The Parallel Book
The Parallel Book
Edited by:
Paul E. McKenney
Facebook
[email protected]
v2022.09.25a
ii
Legal Statement
This work represents the views of the editor and the authors and does not necessarily
represent the view of their respective employers.
Trademarks:
• IBM, z Systems, and PowerPC are trademarks or registered trademarks of Inter-
national Business Machines Corporation in the United States, other countries, or
both.
• Linux is a registered trademark of Linus Torvalds.
• Intel, Itanium, Intel Core, and Intel Xeon are trademarks of Intel Corporation or
its subsidiaries in the United States, other countries, or both.
• Arm is a registered trademark of Arm Limited (or its subsidiaries) in the US and/or
elsewhere.
• SPARC is a registered trademark of SPARC International, Inc. Products bearing
SPARC trademarks are based on an architecture developed by Sun Microsystems,
Inc.
• Other company, product, and service names may be trademarks or service marks
of such companies.
The non-source-code text and images in this document are provided under the terms
of the Creative Commons Attribution-Share Alike 3.0 United States license.1 In brief,
you may use the contents of this document for any purpose, personal, commercial, or
otherwise, so long as attribution to the authors is maintained. Likewise, the document
may be modified, and derivative works and translations made available, so long as
such modifications and derivations are offered to the public on equal terms as the
non-source-code text and images in the original document.
Source code is covered by various versions of the GPL.2 Some of this code is
GPLv2-only, as it derives from the Linux kernel, while other code is GPLv2-or-later. See
the comment headers of the individual source files within the CodeSamples directory in
the git archive3 for the exact licenses. If you are unsure of the license for a given code
fragment, you should assume GPLv2-only.
Combined work © 2005–2022 by Paul E. McKenney. Each individual contribution
is copyright by its contributor at the time of contribution, as recorded in the git archive.
1 https://creativecommons.org/licenses/by-sa/3.0/us/
2 https://www.gnu.org/licenses/gpl-2.0.html
3 git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
v2022.09.25a
Contents
2 Introduction 7
2.1 Historic Parallel Programming Difficulties . . . . . . . . . . . . . . . 7
2.2 Parallel Programming Goals . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Alternatives to Parallel Programming . . . . . . . . . . . . . . . . . . 12
2.3.1 Multiple Instances of a Sequential Application . . . . . . . . 12
2.3.2 Use Existing Parallel Software . . . . . . . . . . . . . . . . . 12
2.3.3 Performance Optimization . . . . . . . . . . . . . . . . . . . 13
2.4 What Makes Parallel Programming Hard? . . . . . . . . . . . . . . . 13
2.4.1 Work Partitioning . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Parallel Access Control . . . . . . . . . . . . . . . . . . . . . 14
2.4.3 Resource Partitioning and Replication . . . . . . . . . . . . . 15
2.4.4 Interacting With Hardware . . . . . . . . . . . . . . . . . . . 15
2.4.5 Composite Capabilities . . . . . . . . . . . . . . . . . . . . . 15
2.4.6 How Do Languages and Environments Assist With These Tasks? 16
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
iii
v2022.09.25a
iv CONTENTS
5 Counting 49
5.1 Why Isn’t Concurrent Counting Trivial? . . . . . . . . . . . . . . . . 49
5.2 Statistical Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.2 Array-Based Implementation . . . . . . . . . . . . . . . . . . 51
5.2.3 Per-Thread-Variable-Based Implementation . . . . . . . . . . 52
5.2.4 Eventually Consistent Implementation . . . . . . . . . . . . . 54
5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Approximate Limit Counters . . . . . . . . . . . . . . . . . . . . . . 55
5.3.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.2 Simple Limit Counter Implementation . . . . . . . . . . . . . 56
5.3.3 Simple Limit Counter Discussion . . . . . . . . . . . . . . . 59
5.3.4 Approximate Limit Counter Implementation . . . . . . . . . 60
5.3.5 Approximate Limit Counter Discussion . . . . . . . . . . . . 61
5.4 Exact Limit Counters . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4.1 Atomic Limit Counter Implementation . . . . . . . . . . . . 61
5.4.2 Atomic Limit Counter Discussion . . . . . . . . . . . . . . . 64
5.4.3 Signal-Theft Limit Counter Design . . . . . . . . . . . . . . 64
5.4.4 Signal-Theft Limit Counter Implementation . . . . . . . . . . 65
5.4.5 Signal-Theft Limit Counter Discussion . . . . . . . . . . . . 67
5.4.6 Applying Exact Limit Counters . . . . . . . . . . . . . . . . 68
5.5 Parallel Counting Discussion . . . . . . . . . . . . . . . . . . . . . . 68
v2022.09.25a
CONTENTS v
7 Locking 101
7.1 Staying Alive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.1.1 Deadlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.1.2 Livelock and Starvation . . . . . . . . . . . . . . . . . . . . 108
7.1.3 Unfairness . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.1.4 Inefficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.2 Types of Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2.1 Exclusive Locks . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2.2 Reader-Writer Locks . . . . . . . . . . . . . . . . . . . . . . 110
7.2.3 Beyond Reader-Writer Locks . . . . . . . . . . . . . . . . . . 111
7.2.4 Scoped Locking . . . . . . . . . . . . . . . . . . . . . . . . 112
7.3 Locking Implementation Issues . . . . . . . . . . . . . . . . . . . . . 114
7.3.1 Sample Exclusive-Locking Implementation Based on Atomic
Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3.2 Other Exclusive-Locking Implementations . . . . . . . . . . 114
7.4 Lock-Based Existence Guarantees . . . . . . . . . . . . . . . . . . . 116
7.5 Locking: Hero or Villain? . . . . . . . . . . . . . . . . . . . . . . . 117
7.5.1 Locking For Applications: Hero! . . . . . . . . . . . . . . . 118
7.5.2 Locking For Parallel Libraries: Just Another Tool . . . . . . . 118
v2022.09.25a
vi CONTENTS
v2022.09.25a
CONTENTS vii
11 Validation 207
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
11.1.1 Where Do Bugs Come From? . . . . . . . . . . . . . . . . . 207
11.1.2 Required Mindset . . . . . . . . . . . . . . . . . . . . . . . . 208
11.1.3 When Should Validation Start? . . . . . . . . . . . . . . . . . 210
11.1.4 The Open Source Way . . . . . . . . . . . . . . . . . . . . . 211
11.2 Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
11.3 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
11.4 Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
11.5 Code Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
11.5.1 Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
11.5.2 Walkthroughs . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.5.3 Self-Inspection . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.6 Probability and Heisenbugs . . . . . . . . . . . . . . . . . . . . . . . 215
11.6.1 Statistics for Discrete Testing . . . . . . . . . . . . . . . . . . 216
11.6.2 Statistics Abuse for Discrete Testing . . . . . . . . . . . . . . 217
11.6.3 Statistics for Continuous Testing . . . . . . . . . . . . . . . . 217
11.6.4 Hunting Heisenbugs . . . . . . . . . . . . . . . . . . . . . . 218
11.7 Performance Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 222
11.7.1 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . 222
11.7.2 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
11.7.3 Differential Profiling . . . . . . . . . . . . . . . . . . . . . . 223
11.7.4 Microbenchmarking . . . . . . . . . . . . . . . . . . . . . . 223
11.7.5 Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
11.7.6 Detecting Interference . . . . . . . . . . . . . . . . . . . . . 225
11.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
v2022.09.25a
viii CONTENTS
v2022.09.25a
CONTENTS ix
v2022.09.25a
x CONTENTS
v2022.09.25a
CONTENTS xi
Glossary 569
Bibliography 579
v2022.09.25a
xii CONTENTS
Credits 623
LATEX Advisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Reviewers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Machine Owners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Original Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Figure Credits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
Other Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
Acronyms 627
Index 629
v2022.09.25a
If you would only recognize that life is hard, things
would be so much easier for you.
The purpose of this book is to help you program shared- 1.1 Roadmap
memory parallel systems without risking your sanity.1
Nevertheless, you should think of the information in this
book as a foundation on which to build, rather than as Cat: Where are you going?
a completed cathedral. Your mission, if you choose to Alice: Which way should I go?
accept, is to help make further progress in the exciting Cat: That depends on where you are going.
Alice: I don’t know.
field of parallel programming—progress that will in time
Cat: Then it doesn’t matter which way you go.
render this book obsolete.
Parallel programming in the 21st century is no longer Lewis Carroll, Alice in Wonderland
focused solely on science, research, and grand-challenge
projects. And this is all to the good, because it means This book is a handbook of widely applicable and heav-
that parallel programming is becoming an engineering ily used design techniques, rather than a collection of
discipline. Therefore, as befits an engineering discipline, optimal algorithms with tiny areas of applicability. You
this book examines specific parallel-programming tasks are currently reading Chapter 1, but you knew that al-
and describes how to approach them. In some surprisingly ready. Chapter 2 gives a high-level overview of parallel
common cases, these tasks can be automated. programming.
This book is written in the hope that presenting the
Chapter 3 introduces shared-memory parallel hardware.
engineering discipline underlying successful parallel-
After all, it is difficult to write good parallel code un-
programming projects will free a new generation of par-
less you understand the underlying hardware. Because
allel hackers from the need to slowly and painstakingly
hardware constantly evolves, this chapter will always be
reinvent old wheels, enabling them to instead focus their
out of date. We will nevertheless do our best to keep up.
energy and creativity on new frontiers. However, what
Chapter 4 then provides a very brief overview of common
you get from this book will be determined by what you
shared-memory parallel-programming primitives.
put into it. It is hoped that simply reading this book will
be helpful, and that working the Quick Quizzes will be Chapter 5 takes an in-depth look at parallelizing one
even more helpful. However, the best results come from of the simplest problems imaginable, namely counting.
applying the techniques taught in this book to real-life Because almost everyone has an excellent grasp of count-
problems. As always, practice makes perfect. ing, this chapter is able to delve into many important
But no matter how you approach it, we sincerely hope parallel-programming issues without the distractions of
that parallel programming brings you at least as much fun, more-typical computer-science problems. My impression
excitement, and challenge that it has brought to us! is that this chapter has seen the greatest use in parallel-
programming coursework.
Chapter 6 introduces a number of design-level methods
of addressing the issues identified in Chapter 5. It turns out
1 Or, perhaps more accurately, without much greater risk to your that it is important to address parallelism at the design level
sanity than that incurred by non-parallel programming. Which, come to when feasible: To paraphrase Dijkstra [Dij68], “retrofitted
think of it, might not be saying all that much. parallelism considered grossly suboptimal” [McK12c].
v2022.09.25a
2 CHAPTER 1. HOW TO USE THIS BOOK
The next three chapters examine three important ap- Some of them are based on material in which that quick
proaches to synchronization. Chapter 7 covers locking, quiz appears, but others require you to think beyond that
which is still not only the workhorse of production-quality section, and, in some cases, beyond the realm of current
parallel programming, but is also widely considered to knowledge. As with most endeavors, what you get out of
be parallel programming’s worst villain. Chapter 8 gives this book is largely determined by what you are willing to
a brief overview of data ownership, an often overlooked put into it. Therefore, readers who make a genuine effort
but remarkably pervasive and powerful approach. Finally, to solve a quiz before looking at the answer find their
Chapter 9 introduces a number of deferred-processing effort repaid handsomely with increased understanding of
mechanisms, including reference counting, hazard point- parallel programming.
ers, sequence locking, and RCU. Quick Quiz 1.1: Where are the answers to the Quick Quizzes
Chapter 10 applies the lessons of previous chapters to found?
hash tables, which are heavily used due to their excel-
lent partitionability, which (usually) leads to excellent Quick Quiz 1.2: Some of the Quick Quiz questions seem to
performance and scalability. be from the viewpoint of the reader rather than the author. Is
As many have learned to their sorrow, parallel program- that really the intent?
ming without validation is a sure path to abject failure.
Chapter 11 covers various forms of testing. It is of course Quick Quiz 1.3: These Quick Quizzes are just not my cup
impossible to test reliability into your program after the of tea. What can I do about it?
fact, so Chapter 12 follows up with a brief overview of a
In short, if you need a deep understanding of the mate-
couple of practical approaches to formal verification.
rial, then you should invest some time into answering the
Chapter 13 contains a series of moderate-sized parallel
Quick Quizzes. Don’t get me wrong, passively reading the
programming problems. The difficulty of these problems
material can be quite valuable, but gaining full problem-
vary, but should be appropriate for someone who has
solving capability really does require that you practice
mastered the material in the previous chapters.
solving problems. Similarly, gaining full code-production
Chapter 14 looks at advanced synchronization meth-
capability really does require that you practice producing
ods, including non-blocking synchronization and parallel
code.
real-time computing, while Chapter 15 covers the ad-
vanced topic of memory ordering. Chapter 16 follows up Quick Quiz 1.4: If passively reading this book doesn’t get me
full problem-solving and code-production capabilities, what
with some ease-of-use advice. Chapter 17 looks at a few
on earth is the point???
possible future directions, including shared-memory par-
allel system design, software and hardware transactional I learned this the hard way during coursework for my
memory, and functional programming for parallelism. Fi- late-in-life Ph.D. I was studying a familiar topic, and
nally, Chapter 18 reviews the material in this book and its was surprised at how few of the chapter’s exercises I
origins. could answer off the top of my head.2 Forcing myself to
This chapter is followed by a number of appendices. The answer the questions greatly increased my retention of the
most popular of these appears to be Appendix C, which material. So with these Quick Quizzes I am not asking
delves even further into memory ordering. Appendix E you to do anything that I have not been doing myself.
contains the answers to the infamous Quick Quizzes, Finally, the most common learning disability is thinking
which are discussed in the next section. that you already understand the material at hand. The
quick quizzes can be an extremely effective cure.
1.2 Quick Quizzes
“Quick quizzes” appear throughout this book, and the 2 So I suppose that it was just as well that my professors refused to
answers may be found in Appendix E starting on page 461. let me waive that class!
v2022.09.25a
1.3. ALTERNATIVES TO THIS BOOK 3
1.3 Alternatives to This Book any data structure.” These are clearly not the words
of someone who is hostile towards RCU.
Between two evils I always pick the one I never tried 2. If you would like an academic treatment of parallel
before. programming from a programming-language-prag-
Mae West matics viewpoint, you might be interested in the
concurrency chapter from Scott’s textbook [Sco06,
As Knuth learned the hard way, if you want your book Sco15] on programming-language pragmatics.
to be finite, it must be focused. This book focuses on
3. If you are interested in an object-oriented patternist
shared-memory parallel programming, with an emphasis
treatment of parallel programming focussing on C++,
on software that lives near the bottom of the software
you might try Volumes 2 and 4 of Schmidt’s POSA
stack, such as operating-system kernels, parallel data-
series [SSRB00, BHS07]. Volume 4 in particular
management systems, low-level libraries, and the like.
has some interesting chapters applying this work to a
The programming language used by this book is C.
warehouse application. The realism of this example
If you are interested in other aspects of parallelism,
is attested to by the section entitled “Partitioning the
you might well be better served by some other book.
Big Ball of Mud”, in which the problems inherent
Fortunately, there are many alternatives available to you:
in parallelism often take a back seat to getting one’s
1. If you prefer a more academic and rigorous treatment head around a real-world application.
of parallel programming, you might like Herlihy’s 4. If you want to work with Linux-kernel device driv-
and Shavit’s textbook [HS08, HSLS20]. This book ers, then Corbet’s, Rubini’s, and Kroah-Hartman’s
starts with an interesting combination of low-level “Linux Device Drivers” [CRKH05] is indispensable,
primitives at high levels of abstraction from the hard- as is the Linux Weekly News web site (https:
ware, and works its way through locking and simple //lwn.net/). There is a large number of books and
data structures including lists, queues, hash tables, resources on the more general topic of Linux kernel
and counters, culminating with transactional mem- internals.
ory, all in Java. Michael Scott’s textbook [Sco13]
approaches similar material with more of a software- 5. If your primary focus is scientific and technical com-
engineering focus, and, as far as I know, is the first puting, and you prefer a patternist approach, you
formally published academic textbook with section might try Mattson et al.’s textbook [MSM05]. It
devoted to RCU. covers Java, C/C++, OpenMP, and MPI. Its pat-
Herlihy, Shavit, Luchangco, and Spear did catch up terns are admirably focused first on design, then on
in their second edition [HSLS20] by adding short implementation.
sections on hazard pointers and on RCU, with the 6. If your primary focus is scientific and technical com-
latter in the guise of EBR.3 They also include puting, and you are interested in GPUs, CUDA, and
a brief history of both, albeit with an abbreviated MPI, you might check out Norm Matloff’s “Program-
history of RCU that picks up almost a year after it was ming on Parallel Machines” [Mat17]. Of course, the
accepted into the Linux kernel and more than 20 years GPU vendors have quite a bit of additional informa-
after Kung’s and Lehman’s landmark paper [KL80]. tion [AMD20, Zel11, NVi17a, NVi17b].
Those wishing a deeper view of the history may find
it in this book’s Section 9.5.5. 7. If you are interested in POSIX Threads, you might
However, readers who might otherwise suspect a take a look at David R. Butenhof’s book [But97]. In
hostile attitude towards RCU on the part of this text- addition, W. Richard Stevens’s book [Ste92, Ste13]
book’s first author should refer to the last full sentence covers UNIX and POSIX, and Stewart Weiss’s lecture
on the first page of one of his papers [BGHZ16]. This notes [Wei13] provide an thorough and accessible
sentence reads “QSBR [a particular class of RCU im- introduction with a good set of examples.
plementations] is fast and can be applied to virtually 8. If you are interested in C++11, you might like An-
3 Albeitan implementation that contains a reader-preemption bug thony Williams’s “C++ Concurrency in Action: Prac-
noted by Richard Bornat. tical Multithreading” [Wil12, Wil19].
v2022.09.25a
4 CHAPTER 1. HOW TO USE THIS BOOK
9. If you are interested in C++, but in a Windows Listing 1.1: Creating an Up-To-Date PDF
environment, you might try Herb Sutter’s “Effective git clone git://git.kernel.org/pub/scm/linux/kernel/git/↵
↩→ paulmck/perfbook.git
Concurrency” series in Dr. Dobbs Journal [Sut08]. cd perfbook
This series does a reasonable job of presenting a # You may need to install a font. See item 1 in FAQ.txt.
make # -jN for parallel build
commonsense approach to parallelism. evince perfbook.pdf & # Two-column version
make perfbook-1c.pdf
10. If you want to try out Intel Threading Building Blocks, evince perfbook-1c.pdf & # One-column version for e-readers
make help # Display other build options
then perhaps James Reinders’s book [Rei07] is what
you are looking for.
This command will locate the file rcu_rcpls.c, which
11. Those interested in learning how various types of
is called out in Appendix B. Non-UNIX systems have
multi-processor hardware cache organizations affect
their own well-known ways of locating files by filename.
the implementation of kernel internals should take
a look at Curt Schimmel’s classic treatment of this
subject [Sch94].
1.5 Whose Book Is This?
12. If you are looking for a hardware view, Hennessy’s
and Patterson’s classic textbook [HP17, HP11] is If you become a teacher, by your pupils you’ll be
well worth a read. A “Readers Digest” version of this taught.
tome geared for scientific and technical workloads Oscar Hammerstein II
(bashing big arrays) may be found in Andrew Chien’s
textbook [Chi22]. If you are looking for an academic As the cover says, the editor is one Paul E. McKenney.
textbook on memory ordering, that of Daniel Sorin However, the editor does accept contributions via the
et al. [SHW11, NSHW20] is highly recommended. [email protected] email list. These contri-
For a memory-ordering tutorial from a Linux-kernel butions can be in pretty much any form, with popular
viewpoint, Paolo Bonzini’s LWN series is a good approaches including text emails, patches against the
place to start [Bon21a, Bon21e, Bon21c, Bon21b, book’s LATEX source, and even git pull requests. Use
Bon21d, Bon21f]. whatever form works best for you.
To create patches or git pull requests, you will
13. Finally, those using Java might be well-served by
need the LATEX source to the book, which is at
Doug Lea’s textbooks [Lea97, GPB+ 07].
git://git.kernel.org/pub/scm/linux/kernel/
git/paulmck/perfbook.git. You will of course also
However, if you are interested in principles of parallel
need git and LATEX, which are available as part of most
design for low-level software, especially software written
mainstream Linux distributions. Other packages may be
in C, read on!
required, depending on the distribution you use. The
required list of packages for a few popular distributions is
listed in the file FAQ-BUILD.txt in the LATEX source to
1.4 Sample Source Code the book.
To create and display a current LATEX source tree of this
Use the source, Luke! book, use the list of Linux commands shown in Listing 1.1.
In some environments, the evince command that displays
Unknown Star Wars fan
perfbook.pdf may need to be replaced, for example,
with acroread. The git clone command need only be
This book discusses its fair share of source code, and used the first time you create a PDF, subsequently, you
in many cases this source code may be found in the can run the commands shown in Listing 1.2 to pull in any
CodeSamples directory of this book’s git tree. For ex- updates and generate an updated PDF. The commands
ample, on UNIX systems, you should be able to type the in Listing 1.2 must be run within the perfbook directory
following: created by the commands shown in Listing 1.1.
PDFs of this book are sporadically posted at
find CodeSamples -name rcu_rcpls.c -print
https://kernel.org/pub/linux/kernel/people/
v2022.09.25a
1.5. WHOSE BOOK IS THIS? 5
Listing 1.2: Generating an Updated PDF must use your real name: I unfortunately cannot accept
git remote update pseudonymous or anonymous contributions.
git checkout origin/master
make # -jN for parallel build The language of this book is American English, however,
evince perfbook.pdf & # Two-column version the open-source nature of this book permits translations,
make perfbook-1c.pdf
evince perfbook-1c.pdf & # One-column version for e-readers and I personally encourage them. The open-source li-
censes covering this book additionally allow you to sell
your translation, if you wish. I do request that you send
paulmck/perfbook/perfbook.html and at http: me a copy of the translation (hardcopy if available), but
//www.rdrop.com/users/paulmck/perfbook/. this is a request made as a professional courtesy, and
The actual process of contributing patches and is not in any way a prerequisite to the permission that
sending git pull requests is similar to that of you already have under the Creative Commons and GPL
the Linux kernel, which is documented here: licenses. Please see the FAQ.txt file in the source tree
https://www.kernel.org/doc/html/latest/ for a list of translations currently in progress. I consider
process/submitting-patches.html. One important a translation effort to be “in progress” once at least one
requirement is that each patch (or commit, in the chapter has been fully translated.
case of a git pull request) must contain a valid There are many styles under the “American English”
Signed-off-by: line, which has the following format: rubric. The style for this particular book is documented
in Appendix D.
Signed-off-by: My Name <[email protected]> As noted at the beginning of this section, I am this
book’s editor. However, if you choose to contribute, it will
Please see https://lkml.org/lkml/2007/1/15/ be your book as well. In that spirit, I offer you Chapter 2,
219 for an example patch with a Signed-off-by: line. our introduction.
Note well that the Signed-off-by: line has a very spe-
cific meaning, namely that you are certifying that:
(d) I understand and agree that this project and the contri-
bution are public and that a record of the contribution
(including all personal information I submit with it,
including my sign-off) is maintained indefinitely and
may be redistributed consistent with this project or
the open source license(s) involved.
v2022.09.25a
6 CHAPTER 1. HOW TO USE THIS BOOK
v2022.09.25a
If parallel programming is so hard, why are there so
many parallel programs?
Chapter 2 Unknown
Introduction
Parallel programming has earned a reputation as one of 2.1 Historic Parallel Programming
the most difficult areas a hacker can tackle. Papers and
textbooks warn of the perils of deadlock, livelock, race Difficulties
conditions, non-determinism, Amdahl’s-Law limits to
scaling, and excessive realtime latencies. And these perils Not the power to remember, but its very opposite, the
are quite real; we authors have accumulated uncounted power to forget, is a necessary condition for our
years of experience along with the resulting emotional existence.
scars, grey hairs, and hair loss.
Sholem Asch
However, new technologies that are difficult to use at
introduction invariably become easier over time. For As indicated by its title, this book takes a different ap-
example, the once-rare ability to drive a car is now com- proach. Rather than complain about the difficulty of
monplace in many countries. This dramatic change came parallel programming, it instead examines the reasons
about for two basic reasons: (1) Cars became cheaper why parallel programming is difficult, and then works to
and more readily available, so that more people had the help the reader to overcome these difficulties. As will be
opportunity to learn to drive, and (2) Cars became easier to seen, these difficulties have historically fallen into several
operate due to automatic transmissions, automatic chokes, categories, including:
automatic starters, greatly improved reliability, and a host
of other technological improvements. 1. The historic high cost and relative rarity of parallel
systems.
The same is true for many other technologies, includ-
ing computers. It is no longer necessary to operate a 2. The typical researcher’s and practitioner’s lack of
keypunch in order to program. Spreadsheets allow most experience with parallel systems.
non-programmers to get results from their computers that 3. The paucity of publicly accessible parallel code.
would have required a team of specialists a few decades
ago. Perhaps the most compelling example is web-surfing 4. The lack of a widely understood engineering disci-
and content creation, which since the early 2000s has been pline of parallel programming.
easily done by untrained, uneducated people using various
now-commonplace social-networking tools. As recently 5. The high overhead of communication relative to that
as 1968, such content creation was a far-out research of processing, even in tightly coupled shared-memory
project [Eng68], described at the time as “like a UFO computers.
landing on the White House lawn” [Gri00].
Many of these historic difficulties are well on the way
Therefore, if you wish to argue that parallel program- to being overcome. First, over the past few decades,
ming will remain as difficult as it is currently perceived the cost of parallel systems has decreased from many
by many to be, it is you who bears the burden of proof, multiples of that of a house to that of a modest meal,
keeping in mind the many centuries of counter-examples courtesy of Moore’s Law [Moo65]. Papers calling out the
in many fields of endeavor. advantages of multicore CPUs were published as early
v2022.09.25a
8 CHAPTER 2. INTRODUCTION
as 1996 [ONH+ 96]. IBM introduced simultaneous multi- hardware will be more friendly to parallel software, as
threading into its high-end POWER family in 2000, and discussed in Section 3.3.
multicore in 2001. Intel introduced hyperthreading into Quick Quiz 2.1: Come on now!!! Parallel programming has
its commodity Pentium line in November 2000, and both been known to be exceedingly hard for many decades. You
AMD and Intel introduced dual-core CPUs in 2005. Sun seem to be hinting that it is not so hard. What sort of game
followed with the multicore/multi-threaded Niagara in are you playing?
late 2005. In fact, by 2008, it was becoming difficult to
find a single-CPU desktop system, with single-core CPUs However, even though parallel programming might not
being relegated to netbooks and embedded devices. By be as hard as is commonly advertised, it is often more
2012, even smartphones were starting to sport multiple work than is sequential programming.
CPUs. By 2020, safety-critical software standards started Quick Quiz 2.2: How could parallel programming ever be
addressing concurrency. as easy as sequential programming?
Second, the advent of low-cost and readily available
multicore systems means that the once-rare experience It therefore makes sense to consider alternatives to
of parallel programming is now available to almost all parallel programming. However, it is not possible to
researchers and practitioners. In fact, parallel systems reasonably consider parallel-programming alternatives
have long been within the budget of students and hobbyists. without understanding parallel-programming goals. This
We can therefore expect greatly increased levels of inven- topic is addressed in the next section.
tion and innovation surrounding parallel systems, and that
increased familiarity will over time make the once pro-
hibitively expensive field of parallel programming much
2.2 Parallel Programming Goals
more friendly and commonplace.
If you don’t know where you are going, you will end
Third, in the 20th century, large systems of highly par-
up somewhere else.
allel software were almost always closely guarded propri-
etary secrets. In happy contrast, the 21st century has seen Yogi Berra
numerous open-source (and thus publicly available) paral-
lel software projects, including the Linux kernel [Tor03], The three major goals of parallel programming (over and
database systems [Pos08, MS08], and message-passing above those of sequential programming) are as follows:
systems [The08, Uni08a]. This book will draw primarily
from the Linux kernel, but will provide much material 1. Performance.
suitable for user-level applications.
2. Productivity.
Fourth, even though the large-scale parallel-program-
ming projects of the 1980s and 1990s were almost all 3. Generality.
proprietary projects, these projects have seeded other
communities with cadres of developers who understand Unfortunately, given the current state of the art, it is
the engineering discipline required to develop production- possible to achieve at best two of these three goals for any
quality parallel code. A major purpose of this book is to given parallel program. These three goals therefore form
present this engineering discipline. the iron triangle of parallel programming, a triangle upon
Unfortunately, the fifth difficulty, the high cost of com- which overly optimistic hopes all too often come to grief.1
munication relative to that of processing, remains largely Quick Quiz 2.3: Oh, really??? What about correctness,
in force. This difficulty has been receiving increasing maintainability, robustness, and so on?
attention during the new millennium. However, accord-
ing to Stephen Hawking, the finite speed of light and Quick Quiz 2.4: And if correctness, maintainability, and
the atomic nature of matter will limit progress in this robustness don’t make the list, why do productivity and gener-
area [Gar07, Moo03]. Fortunately, this difficulty has been ality?
in force since the late 1980s, so that the aforementioned
engineering discipline has evolved practical and effective
strategies for handling it. In addition, hardware designers
are increasingly aware of these issues, so perhaps future 1 Kudos to Michael Wong for naming the iron triangle.
v2022.09.25a
2.2. PARALLEL PROGRAMMING GOALS 9
v2022.09.25a
10 CHAPTER 2. INTRODUCTION
100000
Quick Quiz 2.9: Why all this prattling on about non-technical
issues??? And not just any non-technical issue, but productivity 10000
1975
1980
1985
1990
1995
2000
2005
2010
2015
2020
One such machine was the CSIRAC, the oldest still-
intact stored-program computer, which was put into op-
eration in 1949 [Mus04, Dep06]. Because this machine Year
was built before the transistor era, it was constructed of Figure 2.2: MIPS per Die for Intel CPUs
2,000 vacuum tubes, ran with a clock frequency of 1 kHz,
consumed 30 kW of power, and weighed more than three
metric tons. Given that this machine had but 768 words cently has high productivity become critically important
of RAM, it is safe to say that it did not suffer from the when creating parallel software.
productivity issues that often plague today’s large-scale
Quick Quiz 2.10: Given how cheap parallel systems have
software projects.
become, how can anyone afford to pay people to program
Today, it would be quite difficult to purchase a machine them?
with so little computing power. Perhaps the closest equiv-
alents are 8-bit embedded microprocessors exemplified Perhaps at one time, the sole purpose of parallel software
by the venerable Z80 [Wik08], but even the old Z80 had was performance. Now, however, productivity is gaining
a CPU clock frequency more than 1,000 times faster than the spotlight.
the CSIRAC. The Z80 CPU had 8,500 transistors, and
could be purchased in 2008 for less than $2 US per unit 2.2.3 Generality
in 1,000-unit quantities. In stark contrast to the CSIRAC,
software-development costs are anything but insignificant One way to justify the high cost of developing parallel
for the Z80. software is to strive for maximal generality. All else being
The CSIRAC and the Z80 are two points in a long- equal, the cost of a more-general software artifact can be
term trend, as can be seen in Figure 2.2. This figure spread over more users than that of a less-general one. In
plots an approximation to computational power per die fact, this economic force explains much of the maniacal
over the past four decades, showing an impressive six- focus on portability, which can be seen as an important
order-of-magnitude increase over a period of forty years. special case of generality.4
Note that the advent of multicore CPUs has permitted this Unfortunately, generality often comes at the cost of per-
increase to continue apace despite the clock-frequency wall formance, productivity, or both. For example, portability
encountered in 2003, albeit courtesy of dies supporting is often achieved via adaptation layers, which inevitably
more than 50 hardware threads each. exact a performance penalty. To see this more gener-
One of the inescapable consequences of the rapid de- ally, consider the following popular parallel programming
crease in the cost of hardware is that software productivity environments:
becomes increasingly important. It is no longer sufficient
C/C++ “Locking Plus Threads”: This category, which
merely to make efficient use of the hardware: It is now
includes POSIX Threads (pthreads) [Ope97], Win-
necessary to make extremely efficient use of software
dows Threads, and numerous operating-system ker-
developers as well. This has long been the case for se-
nel environments, offers excellent performance (at
quential hardware, but parallel hardware has become a
low-cost commodity only recently. Therefore, only re- 4 Kudos to Michael Wong for pointing this out.
v2022.09.25a
2.2. PARALLEL PROGRAMMING GOALS 11
Performance
much higher productivity than C or C++, courtesy
Generality
System Libraries
of the automatic garbage collector and the rich set
of class libraries. However, its performance, though Container
greatly improved in the early 2000s, lags that of C Operating System Kernel
and C++.
Hypervisor
MPI: This Message Passing Interface [MPI08] powers
Firmware
the largest scientific and technical computing clusters
in the world and offers unparalleled performance
Hardware
and scalability. In theory, it is general purpose,
but it is mainly used for scientific and technical
computing. Its productivity is believed by many Figure 2.3: Software Layers and Performance, Produc-
to be even lower than that of C/C++ “locking plus tivity, and Generality
threads” environments.
OpenMP: This set of compiler directives can be used to Special−Purpose
User 1 Env Productive User 2
parallelize loops. It is thus quite specific to this task, for User 1
and this specificity often limits its performance. It
is, however, much easier to use than MPI or C/C++
“locking plus threads.” HW / Special−Purpose
Abs Environment
Productive for User 2
SQL: Structured Query Language [Int92] is specific to
relational database queries. However, its perfor-
mance is quite good as measured by the Transaction User 3
General−Purpose User 4
Processing Performance Council (TPC) benchmark Environment
results [Tra01]. Productivity is excellent; in fact, this
parallel programming environment enables people to Special−Purpose Environment
Special−Purpose
make good use of a large parallel system despite hav- Productive for User 3
Environment
ing little or no knowledge of parallel programming Productive for User 4
concepts. Figure 2.4: Tradeoff Between Productivity and Generality
The nirvana of parallel programming environments,
one that offers world-class performance, productivity, and
lost in lower layers cannot easily be recovered further up
generality, simply does not yet exist. Until such a nirvana
the stack. In the upper layers of the stack, there might be
appears, it will be necessary to make engineering tradeoffs
very few users for a given specific application, in which
among performance, productivity, and generality. One
case productivity concerns are paramount. This explains
such tradeoff is depicted by the green “iron triangle”5
the tendency towards “bloatware” further up the stack:
shown in Figure 2.3, which shows how productivity be-
Extra hardware is often cheaper than extra developers.
comes increasingly important at the upper layers of the
This book is intended for developers working near the
system stack, while performance and generality become
bottom of the stack, where performance and generality
increasingly important at the lower layers of the system
are of greatest concern.
stack. The huge development costs incurred at the lower
It is important to note that a tradeoff between produc-
layers must be spread over equally huge numbers of users
tivity and generality has existed for centuries in many
(hence the importance of generality), and performance
fields. For but one example, a nailgun is more productive
5 Kudos to Michael Wong for coining “iron triangle.” than a hammer for driving nails, but in contrast to the
v2022.09.25a
12 CHAPTER 2. INTRODUCTION
nailgun, a hammer can be used for many things besides tion 2.2, the primary goals of parallel programming are
driving nails. It should therefore be no surprise to see performance, productivity, and generality. Because this
similar tradeoffs appear in the field of parallel computing. book is intended for developers working on performance-
This tradeoff is shown schematically in Figure 2.4. Here, critical code near the bottom of the software stack, the
users 1, 2, 3, and 4 have specific jobs that they need the remainder of this section focuses primarily on performance
computer to help them with. The most productive possible improvement.
language or environment for a given user is one that simply It is important to keep in mind that parallelism is but
does that user’s job, without requiring any programming, one way to improve performance. Other well-known
configuration, or other setup. approaches include the following, in roughly increasing
order of difficulty:
Quick Quiz 2.11: This is a ridiculously unachievable ideal!
Why not focus on something that is achievable in practice?
1. Run multiple instances of a sequential application.
Unfortunately, a system that does the job required by 2. Make the application use existing parallel software.
user 1 is unlikely to do user 2’s job. In other words, the
most productive languages and environments are domain- 3. Optimize the serial application.
specific, and thus by definition lacking generality.
These approaches are covered in the following sections.
Another option is to tailor a given programming lan-
guage or environment to the hardware system (for example,
low-level languages such as assembly, C, C++, or Java) 2.3.1 Multiple Instances of a Sequential
or to some abstraction (for example, Haskell, Prolog, or Application
Snobol), as is shown by the circular region near the center
of Figure 2.4. These languages can be considered to be Running multiple instances of a sequential application can
general in the sense that they are equally ill-suited to the allow you to do parallel programming without actually
jobs required by users 1, 2, 3, and 4. In other words, doing parallel programming. There are a large number of
their generality comes at the expense of decreased produc- ways to approach this, depending on the structure of the
tivity when compared to domain-specific languages and application.
environments. Worse yet, a language that is tailored to a If your program is analyzing a large number of different
given abstraction is likely to suffer from performance and scenarios, or is analyzing a large number of independent
scalability problems unless and until it can be efficiently data sets, one easy and effective approach is to create a
mapped to real hardware. single sequential program that carries out a single analysis,
Is there no escape from iron triangle’s three conflicting then use any of a number of scripting environments (for
goals of performance, productivity, and generality? example the bash shell) to run a number of instances of
It turns out that there often is an escape, for example, that sequential program in parallel. In some cases, this
using the alternatives to parallel programming discussed approach can be easily extended to a cluster of machines.
in the next section. After all, parallel programming can This approach may seem like cheating, and in fact some
be a great deal of fun, but it is not always the best tool for denigrate such programs as “embarrassingly parallel”.
the job. And in fact, this approach does have some potential dis-
advantages, including increased memory consumption,
waste of CPU cycles recomputing common intermediate
2.3 Alternatives to Parallel Pro- results, and increased copying of data. However, it is of-
ten extremely productive, garnering extreme performance
gramming gains with little or no added effort.
Experiment is folly when experience shows the way. 2.3.2 Use Existing Parallel Software
Roger M. Babson There is no longer any shortage of parallel software envi-
ronments that can present a single-threaded programming
In order to properly consider alternatives to parallel pro- environment, including relational databases [Dat82], web-
gramming, you must first decide on what exactly you application servers, and map-reduce environments. For
expect the parallelism to do for you. As seen in Sec- example, a common design provides a separate process for
v2022.09.25a
2.4. WHAT MAKES PARALLEL PROGRAMMING HARD? 13
each user, each of which generates SQL from user queries. Furthermore, different programs might have different
This per-user SQL is run against a common relational performance bottlenecks. For example, if your program
database, which automatically runs the users’ queries spends most of its time waiting on data from your disk
concurrently. The per-user programs are responsible only drive, using multiple CPUs will probably just increase the
for the user interface, with the relational database tak- time wasted waiting for the disks. In fact, if the program
ing full responsibility for the difficult issues surrounding was reading from a single large file laid out sequentially
parallelism and persistence. on a rotating disk, parallelizing your program might well
In addition, there are a growing number of parallel make it a lot slower due to the added seek overhead. You
library functions, particularly for numeric computation. should instead optimize the data layout so that the file can
Even better, some libraries take advantage of special- be smaller (thus faster to read), split the file into chunks
purpose hardware such as vector units and general-purpose which can be accessed in parallel from different drives,
graphical processing units (GPGPUs). cache frequently accessed data in main memory, or, if
Taking this approach often sacrifices some performance, possible, reduce the amount of data that must be read.
at least when compared to carefully hand-coding a fully
Quick Quiz 2.13: What other bottlenecks might prevent
parallel application. However, such sacrifice is often well
additional CPUs from providing additional performance?
repaid by a huge reduction in development effort.
Quick Quiz 2.12: Wait a minute! Doesn’t this approach Parallelism can be a powerful optimization technique,
simply shift the development effort from you to whoever wrote but it is not the only such technique, nor is it appropriate
the existing parallel software you are using? for all situations. Of course, the easier it is to parallelize
your program, the more attractive parallelization becomes
as an optimization. Parallelization has a reputation of
2.3.3 Performance Optimization being quite difficult, which leads to the question “exactly
what makes parallel programming so difficult?”
Up through the early 2000s, CPU clock frequencies dou-
bled every 18 months. It was therefore usually more impor-
tant to create new functionality than to carefully optimize
performance. Now that Moore’s Law is “only” increasing 2.4 What Makes Parallel Program-
transistor density instead of increasing both transistor ming Hard?
density and per-transistor performance, it might be a good
time to rethink the importance of performance optimiza-
Real difficulties can be overcome; it is only the
tion. After all, new hardware generations no longer bring
imaginary ones that are unconquerable.
significant single-threaded performance improvements.
Furthermore, many performance optimizations can also Theodore N. Vail
conserve energy.
From this viewpoint, parallel programming is but an- It is important to note that the difficulty of parallel pro-
other performance optimization, albeit one that is be- gramming is as much a human-factors issue as it is a set of
coming much more attractive as parallel systems become technical properties of the parallel programming problem.
cheaper and more readily available. However, it is wise We do need human beings to be able to tell parallel sys-
to keep in mind that the speedup available from paral- tems what to do, otherwise known as programming. But
lelism is limited to roughly the number of CPUs (but parallel programming involves two-way communication,
see Section 6.5 for an interesting exception). In contrast, with a program’s performance and scalability being the
the speedup available from traditional single-threaded communication from the machine to the human. In short,
software optimizations can be much larger. For example, the human writes a program telling the computer what
replacing a long linked list with a hash table or a search to do, and the computer critiques this program via the
tree can improve performance by many orders of mag- resulting performance and scalability. Therefore, appeals
nitude. This highly optimized single-threaded program to abstractions or to mathematical analyses will often be
might run much faster than its unoptimized parallel coun- of severely limited utility.
terpart, making parallelization unnecessary. Of course, a In the Industrial Revolution, the interface between hu-
highly optimized parallel program would be even better, man and machine was evaluated by human-factor studies,
aside from the added development effort required. then called time-and-motion studies. Although there have
v2022.09.25a
14 CHAPTER 2. INTRODUCTION
Performance Productivity
errors and events: A parallel program may need to carry
out non-trivial synchronization in order to safely process
Work
such global events. More generally, each partition requires
Partitioning
some sort of communication: After all, if a given thread
Resource did not communicate at all, it would have no effect and
Parallel
Partitioning and would thus not need to be executed. However, because
Access Control Replication
communication incurs overhead, careless partitioning
choices can result in severe performance degradation.
Interacting
Furthermore, the number of concurrent threads must
With Hardware
often be controlled, as each such thread occupies common
Generality resources, for example, space in CPU caches. If too
many threads are permitted to execute concurrently, the
Figure 2.5: Categories of Tasks Required of Parallel CPU caches will overflow, resulting in high cache miss
Programmers rate, which in turn degrades performance. Conversely,
large numbers of threads are often required to overlap
computation and I/O so as to fully utilize I/O devices.
been a few human-factor studies examining parallel pro-
Quick Quiz 2.14: Other than CPU cache capacity, what
gramming [ENS05, ES05, HCS+ 05, SS94], these studies
might require limiting the number of concurrent threads?
have been extremely narrowly focused, and hence unable
to demonstrate any general results. Furthermore, given Finally, permitting threads to execute concurrently
that the normal range of programmer productivity spans greatly increases the program’s state space, which can
more than an order of magnitude, it is unrealistic to expect make the program difficult to understand and debug, de-
an affordable study to be capable of detecting (say) a 10 % grading productivity. All else being equal, smaller state
difference in productivity. Although the multiple-order-of- spaces having more regular structure are more easily un-
magnitude differences that such studies can reliably detect derstood, but this is a human-factors statement as much as
are extremely valuable, the most impressive improvements it is a technical or mathematical statement. Good parallel
tend to be based on a long series of 10 % improvements. designs might have extremely large state spaces, but never-
We must therefore take a different approach. theless be easy to understand due to their regular structure,
One such approach is to carefully consider the tasks that while poor designs can be impenetrable despite having a
parallel programmers must undertake that are not required comparatively small state space. The best designs exploit
of sequential programmers. We can then evaluate how embarrassing parallelism, or transform the problem to
well a given programming language or environment assists one having an embarrassingly parallel solution. In either
the developer with these tasks. These tasks fall into the case, “embarrassingly parallel” is in fact an embarrass-
four categories shown in Figure 2.5, each of which is ment of riches. The current state of the art enumerates
covered in the following sections. good designs; more work is required to make more general
judgments on state-space size and structure.
2.4.1 Work Partitioning
2.4.2 Parallel Access Control
Work partitioning is absolutely required for parallel ex-
ecution: If there is but one “glob” of work, then it can Given a single-threaded sequential program, that single
be executed by at most one CPU at a time, which is by thread has full access to all of the program’s resources.
definition sequential execution. However, partitioning the These resources are most often in-memory data structures,
work requires great care. For example, uneven partitioning but can be CPUs, memory (including caches), I/O devices,
can result in sequential execution once the small partitions computational accelerators, files, and much else besides.
have completed [Amd67]. In less extreme cases, load The first parallel-access-control issue is whether the
balancing can be used to fully utilize available hardware form of access to a given resource depends on that re-
and restore performance and scalability. source’s location. For example, in many message-passing
Although partitioning can greatly improve performance environments, local-variable access is via expressions
and scalability, it can also increase complexity. For and assignments, while remote-variable access uses an
example, partitioning can complicate handling of global entirely different syntax, usually involving messaging.
v2022.09.25a
2.4. WHAT MAKES PARALLEL PROGRAMMING HARD? 15
v2022.09.25a
16 CHAPTER 2. INTRODUCTION
can sometimes greatly enhance both performance and The authors of these older guides were well up to the
scalability [Met99]. parallel programming challenge back in the 1980s. As
such, there are simply no excuses for refusing to step up
2.4.6 How Do Languages and Environ- to the parallel-programming challenge here in the 21st
century!
ments Assist With These Tasks? We are now ready to proceed to the next chapter, which
Although many environments require the developer to dives into the relevant properties of the parallel hardware
deal manually with these tasks, there are long-standing underlying our parallel software.
environments that bring significant automation to bear.
The poster child for these environments is SQL, many
implementations of which automatically parallelize single
large queries and also automate concurrent execution of
independent queries and updates.
These four categories of tasks must be carried out in all
parallel programs, but that of course does not necessarily
mean that the developer must manually carry out these
tasks. We can expect to see ever-increasing automation of
these four tasks as parallel systems continue to become
cheaper and more readily available.
Quick Quiz 2.16: Are there any other obstacles to parallel
programming?
2.5 Discussion
Until you try, you don’t know what you can’t do.
Henry James
v2022.09.25a
Premature abstraction is the root of all evil.
A cast of thousands
Chapter 3
17
v2022.09.25a
18 CHAPTER 3. HARDWARE AND ITS HABITS
Thread 0 Thread 1
4.0 GHz clock, 20 MB L3 Instructions
Decode and
Instructions
cache, 20 stage pipeline... Translate
Micro-Op
Scheduler
v2022.09.25a
3.1. OVERVIEW 19
v2022.09.25a
20 CHAPTER 3. HARDWARE AND ITS HABITS
Memory
Barrier
v2022.09.25a
3.2. OVERHEADS 21
3.2 Overheads
Unknown
v2022.09.25a
22 CHAPTER 3. HARDWARE AND ITS HABITS
v2022.09.25a
3.2. OVERHEADS 23
Table 3.1: CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs @
2.10 GHz
Ratio
Operation Cost (ns) (cost/clock) CPUs
Clock period 0.5 1.0
Same-CPU CAS 7.0 14.6 0
Same-CPU lock 15.4 32.3 0
In-core blind CAS 7.2 15.2 224
In-core CAS 18.0 37.7 224
Off-core blind CAS 47.5 99.8 1–27,225–251
Off-core CAS 101.9 214.0 1–27,225–251
Off-socket blind CAS 148.8 312.5 28–111,252–335
Off-socket CAS 442.9 930.1 28–111,252–335
Cross-interconnect blind CAS 336.6 706.8 112–223,336–447
Cross-interconnect CAS 944.8 1,984.2 112–223,336–447
Off-System
Comms Fabric 5,000 10,500
Global Comms 195,000,000 409,500,000
The “same-CPU” prefix means that the CPU now per- new value. Then in the CAS operation, the value actu-
forming the CAS operation on a given variable was also ally loaded would be specified as the old value and the
the last CPU to access this variable, so that the corre- incremented value as the new value. If the value had
sponding cacheline is already held in that CPU’s cache. not been changed between the load and the CAS, this
Similarly, the same-CPU lock operation (a “round trip” would increment the memory location. However, if the
pair consisting of a lock acquisition and release) consumes value had in fact changed, then the old value would not
more than fifteen nanoseconds, or more than thirty clock match, causing a miscompare that would result in the CAS
cycles. The lock operation is more expensive than CAS operation failing. The key point is that there are now two
because it requires two atomic operations on the lock data accesses to the memory location, the load and the CAS.
structure, one for acquisition and the other for release. Thus, it is not surprising that in-core blind CAS con-
In-core operations involving interactions between the sumes only about seven nanoseconds, while in-core CAS
hardware threads sharing a single core are about the same consumes about 18 nanoseconds. The non-blind case’s
cost as same-CPU operations. This should not be too extra load does not come for free. That said, the overhead
surprising, given that these two hardware threads also of these operations are similar to single-CPU CAS and
share the full cache hierarchy. lock, respectively.
In the case of the blind CAS, the software specifies the Quick Quiz 3.6: Table 3.1 shows CPU 0 sharing a core with
old value without looking at the memory location. This CPU 224. Shouldn’t that instead be CPU 1???
approach is appropriate when attempting to acquire a lock.
If the unlocked state is represented by zero and the locked A blind CAS involving CPUs in different cores but
state is represented by the value one, then a CAS operation on the same socket consumes almost fifty nanoseconds,
on the lock that specifies zero for the old value and one or almost one hundred clock cycles. The code used for
for the new value will acquire the lock if it is not already this cache-miss measurement passes the cache line back
held. The key point is that there is only one access to the and forth between a pair of CPUs, so this cache miss
memory location, namely the CAS operation itself. is satisfied not from memory, but rather from the other
In contrast, a normal CAS operation’s old value is de- CPU’s cache. A non-blind CAS operation, which as
rived from some earlier load. For example, to implement noted earlier must look at the old value of the variable
an atomic increment, the current value of that location as well as store a new value, consumes over one hundred
is loaded and that value is incremented to produce the nanoseconds, or more than two hundred clock cycles.
v2022.09.25a
24 CHAPTER 3. HARDWARE AND ITS HABITS
Table 3.2: Cache Geometry for 8-Socket System With carefully chosen addresses) are required to overflow this
Intel Xeon Platinum 8176 CPUs @ 2.10 GHz 40 MB cache. On the other hand, equally careful choice
of addresses might make good use of the entire 40 MB.
Level Scope Line Size Sets Ways Size
Spatial locality of reference is clearly extremely impor-
L0 Core 64 64 8 32K tant, as is spreading the data across memory.
L1 Core 64 64 8 32K I/O operations are even more expensive. As shown
L2 Core 64 1024 16 1024K in the “Comms Fabric” row, high performance (and ex-
L3 Socket 64 57,344 11 39,424K pensive!) communications fabric, such as InfiniBand or
any number of proprietary interconnects, has a latency of
roughly five microseconds for an end-to-end round trip,
Think about this a bit. In the time required to do one CAS during which time more than ten thousand instructions
operation, the CPU could have executed more than two might have been executed. Standards-based communi-
hundred normal instructions. This should demonstrate cations networks often require some sort of protocol
the limitations not only of fine-grained locking, but of any processing, which further increases the latency. Of course,
other synchronization mechanism relying on fine-grained geographic distance also increases latency, with the speed-
global agreement. of-light through optical fiber latency around the world
If the pair of CPUs are on different sockets, the oper- coming to roughly 195 milliseconds, or more than 400
ations are considerably more expensive. A blind CAS million clock cycles, as shown in the “Global Comms”
operation consumes almost 150 nanoseconds, or more row.
than three hundred clock cycles. A normal CAS operation Quick Quiz 3.8: These numbers are insanely large! How
consumes more than 400 nanoseconds, or almost one can I possibly get my head around them?
thousand clock cycles.
Worse yet, not all pairs of sockets are created equal.
This particular system appears to be constructed as a
3.2.3 Hardware Optimizations
pair of four-socket components, with additional latency
penalties when the CPUs reside in different components. It is only natural to ask how the hardware is helping, and
In this case, a blind CAS operation consumes more than the answer is “Quite a bit!”
three hundred nanoseconds, or more than seven hundred One hardware optimization is large cachelines. This
clock cycles. A CAS operation consumes almost a full can provide a big performance boost, especially when
microsecond, or almost two thousand clock cycles. software is accessing memory sequentially. For example,
Quick Quiz 3.7: Surely the hardware designers could be per- given a 64-byte cacheline and software accessing 64-
suaded to improve this situation! Why have they been content bit variables, the first access will still be slow due to
with such abysmal performance for these single-instruction speed-of-light delays (if nothing else), but the remaining
operations? seven can be quite fast. However, this optimization has
a dark side, namely false sharing, which happens when
Unfortunately, the high speed of within-core and within- different variables in the same cacheline are being updated
socket communication does not come for free. First, there by different CPUs, resulting in a high cache-miss rate.
are only two CPUs within a given core and only 56 within a Software can use the alignment directives available in
given socket, compared to 448 across the system. Second, many compilers to avoid false sharing, and adding such
as shown in Table 3.2, the in-core caches are quite small directives is a common step in tuning parallel software.
compared to the in-socket caches, which are in turn quite A second related hardware optimization is cache
small compared to the 1.4 TB of memory configured on prefetching, in which the hardware reacts to consecutive
this system. Third, again referring to the figure, the caches accesses by prefetching subsequent cachelines, thereby
are organized as a hardware hash table with a limited evading speed-of-light delays for these subsequent cache-
number of items per bucket. For example, the raw size of lines. Of course, the hardware must use simple heuristics
the L3 cache (“Size”) is almost 40 MB, but each bucket to determine when to prefetch, and these heuristics can be
(“Line”) can only hold 11 blocks of memory (“Ways”), fooled by the complex data-access patterns in many appli-
each of which can be at most 64 bytes (“Line Size”). cations. Fortunately, some CPU families allow for this by
This means that only 12 bytes of memory (admittedly at providing special prefetch instructions. Unfortunately, the
v2022.09.25a
3.3. HARDWARE FREE LUNCH? 25
Figure 3.11: Hardware and Software: On Same Side The great trouble today is that there are too many
people looking for someone else to do something for
them. The solution to most of our troubles is to be
found in everyone doing something for themselves.
effectiveness of these instructions in the general case is
subject to some dispute. Henry Ford, updated
A third hardware optimization is the store buffer, which
allows a string of store instructions to execute quickly The major reason that concurrency has been receiving so
even when the stores are to non-consecutive addresses much focus over the past few years is the end of Moore’s-
and when none of the needed cachelines are present in Law induced single-threaded performance increases (or
the CPU’s cache. The dark side of this optimization is “free lunch” [Sut08]), as shown in Figure 2.1 on page 9.
memory misordering, for which see Chapter 15. This section briefly surveys a few ways that hardware
A fourth hardware optimization is speculative execution, designers might bring back the “free lunch”.
which can allow the hardware to make good use of the store However, the preceding section presented some substan-
buffers without resulting in memory misordering. The tial hardware obstacles to exploiting concurrency. One
dark side of this optimization can be energy inefficiency severe physical limitation that hardware designers face
and lowered performance if the speculative execution goes is the finite speed of light. As noted in Figure 3.10 on
awry and must be rolled back and retried. Worse yet, the page 22, light can manage only about an 8-centimeters
advent of Spectre and Meltdown [Hor18] made it apparent round trip in a vacuum during the duration of a 1.8 GHz
that hardware speculation can also enable side-channel clock period. This distance drops to about 3 centimeters
attacks that defeat memory-protection hardware so as to for a 5 GHz clock. Both of these distances are relatively
allow unprivileged processes to read memory that they small compared to the size of a modern computer system.
should not have access to. It is clear that the combination To make matters even worse, electric waves in silicon
of speculative execution and cloud computing needs more move from three to thirty times more slowly than does light
than a bit of rework! in a vacuum, and common clocked logic constructs run
A fifth hardware optimization is large caches, allowing still more slowly, for example, a memory reference may
individual CPUs to operate on larger datasets without need to wait for a local cache lookup to complete before
incurring expensive cache misses. Although large caches the request may be passed on to the rest of the system.
can degrade energy efficiency and cache-miss latency, the Furthermore, relatively low speed and high power drivers
ever-growing cache sizes on production microprocessors are required to move electrical signals from one silicon
attests to the power of this optimization. die to another, for example, to communicate between a
A final hardware optimization is read-mostly replication, CPU and main memory.
in which data that is frequently read but rarely updated is
present in all CPUs’ caches. This optimization allows the Quick Quiz 3.9: But individual electrons don’t move any-
where near that fast, even in conductors!!! The electron drift
read-mostly data to be accessed exceedingly efficiently,
velocity in a conductor under semiconductor voltage levels is
and is the subject of Chapter 9. on the order of only one millimeter per second. What gives???
In short, hardware and software engineers are really
on the same side, with both trying to make computers
go fast despite the best efforts of the laws of physics, as There are nevertheless some technologies (both hard-
fancifully depicted in Figure 3.11 where our data stream ware and software) that might help improve matters:
v2022.09.25a
26 CHAPTER 3. HARDWARE AND ITS HABITS
70 um
increases to which some people have become accustomed.
That said, they may be necessary steps on the path to the
late Jim Gray’s “smoking hairy golf balls” [Gra02].
v2022.09.25a
3.4. SOFTWARE DESIGN IMPLICATIONS 27
between electricity and light and vice versa, resulting in audio for several minutes—with its CPU fully powered
both power-consumption and heat-dissipation problems. off the entire time. The purpose of these accelerators
That said, absent some fundamental advances in the is to improve energy efficiency and thus extend battery
field of physics, any exponential increases in the speed of life: Special purpose hardware can often compute more
data flow will be sharply limited by the actual speed of efficiently than can a general-purpose CPU. This is an-
light in a vacuum. other example of the principle called out in Section 2.2.3:
Generality is almost never free.
Nevertheless, given the end of Moore’s-Law-induced
3.3.4 Special-Purpose Accelerators single-threaded performance increases, it seems safe to as-
sume that increasing varieties of special-purpose hardware
A general-purpose CPU working on a specialized problem
will appear.
is often spending significant time and energy doing work
that is only tangentially related to the problem at hand.
For example, when taking the dot product of a pair of 3.3.5 Existing Parallel Software
vectors, a general-purpose CPU will normally use a loop
(possibly unrolled) with a loop counter. Decoding the Although multicore CPUs seem to have taken the com-
instructions, incrementing the loop counter, testing this puting industry by surprise, the fact remains that shared-
counter, and branching back to the top of the loop are in memory parallel computer systems have been commer-
some sense wasted effort: The real goal is instead to multi- cially available for more than a quarter century. This is
ply corresponding elements of the two vectors. Therefore, more than enough time for significant parallel software to
a specialized piece of hardware designed specifically to make its appearance, and it indeed has. Parallel operating
multiply vectors could get the job done more quickly and systems are quite commonplace, as are parallel threading
with less energy consumed. libraries, parallel relational database management sys-
tems, and parallel numerical software. Use of existing
This is in fact the motivation for the vector instructions
parallel software can go a long ways towards solving any
present in many commodity microprocessors. Because
parallel-software crisis we might encounter.
these instructions operate on multiple data items simulta-
Perhaps the most common example is the parallel re-
neously, they would permit a dot product to be computed
lational database management system. It is not unusual
with less instruction-decode and loop overhead.
for single-threaded programs, often written in high-level
Similarly, specialized hardware can more efficiently
scripting languages, to access a central relational database
encrypt and decrypt, compress and decompress, encode
concurrently. In the resulting highly parallel system, only
and decode, and many other tasks besides. Unfortunately,
the database need actually deal directly with parallelism.
this efficiency does not come for free. A computer system
A very nice trick when it works!
incorporating this specialized hardware will contain more
transistors, which will consume some power even when
not in use. Software must be modified to take advantage of
this specialized hardware, and this specialized hardware
3.4 Software Design Implications
must be sufficiently generally useful that the high up-front
hardware-design costs can be spread over enough users to One ship drives east and another west
make the specialized hardware affordable. In part due to While the self-same breezes blow;
these sorts of economic considerations, specialized hard- ’Tis the set of the sail and not the gail
ware has thus far appeared only for a few application areas, That bids them where to go.
including graphics processing (GPUs), vector processors Ella Wheeler Wilcox
(MMX, SSE, and VMX instructions), and, to a lesser ex-
tent, encryption. And even in these areas, it is not always The values of the ratios in Table 3.1 are critically important,
easy to realize the expected performance gains, for exam- as they limit the efficiency of a given parallel application.
ple, due to thermal throttling [Kra17, Lem18, Dow20]. To see this, suppose that the parallel application uses CAS
Unlike the server and PC arena, smartphones have long operations to communicate among threads. These CAS
used a wide variety of hardware accelerators. These hard- operations will typically involve a cache miss, that is,
ware accelerators are often used for media decoding, so assuming that the threads are communicating primarily
much so that a high-end MP3 player might be able to play with each other rather than with themselves. Suppose
v2022.09.25a
28 CHAPTER 3. HARDWARE AND ITS HABITS
further that the unit of work corresponding to each CAS 3. The bad news is that the overhead of cache misses is
communication operation takes 300 ns, which is sufficient still high, especially on large systems.
time to compute several floating-point transcendental
functions. Then about half of the execution time will be The remainder of this book describes ways of handling
consumed by the CAS communication operations! This this bad news.
in turn means that a two-CPU system running such a In particular, Chapter 4 will cover some of the low-
parallel program would run no faster than a sequential level tools used for parallel programming, Chapter 5 will
implementation running on a single CPU. investigate problems and solutions to parallel counting,
The situation is even worse in the distributed-system and Chapter 6 will discuss design disciplines that promote
case, where the latency of a single communications oper- performance and scalability.
ation might take as long as thousands or even millions of
floating-point operations. This illustrates how important
it is for communications operations to be extremely infre-
quent and to enable very large quantities of processing.
Quick Quiz 3.10: Given that distributed-systems communi-
cation is so horribly expensive, why does anyone bother with
such systems?
v2022.09.25a
You are only as good as your tools, and your tools are
only as good as you are.
Chapter 4 Unknown
Please note that this chapter provides but a brief intro- cat compute_it.2.out
duction. More detail is available from the references (and
from the Internet), and more information will be provided Figure 4.1: Execution Diagram for Parallel Shell Execu-
in later chapters. tion
4.1 Scripting Languages character directing the shell to run the two instances of
the program in the background. Line 3 waits for both
The supreme excellence is simplicity. instances to complete, and lines 4 and 5 display their
output. The resulting execution is as shown in Figure 4.1:
Henry Wadsworth Longfellow, simplified The two instances of compute_it execute in parallel,
wait completes after both of them do, and then the two
The Linux shell scripting languages provide simple but instances of cat execute sequentially.
effective ways of managing parallelism. For example,
suppose that you had a program compute_it that you Quick Quiz 4.2: But this silly shell script isn’t a real parallel
needed to run twice with two different sets of arguments. program! Why bother with such trivia???
This can be accomplished using UNIX shell scripting as
follows: Quick Quiz 4.3: Is there a simpler way to create a parallel
1 compute_it 1 > compute_it.1.out & shell script? If so, how? If not, why not?
2 compute_it 2 > compute_it.2.out &
3 wait
4 cat compute_it.1.out For another example, the make software-build scripting
5 cat compute_it.2.out language provides a -j option that specifies how much par-
allelism should be introduced into the build process. Thus,
Lines 1 and 2 launch two instances of this program, typing make -j4 when building a Linux kernel specifies
redirecting their output to two separate files, with the & that up to four build steps be executed concurrently.
29
v2022.09.25a
30 CHAPTER 4. TOOLS OF THE TRADE
It is hoped that these simple examples convince you Listing 4.1: Using the fork() Primitive
that parallel programming need not always be complex or 1 pid = fork();
2 if (pid == 0) {
difficult. 3 /* child */
4 } else if (pid < 0) {
Quick Quiz 4.4: But if script-based parallel programming is 5 /* parent, upon error */
so easy, why bother with anything else? 6 perror("fork");
7 exit(EXIT_FAILURE);
8 } else {
9 /* parent, pid == child ID */
10 }
4.2 POSIX Multiprocessing
Listing 4.2: Using the wait() Primitive
1 static __inline__ void waitall(void)
A camel is a horse designed by committee. 2 {
3 int pid;
Unknown 4 int status;
5
6 for (;;) {
This section scratches the surface of the POSIX environ- 7 pid = wait(&status);
ment, including pthreads [Ope97], as this environment is 8 if (pid == -1) {
9 if (errno == ECHILD)
readily available and widely implemented. Section 4.2.1 10 break;
provides a glimpse of the POSIX fork() and related 11 perror("wait");
12 exit(EXIT_FAILURE);
primitives, Section 4.2.2 touches on thread creation and 13 }
destruction, Section 4.2.3 gives a brief overview of POSIX 14 }
15 }
locking, and, finally, Section 4.2.4 describes a specific
lock which can be used for data that is read by many
threads and only occasionally updated. noted earlier, the child may terminate via the exit()
primitive. Otherwise, this is the parent, which checks for
4.2.1 POSIX Process Creation and De- an error return from the fork() primitive on line 4, and
struction prints an error and exits on lines 5–7 if so. Otherwise,
the fork() has executed successfully, and the parent
Processes are created using the fork() primitive, they therefore executes line 9 with the variable pid containing
may be destroyed using the kill() primitive, they may the process ID of the child.
destroy themselves using the exit() primitive. A process The parent process may use the wait() primitive to
executing a fork() primitive is said to be the “parent” wait for its children to complete. However, use of this
of the newly created process. A parent may wait on its primitive is a bit more complicated than its shell-script
children using the wait() primitive. counterpart, as each invocation of wait() waits for but one
Please note that the examples in this section are quite child process. It is therefore customary to wrap wait()
simple. Real-world applications using these primitives into a function similar to the waitall() function shown
might need to manipulate signals, file descriptors, shared in Listing 4.2 (api-pthreads.h), with this waitall()
memory segments, and any number of other resources. In function having semantics similar to the shell-script wait
addition, some applications need to take specific actions command. Each pass through the loop spanning lines 6–14
if a given child terminates, and might also need to be waits on one child process. Line 7 invokes the wait()
concerned with the reason that the child terminated. These primitive, which blocks until a child process exits, and
issues can of course add substantial complexity to the code. returns that child’s process ID. If the process ID is instead
For more information, see any of a number of textbooks −1, this indicates that the wait() primitive was unable to
on the subject [Ste92, Wei13]. wait on a child. If so, line 9 checks for the ECHILD errno,
If fork() succeeds, it returns twice, once for the which indicates that there are no more child processes, so
parent and again for the child. The value returned from that line 10 exits the loop. Otherwise, lines 11 and 12
fork() allows the caller to tell the difference, as shown in print an error and exit.
Listing 4.1 (forkjoin.c). Line 1 executes the fork()
Quick Quiz 4.5: Why does this wait() primitive need to be
primitive, and saves its return value in local variable pid.
so complicated? Why not just make it work like the shell-script
Line 2 checks to see if pid is zero, in which case, this wait does?
is the child, which continues on to execute line 3. As
v2022.09.25a
4.2. POSIX MULTIPROCESSING 31
Listing 4.3: Processes Created Via fork() Do Not Share Listing 4.4: Threads Created Via pthread_create() Share
Memory Memory
1 int x = 0; 1 int x = 0;
2 2
3 int main(int argc, char *argv[]) 3 void *mythread(void *arg)
4 { 4 {
5 int pid; 5 x = 1;
6 6 printf("Child process set x=1\n");
7 pid = fork(); 7 return NULL;
8 if (pid == 0) { /* child */ 8 }
9 x = 1; 9
10 printf("Child process set x=1\n"); 10 int main(int argc, char *argv[])
11 exit(EXIT_SUCCESS); 11 {
12 } 12 int en;
13 if (pid < 0) { /* parent, upon error */ 13 pthread_t tid;
14 perror("fork"); 14 void *vp;
15 exit(EXIT_FAILURE); 15
16 } 16 if ((en = pthread_create(&tid, NULL,
17 17 mythread, NULL)) != 0) {
18 /* parent */ 18 fprintf(stderr, "pthread_create: %s\n", strerror(en));
19 19 exit(EXIT_FAILURE);
20 waitall(); 20 }
21 printf("Parent process sees x=%d\n", x); 21
22 22 /* parent */
23 return EXIT_SUCCESS; 23
24 } 24 if ((en = pthread_join(tid, &vp)) != 0) {
25 fprintf(stderr, "pthread_join: %s\n", strerror(en));
26 exit(EXIT_FAILURE);
27 }
It is critically important to note that the parent and child 28 printf("Parent process sees x=%d\n", x);
29
do not share memory. This is illustrated by the program 30 return EXIT_SUCCESS;
shown in Listing 4.3 (forkjoinvar.c), in which the 31 }
v2022.09.25a
32 CHAPTER 4. TOOLS OF THE TRADE
v2022.09.25a
4.2. POSIX MULTIPROCESSING 33
and initializes a POSIX lock named lock_a, while line 2 Listing 4.6: Demonstration of Same Exclusive Lock
similarly defines and initializes a lock named lock_b. 1 printf("Creating two threads using same lock:\n");
2 en = pthread_create(&tid1, NULL, lock_reader, &lock_a);
Line 4 defines and initializes a shared variable x. 3 if (en != 0) {
4 fprintf(stderr, "pthread_create: %s\n", strerror(en));
Lines 6–33 define a function lock_reader() which 5 exit(EXIT_FAILURE);
repeatedly reads the shared variable x while holding the 6 }
7 en = pthread_create(&tid2, NULL, lock_writer, &lock_a);
lock specified by arg. Line 12 casts arg to a pointer to a 8 if (en != 0) {
pthread_mutex_t, as required by the pthread_mutex_ 9 fprintf(stderr, "pthread_create: %s\n", strerror(en));
10 exit(EXIT_FAILURE);
lock() and pthread_mutex_unlock() primitives. 11 }
12 if ((en = pthread_join(tid1, &vp)) != 0) {
Quick Quiz 4.10: Why not simply make the argument to 13 fprintf(stderr, "pthread_join: %s\n", strerror(en));
lock_reader() on line 6 of Listing 4.5 be a pointer to a 14 exit(EXIT_FAILURE);
15 }
pthread_mutex_t? 16 if ((en = pthread_join(tid2, &vp)) != 0) {
17 fprintf(stderr, "pthread_join: %s\n", strerror(en));
18 exit(EXIT_FAILURE);
Quick Quiz 4.11: What is the READ_ONCE() on lines 20 19 }
and 47 and the WRITE_ONCE() on line 47 of Listing 4.5?
Listing 4.7: Demonstration of Different Exclusive Locks
Lines 14–18 acquire the specified pthread_mutex_t, 1 printf("Creating two threads w/different locks:\n");
2 x = 0;
checking for errors and exiting the program if any occur. 3 en = pthread_create(&tid1, NULL, lock_reader, &lock_a);
Lines 19–26 repeatedly check the value of x, printing 4 if (en != 0) {
5 fprintf(stderr, "pthread_create: %s\n", strerror(en));
the new value each time that it changes. Line 25 sleeps 6 exit(EXIT_FAILURE);
for one millisecond, which allows this demonstration 7 }
8 en = pthread_create(&tid2, NULL, lock_writer, &lock_b);
to run nicely on a uniprocessor machine. Lines 27–31 9 if (en != 0) {
release the pthread_mutex_t, again checking for errors 10 fprintf(stderr, "pthread_create: %s\n", strerror(en));
11 exit(EXIT_FAILURE);
and exiting the program if any occur. Finally, line 32 12 }
returns NULL, again to match the function type required 13 if ((en = pthread_join(tid1, &vp)) != 0) {
14 fprintf(stderr, "pthread_join: %s\n", strerror(en));
by pthread_create(). 15 exit(EXIT_FAILURE);
16 }
Quick Quiz 4.12: Writing four lines of code for each 17 if ((en = pthread_join(tid2, &vp)) != 0) {
18 fprintf(stderr, "pthread_join: %s\n", strerror(en));
acquisition and release of a pthread_mutex_t sure seems 19 exit(EXIT_FAILURE);
painful! Isn’t there a better way? 20 }
Creating two threads using same lock: Because the two threads are using different locks, they
lock_reader(): x = 0
do not exclude each other, and can run concurrently. The
v2022.09.25a
34 CHAPTER 4. TOOLS OF THE TRADE
v2022.09.25a
4.2. POSIX MULTIPROCESSING 35
10
thinktime argument controlling the time between the
release of the reader-writer lock and the next acquisition,
line 4 defines the readcounts array into which each ideal 10000us
1
reader thread places the number of times it acquired the
Quick Quiz 4.17: Instead of using READ_ONCE() every- on the graph). The actual value plotted is:
where, why not just declare goflag as volatile on line 10
of Listing 4.8? 𝐿𝑁
(4.1)
𝑁 𝐿1
Quick Quiz 4.18: READ_ONCE() only affects the compiler, where 𝑁 is the number of threads in the current run, 𝐿 𝑁 is
not the CPU. Don’t we also need memory barriers to make the total number of lock acquisitions by all 𝑁 threads in the
sure that the change in goflag’s value propagates to the CPU current run, and 𝐿 1 is the number of lock acquisitions in
in a timely fashion in Listing 4.8? a single-threaded run. Given ideal hardware and software
scalability, this value will always be 1.0.
Quick Quiz 4.19: Would it ever be necessary to use READ_ As can be seen in the figure, reader-writer locking
ONCE() when accessing a per-thread variable, for example, a scalability is decidedly non-ideal, especially for smaller
variable declared using GCC’s __thread storage class? sizes of critical sections. To see why read-acquisition can
be so slow, consider that all the acquiring threads must
The loop spanning lines 23–41 carries out the perfor- update the pthread_rwlock_t data structure. Therefore,
mance test. Lines 24–28 acquire the lock, lines 29–31 if all 448 executing threads attempt to read-acquire the
hold the lock for the specified number of microseconds, reader-writer lock concurrently, they must update this
lines 32–36 release the lock, and lines 37–39 wait for the underlying pthread_rwlock_t one at a time. One lucky
specified number of microseconds before re-acquiring the thread might do so almost immediately, but the least-lucky
lock. Line 40 counts this lock acquisition. thread must wait for all the other 447 threads to do their
Line 42 moves the lock-acquisition count to this thread’s updates. This situation will only get worse as you add
element of the readcounts[] array, and line 43 returns, CPUs. Note also the logscale y-axis. Even though the
terminating this thread. 10,000 microsecond trace appears quite ideal, it has in fact
degraded by about 10 % from ideal.
Figure 4.2 shows the results of running this test on a
224-core Xeon system with two hardware threads per core Quick Quiz 4.20: Isn’t comparing against single-CPU
for a total of 448 software-visible CPUs. The thinktime throughput a bit harsh?
parameter was zero for all these tests, and the holdtime
parameter set to values ranging from one microsecond Quick Quiz 4.21: But one microsecond is not a particularly
(“1us” on the graph) to 10,000 microseconds (“10000us” small size for a critical section. What do I do if I need a much
v2022.09.25a
36 CHAPTER 4. TOOLS OF THE TRADE
smaller critical section, for example, one containing only a few Listing 4.9: Compiler Barrier Primitive (for GCC)
instructions? #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
#define READ_ONCE(x) \
({ typeof(x) ___x = ACCESS_ONCE(x); ___x; })
Quick Quiz 4.22: The system used is a few years old, and #define WRITE_ONCE(x, val) \
new hardware should be faster. So why should anyone worry do { ACCESS_ONCE(x) = (val); } while (0)
#define barrier() __asm__ __volatile__("": : :"memory")
about reader-writer locks being slow?
v2022.09.25a
4.3. ALTERNATIVES TO POSIX OPERATIONS 37
write atomics. The read-modify-write atom- __thread is much easier to use than the POSIX thead-
ics include atomic_fetch_add(), atomic_fetch_ specific data, and so __thread is usually preferred for
sub(), atomic_fetch_and(), atomic_fetch_xor(), code that is to be built only with GCC or other compilers
atomic_exchange(), atomic_compare_exchange_ supporting __thread.
strong(), and atomic_compare_exchange_weak(). Fortunately, the C11 standard introduced a _Thread_
These operate in a manner similar to those described local keyword that can be used in place of __thread. In
in Section 4.2.5, but with the addition of memory-order the fullness of time, this new keyword should combine the
arguments to _explicit variants of all of the opera- ease of use of __thread with the portability of POSIX
tions. Without memory-order arguments, all the atomic thread-specific data.
operations are fully ordered, and the arguments per-
mit weaker orderings. For example, “atomic_load_
explicit(&a, memory_order_relaxed)” is vaguely 4.3 Alternatives to POSIX Opera-
similar to the Linux kernel’s “READ_ONCE()”.1 tions
4.2.7 Atomic Operations (Modern GCC) The strategic marketing paradigm of Open Source is
One restriction of the C11 atomics is that they apply a massively parallel drunkard’s walk filtered by a
Darwinistic process.
only to special atomic types, which can be problematic.
The GNU C compiler therefore provides atomic intrin- Bruce Perens
sics, including __atomic_load(), __atomic_load_
n(), __atomic_store(), __atomic_store_n(), __ Unfortunately, threading operations, locking primitives,
atomic_thread_fence(), etc. These intrinsics offer and atomic operations were in reasonably wide use long
the same semantics as their C11 counterparts, but may before the various standards committees got around to
be used on plain non-atomic objects. Some of these in- them. As a result, there is considerable variation in how
trinsics may be passed a memory-order argument from these operations are supported. It is still quite common to
this list: __ATOMIC_RELAXED, __ATOMIC_CONSUME, find these operations implemented in assembly language,
__ATOMIC_ACQUIRE, __ATOMIC_RELEASE, __ATOMIC_ either for historical reasons or to obtain better perfor-
ACQ_REL, and __ATOMIC_SEQ_CST. mance in specialized circumstances. For example, GCC’s
__sync_ family of primitives all provide full memory-
4.2.8 Per-Thread Variables ordering semantics, which in the past motivated many
developers to create their own implementations for situa-
Per-thread variables, also called thread-specific data, tions where the full memory ordering semantics are not
thread-local storage, and other less-polite names, are used required. The following sections show some alternatives
extremely heavily in concurrent code, as will be explored from the Linux kernel and some historical primitives used
in Chapters 5 and 8. POSIX supplies the pthread_key_ by this book’s sample code.
create() function to create a per-thread variable (and
return the corresponding key), pthread_key_delete()
4.3.1 Organization and Initialization
to delete the per-thread variable corresponding to key,
pthread_setspecific() to set the value of the current Although many environments do not require any special
thread’s variable corresponding to the specified key, and initialization code, the code samples in this book start
pthread_getspecific() to return that value. with a call to smp_init(), which initializes a mapping
A number of compilers (including GCC) provide a __ from pthread_t to consecutive integers. The userspace
thread specifier that may be used in a variable definition RCU library2 similarly requires a call to rcu_init().
to designate that variable as being per-thread. The name of Although these calls can be hidden in environments (such
the variable may then be used normally to access the value as that of GCC) that support constructors, most of the
of the current thread’s instance of that variable. Of course, RCU flavors supported by the userspace RCU library also
require each thread invoke rcu_register_thread()
v2022.09.25a
38 CHAPTER 4. TOOLS OF THE TRADE
Listing 4.10: Thread API thread_id_t corresponding to the newly created child
int smp_thread_id(void) thread.
thread_id_t create_thread(void *(*func)(void *), void *arg)
for_each_thread(t) This primitive will abort the program if more than
for_each_running_thread(t) NR_THREADS threads are created, counting the one im-
void *wait_thread(thread_id_t tid)
void wait_all_threads(void) plicitly created by running the program. NR_THREADS is
a compile-time constant that may be modified, though
some systems may have an upper bound for the allowable
upon thread creation and rcu_unregister_thread() number of threads.
before thread exit.
In the case of the Linux kernel, it is a philosophical 4.3.2.2 smp_thread_id()
question as to whether the kernel does not require calls
to special initialization code or whether the kernel’s boot- Because the thread_id_t returned from create_
time code is in fact the required initialization code. thread() is system-dependent, the smp_thread_id()
primitive returns a thread index corresponding to the
thread making the request. This index is guaranteed to be
4.3.2 Thread Creation, Destruction, and
less than the maximum number of threads that have been
Control in existence since the program started, and is therefore
The Linux kernel uses struct task_struct pointers useful for bitmasks, array indices, and the like.
to track kthreads, kthread_create() to create them,
kthread_should_stop() to externally suggest that they 4.3.2.3 for_each_thread()
stop (which has no POSIX equivalent),3 kthread_
stop() to wait for them to stop, and schedule_ The for_each_thread() macro loops through all
timeout_interruptible() for a timed wait. There threads that exist, including all threads that would ex-
are quite a few additional kthread-management APIs, but ist if created. This macro is useful for handling the
this provides a good start, as well as good search terms. per-thread variables introduced in Section 4.2.8.
The CodeSamples API focuses on “threads”, which are a
locus of control.4 Each such thread has an identifier of type 4.3.2.4 for_each_running_thread()
thread_id_t, and no two threads running at a given time
will have the same identifier. Threads share everything The for_each_running_thread() macro loops
except for per-thread local state,5 which includes program through only those threads that currently exist. It is the
counter and stack. caller’s responsibility to synchronize with thread creation
The thread API is shown in Listing 4.10, and members and deletion if required.
are described in the following sections.
4.3.2.5 wait_thread()
4.3.2.1 create_thread()
The wait_thread() primitive waits for completion of the
The create_thread() primitive creates a new thread, thread specified by the thread_id_t passed to it. This in
starting the new thread’s execution at the function func no way interferes with the execution of the specified thread;
specified by create_thread()’s first argument, and instead, it merely waits for it. Note that wait_thread()
passing it the argument specified by create_thread()’s returns the value that was returned by the corresponding
second argument. This newly created thread will termi- thread.
nate when it returns from the starting function specified
by func. The create_thread() primitive returns the 4.3.2.6 wait_all_threads()
3 POSIX environments can work around the lack of kthread_ The wait_all_threads() primitive waits for comple-
should_stop() by using a properly synchronized boolean flag in tion of all currently running threads. It is the caller’s
conjunction with pthread_join(). responsibility to synchronize with thread creation and
4 There are many other names for similar software constructs, in-
v2022.09.25a
4.3. ALTERNATIVES TO POSIX OPERATIONS 39
v2022.09.25a
40 CHAPTER 4. TOOLS OF THE TRADE
Listing 4.14: Living Dangerously Early 1990s Style 4.3.4.1 Shared-Variable Shenanigans
1 ptr = global_ptr;
2 if (ptr != NULL && ptr < high_address) Given code that does plain loads and stores,6 the compiler
3 do_low(ptr);
is within its rights to assume that the affected variables are
neither accessed nor modified by any other thread. This
Listing 4.15: C Compilers Can Invent Loads assumption allows the compiler to carry out a large number
1 if (global_ptr != NULL && of transformations, including load tearing, store tearing,
2 global_ptr < high_address)
3 do_low(global_ptr); load fusing, store fusing, code reordering, invented loads,
invented stores, store-to-load transformations, and dead-
code elimination, all of which work just fine in single-
Quick Quiz 4.27: What problems could occur if the variable threaded code. But concurrent code can be broken by each
counter were incremented without the protection of mutex? of these transformations, or shared-variable shenanigans,
as described below.
Load tearing occurs when the compiler uses multiple
However, the spin_lock() and spin_unlock() load instructions for a single access. For example, the
primitives do have performance consequences, as will compiler could in theory compile the load from global_
be seen in Chapter 10. ptr (see line 1 of Listing 4.14) as a series of one-byte loads.
If some other thread was concurrently setting global_
ptr to NULL, the result might have all but one byte of
4.3.4 Accessing Shared Variables the pointer set to zero, thus forming a “wild pointer”.
Stores using such a wild pointer could corrupt arbitrary
It was not until 2011 that the C standard defined seman- regions of memory, resulting in rare and difficult-to-debug
tics for concurrent read/write access to shared variables. crashes.
However, concurrent C code was being written at least Worse yet, on (say) an 8-bit system with 16-bit pointers,
a quarter century earlier [BK85, Inm85]. This raises the the compiler might have no choice but to use a pair of
question as to what today’s greybeards did back in long- 8-bit instructions to access a given pointer. Because the C
past pre-C11 days. A short answer to this question is “they standard must support all manner of systems, the standard
lived dangerously”. cannot rule out load tearing in the general case.
At least they would have been living dangerously had Store tearing occurs when the compiler uses multiple
they been using 2021 compilers. In (say) the early 1990s, store instructions for a single access. For example, one
compilers did fewer optimizations, in part because there thread might store 0x12345678 to a four-byte integer vari-
were fewer compiler writers and in part due to the relatively able at the same time another thread stored 0xabcdef00.
small memories of that era. Nevertheless, problems did If the compiler used 16-bit stores for either access, the
arise, as shown in Listing 4.14, which the compiler is result might well be 0x1234ef00, which could come as
within its rights to transform into Listing 4.15. As you quite a surprise to code loading from this integer. Nor
can see, the temporary on line 1 of Listing 4.14 has been is this a strictly theoretical issue. For example, there are
optimized away, so that global_ptr will be loaded up to CPUs that feature small immediate instruction fields, and
three times. on such CPUs, the compiler might split a 64-bit store
Quick Quiz 4.28: What is wrong with loading Listing 4.14’s into two 32-bit stores in order to reduce the overhead of
global_ptr up to three times? explicitly forming the 64-bit constant in a register, even on
a 64-bit CPU. There are historical reports of this actually
Section 4.3.4.1 describes additional problems caused by happening in the wild (e.g. [KM13]), but there is also a
plain accesses, Sections 4.3.4.2 and 4.3.4.3 describe some recent report [Dea19].7
pre-C11 solutions. Of course, where practical, direct
C-language memory references should be replaced by 6 That is, normal loads and stores instead of C11 atomics, inline
the primitives described in Section 4.2.5 or (especially) assembly, or volatile accesses.
7 Note that this tearing can happen even on properly aligned and
Section 4.2.6. Use these primitives to avoid data races,
machine-word-sized accesses, and in this particular case, even for volatile
that is, ensure that if there are multiple concurrent C- stores. Some might argue that this behavior constitutes a bug in the
language accesses to a given variable, all of those accesses compiler, but either way it illustrates the perceived value of store tearing
are loads. from a compiler-writer viewpoint.
v2022.09.25a
4.3. ALTERNATIVES TO POSIX OPERATIONS 41
Listing 4.16: Inviting Load Fusing Listing 4.18: C Compilers Can Fuse Non-Adjacent Loads
1 while (!need_to_stop) 1 int *gp;
2 do_something_quickly(); 2
3 void t0(void)
4 {
5 WRITE_ONCE(gp, &myvar);
Listing 4.17: C Compilers Can Fuse Loads 6 }
1 if (!need_to_stop) 7
2 for (;;) { 8 void t1(void)
3 do_something_quickly(); 9 {
4 do_something_quickly(); 10 p1 = gp;
5 do_something_quickly(); 11 do_something(p1);
6 do_something_quickly(); 12 p2 = READ_ONCE(gp);
7 do_something_quickly(); 13 if (p2) {
8 do_something_quickly(); 14 do_something_else();
9 do_something_quickly(); 15 p3 = *gp;
10 do_something_quickly(); 16 }
11 do_something_quickly(); 17 }
12 do_something_quickly();
13 do_something_quickly();
14 do_something_quickly();
15 do_something_quickly(); t1() run concurrently, and do_something() and do_
16 do_something_quickly();
17 do_something_quickly(); something_else() are inline functions. Line 1 declares
18 do_something_quickly(); pointer gp, which C initializes to NULL by default. At
19 }
some point, line 5 of t0() stores a non-NULL pointer to gp.
Meanwhile, t1() loads from gp three times on lines 10,
Of course, the compiler simply has no choice but to tear 12, and 15. Given that line 13 finds that gp is non-NULL,
some stores in the general case, given the possibility of one might hope that the dereference on line 15 would be
code using 64-bit integers running on a 32-bit system. But guaranteed never to fault. Unfortunately, the compiler is
for properly aligned machine-sized stores, WRITE_ONCE() within its rights to fuse the read on lines 10 and 15, which
will prevent store tearing. means that if line 10 loads NULL and line 12 loads &myvar,
line 15 could load NULL, resulting in a fault.8 Note that
Load fusing occurs when the compiler uses the result
the intervening READ_ONCE() does not prevent the other
of a prior load from a given variable instead of repeating
two loads from being fused, despite the fact that all three
the load. Not only is this sort of optimization just fine in
are loading from the same variable.
single-threaded code, it is often just fine in multithreaded
code. Unfortunately, the word “often” hides some truly Quick Quiz 4.29: Why does it matter whether do_
annoying exceptions. something() and do_something_else() in Listing 4.18
For example, suppose that a real-time system needs to are inline functions?
invoke a function named do_something_quickly() Store fusing can occur when the compiler notices a
repeatedly until the variable need_to_stop was set, pair of successive stores to a given variable with no
and that the compiler can see that do_something_ intervening loads from that variable. In this case, the
quickly() does not store to need_to_stop. One (un- compiler is within its rights to omit the first store. This is
safe) way to code this is shown in Listing 4.16. The never a problem in single-threaded code, and in fact it is
compiler might reasonably unroll this loop sixteen times usually not a problem in correctly written concurrent code.
in order to reduce the per-invocation of the backwards After all, if the two stores are executed in quick succession,
branch at the end of the loop. Worse yet, because the there is very little chance that some other thread could
compiler knows that do_something_quickly() does load the value from the first store.
not store to need_to_stop, the compiler could quite However, there are exceptions, for example as shown
reasonably decide to check this variable only once, re- in Listing 4.19. The function shut_it_down() stores
sulting in the code shown in Listing 4.17. Once entered, to the shared variable status on lines 3 and 8, and so
the loop on lines 2–19 will never exit, regardless of how assuming that neither start_shutdown() nor finish_
many times some other thread stores a non-zero value to shutdown() access status, the compiler could reason-
need_to_stop. The result will at best be consternation, ably remove the store to status on line 3. Unfortunately,
and might well also include severe physical damage. this would mean that work_until_shut_down() would
The compiler can fuse loads across surprisingly large
spans of code. For example, in Listing 4.18, t0() and 8 Will Deacon reports that this happened in the Linux kernel.
v2022.09.25a
42 CHAPTER 4. TOOLS OF THE TRADE
Listing 4.19: C Compilers Can Fuse Stores Listing 4.20: Inviting an Invented Store
1 void shut_it_down(void) 1 if (condition)
2 { 2 a = 1;
3 status = SHUTTING_DOWN; /* BUGGY!!! */ 3 else
4 start_shutdown(); 4 do_a_bunch_of_stuff();
5 while (!other_task_ready) /* BUGGY!!! */
6 continue;
7 finish_shutdown();
8 status = SHUT_DOWN; /* BUGGY!!! */ Listing 4.21: Compiler Invents an Invited Store
9 do_something_else(); 1 a = 1;
10 } 2 if (!condition) {
11 3 a = 0;
12 void work_until_shut_down(void) 4 do_a_bunch_of_stuff();
13 { 5 }
14 while (status != SHUTTING_DOWN) /* BUGGY!!! */
15 do_more_work();
16 other_task_ready = 1; /* BUGGY!!! */
17 }
Invented loads were illustrated by the code in List-
ings 4.14 and 4.15, in which the compiler optimized away
never exit its loop spanning lines 14 and 15, and thus would a temporary variable, thus loading from a shared variable
never set other_task_ready, which would in turn mean more often than intended.
that shut_it_down() would never exit its loop spanning Invented loads can be a performance hazard. These
lines 5 and 6, even if the compiler chooses not to fuse the hazards can occur when a load of variable in a “hot”
successive loads from other_task_ready on line 5. cacheline is hoisted out of an if statement. These hoisting
And there are more problems with the code in List- optimizations are not uncommon, and can cause significant
ing 4.19, including code reordering. increases in cache misses, and thus significant degradation
Code reordering is a common compilation technique of both performance and scalability.
used to combine common subexpressions, reduce register
Invented stores can occur in a number of situations.
pressure, and improve utilization of the many functional
For example, a compiler emitting code for work_until_
units available on modern superscalar microprocessors.
shut_down() in Listing 4.19 might notice that other_
It is also another reason why the code in Listing 4.19 is
task_ready is not accessed by do_more_work(), and
buggy. For example, suppose that the do_more_work()
stored to on line 16. If do_more_work() was a complex
function on line 15 does not access other_task_ready.
inline function, it might be necessary to do a register spill,
Then the compiler would be within its rights to move the
in which case one attractive place to use for temporary
assignment to other_task_ready on line 16 to precede
storage is other_task_ready. After all, there are no
line 14, which might be a great disappointment for anyone
accesses to it, so what is the harm?
hoping that the last call to do_more_work() on line 15
happens before the call to finish_shutdown() on line 7. Of course, a non-zero store to this variable at just the
It might seem futile to prevent the compiler from chang- wrong time would result in the while loop on line 5 termi-
ing the order of accesses in cases where the underlying nating prematurely, again allowing finish_shutdown()
hardware is free to reorder them. However, modern ma- to run concurrently with do_more_work(). Given that
chines have exact exceptions and exact interrupts, mean- the entire point of this while appears to be to prevent
ing that any interrupt or exception will appear to have such concurrency, this is not a good thing.
happened at a specific place in the instruction stream. Using a stored-to variable as a temporary might seem
This means that the handler will see the effect of all outlandish, but it is permitted by the standard. Neverthe-
prior instructions, but won’t see the effect of any subse- less, readers might be justified in wanting a less outlandish
quent instructions. READ_ONCE() and WRITE_ONCE() example, which is provided by Listings 4.20 and 4.21.
can therefore be used to control communication between A compiler emitting code for Listing 4.20 might know
interrupted code and interrupt handlers, independent of that the value of a is initially zero, which might be a strong
the ordering provided by the underlying hardware.9 temptation to optimize away one branch by transforming
this code to that in Listing 4.21. Here, line 1 uncondi-
tionally stores 1 to a, then resets the value back to zero
9 That said, the various standards committees would prefer that
on line 3 if condition was not set. This transforms the
you use atomics or variables of type sig_atomic_t, instead of READ_
ONCE() and WRITE_ONCE(). if-then-else into an if-then, saving one branch.
v2022.09.25a
4.3. ALTERNATIVES TO POSIX OPERATIONS 43
Listing 4.22: Inviting a Store-to-Load Conversion surprises will be at all pleasant. Elimination of store-only
1 r1 = p; variables is especially dangerous in cases where external
2 if (unlikely(r1))
3 do_something_with(r1); code locates the variable via symbol tables: The compiler
4 barrier(); is necessarily ignorant of such external-code accesses,
5 p = NULL;
and might thus eliminate a variable that the external code
relies upon.
Listing 4.23: Compiler Converts a Store to a Load Reliable concurrent code clearly needs a way to cause
1 r1 = p;
2 if (unlikely(r1)) the compiler to preserve the number, order, and type of
3 do_something_with(r1); important accesses to shared memory, a topic taken up by
4 barrier();
5 if (p != NULL) Sections 4.3.4.2 and 4.3.4.3, which are up next.
6 p = NULL;
v2022.09.25a
44 CHAPTER 4. TOOLS OF THE TRADE
Listing 4.24: Avoiding Danger, 2018 Style Listing 4.26: Preventing Store Fusing and Invented Stores
1 ptr = READ_ONCE(global_ptr); 1 void shut_it_down(void)
2 if (ptr != NULL && ptr < high_address) 2 {
3 do_low(ptr); 3 WRITE_ONCE(status, SHUTTING_DOWN); /* BUGGY!!! */
4 start_shutdown();
5 while (!READ_ONCE(other_task_ready)) /* BUGGY!!! */
Listing 4.25: Preventing Load Fusing 6 continue;
7 finish_shutdown();
1 while (!READ_ONCE(need_to_stop)) 8 WRITE_ONCE(status, SHUT_DOWN); /* BUGGY!!! */
2 do_something_quickly(); 9 do_something_else();
10 }
11
12 void work_until_shut_down(void)
that access’s size and type are available.12
Concur- 13 {
14 while (READ_ONCE(status) != SHUTTING_DOWN) /* BUGGY!!! */
rent code relies on this constraint to avoid unneces- 15 do_more_work();
sary load and store tearing. 16 WRITE_ONCE(other_task_ready, 1); /* BUGGY!!! */
17 }
2. Implementations must not assume anything about the
semantics of a volatile access, nor, for any volatile Listing 4.27: Disinviting an Invented Store
access that returns a value, about the possible set of 1 if (condition)
values that might be returned.13 Concurrent code 2 WRITE_ONCE(a, 1);
3 else
relies on this constraint to avoid optimizations that 4 do_a_bunch_of_stuff();
are inapplicable given that other processors might be
concurrently accessing the location in question.
shown in Listing 4.19, with the result shown in List-
3. Aligned machine-sized non-mixed-size volatile ac- ing 4.26. However, this does nothing to prevent code
cesses interact naturally with volatile assembly-code reordering, which requires some additional tricks taught
sequences before and after. This is necessary because in Section 4.3.4.3.
some devices must be accessed using a combina-
Finally, WRITE_ONCE() can be used to prevent the store
tion of volatile MMIO accesses and special-purpose
invention shown in Listing 4.20, with the resulting code
assembly-language instructions. Concurrent code
shown in Listing 4.27.
relies on this constraint in order to achieve the desired
To summarize, the volatile keyword can prevent
ordering properties from combinations of volatile ac-
load tearing and store tearing in cases where the loads
cesses and other means discussed in Section 4.3.4.3.
and stores are machine-sized and properly aligned. It
Concurrent code also relies on the first two constraints can also prevent load fusing, store fusing, invented loads,
to avoid undefined behavior that could result due to data and invented stores. However, although it does prevent
races if any of the accesses to a given object was either the compiler from reordering volatile accesses with
non-atomic or non-volatile, assuming that all accesses are each other, it does nothing to prevent the CPU from
aligned and machine-sized. The semantics of mixed-size reordering these accesses. Furthermore, it does nothing
accesses to the same locations are more complex, and are to prevent either compiler or CPU from reordering non-
left aside for the time being. volatile accesses with each other or with volatile
So how does volatile stack up against the earlier accesses. Preventing these types of reordering requires
examples? the techniques described in the next section.
Using READ_ONCE() on line 1 of Listing 4.14 avoids
invented loads, resulting in the code shown in Listing 4.24. 4.3.4.3 Assembling the Rest of a Solution
As shown in Listing 4.25, READ_ONCE() can also pre-
vent the loop unrolling in Listing 4.17. Additional ordering has traditionally been provided by
READ_ONCE() and WRITE_ONCE() can also be used recourse to assembly language, for example, GCC asm
to prevent the store fusing and invented stores that were directives. Oddly enough, these directives need not ac-
tually contain assembly language, as exemplified by the
12 Note that this leaves unspecified what to do with 128-bit loads and
barrier() macro shown in Listing 4.9.
stores on CPUs having 128-bit CAS but not 128-bit loads and stores.
In the barrier() macro, the __asm__ introduces the
13 This is strongly implied by the implementation-defined semantics asm directive, the __volatile__ prevents the compiler
called out above. from optimizing the asm away, the empty string specifies
v2022.09.25a
4.3. ALTERNATIVES TO POSIX OPERATIONS 45
Listing 4.28: Preventing C Compilers From Fusing Loads Ordering is also provided by some read-modify-write
1 while (!need_to_stop) { atomic operations, some of which are presented in Sec-
2 barrier();
3 do_something_quickly(); tion 4.3.5. In the general case, memory ordering can be
4 barrier(); quite subtle, as discussed in Chapter 15. The next section
5 }
covers an alternative to memory ordering, namely limiting
or even entirely avoiding data races.
Listing 4.29: Preventing Reordering
1 void shut_it_down(void)
2 { 4.3.4.4 Avoiding Data Races
3 WRITE_ONCE(status, SHUTTING_DOWN);
4 smp_mb(); “Doctor, it hurts my head when I think about
5 start_shutdown();
6 while (!READ_ONCE(other_task_ready)) concurrently accessing shared variables!”
7 continue;
8 smp_mb(); “Then stop concurrently accessing shared vari-
9 finish_shutdown(); ables!!!”
10 smp_mb();
11 WRITE_ONCE(status, SHUT_DOWN);
12 do_something_else(); The doctor’s advice might seem unhelpful, but one
13 } time-tested way to avoid concurrently accessing shared
14
15 void work_until_shut_down(void) variables is access those variables only when holding a
16 { particular lock, as will be discussed in Chapter 7. Another
17 while (READ_ONCE(status) != SHUTTING_DOWN) {
18 smp_mb(); way is to access a given “shared” variable only from a
19 do_more_work(); given CPU or thread, as will be discussed in Chapter 8. It
20 }
21 smp_mb(); is possible to combine these two approaches, for example,
22 WRITE_ONCE(other_task_ready, 1); a given variable might be modified only by a given CPU or
23 }
thread while holding a particular lock, and might be read
either from that same CPU or thread on the one hand, or
that no actual instructions are to be emitted, and the from some other CPU or thread while holding that same
final "memory" tells the compiler that this do-nothing lock on the other. In all of these situations, all accesses to
asm can arbitrarily change memory. In response, the the shared variables may be plain C-language accesses.
compiler will avoid moving any memory references across Here is a list of situations allowing plain loads and stores
the barrier() macro. This means that the real-time- for some accesses to a given variable, while requiring
destroying loop unrolling shown in Listing 4.17 can be markings (such as READ_ONCE() and WRITE_ONCE()) for
prevented by adding barrier() calls as shown on lines 2 other accesses to that same variable:
and 4 of Listing 4.28. These two lines of code prevent the
1. A shared variable is only modified by a given owning
compiler from pushing the load from need_to_stop into
CPU or thread, but is read by other CPUs or threads.
or past do_something_quickly() from either direction.
All stores must use WRITE_ONCE(). The owning
However, this does nothing to prevent the CPU from
CPU or thread may use plain loads. Everything else
reordering the references. In many cases, this is not
must use READ_ONCE() for loads.
a problem because the hardware can only do a certain
amount of reordering. However, there are cases such 2. A shared variable is only modified while holding a
as Listing 4.19 where the hardware must be constrained. given lock, but is read by code not holding that lock.
Listing 4.26 prevented store fusing and invention, and All stores must use WRITE_ONCE(). CPUs or threads
Listing 4.29 further prevents the remaining reordering holding the lock may use plain loads. Everything
by addition of smp_mb() on lines 4, 8, 10, 18, and 21. else must use READ_ONCE() for loads.
The smp_mb() macro is similar to barrier() shown in
Listing 4.9, but with the empty string replaced by a string 3. A shared variable is only modified while holding a
containing the instruction for a full memory barrier, for given lock by a given owning CPU or thread, but is
example, "mfence" on x86 or "sync" on PowerPC. read by other CPUs or threads or by code not holding
that lock. All stores must use WRITE_ONCE(). The
Quick Quiz 4.31: But aren’t full memory barriers very owning CPU or thread may use plain loads, as may
heavyweight? Isn’t there a cheaper way to enforce the ordering
any CPU or thread holding the lock. Everything else
needed in Listing 4.29?
must use READ_ONCE() for loads.
v2022.09.25a
46 CHAPTER 4. TOOLS OF THE TRADE
4. A shared variable is only accessed by a given CPU Listing 4.30: Per-Thread-Variable API
or thread and by a signal or interrupt handler running DEFINE_PER_THREAD(type, name)
DECLARE_PER_THREAD(type, name)
in that CPU’s or thread’s context. The handler can per_thread(name, thread)
use plain loads and stores, as can any code that __get_thread_var(name)
init_per_thread(name, v)
has prevented the handler from being invoked, that
is, code that has blocked signals and/or interrupts.
All other code must use READ_ONCE() and WRITE_ nothing happens unless the original value of the atomic
ONCE(). variable is different than the value specified (these are very
5. A shared variable is only accessed by a given CPU handy for managing reference counters, for example).
or thread and by a signal or interrupt handler running An atomic exchange operation is provided by atomic_
in that CPU’s or thread’s context, and the handler xchg(), and the celebrated compare-and-swap (CAS)
always restores the values of any variables that it operation is provided by atomic_cmpxchg(). Both
has written before return. The handler can use plain of these return the old value. Many additional atomic
loads and stores, as can any code that has prevented RMW primitives are available in the Linux kernel, see
the handler from being invoked, that is, code that the Documentation/atomic_t.txt file in the Linux-
has blocked signals and/or interrupts. All other code kernel source tree.14
can use plain loads, but must use WRITE_ONCE() This book’s CodeSamples API closely follows that of
to prevent store tearing, store fusing, and invented the Linux kernel.
stores.
4.3.6 Per-CPU Variables
Quick Quiz 4.32: What needs to happen if an interrupt or
signal handler might itself be interrupted? The Linux kernel uses DEFINE_PER_CPU() to define a
per-CPU variable, this_cpu_ptr() to form a reference
In most other cases, loads from and stores to a shared to this CPU’s instance of a given per-CPU variable, per_
variable must use READ_ONCE() and WRITE_ONCE() or cpu() to access a specified CPU’s instance of a given
stronger, respectively. But it bears repeating that neither per-CPU variable, along with many other special-purpose
READ_ONCE() nor WRITE_ONCE() provide any ordering per-CPU operations.
guarantees other than within the compiler. See the above Listing 4.30 shows this book’s per-thread-variable API,
Section 4.3.4.3 or Chapter 15 for information on such which is patterned after the Linux kernel’s per-CPU-
guarantees. variable API. This API provides the per-thread equivalent
Examples of many of these data-race-avoidance patterns of global variables. Although this API is, strictly speaking,
are presented in Chapter 5. not necessary,15 it can provide a good userspace analogy
to Linux kernel code.
4.3.5 Atomic Operations Quick Quiz 4.33: How could you work around the lack of a
The Linux kernel provides a wide variety of atomic opera- per-thread-variable API on systems that do not provide it?
tions, but those defined on type atomic_t provide a good
start. Normal non-tearing reads and stores are provided by
atomic_read() and atomic_set(), respectively. Ac- 4.3.6.1 DEFINE_PER_THREAD()
quire load is provided by smp_load_acquire() and The DEFINE_PER_THREAD() primitive defines a per-
release store by smp_store_release(). thread variable. Unfortunately, it is not possible to pro-
Non-value-returning fetch-and-add operations are pro- vide an initializer in the way permitted by the Linux
vided by atomic_add(), atomic_sub(), atomic_ kernel’s DEFINE_PER_CPU() primitive, but there is an
inc(), and atomic_dec(), among others. An atomic init_per_thread() primitive that permits easy runtime
decrement that returns a reached-zero indication is pro- initialization.
vided by both atomic_dec_and_test() and atomic_
sub_and_test(). An atomic add that returns the
new value is provided by atomic_add_return().
Both atomic_add_unless() and atomic_inc_not_ 14 As of Linux kernel v5.11.
zero() provide conditional atomic operations, where 15 You could instead use __thread or _Thread_local.
v2022.09.25a
4.4. THE RIGHT TOOL FOR THE JOB: HOW TO CHOOSE? 47
v2022.09.25a
48 CHAPTER 4. TOOLS OF THE TRADE
v2022.09.25a
As easy as 1, 2, 3!
Unknown
Chapter 5
Counting
Counting is perhaps the simplest and most natural thing number of structures in use exceeds an exact limit (again, say
a computer can do. However, counting efficiently and 10,000). Suppose further that these structures are short-lived,
scalably on a large shared-memory multiprocessor can and that the limit is rarely exceeded, that there is almost always
be quite challenging. Furthermore, the simplicity of the at least one structure in use, and suppose further still that it is
underlying concept of counting allows us to explore the necessary to know exactly when this counter reaches zero, for
example, in order to free up some memory that is not required
fundamental issues of concurrency without the distractions
unless there is at least one structure in use.
of elaborate data structures or complex synchronization
primitives. Counting therefore provides an excellent
Quick Quiz 5.5: Removable I/O device access-count prob-
introduction to parallel programming.
lem. Suppose that you need to maintain a reference count on
This chapter covers a number of special cases for which a heavily used removable mass-storage device, so that you can
there are simple, fast, and scalable counting algorithms. tell the user when it is safe to remove the device. As usual, the
But first, let us find out how much you already know about user indicates a desire to remove the device, and the system
concurrent counting. tells the user when it is safe to do so.
Quick Quiz 5.1: Why should efficient and scalable counting
Section 5.1 shows why counting is non-trivial. Sec-
be hard??? After all, computers have special hardware for the
tions 5.2 and 5.3 investigate network-packet counting
sole purpose of doing counting!!!
and approximate structure-allocation limits, respectively.
Section 5.4 takes on exact structure-allocation limits. Fi-
Quick Quiz 5.2: Network-packet counting problem. Sup-
pose that you need to collect statistics on the number of nally, Section 5.5 presents performance measurements
networking packets transmitted and received. Packets might and discussion.
be transmitted or received by any CPU on the system. Suppose Sections 5.1 and 5.2 contain introductory material,
further that your system is capable of handling millions of while the remaining sections are more advanced.
packets per second per CPU, and that a systems-monitoring
package reads the count every five seconds. How would you
implement this counter? 5.1 Why Isn’t Concurrent Counting
Quick Quiz 5.3: Approximate structure-allocation limit Trivial?
problem. Suppose that you need to maintain a count of the
number of structures allocated in order to fail any allocations
Seek simplicity, and distrust it.
once the number of structures in use exceeds a limit (say,
10,000). Suppose further that the structures are short-lived, Alfred North Whitehead
the limit is rarely exceeded, and a “sloppy” approximate limit
is acceptable. Let’s start with something simple, for example, the
straightforward use of arithmetic shown in Listing 5.1
Quick Quiz 5.4: Exact structure-allocation limit problem. (count_nonatomic.c). Here, we have a counter on
Suppose that you need to maintain a count of the number of
line 1, we increment it on line 5, and we read out its value
structures allocated in order to fail any allocations once the
on line 10. What could be simpler?
49
v2022.09.25a
50 CHAPTER 5. COUNTING
10
100
1 atomic_t counter = ATOMIC_INIT(0);
2
3 static __inline__ void inc_count(void) Number of CPUs (Threads)
4 {
5 atomic_inc(&counter);
6 }
Figure 5.1: Atomic Increment Scalability on x86
7
8 static __inline__ long read_count(void)
9 {
10 return atomic_read(&counter); times slower than non-atomic increment, even when only
11 } a single thread is incrementing.1
This poor performance should not be a surprise, given
the discussion in Chapter 3, nor should it be a surprise
Quick Quiz 5.6: One thing that could be simpler is ++ instead that the performance of atomic increment gets slower
of that concatenation of READ_ONCE() and WRITE_ONCE(). as the number of CPUs and threads increase, as shown
Why all that extra typing??? in Figure 5.1. In this figure, the horizontal dashed line
resting on the x axis is the ideal performance that would
This approach has the additional advantage of being be achieved by a perfectly scalable algorithm: With
blazingly fast if you are doing lots of reading and almost such an algorithm, a given increment would incur the
no incrementing, and on small systems, the performance same overhead that it would in a single-threaded program.
is excellent. Atomic increment of a single global variable is clearly
There is just one large fly in the ointment: This approach decidedly non-ideal, and gets multiple orders of magnitude
can lose counts. On my six-core x86 laptop, a short run worse with additional CPUs.
invoked inc_count() 285,824,000 times, but the final
Quick Quiz 5.9: Why doesn’t the horizontal dashed line on
value of the counter was only 35,385,525. Although the x axis meet the diagonal line at 𝑥 = 1?
approximation does have a large place in computing, loss
of 87 % of the counts is a bit excessive. Quick Quiz 5.10: But atomic increment is still pretty fast.
Quick Quiz 5.7: But can’t a smart compiler prove that line 5 And incrementing a single variable in a tight loop sounds pretty
of Listing 5.1 is equivalent to the ++ operator and produce an unrealistic to me, after all, most of the program’s execution
x86 add-to-memory instruction? And won’t the CPU cache should be devoted to actually doing work, not accounting for
cause this to be atomic? the work it has done! Why should I care about making this go
faster?
Quick Quiz 5.8: The 8-figure accuracy on the number of
For another perspective on global atomic increment,
failures indicates that you really did test this. Why would it be
necessary to test such a trivial program, especially when the
consider Figure 5.2. In order for each CPU to get a
bug is easily seen by inspection? chance to increment a given global variable, the cache
line containing that variable must circulate among all
The straightforward way to count accurately is to use 1 Interestingly enough, non-atomically incrementing a counter will
atomic operations, as shown in Listing 5.2 (count_ advance the counter more quickly than atomically incrementing the
atomic.c). Line 1 defines an atomic variable, line 5 counter. Of course, if your only goal is to make the counter increase
quickly, an easier approach is to simply assign a large value to the counter.
atomically increments it, and line 10 reads it out. Be- Nevertheless, there is likely to be a role for algorithms that use carefully
cause this is atomic, it keeps perfect count. However, it is relaxed notions of correctness in order to gain greater performance and
slower: On my six-core x86 laptop, it is more than twenty scalability [And91, ACMS03, Rin13, Ung11].
v2022.09.25a
5.2. STATISTICAL COUNTERS 51
Mark Twain
Memory System Interconnect Memory
This section covers the common special case of statistical
counters, where the count is updated extremely frequently
Interconnect Interconnect and the value is read out rarely, if ever. These will be used
Cache Cache Cache Cache to solve the network-packet counting problem posed in
CPU 4 CPU 5 CPU 6 CPU 7 Quick Quiz 5.2.
Figure 5.2: Data Flow For Global Atomic Increment 5.2.1 Design
Statistical counting is typically handled by providing a
counter per thread (or CPU, when running in the kernel),
so that each thread updates its own counter, as was fore-
shadowed in Section 4.3.6 on page 46. The aggregate
value of the counters is read out by simply summing up
all of the threads’ counters, relying on the commutative
and associative properties of addition. This is an example
of the Data Ownership pattern that will be introduced in
Section 6.3.4 on page 86.
One one thousand. Quick Quiz 5.12: But doesn’t the fact that C’s “integers” are
Two one thousand.
Three one thousand...
limited in size complicate things?
v2022.09.25a
52 CHAPTER 5. COUNTING
to anything attempting to read out the count. The use the network-packet counting problem presented at the
of WRITE_ONCE() prevents this optimization and others beginning of this chapter.
besides. Quick Quiz 5.17: The read operation takes time to sum
Quick Quiz 5.14: What other nasty optimizations could up the per-thread values, and during that time, the counter
GCC apply? could well be changing. This means that the value returned
by read_count() in Listing 5.3 will not necessarily be exact.
Lines 10–18 show a function that reads out the aggregate Assume that the counter is being incremented at rate 𝑟 counts
per unit time, and that read_count()’s execution consumes
value of the counter, using the for_each_thread()
𝛥 units of time. What is the expected error in the return value?
primitive to iterate over the list of currently running
threads, and using the per_thread() primitive to fetch
the specified thread’s counter. This code also uses READ_ However, many implementations provide cheaper mech-
ONCE() to ensure that the compiler doesn’t optimize these anisms for per-thread data that are free from arbitrary
loads into oblivion. For but one example, a pair of array-size limits. This is the topic of the next section.
consecutive calls to read_count() might be inlined, and
an intrepid optimizer might notice that the same locations 5.2.3 Per-Thread-Variable-Based Imple-
were being summed and thus incorrectly conclude that it
would be simply wonderful to sum them once and use the
mentation
resulting value twice. This sort of optimization might be The C language, since C11, features a _Thread_local
rather frustrating to people expecting later read_count() storage class that provides per-thread storage.2 This can be
calls to account for the activities of other threads. The use used as shown in Listing 5.4 (count_end.c) to implement
of READ_ONCE() prevents this optimization and others a statistical counter that not only scales well and avoids
besides. arbitrary thread-number limits, but that also incurs little
Quick Quiz 5.15: How does the per-thread counter variable or no performance penalty to incrementers compared to
in Listing 5.3 get initialized? simple non-atomic increment.
Lines 1–4 define needed variables: counter is the
per-thread counter variable, the counterp[] array allows
Quick Quiz 5.16: How is the code in Listing 5.3 supposed
to permit more than one counter? threads to access each others’ counters, finalcount ac-
cumulates the total as individual threads exit, and final_
This approach scales linearly with increasing number mutex coordinates between threads accumulating the total
of updater threads invoking inc_count(). As is shown value of the counter and exiting threads.
by the green arrows on each CPU in Figure 5.4, the
reason for this is that each CPU can make rapid progress 2 GCC provides its own __thread storage class, which was used
incrementing its thread’s variable, without any expensive in previous versions of this book. The two methods for specifying a
cross-system communication. As such, this section solves thread-local variable are interchangeable when using GCC.
v2022.09.25a
5.2. STATISTICAL COUNTERS 53
Listing 5.4: Per-Thread Statistical Counters counter-pointers to that variable rather than setting them to
1 unsigned long _Thread_local counter = 0; NULL?
2 unsigned long *counterp[NR_THREADS] = { NULL };
3 unsigned long finalcount = 0;
4 DEFINE_SPINLOCK(final_mutex); Quick Quiz 5.20: Why on earth do we need something as
5
heavyweight as a lock guarding the summation in the function
6 static inline void inc_count(void)
7 { read_count() in Listing 5.4?
8 WRITE_ONCE(counter, counter + 1);
9 } Lines 25–32 show the count_register_thread()
10
11 static inline unsigned long read_count(void) function, which must be called by each thread before its
12 { first use of this counter. This function simply sets up this
13 int t;
14 unsigned long sum; thread’s element of the counterp[] array to point to its
15
16 spin_lock(&final_mutex);
per-thread counter variable.
17 sum = finalcount;
18 for_each_thread(t)
Quick Quiz 5.21: Why on earth do we need to acquire the
19 if (counterp[t] != NULL) lock in count_register_thread() in Listing 5.4? It is a
20 sum += READ_ONCE(*counterp[t]); single properly aligned machine-word store to a location that
21 spin_unlock(&final_mutex);
22 return sum; no other thread is modifying, so it should be atomic anyway,
23 } right?
24
25 void count_register_thread(unsigned long *p)
26 { Lines 34–42 show the count_unregister_
27 int idx = smp_thread_id(); thread() function, which must be called prior to exit
28
29 spin_lock(&final_mutex); by each thread that previously called count_register_
30 counterp[idx] = &counter; thread(). Line 38 acquires the lock, and line 41 releases
31 spin_unlock(&final_mutex);
32 } it, thus excluding any calls to read_count() as well as
33 other calls to count_unregister_thread(). Line 39
34 void count_unregister_thread(int nthreadsexpected)
35 { adds this thread’s counter to the global finalcount,
36 int idx = smp_thread_id(); and then line 40 NULLs out its counterp[] array entry.
37
38 spin_lock(&final_mutex); A subsequent call to read_count() will see the exiting
39 finalcount += counter; thread’s count in the global finalcount, and will
40 counterp[idx] = NULL;
41 spin_unlock(&final_mutex); skip the exiting thread when sequencing through the
42 } counterp[] array, thus obtaining the correct total.
This approach gives updaters almost exactly the same
performance as a non-atomic add, and also scales linearly.
Quick Quiz 5.18: Doesn’t that explicit counterp array On the other hand, concurrent reads contend for a sin-
in Listing 5.4 reimpose an arbitrary limit on the number
gle global lock, and therefore perform poorly and scale
of threads? Why doesn’t the C language provide a per_
abysmally. However, this is not a problem for statistical
thread() interface, similar to the Linux kernel’s per_cpu()
primitive, to allow threads to more easily access each others’ counters, where incrementing happens often and readout
per-thread variables? happens almost never. Of course, this approach is consid-
erably more complex than the array-based scheme, due to
The inc_count() function used by updaters is quite the fact that a given thread’s per-thread variables vanish
simple, as can be seen on lines 6–9. when that thread exits.
The read_count() function used by readers is a bit Quick Quiz 5.22: Fine, but the Linux kernel doesn’t have
more complex. Line 16 acquires a lock to exclude exiting to acquire a lock when reading out the aggregate value of
threads, and line 21 releases it. Line 17 initializes the per-CPU counters. So why should user-space code need to do
sum to the count accumulated by those threads that have this???
already exited, and lines 18–20 sum the counts being
Both the array-based and _Thread_local-based ap-
accumulated by threads currently running. Finally, line 22
proaches offer excellent update-side performance and
returns the sum.
scalability. However, these benefits result in large read-
Quick Quiz 5.19: Doesn’t the check for NULL on line 19 side expense for large numbers of threads. The next
of Listing 5.4 add extra branch mispredictions? Why not
section shows one way to reduce read-side expense while
have a variable set permanently to zero, and point unused
still retaining the update-side scalability.
v2022.09.25a
54 CHAPTER 5. COUNTING
v2022.09.25a
5.3. APPROXIMATE LIMIT COUNTERS 55
Quick Quiz 5.23: Why doesn’t inc_count() in Listing 5.5 5.3 Approximate Limit Counters
need to use atomic instructions? After all, we now have
multiple threads accessing the per-thread counters!
An approximate answer to the right problem is worth
a good deal more than an exact answer to an
approximate problem.
Quick Quiz 5.24: Won’t the single global thread in the func-
tion eventual() of Listing 5.5 be just as severe a bottleneck John Tukey
as a global lock would be?
Another special case of counting involves limit-checking.
For example, as noted in the approximate structure-
Quick Quiz 5.25: Won’t the estimate returned by read_ allocation limit problem in Quick Quiz 5.3, suppose that
count() in Listing 5.5 become increasingly inaccurate as the you need to maintain a count of the number of structures
number of threads rises?
allocated in order to fail any allocations once the number
of structures in use exceeds a limit, in this case, 10,000.
Suppose further that these structures are short-lived, that
Quick Quiz 5.26: Given that in the eventually-consistent
this limit is rarely exceeded, and that this limit is approx-
algorithm shown in Listing 5.5 both reads and updates have
extremely low overhead and are extremely scalable, why imate in that it is OK to exceed it sometimes by some
would anyone bother with the implementation described in bounded amount (see Section 5.4 if you instead need the
Section 5.2.2, given its costly read-side code? limit to be exact).
5.3.1 Design
Quick Quiz 5.27: What is the accuracy of the estimate
returned by read_count() in Listing 5.5? One possible design for limit counters is to divide the
limit of 10,000 by the number of threads, and give each
thread a fixed pool of structures. For example, given 100
threads, each thread would manage its own pool of 100
structures. This approach is simple, and in some cases
works well, but it does not handle the common case where
5.2.5 Discussion a given structure is allocated by one thread and freed by
another [MS93]. On the one hand, if a given thread takes
These three implementations show that it is possible credit for any structures it frees, then the thread doing
to obtain near-uniprocessor performance for statistical most of the allocating runs out of structures, while the
counters, despite running on a parallel machine. threads doing most of the freeing have lots of credits that
they cannot use. On the other hand, if freed structures
Quick Quiz 5.28: What fundamental difference is there are credited to the CPU that allocated them, it will be
between counting packets and counting the total number of necessary for CPUs to manipulate each others’ counters,
bytes in the packets, given that the packets vary in size? which will require expensive atomic instructions or other
means of communicating between threads.3
In short, for many important workloads, we cannot fully
Quick Quiz 5.29: Given that the reader must sum all the partition the counter. Given that partitioning the counters
threads’ counters, this counter-read operation could take a long was what brought the excellent update-side performance
time given large numbers of threads. Is there any way that for the three schemes discussed in Section 5.2, this might
the increment operation can remain fast and scalable while
be grounds for some pessimism. However, the eventually
allowing readers to also enjoy not only reasonable performance
and scalability, but also good accuracy?
consistent algorithm presented in Section 5.2.4 provides
an interesting hint. Recall that this algorithm kept two sets
of books, a per-thread counter variable for updaters and a
Given what has been presented in this section, you
should now be able to answer the Quick Quiz about 3 That said, if each structure will always be freed by the same CPU
statistical counters for networking near the beginning of (or thread) that allocated it, then this simple partitioning approach works
this chapter. extremely well.
v2022.09.25a
56 CHAPTER 5. COUNTING
global_count variable for readers, with an eventual() Listing 5.6: Simple Limit Counter Variables
thread that periodically updated global_count to be 1 unsigned long __thread counter = 0;
2 unsigned long __thread countermax = 0;
eventually consistent with the values of the per-thread 3 unsigned long globalcountmax = 10000;
counter. The per-thread counter perfectly partitioned 4 unsigned long globalcount = 0;
5 unsigned long globalreserve = 0;
the counter value, while global_count kept the full 6 unsigned long *counterp[NR_THREADS] = { NULL };
value. 7 DEFINE_SPINLOCK(gblcnt_mutex);
v2022.09.25a
5.3. APPROXIMATE LIMIT COUNTERS 57
countermax 3 Listing 5.7: Simple Limit Counter Add, Subtract, and Read
globalreserve
4
5 return 1;
6 }
countermax 1 counter 1
7 spin_lock(&gblcnt_mutex);
8 globalize_count();
countermax 0 9 if (globalcountmax -
counter 0
10 globalcount - globalreserve < delta) {
11 spin_unlock(&gblcnt_mutex);
12 return 0;
globalcount
13 }
14 globalcount += delta;
15 balance_count();
16 spin_unlock(&gblcnt_mutex);
17 return 1;
18 }
19
20 static __inline__ int sub_count(unsigned long delta)
21 {
22 if (counter >= delta) {
Figure 5.5: Simple Limit Counter Variable Relationships 23 WRITE_ONCE(counter, counter - delta);
24 return 1;
25 }
26 spin_lock(&gblcnt_mutex);
in other words, no thread is permitted to access or modify 27 globalize_count();
any of the global variables unless it has acquired gblcnt_ 28 if (globalcount < delta) {
29 spin_unlock(&gblcnt_mutex);
mutex. 30 return 0;
Listing 5.7 shows the add_count(), sub_count(), 31 }
32 globalcount -= delta;
and read_count() functions (count_lim.c). 33 balance_count();
34 spin_unlock(&gblcnt_mutex);
Quick Quiz 5.30: Why does Listing 5.7 provide add_ 35 return 1;
count() and sub_count() instead of the inc_count() and 36 }
37
dec_count() interfaces show in Section 5.2? 38 static __inline__ unsigned long read_count(void)
39 {
Lines 1–18 show add_count(), which adds the speci- 40 int t;
41 unsigned long sum;
fied value delta to the counter. Line 3 checks to see if 42
v2022.09.25a
58 CHAPTER 5. COUNTING
the expression preceding the less-than sign shown in Fig- Listing 5.9: Simple Limit Counter Utility Functions
ure 5.5 as the difference in height of the two red (leftmost) 1 static __inline__ void globalize_count(void)
2 {
bars. If the addition of delta cannot be accommodated, 3 globalcount += counter;
then line 11 (as noted earlier) releases gblcnt_mutex 4 counter = 0;
5 globalreserve -= countermax;
and line 12 returns indicating failure. 6 countermax = 0;
Otherwise, we take the slowpath. Line 14 adds delta 7 }
8
to globalcount, and then line 15 invokes balance_ 9 static __inline__ void balance_count(void)
count() (shown in Listing 5.9) in order to update both the 10 {
11 countermax = globalcountmax -
global and the per-thread variables. This call to balance_ 12 globalcount - globalreserve;
count() will usually set this thread’s countermax to 13 countermax /= num_online_threads();
14 globalreserve += countermax;
re-enable the fastpath. Line 16 then releases gblcnt_ 15 counter = countermax / 2;
mutex (again, as noted earlier), and, finally, line 17 returns 16 if (counter > globalcount)
17 counter = globalcount;
indicating success. 18 globalcount -= counter;
19 }
Quick Quiz 5.32: Why does globalize_count() zero the 20
v2022.09.25a
5.3. APPROXIMATE LIMIT COUNTERS 59
this function does not change the aggregate value of the by the bottommost dotted line connecting the leftmost
counter, but instead changes how the counter’s current and center configurations. In other words, the sum of
value is represented. Line 3 adds the thread’s counter globalcount and the four threads’ counter variables is
variable to globalcount, and line 4 zeroes counter. the same in both configurations. Similarly, this change did
Similarly, line 5 subtracts the per-thread countermax not affect the sum of globalcount and globalreserve,
from globalreserve, and line 6 zeroes countermax. It as indicated by the upper dotted line.
is helpful to refer to Figure 5.5 when reading both this The rightmost configuration shows the relationship
function and balance_count(), which is next. of these counters after balance_count() is executed,
Lines 9–19 show balance_count(), which is roughly again by thread 0. One-quarter of the remaining count,
speaking the inverse of globalize_count(). This func- denoted by the vertical line extending up from all three
tion’s job is to set the current thread’s countermax vari- configurations, is added to thread 0’s countermax and
able to the largest value that avoids the risk of the counter half of that to thread 0’s counter. The amount added to
exceeding the globalcountmax limit. Changing the thread 0’s counter is also subtracted from globalcount
current thread’s countermax variable of course requires in order to avoid changing the overall value of the counter
corresponding adjustments to counter, globalcount (which is again the sum of globalcount and the three
and globalreserve, as can be seen by referring back to threads’ counter variables), again as indicated by the
Figure 5.5. By doing this, balance_count() maximizes lowermost of the two dotted lines connecting the center and
use of add_count()’s and sub_count()’s low-overhead rightmost configurations. The globalreserve variable
fastpaths. As with globalize_count(), balance_ is also adjusted so that this variable remains equal to the
count() is not permitted to change the aggregate value sum of the four threads’ countermax variables. Because
of the counter. thread 0’s counter is less than its countermax, thread 0
Lines 11–13 compute this thread’s share of that por- can once again increment the counter locally.
tion of globalcountmax that is not already covered by Quick Quiz 5.37: In Figure 5.6, even though a quarter of the
either globalcount or globalreserve, and assign the remaining count up to the limit is assigned to thread 0, only an
computed quantity to this thread’s countermax. Line 14 eighth of the remaining count is consumed, as indicated by the
makes the corresponding adjustment to globalreserve. uppermost dotted line connecting the center and the rightmost
Line 15 sets this thread’s counter to the middle of the configurations. Why is that?
range from zero to countermax. Line 16 checks to
see whether globalcount can in fact accommodate this Lines 21–28 show count_register_thread(),
value of counter, and, if not, line 17 decreases counter which sets up state for newly created threads. This
accordingly. Finally, in either case, line 18 makes the function simply installs a pointer to the newly created
corresponding adjustment to globalcount. thread’s counter variable into the corresponding entry of
the counterp[] array under the protection of gblcnt_
Quick Quiz 5.36: Why set counter to countermax / 2
mutex.
in line 15 of Listing 5.9? Wouldn’t it be simpler to just take
countermax counts? Finally, lines 30–38 show count_unregister_
thread(), which tears down state for a soon-to-be-exiting
It is helpful to look at a schematic depicting how the thread. Line 34 acquires gblcnt_mutex and line 37 re-
relationship of the counters changes with the execution of leases it. Line 35 invokes globalize_count() to clear
first globalize_count() and then balance_count(), out this thread’s counter state, and line 36 clears this
as shown in Figure 5.6. Time advances from left to right, thread’s entry in the counterp[] array.
with the leftmost configuration roughly that of Figure 5.5.
The center configuration shows the relationship of these
5.3.3 Simple Limit Counter Discussion
same counters after globalize_count() is executed by
thread 0. As can be seen from the figure, thread 0’s This type of counter is quite fast when aggregate val-
counter (“c 0” in the figure) is added to globalcount, ues are near zero, with some overhead due to the com-
while the value of globalreserve is reduced by this same parison and branch in both add_count()’s and sub_
amount. Both thread 0’s counter and its countermax count()’s fastpaths. However, the use of a per-thread
(“cm 0” in the figure) are reduced to zero. The other three countermax reserve means that add_count() can fail
threads’ counters are unchanged. Note that this change even when the aggregate value of the counter is nowhere
did not affect the overall value of the counter, as indicated near globalcountmax. Similarly, sub_count() can fail
v2022.09.25a
60 CHAPTER 5. COUNTING
globalize_count() balance_count()
cm 3
globalreserve
c 3
globalreserve
cm 3 cm 3
globalreserve
c 3 c 3
cm 2
c 2
cm 2 cm 2
c 2 c 2
cm 1 c 1
cm 1 c 1 cm 1 c 1
cm 0
c 0
cm 0 c 0
globalcount
globalcount
globalcount
v2022.09.25a
5.4. EXACT LIMIT COUNTERS 61
5.3.5 Approximate Limit Counter Discus- Listing 5.12: Atomic Limit Counter Variables and Access
sion Functions
1 atomic_t __thread counterandmax = ATOMIC_INIT(0);
2 unsigned long globalcountmax = 1 << 25;
These changes greatly reduce the limit inaccuracy seen in 3 unsigned long globalcount = 0;
the previous version, but present another problem: Any 4 unsigned long globalreserve = 0;
5 atomic_t *counterp[NR_THREADS] = { NULL };
given value of MAX_COUNTERMAX will cause a workload- 6 DEFINE_SPINLOCK(gblcnt_mutex);
dependent fraction of accesses to fall off the fastpath. As 7 #define CM_BITS (sizeof(atomic_t) * 4)
8 #define MAX_COUNTERMAX ((1 << CM_BITS) - 1)
the number of threads increase, non-fastpath execution 9
will become both a performance and a scalability problem. 10 static __inline__ void
11 split_counterandmax_int(int cami, int *c, int *cm)
However, we will defer this problem and turn instead to 12 {
counters with exact limits. 13 *c = (cami >> CM_BITS) & MAX_COUNTERMAX;
14 *cm = cami & MAX_COUNTERMAX;
15 }
16
17 static __inline__ void
5.4 Exact Limit Counters 18
19
split_counterandmax(atomic_t *cam, int *old, int *c, int *cm)
{
20 unsigned int cami = atomic_read(cam);
21
Exactitude can be expensive. Spend wisely. 22 *old = cami;
23 split_counterandmax_int(cami, c, cm);
Unknown 24 }
25
26 static __inline__ int merge_counterandmax(int c, int cm)
To solve the exact structure-allocation limit problem noted 27 {
28 unsigned int cami;
in Quick Quiz 5.4, we need a limit counter that can 29
tell exactly when its limits are exceeded. One way of 30 cami = (c << CM_BITS) | cm;
31 return ((int)cami);
implementing such a limit counter is to cause threads 32 }
that have reserved counts to give them up. One way to
do this is to use atomic instructions. Of course, atomic
instructions will slow down the fastpath, but on the other variable is of type atomic_t, which has an underlying
hand, it would be silly not to at least give them a try. representation of int.
Lines 2–6 show the definitions for globalcountmax,
5.4.1 Atomic Limit Counter Implementa- globalcount, globalreserve, counterp, and
gblcnt_mutex, all of which take on roles similar to
tion
their counterparts in Listing 5.10. Line 7 defines CM_
Unfortunately, if one thread is to safely remove counts BITS, which gives the number of bits in each half of
from another thread, both threads will need to atomically counterandmax, and line 8 defines MAX_COUNTERMAX,
manipulate that thread’s counter and countermax vari- which gives the maximum value that may be held in either
ables. The usual way to do this is to combine these two half of counterandmax.
variables into a single variable, for example, given a 32-bit Quick Quiz 5.39: In what way does line 7 of Listing 5.12
variable, using the high-order 16 bits to represent counter violate the C standard?
and the low-order 16 bits to represent countermax.
Lines 10–15 show the split_counterandmax_
Quick Quiz 5.38: Why is it necessary to atomically manip-
ulate the thread’s counter and countermax variables as a int() function, which, when given the underlying int
unit? Wouldn’t it be good enough to atomically manipulate from the atomic_t counterandmax variable, splits it
them individually? into its counter (c) and countermax (cm) components.
Line 13 isolates the most-significant half of this int,
The variables and access functions for a simple atomic placing the result as specified by argument c, and line 14
limit counter are shown in Listing 5.12 (count_lim_ isolates the least-significant half of this int, placing the
atomic.c). The counter and countermax variables in result as specified by argument cm.
earlier algorithms are combined into the single variable Lines 17–24 show the split_counterandmax() func-
counterandmax shown on line 1, with counter in the tion, which picks up the underlying int from the spec-
upper half and countermax in the lower half. This ified variable on line 20, stores it as specified by the
v2022.09.25a
62 CHAPTER 5. COUNTING
v2022.09.25a
5.4. EXACT LIMIT COUNTERS 63
Listing 5.14: Atomic Limit Counter Read Listing 5.15: Atomic Limit Counter Utility Functions 1
1 unsigned long read_count(void) 1 static void globalize_count(void)
2 { 2 {
3 int c; 3 int c;
4 int cm; 4 int cm;
5 int old; 5 int old;
6 int t; 6
7 unsigned long sum; 7 split_counterandmax(&counterandmax, &old, &c, &cm);
8 8 globalcount += c;
9 spin_lock(&gblcnt_mutex); 9 globalreserve -= cm;
10 sum = globalcount; 10 old = merge_counterandmax(0, 0);
11 for_each_thread(t) 11 atomic_set(&counterandmax, old);
12 if (counterp[t] != NULL) { 12 }
13 split_counterandmax(counterp[t], &old, &c, &cm); 13
14 sum += c; 14 static void flush_local_count(void)
15 } 15 {
16 spin_unlock(&gblcnt_mutex); 16 int c;
17 return sum; 17 int cm;
18 } 18 int old;
19 int t;
20 int zero;
21
22 if (globalreserve == 0)
global counters, and then lines 22–23 recheck whether 23 return;
delta can be accommodated. If, after all that, the addition 24 zero = merge_counterandmax(0, 0);
25 for_each_thread(t)
of delta still cannot be accommodated, then line 24 26 if (counterp[t] != NULL) {
releases gblcnt_mutex (as noted earlier), and then line 25 27 old = atomic_xchg(counterp[t], zero);
28 split_counterandmax_int(old, &c, &cm);
returns failure. 29 globalcount += c;
Otherwise, line 28 adds delta to the global counter, 30 globalreserve -= cm;
31 }
line 29 spreads counts to the local state if appropriate, 32 }
line 30 releases gblcnt_mutex (again, as noted earlier),
and finally, line 31 returns success.
Lines 34–63 of Listing 5.13 show sub_count(), which local variable zero to a combined zeroed counter and
is structured similarly to add_count(), having a fastpath countermax. The loop spanning lines 25–31 sequences
on lines 41–48 and a slowpath on lines 49–62. A line-by- through each thread. Line 26 checks to see if the current
line analysis of this function is left as an exercise to the thread has counter state, and, if so, lines 27–30 move that
reader. state to the global counters. Line 27 atomically fetches
Listing 5.14 shows read_count(). Line 9 acquires the current thread’s state while replacing it with zero.
gblcnt_mutex and line 16 releases it. Line 10 initializes Line 28 splits this state into its counter (in local variable
local variable sum to the value of globalcount, and the c) and countermax (in local variable cm) components.
loop spanning lines 11–15 adds the per-thread counters to Line 29 adds this thread’s counter to globalcount,
this sum, isolating each per-thread counter using split_ while line 30 subtracts this thread’s countermax from
counterandmax on line 13. Finally, line 17 returns the globalreserve.
sum. Quick Quiz 5.44: What stops a thread from simply refilling its
Listings 5.15 and 5.16 show the utility func- counterandmax variable immediately after flush_local_
tions globalize_count(), flush_local_count(), count() on line 14 of Listing 5.15 empties it?
balance_count(), count_register_thread(), and
count_unregister_thread(). The code for Quick Quiz 5.45: What prevents concurrent execution of
globalize_count() is shown on lines 1–12 of List- the fastpath of either add_count() or sub_count() from
ing 5.15, and is similar to that of previous algorithms, interfering with the counterandmax variable while flush_
local_count() is accessing it on line 27 of Listing 5.15?
with the addition of line 7, which is now required to split
out counter and countermax from counterandmax.
The code for flush_local_count(), which moves Lines 1–22 on Listing 5.16 show the code for
all threads’ local counter state to the global counter, is balance_count(), which refills the calling thread’s local
shown on lines 14–32. Line 22 checks to see if the value counterandmax variable. This function is quite similar
of globalreserve permits any per-thread counts, and, to that of the preceding algorithms, with changes required
if not, line 23 returns. Otherwise, line 24 initializes to handle the merged counterandmax variable. Detailed
v2022.09.25a
64 CHAPTER 5. COUNTING
this slowdown, it is worthwhile looking for algorithms READY are green, REQ is red, and ACK is blue.
v2022.09.25a
5.4. EXACT LIMIT COUNTERS 65
to each thread, and the corresponding signal handler Listing 5.17: Signal-Theft Limit Counter Data
checks the corresponding thread’s theft and counting 1 #define THEFT_IDLE 0
2 #define THEFT_REQ 1
variables. If the theft state is not REQ, then the signal 3 #define THEFT_ACK 2
handler is not permitted to change the state, and therefore 4 #define THEFT_READY 3
5
simply returns. Otherwise, if the counting variable is set, 6 int __thread theft = THEFT_IDLE;
indicating that the current thread’s fastpath is in progress, 7 int __thread counting = 0;
8 unsigned long __thread counter = 0;
the signal handler sets the theft state to ACK, otherwise 9 unsigned long __thread countermax = 0;
to READY. 10 unsigned long globalcountmax = 10000;
11 unsigned long globalcount = 0;
If the theft state is ACK, only the fastpath is permitted 12 unsigned long globalreserve = 0;
to change the theft state, as indicated by the blue color. 13 unsigned long *counterp[NR_THREADS] = { NULL };
14 unsigned long *countermaxp[NR_THREADS] = { NULL };
When the fastpath completes, it sets the theft state to 15 int *theftp[NR_THREADS] = { NULL };
READY. 16 DEFINE_SPINLOCK(gblcnt_mutex);
17 #define MAX_COUNTERMAX 100
Once the slowpath sees a thread’s theft state is
READY, the slowpath is permitted to steal that thread’s
count. The slowpath then sets that thread’s theft state to Quick Quiz 5.50: In Listing 5.18’s function flush_local_
IDLE. count_sig(), why are there READ_ONCE() and WRITE_
Quick Quiz 5.48: In Figure 5.7, why is the REQ theft state ONCE() wrappers around the uses of the theft per-thread
colored red? variable?
Listing 5.17 (count_lim_sig.c) shows the data struc- Quick Quiz 5.51: In Listing 5.18, why is it safe for line 28 to
directly access the other thread’s countermax variable?
tures used by the signal-theft based counter implemen-
tation. Lines 1–7 define the states and values for the
per-thread theft state machine described in the preceding Quick Quiz 5.52: In Listing 5.18, why doesn’t line 33 check
for the current thread sending itself a signal?
section. Lines 8–17 are similar to earlier implementa-
tions, with the addition of lines 14 and 15 to allow remote
Quick Quiz 5.53: The code shown in Listings 5.17 and 5.18
access to a thread’s countermax and theft variables,
works with GCC and POSIX. What would be required to make
respectively.
it also conform to the ISO C standard?
Listing 5.18 shows the functions responsible for migrat-
ing counts between per-thread variables and the global The loop spanning lines 35–48 waits until each thread
variables. Lines 1–7 show globalize_count(), which reaches READY state, then steals that thread’s count.
is identical to earlier implementations. Lines 9–19 show Lines 36–37 skip any non-existent threads, and the loop
flush_local_count_sig(), which is the signal han- spanning lines 38–42 waits until the current thread’s
dler used in the theft process. Lines 11 and 12 check to theft state becomes READY. Line 39 blocks for a
see if the theft state is REQ, and, if not returns without millisecond to avoid priority-inversion problems, and if
change. Line 13 executes a memory barrier to ensure line 40 determines that the thread’s signal has not yet
that the sampling of the theft variable happens before any arrived, line 41 resends the signal. Execution reaches
change to that variable. Line 14 sets the theft state to line 43 when the thread’s theft state becomes READY,
ACK, and, if line 15 sees that this thread’s fastpaths are so lines 43–46 do the thieving. Line 47 then sets the
not running, line 16 sets the theft state to READY. thread’s theft state back to IDLE.
v2022.09.25a
66 CHAPTER 5. COUNTING
v2022.09.25a
5.4. EXACT LIMIT COUNTERS 67
Listing 5.20: Signal-Theft Limit Counter Subtract Function Listing 5.22: Signal-Theft Limit Counter Initialization Func-
1 int sub_count(unsigned long delta) tions
2 { 1 void count_init(void)
3 int fastpath = 0; 2 {
4 3 struct sigaction sa;
5 WRITE_ONCE(counting, 1); 4
6 barrier(); 5 sa.sa_handler = flush_local_count_sig;
7 if (READ_ONCE(theft) <= THEFT_REQ && 6 sigemptyset(&sa.sa_mask);
8 counter >= delta) { 7 sa.sa_flags = 0;
9 WRITE_ONCE(counter, counter - delta); 8 if (sigaction(SIGUSR1, &sa, NULL) != 0) {
10 fastpath = 1; 9 perror("sigaction");
11 } 10 exit(EXIT_FAILURE);
12 barrier(); 11 }
13 WRITE_ONCE(counting, 0); 12 }
14 barrier(); 13
15 if (READ_ONCE(theft) == THEFT_ACK) { 14 void count_register_thread(void)
16 smp_mb(); 15 {
17 WRITE_ONCE(theft, THEFT_READY); 16 int idx = smp_thread_id();
18 } 17
19 if (fastpath) 18 spin_lock(&gblcnt_mutex);
20 return 1; 19 counterp[idx] = &counter;
21 spin_lock(&gblcnt_mutex); 20 countermaxp[idx] = &countermax;
22 globalize_count(); 21 theftp[idx] = &theft;
23 if (globalcount < delta) { 22 spin_unlock(&gblcnt_mutex);
24 flush_local_count(); 23 }
25 if (globalcount < delta) { 24
26 spin_unlock(&gblcnt_mutex); 25 void count_unregister_thread(int nthreadsexpected)
27 return 0; 26 {
28 } 27 int idx = smp_thread_id();
29 } 28
30 globalcount -= delta; 29 spin_lock(&gblcnt_mutex);
31 balance_count(); 30 globalize_count();
32 spin_unlock(&gblcnt_mutex); 31 counterp[idx] = NULL;
33 return 1; 32 countermaxp[idx] = NULL;
34 } 33 theftp[idx] = NULL;
34 spin_unlock(&gblcnt_mutex);
35 }
Listing 5.21: Signal-Theft Limit Counter Read Function
1 unsigned long read_count(void)
2 { Lines 1–12 of Listing 5.22 show count_init(), which
3 int t;
4 unsigned long sum; set up flush_local_count_sig() as the signal han-
5
dler for SIGUSR1, enabling the pthread_kill() calls
6 spin_lock(&gblcnt_mutex);
7 sum = globalcount; in flush_local_count() to invoke flush_local_
8 for_each_thread(t) count_sig(). The code for thread registry and unregistry
9 if (counterp[t] != NULL)
10 sum += READ_ONCE(*counterp[t]); is similar to that of earlier examples, so its analysis is left
11 spin_unlock(&gblcnt_mutex); as an exercise for the reader.
12 return sum;
13 }
v2022.09.25a
68 CHAPTER 5. COUNTING
Quick Quiz 5.57: What else had you better have done when
using a biased counter?
5.5 Parallel Counting Discussion
Although a biased counter can be quite helpful and
useful, it is only a partial solution to the removable I/O
This idea that there is generality in the specific is of
device access-count problem called out on page 49. When
far-reaching importance.
attempting to remove a device, we must not only know
the precise number of current I/O accesses, we also need Douglas R. Hofstadter
to prevent any future accesses from starting. One way
to accomplish this is to read-acquire a reader-writer lock This chapter has presented the reliability, performance, and
when updating the counter, and to write-acquire that same scalability problems with traditional counting primitives.
reader-writer lock when checking the counter. Code for The C-language ++ operator is not guaranteed to function
doing I/O might be as follows: reliably in multithreaded code, and atomic operations to a
v2022.09.25a
5.5. PARALLEL COUNTING DISCUSSION 69
single variable neither perform nor scale well. This chapter on large numbers of core, and suffers severe lock con-
therefore presented a number of counting algorithms that tention when there are many parallel readers. This con-
perform and scale extremely well in certain special cases. tention can be addressed using the deferred-processing
It is well worth reviewing the lessons from these count- techniques introduced in Chapter 9, as shown on the
ing algorithms. To that end, Section 5.5.1 overviews count_end_rcu.c row of Table 5.1. Deferred process-
requisite validation, Section 5.5.2 summarizes perfor- ing also shines on the count_stat_eventual.c row,
mance and scalability, Section 5.5.3 discusses the need courtesy of eventual consistency.
for specialization, and finally, Section 5.5.4 enumerates Quick Quiz 5.60: On the count_stat.c row of Table 5.1,
lessons learned and calls attention to later chapters that we see that the read-side scales linearly with the number of
will expand on these lessons. threads. How is that possible given that the more threads there
are, the more per-thread counters must be summed up?
5.5.1 Parallel Counting Validation Quick Quiz 5.61: Even on the fourth row of Table 5.1,
Many of the algorithms in this section are quite simple, so the read-side performance of these statistical counter imple-
mentations is pretty horrible. So why bother with them?
much so that it is tempting to declare them to be correct
by construction or by inspection. Unfortunately, it is all
too easy for those carrying out the construction or the The bottom half of Table 5.1 shows the performance of
inspection to become overconfident, tired, confused, or the parallel limit-counting algorithms. Exact enforcement
just plain sloppy, all of which can result in bugs. And of the limits incurs a substantial update-side performance
early implementations of these limit counters have in fact penalty, although on this x86 system that penalty can
contained bugs, in some cases aided and abetted by the be reduced by substituting signals for atomic operations.
complexities inherent in maintaining a 64-bit count on a All of these implementations suffer from read-side lock
32-bit system. Therefore, validation is not optional, even contention in the face of concurrent readers.
for the simple algorithms presented in this chapter. Quick Quiz 5.62: Given the performance data shown in the
The statistical counters are tested for acting like counters bottom half of Table 5.1, we should always prefer signals over
(“counttorture.h”), that is, that the aggregate sum in atomic operations, right?
the counter changes by the sum of the amounts added by
the various update-side threads. Quick Quiz 5.63: Can advanced techniques be applied to
The limit counters are also tested for acting like counters address the lock contention for readers seen in the bottom half
(“limtorture.h”), and additionally checked for their of Table 5.1?
ability to accommodate the specified limit.
In short, this chapter has demonstrated a number of
Both of these test suites produce performance data that counting algorithms that perform and scale extremely
is used in Section 5.5.2. well in a number of special cases. But must our parallel
Although this level of validation is good and sufficient counting be confined to special cases? Wouldn’t it be
for textbook implementations such as these, it would be better to have a general algorithm that operated efficiently
wise to apply additional validation before putting similar in all cases? The next section looks at these questions.
algorithms into production. Chapter 11 describes addi-
tional approaches to testing, and given the simplicity of
most of these counting algorithms, most of the techniques 5.5.3 Parallel Counting Specializations
described in Chapter 12 can also be quite helpful. The fact that these algorithms only work well in their
respective special cases might be considered a major
5.5.2 Parallel Counting Performance problem with parallel programming in general. After
all, the C-language ++ operator works just fine in single-
The top half of Table 5.1 shows the performance of the threaded code, and not just for special cases, but in general,
four parallel statistical counting algorithms. All four algo- right?
rithms provide near-perfect linear scalability for updates. This line of reasoning does contain a grain of truth, but
The per-thread-variable implementation (count_end.c) is in essence misguided. The problem is not parallelism
is significantly faster on updates than the array-based as such, but rather scalability. To understand this, first
implementation (count_stat.c), but is slower at reads consider the C-language ++ operator. The fact is that it
v2022.09.25a
70 CHAPTER 5. COUNTING
Exact?
Reads (ns)
Algorithm Updates
(count_*.c) Section (ns) 1 CPU 8 CPUs 64 CPUs 420 CPUs
stat 5.2.2 6.3 294 303 315 612
stat_eventual 5.2.4 6.4 1 1 1 1
end 5.2.3 2.9 301 6,309 147,594 239,683
end_rcu 13.5.1 2.9 454 481 508 2,317
lim 5.3.2 N 3.2 435 6,678 156,175 239,422
lim_app 5.3.4 N 2.4 485 7,041 173,108 239,682
lim_atomic 5.4.1 Y 19.7 513 7,085 199,957 239,450
lim_sig 5.4.4 Y 4.7 519 6,805 120,000 238,811
does not work in general, only for a restricted range of sort of adaptation will become increasingly important as
numbers. If you need to deal with 1,000-digit decimal the number of CPUs on mainstream systems continues to
numbers, the C-language ++ operator will not work for increase.
you. In short, as discussed in Chapter 3, the laws of physics
Quick Quiz 5.64: The ++ operator works just fine for 1,000- constrain parallel software just as surely as they constrain
digit numbers! Haven’t you heard of operator overloading??? mechanical artifacts such as bridges. These constraints
force specialization, though in the case of software it
might be possible to automate the choice of specialization
This problem is not specific to arithmetic. Suppose you to fit the hardware and workload in question.
need to store and query data. Should you use an ASCII Of course, even generalized counting is quite special-
file? XML? A relational database? A linked list? A dense ized. We need to do a great number of other things with
array? A B-tree? A radix tree? Or one of the plethora of computers. The next section relates what we have learned
other data structures and environments that permit data to from counters to topics taken up later in this book.
be stored and queried? It depends on what you need to
do, how fast you need it done, and how large your data set
is—even on sequential systems. 5.5.4 Parallel Counting Lessons
Similarly, if you need to count, your solution will
depend on how large of numbers you need to work with, The opening paragraph of this chapter promised that our
how many CPUs need to be manipulating a given number study of counting would provide an excellent introduction
concurrently, how the number is to be used, and what level to parallel programming. This section makes explicit
of performance and scalability you will need. connections between the lessons from this chapter and the
Nor is this problem specific to software. The design material presented in a number of later chapters.
for a bridge meant to allow people to walk across a small The examples in this chapter have shown that an impor-
brook might be a simple as a single wooden plank. But you tant scalability and performance tool is partitioning. The
would probably not use a plank to span the kilometers-wide counters might be fully partitioned, as in the statistical
mouth of the Columbia River, nor would such a design be counters discussed in Section 5.2, or partially partitioned
advisable for bridges carrying concrete trucks. In short, as in the limit counters discussed in Sections 5.3 and 5.4.
just as bridge design must change with increasing span Partitioning will be considered in far greater depth in Chap-
and load, so must software design change as the number of ter 6, and partial parallelization in particular in Section 6.4,
CPUs increases. That said, it would be good to automate where it is called parallel fastpath.
this process, so that the software adapts to changes in
hardware configuration and in workload. There has in fact Quick Quiz 5.65: But if we are going to have to partition
been some research into this sort of automation [AHS+ 03, everything, why bother with shared-memory multithreading?
Why not just partition the problem completely and run as
SAH+ 03], and the Linux kernel does some boot-time
multiple processes, each in its own address space?
reconfiguration, including limited binary rewriting. This
v2022.09.25a
5.5. PARALLEL COUNTING DISCUSSION 71
v2022.09.25a
72 CHAPTER 5. COUNTING
v2022.09.25a
Divide and rule.
Philip II of Macedon
Chapter 6
replication, is covered in Chapter 9. 2 But feel free to instead think in terms of chopsticks.
73
v2022.09.25a
74 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
v2022.09.25a
6.1. PARTITIONING EXERCISES 75
P4 P2
A good place to start is with invariants. For example, if
elements are pushed onto one end of a double-ended queue
and popped off of the other, the order of those elements
must be preserved. Similarly, if elements are pushed onto
one end of the queue and popped off of that same end, the
order of those elements must be reversed. Any element
popped from the queue must have been most recently
P3 pushed onto that queue, and if the queue is emptied, all
elements pushed onto it must have already been popped
Figure 6.4: Dining Philosophers Problem, Partitioned from it.
The beginnings of a test suite for concurrent double-
ended queues (“deqtorture.h”) provides the following
partition technique. Here the upper and rightmost philoso- checks:
phers share a pair of forks, while the lower and leftmost
philosophers share another pair of forks. If all philoso- 1. Element-ordering checks provided by CHECK_
phers are simultaneously hungry, at least two will always SEQUENCE_PAIR().
be able to eat concurrently. In addition, as shown in the
2. Checks that elements popped were most recently
figure, the forks can now be bundled so that the pair are
pushed, provided by melee().
picked up and put down simultaneously, simplifying the
acquisition and release algorithms. 3. Checks that elements pushed are popped before the
Quick Quiz 6.1: Is there a better solution to the Dining queue is emptied, also provided by melee().
Philosophers Problem?
This suite includes both sequential and concurrent tests.
Quick Quiz 6.2: How would you valididate an algorithm Although this suite is good and sufficient for textbook
alleged to solve the Dining Philosophers Problem? code, you should test considerably more thoroughly for
code intended for production use. Chapters 11 and 12
This is an example of “horizontal parallelism” [Inm85] cover a large array of validation tools and techniques.
or “data parallelism”, so named because there is no depen- But with a prototype test suite in place, we are ready
dency among the pairs of philosophers. In a horizontally to look at the double-ended-queue algorithms in the next
parallel data-processing system, a given item of data would sections.
be processed by only one of a replicated set of software
components. 6.1.2.2 Left- and Right-Hand Locks
Quick Quiz 6.3: And in just what sense can this “horizontal
parallelism” be said to be “horizontal”? One seemingly straightforward approach would be to use
a doubly linked list with a left-hand lock for left-hand-end
enqueue and dequeue operations along with a right-hand
lock for right-hand-end operations, as shown in Figure 6.5.
6.1.2 Double-Ended Queue
However, the problem with this approach is that the two
A double-ended queue is a data structure containing a locks’ domains must overlap when there are fewer than
list of elements that may be inserted or removed from four elements on the list. This overlap is due to the fact
either end [Knu73]. It has been claimed that a lock-based that removing any given element affects not only that
implementation permitting concurrent operations on both element, but also its left- and right-hand neighbors. These
ends of the double-ended queue is difficult [Gro07]. This domains are indicated by color in the figure, with blue with
v2022.09.25a
76 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
Lock L Lock R
same double-ended queue, as we can unconditionally left-
enqueue elements to the left-hand queue and right-enqueue
Header L Header R
elements to the right-hand queue. The main complication
arises when dequeuing from an empty queue, in which
Lock L Lock R
case it is necessary to:
Header L 0 Header R
1. If holding the right-hand lock, release it and acquire
the left-hand lock.
Lock L Lock R
2. Acquire the right-hand lock.
Header L 0 1 Header R
v2022.09.25a
6.1. PARTITIONING EXERCISES 77
R4 R5 R6 R7
L0 R1 R2 R3
DEQ 0 DEQ 1 DEQ 2 DEQ 3
L−4 L−3 L−2 L−1
Lock 0 Lock 1 Lock 2 Lock 3
L−8 L−7 L−6 L−5
Figure 6.7: Hashed Double-Ended Queue Listing 6.1: Lock-Based Parallel Double-Ended Queue Data
Structure
1 struct pdeq {
2 spinlock_t llock;
3 int lidx;
R1 4 spinlock_t rlock;
5 int ridx;
6 struct deq bkt[PDEQ_N_BKTS];
7 };
v2022.09.25a
78 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
Listing 6.2: Lock-Based Parallel Double-Ended Queue Imple- Lines 1–13 show pdeq_pop_l(), which left-dequeues
mentation and returns an element if possible, returning NULL other-
1 struct cds_list_head *pdeq_pop_l(struct pdeq *d)
2 { wise. Line 6 acquires the left-hand spinlock, and line 7
3 struct cds_list_head *e; computes the index to be dequeued from. Line 8 dequeues
4 int i;
5
the element, and, if line 9 finds the result to be non-NULL,
6 spin_lock(&d->llock); line 10 records the new left-hand index. Either way, line 11
7 i = moveright(d->lidx);
8 e = deq_pop_l(&d->bkt[i]); releases the lock, and, finally, line 12 returns the element
9 if (e != NULL) if there was one, or NULL otherwise.
10 d->lidx = i;
11 spin_unlock(&d->llock); Lines 29–38 show pdeq_push_l(), which left-
12 return e; enqueues the specified element. Line 33 acquires the
13 }
14
left-hand lock, and line 34 picks up the left-hand in-
15 struct cds_list_head *pdeq_pop_r(struct pdeq *d) dex. Line 35 left-enqueues the specified element onto
16 {
17 struct cds_list_head *e; the double-ended queue indexed by the left-hand index.
18 int i; Line 36 then updates the left-hand index and line 37
19
20 spin_lock(&d->rlock); releases the lock.
21 i = moveleft(d->ridx); As noted earlier, the right-hand operations are com-
22 e = deq_pop_r(&d->bkt[i]);
23 if (e != NULL) pletely analogous to their left-handed counterparts, so
24 d->ridx = i; their analysis is left as an exercise for the reader.
25 spin_unlock(&d->rlock);
26 return e; Quick Quiz 6.5: Is the hashed double-ended queue a good
27 }
28
solution? Why or why not?
29 void pdeq_push_l(struct cds_list_head *e, struct pdeq *d)
30 {
31 int i;
32 6.1.2.5 Compound Double-Ended Queue Revisited
33 spin_lock(&d->llock);
34 i = d->lidx; This section revisits the compound double-ended queue,
35 deq_push_l(e, &d->bkt[i]);
36 d->lidx = moveleft(d->lidx); using a trivial rebalancing scheme that moves all the
37 spin_unlock(&d->llock); elements from the non-empty queue to the now-empty
38 }
39 queue.
40 void pdeq_push_r(struct cds_list_head *e, struct pdeq *d)
41 { Quick Quiz 6.6: Move all the elements to the queue that
42 int i; became empty? In what possible universe is this brain-dead
43
44 spin_lock(&d->rlock); solution in any way optimal???
45 i = d->ridx;
46 deq_push_r(e, &d->bkt[i]); In contrast to the hashed implementation presented in
47 d->ridx = moveright(d->ridx);
48 spin_unlock(&d->rlock); the previous section, the compound implementation will
49 } build on a sequential implementation of a double-ended
queue that uses neither locks nor atomic operations.
Listing 6.3 shows the implementation. Unlike the
the left-hand index on line 3, the right-hand lock on line 4 hashed implementation, this compound implementation is
(which is cache-aligned in the actual implementation), the asymmetric, so that we must consider the pdeq_pop_l()
right-hand index on line 5, and, finally, the hashed array and pdeq_pop_r() implementations separately.
of simple lock-based double-ended queues on line 6. A
high-performance implementation would of course use Quick Quiz 6.7: Why can’t the compound parallel double-
padding or special alignment directives to avoid false ended queue implementation be symmetric?
sharing. The pdeq_pop_l() implementation is shown on
Listing 6.2 (lockhdeq.c) shows the implementation of lines 1–16 of the figure. Line 5 acquires the left-hand lock,
the enqueue and dequeue functions.4 Discussion will focus which line 14 releases. Line 6 attempts to left-dequeue
on the left-hand operations, as the right-hand operations an element from the left-hand underlying double-ended
are trivially derived from them. queue, and, if successful, skips lines 8–13 to simply return
4 One could easily create a polymorphic implementation in any this element. Otherwise, line 8 acquires the right-hand
number of languages, but doing so is left as an exercise for the reader. lock, line 9 left-dequeues an element from the right-hand
v2022.09.25a
6.1. PARTITIONING EXERCISES 79
v2022.09.25a
80 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
compared to software alternatives [DCW+ 11] and even parallelism”. The synchronization overhead in this case
compared to algorithms using hardware assist [DLM+ 10]. is nearly (or even exactly) zero. In contrast, the double-
Nevertheless, the best we can hope for from such a scheme ended queue implementations are examples of “vertical
is 2x scalability, as at most two threads can be holding the parallelism” or “pipelining”, given that data moves from
dequeue’s locks concurrently. This limitation also applies one thread to another. The tighter coordination required
to algorithms based on non-blocking synchronization, for pipelining in turn requires larger units of work to obtain
such as the compare-and-swap-based dequeue algorithm a given level of efficiency.
of Michael [Mic03].5
Quick Quiz 6.12: The tandem double-ended queue runs
Quick Quiz 6.11: Why are there not one but two solutions about twice as fast as the hashed double-ended queue, even
to the double-ended queue problem? when I increase the size of the hash table to an insanely large
number. Why is that?
In fact, as noted by Dice et al. [DLM+ 10], an unsynchro-
nized single-threaded double-ended queue significantly
Quick Quiz 6.13: Is there a significantly better way of
outperforms any of the parallel implementations they stud-
handling concurrency for double-ended queues?
ied. Therefore, the key point is that there can be significant
overhead enqueuing to or dequeuing from a shared queue, These two examples show just how powerful partition-
regardless of implementation. This should come as no ing can be in devising parallel algorithms. Section 6.3.5
surprise in light of the material in Chapter 3, given the looks briefly at a third example, matrix multiply. However,
strict first-in-first-out (FIFO) nature of these queues. all three of these examples beg for more and better design
Furthermore, these strict FIFO queues are strictly FIFO criteria for parallel programs, a topic taken up in the next
only with respect to linearization points [HW90]6 that section.
are not visible to the caller, in fact, in these examples, the
linearization points are buried in the lock-based critical
sections. These queues are not strictly FIFO with respect
to (say) the times at which the individual operations 6.2 Design Criteria
started [HKLP12]. This indicates that the strict FIFO
property is not all that valuable in concurrent programs, One pound of learning requires ten pounds of
and in fact, Kirsch et al. present less-strict queues that commonsense to apply it.
provide improved performance and scalability [KLP12].7
Persian proverb
All that said, if you are pushing all the data used by your
concurrent program through a single queue, you really
need to rethink your overall design. One way to obtain the best performance and scalability is
to simply hack away until you converge on the best possible
parallel program. Unfortunately, if your program is other
6.1.3 Partitioning Example Discussion than microscopically tiny, the space of possible parallel
The optimal solution to the dining philosophers problem programs is so huge that convergence is not guaranteed in
given in the answer to the Quick Quiz in Section 6.1.1 is the lifetime of the universe. Besides, what exactly is the
an excellent example of “horizontal parallelism” or “data “best possible parallel program”? After all, Section 2.2
called out no fewer than three parallel-programming goals
5 This paper is interesting in that it showed that special double-
of performance, productivity, and generality, and the best
compare-and-swap (DCAS) instructions are not needed for lock-free possible performance will likely come at a cost in terms
implementations of double-ended queues. Instead, the common compare- of productivity and generality. We clearly need to be able
and-swap (e.g., x86 cmpxchg) suffices. to make higher-level choices at design time in order to
6 In short, a linearization point is a single point within a given
arrive at an acceptably good parallel program before that
function where that function can be said to have taken effect. In this
lock-based implementation, the linearization points can be said to be program becomes obsolete.
anywhere within the critical section that does the work. However, more detailed design criteria are required to
7 Nir Shavit produced relaxed stacks for roughly the same rea-
actually produce a real-world design, a task taken up in
sons [Sha11]. This situation leads some to believe that the linearization
points are useful to theorists rather than developers, and leads others
this section. This being the real world, these criteria often
to wonder to what extent the designers of such data structures and conflict to a greater or lesser degree, requiring that the
algorithms were considering the needs of their users. designer carefully balance the resulting tradeoffs.
v2022.09.25a
6.2. DESIGN CRITERIA 81
As such, these criteria may be thought of as the the sequential program, although large state spaces
“forces” acting on the design, with particularly good having regular structures can in some cases be easily
tradeoffs between these forces being called “design pat- understood. A parallel programmer must consider
terns” [Ale79, GHJV95]. synchronization primitives, messaging, locking de-
The design criteria for attaining the three parallel- sign, critical-section identification, and deadlock in
programming goals are speedup, contention, overhead, the context of this larger state space.
read-to-write ratio, and complexity: This greater complexity often translates to higher
Speedup: As noted in Section 2.2, increased performance development and maintenance costs. Therefore, bud-
is the major reason to go to all of the time and trouble getary constraints can limit the number and types
required to parallelize it. Speedup is defined to be the of modifications made to an existing program, since
ratio of the time required to run a sequential version a given degree of speedup is worth only so much
of the program to the time required to run a parallel time and trouble. Worse yet, added complexity can
version. actually reduce performance and scalability.
Contention: If more CPUs are applied to a parallel pro- Therefore, beyond a certain point, there may be
gram than can be kept busy by that program, the potential sequential optimizations that are cheaper
excess CPUs are prevented from doing useful work and more effective than parallelization. As noted
by contention. This may be lock contention, memory in Section 2.2.1, parallelization is but one perfor-
contention, or a host of other performance killers. mance optimization of many, and is furthermore an
optimization that applies most readily to CPU-based
Work-to-Synchronization Ratio: A uniprocessor, sin- bottlenecks.
gle-threaded, non-preemptible, and non-interrupt-
ible8 version of a given parallel program would not These criteria will act together to enforce a maximum
need any synchronization primitives. Therefore, speedup. The first three criteria are deeply interrelated,
any time consumed by these primitives (including so the remainder of this section analyzes these interrela-
communication cache misses as well as message tionships.9
latency, locking primitives, atomic instructions, and Note that these criteria may also appear as part of the
memory barriers) is overhead that does not contrib- requirements specification. For example, speedup may act
ute directly to the useful work that the program is as a relative desideratum (“the faster, the better”) or as an
intended to accomplish. Note that the important absolute requirement of the workload (“the system must
measure is the relationship between the synchroniza- support at least 1,000,000 web hits per second”). Classic
tion overhead and the overhead of the code in the design pattern languages describe relative desiderata as
critical section, with larger critical sections able to forces and absolute requirements as context.
tolerate greater synchronization overhead. The work- An understanding of the relationships between these
to-synchronization ratio is related to the notion of design criteria can be very helpful when identifying ap-
synchronization efficiency. propriate design tradeoffs for a parallel program.
Read-to-Write Ratio: A data structure that is rarely up- 1. The less time a program spends in exclusive-lock
dated may often be replicated rather than partitioned, critical sections, the greater the potential speedup.
and furthermore may be protected with asymmet- This is a consequence of Amdahl’s Law [Amd67]
ric synchronization primitives that reduce readers’ because only one CPU may execute within a given
synchronization overhead at the expense of that of exclusive-lock critical section at a given time.
writers, thereby reducing overall synchronization More specifically, for unbounded linear scalability,
overhead. Corresponding optimizations are possible the fraction of time that the program spends in a
for frequently updated data structures, as discussed given exclusive critical section must decrease as the
in Chapter 5. number of CPUs increases. For example, a program
Complexity: A parallel program is more complex than will not scale to 10 CPUs unless it spends much
an equivalent sequential program because the paral-
9 A real-world parallel system will be subject to many additional
lel program has a much larger state space than does
design criteria, such as data-structure layout, memory size, memory-
8 Either by masking interrupts or by being oblivious to them. hierarchy latencies, bandwidth limitations, and I/O issues.
v2022.09.25a
82 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
Partition Batch
2. Contention effects consume the excess CPU and/or Code
wallclock time when the actual speedup is less than Locking
the number of available CPUs. The larger the gap Partition Batch
between the number of CPUs and the actual speedup,
Data
the less efficiently the CPUs will be used. Similarly, Locking
the greater the desired efficiency, the smaller the
achievable speedup. Own Disown
Data
Ownership
3. If the available synchronization primitives have high
overhead compared to the critical sections that they Figure 6.10: Design Patterns and Lock Granularity
guard, the best way to improve speedup is to reduce
the number of times that the primitives are invoked.
This can be accomplished by batching critical sec- 6.3 Synchronization Granularity
tions, using data ownership (see Chapter 8), using
asymmetric primitives (see Chapter 9), or by using a
Doing little things well is a step toward doing big
coarse-grained design such as code locking. things better.
Harry F. Banks
4. If the critical sections have high overhead compared
to the primitives guarding them, the best way to Figure 6.10 gives a pictorial view of different levels of
improve speedup is to increase parallelism by moving synchronization granularity, each of which is described
to reader/writer locking, data locking, asymmetric, in one of the following sections. These sections focus
or data ownership. primarily on locking, but similar granularity issues arise
with all forms of synchronization.
5. If the critical sections have high overhead compared
to the primitives guarding them and the data structure 6.3.1 Sequential Program
being guarded is read much more often than modified, If the program runs fast enough on a single processor,
the best way to increase parallelism is to move to and has no interactions with other processes, threads, or
reader/writer locking or asymmetric primitives. interrupt handlers, you should remove the synchronization
primitives and spare yourself their overhead and complex-
ity. Some years back, there were those who would argue
6. Many changes that improve SMP performance, for
that Moore’s Law would eventually force all programs
example, reducing lock contention, also improve
into this category. However, as can be seen in Figure 6.11,
real-time latencies [McK05c].
the exponential increase in single-threaded performance
halted in about 2003. Therefore, increasing performance
will increasingly require parallelism.10 Given that back
Quick Quiz 6.14: Don’t all these problems with critical
in 2006 Paul typed the first version of this sentence on
sections mean that we should just always use non-blocking
synchronization [Her90], which don’t have critical sections? a dual-core laptop, and further given that many of the
graphs added in 2020 were generated on a system with
10 This plot shows clock frequencies for newer CPUs theoretically
It is worth reiterating that contention has many guises, capable of retiring one or more instructions per clock, and MIPS for
older CPUs requiring multiple clocks to execute even the simplest
including lock contention, memory contention, cache instruction. The reason for taking this approach is that the newer CPUs’
overflow, thermal throttling, and much else besides. This ability to retire multiple instructions per clock is typically limited by
chapter looks primarily at lock and memory contention. memory-system performance.
v2022.09.25a
6.3. SYNCHRONIZATION GRANULARITY 83
10000
CPU Clock Frequency / MIPS
1000
1x106
100
100000 Ethernet
Relative Performance
10 10000
1000
1
100 x86 CPUs
0.1 10
1975
1980
1985
1990
1995
2000
2005
2010
2015
2020
1
Year 0.1
1970
1975
1980
1985
1990
1995
2000
2005
2010
2015
2020
Figure 6.11: MIPS/Clock-Frequency Trend for Intel
CPUs
Year
of Java, uses classes with synchronized instances, you are instead using
“data locking”, described in Section 6.3.3.
v2022.09.25a
84 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
v2022.09.25a
6.3. SYNCHRONIZATION GRANULARITY 85
16
17 int hash_search(struct hash_table *h, long key)
18 {
19 struct bucket *bp; toy
20 struct node *cur;
21 int retval;
22
23 bp = h->buckets[key % h->nbuckets];
24 spin_lock(&bp->bucket_lock); Figure 6.14: Data Locking
25 cur = bp->list_head;
26 while (cur != NULL) {
27 if (cur->key >= key) {
28 retval = (cur->key == key);
29 spin_unlock(&bp->bucket_lock);
30 return retval;
31 }
32 cur = cur->next;
33 }
34 spin_unlock(&bp->bucket_lock);
35 return 0;
36 }
v2022.09.25a
86 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
such as hash tables, as well as in situations where multiple If there is significant sharing, communication between
entities are each represented by an instance of a given the threads or CPUs can result in significant complexity
data structure. The Linux-kernel task list is an example of and overhead. Furthermore, if the most-heavily used data
the latter, each task structure having its own alloc_lock happens to be that owned by a single CPU, that CPU will be
and pi_lock. a “hot spot”, sometimes with results resembling that shown
A key challenge with data locking on dynamically in Figure 6.15. However, in situations where no sharing
allocated structures is ensuring that the structure remains is required, data ownership achieves ideal performance,
in existence while the lock is being acquired [GKAS99]. and with code that can be as simple as the sequential-
The code in Listing 6.6 finesses this challenge by placing program case shown in Listing 6.4. Such situations are
the locks in the statically allocated hash buckets, which often referred to as “embarrassingly parallel”, and, in
are never freed. However, this trick would not work if the best case, resemble the situation previously shown in
the hash table were resizeable, so that the locks were now Figure 6.14.
dynamically allocated. In this case, there would need to Another important instance of data ownership occurs
be some means to prevent the hash bucket from being when the data is read-only, in which case, all threads can
freed during the time that its lock was being acquired. “own” it via replication.
Quick Quiz 6.17: What are some ways of preventing a
Where data locking partitions both the address space
structure from being freed while its lock is being acquired? (with one hash buckets per partition) and time (using
per-bucket locks), data ownership partitions only the ad-
dress space. The reason that data ownership need not
partition time is because a given thread or CPU is assigned
6.3.4 Data Ownership permanent ownership of a given address-space partition.
Data ownership partitions a given data structure over the Quick Quiz 6.18: But won’t system boot and shutdown (or
threads or CPUs, so that each thread/CPU accesses its application startup and shutdown) be partitioning time, even
subset of the data structure without any synchronization for data ownership?
overhead whatsoever. However, if one thread wishes
to access some other thread’s data, the first thread is Data ownership will be presented in more detail in
unable to do so directly. Instead, the first thread must Chapter 8.
communicate with the second thread, so that the second
thread performs the operation on behalf of the first, or,
alternatively, migrates the data to the first thread.
6.3.5 Locking Granularity and Perfor-
Data ownership might seem arcane, but it is used very mance
frequently: This section looks at locking granularity and performance
from a mathematical synchronization-efficiency viewpoint.
1. Any variables accessible by only one CPU or thread
Readers who are uninspired by mathematics might choose
(such as auto variables in C and C++) are owned by
to skip this section.
that CPU or process.
The approach is to use a crude queueing model for the
2. An instance of a user interface owns the correspond- efficiency of synchronization mechanism that operate on
ing user’s context. It is very common for applications a single shared global variable, based on an M/M/1 queue.
interacting with parallel database engines to be writ- M/M/1 queuing models are based on an exponentially
ten as if they were entirely sequential programs. Such distributed “inter-arrival rate” 𝜆 and an exponentially
applications own the user interface and his current distributed “service rate” 𝜇. The inter-arrival rate 𝜆 can
action. Explicit parallelism is thus confined to the be thought of as the average number of synchronization
database engine itself. operations per second that the system would process if the
synchronization were free, in other words, 𝜆 is an inverse
3. Parametric simulations are often trivially parallelized measure of the overhead of each non-synchronization
by granting each thread ownership of a particular unit of work. For example, if each unit of work was a
region of the parameter space. There are also com- transaction, and if each transaction took one millisecond
puting frameworks designed for this type of prob- to process, excluding synchronization overhead, then 𝜆
lem [Uni08a]. would be 1,000 transactions per second.
v2022.09.25a
6.3. SYNCHRONIZATION GRANULARITY 87
Synchronization Efficiency
average number of synchronization operations per second 1
that the system would process if the overhead of each 0.9
transaction was zero, and ignoring the fact that CPUs 0.8
0.7
must wait on each other to complete their synchronization 100
0.6
operations, in other words, 𝜇 can be roughly thought of as
0.5 75
the synchronization overhead in absence of contention. For
0.4 50
example, suppose that each transaction’s synchronization
0.3 25
operation involves an atomic increment instruction, and 0.2 10
that a computer system is able to do a private-variable 0.1
atomic increment every 5 nanoseconds on each CPU
10
20
30
40
50
60
70
80
90
100
(see Figure 5.1).13 The value of 𝜇 is therefore about
200,000,000 atomic increments per second.
Of course, the value of 𝜆 increases as increasing num- Number of CPUs (Threads)
bers of CPUs increment a shared variable because each
Figure 6.16: Synchronization Efficiency
CPU is capable of processing transactions independently
(again, ignoring synchronization):
But the value of 𝜇/𝜆0 is just the ratio of the time
𝜆 = 𝑛𝜆0 (6.1) required to process the transaction (absent synchronization
Here, 𝑛 is the number of CPUs and 𝜆 0 is the transaction- overhead) to that of the synchronization overhead itself
processing capability of a single CPU. Note that the (absent contention). If we call this ratio 𝑓 , we have:
expected time for a single CPU to execute a single trans-
action in the absence of contention is 1/𝜆0 . 𝑓 −𝑛
𝑒= (6.6)
Because the CPUs have to “wait in line” behind each 𝑓 − (𝑛 − 1)
other to get their chance to increment the single shared vari-
Figure 6.16 plots the synchronization efficiency 𝑒 as
able, we can use the M/M/1 queueing-model expression
a function of the number of CPUs/threads 𝑛 for a few
for the expected total waiting time:
values of the overhead ratio 𝑓 . For example, again us-
1 ing the 5-nanosecond atomic increment, the 𝑓 = 10
𝑇= (6.2) line corresponds to each CPU attempting an atomic in-
𝜇−𝜆
Substituting the above value of 𝜆: crement every 50 nanoseconds, and the 𝑓 = 100 line
corresponds to each CPU attempting an atomic increment
1 every 500 nanoseconds, which in turn corresponds to some
𝑇= (6.3)
𝜇 − 𝑛𝜆0 hundreds (perhaps thousands) of instructions. Given that
Now, the efficiency is just the ratio of the time required each trace drops off sharply with increasing numbers of
to process a transaction in absence of synchronization CPUs or threads, we can conclude that synchronization
(1/𝜆0 ) to the time required including synchronization mechanisms based on atomic manipulation of a single
(𝑇 + 1/𝜆0 ): global shared variable will not scale well if used heavily
on current commodity hardware. This is an abstract math-
1/𝜆0 ematical depiction of the forces leading to the parallel
𝑒= (6.4)
𝑇 + 1/𝜆0 counting algorithms that were discussed in Chapter 5.
Substituting the above value for 𝑇 and simplifying: Your real-world mileage may differ.
Nevertheless, the concept of efficiency is useful, and
𝜇
𝜆0 −𝑛 even in cases having little or no formal synchronization.
𝑒= 𝜇 (6.5) Consider for example a matrix multiply, in which the
𝜆0 − (𝑛 − 1)
columns of one matrix are multiplied (via “dot product”)
13 Of course, if there are 8 CPUs all incrementing the same shared
by the rows of another, resulting in an entry in a third
variable, then each CPU must wait at least 35 nanoseconds for each
of the other CPUs to do its increment before consuming an additional
matrix. Because none of these operations conflict, it
5 nanoseconds doing its own increment. In fact, the wait will be longer is possible to partition the columns of the first matrix
due to the need to move the variable from one CPU to another. among a group of threads, with each thread computing
v2022.09.25a
88 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
1.2 Reader/Writer
v2022.09.25a
6.4. PARALLEL FASTPATH 89
1. Reader/Writer Locking (described below in Sec- Listing 6.7: Reader-Writer-Locking Hash Table Search
tion 6.4.1). 1 rwlock_t hash_lock;
2
3 struct hash_table
2. Read-copy update (RCU), which may be used as a 4 {
5 long nbuckets;
high-performance replacement for reader/writer lock- 6 struct node **buckets;
ing, is introduced in Section 9.5. Other alternatives 7 };
8
include hazard pointers (Section 9.3) and sequence 9 typedef struct node {
locking (Section 9.4). These alternatives will not be 10 unsigned long key;
11 struct node *next;
discussed further in this chapter. 12 } node_t;
13
14 int hash_search(struct hash_table *h, long key)
3. Hierarchical Locking ([McK96a]), which is touched 15 {
upon in Section 6.4.2. 16 struct node *cur;
17 int retval;
18
19 read_lock(&hash_lock);
4. Resource Allocator Caches ([McK96a, MS93]). See 20 cur = h->buckets[key % h->nbuckets];
Section 6.4.3 for more detail. 21 while (cur != NULL) {
22 if (cur->key >= key) {
23 retval = (cur->key == key);
24 read_unlock(&hash_lock);
6.4.1 Reader/Writer Locking 25 return retval;
26 }
27 cur = cur->next;
If synchronization overhead is negligible (for example, if 28 }
29 read_unlock(&hash_lock);
the program uses coarse-grained parallelism with large 30 return 0;
critical sections), and if only a small fraction of the critical 31 }
sections modify data, then allowing multiple readers to
proceed in parallel can greatly increase scalability. Writ-
ers exclude both readers and each other. There are many 6.4.3 Resource Allocator Caches
implementations of reader-writer locking, including the
POSIX implementation described in Section 4.2.4. List- This section presents a simplified schematic of a parallel
ing 6.7 shows how the hash search might be implemented fixed-block-size memory allocator. More detailed descrip-
using reader-writer locking. tions may be found in the literature [MG92, MS93, BA01,
Reader/writer locking is a simple instance of asymmet- MSK01, Eva11, Ken20] or in the Linux kernel [Tor03].
ric locking. Snaman [ST87] describes a more ornate six-
mode asymmetric locking design used in several clustered
systems. Locking in general and reader-writer locking in 6.4.3.1 Parallel Resource Allocation Problem
particular is described extensively in Chapter 7.
The basic problem facing a parallel memory allocator
is the tension between the need to provide extremely
6.4.2 Hierarchical Locking fast memory allocation and freeing in the common case
and the need to efficiently distribute memory in face of
The idea behind hierarchical locking is to have a coarse- unfavorable allocation and freeing patterns.
grained lock that is held only long enough to work out To see this tension, consider a straightforward applica-
which fine-grained lock to acquire. Listing 6.8 shows how tion of data ownership to this problem—simply carve up
our hash-table search might be adapted to do hierarchical memory so that each CPU owns its share. For example,
locking, but also shows the great weakness of this ap- suppose that a system with 12 CPUs has 64 gigabytes of
proach: We have paid the overhead of acquiring a second memory, for example, the laptop I am using right now.
lock, but we only hold it for a short time. In this case, We could simply assign each CPU a five-gigabyte region
the data-locking approach would be simpler and likely of memory, and allow each CPU to allocate from its own
perform better. region, without the need for locking and its complexities
Quick Quiz 6.22: In what situation would hierarchical and overheads. Unfortunately, this scheme fails when
locking work well? CPU 0 only allocates memory and CPU 1 only frees it, as
happens in simple producer-consumer workloads.
v2022.09.25a
90 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
Overflow
Overflow
5 };
(Code Locked)
6
7 struct bucket {
Empty
Empty
8 spinlock_t bucket_lock;
9 node_t *list_head;
10 };
11
12 typedef struct node {
13 spinlock_t node_lock;
14 unsigned long key;
15 struct node *next; CPU 0 Pool CPU 1 Pool
16 } node_t;
17
18 int hash_search(struct hash_table *h, long key) (Owned by CPU 0) (Owned by CPU 1)
19 {
20 struct bucket *bp;
21 struct node *cur;
22 int retval;
23
Allocate/Free
24 bp = h->buckets[key % h->nbuckets];
25 spin_lock(&bp->bucket_lock);
26 cur = bp->list_head; Figure 6.19: Allocator Cache Schematic
27 while (cur != NULL) {
28 if (cur->key >= key) {
29 spin_lock(&cur->node_lock); Listing 6.9: Allocator-Cache Data Structures
30 spin_unlock(&bp->bucket_lock); 1 #define TARGET_POOL_SIZE 3
31 retval = (cur->key == key); 2 #define GLOBAL_POOL_SIZE 40
32 spin_unlock(&cur->node_lock); 3
33 return retval; 4 struct globalmempool {
34 } 5 spinlock_t mutex;
35 cur = cur->next; 6 int cur;
36 } 7 struct memblock *pool[GLOBAL_POOL_SIZE];
37 spin_unlock(&bp->bucket_lock); 8 } globalmem;
38 return 0; 9
39 } 10 struct perthreadmempool {
11 int cur;
12 struct memblock *pool[2 * TARGET_POOL_SIZE];
13 };
The other extreme, code locking, suffers from excessive 14
15 DEFINE_PER_THREAD(struct perthreadmempool, perthreadmem);
lock contention and overhead [MS93].
6.4.3.2 Parallel Fastpath for Resource Allocation The “Global Pool” of Figure 6.19 is implemented by
globalmem of type struct globalmempool, and the
The commonly used solution uses parallel fastpath with
two CPU pools by the per-thread variable perthreadmem
each CPU owning a modest cache of blocks, and with a
of type struct perthreadmempool. Both of these data
large code-locked shared pool for additional blocks. To
structures have arrays of pointers to blocks in their pool
prevent any given CPU from monopolizing the memory
fields, which are filled from index zero upwards. Thus,
blocks, we place a limit on the number of blocks that can
if globalmem.pool[3] is NULL, then the remainder of
be in each CPU’s cache. In a two-CPU system, the flow of
the array from index 4 up must also be NULL. The cur
memory blocks will be as shown in Figure 6.19: When a
fields contain the index of the highest-numbered full
given CPU is trying to free a block when its pool is full, it
element of the pool array, or −1 if all elements are
sends blocks to the global pool, and, similarly, when that
empty. All elements from globalmem.pool[0] through
CPU is trying to allocate a block when its pool is empty,
globalmem.pool[globalmem.cur] must be full, and
it retrieves blocks from the global pool.
all the rest must be empty.15
v2022.09.25a
6.4. PARALLEL FASTPATH 91
v2022.09.25a
92 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
v2022.09.25a
6.5. BEYOND PARTITIONING 93
Table 6.1: Schematic of Real-World Parallel Allocator Listing 6.12: SEQ Pseudocode
1 int maze_solve(maze *mp, cell sc, cell ec)
Level Locking Purpose 2 {
3 cell c = sc;
4 cell n;
Per-thread pool Data ownership High-speed 5 int vi = 0;
allocation 6
7 maze_try_visit_cell(mp, c, c, &n, 1);
Global block pool Data locking Distributing blocks 8 for (;;) {
among threads 9 while (!maze_find_any_next_cell(mp, c, &n)) {
10 if (++vi >= mp->vi)
Coalescing Data locking Combining blocks 11 return 0;
into pages 12 c = mp->visited[vi].c;
13 }
System memory Code locking Memory from/to 14 do {
system 15 if (n == ec) {
16 return 1;
17 }
18 c = n;
19 } while (maze_find_any_next_cell(mp, c, &n));
Third, coalesced memory must be returned to the un- 20 c = mp->visited[vi].c;
21 }
derlying memory system, and pages of memory must 22 }
also be allocated from the underlying memory system.
The locking required at this level will depend on that
of the underlying memory system, but could well be so it should come as no surprise that they are generated
code locking. Code locking can often be tolerated at and solved using computers, including biological com-
this level, because this level is so infrequently reached in puters [Ada11], GPGPUs [Eri08], and even discrete hard-
well-designed systems [MSK01]. ware [KFC11]. Parallel solution of mazes is sometimes
Concurrent userspace allocators face similar chal- used as a class project in universities [ETH11, Uni10]
lenges [Ken20, Eva11]. and as a vehicle to demonstrate the benefits of parallel-
Despite this real-world design’s greater complexity, programming frameworks [Fos10].
the underlying idea is the same—repeated application of Common advice is to use a parallel work-queue algo-
parallel fastpath, as shown in Table 6.1. rithm (PWQ) [ETH11, Fos10]. This section evaluates this
advice by comparing PWQ against a sequential algorithm
(SEQ) and also against an alternative parallel algorithm,
6.5 Beyond Partitioning in all cases solving randomly generated square mazes.
Section 6.5.1 discusses PWQ, Section 6.5.2 discusses
It is all right to aim high if you have plenty of an alternative parallel algorithm, Section 6.5.4 analyzes
ammunition. its anomalous performance, Section 6.5.5 derives an im-
proved sequential algorithm from the alternative paral-
Hawley R. Everhart lel algorithm, Section 6.5.6 makes further performance
comparisons, and finally Section 6.5.7 presents future
This chapter has discussed how data partitioning can be directions and concluding remarks.
used to design simple linearly scalable parallel programs.
Section 6.3.4 hinted at the possibilities of data replication,
which will be used to great effect in Section 9.5. 6.5.1 Work-Queue Parallel Maze Solver
The main goal of applying partitioning and replication
is to achieve linear speedups, in other words, to ensure PWQ is based on SEQ, which is shown in Listing 6.12
that the total amount of work required does not increase (pseudocode for maze_seq.c). The maze is represented
significantly as the number of CPUs or threads increases. by a 2D array of cells and a linear-array-based work queue
A problem that can be solved via partitioning and/or named ->visited.
replication, resulting in linear speedups, is embarrassingly Line 7 visits the initial cell, and each iteration of the loop
parallel. But can we do better? spanning lines 8–21 traverses passages headed by one cell.
To answer this question, let us examine the solution of The loop spanning lines 9–13 scans the ->visited[]
labyrinths and mazes. Of course, labyrinths and mazes array for a visited cell with an unvisited neighbor, and
have been objects of fascination for millennia [Wik12], the loop spanning lines 14–19 traverses one fork of the
v2022.09.25a
94 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
Probability
21 0.6
22 return 1; 0.5 SEQ
23 if (maze_try_visit_cell(mp, c, prevrow(c), n, d))
24 return 1; 0.4
25 if (maze_try_visit_cell(mp, c, nextrow(c), n, d)) 0.3
26 return 1;
27 return 0; 0.2
28 } 0.1
0
0 20 40 60 80 100 120 140
CDF of Solution Time (ms)
submaze headed by that neighbor. Line 20 initializes for
the next pass through the outer loop. Figure 6.23: CDF of Solution Times For SEQ and PWQ
The pseudocode for maze_try_visit_cell() is
shown on lines 1–12 of Listing 6.13 (maze.c). Line 4
checks to see if cells c and t are adjacent and connected,
The parallel work-queue solver is a straightforward
while line 5 checks to see if cell t has not yet been vis-
parallelization of the algorithm shown in Listings 6.12
ited. The celladdr() function returns the address of the
and 6.13. Line 10 of Listing 6.12 must use fetch-and-
specified cell. If either check fails, line 6 returns failure.
add, and the local variable vi must be shared among the
Line 7 indicates the next cell, line 8 records this cell in the
various threads. Lines 5 and 10 of Listing 6.13 must be
next slot of the ->visited[] array, line 9 indicates that
combined into a CAS loop, with CAS failure indicating
this slot is now full, and line 10 marks this cell as visited
a loop in the maze. Lines 8–9 of this listing must use
and also records the distance from the maze start. Line 11
fetch-and-add to arbitrate concurrent attempts to record
then returns success.
cells in the ->visited[] array.
The pseudocode for maze_find_any_next_cell()
is shown on lines 14–28 of Listing 6.13 (maze.c). Line 17 This approach does provide significant speedups on a
picks up the current cell’s distance plus 1, while lines 19, dual-CPU Lenovo W500 running at 2.53 GHz, as shown
21, 23, and 25 check the cell in each direction, and in Figure 6.23, which shows the cumulative distribution
lines 20, 22, 24, and 26 return true if the corresponding functions (CDFs) for the solution times of the two al-
cell is a candidate next cell. The prevcol(), nextcol(), gorithms, based on the solution of 500 different square
prevrow(), and nextrow() each do the specified array- 500-by-500 randomly generated mazes. The substantial
index-conversion operation. If none of the cells is a overlap of the projection of the CDFs onto the x-axis will
candidate, line 27 returns false. be addressed in Section 6.5.4.
The path is recorded in the maze by counting the number Interestingly enough, the sequential solution-path track-
of cells from the starting point, as shown in Figure 6.22, ing works unchanged for the parallel algorithm. However,
where the starting cell is in the upper left and the ending this uncovers a significant weakness in the parallel algo-
cell is in the lower right. Starting at the ending cell and rithm: At most one thread may be making progress along
following consecutively decreasing cell numbers traverses the solution path at any given time. This weakness is
the solution. addressed in the next section.
v2022.09.25a
6.5. BEYOND PARTITIONING 95
Listing 6.14: Partitioned Parallel Solver Pseudocode Listing 6.15: Partitioned Parallel Helper Pseudocode
1 int maze_solve_child(maze *mp, cell *visited, cell sc) 1 int maze_try_visit_cell(struct maze *mp, int c, int t,
2 { 2 int *n, int d)
3 cell c; 3 {
4 cell n; 4 cell_t t;
5 int vi = 0; 5 cell_t *tp;
6 6 int vi;
7 myvisited = visited; myvi = &vi; 7
8 c = visited[vi]; 8 if (!maze_cells_connected(mp, c, t))
9 do { 9 return 0;
10 while (!maze_find_any_next_cell(mp, c, &n)) { 10 tp = celladdr(mp, t);
11 if (visited[++vi].row < 0) 11 do {
12 return 0; 12 t = READ_ONCE(*tp);
13 if (READ_ONCE(mp->done)) 13 if (t & VISITED) {
14 return 1; 14 if ((t & TID) != mytid)
15 c = visited[vi]; 15 mp->done = 1;
16 } 16 return 0;
17 do { 17 }
18 if (READ_ONCE(mp->done)) 18 } while (!CAS(tp, t, t | VISITED | myid | d));
19 return 1; 19 *n = t;
20 c = n; 20 vi = (*myvi)++;
21 } while (maze_find_any_next_cell(mp, c, &n)); 21 myvisited[vi] = t;
22 c = visited[vi]; 22 return 1;
23 } while (!READ_ONCE(mp->done)); 23 }
24 return 1;
25 }
v2022.09.25a
96 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
1 1
0.9 0.9
0.8 PART 0.8
Probability 0.7 PWQ 0.7
Probability
0.6 0.6
0.5 SEQ 0.5 SEQ/PWQ SEQ/PART
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 20 40 60 80 100 120 140 0.1 1 10 100
CDF of Solution Time (ms) CDF of Speedup Relative to SEQ
Figure 6.24: CDF of Solution Times For SEQ, PWQ, Figure 6.25: CDF of SEQ/PWQ and SEQ/PART Solution-
and PART Time Ratios
in CodeSamples/SMPdesign/maze/*.c. Examples
checks include:
3. Newly created mazes with unreachable cells. Figure 6.26: Reason for Small Visit Percentages
v2022.09.25a
6.5. BEYOND PARTITIONING 97
140 1
0.9
120 PART
0.8
100
Solution Time (ms)
SEQ 0.7
Probability
0.6
80
0.5 PWQ
60 PWQ
0.4
40 0.3
0.2
20 PART
0.1 SEQ -O3
0 0
0 10 20 30 40 50 60 70 80 90 100 0.1 1 10 100
Percent of Maze Cells Visited CDF of Speedup Relative to SEQ
Figure 6.27: Correlation Between Visit Percentage and Figure 6.29: Effect of Compiler Optimization (-O3)
Solution Time
each cell with more than two neighbors. Each such cell
can result in contention in PWQ, because one thread can
enter but two threads can exit, which hurts performance,
as noted earlier in this chapter. In contrast, PART can
incur such contention but once, namely when the solution
is located. Of course, SEQ never contends.
Although PART’s speedup is impressive, we should
not neglect sequential optimizations. Figure 6.29 shows
that SEQ, when compiled with -O3, is about twice as
Figure 6.28: PWQ Potential Contention Points
fast as unoptimized PWQ, approaching the performance
of unoptimized PART. Compiling all three algorithms
with -O3 gives results similar to (albeit faster than) those
traversing the solution from the upper left reaches the
shown in Figure 6.25, except that PWQ provides almost
circle, the other thread cannot reach the upper-right portion
no speedup compared to SEQ, in keeping with Amdahl’s
of the maze. Similarly, if the other thread reaches the
Law [Amd67]. However, if the goal is to double per-
square, the first thread cannot reach the lower-left portion
formance compared to unoptimized SEQ, as opposed to
of the maze. Therefore, PART will likely visit a small
achieving optimality, compiler optimizations are quite
fraction of the non-solution-path cells. In short, the
attractive.
superlinear speedups are due to threads getting in each
others’ way. This is a sharp contrast with decades of Cache alignment and padding often improves perfor-
experience with parallel programming, where workers mance by reducing false sharing. However, for these maze-
have struggled to keep threads out of each others’ way. solution algorithms, aligning and padding the maze-cell
Figure 6.27 confirms a strong correlation between cells array degrades performance by up to 42 % for 1000x1000
visited and solution time for all three methods. The mazes. Cache locality is more important than avoiding
slope of PART’s scatterplot is smaller than that of SEQ, false sharing, especially for large mazes. For smaller
indicating that PART’s pair of threads visits a given 20-by-20 or 50-by-50 mazes, aligning and padding can
fraction of the maze faster than can SEQ’s single thread. produce up to a 40 % performance improvement for PART,
PART’s scatterplot is also weighted toward small visit but for these small sizes, SEQ performs better anyway
percentages, confirming that PART does less total work, because there is insufficient time for PART to make up for
hence the observed humiliating parallelism. the overhead of thread creation and destruction.
The fraction of cells visited by PWQ is similar to that In short, the partitioned parallel maze solver is an
of SEQ. In addition, PWQ’s solution time is greater than interesting example of an algorithmic superlinear speedup.
that of PART, even for equal visit fractions. The reason If “algorithmic superlinear speedup” causes cognitive
for this is shown in Figure 6.28, which has a red circle on dissonance, please proceed to the next section.
v2022.09.25a
98 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
1 1.8
COPART
0 0
0.1 1 10 100 10 100 1000
CDF of Speedup Relative to SEQ (-O3) Maze Size
Figure 6.30: Partitioned Coroutines Figure 6.32: Varying Maze Size vs. COPART
12 3.5
10 3
8 2.5
2
6
1.5
4 PART
1
2 PART PWQ 0.5
PWQ
0
10 100 1000 0
1 2 3 4 5 6 7 8
Maze Size
Number of Threads
Figure 6.31: Varying Maze Size vs. SEQ Figure 6.33: Mean Speedup vs. Number of Threads,
1000x1000 Maze
6.5.5 Alternative Sequential Maze Solver
The presence of algorithmic superlinear speedups sug- 90-percent-confidence error bars. PART shows superlin-
gests simulating parallelism via co-routines, for example, ear scalability against SEQ and modest scalability against
manually switching context between threads on each pass COPART for 100-by-100 and larger mazes. PART exceeds
through the main do-while loop in Listing 6.14. This theoretical energy-efficiency breakeven against COPART
context switching is straightforward because the context at roughly the 200-by-200 maze size, given that power
consists only of the variables c and vi: Of the numer- consumption rises as roughly the square of the frequency
ous ways to achieve the effect, this is a good tradeoff for high frequencies [Mud01], so that 1.4x scaling on two
between context-switch overhead and visit percentage. threads consumes the same energy as a single thread at
As can be seen in Figure 6.30, this coroutine algorithm equal solution speeds. In contrast, PWQ shows poor scala-
(COPART) is quite effective, with the performance on one bility against both SEQ and COPART unless unoptimized:
thread being within about 30 % of PART on two threads Figures 6.31 and 6.32 were generated using -O3.
(maze_2seq.c). Figure 6.33 shows the performance of PWQ and PART
relative to COPART. For PART runs with more than
6.5.6 Performance Comparison II two threads, the additional threads were started evenly
spaced along the diagonal connecting the starting and
Figures 6.31 and 6.32 show the effects of varying maze ending cells. Simplified link-state routing [BG87] was
size, comparing both PWQ and PART running on two used to detect early termination on PART runs with more
threads against either SEQ or COPART, respectively, with than two threads (the solution is flagged when a thread is
v2022.09.25a
6.6. PARTITIONING, PARALLELISM, AND OPTIMIZATION 99
connected to both beginning and end). PWQ performs design-time application of parallelism is likely to be a
quite poorly, but PART hits breakeven at two threads and fruitful field of study. This section took the problem
again at five threads, achieving modest speedups beyond of solving mazes from mildly scalable to humiliatingly
five threads. Theoretical energy efficiency breakeven is parallel and back again. It is hoped that this experience will
within the 90-percent-confidence interval for seven and motivate work on parallelism as a first-class design-time
eight threads. The reasons for the peak at two threads whole-application optimization technique, rather than as
are (1) the lower complexity of termination detection in a grossly suboptimal after-the-fact micro-optimization to
the two-thread case and (2) the fact that there is a lower be retrofitted into existing programs.
probability of the third and subsequent threads making
useful forward progress: Only the first two threads are
guaranteed to start on the solution line. This disappointing 6.6 Partitioning, Parallelism, and
performance compared to results in Figure 6.32 is due to Optimization
the less-tightly integrated hardware available in the larger
and older Xeon system running at 2.66 GHz.
Knowledge is of no value unless you put it into
practice.
6.5.7 Future Directions and Conclusions Anton Chekhov
Much future work remains. First, this section applied Most important, although this chapter has demonstrated
only one technique used by human maze solvers. Oth- that applying parallelism at the design level gives excellent
ers include following walls to exclude portions of the results, this final section shows that this is not enough.
maze and choosing internal starting points based on the For search problems such as maze solution, this section
locations of previously traversed paths. Second, different has shown that search strategy is even more important
choices of starting and ending points might favor different than parallel design. Yes, for this particular type of maze,
algorithms. Third, although placement of the PART algo- intelligently applying parallelism identified a superior
rithm’s first two threads is straightforward, there are any search strategy, but this sort of luck is no substitute for a
number of placement schemes for the remaining threads. clear focus on search strategy itself.
Optimal placement might well depend on the starting As noted back in Section 2.2, parallelism is but one
and ending points. Fourth, study of unsolvable mazes potential optimization of many. A successful design needs
and cyclic mazes is likely to produce interesting results. to focus on the most important optimization. Much though
Fifth, the lightweight C++11 atomic operations might I might wish to claim otherwise, that optimization might
improve performance. Sixth, it would be interesting to or might not be parallelism.
compare the speedups for three-dimensional mazes (or of However, for the many cases where parallelism is the
even higher-order mazes). Finally, for mazes, humiliating right optimization, the next section covers that synchro-
parallelism indicated a more-efficient sequential imple- nization workhorse, locking.
mentation using coroutines. Do humiliatingly parallel
algorithms always lead to more-efficient sequential imple-
mentations, or are there inherently humiliatingly parallel
algorithms for which coroutine context-switch overhead
overwhelms the speedups?
This section demonstrated and analyzed parallelization
of maze-solution algorithms. A conventional work-queue-
based algorithm did well only when compiler optimiza-
tions were disabled, suggesting that some prior results
obtained using high-level/overhead languages will be in-
validated by advances in optimization.
This section gave a clear example where approaching
parallelism as a first-class optimization technique rather
than as a derivative of a sequential algorithm paves the
way for an improved sequential algorithm. High-level
v2022.09.25a
100 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
v2022.09.25a
Locking is the worst general-purpose
synchronization mechanism except for all those other
mechanisms that have been tried from time to time.
Locking
In recent concurrency research, locking often plays the role 5. Locking works extremely well for some software
of villain. Locking stands accused of inciting deadlocks, artifacts and extremely poorly for others. Developers
convoying, starvation, unfairness, data races, and all man- who have worked on artifacts for which locking works
ner of other concurrency sins. Interestingly enough, the well can be expected to have a much more positive
role of workhorse in production-quality shared-memory opinion of locking than those who have worked on
parallel software is also played by locking. This chapter artifacts for which locking works poorly, as will be
will look into this dichotomy between villain and hero, as discussed in Section 7.5.
fancifully depicted in Figures 7.1 and 7.2.
There are a number of reasons behind this Jekyll-and- 6. All good stories need a villain, and locking has a long
Hyde dichotomy: and honorable history serving as a research-paper
whipping boy.
1. Many of locking’s sins have pragmatic design solu-
tions that work well in most cases, for example: Quick Quiz 7.1: Just how can serving as a whipping boy be
considered to be in any way honorable???
(a) Use of lock hierarchies to avoid deadlock.
(b) Deadlock-detection tools, for example, the This chapter will give an overview of a number of ways
Linux kernel’s lockdep facility [Cor06a]. to avoid locking’s more serious sins.
(c) Locking-friendly data structures, such as arrays,
hash tables, and radix trees, which will be 7.1 Staying Alive
covered in Chapter 10.
2. Some of locking’s sins are problems only at high I work to stay alive.
levels of contention, levels reached only by poorly
Bette Davis
designed programs.
3. Some of locking’s sins are avoided by using other Given that locking stands accused of deadlock and starva-
synchronization mechanisms in concert with locking. tion, one important concern for shared-memory parallel
These other mechanisms include statistical counters developers is simply staying alive. The following sections
(see Chapter 5), reference counters (see Section 9.2), therefore cover deadlock, livelock, starvation, unfairness,
hazard pointers (see Section 9.3), sequence-locking and inefficiency.
readers (see Section 9.4), RCU (see Section 9.5),
and simple non-blocking data structures (see Sec- 7.1.1 Deadlock
tion 14.2).
Deadlock occurs when each member of a group of threads
4. Until quite recently, almost all large shared-memory is holding at least one lock while at the same time waiting
parallel programs were developed in secret, so that it on a lock held by a member of that same group. This
was not easy to learn of these pragmatic solutions. happens even in groups containing a single thread when
101
v2022.09.25a
102 CHAPTER 7. LOCKING
Lock 1
Thread A Lock 2
XXXX
Lock 3 Thread B
Thread C Lock 4
v2022.09.25a
7.1. STAYING ALIVE 103
v2022.09.25a
104 CHAPTER 7. LOCKING
Application
Lock A Lock B
foo() bar()
Application
DEADLOCK qsort()
Library
Lock C
qsort() Lock D
cmp()
Figure 7.5: Without qsort() Local Locking Hierarchy
v2022.09.25a
7.1. STAYING ALIVE 105
Listing 7.1: Concurrent List Iterator Listing 7.2: Concurrent List Iterator Usage
1 struct locked_list { 1 struct list_ints {
2 spinlock_t s; 2 struct cds_list_head n;
3 struct cds_list_head h; 3 int a;
4 }; 4 };
5 5
6 struct cds_list_head *list_start(struct locked_list *lp) 6 void list_print(struct locked_list *lp)
7 { 7 {
8 spin_lock(&lp->s); 8 struct cds_list_head *np;
9 return list_next(lp, &lp->h); 9 struct list_ints *ip;
10 } 10
11 11 np = list_start(lp);
12 struct cds_list_head *list_next(struct locked_list *lp, 12 while (np != NULL) {
13 struct cds_list_head *np) 13 ip = cds_list_entry(np, struct list_ints, n);
14 { 14 printf("\t%d\n", ip->a);
15 struct cds_list_head *ret; 15 np = list_next(lp, np);
16 16 }
17 ret = np->next; 17 }
18 if (ret == &lp->h) {
19 spin_unlock(&lp->s);
20 ret = NULL;
21 } the deadlock by layering the locking hierarchy to take the
22 return ret;
23 } list-iterator locking into account.
This layered approach can be extended to an arbitrarily
large number of layers, but each added layer increases
have three layers to the global deadlock hierarchy, the first the complexity of the locking design. Such increases in
containing Locks A and B, the second containing Lock C, complexity are particularly inconvenient for some types of
and the third containing Lock D. object-oriented designs, in which control passes back and
Please note that it is not typically possible to mechan- forth among a large group of objects in an undisciplined
ically change cmp() to use the new Lock D. Quite the manner.1 This mismatch between the habits of object-
opposite: It is often necessary to make profound design- oriented design and the need to avoid deadlock is an
level modifications. Nevertheless, the effort required for important reason why parallel programming is perceived
such modifications is normally a small price to pay in by some to be so difficult.
order to avoid deadlock. More to the point, this potential Some alternatives to highly layered locking hierarchies
deadlock should preferably be detected at design time, are covered in Chapter 9.
before any code has been generated!
For another example where releasing all locks before 7.1.1.4 Temporal Locking Hierarchies
invoking unknown code is impractical, imagine an iterator
One way to avoid deadlock is to defer acquisition of one
over a linked list, as shown in Listing 7.1 (locked_list.
of the conflicting locks. This approach is used in Linux-
c). The list_start() function acquires a lock on the
kernel RCU, whose call_rcu() function is invoked by
list and returns the first element (if there is one), and
the Linux-kernel scheduler while holding its locks. This
list_next() either returns a pointer to the next element
means that call_rcu() cannot always safely invoke the
in the list or releases the lock and returns NULL if the end
scheduler to do a wakeup, for example, in order to wake
of the list has been reached.
up an RCU kthread in order to start the new grace period
Listing 7.2 shows how this list iterator may be used. that is required by the callback queued by call_rcu().
Lines 1–4 define the list_ints element containing a
single integer, and lines 6–17 show how to iterate over Quick Quiz 7.5: What do you mean “cannot always safely
the list. Line 11 locks the list and fetches a pointer to the invoke the scheduler”? Either call_rcu() can or cannot
safely invoke the scheduler, right?
first element, line 13 provides a pointer to our enclosing
list_ints structure, line 14 prints the corresponding However, grace periods last for many milliseconds, so
integer, and line 15 moves to the next element. This is waiting another millisecond before starting a new grace pe-
quite simple, and hides all of the locking. riod is not normally a problem. Therefore, if call_rcu()
That is, the locking remains hidden as long as the code detects a possible deadlock with the scheduler, it arranges
processing each list element does not itself acquire a lock to start the new grace period later, either within a timer
that is held across some other call to list_start() or
list_next(), which results in deadlock. We can avoid 1 One name for this is “object-oriented spaghetti code.”
v2022.09.25a
106 CHAPTER 7. LOCKING
handler or within the scheduler-clock interrupt handler, Listing 7.3: Protocol Layering and Deadlock
depending on configuration. Because no scheduler locks 1 spin_lock(&lock2);
2 layer_2_processing(pkt);
are held across either handler, deadlock is successfully 3 nextlayer = layer_1(pkt);
avoided. 4 spin_lock(&nextlayer->lock1);
5 layer_1_processing(pkt);
The overall approach is thus to adhere to a locking 6 spin_unlock(&lock2);
hierarchy by deferring lock acquisition to an environment 7 spin_unlock(&nextlayer->lock1);
v2022.09.25a
7.1. STAYING ALIVE 107
In an important special case of conditional locking, all Deadlocks involving signal handlers are often quickly
needed locks are acquired before any processing is carried dismissed by noting that it is not legal to invoke pthread_
out, where the needed locks might be identified by hashing mutex_lock() from within a signal handler [Ope97].
the addresses of the data structures involved. In this case, However, it is possible (though often unwise) to hand-
processing need not be idempotent: If it turns out to be craft locking primitives that can be invoked from signal
impossible to acquire a given lock without first releasing handlers. Besides which, almost all operating-system
one that was already acquired, just release all the locks kernels permit locks to be acquired from within interrupt
and try again. Only once all needed locks are held will handlers, which are analogous to signal handlers.
any processing be carried out. The trick is to block signals (or disable interrupts, as
However, this procedure can result in livelock, which the case may be) when acquiring any lock that might
will be discussed in Section 7.1.2. be acquired within a signal (or an interrupt) handler.
Furthermore, if holding such a lock, it is illegal to attempt
Quick Quiz 7.10: When using the “acquire needed locks to acquire any lock that is ever acquired outside of a signal
first” approach described in Section 7.1.1.7, how can livelock handler without blocking signals.
be avoided?
Quick Quiz 7.11: Suppose Lock A is never acquired within
A related approach, two-phase locking [BHG87], has a signal handler, but Lock B is acquired both from thread
context and by signal handlers. Suppose further that Lock A is
seen long production use in transactional database systems.
sometimes acquired with signals unblocked. Why is it illegal
In the first phase of a two-phase locking transaction, locks
to acquire Lock A holding Lock B?
are acquired but not released. Once all needed locks have
been acquired, the transaction enters the second phase, If a lock is acquired by the handlers for several signals,
where locks are released, but not acquired. This locking then each and every one of these signals must be blocked
approach allows databases to provide serializability guar- whenever that lock is acquired, even when that lock is
antees for their transactions, in other words, to guarantee acquired within a signal handler.
that all values seen and produced by the transactions are
consistent with some global ordering of all the transac- Quick Quiz 7.12: How can you legally block signals within
tions. Many such systems rely on the ability to abort a signal handler?
transactions, although this can be simplified by avoiding
making any changes to shared data until all needed locks Unfortunately, blocking and unblocking signals can be
are acquired. Livelock and deadlock are issues in such expensive in some operating systems, notably including
systems, but practical solutions may be found in any of a Linux, so performance concerns often mean that locks
number of database textbooks. acquired in signal handlers are only acquired in signal
handlers, and that lockless synchronization mechanisms
are used to communicate between application code and
signal handlers.
7.1.1.8 Single-Lock-at-a-Time Designs
Or that signal handlers are avoided completely except
In some cases, it is possible to avoid nesting locks, thus for handling fatal errors.
avoiding deadlock. For example, if a problem is perfectly Quick Quiz 7.13: If acquiring locks in signal handlers is
partitionable, a single lock may be assigned to each par- such a bad idea, why even discuss ways of making it safe?
tition. Then a thread working on a given partition need
only acquire the one corresponding lock. Because no
thread ever holds more than one lock at a time, deadlock
7.1.1.10 Discussion
is impossible.
However, there must be some mechanism to ensure that There are a large number of deadlock-avoidance strategies
the needed data structures remain in existence during the available to the shared-memory parallel programmer, but
time that neither lock is held. One such mechanism is there are sequential programs for which none of them
discussed in Section 7.4 and several others are presented is a good fit. This is one of the reasons that expert
in Chapter 9. programmers have more than one tool in their toolbox:
v2022.09.25a
108 CHAPTER 7. LOCKING
Listing 7.5: Abusing Conditional Locking Listing 7.6: Conditional Locking and Exponential Backoff
1 void thread1(void) 1 void thread1(void)
2 { 2 {
3 retry: 3 unsigned int wait = 1;
4 spin_lock(&lock1); 4 retry:
5 do_one_thing(); 5 spin_lock(&lock1);
6 if (!spin_trylock(&lock2)) { 6 do_one_thing();
7 spin_unlock(&lock1); 7 if (!spin_trylock(&lock2)) {
8 goto retry; 8 spin_unlock(&lock1);
9 } 9 sleep(wait);
10 do_another_thing(); 10 wait = wait << 1;
11 spin_unlock(&lock2); 11 goto retry;
12 spin_unlock(&lock1); 12 }
13 } 13 do_another_thing();
14 14 spin_unlock(&lock2);
15 void thread2(void) 15 spin_unlock(&lock1);
16 { 16 }
17 retry: 17
18 spin_lock(&lock2); 18 void thread2(void)
19 do_a_third_thing(); 19 {
20 if (!spin_trylock(&lock1)) { 20 unsigned int wait = 1;
21 spin_unlock(&lock2); 21 retry:
22 goto retry; 22 spin_lock(&lock2);
23 } 23 do_a_third_thing();
24 do_a_fourth_thing(); 24 if (!spin_trylock(&lock1)) {
25 spin_unlock(&lock1); 25 spin_unlock(&lock2);
26 spin_unlock(&lock2); 26 sleep(wait);
27 } 27 wait = wait << 1;
28 goto retry;
29 }
30 do_a_fourth_thing();
Locking is a powerful concurrency tool, but there are jobs 31 spin_unlock(&lock1);
32 spin_unlock(&lock2);
better addressed with other tools. 33 }
Quick Quiz 7.14: Given an object-oriented application that
passes control freely among a group of objects such that there
is no straightforward locking hierarchy,a layered or otherwise, 5. Thread 1 releases lock1 on line 7, then jumps to
how can this application be parallelized? retry at line 3.
a Also known as “object-oriented spaghetti code.” 6. Thread 2 releases lock2 on line 21, and jumps to
retry at line 17.
Nevertheless, the strategies described in this section
have proven quite useful in many settings. 7. The livelock dance repeats from the beginning.
7.1.2 Livelock and Starvation Quick Quiz 7.15: How can the livelock shown in Listing 7.5
be avoided?
Although conditional locking can be an effective deadlock-
avoidance mechanism, it can be abused. Consider for Livelock can be thought of as an extreme form of
example the beautifully symmetric example shown in starvation where a group of threads starves, rather than
Listing 7.5. This example’s beauty hides an ugly livelock. just one of them.3
To see this, consider the following sequence of events: Livelock and starvation are serious issues in software
transactional memory implementations, and so the concept
1. Thread 1 acquires lock1 on line 4, then invokes of contention manager has been introduced to encapsulate
do_one_thing(). these issues. In the case of locking, simple exponential
backoff can often address livelock and starvation. The
2. Thread 2 acquires lock2 on line 18, then invokes
idea is to introduce exponentially increasing delays before
do_a_third_thing().
each retry, as shown in Listing 7.6.
3. Thread 1 attempts to acquire lock2 on line 6, but
fails because Thread 2 holds it. 3 Try not to get too hung up on the exact definitions of terms like
v2022.09.25a
7.1. STAYING ALIVE 109
v2022.09.25a
110 CHAPTER 7. LOCKING
The Rust programming language takes lock/data asso- It is important to note that unconditionally acquiring
ciation a step further by allowing the developer to make a an exclusive lock has two effects: (1) Waiting for all prior
compiler-visible association between a lock and the data holders of that lock to release it and (2) Blocking any
that it protects [JJKD21]. When such an association has other acquisition attempts until the lock is released. As a
been made, attempts to access the data without the benefit result, at lock acquisition time, any concurrent acquisitions
of the corresponding lock will result in a compile-time of that lock must be partitioned into prior holders and
diagnostic. The hope is that this will greatly reduce the subsequent holders. Different types of exclusive locks
frequency of this class of bugs. Of course, this approach use different partitioning strategies [Bra11, GGL+ 19], for
does not apply straightforwardly to cases where the data to example:
be locked is distributed throughout the nodes of some data
structure or when that which is locked is purely abstract, 1. Strict FIFO, with acquisitions starting earlier acquir-
for example, when a small subset of state-machine transi- ing the lock earlier.
tions is to be protected by a given lock. For this reason,
Rust allows locks to be associated with types rather than 2. Approximate FIFO, with acquisitions starting suffi-
data items or even to be associated with nothing at all. This ciently earlier acquiring the lock earlier.
last option permits Rust to emulate traditional locking use 3. FIFO within priority level, with higher-priority
cases, but is not popular among Rust developers. Perhaps threads acquiring the lock earlier than any lower-
the Rust community will come up with other mechanisms priority threads attempting to acquire the lock at
tailored to other locking use cases. about the same time, but so that some FIFO ordering
applies for threads of the same priority.
7.2 Types of Locks 4. Random, so that the new lock holder is chosen ran-
domly from all threads attempting acquisition, re-
Only locks in life are what you think you know, but gardless of timing.
don’t. Accept your ignorance and try something new.
5. Unfair, so that a given acquisition might never acquire
Dennis Vickers the lock (see Section 7.1.3).
There are a surprising number of types of locks, more Unfortunately, locking implementations with stronger
than this short chapter can possibly do justice to. The guarantees typically incur higher overhead, motivating the
following sections discuss exclusive locks (Section 7.2.1), wide variety of locking implementations in production
reader-writer locks (Section 7.2.2), multi-role locks (Sec- use. For example, real-time systems often require some
tion 7.2.3), and scoped locking (Section 7.2.4). degree of FIFO ordering within priority level, and much
else besides (see Section 14.3.5.1), while non-realtime
systems subject to high contention might require only
7.2.1 Exclusive Locks enough ordering to avoid starvation, and finally, non-
Exclusive locks are what they say they are: Only one realtime systems designed to avoid contention might not
thread may hold the lock at a time. The holder of such a need fairness at all.
lock thus has exclusive access to all data protected by that
lock, hence the name. 7.2.2 Reader-Writer Locks
Of course, this all assumes that this lock is held across
all accesses to data purportedly protected by the lock. Reader-writer locks [CHP71] permit any number of read-
Although there are some tools that can help (see for ers to hold the lock concurrently on the one hand or a
example Section 12.3.1), the ultimate responsibility for single writer to hold the lock on the other. In theory,
ensuring that the lock is always acquired when needed then, reader-writer locks should allow excellent scalability
rests with the developer. for data that is read often and written rarely. In prac-
tice, the scalability will depend on the reader-writer lock
Quick Quiz 7.19: Does it ever make sense to have an implementation.
exclusive lock acquisition immediately followed by a release
The classic reader-writer lock implementation involves
of that same lock, that is, an empty critical section?
a set of counters and flags that are manipulated atomically.
v2022.09.25a
7.2. TYPES OF LOCKS 111
This type of implementation suffers from the same problem Table 7.1: VAX/VMS Distributed Lock Manager Policy
as does exclusive locking for short critical sections: The
Concurrent Write
Concurrent Read
overhead of acquiring and releasing the lock is about
Protected Write
Protected Read
two orders of magnitude greater than the overhead of a
simple instruction. Of course, if the critical section is
Exclusive
long enough, the overhead of acquiring and releasing the
lock becomes negligible. However, because only one
thread at a time can be manipulating the lock, the required
critical-section size increases with the number of CPUs. Null (Not Held)
It is possible to design a reader-writer lock that is much Concurrent Read X
more favorable to readers through use of per-thread exclu- Concurrent Write X X X
sive locks [HW92]. To read, a thread acquires only its own Protected Read X X X
lock. To write, a thread acquires all locks. In the absence Protected Write X X X X
of writers, each reader incurs only atomic-instruction and Exclusive X X X X X
memory-barrier overhead, with no cache misses, which is
quite good for a locking primitive. Unfortunately, writers
must incur cache misses as well as atomic-instruction and large latencies when switching from read-holder to write-
memory-barrier overhead—multiplied by the number of holder mode. Here are a few possible approaches:
threads.
In short, reader-writer locks can be quite useful in a 1. Reader-preference implementations unconditionally
number of situations, but each type of implementation favor readers over writers, possibly allowing write
does have its drawbacks. The canonical use case for reader- acquisitions to be indefinitely blocked.
writer locking involves very long read-side critical sections, 2. Batch-fair implementations ensure that when both
preferably measured in hundreds of microseconds or even readers and writers are acquiring the lock, both have
milliseconds. reasonable access via batching. For example, the
As with exclusive locks, a reader-writer lock acquisition lock might admit five readers per CPU, then two
cannot complete until all prior conflicting holders of that writers, then five more readers per CPU, and so on.
lock have released it. If a lock is read-held, then read acqui-
sitions can complete immediately, but write acquisitions 3. Writer-preference implementations unconditionally
must wait until there are no longer any readers holding favor writers over readers, possibly allowing read
the lock. If a lock is write-held, then all acquisitions must acquisitions to be indefinitely blocked.
wait until the writer releases the lock. Again as with exclu-
Of course, these distinctions matter only under condi-
sive locks, different reader-writer lock implementations
tions of high lock contention.
provide different degrees of FIFO ordering to readers on
Please keep the waiting/blocking dual nature of locks
the one hand and to writers on the other.
firmly in mind. This will be revisited in Chapter 9’s
But suppose a large number of readers hold the lock and discussion of scalable high-performance special-purpose
a writer is waiting to acquire the lock. Should readers be alternatives to locking.
allowed to continue to acquire the lock, possibly starving
the writer? Similarly, suppose that a writer holds the
lock and that a large number of both readers and writers 7.2.3 Beyond Reader-Writer Locks
are waiting to acquire the lock. When the current writer Reader-writer locks and exclusive locks differ in their
release the lock, should it be given to a reader or to another admission policy: Exclusive locks allow at most one
writer? If it is given to a reader, how many readers should holder, while reader-writer locks permit an arbitrary num-
be allowed to acquire the lock before the next writer is ber of read-holders (but only one write-holder). There is a
permitted to do so? very large number of possible admission policies, one of
There are many possible answers to these questions, which is that of the VAX/VMS distributed lock manager
with different levels of complexity, overhead, and fairness. (DLM) [ST87], which is shown in Table 7.1. Blank cells
Different implementations might have different costs, for indicate compatible modes, while cells containing “X”
example, some types of reader-writer locks incur extremely indicate incompatible modes.
v2022.09.25a
112 CHAPTER 7. LOCKING
The VAX/VMS DLM uses six modes. For purposes can have more than thirty locking modes, compared to
of comparison, exclusive locks use two modes (not held VAX/VMS’s six.
and held), while reader-writer locks use three modes (not
held, read held, and write held). 7.2.4 Scoped Locking
The first mode is null, or not held. This mode is
compatible with all other modes, which is to be expected: The locking primitives discussed thus far require explicit
If a thread is not holding a lock, it should not prevent any acquisition and release primitives, for example, spin_
other thread from acquiring that lock. lock() and spin_unlock(), respectively. Another ap-
The second mode is concurrent read, which is com- proach is to use the object-oriented resource-acquisition-
patible with every other mode except for exclusive. The is-initialization (RAII) pattern [ES90].5 This pattern is
concurrent-read mode might be used to accumulate ap- often applied to auto variables in languages like C++,
proximate statistics on a data structure, while permitting where the corresponding constructor is invoked upon en-
updates to proceed concurrently. try to the object’s scope, and the corresponding destructor
The third mode is concurrent write, which is compatible is invoked upon exit from that scope. This can be applied
with null, concurrent read, and concurrent write. The to locking by having the constructor acquire the lock and
concurrent-write mode might be used to update approxi- the destructor free it.
mate statistics, while still permitting reads and concurrent This approach can be quite useful, in fact in 1990 I was
updates to proceed concurrently. convinced that it was the only type of locking that was
The fourth mode is protected read, which is compatible needed.6 One very nice property of RAII locking is that
with null, concurrent read, and protected read. The you don’t need to carefully release the lock on each and
protected-read mode might be used to obtain a consistent every code path that exits that scope, a property that can
snapshot of the data structure, while permitting reads but eliminate a troublesome set of bugs.
not updates to proceed concurrently. However, RAII locking also has a dark side. RAII
The fifth mode is protected write, which is compatible makes it quite difficult to encapsulate lock acquisition
with null and concurrent read. The protected-write mode and release, for example, in iterators. In many iterator
might be used to carry out updates to a data structure that implementations, you would like to acquire the lock in the
could interfere with protected readers but which could be iterator’s “start” function and release it in the iterator’s
tolerated by concurrent readers. “stop” function. RAII locking instead requires that the
The sixth and final mode is exclusive, which is compat- lock acquisition and release take place in the same level
ible only with null. The exclusive mode is used when it is of scoping, making such encapsulation difficult or even
necessary to exclude all other accesses. impossible.
It is interesting to note that exclusive locks and reader- Strict RAII locking also prohibits overlapping critical
writer locks can be emulated by the VAX/VMS DLM. Ex- sections, due to the fact that scopes must nest. This
clusive locks would use only the null and exclusive modes, prohibition makes it difficult or impossible to express a
while reader-writer locks might use the null, protected- number of useful constructs, for example, locking trees
read, and protected-write modes. that mediate between multiple concurrent attempts to
Quick Quiz 7.20: Is there any other way for the VAX/VMS
assert an event. Of an arbitrarily large group of concurrent
DLM to emulate a reader-writer lock? attempts, only one need succeed, and the best strategy
for the remaining attempts is for them to fail as quickly
Although the VAX/VMS DLM policy has seen wide- and painlessly as possible. Otherwise, lock contention
spread production use for distributed databases, it does not becomes pathological on large systems (where “large” is
appear to be used much in shared-memory applications. many hundreds of CPUs). Therefore, C++17 [Smi19] has
One possible reason for this is that the greater commu- escapes from strict RAII in its unique_lock class, which
nication overheads of distributed databases can hide the allows the scope of the critical section to be controlled to
greater overhead of the VAX/VMS DLM’s more-complex roughly the same extent as can be achieved with explicit
admission policy. lock acquisition and release primitives.
Nevertheless, the VAX/VMS DLM is an interesting 5 Though more clearly expressed at https://www.stroustrup.
illustration of just how flexible the concepts behind locking com/bs_faq2.html#finally.
can be. It also serves as a very simple introduction to 6 My later work with parallelism at Sequent Computer Systems very
the locking schemes used by modern DBMSes, which quickly disabused me of this misguided notion.
v2022.09.25a
7.2. TYPES OF LOCKS 113
CPU m
CPU m * (N − 1)
CPU m * (N − 1) + 1
CPU m * N − 1
16
17 WRITE_ONCE(gp_flags, 1);
18 do_force_quiescent_state();
19 WRITE_ONCE(gp_flags, 0);
20 }
21 raw_spin_unlock(&rnp_old->fqslock);
22 }
Figure 7.10: Locking Hierarchy multiple concurrent callers, we need at most one of them
to actually invoke do_force_quiescent_state(), and
we need the rest to (as quickly and painlessly as possible)
Example strict-RAII-unfriendly data structures from
give up and leave.
Linux-kernel RCU are shown in Figure 7.10. Here, each
CPU is assigned a leaf rcu_node structure, and each rcu_ To this end, each pass through the loop spanning
node structure has a pointer to its parent (named, oddly lines 7–15 attempts to advance up one level in the rcu_
enough, ->parent), up to the root rcu_node structure, node hierarchy. If the gp_flags variable is already set
which has a NULL ->parent pointer. The number of child (line 8) or if the attempt to acquire the current rcu_node
rcu_node structures per parent can vary, but is typically structure’s ->fqslock is unsuccessful (line 9), then local
32 or 64. Each rcu_node structure also contains a lock variable ret is set to 1. If line 10 sees that local variable
named ->fqslock. rnp_old is non-NULL, meaning that we hold rnp_old’s
The general approach is a tournament, where a given ->fqs_lock, line 11 releases this lock (but only after the
CPU conditionally acquires its leaf rcu_node structure’s attempt has been made to acquire the parent rcu_node
->fqslock, and, if successful, attempt to acquire that structure’s ->fqslock). If line 12 sees that either line 8
of the parent, then release that of the child. In addi- or 9 saw a reason to give up, line 13 returns to the caller.
tion, at each level, the CPU checks a global gp_flags Otherwise, we must have acquired the current rcu_node
variable, and if this variable indicates that some other structure’s ->fqslock, so line 14 saves a pointer to this
CPU has asserted the event, the first CPU drops out of structure in local variable rnp_old in preparation for the
the competition. This acquire-then-release sequence con- next pass through the loop.
tinues until either the gp_flags variable indicates that If control reaches line 16, we won the tournament, and
someone else won the tournament, one of the attempts now holds the root rcu_node structure’s ->fqslock. If
to acquire an ->fqslock fails, or the root rcu_node line 16 still sees that the global variable gp_flags is zero,
structure’s ->fqslock has been acquired. If the root line 17 sets gp_flags to one, line 18 invokes do_force_
rcu_node structure’s ->fqslock is acquired, a function quiescent_state(), and line 19 resets gp_flags back
named do_force_quiescent_state() is invoked. to zero. Either way, line 21 releases the root rcu_node
Simplified code to implement this is shown in List- structure’s ->fqslock.
ing 7.7. The purpose of this function is to mediate between
CPUs who have concurrently detected a need to invoke Quick Quiz 7.21: The code in Listing 7.7 is ridiculously
the do_force_quiescent_state() function. At any complicated! Why not conditionally acquire a single global
given time, it only makes sense for one instance of do_ lock?
force_quiescent_state() to be active, so if there are
v2022.09.25a
114 CHAPTER 7. LOCKING
Quick Quiz 7.22: Wait a minute! If we “win” the tournament Listing 7.8: Sample Lock Based on Atomic Exchange
on line 16 of Listing 7.7, we get to do all the work of do_ 1 typedef int xchglock_t;
2 #define DEFINE_XCHG_LOCK(n) xchglock_t n = 0
force_quiescent_state(). Exactly how is that a win, 3
really? 4 void xchg_lock(xchglock_t *xp)
5 {
6 while (xchg(xp, 1) == 1) {
This function illustrates the not-uncommon pattern of 7 while (READ_ONCE(*xp) == 1)
hierarchical locking. This pattern is difficult to implement 8 continue;
9 }
using strict RAII locking,7 just like the iterator encapsula- 10 }
tion noted earlier, and so explicit lock/unlock primitives 11
12 void xchg_unlock(xchglock_t *xp)
(or C++17-style unique_lock escapes) will be required 13 {
for the foreseeable future. 14 (void)xchg(xp, 0);
15 }
Developers are almost always best-served by using what- Lock release is carried out by the xchg_unlock()
ever locking primitives are provided by the system, for function shown on lines 12–15. Line 14 atomically ex-
example, the POSIX pthread mutex locks [Ope97, But97]. changes the value zero (“unlocked”) into the lock, thus
Nevertheless, studying sample implementations can be marking it as having been released.
helpful, as can considering the challenges posed by ex- Quick Quiz 7.25: Why not simply store zero into the lock
treme workloads and environments. word on line 14 of Listing 7.8?
v2022.09.25a
7.3. LOCKING IMPLEMENTATION ISSUES 115
unable to use it, perhaps due to that thread being preempted at low levels of contention and switching to the queued
or interrupted. On the other hand, it is important to avoid lock at high levels of contention [LA94], thus getting
getting too worried about the possibility of preemption low overhead at low levels of contention and getting
and interruption. After all, in many cases, this preemption fairness and high throughput at high levels of contention.
and interruption could just as well happen just after the Browning et al. took a similar approach, but avoided the
lock was acquired.8 use of a separate flag, so that the test-and-set fast path
All locking implementations where waiters spin on a uses the same sequence of instructions that would be used
single memory location, including both test-and-set locks in a simple test-and-set lock [BMMM05]. This approach
and ticket locks, suffer from performance problems at high has been used in production.
contention levels. The problem is that the thread releasing Another issue that arises at high levels of contention
the lock must update the value of the corresponding is when the lock holder is delayed, especially when the
memory location. At low contention, this is not a problem: delay is due to preemption, which can result in priority
The corresponding cache line is very likely still local to inversion, where a low-priority thread holds a lock, but
and writeable by the thread holding the lock. In contrast, is preempted by a medium priority CPU-bound thread,
at high levels of contention, each thread attempting to which results in a high-priority process blocking while
acquire the lock will have a read-only copy of the cache attempting to acquire the lock. The result is that the
line, and the lock holder will need to invalidate all such CPU-bound medium-priority process is preventing the
copies before it can carry out the update that releases the high-priority process from running. One solution is
lock. In general, the more CPUs and threads there are, priority inheritance [LR80], which has been widely used
the greater the overhead incurred when releasing the lock for real-time computing [SRL90, Cor06b], despite some
under conditions of high contention. lingering controversy over this practice [Yod04a, Loc02].
This negative scalability has motivated a number of
Another way to avoid priority inversion is to prevent pre-
different queued-lock implementations [And90, GT90,
emption while a lock is held. Because preventing preemp-
MCS91, WKS94, Cra93, MLH94, TS93], some of which
tion while locks are held also improves throughput, most
are used in recent versions of the Linux kernel [Cor14b].
proprietary UNIX kernels offer some form of scheduler-
Queued locks avoid high cache-invalidation overhead by
conscious synchronization mechanism [KWS97], largely
assigning each thread a queue element. These queue
due to the efforts of a certain sizable database ven-
elements are linked together into a queue that governs the
dor. These mechanisms usually take the form of a hint
order that the lock will be granted to the waiting threads.
that preemption should be avoided in a given region
The key point is that each thread spins on its own queue
of code, with this hint typically being placed in a ma-
element, so that the lock holder need only invalidate the
chine register. These hints frequently take the form of
first element from the next thread’s CPU’s cache. This
a bit set in a particular machine register, which enables
arrangement greatly reduces the overhead of lock handoff
extremely low per-lock-acquisition overhead for these
at high levels of contention.
mechanisms. In contrast, Linux avoids these hints, in-
More recent queued-lock implementations also take the
stead getting similar results from a mechanism called
system’s architecture into account, preferentially granting
futexes [FRK02, Mol06, Ros06, Dre11].
locks locally, while also taking steps to avoid starva-
tion [SSVM02, RH03, RH02, JMRR02, MCM02]. Many Interestingly enough, atomic instructions are not strictly
of these can be thought of as analogous to the elevator needed to implement locks [Dij65, Lam74]. An excellent
algorithms traditionally used in scheduling disk I/O. exposition of the issues surrounding locking implementa-
Unfortunately, the same scheduling logic that improves tions based on simple loads and stores may be found in
the efficiency of queued locks at high contention also Herlihy’s and Shavit’s textbook [HS08, HSLS20]. The
increases their overhead at low contention. Beng-Hong main point echoed here is that such implementations cur-
Lim and Anant Agarwal therefore combined a simple test- rently have little practical application, although a careful
and-set lock with a queued lock, using the test-and-set lock study of them can be both entertaining and enlightening.
Nevertheless, with one exception described below, such
8 Besides, the best way of handling high lock contention is to avoid study is left as an exercise for the reader.
it in the first place! There are nevertheless some situations where high
lock contention is the lesser of the available evils, and in any case,
Gamsa et al. [GKAS99, Section 5.3] describe a token-
studying schemes that deal with high levels of contention is a good based mechanism in which a token circulates among
mental exercise. the CPUs. When the token reaches a given CPU, it has
v2022.09.25a
116 CHAPTER 7. LOCKING
exclusive access to anything protected by that token. There Listing 7.9: Per-Element Locking Without Existence Guaran-
are any number of schemes that may be used to implement tees
1 int delete(int key)
the token-based mechanism, for example: 2 {
3 int b;
1. Maintain a per-CPU flag, which is initially zero for 4 struct element *p;
5
all but one CPU. When a CPU’s flag is non-zero, it 6 b = hashfunction(key);
holds the token. When it finishes with the token, it 7 p = hashtable[b];
8 if (p == NULL || p->key != key)
zeroes its flag and sets the flag of the next CPU to 9 return 0;
one (or to any other non-zero value). 10 spin_lock(&p->lock);
11 hashtable[b] = NULL;
12 spin_unlock(&p->lock);
2. Maintain a per-CPU counter, which is initially set to 13 kfree(p);
the corresponding CPU’s number, which we assume 14 return 1;
15 }
to range from zero to 𝑁 − 1, where 𝑁 is the number
of CPUs in the system. When a CPU’s counter is
greater than that of the next CPU (taking counter over roll-your-own efforts is that the standard primitives
wrap into account), the first CPU holds the token. are typically much less bug-prone.9
When it is finished with the token, it sets the next
CPU’s counter to a value one greater than its own
counter. 7.4 Lock-Based Existence Guaran-
Quick Quiz 7.26: How can you tell if one counter is greater
tees
than another, while accounting for counter wrap?
Existence precedes and rules essence.
Quick Quiz 7.27: Which is better, the counter approach or
Jean-Paul Sartre
the flag approach?
This lock is unusual in that a given CPU cannot nec- A key challenge in parallel programming is to provide
essarily acquire it immediately, even if no other CPU existence guarantees [GKAS99], so that attempts to access
is using it at the moment. Instead, the CPU must wait a given object can rely on that object being in existence
until the token comes around to it. This is useful in throughout a given access attempt.
cases where CPUs need periodic access to the critical In some cases, existence guarantees are implicit:
section, but can tolerate variances in token-circulation rate. 1. Global variables and static local variables in the
Gamsa et al. [GKAS99] used it to implement a variant of base module will exist as long as the application is
read-copy update (see Section 9.5), but it could also be running.
used to protect periodic per-CPU operations such as flush-
ing per-CPU caches used by memory allocators [MS93], 2. Global variables and static local variables in a loaded
garbage-collecting per-CPU data structures, or flushing module will exist as long as that module remains
per-CPU data to shared storage (or to mass storage, for loaded.
that matter). 3. A module will remain loaded as long as at least one
The Linux kernel now uses queued spinlocks [Cor14b], of its functions has an active instance.
but because of the complexity of implementations that pro-
vide good performance across the range of contention lev- 4. A given function instance’s on-stack variables will
els, the path has not always been smooth [Mar18, Dea18]. exist until that instance returns.
As increasing numbers of people gain familiarity with
5. If you are executing within a given function or have
parallel hardware and parallelize increasing amounts of
been called (directly or indirectly) from that function,
code, we can continue to expect more special-purpose
then the given function has an active instance.
locking primitives to appear, see for example Guerraoui et
9 And yes, I have done at least my share of roll-your-own synchro-
al. [GGL+ 19, Gui18]. Nevertheless, you should carefully
nization primitives. However, you will notice that my hair is much greyer
consider this important safety tip: Use the standard syn- than it was before I started doing that sort of work. Coincidence? Maybe.
chronization primitives whenever humanly possible. The But are you really willing to risk your own hair turning prematurely
big advantage of the standard synchronization primitives grey?
v2022.09.25a
7.5. LOCKING: HERO OR VILLAIN? 117
These implicit existence guarantees are straightforward, Listing 7.10: Per-Element Locking With Lock-Based Existence
though bugs involving implicit existence guarantees really Guarantees
1 int delete(int key)
can happen. 2 {
3 int b;
Quick Quiz 7.28: How can relying on implicit existence 4 struct element *p;
guarantees result in a bug? 5 spinlock_t *sp;
6
7 b = hashfunction(key);
But the more interesting—and troublesome—guarantee 8 sp = &locktable[b];
9 spin_lock(sp);
involves heap memory: A dynamically allocated data 10 p = hashtable[b];
structure will exist until it is freed. The problem to be 11 if (p == NULL || p->key != key) {
12 spin_unlock(sp);
solved is to synchronize the freeing of the structure with 13 return 0;
concurrent accesses to that same structure. One way to 14 }
15 hashtable[b] = NULL;
do this is with explicit guarantees, such as locking. If a 16 spin_unlock(sp);
given structure may only be freed while holding a given 17 kfree(p);
18 return 1;
lock, then holding that lock guarantees that structure’s 19 }
existence.
But this guarantee depends on the existence of the lock
itself. One straightforward way to guarantee the lock’s Because there is no existence guarantee, the identity of
existence is to place the lock in a global variable, but the data element can change while a thread is attempting
global locking has the disadvantage of limiting scalability. to acquire that element’s lock on line 10!
One way of providing scalability that improves as the size One way to fix this example is to use a hashed set
of the data structure increases is to place a lock in each of global locks, so that each hash bucket has its own
element of the structure. Unfortunately, putting the lock lock, as shown in Listing 7.10. This approach allows
that is to protect a data element in the data element itself is acquiring the proper lock (on line 9) before gaining a
subject to subtle race conditions, as shown in Listing 7.9. pointer to the data element (on line 10). Although this
Quick Quiz 7.29: What if the element we need to delete is approach works quite well for elements contained in a
not the first element of the list on line 8 of Listing 7.9? single partitionable data structure such as the hash table
shown in the listing, it can be problematic if a given
To see one of these race conditions, consider the fol- data element can be a member of multiple hash tables
lowing sequence of events: or given more-complex data structures such as trees or
graphs. Not only can these problems be solved, but
1. Thread 0 invokes delete(0), and reaches line 10 the solutions also form the basis of lock-based software
of the listing, acquiring the lock. transactional memory implementations [ST95, DSS06].
However, Chapter 9 describes simpler—and faster—ways
2. Thread 1 concurrently invokes delete(0), reaching of providing existence guarantees.
line 10, but spins on the lock because Thread 0 holds
it.
3. Thread 0 executes lines 11–14, removing the element 7.5 Locking: Hero or Villain?
from the hashtable, releasing the lock, and then
freeing the element. You either die a hero or you live long enough to see
yourself become the villain.
4. Thread 0 continues execution, and allocates memory,
getting the exact block of memory that it just freed. Aaron Eckhart as Harvey Dent
5. Thread 0 then initializes this block of memory as As is often the case in real life, locking can be either
some other type of structure. hero or villain, depending on how it is used and on the
problem at hand. In my experience, those writing whole
6. Thread 1’s spin_lock() operation fails due to the applications are happy with locking, those writing parallel
fact that what it believes to be p->lock is no longer libraries are less happy, and those parallelizing existing
a spinlock. sequential libraries are extremely unhappy. The following
v2022.09.25a
118 CHAPTER 7. LOCKING
sections discuss some reasons for these differences in deadlock can ensue just as surely as if the library function
viewpoints. had called the signal handler directly. A final complication
occurs for those library functions that can be used between
a fork()/exec() pair, for example, due to use of the
7.5.1 Locking For Applications: Hero! system() function. In this case, if your library function
When writing an entire application (or entire kernel), was holding a lock at the time of the fork(), then the
developers have full control of the design, including the child process will begin life with that lock held. Because
synchronization design. Assuming that the design makes the thread that will release the lock is running in the parent
good use of partitioning, as discussed in Chapter 6, locking but not the child, if the child calls your library function,
can be an extremely effective synchronization mechanism, deadlock will ensue.
as demonstrated by the heavy use of locking in production- The following strategies may be used to avoid deadlock
quality parallel software. problems in these cases:
Nevertheless, although such software usually bases
most of its synchronization design on locking, such soft- 1. Don’t use either callbacks or signals.
ware also almost always makes use of other synchro- 2. Don’t acquire locks from within callbacks or signal
nization mechanisms, including special counting algo- handlers.
rithms (Chapter 5), data ownership (Chapter 8), reference
counting (Section 9.2), hazard pointers (Section 9.3), 3. Let the caller control synchronization.
sequence locking (Section 9.4), and read-copy update
(Section 9.5). In addition, practitioners use tools for 4. Parameterize the library API to delegate locking to
deadlock detection [Cor06a], lock acquisition/release bal- caller.
ancing [Cor04b], cache-miss analysis [The11], hardware-
5. Explicitly avoid callback deadlocks.
counter-based profiling [EGMdB11, The12b], and many
more besides. 6. Explicitly avoid signal-handler deadlocks.
Given careful design, use of a good combination of
synchronization mechanisms, and good tooling, locking 7. Avoid invoking fork().
works quite well for applications and kernels.
Each of these strategies is discussed in one of the
following sections.
7.5.2 Locking For Parallel Libraries: Just
Another Tool 7.5.2.1 Use Neither Callbacks Nor Signals
Unlike applications and kernels, the designer of a library If a library function avoids callbacks and the application
cannot know the locking design of the code that the library as a whole avoids signals, then any locks acquired by that
will be interacting with. In fact, that code might not be library function will be leaves of the locking-hierarchy
written for years to come. Library designers therefore tree. This arrangement avoids deadlock, as discussed in
have less control and must exercise more care when laying Section 7.1.1.1. Although this strategy works extremely
out their synchronization design. well where it applies, there are some applications that
Deadlock is of course of particular concern, and the must use signal handlers, and there are some library
techniques discussed in Section 7.1.1 need to be applied. functions (such as the qsort() function discussed in
One popular deadlock-avoidance strategy is therefore to Section 7.1.1.2) that require callbacks.
ensure that the library’s locks are independent subtrees of The strategy described in the next section can often be
the enclosing program’s locking hierarchy. However, this used in these cases.
can be harder than it looks.
One complication was discussed in Section 7.1.1.2, 7.5.2.2 Avoid Locking in Callbacks and Signal Han-
namely when library functions call into application code, dlers
with qsort()’s comparison-function argument being a
case in point. Another complication is the interaction If neither callbacks nor signal handlers acquire locks, then
with signal handlers. If an application signal handler is they cannot be involved in deadlock cycles, which allows
invoked from a signal received within the library function, straightforward locking hierarchies to once again consider
v2022.09.25a
7.5. LOCKING: HERO OR VILLAIN? 119
library functions to be leaves on the locking-hierarchy tree. example, a hash table or a parallel sort. In this case, the
This strategy works very well for most uses of qsort, library absolutely must control its own synchronization.
whose callbacks usually simply compare the two values
passed in to them. This strategy also works wonderfully 7.5.2.4 Parameterize Library Synchronization
for many signal handlers, especially given that acquiring
locks from within signal handlers is generally frowned The idea here is to add arguments to the library’s API to
upon [Gro01],10 but can fail if the application needs to specify which locks to acquire, how to acquire and release
manipulate complex data structures from a signal handler. them, or both. This strategy allows the application to
Here are some ways to avoid acquiring locks in sig- take on the global task of avoiding deadlock by specifying
nal handlers even if complex data structures must be which locks to acquire (by passing in pointers to the
manipulated: locks in question) and how to acquire them (by passing
in pointers to lock acquisition and release functions),
1. Use simple data structures based on non-blocking syn- but also allows a given library function to control its
chronization, as will be discussed in Section 14.2.1. own concurrency by deciding where the locks should be
acquired and released.
2. If the data structures are too complex for reasonable
In particular, this strategy allows the lock acquisition
use of non-blocking synchronization, create a queue
and release functions to block signals as needed without
that allows non-blocking enqueue operations. In the
the library code needing to be concerned with which
signal handler, instead of manipulating the complex
signals need to be blocked by which locks. The separation
data structure, add an element to the queue describing
of concerns used by this strategy can be quite effective,
the required change. A separate thread can then
but in some cases the strategies laid out in the following
remove elements from the queue and carry out the
sections can work better.
required changes using normal locking. There are
That said, passing explicit pointers to locks to external
a number of readily available implementations of
APIs must be very carefully considered, as discussed in
concurrent queues [KLP12, Des09b, MS96].
Section 7.1.1.5. Although this practice is sometimes the
This strategy should be enforced with occasional manual right thing to do, you should do yourself a favor by looking
or (preferably) automated inspections of callbacks and into alternative designs first.
signal handlers. When carrying out these inspections, be
wary of clever coders who might have (unwisely) created 7.5.2.5 Explicitly Avoid Callback Deadlocks
home-brew locks from atomic operations.
The basic rule behind this strategy was discussed in Sec-
tion 7.1.1.2: “Release all locks before invoking unknown
7.5.2.3 Caller Controls Synchronization code.” This is usually the best approach because it allows
Letting the caller control synchronization works extremely the application to ignore the library’s locking hierarchy:
well when the library functions are operating on indepen- The library remains a leaf or isolated subtree of the appli-
dent caller-visible instances of a data structure, each of cation’s overall locking hierarchy.
which may be synchronized separately. For example, if In cases where it is not possible to release all locks before
the library functions operate on a search tree, and if the invoking unknown code, the layered locking hierarchies
application needs a large number of independent search described in Section 7.1.1.3 can work well. For example, if
trees, then the application can associate a lock with each the unknown code is a signal handler, this implies that the
tree. The application then acquires and releases locks as library function block signals across all lock acquisitions,
needed, so that the library need not be aware of parallelism which can be complex and slow. Therefore, in cases
at all. Instead, the application controls the parallelism, where signal handlers (probably unwisely) acquire locks,
so that locking can work very well, as was discussed in the strategies in the next section may prove helpful.
Section 7.5.1.
However, this strategy fails if the library implements 7.5.2.6 Explicitly Avoid Signal-Handler Deadlocks
a data structure that requires internal concurrency, for
Suppose that a given library function is known to acquire
10 But the standard’s words do not stop clever coders from creating locks, but does not block signals. Suppose further that it
their own home-brew locking primitives from atomic operations. is necessary to invoke that function both from within and
v2022.09.25a
120 CHAPTER 7. LOCKING
outside of a signal handler, and that it is not permissible 2. If the child creates additional threads, two threads
to modify this library function. Of course, if no special might break the lock concurrently, with the result
action is taken, then if a signal arrives while that library that both threads believe they own the lock. This
function is holding its lock, deadlock can occur when the could again result in arbitrary memory corruption.
signal handler invokes that same library function, which
in turn attempts to re-acquire that same lock. The pthread_atfork() function is provided to help
Such deadlocks can be avoided as follows: deal with these situations. The idea is to register a triplet of
functions, one to be called by the parent before the fork(),
1. If the application invokes the library function from one to be called by the parent after the fork(), and one
within a signal handler, then that signal must be to be called by the child after the fork(). Appropriate
blocked every time that the library function is invoked cleanups can then be carried out at these three points.
from outside of a signal handler. Be warned, however, that coding of pthread_
2. If the application invokes the library function while atfork() handlers is quite subtle in general. The cases
holding a lock acquired within a given signal handler, where pthread_atfork() works best are cases where
then that signal must be blocked every time that the the data structure in question can simply be re-initialized
library function is called outside of a signal handler. by the child.
These rules can be enforced by using tools similar to the 7.5.2.8 Parallel Libraries: Discussion
Linux kernel’s lockdep lock dependency checker [Cor06a].
One of the great strengths of lockdep is that it is not fooled Regardless of the strategy used, the description of the
by human intuition [Ros11]. library’s API must include a clear description of that
strategy and how the caller should interact with that
7.5.2.7 Library Functions Used Between fork() and strategy. In short, constructing parallel libraries using
exec() locking is possible, but not as easy as constructing a
parallel application.
As noted earlier, if a thread executing a library function is
holding a lock at the time that some other thread invokes
7.5.3 Locking For Parallelizing Sequential
fork(), the fact that the parent’s memory is copied to
create the child means that this lock will be born held in Libraries: Villain!
the child’s context. The thread that will release this lock With the advent of readily available low-cost multicore
is running in the parent, but not in the child, which means systems, a common task is parallelizing an existing library
that the child’s copy of this lock will never be released. that was designed with only single-threaded use in mind.
Therefore, any attempt on the part of the child to invoke This all-too-common disregard for parallelism can result
that same library function will result in deadlock. in a library API that is severely flawed from a parallel-
A pragmatic and straightforward way of solving this programming viewpoint. Candidate flaws include:
problem is to fork() a child process while the process is
still single-threaded, and have this child process remain 1. Implicit prohibition of partitioning.
single-threaded. Requests to create further child processes
2. Callback functions requiring locking.
can then be communicated to this initial child process,
which can safely carry out any needed fork() and exec() 3. Object-oriented spaghetti code.
system calls on behalf of its multi-threaded parent process.
Another rather less pragmatic and straightforward solu- These flaws and the consequences for locking are dis-
tion to this problem is to have the library function check cussed in the following sections.
to see if the owner of the lock is still running, and if not,
“breaking” the lock by re-initializing and then acquiring it. 7.5.3.1 Partitioning Prohibited
However, this approach has a couple of vulnerabilities:
Suppose that you were writing a single-threaded hash-
1. The data structures protected by that lock are likely table implementation. It is easy and fast to maintain an
to be in some intermediate state, so that naively exact count of the total number of items in the hash table,
breaking the lock might result in arbitrary memory and also easy and fast to return this exact count on each
corruption. addition and deletion operation. So why not?
v2022.09.25a
7.5. LOCKING: HERO OR VILLAIN? 121
One reason is that exact counters do not perform or Nevertheless, human nature being what it is, we can
scale well on multicore systems, as was seen in Chapter 5. expect our hapless developer to be more likely to complain
As a result, the parallelized implementation of the hash about locking than about his or her own poor (though
table will not perform or scale well. understandable) API design choices.
So what can be done about this? One approach is to
return an approximate count, using one of the algorithms 7.5.3.2 Deadlock-Prone Callbacks
from Chapter 5. Another approach is to drop the element
count altogether. Sections 7.1.1.2, 7.1.1.3, and 7.5.2 described how undisci-
Either way, it will be necessary to inspect uses of the plined use of callbacks can result in locking woes. These
hash table to see why the addition and deletion operations sections also described how to design your library function
need the exact count. Here are a few possibilities: to avoid these problems, but it is unrealistic to expect a
1990s programmer with no experience in parallel program-
1. Determining when to resize the hash table. In this ming to have followed such a design. Therefore, someone
case, an approximate count should work quite well. It attempting to parallelize an existing callback-heavy single-
might also be useful to trigger the resizing operation threaded library will likely have many opportunities to
from the length of the longest chain, which can be curse locking’s villainy.
computed and maintained in a nicely partitioned If there are a very large number of uses of a callback-
per-chain manner. heavy library, it may be wise to again add a parallel-
friendly API to the library in order to allow existing
2. Producing an estimate of the time required to traverse
users to convert their code incrementally. Alternatively,
the entire hash table. An approximate count works
some advocate use of transactional memory in these cases.
well in this case, also.
While the jury is still out on transactional memory, Sec-
3. For diagnostic purposes, for example, to check for tion 17.2 discusses its strengths and weaknesses. It is
items being lost when transferring them to and from important to note that hardware transactional memory
the hash table. This clearly requires an exact count. (discussed in Section 17.3) cannot help here unless the
However, given that this usage is diagnostic in na- hardware transactional memory implementation provides
ture, it might suffice to maintain the lengths of the forward-progress guarantees, which few do. Other alter-
hash chains, then to infrequently sum them up while natives that appear to be quite practical (if less heavily
locking out addition and deletion operations. hyped) include the methods discussed in Sections 7.1.1.6
and 7.1.1.7, as well as those that will be discussed in
It turns out that there is now a strong theoretical basis for Chapters 8 and 9.
some of the constraints that performance and scalability
place on a parallel library’s APIs [AGH+ 11a, AGH+ 11b, 7.5.3.3 Object-Oriented Spaghetti Code
McK11b]. Anyone designing a parallel library needs to
pay close attention to those constraints. Object-oriented programming went mainstream sometime
Although it is all too easy to blame locking for what in the 1980s or 1990s, and as a result there is a huge amount
are really problems due to a concurrency-unfriendly API, of single-threaded object-oriented code in production.
doing so is not helpful. On the other hand, one has little Although object orientation can be a valuable software
choice but to sympathize with the hapless developer who technique, undisciplined use of objects can easily result
made this choice in (say) 1985. It would have been a in object-oriented spaghetti code. In object-oriented
rare and courageous developer to anticipate the need for spaghetti code, control flits from object to object in an
parallelism at that time, and it would have required an even essentially random manner, making the code hard to
more rare combination of brilliance and luck to actually understand and even harder, and perhaps impossible, to
arrive at a good parallel-friendly API. accommodate a locking hierarchy.
Times change, and code must change with them. That Although many might argue that such code should
said, there might be a huge number of users of a popular be cleaned up in any case, such things are much easier
library, in which case an incompatible change to the API to say than to do. If you are tasked with parallelizing
would be quite foolish. Adding a parallel-friendly API such a beast, you can reduce the number of opportunities
to complement the existing heavily used sequential-only to curse locking by using the techniques described in
API is usually the best course of action. Sections 7.1.1.6 and 7.1.1.7, as well as those that will be
v2022.09.25a
122 CHAPTER 7. LOCKING
7.6 Summary
Achievement unlocked.
Unknown
v2022.09.25a
It is mine, I tell you. My own. My precious. Yes, my
precious.
Data Ownership
One of the simplest ways to avoid the synchronization 8.1 Multiple Processes
overhead that comes with locking is to parcel the data
out among the threads (or, in the case of kernels, CPUs)
so that a given piece of data is accessed and modified A man’s home is his castle
by only one of the threads. Interestingly enough, data Ancient Laws of England
ownership covers each of the “big three” parallel design
techniques: It partitions over threads (or CPUs, as the case Section 4.1 introduced the following example:
may be), it batches all local operations, and its elimination
of synchronization operations is weakening carried to its 1 compute_it 1 > compute_it.1.out &
logical extreme. It should therefore be no surprise that 2 compute_it 2 > compute_it.2.out &
3 wait
data ownership is heavily used: Even novices use it almost 4 cat compute_it.1.out
instinctively. In fact, it is so heavily used that this chapter 5 cat compute_it.2.out
will not introduce any new examples, but will instead refer
back to those of previous chapters. This example runs two instances of the compute_it
program in parallel, as separate processes that do not
share memory. Therefore, all data in a given process
Quick Quiz 8.1: What form of data ownership is extremely is owned by that process, so that almost the entirety of
difficult to avoid when creating shared-memory parallel pro- data in the above example is owned. This approach
grams (for example, using pthreads) in C or C++? almost entirely eliminates synchronization overhead. The
resulting combination of extreme simplicity and optimal
performance is obviously quite attractive.
Quick Quiz 8.2: What synchronization remains in the
There are a number of approaches to data ownership. example shown in Section 8.1?
Section 8.1 presents the logical extreme in data ownership,
where each thread has its own private address space. Sec- Quick Quiz 8.3: Is there any shared data in the example
tion 8.2 looks at the opposite extreme, where the data is shown in Section 8.1?
shared, but different threads own different access rights to
the data. Section 8.3 describes function shipping, which is This same pattern can be written in C as well as in sh,
a way of allowing other threads to have indirect access to as illustrated by Listings 4.1 and 4.2.
data owned by a particular thread. Section 8.4 describes It bears repeating that these trivial forms of parallelism
how designated threads can be assigned ownership of are not in any way cheating or ducking responsibility, but
a specified function and the related data. Section 8.5 are rather simple and elegant ways to make your code
discusses improving performance by transforming algo- run faster. It is fast, scales well, is easy to program, easy
rithms with shared data to instead use data ownership. to maintain, and gets the job done. In addition, taking
Finally, Section 8.6 lists a few software environments that this approach (where applicable) allows the developer
feature data ownership as a first-class citizen. more time to focus on other things whether these things
123
v2022.09.25a
124 CHAPTER 8. DATA OWNERSHIP
v2022.09.25a
8.6. OTHER USES OF DATA OWNERSHIP 125
Quick Quiz 8.6: But none of the data in the eventual() In short, privatization is a powerful tool in the parallel
function shown on lines 17–34 of Listing 5.5 is actually owned programmer’s toolbox, but it must nevertheless be used
by the eventual() thread! In just what way is this data with care. Just like every other synchronization primitive,
ownership??? it has the potential to increase complexity while decreasing
performance and scalability.
of safely taking public data structures private. been ported to GPGPUs [Mat17, AMD20, NVi17a, NVi17b].
v2022.09.25a
126 CHAPTER 8. DATA OWNERSHIP
v2022.09.25a
All things come to those who wait.
Violet Fane
Chapter 9
Deferred Processing
127
v2022.09.25a
128 CHAPTER 9. DEFERRED PROCESSING
v2022.09.25a
9.2. REFERENCE COUNTING 129
v2022.09.25a
130 CHAPTER 9. DEFERRED PROCESSING
v2022.09.25a
9.3. HAZARD POINTERS 131
Listing 9.4: Hazard-Pointer Recording and Clearing Quick Quiz 9.6: Given that papers on hazard pointers use
1 static inline void *_h_t_r_impl(void **p, the bottom bits of each pointer to mark deleted elements, what
2 hazard_pointer *hp)
3 {
is up with HAZPTR_POISON?
4 void *tmp;
5
Line 6 reads the pointer to the object to be protected.
6 tmp = READ_ONCE(*p);
7 if (!tmp || tmp == (void *)HAZPTR_POISON) If line 8 finds that this pointer was either NULL or the
8 return tmp; special HAZPTR_POISON deleted-object token, it returns
9 WRITE_ONCE(hp->p, tmp);
10 smp_mb(); the pointer’s value to inform the caller of the failure.
11 if (tmp == READ_ONCE(*p)) Otherwise, line 9 stores the pointer into the specified
12 return tmp;
13 return (void *)HAZPTR_POISON; hazard pointer, and line 10 forces full ordering of that
14 } store with the reload of the original pointer on line 11.
15
16 #define hp_try_record(p, hp) _h_t_r_impl((void **)(p), hp) (See Chapter 15 for more information on memory order-
17
ing.) If the value of the original pointer has not changed,
18 static inline void *hp_record(void **p,
19 hazard_pointer *hp) then the hazard pointer protects the pointed-to object,
20 { and in that case, line 12 returns a pointer to that object,
21 void *tmp;
22 which also indicates success to the caller. Otherwise,
23 do { if the pointer changed between the two READ_ONCE()
24 tmp = hp_try_record(*p, hp);
25 } while (tmp == (void *)HAZPTR_POISON); invocations, line 13 indicates failure.
26 return tmp;
27 } Quick Quiz 9.7: Why does hp_try_record() in Listing 9.4
28
take a double indirection to the data element? Why not void *
29 static inline void hp_clear(hazard_pointer *hp)
30 { instead of void **?
31 smp_mb();
32 WRITE_ONCE(hp->p, NULL); The hp_record() function is quite straightforward:
33 }
It repeatedly invokes hp_try_record() until the return
value is something other than HAZPTR_POISON.
lists is called a hazard pointer [Mic04a].2 The value of a Quick Quiz 9.8: Why bother with hp_try_record()?
given data element’s “virtual reference counter” can then Wouldn’t it be easier to just use the failure-immune hp_
be obtained by counting the number of hazard pointers record() function?
referencing that element. Therefore, if that element has
been rendered inaccessible to readers, and there are no The hp_clear() function is even more straightforward,
longer any hazard pointers referencing it, that element with an smp_mb() to force full ordering between the
may safely be freed. caller’s uses of the object protected by the hazard pointer
Of course, this means that hazard-pointer acquisition and the setting of the hazard pointer to NULL.
must be carried out quite carefully in order to avoid destruc- Once a hazard-pointer-protected object has been re-
tive races with concurrent deletion. One implementation moved from its linked data structure, so that it is now
is shown in Listing 9.4, which shows hp_try_record() inaccessible to future hazard-pointer readers, it is passed to
on lines 1–16, hp_record() on lines 18–27, and hp_ hazptr_free_later(), which is shown on lines 48–56
clear() on lines 29–33 (hazptr.h). of Listing 9.5 (hazptr.c). Lines 50 and 51 enqueue
The hp_try_record() macro on line 16 is simply a the object on a per-thread list rlist and line 52 counts
casting wrapper for the _h_t_r_impl() function, which the object in rcount. If line 53 sees that a sufficiently
attempts to store the pointer referenced by p into the hazard large number of objects are now queued, line 54 invokes
pointer referenced by hp. If successful, it returns the value hazptr_scan() to attempt to free some of them.
of the stored pointer. If it fails due to that pointer being The hazptr_scan() function is shown on lines 6–46
NULL, it returns NULL. Finally, if it fails due to racing with of the listing. This function relies on a fixed maximum
an update, it returns a special HAZPTR_POISON token. number of threads (NR_THREADS) and a fixed maximum
number of hazard pointers per thread (K), which allows a
fixed-size array of hazard pointers to be used. Because
any thread might need to scan the hazard pointers, each
thread maintains its own array, which is referenced by the
2 Also independently invented by others [HLM02]. per-thread variable gplist. If line 14 determines that this
v2022.09.25a
132 CHAPTER 9. DEFERRED PROCESSING
v2022.09.25a
9.3. HAZARD POINTERS 133
shows the ->re_freed field used to detect use-after-free Quick Quiz 9.9: Readers must “typically” restart? What are
bugs, and line 21 invokes hp_try_record() to attempt some exceptions?
to acquire a hazard pointer. If the return value is NULL,
line 23 returns a not-found indication to the caller. If the Because algorithms using hazard pointers might be
call to hp_try_record() raced with deletion, line 25 restarted at any step of their traversal through the linked
branches back to line 18’s retry to re-traverse the list data structure, such algorithms must typically take care
from the beginning. The do–while loop falls through to avoid making any changes to the data structure until
when the desired element is located, but if this element after they have acquired all the hazard pointers that are
has already been freed, line 29 terminates the program. required for the update in question.
Otherwise, the element’s ->iface field is returned to the Quick Quiz 9.10: But don’t these restrictions on hazard
caller. pointers also apply to other forms of reference counting?
Note that line 21 invokes hp_try_record() rather
than the easier-to-use hp_record(), restarting the full These hazard-pointer restrictions result in great benefits
search upon hp_try_record() failure. And such restart- to readers, courtesy of the fact that the hazard pointers are
ing is absolutely required for correctness. To see this, stored local to each CPU or thread, which in turn allows
consider a hazard-pointer-protected linked list containing traversals to be carried out without any writes to the data
elements A, B, and C that is subjected to the following structures being traversed. Referring back to Figure 5.8
sequence of events: on page 71, hazard pointers enable the CPU caches to
do resource replication, which in turn allows weakening
1. Thread 0 stores a hazard pointer to element B (having of the parallel-access-control mechanism, thus boosting
presumably traversed to element B from element A). performance and scalability.
Another advantage of restarting hazard pointers traver-
2. Thread 1 removes element B from the list, which sals is a reduction in minimal memory footprint: Any
sets the pointer from element B to element C to the object not currently referenced by some hazard pointer
special HAZPTR_POISON value in order to mark the may be immediately freed. In contrast, Section 9.5 will
deletion. Because Thread 0 has a hazard pointer to discuss a mechanism that avoids read-side retries (and
element B, it cannot yet be freed. minimizes read-side overhead), but which can result in a
much larger memory footprint.
3. Thread 1 removes element C from the list. Because The route_add() and route_del() functions are
there are no hazard pointers referencing element C, shown in Listing 9.7. Line 10 initializes ->re_freed,
it is immediately freed. line 31 poisons the ->re_next field of the newly removed
object, and line 33 passes that object to the hazptr_
4. Thread 0 attempts to acquire a hazard pointer to free_later() function, which will free that object once
now-removed element B’s successor, but hp_try_ it is safe to do so. The spinlocks work the same as in
record() returns the HAZPTR_POISON value, forc- Listing 9.3.
ing the caller to restart its traversal from the beginning Figure 9.3 shows the hazard-pointers-protected Pre-
of the list. BSD routing algorithm’s performance on the same read-
only workload as for Figure 9.2. Although hazard pointers
Which is a very good thing, because B’s successor is scale far better than does reference counting, hazard point-
the now-freed element C, which means that Thread 0’s ers still require readers to do writes to shared memory
subsequent accesses might have resulted in arbitrarily (albeit with much improved locality of reference), and
horrible memory corruption, especially if the memory also require a full memory barrier and retry check for
for element C had since been re-allocated for some other each object traversed. Therefore, hazard-pointers per-
purpose. Therefore, hazard-pointer readers must typically formance is still far short of ideal. On the other hand,
restart the full traversal in the face of a concurrent deletion. unlike naive approaches to concurrent reference-counting,
Often the restart must go back to some global (and thus hazard pointers not only operate correctly for workloads
immortal) pointer, but it is sometimes possible to restart at involving concurrent updates, but also exhibit excellent
some intermediate location if that location is guaranteed scalability. Additional performance comparisons with
to still be live, for example, due to the current thread other mechanisms may be found in Chapter 10 and in
holding a lock, a reference count, etc. other publications [HMBW07, McK13, Mic04a].
v2022.09.25a
134 CHAPTER 9. DEFERRED PROCESSING
Listing 9.7: Hazard-Pointer Pre-BSD Routing Table Add/Delete Ah, I finally got
1
2
int route_add(unsigned long addr, unsigned long interface)
{
done reading!
3 struct route_entry *rep;
4
5 rep = malloc(sizeof(*rep));
6 if (!rep)
7 return -ENOMEM;
8 rep->addr = addr;
9 rep->iface = interface;
10 rep->re_freed = 0;
11 spin_lock(&routelock); No, you didn't!
rep->re_next = route_list.re_next;
Start over!
12
13 route_list.re_next = rep;
14 spin_unlock(&routelock);
15 return 0;
16 }
17
18 int route_del(unsigned long addr)
19 {
20 struct route_entry *rep; Figure 9.4: Reader And Uncooperative Sequence Lock
21 struct route_entry **repp;
22
23 spin_lock(&routelock);
24 repp = &route_list.re_next; Quick Quiz 9.11: Figure 9.3 shows no sign of hyperthread-
25 for (;;) { induced flattening at 224 threads. Why is that?
26 rep = *repp;
27 if (rep == NULL)
28 break; Quick Quiz 9.12: The paper “Structured Deferral: Syn-
29 if (rep->addr == addr) {
30 *repp = rep->re_next; chronization via Procrastination” [McK13] shows that hazard
31 rep->re_next = (struct route_entry *)HAZPTR_POISON; pointers have near-ideal performance. Whatever happened in
32 spin_unlock(&routelock);
33 hazptr_free_later(&rep->hh);
Figure 9.3???
34 return 0;
35 } The next section attempts to improve on hazard pointers
36 repp = &rep->re_next;
37 } by using sequence locks, which avoid both read-side writes
38 spin_unlock(&routelock); and per-object memory barriers.
39 return -ENOENT;
40 }
ideal
The published sequence-lock record [Eas71, Lam77] ex-
1.5x10 7 tends back as far as that of reader-writer locking, but
sequence locks nevertheless remain in relative obscurity.
1x107
Sequence locks are used in the Linux kernel for read-
mostly data that must be seen in a consistent state by
5x106
readers. However, unlike reader-writer locking, readers
hazptr do not exclude writers. Instead, like hazard pointers,
0
sequence locks force readers to retry an operation if they
0 50 100 150 200 250 300 350 400 450 detect activity from a concurrent writer. As can be seen
Number of CPUs (Threads) from Figure 9.4, it is important to design code using
Figure 9.3: Pre-BSD Routing Table Protected by Hazard sequence locks so that readers very rarely need to retry.
Pointers Quick Quiz 9.13: Why isn’t this sequence-lock discussion
in Chapter 7, you know, the one on locking?
v2022.09.25a
9.4. SEQUENCE LOCKS 135
v2022.09.25a
136 CHAPTER 9. DEFERRED PROCESSING
v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 137
Listing 9.12: Sequence-Locked Pre-BSD Routing Table Add/ locking does not permit traversal of pointers to objects
Delete (BUGGY!!!) that might be freed by updaters. These limitations are of
1 int route_add(unsigned long addr, unsigned long interface) course overcome by transactional memory, but can also be
2 {
3 struct route_entry *rep; overcome by combining other synchronization primitives
4
5 rep = malloc(sizeof(*rep));
with sequence locking.
6 if (!rep) Sequence locks allow writers to defer readers, but not
7 return -ENOMEM; vice versa. This can result in unfairness and even starvation
8 rep->addr = addr;
9 rep->iface = interface; in writer-heavy workloads.3 On the other hand, in the
10 rep->re_freed = 0; absence of writers, sequence-lock readers are reasonably
11 write_seqlock(&sl);
12 rep->re_next = route_list.re_next; fast and scale linearly. It is only human to want the best of
13 route_list.re_next = rep; both worlds: Fast readers without the possibility of read-
14 write_sequnlock(&sl);
15 return 0; side failure, let alone starvation. In addition, it would also
16 } be nice to overcome sequence locking’s limitations with
17
18 int route_del(unsigned long addr) pointers. The following section presents a synchronization
19 { mechanism with exactly these properties.
20 struct route_entry *rep;
21 struct route_entry **repp;
22
write_seqlock(&sl);
23
24 repp = &route_list.re_next; 9.5 Read-Copy Update (RCU)
25 for (;;) {
26 rep = *repp;
27 if (rep == NULL) “Free” is a very good price!
28 break;
29 if (rep->addr == addr) {
30 *repp = rep->re_next;
Tom Peterson
31 write_sequnlock(&sl);
32 smp_mb(); All of the mechanisms discussed in the preceding sections
33 rep->re_freed = 1;
34 free(rep); used one of a number of approaches to defer specific
35 return 0; actions until they may be carried out safely. The reference
36 }
37 repp = &rep->re_next; counters discussed in Section 9.2 use explicit counters to
38 } defer actions that could disturb readers, which results in
39 write_sequnlock(&sl);
40 return -ENOENT; read-side contention and thus poor scalability. The hazard
41 } pointers covered by Section 9.3 uses implicit counters
in the guise of per-thread lists of pointer. This avoids
read-side contention, but requires readers to do stores
It also performs better on the read-only workload, as and conditional branches, as well as either full memory
can be seen in Figure 9.5, though its performance is barriers in read-side primitives or real-time-unfriendly
still far from ideal. Worse yet, it suffers use-after-free inter-processor interrupts in update-side primitives.4 The
failures. The problem is that the reader might encounter a sequence lock presented in Section 9.4 also avoids read-
segmentation violation due to accessing an already-freed side contention, but does not protect pointer traversals and,
structure before read_seqretry() has a chance to warn like hazard pointers, requires either full memory barriers
of the concurrent update. in read-side primitives, or inter-processor interrupts in
update-side primitives. These schemes’ shortcomings
Quick Quiz 9.20: Can this bug be fixed? In other words, can
you use sequence locks as the only synchronization mechanism
raise the question of whether it is possible to do better.
protecting a linked list supporting concurrent addition, deletion, This section introduces read-copy update (RCU), which
and lookup? provides an API that allows readers to be associated with
3 Dmitry Vyukov describes one way to reduce (but, sadly, not elimi-
Both the read-side and write-side critical sections of nate) reader starvation: http://www.1024cores.net/home/lock-
a sequence lock can be thought of as transactions, and free-algorithms/reader-writer-problem/improved-lock-
sequence locking therefore can be thought of as a limited free-seqlock.
4 In some important special cases, this extra work can be avoided
form of transactional memory, which will be discussed by using link counting as exemplified by the UnboundedQueue and
in Section 17.2. The limitations of sequence locking are: ConcurrentHashMap data structures implemented in Folly open-source
(1) Sequence locking restricts updates and (2) Sequence library (https://github.com/facebook/folly).
v2022.09.25a
138 CHAPTER 9. DEFERRED PROCESSING
regions in the source code, rather than with expensive (1) gptr
updates to frequently updated shared data. The remainder
of this section examines RCU from a number of different
perspectives. Section 9.5.1 provides the classic intro- kmalloc()
duction to RCU, Section 9.5.2 covers fundamental RCU
concepts, Section 9.5.3 presents the Linux-kernel API,
p
Section 9.5.4 introduces some common RCU use cases, (2)
->addr=?
gptr
and finally Section 9.5.5 covers recent work related to ->iface=?
RCU.
initialization
9.5.1 Introduction to RCU
The approaches discussed in the preceding sections have
p
provided good scalability but decidedly non-ideal perfor- ->addr=42
(3) gptr
mance for the Pre-BSD routing table. Therefore, in the ->iface=1
spirit of “only those who have gone too far know how
far you can go”,5 we will go all the way, looking into
algorithms in which concurrent readers execute the same smp_store_release(&gptr, p);
sequence of assembly language instructions as would a
single-threaded lookup, despite the presence of concurrent
updates. Of course, this laudable goal might raise seri- p
->addr=42
(4) gptr
ous implementability questions, but we cannot possibly ->iface=1
succeed if we don’t even try!
Figure 9.6: Insertion With Concurrent Readers
9.5.1.1 Minimal Insertion and Deletion
To minimize implementability concerns, we focus on a
minimal data structure, which consists of a single global
get the newly installed pointer, but either way see a valid
pointer that is either NULL or references a single structure.
result. Unfortunately, Section 4.3.4.1 dashes these hopes
Minimal though it might be, this data structure is heavily
as well. To obtain this guarantee, readers must instead use
used in production [RH18]. A classic approach for inser-
READ_ONCE(), or, as will be seen, rcu_dereference().
tion is shown in Figure 9.6, which shows four states with
However, on most modern computer systems, each of
time advancing from top to bottom. The first row shows
these read-side primitives can be implemented with a
the initial state, with gptr equal to NULL. In the second
single load instruction, exactly the instruction that would
row, we have allocated a structure which is uninitialized,
normally be used in single-threaded code.
as indicated by the question marks. In the third row, we
have initialized the structure. Finally, in the fourth and Reviewing Figure 9.6 from the viewpoint of readers,
final row, we have updated gptr to reference the newly in the first three states all readers see gptr having the
allocated and initialized element. value NULL. Upon entering the fourth state, some readers
We might hope that this assignment to gptr could use might see gptr still having the value NULL while others
a simple C-language assignment statement. Unfortunately, might see it referencing the newly inserted element, but
Section 4.3.4.1 dashes these hopes. Therefore, the updater after some time, all readers will see this new element. At
cannot use a simple C-language assignment, but must in- all times, all readers will see gptr as containing a valid
stead use smp_store_release() as shown in the figure, pointer. Therefore, it really is possible to add new data to
or, as will be seen, rcu_assign_pointer(). linked data structures while allowing concurrent readers
Similarly, one might hope that readers could use a single to execute the same sequence of machine instructions
C-language assignment to fetch the value of gptr, and that is normally used in single-threaded code. This no-
be guaranteed to either get the old value of NULL or to cost approach to concurrent reading provides excellent
performance and scalability, and also is eminently suitable
5 With apologies to T. S. Eliot. for real-time use.
v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 139
Readers?
(3) gptr
->addr=42 9.5.1.2 Core RCU API
->iface=1
1 Version
The full Linux-kernel API is quite extensive, with more
than one hundred API members. However, this section
gptr = NULL; /*almost*/
free() will confine itself to six core RCU API members, which
suffices for the upcoming sections introducing RCU and
covering its fundamentals. The full API is covered in
(4) gptr 1 Version Section 9.5.3.
Three members of the core APIs are used by read-
Figure 9.7: Deletion With Concurrent Readers ers. The rcu_read_lock() and rcu_read_unlock()
functions delimit RCU read-side critical sections. These
may be nested, so that one rcu_read_lock()–rcu_
Insertion is of course quite useful, but sooner or later, read_unlock() pair can be enclosed within another. In
it will also be necessary to delete data. As can be seen in this case, the nested set of RCU read-side critical sec-
Figure 9.7, the first step is easy. Again taking the lessons tions act as one large critical section covering the full
from Section 4.3.4.1 to heart, smp_store_release() is extent of the nested set. The third read-side API member,
used to NULL the pointer, thus moving from the first row to rcu_dereference(), fetches an RCU-protected pointer.
the second in the figure. At this point, pre-existing readers Conceptually, rcu_dereference() simply loads from
see the old structure with ->addr of 42 and ->iface memory, but we will see in Section 9.5.2.1 that rcu_
of 1, but new readers will see a NULL pointer, that is, dereference() must prevent the compiler and (in one
concurrent readers can disagree on the state, as indicated case) the CPU from reordering its load with later memory
by the “2 Versions” in the figure. operations that dereference this pointer.
Quick Quiz 9.21: Why does Figure 9.7 use smp_store_ Quick Quiz 9.23: What is an RCU-protected pointer?
release() given that it is storing a NULL pointer? Wouldn’t
WRITE_ONCE() work just as well in this case, given that there
The other three members of the core APIs are used by up-
is no structure initialization to order against the store of the daters. The synchronize_rcu() function implements
NULL pointer? the “wait for readers” operation from Figure 9.7. The
call_rcu() function is the asynchronous counterpart of
synchronize_rcu() by invoking the specified function
Quick Quiz 9.22: Readers running concurrently with each
other and with the procedure outlined in Figure 9.7 can disagree
after all pre-existing RCU readers have completed. Finally,
on the value of gptr. Isn’t that just a wee bit problematic??? the rcu_assign_pointer() macro is used to update an
RCU-protected pointer. Conceptually, this is simply an
assignment statement, but we will see in Section 9.5.2.1
We get back to a single version simply by waiting for that rcu_assign_pointer() must prevent the compiler
all the pre-existing readers to complete, as shown in row 3. and the CPU from reordering this assignment to precede
At that point, all the pre-existing readers are done, and no any prior assignments used to initialize the pointed-to
later reader has a path to the old data item, so there can structure.
v2022.09.25a
140 CHAPTER 9. DEFERRED PROCESSING
Primitive Purpose
Quick Quiz 9.24: What does synchronize_rcu() do if it mizations. Nevertheless, this approach is said to be used
starts at about the same time as an rcu_read_lock()? in production [Ash15].
A third approach is to simply wait for a fixed period
The core RCU API is summarized in Table 9.1 for of time that is long enough to comfortably exceed the
easy reference. With that, we are ready to continue this lifetime of any reasonable reader [Jac93, Joh95]. This
introduction to RCU with the key RCU operation, waiting can work quite well in hard real-time systems [RLPB18],
for readers. but in less exotic settings, Murphy says that it is critically
important to be prepared even for unreasonably long-lived
readers. To see this, consider the consequences of failing
9.5.1.3 Waiting for Readers do so: A data item will be freed while the unreasonable
It is tempting to base the reader-waiting functionality of reader is still referencing it, and that item might well
synchronize_rcu() and call_rcu() on a reference be immediately reallocated, possibly even as a data item
counter updated by rcu_read_lock() and rcu_read_ of some other type. The unreasonable reader and the
unlock(), but Figure 5.1 in Chapter 5 shows that con- unwitting reallocator would then be attempting to use
current reference counting results in extreme overhead. the same memory for two very different purposes. The
This extreme overhead was confirmed in the specific case ensuing mess will at best be exceedingly difficult to debug.
of reference counters in Figure 9.2 on page 130. Hazard A fourth approach is to wait forever, secure in the
pointers profoundly reduce this overhead, but, as we saw knowledge that doing so will accommodate even the
in Figure 9.3 on page 134, not to zero. Nevertheless, many most unreasonable reader. This approach is also called
RCU implementations make very careful cache-local use “leaking memory”, and has a bad reputation due to the
of counters. fact that memory leaks often require untimely and incon-
venient reboots. Nevertheless, this is a viable strategy
A second approach observes that memory synchro-
when the update rate and the uptime are both sharply
nization is expensive, and therefore uses registers instead,
bounded. For example, this approach could work well in a
namely each CPU’s or thread’s program counter (PC), thus
high-availability cluster where systems were periodically
imposing no overhead on readers, at least in the absence
crashed in order to ensure that cluster really remained
of concurrent updates. The updater polls each relevant
highly available.6 Leaking the memory is also a viable
PC, and if that PC is not within read-side code, then the
strategy in environments having garbage collectors, in
corresponding CPU or thread is within a quiescent state,
which case the garbage collector can be thought of as
in turn signaling the completion of any reader that might
plugging the leak [KL80]. However, if your environment
have access to the newly removed data element. Once all
lacks a garbage collector, read on!
CPU’s or thread’s PCs have been observed to be outside
of any reader, the grace period has completed. Please 6 The program that forces the periodic crashing is sometimes
note that this approach poses some serious challenges, known as a “chaos monkey”: https://netflix.github.io/
including memory ordering, functions that are sometimes chaosmonkey/. However, it might also be a mistake to neglect chaos
invoked from readers, and ever-exciting code-motion opti- caused by systems running for too long.
v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 141
WRITE_ONCE(gptr, NULL);
A fifth approach avoids the period crashes in favor
of periodically “stopping the world”, as exemplified by
the traditional stop-the-world garbage collector. This
approach was also heavily used during the decades before
synchronize_rcu()
CPU 1 CPU 2 CPU 3
ubiquitous connectivity, when it was common practice
to power systems off at the end of each working day.
Context Switch
However, in today’s always-connected always-on world,
stopping the world can gravely degrade response times,
which has been one motivation for the development of
Reader
concurrent garbage collectors [BCR03]. Furthermore,
although we need all pre-existing readers to complete, we
do not need them all to complete at the same time.
This observation leads to the sixth approach, which is
stopping one CPU or thread at a time. This approach has
Grace Period
the advantage of not degrading reader response times at
all, let alone gravely. Furthermore, numerous applications
already have states (termed quiescent states) that can be
reached only after all pre-existing readers are done. In
transaction-processing systems, the time between a pair
of successive transactions might be a quiescent state. In
free()
reactive systems, the state between a pair of successive
events might be a quiescent state. Within non-preemptive
operating-systems kernels, a context switch can be a
quiescent state [MS98a]. Either way, once all CPUs Figure 9.8: QSBR: Waiting for Pre-Existing Readers
and/or threads have passed through a quiescent state, the
system is said to have completed a grace period, at which
point all readers in existence at the start of that grace Again, this same constraint is imposed on reader threads
period are guaranteed to have completed. As a result, it is dereferencing gptr: Such threads are not allowed to block
also guaranteed to be safe to free any removed data items until after they are done using the pointed-to data item.
that were removed prior to the start of that grace period.7 Returning to the second row of Figure 9.7, where the
Within a non-preemptive operating-system kernel, for updater has just completed executing the smp_store_
context switch to be a valid quiescent state, readers must release(), imagine that CPU 0 executes a context switch.
be prohibited from blocking while referencing a given Because readers are not permitted to block while traversing
instance data structure obtained via the gptr pointer the linked list, we are guaranteed that all prior readers that
shown in Figures 9.6 and 9.7. This no-blocking constraint might have been running on CPU 0 will have completed.
is consistent with similar constraints on pure spinlocks, Extending this line of reasoning to the other CPUs, once
where a CPU is forbidden from blocking while holding each CPU has been observed executing a context switch,
a spinlock. Without this constraint, all CPUs might be we are guaranteed that all prior readers have completed,
consumed by threads spinning attempting to acquire a and that there are no longer any reader threads referencing
spinlock held by a blocked thread. The spinning threads the newly removed data element. The updater can then
will not relinquish their CPUs until they acquire the lock, safely free that data element, resulting in the state shown
but the thread holding the lock cannot possibly release at the bottom of Figure 9.7.
it until one of the spinning threads relinquishes a CPU. This approach is termed quiescent-state-based recla-
This is a classic deadlock situation, and this deadlock is mation (QSBR) [HMB06]. A QSBR schematic is shown
avoided by forbidding blocking while holding a spinlock. in Figure 9.8, with time advancing from the top of the
figure to the bottom. The cyan-colored boxes depict RCU
read-side critical sections, each of which begins with
7 It is possible to do much more with RCU than simply defer rcu_read_lock() and ends with rcu_read_unlock().
reclamation of memory, but deferred reclamation is RCU’s most common CPU 1 does the WRITE_ONCE() that removes the current
use case, and is therefore an excellent place to start. data item (presumably having previously read the pointer
v2022.09.25a
142 CHAPTER 9. DEFERRED PROCESSING
value and availed itself of appropriate synchronization), Listing 9.13: Insertion and Deletion With Concurrent Readers
then waits for readers. This wait operation results in 1 struct route *gptr;
2
an immediate context switch, which is a quiescent state 3 int access_route(int (*f)(struct route *rp))
(denoted by the pink circle), which in turn means that all 4 {
5 int ret = -1;
prior reads on CPU 1 have completed. Next, CPU 2 does 6 struct route *rp;
a context switch, so that all readers on CPUs 1 and 2 are 7
8 rcu_read_lock();
now known to have completed. Finally, CPU 3 does a 9 rp = rcu_dereference(gptr);
context switch. At this point, all readers throughout the 10 if (rp)
11 ret = f(rp);
entire system are known to have completed, so the grace 12 rcu_read_unlock();
period ends, permitting synchronize_rcu() to return 13 return ret;
14 }
to its caller, in turn permitting CPU 1 to free the old data 15
v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 143
2002
2004
2006
2008
2010
2012
2014
2016
2018
2020
2. Trip an assertion if the returned pointer is non-NULL.
v2022.09.25a
144 CHAPTER 9. DEFERRED PROCESSING
Linux kernel. RCU has enjoyed heavy use both prior to ins_route() access_route()
and since its acceptance in the Linux kernel, as discussed
Allocate
in Section 9.5.5.
It is therefore safe to say that RCU enjoys wide practical
applicability. Pre-initialization
garbage
The minimal example discussed in this section is a good
introduction to RCU. However, effective use of RCU often
requires that you think differently about your problem. It Initialize
is therefore useful to examine RCU’s fundamentals, a task
taken up by the following section.
Valid route structure
Not OK
vious section, but independent of any particular example
or use case. People who prefer to live their lives very Valid route structure Dereference pointer
OK
close to the actual code may wish to skip the underlying
fundamentals presented in this section. Surprising, but OK
RCU is made up of three fundamental mechanisms, the
first being used for insertion, the second being used for Figure 9.10: Publication/Subscription Constraints
deletion, and the third being used to allow readers to toler-
ate concurrent insertions and deletions. Section 9.5.2.1
describes the publish-subscribe mechanism used for inser- then initializes the newly allocated structure, and then in-
tion, Section 9.5.2.2 describes how waiting for pre-existing vokes ins_route() to publish a pointer to the new route
RCU readers enabled deletion, and Section 9.5.2.3 dis- structure. Publication does not affect the contents of the
cusses how maintaining multiple versions of recently up- structure, which therefore remain valid after publication.
dated objects permits concurrent insertions and deletions. The access_route() column from this same figure
Finally, Section 9.5.2.4 summarizes RCU fundamentals. shows the pointer being subscribed to and dereferenced.
This dereference operation absolutely must see a valid
route structure rather than pre-initialization garbage be-
9.5.2.1 Publish-Subscribe Mechanism
cause referencing garbage could result in memory corrup-
Because RCU readers are not excluded by RCU updaters, tion, crashes, and hangs. As noted earlier, avoiding such
an RCU-protected data structure might change while a garbage means that the publish and subscribe operations
reader accesses it. The accessed data item might be moved, must inform both the compiler and the CPU of the need
removed, or replaced. Because the data structure does to maintain the needed ordering.
not “hold still” for the reader, each reader’s access can Publication is carried out by rcu_assign_pointer(),
be thought of as subscribing to the current version of the which ensures that ins_route()’s caller’s initialization
RCU-protected data item. For their part, updaters can be is ordered before the actual publication operation’s store
thought of as publishing new versions. of the pointer. In addition, rcu_assign_pointer()
Unfortunately, as laid out in Section 4.3.4.1 and reiter- must be atomic in the sense that concurrent readers see
ated in Section 9.5.1.1, it is unwise to use plain accesses either the old value of the pointer or the new value of the
for these publication and subscription operations. It is pointer, but not some mash-up of these two values. These
instead necessary to inform both the compiler and the requirements are met by the C11 store-release operation,
CPU of the need for care, as can be seen from Figure 9.10, and in fact in the Linux kernel, rcu_assign_pointer()
which illustrates interactions between concurrent execu- is defined in terms of smp_store_release(), which is
tions of ins_route() (and its caller) and read_gptr() similar to C11 store-release.
from Listing 9.13. Note that if concurrent updates are required, some sort
The ins_route() column from Figure 9.10 shows of synchronization mechanism will be required to medi-
ins_route()’s caller allocating a new route structure, ate among multiple concurrent rcu_assign_pointer()
which then contains pre-initialization garbage. The caller calls on the same pointer. In the Linux kernel, locking
v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 145
is the mechanism of choice, but pretty much any syn- what alternate universe would that qualify as “not disrupting
chronization mechanism may be used. An example of concurrent readers”???
a particularly lightweight synchronization mechanism is
Adding data to a linked structure without disrupting
Chapter 8’s data ownership: If each pointer is owned by
readers is a good thing, as are the cases where this can
a particular thread, then that thread may execute rcu_
be done with no added read-side cost compared to single-
assign_pointer() on that pointer with no additional
threaded readers. However, in most cases it is also nec-
synchronization overhead.
essary to remove data, and this is the subject of the next
Quick Quiz 9.30: Wouldn’t use of data ownership for RCU section.
updaters mean that the updates could use exactly the same
sequence of instructions as would the corresponding single-
threaded code? 9.5.2.2 Wait For Pre-Existing RCU Readers
Subscription is carried out by rcu_dereference(), In its most basic form, RCU is a way of waiting for
which orders the subscription operation’s load from the things to finish. Of course, there are a great many other
pointer is before the dereference. Similar to rcu_assign_ ways of waiting for things to finish, including reference
pointer(), rcu_dereference() must be atomic in the counts, reader-writer locks, events, and so on. The great
sense that the value loaded must be that from a single store, advantage of RCU is that it can wait for each of (say)
for example, the compiler must not tear the load.9 Unfor- 20,000 different things without having to explicitly track
tunately, compiler support for rcu_dereference() is at each and every one of them, and without having to worry
best a work in progress [MWB+ 17, MRP+ 17, BM18]. In about the performance degradation, scalability limitations,
the meantime, the Linux kernel relies on volatile loads, complex deadlock scenarios, and memory-leak hazards
the details of the various CPU architectures, coding re- that are inherent in schemes using explicit tracking.
strictions [McK14e], and, on DEC Alpha [Cor02], a In RCU’s case, each of the things waited on is called
memory-barrier instruction. However, on other architec- an RCU read-side critical section. As noted in Table 9.1,
tures, rcu_dereference() typically emits a single load an RCU read-side critical section starts with an rcu_
instruction, just as would the equivalent single-threaded read_lock() primitive, and ends with a corresponding
code. The coding restrictions are described in more detail rcu_read_unlock() primitive. RCU read-side critical
in Section 15.3.2, however, the common case of field sections can be nested, and may contain pretty much any
selection (“->”) works quite well. Software that does not code, as long as that code does not contain a quiescent
require the ultimate in read-side performance can instead state. For example, within the Linux kernel, it is illegal
use C11 acquire loads, which provide the needed ordering to sleep within an RCU read-side critical section because
and more, albeit at a cost. It is hoped that lighter-weight a context switch is a quiescent state.10 If you abide
compiler support for rcu_dereference() will appear by these conventions, you can use RCU to wait for any
in due course. pre-existing RCU read-side critical section to complete,
In short, use of rcu_assign_pointer() for publish- and synchronize_rcu() uses indirect means to do the
ing pointers and use of rcu_dereference() for subscrib- actual waiting [DMS+ 12, McK13].
ing to them successfully avoids the “Not OK” garbage The relationship between an RCU read-side critical
loads depicted in Figure 9.10. These two primitives can section and a later RCU grace period is an if-then rela-
therefore be used to add new data to linked structures tionship, as illustrated by Figure 9.11. If any portion of a
without disrupting concurrent readers. given critical section precedes the beginning of a given
grace period, then RCU guarantees that all of that critical
Quick Quiz 9.31: But suppose that updaters are adding section will precede the end of that grace period. In the
and removing multiple data items from a linked list while a figure, P0()’s access to x precedes P1()’s access to this
reader is iterating over that same list. Specifically, suppose same variable, and thus also precedes the grace period
that a list initially contains elements A, B, and C, and that an
generated by P1()’s call to synchronize_rcu(). It is
updater removes element A and then adds a new element D at
the end of the list. The reader might well see {A, B, C, D},
therefore guaranteed that P0()’s access to y will precede
when that sequence of elements never actually ever existed! In P1()’s access. In this case, if r1’s final value is 0, then
r2’s final value is guaranteed to also be 0.
9 That is, the compiler must not break the load into multiple smaller 10 However, a special form of RCU called SRCU [McK06] does
loads, as described under “load tearing” in Section 4.3.4.1. permit general sleeping in SRCU read-side critical sections.
v2022.09.25a
146 CHAPTER 9. DEFERRED PROCESSING
P0()
rcu_read_lock()
P1()
x = 1;
r1 = x; Given this ordering ...
P0()
this ordering.
rcu_read_lock() synchronize_rcu()
r2 = y;
rcu_read_unlock()
.... RCU guarantees this ordering. y = 1;
v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 147
3. Clean up, for example, free the element that was (3) Reader within grace period
replaced above.
rcu_read_lock()
This more abstract procedure requires a more abstract
diagram than Figures 9.11–9.13, which are specific to a Remove
particular litmus test. After all, an RCU implementation
must work correctly regardless of the form of the RCU up- synchronize_rcu()
dates and the RCU read-side critical sections. Figure 9.14 Free Old Memory
fills this need, showing the four possible scenarios, with
time advancing from top to bottom within each scenario.
rcu_read_unlock()
Within each scenario, an RCU reader is represented by BUG!!!
the left-hand stack of boxes and RCU updater by the (4) Grace period within reader (BUG!!!)
right-hand stack.
In the first scenario, the reader starts execution before the
updater starts the removal, so it is possible that this reader Figure 9.14: Summary of RCU Grace-Period Ordering
has a reference to the removed data element. Therefore, Guarantees
the updater must not free this element until after the reader
completes. In the second scenario, the reader does not
start execution until after the removal has completed. The
v2022.09.25a
148 CHAPTER 9. DEFERRED PROCESSING
This section discusses how RCU accommodates Figure 9.15: Multiple RCU Data-Structure Versions
synchronization-free readers by maintaining multiple ver-
sions of data. Because these synchronization-free readers
provide very weak temporal synchronization, RCU users is concurrently updated.11 In the first row of the figure,
compensate via spatial synchronization. Spatial synchro- the reader is referencing data item A, and in the second
nization was discussed in Chapter 6, and is heavily used row, it advances to B, having thus far seen A followed
in practice to obtain good performance and scalability. by B. In the third row, an updater removes element A and
In this section, spatial synchronization will be used to in the fourth row an updater adds element E to the end of
attain a weak (but useful) form of correctness as well as the list. In the fifth and final row, the reader completes its
excellent performance and scalability. traversal, having seeing elements A through E.
Figure 9.7 in Section 9.5.1.1 showed a simple variant of Except that there was no time at which such a list
spatial synchronization, in which different readers running existed. This situation might be even more surprising than
concurrently with del_route() (see Listing 9.13) might that shown in Figure 9.7, in which different concurrent
see the old route structure or an empty list, but either readers see different versions. In contrast, in Figure 9.15
way get a valid result. Of course, a closer look at Fig- the reader sees a version that never actually existed!
ure 9.6 shows that calls to ins_route() can also result One way to resolve this strange situation is via weaker
in concurrent readers seeing different versions: Either the semanitics. A reader traversal must encounter any data
initial empty list or the newly inserted route structure. item that was present during the full traversal (B, C, and D),
Note that both reference counting (Section 9.2) and hazard and might or might not encounter data items that were
pointers (Section 9.3) can also cause concurrent readers present for only part of the traversal (A and E). Therefore,
to see different versions, but RCU’s lightweight readers in this particular case, it is perfectly legitimate for the
make this more likely. reader traversal to encounter all five elements. If this out-
However, maintaining multiple weakly consistent ver- come is problematic, another way to resolve this situation
sions can provide some surprises. For example, consider
Figure 9.15, in which a reader is traversing a linked list that 11 RCU linked-list APIs may be found in Section 9.5.3.
v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 149
is through use of stronger synchronization mechanisms, such inconsistencies is more common than one might
such as reader-writer locking, or clever use of timestamps imagine.
and versioning, as discussed in Section 9.5.4.11. Of One root cause of this common-case tolerance of in-
course, stronger mechanisms will be more expensive, but consistencies is that single-item lookups are much more
then again the engineering life is all about choices and common in practice than are full-data-structure traversals.
tradeoffs. After all, full-data-structure traversals are much more
Strange though this situation might seem, it is entirely expensive than single-item lookups, so developers are
consistent with the real world. As we saw in Section 3.2, motivated to avoid such traversals. Not only are concur-
the finite speed of light cannot be ignored within a com- rent updates are less likely to affect a single-item lookup
puter system, and it most certainly cannot be ignored than they are a full traversal, but it is also the case that
outside of this system. This in turn means that any data an isolated single-item lookup has no way of detecting
within the system representing state in the real world such inconsistencies. As a result, in the common case,
outside of the system is always and forever outdated, and such inconsistencies are not just tolerable, they are in fact
thus inconsistent with the real world. Therefore, it is invisible.
quite possible that the sequence {A, B, C, D, E} occurred In such cases, RCU readers can be considered to be fully
in the real world, but due to speed-of-light delays was ordered with updaters, despite the fact that these readers
never represented in the computer system’s memory. In might be executing the exact same sequence of machine
this case, the reader’s surprising traversal would correctly instructions that would be executed by a single-threaded
reflect reality. program. For example, referring back to Listing 9.13
on page 142, suppose that each reader thread invokes
As a result, algorithms operating on real-world data
access_route() exactly once during its lifetime, and
must account for inconsistent data, either by tolerating
that there is no other communication among reader and up-
inconsistencies or by taking steps to exclude or reject them.
dater threads. Then each invocation of access_route()
In many cases, these algorithms are also perfectly capable
can be ordered after the ins_route() invocation that
of dealing with inconsistencies within the system.
produced the route structure accessed by line 11 of
The pre-BSD packet routing example laid out in Sec- the listing in access_route() and ordered before any
tion 9.1 is a case in point. The contents of a routing subsequent ins_route() or del_route() invocation.
list is set by routing protocols, and these protocols fea- In summary, maintaining multiple versions is exactly
ture significant delays (seconds or even minutes) to avoid what enables the extremely low overheads of RCU readers,
routing instabilities. Therefore, once a routing update and as noted earlier, many algorithms are unfazed by
reaches a given system, it might well have been sending multiple versions. However, there are algorithms that
packets the wrong way for quite some time. Sending a few absolutely cannot handle multiple versions. There are
more packets the wrong way for the few microseconds techniques for adapting such algorithms to RCU [McK04],
during which the update is in flight is clearly not a problem for example, the use of sequence locking described in
because the same higher-level protocol actions that deal Section 13.4.2.
with delayed routing updates will also deal with internal
inconsistencies.
Nor is Internet routing the only situation tolerating Exercises These examples assumed that a mutex was
inconsistencies. To repeat, any algorithm in which data held across the entire update operation, which would mean
within a system tracks outside-of-system state must tol- that there could be at most two versions of the list active
erate inconsistencies, which includes security policies at a given time.
(often set by committees of humans), storage configura-
tion, and WiFi access points, to say nothing of removable Quick Quiz 9.35: How would you modify the deletion
hardware such as microphones, headsets, cameras, mice, example to permit more than two versions of the list to be
printers, and much else besides. Furthermore, the large active?
number of Linux-kernel RCU API uses shown in Fig-
ure 9.9, combined with the Linux kernel’s heavy use Quick Quiz 9.36: How many RCU versions of a given list
of reference counting and with increasing use of hazard can be active at any given time?
pointers in other projects, demonstrates that tolerance for
v2022.09.25a
150 CHAPTER 9. DEFERRED PROCESSING
9.5.2.4 Summary of RCU Fundamentals but preferably after reviewing the next section covering
software-engineering considerations.
This section has described the three fundamental compo-
nents of RCU-based algorithms:
9.5.3.1 RCU API and Software Engineering
1. A publish-subscribe mechanism for adding new data
Readers who have looked ahead to Tables 9.2, 9.3, 9.4,
featuring rcu_assign_pointer() for update-side
and 9.5 might have noted that the full list of Linux-kernel
publication and rcu_dereference() for read-side
APIs sports more than 100 members. This is in sharp
subscription,
(and perhaps dismaying) contrast to the mere six API
2. A way of waiting for pre-existing RCU readers to members shown in Table 9.1. This situation clearly raises
finish based on readers being delimited by rcu_ the question “Why so many???”
read_lock() and rcu_read_unlock() on the one This question is answered more thoroughly in the fol-
hand and updaters waiting via synchronize_rcu() lowing sections, but in the meantime the rest of this section
or call_rcu() on the other (see Section 15.4.3 for summarizes the motivations.
a formal description), and There is a wise old saying to the effect of “To err is
human.” This means that purpose of a significant fraction
3. A discipline of maintaining multiple versions to of the RCU API is to provide diagnostics, most notably in
permit change without harming or unduly delaying Table 9.5, but elsewhere as well.
concurrent RCU readers. Important causes of human error are the limits of the
human brain, for example, the limited capacity of short-
Quick Quiz 9.37: How can RCU updaters possibly delay term memory. The toy examples shown in this book do
RCU readers, given that neither rcu_read_lock() nor rcu_ not stress these limits. This is out of necessity: Many
read_unlock() spin or block? readers push their cognitive limits while learning new
material, so the examples need to be kept simple.
These three RCU components allow data to be updated
These examples therefore keep rcu_dereference()
in the face of concurrent readers that might be executing
invocations in the same function as the enclosing rcu_
the same sequence of machine instructions that would
read_lock() and rcu_read_unlock() calls. In con-
be used by a reader in a single-threaded implementation.
trast, real-world software must frequently invoke these
These RCU components can be combined in different ways
API members from different functions, and even from
to implement a surprising variety of different types of
different translation units. The Linux kernel RCU API
RCU-based algorithms, a number of which are presented
has therefore expanded to accommodate lockdep, which
in Section 9.5.4. However, it is usually better to work at
allows rcu_dereference() and friends to complain if
higher levels of abstraction. To this end, the next section
it is not protected by rcu_read_lock(). Linux-kernel
describes the Linux-kernel API, which includes simple
RCU also checks for some double-free errors, infinite
data structures such as lists.
loops in RCU read-side critical sections, and attempts
to invoke quiescent states within RCU read-side critical
9.5.3 RCU Linux-Kernel API sections.
Another way that real-world software accommodates
This section looks at RCU from the viewpoint of its
the limits of human cognition is through abstraction. The
Linux-kernel API.12 Section 9.5.3.2 presents RCU’s wait-
Linux-kernel API therefore includes members that operate
to-finish APIs, Section 9.5.3.3 presents RCU’s publish-
on lists in addition to the pointer-oriented core API of
subscribe and version-maintenance APIs, Section 9.5.3.4
Table 9.1. The Linux kernel itself also provides RCU-
presents RCU’s list-processing APIs, Section 9.5.3.5
protected hash tables and search trees.
presents RCU’s diagnostic APIs, and Section 9.5.3.6
Operating-systems kernels such as Linux operate near
describes in which contexts RCU’s various APIs may
the bottom of the “iron triangle” of the software stack
be used. Finally, Section 9.5.3.7 presents concluding
shown in Figure 2.3, where performance is critically
remarks.
important. There are thus specialized variants of a number
Readers who are not excited about kernel internals
of RCU APIs for use on fastpaths, for example, as discussed
may wish to skip ahead to Section 9.5.4 on page 160,
in Section 9.5.3.3, RCU_INIT_POINTER() may be used
12 Userspace RCU’s API is documented elsewhere [MDJ13c]. in place of rcu_assign_pointer() in cases where the
v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 151
RCU-protected pointer is being assigned to NULL or when or rcu_read_lock_sched() and end with rcu_read_
that pointer is not yet accessible by readers. Use of RCU_ unlock(), rcu_read_unlock_bh(), or rcu_read_
INIT_POINTER() allows the compiler more leeway in unlock_sched(), respectively. Any region of code that
selecting instructions and carrying out optimizations, thus disables bottom halves, interrupts, or preemption also
increasing performance. acts as an RCU read-side critical section. RCU read-side
On the other hand, when used incorrectly RCU_INIT_ critical sections may be nested. The corresponding syn-
POINTER() can result in silent memory corruption, so chronous update-side primitives, synchronize_rcu()
please be careful! Yes, in some cases, the kernel can and synchronize_rcu_expedited(), along with their
check for inappropriate use of RCU API members from synonym synchronize_net(), wait for any type of cur-
a given kernel context, but the constraints of RCU_INIT_ rently executing RCU read-side critical sections to com-
POINTER() use are not yet checkable. plete. The length of this wait is known as a “grace period”,
Finally, within the Linux kernel, the aforementioned and synchronize_rcu_expedited() is designed to re-
limits of human cognition are compounded by the variety duce grace-period latency at the expense of increased
and severity of workloads running on Linux. As of v5.16, CPU overhead and IPIs. The asynchronous update-side
this has given rise to no fewer than five flavors of RCU, primitive, call_rcu(), invokes a specified function with
each designed to provide different performance, scalability, a specified argument after a subsequent grace period.
response-time, and energy efficiency tradeoffs to RCU For example, call_rcu(p,f); will result in the “RCU
readers and writers. These RCU flavors are the subject of callback” f(p) being invoked after a subsequent grace
the next section. period. There are situations, such as when unloading
a Linux-kernel module that uses call_rcu(), when it
9.5.3.2 RCU has a Family of Wait-to-Finish APIs is necessary to wait for all outstanding RCU callbacks
to complete [McK07e]. The rcu_barrier() primitive
The most straightforward answer to “what is RCU” is that does this job.
RCU is an API. For example, the RCU implementation
Quick Quiz 9.39: How do you prevent a huge number of
used in the Linux kernel is summarized by Table 9.2,
RCU read-side critical sections from indefinitely blocking a
which shows the wait-for-readers portions of the RCU,
synchronize_rcu() invocation?
“sleepable” RCU (SRCU), Tasks RCU, and generic APIs,
respectively, and by Table 9.3, which shows the publish- Quick Quiz 9.40: The synchronize_rcu() API waits for
subscribe portions of the API [McK19b].13 all pre-existing interrupt handlers to complete, right?
If you are new to RCU, you might consider focusing
on just one of the columns in Table 9.2, each of which Finally, RCU may be used to provide type-safe mem-
summarizes one member of the Linux kernel’s RCU API ory [GC96], as described in Section 9.5.4.5. In the context
family. For example, if you are primarily interested in of RCU, type-safe memory guarantees that a given data
understanding how RCU is used in the Linux kernel, element will not change type during any RCU read-side
“RCU” would be the place to start, as it is used most critical section that accesses it. To make use of RCU-
frequently. On the other hand, if you want to understand based type-safe memory, pass SLAB_TYPESAFE_BY_RCU
RCU for its own sake, “Tasks RCU” has the simplest API. to kmem_cache_create().
You can always come back for the other columns later. The “SRCU” column in Table 9.2 displays a special-
If you are already familiar with RCU, these tables can ized RCU API that permits general sleeping in SRCU
serve as a useful reference. read-side critical sections [McK06] delimited by srcu_
Quick Quiz 9.38: Why do some of the cells in Table 9.2 read_lock() and srcu_read_unlock(). However, un-
have exclamation marks (“!”)? like RCU, SRCU’s srcu_read_lock() returns a value
that must be passed into the corresponding srcu_read_
The “RCU” column corresponds to the consolidation of unlock(). This difference is due to the fact that the
the three Linux-kernel RCU implementations [McK19c, SRCU user allocates an srcu_struct for each distinct
McK19a], in which RCU read-side critical sections SRCU usage, so that there is no convenient place to store a
start with rcu_read_lock(), rcu_read_lock_bh(), per-task reader-nesting count. (Keep in mind that although
13 This citation covers v4.20 and later. Documetation for earlier the Linux kernel provides dynamically allocated per-CPU
versions of the Linux-kernel RCU API may be found elsewhere [McK08e, storage, there is not yet dynamically allocated per-task
McK14f]. storage.)
v2022.09.25a
CHAPTER 9. DEFERRED PROCESSING
v2022.09.25a
Table 9.2: RCU Wait-to-Finish APIs
RCU: Original SRCU: Sleeping readers Tasks RCU: Free tracing Tasks RCU Rude: Free idle-task Tasks RCU Trace: Protect sleepable
trampolines tracing trampolines BPF programs
Initialization and DEFINE_SRCU()
Cleanup DEFINE_STATIC_SRCU()
init_srcu_struct()
cleanup_srcu_struct()
Read-side rcu_read_lock() ! srcu_read_lock() Voluntary context switch Voluntary context switch and rcu_read_lock_trace()
critical-section rcu_read_unlock() ! srcu_read_unlock() preempt-enable regions of code rcu_read_unlock_trace()
markers rcu_read_lock_bh()
rcu_read_unlock_bh()
rcu_read_lock_sched()
rcu_read_unlock_sched()
(Plus anything disabing bottom
halves, preemption, or interrupts.)
Update-side primitives synchronize_rcu() synchronize_srcu() synchronize_rcu_tasks() synchronize_rcu_tasks_rude() synchronize_rcu_tasks_trace()
(synchronous) synchronize_net() synchronize_srcu_expedited()
synchronize_rcu_expedited()
Update-side primitives call_rcu() ! call_srcu() call_rcu_tasks() call_rcu_tasks_rude() call_rcu_tasks_trace()
(asynchronous /
callback)
Update-side primitives rcu_barrier() srcu_barrier() rcu_barrier_tasks() rcu_barrier_tasks_rude() rcu_barrier_tasks_trace()
(wait for callbacks)
Update-side primitives get_state_synchronize_rcu()
(initiate / wait) cond_synchronize_rcu()
Update-side primitives kfree_rcu()
(free memory)
Type-safe memory SLAB_TYPESAFE_BY_RCU
Read side constraints No blocking (only preemption) No synchronize_srcu() with No voluntary context switch Neither blocking nor preemption No RCU tasks trace grace period
same srcu_struct
Read side overhead CPU-local accesses (barrier() Simple instructions, memory Free CPU-local accesses (free on CPU-local accesses
on PREEMPT=n) barriers PREEMPT=n)
Asynchronous sub-microsecond sub-microsecond sub-microsecond sub-microsecond sub-microsecond
update-side overhead
Grace-period latency 10s of milliseconds Milliseconds Seconds Milliseconds 10s of milliseconds
Expedited 10s of microseconds Microseconds N/A N/A N/A
grace-period latency
152
9.5. READ-COPY UPDATE (RCU) 153
A given srcu_struct structure may be defined as a to ensure that all code executing within a given trampoline
global variable with DEFINE_SRCU() if the structure must has finished before freeing that trampoline.
be used in multiple translation units, or with DEFINE_ Changes to the code being traced are typically limited
STATIC_SRCU() otherwise. For example, DEFINE_ to a single jump or call instruction, and thus cannot ac-
SRCU(my_srcu) would create a global variable named commodate the sequence of code required to implement
my_srcu that could be used by any file in the program. rcu_read_lock() and rcu_read_unlock(). Nor can
Alternatively, an srcu_struct structure may be either the trampoline contain these calls to rcu_read_lock()
an on-stack variable or a dynamically allocated region of and rcu_read_unlock(). To see this, consider a CPU
memory. In both of these non-global-variable cases, the that is just about to start executing a given trampoline.
memory must be initialized using init_srcu_struct() Because it has not yet executed the rcu_read_lock(),
prior to its first use and cleaned up using cleanup_srcu_ that trampoline could be freed at any time, which would
struct() after its last use (but before the underlying come as a fatal surprise to this CPU. Therefore, trampo-
storage disappears). lines cannot be protected by synchronization primitives
However they are created, these distinct srcu_ executed in either the traced code or in the trampoline
struct structures prevent SRCU read-side criti- itself. Which does raise the question of exactly how the
cal sections from blocking unrelated synchronize_ trampoline is to be protected.
srcu() and synchronize_srcu_expedited() invoca- The key to answering this question is to note that
tions. Of course, use of either synchronize_srcu() trampoline code never contains code that either directly
or synchronize_srcu_expedited() within an SRCU or indirectly does a voluntary context switch. This code
read-side critical section can result in self-deadlock, so might be preempted, but it will never directly or indirectly
should be avoided. As with RCU, SRCU’s synchronize_ invoke schedule(). This suggests a variant of RCU
srcu_expedited() decreases grace-period latency com- having voluntary context switches and idle execution as
pared to synchronize_srcu(), but at the expense of its only quiescent states. This variant is Tasks RCU.
increased CPU overhead. Tasks RCU is unusual in having no read-side mark-
Quick Quiz 9.41: Under what conditions can synchronize_ ing functions, which is good given that its main use
srcu() be safely used within an SRCU read-side critical case has nowhere to put such markings. Instead, calls
section? to schedule() serve directly as quiescent states. Up-
dates can use synchronize_rcu_tasks() to wait for
Similar to normal RCU, self-deadlock can be avoided all pre-existing trampoline execution to complete, or
using the asynchronous call_srcu() function. However, they can use its asynchronous counterpart, call_rcu_
special care must be taken when using call_srcu() tasks(). There is also an rcu_barrier_tasks()
because a single task could register SRCU callbacks that waits for completion of callbacks corresponding
very quickly. Given that SRCU allows readers to block to all prior invocations of call_rcu_tasks(). There
for arbitrary periods of time, this could consume an is no synchronize_rcu_tasks_expedited() because
arbitrarily large quantity of memory. In contrast, given the there has not yet been a request for it, though implementing
synchronous synchronize_srcu() interface, a given a useful variant of it would not be free of challenges.
task must finish waiting for a given grace period before it
can start waiting for the next one. Quick Quiz 9.42: In a kernel built with CONFIG_PREEMPT_
NONE=y, won’t synchronize_rcu() wait for all trampolines,
Also similar to RCU, there is an srcu_barrier() given that preemption is disabled and that trampolines never
function that waits for all prior call_srcu() callbacks directly or indirectly invoke schedule()?
to be invoked.
In other words, SRCU compensates for its extremely The “Tasks RCU Rude” column provides a more ef-
weak forward-progress guarantees by permitting the de- fective variant of the toy implementation presented in
veloper to restrict its scope. Section 9.5.1.4. This variant causes each CPU to execute
The “Tasks RCU” column in Table 9.2 displays a spe- a context switch, so that any voluntary context switch or
cialized RCU API that mediates freeing of the trampolines any preemptible region of code can serve as a quiescent
used in Linux-kernel tracing. These trampolines are used state. The Tasks RCU Rude variant uses the Linux-kernel
to transfer control from a point in the code being traced to workqueues facility to force concurrent context switches,
the code doing the actual tracing. It is of course necessary in contrast to the serial CPU-by-CPU approach taken by
v2022.09.25a
154 CHAPTER 9. DEFERRED PROCESSING
the toy implementation. The API mirrors that of Tasks Quick Quiz 9.44: Are there any downsides to the fact that
RCU, including the lack of explicit read-side markers. these traversal and update primitives can be used with any of
Finally, the “Tasks RCU Trace” column provides an the RCU API family members?
RCU implementation with functionality similar to that
of SRCU, except with much faster read-side markers.14 The rcu_pointer_handoff() primitive simply re-
However, this speed is a consequence of the fact that turns its sole argument, but is useful to tooling checking
these markers do not execute memory-barrier instructions, for pointers being leaked from RCU read-side critical
which means that Tasks RCU Trace grace periods must sections. Use of rcu_pointer_handoff() indicates to
often send IPIs to all CPUs and must always scan the such tooling that protection of the structure in question
entire task list, thus degrading real-time response and has been handed off from RCU to some other mechanism,
consuming considerable CPU time. Nevertheless, in the such as locking or reference counting.
absence of readers, the resulting grace-period latency is The RCU_INIT_POINTER() macro can be used to
reasonably short, rivaling that of RCU. initialize RCU-protected pointers that have not yet
been exposed to readers, or alternatively, to set RCU-
protected pointers to NULL. In these restricted cases, the
9.5.3.3 RCU has Publish-Subscribe and Version-
memory-barrier instructions provided by rcu_assign_
Maintenance APIs
pointer() are not needed. Similarly, RCU_POINTER_
Fortunately, the RCU publish-subscribe and version- INITIALIZER() provides a GCC-style structure initial-
maintenance primitives shown in Table 9.3 apply to all of izer to allow easy initialization of RCU-protected pointers
the variants of RCU discussed above. This commonality in structures.
can allow more code to be shared, and reduces API prolifer- The second category subscribes to pointers to data
ation. The original purpose of the RCU publish-subscribe items, or, alternatively, safely traverses RCU-protected
APIs was to bury memory barriers into these APIs, so that pointers. Again, simply loading these pointers using C-
Linux kernel programmers could use RCU without need- language accesses could result in seeing pre-initialization
ing to become expert on the memory-ordering models of garbage in the pointed-to data. Similarly, loading these
each of the 20+ CPU families that Linux supports [Spr01]. pointer by any means outside of an RCU read-side crit-
These primitives operate directly on pointers, and are ical section could result in the pointed-to object being
useful for creating RCU-protected linked data structures, freed at any time. However, if the pointer is merely
such as RCU-protected arrays and trees. The special to be tested and not dereferenced, the freeing of the
case of linked lists is handled by a separate set of APIs pointed-to object is not necessarily a problem. In this
described in Section 9.5.3.4. case, rcu_access_pointer() may be used. Normally,
The first category publishes pointers to new data items. however, RCU read-side protection is required, and so
The rcu_assign_pointer() primitive ensures that any the rcu_dereference() primitive uses the Linux ker-
prior initialization remains ordered before the assign- nel’s lockdep facility [Cor06a] to verify that this rcu_
ment to the pointer on weakly ordered machines. The dereference() invocation is under the protection of
rcu_replace_pointer() primitive updates the pointer rcu_read_lock(), srcu_read_lock(), or some other
just like rcu_assign_pointer() does, but also re- RCU read-side marker. In contrast, the rcu_access_
turns the previous value, just like rcu_dereference_ pointer() primitive does not involve lockdep, and thus
protected() (see below) would, including the lockdep will not provoke lockdep complaints when used outside
expression. This replacement is convenient when the of an RCU read-side critical section.
updater must both publish a new pointer and free the Another situation where protection is not required
structure referenced by the old pointer. is when update-side code accesses the RCU-protected
Quick Quiz 9.43: Normally, any pointer subject to rcu_ pointer while holding the update-side lock. The rcu_
dereference() must always be updated using one of the dereference_protected() API member is provided
pointer-publish functions in Table 9.3, for example, rcu_ for this situation. Its first parameter is the RCU-protected
assign_pointer(). pointer, and the second parameter takes a lockdep expres-
What is an exception to this rule? sion describing which locks must be held in order for the
access to be safe. Code invoked both from readers and
14 And thus is unusual for the Tasks RCU family for having explicit updaters can use rcu_dereference_check(), which
read-side markers! also takes a lockdep expression, but which may also be
v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 155
v2022.09.25a
156 CHAPTER 9. DEFERRED PROCESSING
for the hash-bucket arrays of large hash tables. As before, values or the new set of values, but not a mixture of the
this notation is cumbersome, so hlist structures will be two sets. For example, each node of a linked list might
abbreviated in the same way list_head-style lists are, as have integer fields ->a, ->b, and ->c, and it might be
shown in Figure 9.17. necessary to update a given node’s fields from 5, 6, and 7
A variant of Linux’s hlist, named hlist_nulls, to 5, 2, and 3, respectively.
provides multiple distinct NULL pointers, but otherwise The code implementing this atomic update is straight-
uses the same layout as shown in Figure 9.18. In this forward:
variant, a ->next pointer having a zero low-order bit is
considered to be a pointer. However, if the low-order bit is 15 q = kmalloc(sizeof(*p), GFP_KERNEL);
16 *q = *p;
set to one, the upper bits identify the type of NULL pointer. 17 q->b = 2;
This type of list is used to allow lockless readers to detect 18 q->c = 3;
19 list_replace_rcu(&p->list, &q->list);
when a node has been moved from one list to another. For 20 synchronize_rcu();
example, each bucket of a hash table might use its index to 21 kfree(p);
mark its NULL pointer. Should a reader encounter a NULL
pointer not matching the index of the bucket it started from,
The following discussion walks through this code, using
that reader knows that an element it was traversing was
Figure 9.19 to illustrate the state changes. The triples
moved to some other bucket during the traversal, taking
in each element represent the values of fields ->a, ->b,
that reader with it. The reader can use the is_a_nulls()
and ->c, respectively. The red-shaded elements might
function (which returns true if passed an hlist_nulls
be referenced by readers, and because readers do not
NULL pointer) to determine when it reaches the end of a list,
synchronize directly with updaters, readers might run
and the get_nulls_value() function (which returns its
concurrently with this entire replacement process. Please
argument’s NULL-pointer identifier) to fetch the type of
note that backwards pointers and the link from the tail to
NULL pointer. When get_nulls_value() returns an
the head are omitted for clarity.
unexpected value, the reader can take corrective action,
The initial state of the list, including the pointer p, is
for example, restarting its traversal from the beginning.
the same as for the deletion example, as shown on the first
Quick Quiz 9.45: But what if an hlist_nulls reader gets row of the figure.
moved to some other bucket and then back again? The following text describes how to replace the 5,6,7
element with 5,2,3 in such a way that any given reader
More information on hlist_nulls is available in sees one of these two values.
the Linux-kernel source tree, with helpful example code Line 15 allocates a replacement element, resulting in
provided in the rculist_nulls.rst file (rculist_ the state as shown in the second row of Figure 9.19. At
nulls.txt in older kernels). this point, no reader can hold a reference to the newly
Another variant of Linux’s hlist incorporates bit- allocated element (as indicated by its green shading), and
locking, and is named hlist_bl. This variant uses the it is uninitialized (as indicated by the question marks).
same layout as shown in Figure 9.18, but reserves the Line 16 copies the old element to the new one, resulting
low-order bit of the head pointer (“first” in the figure) to in the state as shown in the third row of Figure 9.19.
lock the list. This approach also reduces memory usage, The newly allocated element still cannot be referenced by
as it allows what would otherwise be a separate spinlock readers, but it is now initialized.
to be stored with the pointer itself. Line 17 updates q->b to the value “2”, and line 18
The API members for these linked-list variants are updates q->c to the value “3”, as shown on the fourth row
summarized in Table 9.4. More information is available in of Figure 9.19. Note that the newly allocated structure is
the Documentation/RCU directory of the Linux-kernel still inaccessible to readers.
source tree and at Linux Weekly News [McK19b]. Now, line 19 does the replacement, so that the new
However, the remainder of this section expands on element is finally visible to readers, and hence is shaded
the use of list_replace_rcu(), given that this API red, as shown on the fifth row of Figure 9.19. At this
member gave RCU its name. This API member is used to point, as shown below, we have two versions of the list.
carry out more complex updates in which an element in Pre-existing readers might see the 5,6,7 element (which
the middle of the list having multiple fields is atomically is therefore now shaded yellow), but new readers will
updated, so that a given reader sees either the old set of instead see the 5,2,3 element. But any given reader is
v2022.09.25a
Table 9.4: RCU-Protected List APIs
list
list: Circular doubly linked list hlist
hlist: Linear doubly linked list hlist_nulls
hlist_nulls: Linear doubly linked list hlist_bl
hlist_bl: Linear doubly linked list
with marked NULL pointer, with up to with bit locking
31 bits of marking
Structures
struct list_head struct hlist_head struct hlist_nulls_head struct hlist_bl_head
struct hlist_node struct hlist_nulls_node struct hlist_bl_node
Initialization
INIT_LIST_HEAD_RCU()
9.5. READ-COPY UPDATE (RCU)
Full traversal
list_for_each_entry_rcu() hlist_for_each_entry_rcu() hlist_nulls_for_each_entry_rcu() hlist_bl_for_each_entry_rcu()
list_for_each_entry_lockless() hlist_for_each_entry_rcu_bh() hlist_nulls_for_each_entry_safe()
hlist_for_each_entry_rcu_notrace()
Resume traversal
list_for_each_entry_continue_rcu() hlist_for_each_entry_continue_rcu()
list_for_each_entry_from_rcu() hlist_for_each_entry_continue_rcu_bh()
hlist_for_each_entry_from_rcu()
Stepwise traversal
list_entry_rcu() hlist_first_rcu() hlist_nulls_first_rcu() hlist_bl_first_rcu()
list_entry_lockless() hlist_next_rcu() hlist_nulls_next_rcu()
list_first_or_null_rcu() hlist_pprev_rcu()
list_next_rcu()
list_next_or_null_rcu()
Add
list_add_rcu() hlist_add_before_rcu() hlist_nulls_add_head_rcu() hlist_bl_add_head_rcu()
list_add_tail_rcu() hlist_add_behind_rcu() hlist_bl_set_first_rcu()
hlist_add_head_rcu()
hlist_add_tail_rcu()
Delete
list_del_rcu() hlist_del_rcu() hlist_nulls_del_rcu() hlist_bl_del_rcu()
hlist_del_init_rcu() hlist_nulls_del_init_rcu() hlist_bl_del_init_rcu()
Replace
list_replace_rcu() hlist_replace_rcu()
Splice
list_splice_init_rcu() list_splice_tail_init_rcu()
157
v2022.09.25a
158 CHAPTER 9. DEFERRED PROCESSING
Category Primitives
Update
Figure 9.19: RCU Replacement in Linked List 9.5.3.5 RCU Has Diagnostic APIs
Table 9.5 shows RCU’s diagnostic APIs.
The __rcu tag marks an RCU-protected pointer,
for example, “struct foo __rcu *p;”. Pointers
v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 159
rcu_read_unlock()
rcu_dereference()
rcu_read_lock()
turn values allows the Linux kernel’s sparse tool to detect
rcu_assign_pointer()
situations where RCU-protected pointers are incorrectly
call_rcu()
Debug-object support is automatic for any rcu_head
structures that are part of a structure obtained from the
Linux kernel’s memory allocators, but those building
Process synchronize_rcu()
their own special-purpose memory allocators can use
init_rcu_head() and destroy_rcu_head() at allo-
cation and free time, respectively. Those using rcu_head
structures allocated on the function-call stack (it happens!) Figure 9.20: RCU API Usage Constraints
may use init_rcu_head_on_stack() before first use
and destroy_rcu_head_on_stack() after last use, but
before returning from the function. Debug-object sup- CPUs, which means that rcu_is_watching() does not
port allows detection of bugs involving passing the same apply to SRCU.
rcu_head structure to call_rcu() and friends in quick RCU_LOCKDEP_WARN() emits a warning if lockdep is
succession, which is the call_rcu() counterpart to the enabled and if its argument evaluates to true. For exam-
infamous double-free class of memory-allocation bugs. ple, RCU_LOCKDEP_WARN(!rcu_read_lock_held())
Stall-warning control is provided by rcu_cpu_stall_ would emit a warning if invoked outside of an RCU
reset(), which allows the caller to suppress RCU CPU read-side critical section.
stall warnings for the remainder of the current grace period. RCU_NONIDLE() may be used to force RCU to watch
RCU CPU stall warnings help pinpoint situations where an when executing the statement that is passed in as the sole
RCU read-side critical section runs for an excessive length argument. For example, RCU_NONIDLE(WARN_ON(!rcu_
of time, and it is useful for things like kernel debuggers to is_watching())) would never emit a warning. How-
be able to suppress them, for example, when encountering ever, changes in the 2020–2021 timeframe extend RCU’s
a breakpoint. reach deeper into the idle loop, which should greatly
Callback checking is provided by rcu_head_init() reduce or even eliminate the need for RCU_NONIDLE().
and rcu_head_after_call_rcu(). The former is in- Finally, rcu_sleep_check() emits a warning if in-
voked on an rcu_head structure before it is passed to voked within an RCU, RCU-bh, or RCU-sched read-side
call_rcu(), and then rcu_head_after_call_rcu() critical section.
will check to see if the callback has been invoked with the
specified function.
9.5.3.6 Where Can RCU’s APIs Be Used?
Support for lockdep [Cor06a] includes rcu_read_
lock_held(), rcu_read_lock_bh_held(), rcu_ Figure 9.20 shows which APIs may be used in which
read_lock_sched_held(), and srcu_read_lock_ in-kernel environments. The RCU read-side primitives
held(), each of which returns true if invoked within the may be used in any environment, including NMI, the RCU
corresponding type of RCU read-side critical section. mutation and asynchronous grace-period primitives may
Quick Quiz 9.46: Why isn’t there a rcu_read_lock_ be used in any environment other than NMI, and, finally,
tasks_held() for Tasks RCU? the RCU synchronous grace-period primitives may be
used only in process context. The RCU list-traversal prim-
Because rcu_read_lock() cannot be used from the itives include list_for_each_entry_rcu(), hlist_
idle loop, and because energy-efficiency concerns have for_each_entry_rcu(), etc. Similarly, the RCU list-
caused the idle loop to become quite ornate, rcu_is_ mutation primitives include list_add_rcu(), hlist_
watching() returns true if invoked in a context where del_rcu(), etc.
use of rcu_read_lock() is legal. Note again that srcu_ Note that primitives from other families of RCU may
read_lock() may be used from idle and even offline be substituted, for example, srcu_read_lock() may be
v2022.09.25a
160 CHAPTER 9. DEFERRED PROCESSING
9.5.4 RCU Usage Listing 9.14: RCU Pre-BSD Routing Table Lookup
1 struct route_entry {
This section answers the question “What is RCU?” from 2 struct rcu_head rh;
3 struct cds_list_head re_next;
the viewpoint of the uses to which RCU can be put. 4 unsigned long addr;
Because RCU is most frequently used to replace some 5 unsigned long iface;
6 int re_freed;
existing mechanism, we look at it primarily in terms of 7 };
its relationship to such mechanisms, as listed in Table 9.6 8 CDS_LIST_HEAD(route_list);
9 DEFINE_SPINLOCK(routelock);
and as displayed in Figure 9.23. Following the sections 10
listed in this table, Section 9.5.4.12 provides a summary. 11 unsigned long route_lookup(unsigned long addr)
12 {
13 struct route_entry *rep;
14 unsigned long ret;
9.5.4.1 RCU for Pre-BSD Routing 15
16 rcu_read_lock();
In contrast to the later sections, this section focuses on a 17 cds_list_for_each_entry_rcu(rep, &route_list, re_next) {
18 if (rep->addr == addr) {
very specific use case for the purpose of comparison with 19 ret = rep->iface;
other mechanisms. 20 if (READ_ONCE(rep->re_freed))
21 abort();
Listings 9.14 and 9.15 show code for an RCU-protected 22 rcu_read_unlock();
Pre-BSD routing table (route_rcu.c). The former 23 return ret;
24 }
shows data structures and route_lookup(), and the 25 }
latter shows route_add() and route_del(). 26 rcu_read_unlock();
27 return ULONG_MAX;
In Listing 9.14, line 2 adds the ->rh field used by 28 }
RCU reclamation, line 6 adds the ->re_freed use-after-
free-check field, lines 16, 22, and 26 add RCU read-side
protection, and lines 20 and 21 add the use-after-free check.
In Listing 9.15, lines 11, 13, 30, 34, and 39 add update-side
v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 161
v2022.09.25a
162 CHAPTER 9. DEFERRED PROCESSING
v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 163
9.5.4.2 Wait for Pre-Existing Things to Finish Listing 9.16: Using RCU to Wait for NMIs to Finish
1 struct profile_buffer {
As noted in Section 9.5.2 an important component of 2 long size;
3 atomic_t entry[0];
RCU is a way of waiting for RCU readers to finish. One 4 };
of RCU’s great strength is that it allows you to wait for 5 static struct profile_buffer *buf = NULL;
6
each of thousands of different things to finish without 7 void nmi_profile(unsigned long pcvalue)
having to explicitly track each and every one of them, and 8 {
9 struct profile_buffer *p = rcu_dereference(buf);
without incurring the performance degradation, scalability 10
limitations, complex deadlock scenarios, and memory- 11 if (p == NULL)
12 return;
leak hazards that are inherent in schemes that use explicit 13 if (pcvalue >= p->size)
tracking. 14 return;
15 atomic_inc(&p->entry[pcvalue]);
In this section, we will show how synchronize_ 16 }
sched()’s read-side counterparts (which include anything 17
18 void nmi_stop(void)
that disables preemption, along with hardware operations 19 {
and primitives that disable interrupts) permit you to in- 20 struct profile_buffer *p = buf;
21
teraction with non-maskable interrupt (NMI) handlers, 22 if (p == NULL)
which is quite difficult using locking. This approach has 23 return;
24 rcu_assign_pointer(buf, NULL);
been called “Pure RCU” [McK04], and it is used in a few 25 synchronize_sched();
places in the Linux kernel. 26 kfree(p);
27 }
The basic form of such “Pure RCU” designs is as
follows:
1. Make a change, for example, to the way that the OS be preempted, nor can it be interrupted by a normal
reacts to an NMI. interrupt handler, however, it is still subject to delays
due to cache misses, ECC errors, and cycle stealing by
2. Wait for all pre-existing read-side critical sections other hardware threads within the same core. Line 9
to completely finish (for example, by using the gets a local pointer to the profile buffer using the rcu_
synchronize_sched() primitive).16 The key ob- dereference() primitive to ensure memory ordering on
servation here is that subsequent RCU read-side crit- DEC Alpha, and lines 11 and 12 exit from this function
ical sections are guaranteed to see whatever change if there is no profile buffer currently allocated, while
was made. lines 13 and 14 exit from this function if the pcvalue
3. Clean up, for example, return status indicating that argument is out of range. Otherwise, line 15 increments
the change was successfully made. the profile-buffer entry indexed by the pcvalue argument.
Note that storing the size with the buffer guarantees that
The remainder of this section presents example code the range check matches the buffer, even if a large buffer
adapted from the Linux kernel. In this example, the is suddenly replaced by a smaller one.
timer_stop() function in the now-defunct oprofile fa- Lines 18–27 define the nmi_stop() function, where
cility uses synchronize_sched() to ensure that all in- the caller is responsible for mutual exclusion (for example,
flight NMI notifications have completed before freeing holding the correct lock). Line 20 fetches a pointer to the
the associated resources. A simplified version of this code profile buffer, and lines 22 and 23 exit the function if there
is shown in Listing 9.16. is no buffer. Otherwise, line 24 NULLs out the profile-buffer
Lines 1–4 define a profile_buffer structure, con- pointer (using the rcu_assign_pointer() primitive to
taining a size and an indefinite array of entries. Line 5 maintain memory ordering on weakly ordered machines),
defines a pointer to a profile buffer, which is presumably and line 25 waits for an RCU Sched grace period to elapse,
initialized elsewhere to point to a dynamically allocated in particular, waiting for all non-preemptible regions
region of memory. of code, including NMI handlers, to complete. Once
Lines 7–16 define the nmi_profile() function, which execution continues at line 26, we are guaranteed that any
is called from within an NMI handler. As such, it cannot instance of nmi_profile() that obtained a pointer to
16 In Linux kernel v5.1 and later, synchronize_sched() has been the old buffer has returned. It is therefore safe to free the
subsumed into synchronize_rcu(). buffer, in this case using the kfree() primitive.
v2022.09.25a
164 CHAPTER 9. DEFERRED PROCESSING
Common-Case Maintenance Listing 9.17: Phased State Change for Maintenance Operations
1 bool be_careful;
Operations Operations 2
3 void cco(void)
Time 4 {
Quickly 5 rcu_read_lock();
6 if (READ_ONCE(be_careful))
7 cco_carefully();
8 else
Either Prepare 9 cco_quickly();
10 rcu_read_unlock();
11 }
Carefully Maintenance 12
13 void maint(void)
14 {
Either Clean up
15 WRITE_ONCE(be_careful, true);
16 synchronize_rcu();
17 do_maint();
Quickly 18 synchronize_rcu();
19 WRITE_ONCE(be_careful, false);
20 }
v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 165
9.5.4.4 Add-Only List Quick Quiz 9.52: But what if there is an arbitrarily long
series of RCU read-side critical sections in multiple threads,
Add-only data structures, exemplified by the add-only list, so that at any point in time there is at least one thread in
can be used for a surprisingly common set of use cases, the system executing in an RCU read-side critical section?
perhaps most commonly the logging of changes. Add- Wouldn’t that prevent any data from a SLAB_TYPESAFE_BY_
only data structures are a pure use of RCU’s underlying RCU slab ever being returned to the system, possibly resulting
publish/subscribe mechanism. in OOM events?
An add-only variant of a pre-BSD routing table can be
derived from Listings 9.14 and 9.15. Because there is no It is important to note that SLAB_TYPESAFE_BY_RCU
deletion, the route_del() and route_cb() functions will in no way prevent kmem_cache_alloc() from im-
may be dispensed with, along with the ->rh and ->re_ mediately reallocating memory that was just now freed
freed fields of the route_entry structure, the rcu_ via kmem_cache_free()! In fact, the SLAB_TYPESAFE_
read_lock(), the rcu_read_unlock() invocations in BY_RCU-protected data structure just returned by rcu_
the route_lookup() function, and all uses of the ->re_ dereference might be freed and reallocated an arbitrarily
freed field in all remaining functions. large number of times, even when under the protection
Of course, if there are many concurrent invocations of of rcu_read_lock(). Instead, SLAB_TYPESAFE_BY_
the route_add() function, there will be heavy contention RCU operates by preventing kmem_cache_free() from
on routelock, and if lockless techniques are used, heavy returning a completely freed-up slab of data structures to
memory contention on routelist. The usual way to the system until after an RCU grace period elapses. In
avoid this contention is to use a concurrency-friendly data short, although a given RCU read-side critical section
structure such as a hash table (see Chapter 10). Alter- might see a given SLAB_TYPESAFE_BY_RCU data element
natively, per-CPU data structures might be periodically being freed and reallocated arbitrarily often, the element’s
merged into a single global data structure. type is guaranteed not to change until that critical section
On the other hand, if there is never any deletion, ex- has completed.
tended time periods featuring many concurrent invocations
of route_add() will eventually consume all available These algorithms therefore typically use a validation
memory. Therefore, most RCU-protected data structures step that checks to make sure that the newly referenced data
also implement deletion. structure really is the one that was requested [LS86, Sec-
tion 2.5]. These validation checks require that portions of
the data structure remain untouched by the free-reallocate
9.5.4.5 Type-Safe Memory process. Such validation checks are usually very hard to
get right, and can hide subtle and difficult bugs.
A number of lockless algorithms do not require that a given
Therefore, although type-safety-based lockless algo-
data element keep the same identity through a given RCU
rithms can be extremely helpful in a very few difficult
read-side critical section referencing it—but only if that
situations, you should instead use existence guarantees
data element retains the same type. In other words, these
where possible. Simpler is after all almost always better!
lockless algorithms can tolerate a given data element being
On the other hand, type-safety-based lockless algorithms
freed and reallocated as the same type of structure while
can provide improved cache locality, and thus improved
they are referencing it, but must prohibit a change in type.
performance. This improved cache locality is provided by
This guarantee, called “type-safe memory” in academic
the fact that such algorithms can immediately reallocate
literature [GC96], is weaker than the existence guarantees
a newly freed block of memory. In contrast, algorithms
discussed in Section 9.5.4.6, and is therefore quite a bit
based on existence guarantees must wait for all pre-existing
harder to work with. Type-safe memory algorithms in the
readers before reallocating memory, by which time that
Linux kernel make use of slab caches, specially marking
memory may have been ejected from CPU caches.
these caches with SLAB_TYPESAFE_BY_RCU so that RCU
is used when returning a freed-up slab to system memory. As can be seen in Figure 9.23, RCU’s type-safe-memory
This use of RCU guarantees that any in-use element of use case combines both the wait-to-finish and publish-
such a slab will remain in that slab, thus retaining its type, subscribe components, but in the Linux kernel also in-
for the duration of any pre-existing RCU read-side critical cludes the slab allocator’s deferred reclamation specified
sections. by the SLAB_TYPESAFE_BY_RCU flag.
v2022.09.25a
166 CHAPTER 9. DEFERRED PROCESSING
Listing 9.18: Existence Guarantees Enable Per-Element Lock- newly removed element, and line 20 indicates success. If
ing the element is no longer the one we want, line 22 releases
1 int delete(int key) the lock, line 23 leaves the RCU read-side critical section,
2 {
3 struct element *p; and line 24 indicates failure to delete the specified key.
4 int b;
5 Quick Quiz 9.54: Why is it OK to exit the RCU read-side
6 b = hashfunction(key); critical section on line 15 of Listing 9.18 before releasing the
7 rcu_read_lock();
8 p = rcu_dereference(hashtable[b]); lock on line 17?
9 if (p == NULL || p->key != key) {
10 rcu_read_unlock(); Quick Quiz 9.55: Why not exit the RCU read-side critical
11 return 0;
12 } section on line 23 of Listing 9.18 before releasing the lock on
13 spin_lock(&p->lock); line 22?
14 if (hashtable[b] == p && p->key == key) {
15 rcu_read_unlock(); Alert readers will recognize this as only a slight varia-
16 rcu_assign_pointer(hashtable[b], NULL);
17 spin_unlock(&p->lock); tion on the original wait-to-finish theme (Section 9.5.4.2),
18 synchronize_rcu(); adding publish/subscribe, linked structures, a heap allo-
19 kfree(p);
20 return 1; cator (typically), and deferred reclamation, as shown in
21 } Figure 9.23. They might also note the deadlock-immunity
22 spin_unlock(&p->lock);
23 rcu_read_unlock(); advantages over the lock-based existence guarantees dis-
24 return 0; cussed in Section 7.4.
25 }
v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 167
v2022.09.25a
168 CHAPTER 9. DEFERRED PROCESSING
10000 But please note the logscale y axis, which means that
the small separations between the traces still represent
Nanoseconds per operation significant differences. This figure shows non-preemptible
1000
rwlock RCU, but given that preemptible RCU’s read-side overhead
is only about three nanoseconds, its plot would be nearly
100
identical to Figure 9.27.
Quick Quiz 9.60: Why the larger error ranges for the
submicrosecond durations in Figure 9.27?
10 RCU
There are three traces for reader-writer locking, with the
upper trace being for 100 CPUs, the next for 10 CPUs, and
1 the lowest for 1 CPU. The greater the number of CPUs
1 10 100
and the shorter the critical sections, the greater is RCU’s
Number of CPUs (Threads)
performance advantage. These performance advantages
Figure 9.26: Performance Advantage of Preemptible are underscored by the fact that 100-CPU systems are no
RCU Over Reader-Writer Locking longer uncommon and that a number of system calls (and
thus any RCU read-side critical sections that they contain)
100000 complete within microseconds.
In addition, as is discussed in the next section, RCU
Nanoseconds per operation
v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 169
Another interesting consequence of RCU’s deadlock rwlock reader spin rwlock reader
immunity is its immunity to a large class of priority rwlock reader spin rwlock reader
inversion problems. For example, low-priority RCU
rwlock reader spin rwlock reader
readers cannot prevent a high-priority RCU updater from
spin rwlock writer
acquiring the update-side lock. Similarly, a low-priority
RCU updater cannot prevent high-priority RCU readers
from entering an RCU read-side critical section. RCU reader RCU reader RCU reader
RCU reader RCU reader RCU reader
Quick Quiz 9.62: Immunity to both deadlock and priority
RCU reader RCU reader RCU reader
inversion??? Sounds too good to be true. Why should I believe
RCU updater
that this is even possible?
Time
Update Received
Realtime Latency Because RCU read-side primitives
Figure 9.28: Response Time of RCU vs. Reader-Writer
neither spin nor block, they offer excellent realtime laten-
Locking
cies. In addition, as noted earlier, this means that they are
immune to priority inversion involving the RCU read-side
primitives and locks. classic example is the networking routing table. Because
However, RCU is susceptible to more subtle priority- routing updates can take considerable time to reach a given
inversion scenarios, for example, a high-priority process system (seconds or even minutes), the system will have
blocked waiting for an RCU grace period to elapse can be been sending packets the wrong way for quite some time
blocked by low-priority RCU readers in -rt kernels. This when the update arrives. It is usually not a problem to con-
can be solved by using RCU priority boosting [McK07d, tinue sending updates the wrong way for a few additional
GMTW08]. milliseconds. Furthermore, because RCU updaters can
However, use of RCU priority boosting requires that make changes without waiting for RCU readers to finish,
rcu_read_unlock() do deboosting, which entails ac- the RCU readers might well see the change more quickly
quiring scheduler locks. Some care is therefore required than would batch-fair reader-writer-locking readers, as
within the scheduler and RCU to avoid deadlocks, which as shown in Figure 9.28.
of the v5.15 Linux kernel requires RCU to avoid invoking
the scheduler while holding any of RCU’s locks. Quick Quiz 9.63: But how many other algorithms really
This in turn means that rcu_read_unlock() is not tolerate stale and inconsistent data?
always lockless when RCU priority boosting is enabled. Once the update is received, the rwlock writer cannot
However, rcu_read_unlock() will still be lockless if proceed until the last reader completes, and subsequent
its critical section was not priority-boosted. Furthermore, readers cannot proceed until the writer completes. How-
critical sections will not be priority boosted unless they ever, these subsequent readers are guaranteed to see the
are preempted, or, in -rt kernels, they acquire non-raw new value, as indicated by the green shading of the right-
spinlocks. This means that rcu_read_unlock() will most boxes. In contrast, RCU readers and updaters do
normally be lockless from the perspective of the highest not block each other, which permits the RCU readers to
priority task running on any given CPU. see the updated values sooner. Of course, because their
execution overlaps that of the RCU updater, all of the RCU
RCU Readers and Updaters Run Concurrently Be- readers might well see updated values, including the three
cause RCU readers never spin nor block, and because readers that started before the update. Nevertheless only
updaters are not subject to any sort of rollback or abort the green-shaded rightmost RCU readers are guaranteed
semantics, RCU readers and updaters really can run con- to see the updated values.
currently. This means that RCU readers might access stale Reader-writer locking and RCU simply provide different
data, and might even see inconsistencies, either of which guarantees. With reader-writer locking, any reader that
can render conversion from reader-writer locking to RCU begins after the writer begins is guaranteed to see new
non-trivial. values, and any reader that attempts to begin while the
However, in a surprisingly large number of situations, writer is spinning might or might not see new values,
inconsistencies and stale data are not problems. The depending on the reader/writer preference of the rwlock
v2022.09.25a
170 CHAPTER 9. DEFERRED PROCESSING
implementation in question. In contrast, with RCU, any for the rule of thumb that RCU be used in read-mostly
reader that begins after the updater completes is guaranteed situations.
to see new values, and any reader that completes after As noted in Section 9.5.3, within the Linux kernel,
the updater begins might or might not see new values, shorter grace periods may be obtained via expedited grace
depending on timing. periods, for example, by invoking synchronize_rcu_
The key point here is that, although reader-writer lock- expedited() instead of synchronize_rcu(). Expe-
ing does indeed guarantee consistency within the confines dited grace periods can reduce delays to as little as a few
of the computer system, there are situations where this tens of microseconds, albeit at the expense of higher CPU
consistency comes at the price of increased inconsistency utilization and IPIs. The added IPIs can be especially
with the outside world, courtesy of the finite speed of light unwelcome in some real-time workloads.
and the non-zero size of atoms. In other words, reader-
writer locking obtains internal consistency at the price of Code: Reader-Writer Locking vs. RCU In the best
silently stale data with respect to the outside world. case, the conversion from reader-writer locking to RCU is
Note that if a value is computed while read-holding quite simple, as shown in Listings 9.19, 9.20, and 9.21,
a reader-writer lock, and then that value is used after all taken from Wikipedia [MPA+ 06].
that lock is released, then this reader-writer-locking use However, the transformation is not always this straight-
case is using stale data. After all, the quantities that this forward. This is because neither the spin_lock() nor the
value is based on could change at any time after that synchronize_rcu() in Listing 9.21 exclude the read-
lock is released. This sort of reader-writer-locking use ers in Listing 9.20. First, the spin_lock() does not
case is often easy to convert to RCU, as will be shown in interact in any way with rcu_read_lock() and rcu_
Listings 9.19, 9.20, and 9.21 and the accompanying text. read_unlock(), thus not excluding them. Second, al-
though both write_lock() and synchronize_rcu()
Low-Priority RCU Readers Can Block High-Pri- wait for pre-existing readers, only write_lock() pre-
ority Reclaimers In Realtime RCU [GMTW08] or vents subsequent readers from commencing.18 Thus,
SRCU [McK06], a preempted reader will prevent a grace synchronize_rcu() cannot exclude readers. Neverthe-
period from completing, even if a high-priority task is less, a great many situations using reader-writer locking
blocked waiting for that grace period to complete. Real- can be converted to RCU.
time RCU can avoid this problem by substituting call_ More-elaborate cases of replacing reader-writer locking
rcu() for synchronize_rcu() or by using RCU priority with RCU may be found elsewhere [Bro15a, Bro15b].
boosting [McK07d, GMTW08]. It might someday be nec-
essary to augment SRCU and RCU Tasks Trace with Semantics: Reader-Writer Locking vs. RCU Expand-
priority boosting, but not before a clear real-world need is ing on the previous section, reader-writer locking seman-
demonstrated. tics can be roughly and informally summarized by the
following three temporal constraints:
Quick Quiz 9.64: If Tasks RCU Trace might someday be
priority boosted, why not also Tasks RCU and Tasks RCU 1. Write-side acquisitions wait for any read-holders to
Rude? release the lock.
2. Writer-side acquisitions wait for any write-holder to
RCU Grace Periods Extend for Many Milliseconds release the lock.
With the exception of userspace RCU [Des09b, MDJ13c], 3. Read-side acquisitions wait for any write-holder to
expedited grace periods, and several of the “toy” release the lock.
RCU implementations described in Appendix B, RCU
grace periods extend milliseconds. Although there RCU dispenses entirely with constraint #3 and weakens
are a number of techniques to render such long de- the other two as follows:
lays harmless, including use of the asynchronous in-
1. Writers wait for any pre-existing read-holders before
terfaces (call_rcu() and call_rcu_bh()) or of the
progressing to the destructive phase of their update
polling interfaces (get_state_synchronize_rcu(),
(usually the freeing of memory).
start_poll_synchronize_rcu(), and poll_state_
synchronize_rcu()), this situation is a major reason 18 Kudos to whoever pointed this out to Paul.
v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 171
v2022.09.25a
172 CHAPTER 9. DEFERRED PROCESSING
Listing 9.22: RCU Singleton Get Listing 9.23: RCU Singleton Set
1 struct myconfig { 1 void set_config(int cur_a, int cur_b)
2 int a; 2 {
3 int b; 3 struct myconfig *mcp;
4 } *curconfig; 4
5 5 mcp = malloc(sizeof(*mcp));
6 int get_config(int *cur_a, int *cur_b) 6 BUG_ON(!mcp);
7 { 7 mcp->a = cur_a;
8 struct myconfig *mcp; 8 mcp->b = cur_b;
9 9 mcp = xchg(&curconfig, mcp);
10 rcu_read_lock(); 10 if (mcp) {
11 mcp = rcu_dereference(curconfig); 11 synchronize_rcu();
12 if (!mcp) { 12 free(mcp);
13 rcu_read_unlock(); 13 }
14 return 0; 14 }
15 }
16 *cur_a = mcp->a;
17 *cur_b = mcp->b;
18 rcu_read_unlock();
19 return 1; The fields of the current structure are passed back
20 } through the cur_a and cur_b parameters to the get_
config() function defined on lines 6–20. These two
fields can be slightly out of date, but they absolutely
2. Writers synchronize with each other as needed. must be consistent with each other. The get_config()
It is of course this weakening that permits RCU imple- function provides this consistency within the RCU read-
mentations to attain excellent performance and scalability. side critical section starting on line 10 and ending on
It also allows RCU to implement the aforementioned un- either line 13 or line 18, which provides the needed
conditional read-to-write upgrade that is so attractive and temporal synchronization. Line 11 fetches the pointer to
so deadlock-prone in reader-writer locking. Code using the current myconfig structure. This structure will be
RCU can compensate for this weakening in a surprisingly used regardless of any concurrent changes due to calls to
large number of ways, but most commonly by imposing the set_config() function, thus providing the needed
spatial constraints: spatial synchronization. If line 12 determines that the
curconfig pointer was NULL, line 14 returns failure.
1. New data is placed in newly allocated memory. Otherwise, lines 16 and 17 copy out the ->a and ->b
fields and line 19 returns success. These ->a and ->b
2. Old data is freed, but only after: fields are from the same myconfig structure, and the
(a) That data has been unlinked so as to be inac- RCU read-side critical section prevents this structure from
cessible to later readers, and being freed, thus guaranteeing that these two fields are
consistent with each other.
(b) A subsequent RCU grace period has elapsed.
The structure is updated by the set_config() function
Of course, there are some reader-writer-locking use shown in Listing 9.23. Lines 5–8 allocate and initialize
cases for which RCU’s weakened semantics are inap- a new myconfig structure. Line 9 atomically exchanges
propriate, but experience in the Linux kernel indicates a pointer to this new structure with the pointer to the old
that more than 80% of reader-writer locks can in fact be structure in curconfig, while also providing full mem-
replaced by RCU. For example, a common reader-writer- ory ordering both before and after the xchg() operation,
locking use case computes some value while holding the thus providing the needed updater/reader spatial synchro-
lock and then uses that value after releasing that lock. nization on the one hand and the needed updater/updater
This use case results in stale data, and therefore often synchronization on the other. If line 10 determines that the
accommodates RCU’s weaker semantics. pointer to the old structure was in fact non-NULL, line 11
This interaction of temporal and spatial constraints is waits for a grace period (thus providing the needed read-
illustrated by the RCU singleton data structure illustrated er/updater temporal synchronization) and line 12 frees
in Figures 9.6 and 9.7. This structure is defined on the old structure, safe in the knowledge that there are no
lines 1–4 of Listing 9.22, and contains two integer fields, longer any readers still referencing it.
->a and ->b (singleton.c). The current instance of this Figure 9.29 shows an abbreviated representation of
structure is referenced by the curconfig pointer defined get_config() on the left and right and a similarly ab-
on line 4. breviated representation of set_config() in the middle.
v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 173
Time
Address Space
5,
5,25 curconfig
5, 5,
9,81
Readers
rcu_read_lock();
mcp = ... mcp = kmalloc(...)
*cur_a = mcp->a; (5) mcp = xchg(&curconfig, mcp);
Time advances from top to bottom, and the address space as each reader loads the curconfig pointer only once,
of the objects referenced by curconfig advances from each reader will see a consistent view of its myconfig
left to right. The boxes with comma-separated numbers structure.
each represent a myconfig structure, with the constraint In short, when operating on a suitable linked data
that ->b is the square of ->a. Each blue dash-dotted structure, RCU combines temporal and spatial synchro-
arrow represents an interaction with the old structure (on nization in order to approximate reader-writer locking,
the left, containing “5,25”) and each green dashed arrow with RCU read-side critical sections acting as the reader-
represents an interaction with the new structure (on the writer-locking reader, as shown in Figures 9.23 and 9.29.
right, containing “9,81”). RCU’s temporal synchronization is provided by the read-
The black dotted arrows represent temporal relation- side markers, for example, rcu_read_lock() and rcu_
ships between RCU readers on the left and right and read_unlock(), as well as the update-side grace-period
the RCU grace period at center, with each arrow point- primitives, for example, synchronize_rcu() or call_
ing from an older event to a newer event. The call to rcu(). The spatial synchronization is provided by
synchronize_rcu() followed the leftmost rcu_read_ the read-side rcu_dereference() family of primitives,
lock(), and therefore that synchronize_rcu() invoca- each of which subscribes to a version published by rcu_
tion must not return until after the corresponding rcu_ assign_pointer().19 RCU’s combining of temporal
read_unlock(). In contrast, the call to synchronize_ and spatial synchronization contrasts to the schemes pre-
rcu() precedes the rightmost rcu_read_lock(), which sented in Sections 6.3.2, 6.3.3, and 7.1.4, in which tempo-
allows the return from that same synchronize_rcu() to ral and spatial synchronization are provided separately by
ignore the corresponding rcu_read_unlock(). These locking and by static data-structure layout, respectively.
temporal relationships prevent the myconfig structures Quick Quiz 9.65: Is RCU the only synchronization mecha-
from being freed while RCU readers are still accessing nism that combines temporal and spatial synchronization in
them. this way?
The two horizontal grey dashed lines represent the
period of time during which different readers get different
results, however, each reader will see one and only one of 9.5.4.10 Quasi Reference Count
the two objects. Before the first horizontal line, all readers Because grace periods are not allowed to complete while
see the leftmost myconfig structure, and after the second there is an RCU read-side critical section in progress,
horizontal line, all readers will see the rightmost structure.
Between the two lines, that is, during the grace period, 19 Preferably with both rcu_dereference() and rcu_assign_
different readers might see different objects, but as long pointer() being embedded in higher-level APIs.
v2022.09.25a
174 CHAPTER 9. DEFERRED PROCESSING
v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 175
reader-writer locking, many system calls (and thus any presented by this restriction, and as to how it can best be
RCU read-side critical sections that they contain) complete handled.
in a few microseconds. However, in the common case where references are
Although traditional reference counters are usually asso- held within the confines of a single CPU or task, RCU
ciated with a specific data structure, or perhaps a specific can be used as high-performance and highly scalable
group of data structures, this approach does have some reference-counting mechanism.
disadvantages. For example, maintaining a single global As shown in Figure 9.23, quasi reference counts add
reference counter for a large variety of data structures RCU readers as individual or bulk reference counts, pos-
typically results in bouncing the cache line containing the sibly also bridging to reference counters in corner cases.
reference count. As we saw in Figures 9.30–9.32, such
cache-line bouncing can severely degrade performance. 9.5.4.11 Quasi Multi-Version Concurrency Control
In contrast, RCU’s lightweight rcu_read_lock(), RCU can also be thought of as a simplified multi-version
rcu_dereference(), and rcu_read_unlock() read- concurrency control (MVCC) mechanism with weak con-
side primitives permit extremely frequent read-side usage sistency criteria. The multi-version aspects were touched
with negligible performance degradation. Except that the upon in Section 9.5.2.3. However, in its native form,
calls to rcu_dereference() are not doing anything spe- RCU provides version consistency only within a given
cific to acquire a reference to the pointed-to object. The RCU-protected data element.
heavy lifting is instead done by the rcu_read_lock() Nevertheless, there are situations where consistency
and rcu_read_unlock() primitives and their interac- and fresh data are required across multiple data elements.
tions with RCU grace periods. Fortunately, there are a number of approaches that avoid
And ignoring those calls to rcu_dereference() per- inconsistency and stale data, including the following:
mits RCU to be thought of as a “bulk reference-counting”
mechanism, where each call to rcu_read_lock() ob- 1. Enclose RCU readers within sequence-locking read-
tains a reference on each and every RCU-protected object, ers, forcing the RCU readers to be retried should
and with little or no overhead. However, the restrictions an update occur, as described in Section 13.4.2 and
that go with RCU can be quite onerous. For example, in Section 13.4.3.
many cases, the Linux-kernel prohibition against sleeping
2. Place the data that must be consistent into a single
while in an RCU read-side critical section would defeat
element of a linked data structure, and refrain from
the entire purpose. Such cases might be better served by
updating those fields within any element visible to
the hazard pointers mechanism described in Section 9.3.
RCU readers. RCU readers gaining a reference to any
Cases where code rarely sleeps have been handled by using
such element are then guaranteed to see consistent
RCU as a reference count in the common non-sleeping
values. See Section 13.5.4 for additional details.
case and by bridging to an explicit reference counter when
sleeping is necessary. 3. Use a per-element lock that guards a “deleted” flag
Alternatively, situations where a reference must be held to allow RCU readers to reject stale data [McK04,
by a single task across a section of code that sleeps may ACMS03].
be accommodated with Sleepable RCU (SRCU) [McK06].
4. Provide an existence flag that is referenced by all data
This fails to cover the not-uncommon situation where
elements whose update is to appear atomic to RCU
a reference is “passed” from one task to another, for
readers [McK14d, McK14a, McK15b, McK16b,
example, when a reference is acquired when starting an
McK16a].
I/O and released in the corresponding completion interrupt
handler. Again, such cases might be better handled by 5. Use one of a wide range of counter-based meth-
explicit reference counters or by hazard pointers. ods [McK08a, McK10, MW11, McK14b, MSFM15,
Of course, SRCU brings restrictions of its own, namely KMK+ 19]. In these approaches, updaters maintain
that the return value from srcu_read_lock() be passed a version number and maintain links to old versions
into the corresponding srcu_read_unlock(), and that of a given piece of data. Readers take a snapshot
no SRCU primitives be invoked from hardware interrupt of the current version number, and, if necessary, tra-
handlers or from non-maskable interrupt (NMI) handlers. verse the links to find a version consistent with that
The jury is still out as to how much of a problem is snapshot.
v2022.09.25a
176 CHAPTER 9. DEFERRED PROCESSING
& CU
Re on ork
Inc W
ad sist s G
(R
routing tables. Because it may have taken many seconds
-M en
Ne U M
Ne
os t D reat
100% Reads
100% Writes
Re ns
ed
Re ons Be
ed U W
tly at !!!)
(R
C
ad iste Wel
,S a
ad iste OK
C ght
Co ork
Ne
C
across the Internet, the system has been sending packets
-M
-W nt
tal OK
ed
W ns ot B
os t D
i
e
rite Da ..)
rite ist
Co
the wrong way for quite some time. Having some small
tly ata
(R
,
-M ent st)*
n
,
CU
s
os Da
ta
l)
, ta
v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 177
v2022.09.25a
178 CHAPTER 9. DEFERRED PROCESSING
without incurring the overhead of grace periods while lock” [LZC14], similar to those created by Gautham
holding locks. Shenoy [She06] and Srivatsa Bhat [Bha14]. The Liu
Maya Arbel and Hagit Attiya of Technion took a more et al. paper is interesting from a number of perspec-
rigorous approach [AA14] to an RCU-protected search tives [McK14g].
tree that, like Masstree, allows concurrent updates. This Mike Ash posted [Ash15] a description of an RCU-like
paper includes a proof of correctness, including proof primitive in Apple’s Objective-C runtime. This approach
that all operations on this tree are linearizable. Unfor- identifies read-side critical sections via designated code
tunately, this implementation achieves linearizability by ranges, thus qualifying as another method of achieving
incurring the full latency of grace-period waits while zero read-side overhead, albeit one that poses some in-
holding locks, which degrades scalability of update-only teresting practical challenges for large read-side critical
workloads. One way around this problem is to abandon sections that span multiple functions.
linearizability [HKLP12, McK14d]), however, Arbel and Pedro Ramalhete and Andreia Correia [RC15] pro-
Attiya instead created an RCU variant that reduces low- duced “Poor Man’s RCU”, which, despite using a pair of
end grace-period latency. Of course, nothing comes for reader-writer locks, manages to provide lock-free forward-
free, and this RCU variant appears to hit a scalability progress guarantees to readers [MP15a].
limit at about 32 CPUs. Although there is much to be Maya Arbel and Adam Morrison [AM15] produced
said for dropping linearizability, thus gaining both perfor- “Predicate RCU”, which works hard to reduce grace-period
mance and scalability, it is very good to see academics duration in order to efficiently support algorithms that
experimenting with alternative RCU implementations. hold update-side locks across grace periods. This results
in reduced batching of updates into grace periods and
9.5.5.2 RCU Implementations reduced scalability, but does succeed in providing short
grace periods.
Keir Fraser created a user-space RCU named EBR for
use in non-blocking synchronization and software trans- Quick Quiz 9.68: Why not just drop the lock before waiting
actional memory [Fra03, Fra04, FH07]. Interestingly for the grace period, or using something like call_rcu()
enough, this work cites Linux-kernel RCU on the one instead of waiting for a grace period?
hand, but also inspired the name QSBR for the original
non-preemptible Linux-kernel RCU implementation. Alexander Matveev (MIT), Nir Shavit (MIT and Tel-
Mathieu Desnoyers created a user-space RCU for use in Aviv University), Pascal Felber (University of Neuchâ-
tracing [Des09b, Des09a, DMS+ 12], which has seen use tel), and Patrick Marlier (also University of Neuchâ-
in a number of projects [BD13]. tel) [MSFM15] produced an RCU-like mechanism that
Researchers at Charles University in Prague have can be thought of as software transactional memory that
also been working on RCU implementations, including explicitly marks read-only transactions. Their use cases
dissertations by Andrej Podzimek [Pod10] and Adam require holding locks across grace periods, which lim-
Hraska [Hra13]. its scalability [MP15a, MP15b]. This appears to be the
Yujie Liu (Lehigh University), Victor Luchangco (Or- first academic RCU-related work to make good use of the
acle Labs), and Michael Spear (also Lehigh) [LLS13] rcutorture test suite, and also the first to have submitted
pressed scalable non-zero indicators (SNZI) [ELLM07] a performance improvement to Linux-kernel RCU, which
into service as a grace-period mechanism. The intended was accepted into v4.4.
use is to implement software transactional memory (see Alexander Matveev’s RLU was followed up by MV-
Section 17.2), which imposes linearizability requirements, RLU from Jaeho Kim et al. [KMK+ 19]. This work im-
which in turn seems to limit scalability. proves scalability over RLU by permitting multiple concur-
RCU-like mechanisms are also finding their way into rent updates, by avoiding holding locks across grace peri-
Java. Sivaramakrishnan et al. [SZJ12] use an RCU-like ods, and by using asynchronous grace periods, for example,
mechanism to eliminate the read barriers that are otherwise call_rcu() instead of synchronize_rcu(). This pa-
required when interacting with Java’s garbage collector, per also made some interesting performance-evaluation
resulting in significant performance improvements. choices that are discussed further in Section 17.2.3.3 on
Ran Liu, Heng Zhang, and Haibo Chen of Shanghai page 379.
Jiao Tong University created a specialized variant of RCU Adam Belay et al. created an RCU implementation that
that they used for an optimized “passive reader-writer guards the data structures used by TCP/IP’s address-
v2022.09.25a
9.6. WHICH TO CHOOSE? 179
resolution protocol (ARP) in their IX operating sys- the rcutorture test suite. The effort found several holes
tem [BPP+ 16]. in this suite’s coverage, one of which was hiding a real
Geoff Romer and Andrew Hunter (both at Google) bug (since fixed) in Tiny RCU.
proposed a cell-based API for RCU protection of singleton With some luck, all of this validation work will eventu-
data structures for inclusion in the C++ standard [RH18]. ally result in more and better tools for validating concurrent
Dimitrios Siakavaras et al. have applied HTM and RCU code.
to search trees [SNGK17, SBN+ 20], Christina Giannoula
et al. have used HTM and RCU to color graphs [GGK18],
and SeongJae Park et al. have used HTM and RCU to 9.6 Which to Choose?
optimize high-contention locking on NUMA systems.
Alex Kogan et al. applied RCU to the construction of
range locking for scalable address spaces [KDI20]. Choose always the way that seems the best, however
rough it may be; custom will soon render it easy and
Production uses of RCU are listed in Section 9.6.3.3.
agreeable.
In early 2017, it is commonly recognized that almost Section 9.6.1 provides a high-level overview and then Sec-
any bug is a potential security exploit, so validation and tion 9.6.2 provides a more detailed view of the differences
verification are first-class concerns. between the deferred-processing techniques presented
Researchers at Stony Brook University have produced an in this chapter. This discussion assumes a linked data
RCU-aware data-race detector [Dug10, Sey12, SRK+ 11]. structure that is large enough that readers do not hold ref-
Alexey Gotsman of IMDEA, Noam Rinetzky of Tel Aviv erences from one traversal to another, and where elements
University, and Hongseok Yang of the University of Oxford might be added to and removed from the structure at any
have published a paper [GRY12] expressing the formal location and at any time. Section 9.6.3 then points out a
semantics of RCU in terms of separation logic, and have few publicly visible production uses of hazard pointers,
continued with other aspects of concurrency. sequence locking, and RCU. This discussion should help
Joseph Tassarotti (Carnegie-Mellon University), Derek you to make an informed choice between these techniques.
Dreyer (Max Planck Institute for Software Systems), and
Viktor Vafeiadis (also MPI-SWS) [TDV15] produced a
manual formal proof of correctness of the quiescent- 9.6.1 Which to Choose? (Overview)
state-based reclamation (QSBR) variant of userspace
RCU [Des09b, DMS+ 12]. Lihao Liang (University of Table 9.7 shows a few high-level properties that distinguish
Oxford), Paul E. McKenney (IBM), Daniel Kroening, the deferred-reclamation techniques from one another.
and Tom Melham (both also Oxford) [LMKM16] used The “Readers” row summarizes the results presented in
the C bounded model checker (CBMC) [CKL04] to pro- Figure 9.22, which shows that all but reference counting
duce a mechanical proof of correctness of a significant enjoy reasonably fast and scalable readers.
portion of Linux-kernel Tree RCU. Lance Roy [Roy17] The “Number of Protected Objects” row evaluates each
used CBMC to produce a similar proof of correctness technique’s need for external storage with which to record
for a significant portion of Linux-kernel sleepable RCU reader protection. RCU relies on quiescent states, and
(SRCU) [McK06]. Finally, Michalis Kokologiannakis and thus needs no storage to represent readers, whether within
Konstantinos Sagonas (National Technical University of or outside of the object. Reference counting can use a
Athens) [KS17a, KS19] used the Nighugg tool [LSLK14] single integer within each object in the structure, and no
to produce a mechanical proof of correctness of a some- additional storage is required. Hazard pointers require
what larger portion of Linux-kernel Tree RCU. external-to-object pointers be provisioned, and that there
None of these efforts located any bugs other than bugs be sufficient pointers for each CPU or thread to track all
injected into RCU specifically to test the verification the objects being referenced at any given time. Given that
tools. In contrast, Alex Groce (Oregon State University), most hazard-pointer-based traversals require only a few
Iftekhar Ahmed, Carlos Jensen (both also OSU), and Paul hazard pointers, this is not normally a problem in practice.
E. McKenney (IBM) [GAJM15] automatically mutated Of course, sequence locks provides no pointer-traversal
Linux-kernel RCU’s source code to test the coverage of protection, which is why it is normally used on static data.
v2022.09.25a
180 CHAPTER 9. DEFERRED PROCESSING
Readers Slow and unscalable Fast and scalable Fast and scalable Fast and scalable
Memory Overhead Counter per object Pointer per No protection None
reader per object
Duration of Protection Can be long Can be long No protection User must bound
duration
Need for Traversal If object deleted If object deleted If any update Never
Retries
Quick Quiz 9.69: Why can’t users dynamically allocate the applications, then it does not matter that RCU has duration-
hazard pointers as they are needed? limit requirements because your code already meets them.
In the same vein, if readers must already write to the
The “Duration of Protection” describes constraints (if objects that they are traversing, the read-side overhead of
any) on how long a period of time a user may protect a reference counters might not be so important. Of course, if
given object. Reference counting and hazard pointers can the data to be protected is in statically allocated variables,
both protect objects for extended time periods with no then sequence locking’s inability to protect pointers is
untoward side effects, but maintaining an RCU reference irrelevant.
to even one object prevents all other RCU from being freed.
Finally, there is some work on dynamically switching
RCU readers must therefore be relatively short in order
between hazard pointers and RCU based on dynamic
to avoid running the system out of memory, with special-
sampling of delays [BGHZ16]. This defers the choice be-
purpose implementations such as SRCU, Tasks RCU, and
tween hazard pointers and RCU to runtime, and delegates
Tasks Trace RCU being exceptions to this rule. Again,
responsibility for the decision to the software.
sequence locks provide no pointer-traversal protection,
Nevertheless, this table should be of great help when
which is why it is normally used on static data.
choosing between these techniques. But those wishing
The “Need for Traversal Retries” row tells whether a
more detail should continue on to the next section.
new reference to a given object may be acquired uncon-
ditionally, as it can with RCU, or whether the reference
acquisition can fail, resulting in a retry operation, which
is the case for reference counting, hazard pointers, and 9.6.2 Which to Choose? (Details)
sequence locks. In the case of reference counting and
Table 9.8 provides more-detailed rules of thumb that
hazard pointers, retries are only required if an attempt to
can help you choose among the four deferred-processing
acquire a reference to a given object while that object is in
techniques presented in this chapter.
the process of being deleted, a topic covered in more detail
in the next section. Sequence locking must of course retry As shown in the “Existence Guarantee” row, if you
its critical section should it run concurrently with any need existence guarantees for linked data elements, you
update. must use reference counting, hazard pointers, or RCU. Se-
quence locks do not provide existence guarantees, instead
Quick Quiz 9.70: But don’t Linux-kernel kref reference providing detection of updates, retrying any read-side
counters allow guaranteed unconditional reference acquisition? critical sections that do encounter an update.
Of course, as shown in the “Updates and Readers
Of course, different rows will have different levels of Progress Concurrently” row, this detection of updates
importance in different situations. For example, if your implies that sequence locking does not permit updaters
current code is having read-side scalability problems with and readers to make forward progress concurrently. After
hazard pointers, then it does not matter that hazard pointers all, preventing such forward progress is the whole point
can require retrying reference acquisition because your of using sequence locking in the first place! This situation
current code already handles this. Similarly, if response- points the way to using sequence locking in conjunction
time considerations already limit the duration of reader with reference counting, hazard pointers, or RCU in order
traversals, as is often the case in kernels and low-level to provide both existence guarantees and update detection.
v2022.09.25a
9.6. WHICH TO CHOOSE? 181
system call.
In fact, the Linux kernel combines RCU and sequence could eliminate the read-side smp_mb() from hazard pointers?
locking in this manner during pathname lookup.
The “Contention Among Readers”, “Reader Per-
The “Reader Forward Progress Guarantee” row shows
Critical-Section Overhead”, and “Reader Per-Object Tra-
that only RCU has a bounded wait-free forward-progress
versal Overhead” rows give a rough sense of the read-side
guarantee, which means that it can carry out a finite
overhead of these techniques. The overhead of reference
traversal by executing a bounded number of instructions.
counting can be quite large, with contention among read-
ers along with a fully ordered read-modify-write atomic The “Reader Reference Acquisition” row indicates that
operation required for each and every object traversed. only RCU is capable of unconditionally acquiring refer-
Hazard pointers incur the overhead of a memory barrier for ences. The entry for sequence locks is “Unsafe” because,
each data element traversed, and sequence locks incur the again, sequence locks detect updates rather than acquiring
overhead of a pair of memory barriers for each attempt to references. Reference counting and hazard pointers both
execute the critical section. The overhead of RCU imple- require that traversals be restarted from the beginning if a
mentations vary from nothing to that of a pair of memory given acquisition fails. To see this, consider a linked list
barriers for each read-side critical section, thus providing containing objects A, B, C, and D, in that order, and the
RCU with the best performance, particularly for read-side following series of events:
critical sections that traverse many data elements. Of
course, the read-side overhead of all deferred-processing 1. A reader acquires a reference to object B.
variants can be reduced by batching, so that each read-side
operation covers more data.
2. An updater removes object B, but refrains from
freeing it because the reader holds a reference. The
Quick Quiz 9.71: But didn’t the answer to one of the quick
list now contains objects A, C, and D, and object B’s
quizzes in Section 9.3 say that pairwise asymmetric barriers
->next pointer is set to HAZPTR_POISON.
v2022.09.25a
182 CHAPTER 9. DEFERRED PROCESSING
3. The updater removes object C, so that the list now updates. However, there are situations in which the only
contains objects A and D. Because there is no blocking operation is a wait to free memory, which re-
reference to object C, it is immediately freed. sults in a situation that, for many purposes, is as good as
non-blocking [DMS+ 12].
4. The reader tries to advance to the successor of the As shown in the “Automatic Reclamation” row, only
object following the now-removed object B, but the reference counting can automate freeing of memory, and
poisoned ->next pointer prevents this. Which is even then only for non-cyclic data structures. Certain use
a good thing, because object B’s ->next pointer cases for hazard pointers and RCU can provide automatic
would otherwise point to the freelist. reclamation using link counts, which can be thought of
5. The reader must therefore restart its traversal from as reference counts, but applying only to incoming links
the head of the list. from other parts of the data structure [Mic18].
Finally, the “Lines of Code” row shows the size of
Thus, when failing to acquire a reference, a hazard- the Pre-BSD Routing Table implementations, giving a
pointer or reference-counter traversal must restart that rough idea of relative ease of use. That said, it is im-
traversal from the beginning. In the case of nested linked portant to note that the reference-counting and sequence-
data structures, for example, a tree containing linked locking implementations are buggy, and that a correct
lists, the traversal must be restarted from the outermost reference-counting implementation is considerably more
data structure. This situation gives RCU a significant complex [Val95, MS95]. For its part, a correct sequence-
ease-of-use advantage. locking implementation requires the addition of some
However, RCU’s ease-of-use advantage does not come other synchronization mechanism, for example, hazard
for free, as can be seen in the “Memory Footprint” row. pointers or RCU, so that sequence locking detects con-
RCU’s support of unconditional reference acquisition current updates and the other mechanism provides safe
means that it must avoid freeing any object reachable by a reference acquisition.
given RCU reader until that reader completes. RCU there- As more experience is gained using these techniques,
fore has an unbounded memory footprint, at least unless both separately and in combination, the rules of thumb
updates are throttled. In contrast, reference counting and laid out in this section will need to be refined. However,
hazard pointers need to retain only those data elements this section does reflect the current state of the art.
actually referenced by concurrent readers.
This tension between memory footprint and acquisition
failures is sometimes resolved within the Linux kernel by
9.6.3 Which to Choose? (Production Use)
combining use of RCU and reference counters. RCU is This section points out a few publicly visible production
used for short-lived references, which means that RCU uses of hazard pointers, sequence locking, and RCU. Ref-
read-side critical sections can be short. These short erence counting is omitted, not because it is unimportant,
RCU read-side critical sections in turn mean that the but rather because it is not only used pervasively, but heav-
corresponding RCU grace periods can also be short, which ily documented in textbooks going back a half century.
limits the memory footprint. For the few data elements that One of the hoped-for benefits of listing production uses of
need longer-lived references, reference counting is used. these other techniques is to provide examples to study—or
This means that the complexity of reference-acquisition to find bugs in, as the case may be.21
failure only needs to be dealt with for those few data
elements: The bulk of the reference acquisitions are 9.6.3.1 Production Uses of Hazard Pointers
unconditional, courtesy of RCU. See Section 13.2 for
more information on combining reference counting with In 2010, Keith Bostic added hazard pointers to
other synchronization mechanisms. WiredTiger [Bos10]. MongoDB 3.0, released in 2015,
The “Reclamation Forward Progress” row shows included WiredTiger and thus hazard pointers.
that hazard pointers can provide non-blocking up- In 2011, Samy Al Bahra added hazard pointers to the
dates [Mic04a, HLM02]. Reference counting might or Concurrency Kit library [Bah11b].
might not, depending on the implementation. However, 21 Kudos to Mathias Stearn, Matt Wilson, David Goldblatt, Live-
sequence locking cannot provide non-blocking updates, Journal user fanf, Nadav Har’El, Avi Kivity, Dmitry Vyukov, Raul
courtesy of its update-side lock. RCU updaters must Guitterez S., Twitter user @peo3, Paolo Bonzini, and Thomas Monjalon
wait on readers, which also rules out fully non-blocking for locating a great many of these use cases.
v2022.09.25a
9.7. WHAT ABOUT UPDATES? 183
In 2014, Maxim Khizhinsky added hazard pointers to In 2015, Maxim Khizhinsky added RCU to
libcds [Khi14]. libcds [Khi15].
In 2015, David Gwynne introduced shared reference Mindaugas Rasiukevicius implemented libqsbr in 2016,
pointers, a form of hazard pointers, to OpenBSD [Gwy15]. which features QSBR and epoch-based reclamation
In 2017–2018, the Rust-language arc-swap [Van18] (EBR) [Ras16], both of which are types of implemen-
and conc [cut17] crates rolled their own implementations tations of RCU.
of hazard pointers. Sheth et al. [SWS16] demonstrated the value of lever-
In 2018, Maged Michael added hazard pointers to aging Go’s garbage collector to provide RCU-like func-
Facebook’s Folly library [Mic18], where it is used heavily. tionality, and the Go programming language provides a
Value type that can provide this functionality.22
9.6.3.2 Production Uses of Sequence Locking Matt Klein describes an RCU-like mechanism that is
used in the Envoy Proxy [Kle17].
The Linux kernel added sequence locking to v2.5.60 Honnappa Nagarahalli added an RCU library to the
in 2003 [Cor03], having been generalized from an ad- Data Plane Development Kit (DPDK) in 2018 [Nag18].
hoc technique used in x86’s implementation of the Stjepan Glavina merged an epoch-based RCU imple-
gettimeofday() system call. mentation into the crossbeam set of concurrency-support
In 2011, Samy Al Bahra added sequence locking to the “crates” for the Rust language [Gla18].
Concurrency Kit library [Bah11c]. Jason Donenfeld produced an RCU implementations
Paolo Bonzini added a simple sequence-lock to the as part of his port of WireGuard to Windows NT ker-
QEMU emulator in 2013 [Bon13]. nel [Don21].
Alexis Menard abstracted a sequence-lock implementa- Finally, any garbage-collected concurrent language (not
tion in Chromium in 2016 [Men16]. just Go!) gets the update side of an RCU implementation
A simple sequence locking implementation was added at zero incremental cost.
to jemalloc() in 2018 [Gol18a]. The eigen library
also has a special-purpose queue that is managed by a 9.6.3.4 Summary of Production Uses
mechanism resembling sequence locking.
Perhaps the time will come when sequence locking, hazard
pointers, and RCU are all as heavily used and as well
9.6.3.3 Production Uses of RCU
known as are reference counters. Until that time comes,
IBM’s VM/XA is adopted passive serialization, a mecha- the current production uses of these mechanisms should
nism similar to RCU, some time in the 1980s [HOS89]. help guide the choice of mechanism as well as showing
DYNIX/ptx adopted RCU in 1993 [MS98a, SM95]. how best to apply each of them.
The Linux kernel adopted Dipankar Sarma’s implemen- The next section discusses updates, a ticklish issue for
tation of RCU in 2002 [Tor02]. many of the read-mostly mechanisms described in this
The userspace RCU project started in 2009 [Des09b]. chapter.
The Knot DNS project started using the userspace RCU
library in 2010 [Slo10]. That same year, the OSv kernel
added an RCU implementation [Kiv13], later adding an 9.7 What About Updates?
RCU-protected linked list [Kiv14b] and an RCU-protected
hash table [Kiv14a]. The only thing constant in life is change.
In 2011, Samy Al Bahra added epochs (a form
of RCU [Fra04, FH07]) to the Concurrency Kit li- François de la Rochefoucauld
brary [Bah11a].
NetBSD began using the aforementioned passive se- The deferred-processing techniques called out in this chap-
rialization with v6.0 in 2012 [The12a]. Among other ter are most directly applicable to read-mostly situations,
things, passive serialization is used in NetBSD packet which begs the question “But what about updates?” After
filter (NPF) [Ras14]. all, increasing the performance and scalability of readers
Paolo Bonzini added RCU support to the QEMU em-
ulator in 2015 via a friendly fork of the userspace RCU 22 See https://golang.org/pkg/sync/atomic/#Value, par-
v2022.09.25a
184 CHAPTER 9. DEFERRED PROCESSING
v2022.09.25a
Bad programmers worry about the code. Good
programmers worry about data structures and their
relationships.
Chapter 10 Linus Torvalds
Data Structures
Serious discussions of algorithms include time complexity using an in-memory database with each animal in the zoo
of their data structures [CLRS01]. However, for parallel represented by a data item in this database. Each animal
programs, the time complexity includes concurrency ef- has a unique name that is used as a key, with a variety of
fects because these effects can be overwhelmingly large, as data tracked for each animal.
shown in Chapter 3. In other words, a good programmer’s Births, captures, and purchases result in insertions,
data-structure relationships include those aspects related while deaths, releases, and sales result in deletions. Be-
to concurrency. cause Schrödinger’s zoo contains a large quantity of short-
Section 10.1 presents the motivating application for lived animals, including mice and insects, the database
this chapter’s data structures. Chapter 6 showed how par- must handle high update rates. Those interested in Schrö-
titioning improves scalability, so Section 10.2 discusses dinger’s animals can query them, and Schrödinger has
partitionable data structures. Chapter 9 described how noted suspiciously query rates for his cat, so much so that
deferring some actions can greatly improve both perfor- he suspects that his mice might be checking up on their
mance and scalability, a topic taken up by Section 10.3. nemesis. Whatever their source, Schrödinger’s application
Section 10.4 looks at a non-partitionable data structure, must handle high query rates to a single data element.
splitting it into read-mostly and partitionable portions,
As we will see, this simple application can be a challenge
which improves both performance and scalability. Be-
to concurrent data structures.
cause this chapter cannot delve into the details of every
concurrent data structure, Section 10.5 surveys a few of
the important ones. Although the best performance and
scalability results from design rather than after-the-fact
micro-optimization, micro-optimization is nevertheless
10.2 Partitionable Data Structures
necessary for the absolute best possible performance and
scalability, as described in Section 10.6. Finally, Sec- Finding a way to live the simple life today is the most
tion 10.7 presents a summary of this chapter. complicated task.
185
v2022.09.25a
186 CHAPTER 10. DATA STRUCTURES
Listing 10.1 (hash_bkt.c) shows a set of data struc- Figure 10.1: Hash-Table Data-Structure Diagram
tures used in a simple fixed-sized hash table using chain-
ing and per-hash-bucket locking, and Figure 10.1 dia-
grams how they fit together. The hashtab structure
ing two functions acquire and release the ->htb_lock
(lines 11–15 in Listing 10.1) contains four ht_bucket
corresponding to the specified hash value.
structures (lines 6–9 in Listing 10.1), with the ->ht_
nbuckets field controlling the number of buckets and the Listing 10.3 shows hashtab_lookup(), which returns
->ht_cmp field holding the pointer to key-comparison a pointer to the element with the specified hash and key if it
function. Each such bucket contains a list header ->htb_ exists, or NULL otherwise. This function takes both a hash
head and a lock ->htb_lock. The list headers chain value and a pointer to the key because this allows users
ht_elem structures (lines 1–4 in Listing 10.1) through of this function to use arbitrary keys and arbitrary hash
their ->hte_next fields, and each ht_elem structure functions. Line 8 maps from the hash value to a pointer
also caches the corresponding element’s hash value in the to the corresponding hash bucket. Each pass through the
->hte_hash field. The ht_elem structure is included in loop spanning lines 9–14 examines one element of the
a larger structure which might contain a complex key. bucket’s hash chain. Line 10 checks to see if the hash
Figure 10.1 shows bucket 0 containing two elements values match, and if not, line 11 proceeds to the next
and bucket 2 containing one. element. Line 12 checks to see if the actual key matches,
and if so, line 13 returns a pointer to the matching element.
Listing 10.2 shows mapping and locking functions.
If no element matches, line 15 returns NULL.
Lines 1 and 2 show the macro HASH2BKT(), which maps
from a hash value to the corresponding ht_bucket struc- Quick Quiz 10.2: But isn’t the double comparison on
ture. This macro uses a simple modulus: If more aggres- lines 10–13 in Listing 10.3 inefficient in the case where the
sive hashing is required, the caller needs to implement key fits into an unsigned long?
it when mapping from key to hash value. The remain-
v2022.09.25a
10.2. PARTITIONABLE DATA STRUCTURES 187
v2022.09.25a
188 CHAPTER 10. DATA STRUCTURES
6 250000
1.4x10
1x106
150000
800000
ideal
600000 100000
400000 50000
200000
bucket 0
0 0 50 100 150 200 250 300 350 400 450
5 10 15 20 25 Number of CPUs (Threads)
Number of CPUs (Threads)
Figure 10.4: Read-Only Hash-Table Performance For
Figure 10.2: Read-Only Hash-Table Performance For Schrödinger’s Zoo, Varying Buckets
Schrödinger’s Zoo
200000
However, as can be seen in Figure 10.4, changing the
number of buckets has almost no effect: Scalability is
150000
still abysmal. In particular, we still see a sharp dropoff at
29 CPUs and beyond. Clearly something else is going on.
100000
The problem is that this is a multi-socket system, with
CPUs 0–27 and 225–251 mapped to the first socket as
50000
shown in Figure 10.5. Test runs confined to the first
28 CPUs therefore perform quite well, but tests that in-
0
0 50 100 150 200 250 300 350 400 450 volve socket 0’s CPUs 0–27 as well as socket 1’s CPU 28
Number of CPUs (Threads) incur the overhead of passing data across socket bound-
aries. This can severely degrade performance, as was
Figure 10.3: Read-Only Hash-Table Performance For discussed in Section 3.2.1. In short, large multi-socket
Schrödinger’s Zoo, 448 CPUs systems require good locality of reference in addition to
full partitioning. The remainder of this chapter will dis-
cuss ways of providing good locality of reference within
a far short of the ideal performance level, even at only the hash table itself, but in the meantime please note that
28 CPUs. Part of this shortfall is due to the fact that the one other way to provide good locality of reference would
lock acquisitions and releases incur no cache misses on a be to place large data elements in the hash table. For
single CPU, but do incur misses on two or more CPUs. example, Schrödinger might attain excellent cache locality
by placing photographs or even videos of his animals in
And things only get worse with more CPUs, as can be each element of the hash table. But for those needing hash
seen in Figure 10.3. We do not need to show ideal perfor- tables containing small data elements, please read on!
mance: The performance for 29 CPUs and beyond is all
Quick Quiz 10.4: Given the negative scalability of the
too clearly worse than abysmal. This clearly underscores
Schrödinger’s Zoo application across sockets, why not just run
the dangers of extrapolating performance from a modest
multiple copies of the application, with each copy having a
number of CPUs. subset of the animals and confined to run on a single socket?
Of course, one possible reason for the collapse in
performance might be that more hash buckets are needed.
One key property of the Schrödinger’s-zoo runs dis-
We can test this by increasing the number of hash buckets.
cussed thus far is that they are all read-only. This makes the
v2022.09.25a
10.3. READ-MOSTLY DATA STRUCTURES 189
6 168–195 392–419
7 196–223 420–447 10.3.1 RCU-Protected Hash Table Imple-
Figure 10.5: NUMA Topology of System Under Test mentation
For an RCU-protected hash table with per-bucket lock-
ing, updaters use locking as shown in Section 10.2,
performance degradation due to lock-acquisition-induced but readers use RCU. The data structures remain
cache misses all the more painful. Even though we are as shown in Listing 10.1, and the HASH2BKT(),
not updating the underlying hash table itself, we are still hashtab_lock(), and hashtab_unlock() functions
paying the price for writing to memory. Of course, if remain as shown in Listing 10.2. However, readers
the hash table was never going to be updated, we could use the lighter-weight concurrency-control embodied
dispense entirely with mutual exclusion. This approach by hashtab_lock_lookup() and hashtab_unlock_
is quite straightforward and is left as an exercise for the lookup() shown in Listing 10.6.
reader. But even with the occasional update, avoiding Listing 10.7 shows hashtab_lookup() for the RCU-
writes avoids cache misses, and allows the read-mostly protected per-bucket-locked hash table. This is identical
data to be replicated across all the caches, which in turn to that in Listing 10.3 except that cds_list_for_each_
promotes locality of reference. entry() is replaced by cds_list_for_each_entry_
The next section therefore examines optimizations that rcu(). Both of these primitives traverse the hash chain ref-
can be carried out in read-mostly cases where updates are erenced by htb->htb_head but cds_list_for_each_
rare, but could happen at any time. entry_rcu() also correctly enforces memory ordering
in case of concurrent insertion. This is an important
difference between these two hash-table implementations:
Unlike the pure per-bucket-locked implementation, the
RCU protected implementation allows lookups to run con-
10.3 Read-Mostly Data Structures currently with insertions and deletions, and RCU-aware
primitives like cds_list_for_each_entry_rcu() are
Adapt the remedy to the disease. required to correctly handle this added concurrency. Note
also that hashtab_lookup()’s caller must be within an
Chinese proverb
RCU read-side critical section, for example, the caller
must invoke hashtab_lock_lookup() before invoking
Although partitioned data structures can offer excellent hashtab_lookup() (and of course invoke hashtab_
scalability, NUMA effects can result in severe degradations unlock_lookup() some time afterwards).
of both performance and scalability. In addition, the need
Quick Quiz 10.5: But if elements in a hash table can be
for read-side synchronization can degrade performance removed concurrently with lookups, doesn’t that mean that
in read-mostly situations. However, we can achieve both a lookup could return a reference to a data element that was
performance and scalability by using RCU, which was removed immediately after it was looked up?
introduced in Section 9.5. Similar results can be achieved
using hazard pointers (hazptr.c) [Mic04a], which will Listing 10.8 shows hashtab_add() and hashtab_
be included in the performance results shown in this del(), both of which are quite similar to their counterparts
section [McK13]. in the non-RCU hash table shown in Listing 10.4. The
v2022.09.25a
190 CHAPTER 10. DATA STRUCTURES
Listing 10.7: RCU-Protected Hash-Table Lookup The test suite (“hashtorture.h”) contains a
1 struct ht_elem *hashtab_lookup(struct hashtab *htp, smoketest() function that verifies that a specific se-
2 unsigned long hash,
3 void *key) ries of single-threaded additions, deletions, and lookups
4 { give the expected results.
5 struct ht_bucket *htb;
6 struct ht_elem *htep; Concurrent test runs put each updater thread in control
7 of its portion of the elements, which allows assertions
8 htb = HASH2BKT(htp, hash);
9 cds_list_for_each_entry_rcu(htep, checking for the following issues:
10 &htb->htb_head,
11 hte_next) {
12 if (htep->hte_hash != hash) 1. A just-now-to-be-added element already being in the
13 continue; table according to hastab_lookup().
14 if (htp->ht_cmp(htep, key))
15 return htep;
16 } 2. A just-now-to-be-added element being marked as
17 return NULL; being in the table by its ->in_table flag.
18 }
v2022.09.25a
10.3. READ-MOSTLY DATA STRUCTURES 191
2.2x107
2x107
1x10 7 8x106
6
6x10
ideal U
RC zptr 4x106 QSBR,RCU
1x106 ha 6
2x10 hazptr
0
100000 bucket 0 50 100 150 200 250 300 350 400 450
Number of CPUs (Threads)
Figure 10.6: Read-Only RCU-Protected Hash-Table Per- Figure 10.7 shows the same data on a linear scale. This
formance For Schrödinger’s Zoo drops the global-locking trace into the x-axis, but allows
the non-ideal performance of RCU and hazard pointers to
be more readily discerned. Both show a change in slope
at 224 CPUs, and this is due to hardware multithreading.
At 32 and fewer CPUs, each thread has a core to itself. In
this regime, RCU does better than does hazard pointers
because the latter’s read-side memory barriers result in
dead time within the core. In short, RCU is better able to
utilize a core from a single hardware thread than is hazard
pointers.
This situation changes above 224 CPUs. Because RCU
2.2x107
2x107
is using more than half of each core’s resources from a
Total Lookups per Millisecond
1.8x107
single hardware thread, RCU gains relatively little benefit
1.6x107 from the second hardware thread in each core. The slope
1.4x107 ideal of the hazard-pointers trace also decreases at 224 CPUs,
1.2x107 but less dramatically, because the second hardware thread
1x107 is able to fill in the time that the first hardware thread is
8x106 stalled due to memory-barrier latency. As we will see
6x106 in later sections, this second-hardware-thread advantage
6
4x10 RCU depends on the workload.
2x106 hazptr
0
But why is RCU’s performance a factor of five less
0 50 100 150 200 250 300 350 400 450 than ideal? One possibility is that the per-thread coun-
Number of CPUs (Threads) ters manipulated by rcu_read_lock() and rcu_read_
unlock() are slowing things down. Figure 10.8 therefore
Figure 10.7: Read-Only RCU-Protected Hash-Table Per-
adds the results for the QSBR variant of RCU, whose
formance For Schrödinger’s Zoo, Linear Scale
read-side primitives do nothing. And although QSBR
does perform slightly better than does RCU, it is still about
a factor of five short of ideal.
Figure 10.9 adds completely unsynchronized results,
which works because this is a read-only benchmark with
v2022.09.25a
192 CHAPTER 10. DATA STRUCTURES
2.2x107 1x107
7
2x10
Total Lookups per Millisecond 1x106
v2022.09.25a
10.3. READ-MOSTLY DATA STRUCTURES 193
7 6
1x10 1x10
RCU bucket
6
1x10 100000
Lookups per Millisecond
10000 1000
global
1000 global 100
100 10
1 10 100 1 10 100
Number of CPUs Doing Updates Number of CPUs Doing Updates
Figure 10.11: Read-Side RCU-Protected Hash-Table Figure 10.12: Update-Side RCU-Protected Hash-Table
Performance For Schrödinger’s Zoo in the Presence Performance For Schrödinger’s Zoo
of Updates
v2022.09.25a
194 CHAPTER 10. DATA STRUCTURES
v2022.09.25a
10.4. NON-PARTITIONABLE DATA STRUCTURES 195
A B C D A B C D
Bucket 0 Bucket 1
Figure 10.17: Growing a Two-List Hash Table, State (c)
v2022.09.25a
196 CHAPTER 10. DATA STRUCTURES
Listing 10.9: Resizable Hash-Table Data Structures Listing 10.10: Resizable Hash-Table Bucket Selection
1 struct ht_elem { 1 static struct ht_bucket *
2 struct rcu_head rh; 2 ht_get_bucket(struct ht *htp, void *key,
3 struct cds_list_head hte_next[2]; 3 long *b, unsigned long *h)
4 }; 4 {
5 5 unsigned long hash = htp->ht_gethash(key);
6 struct ht_bucket { 6
7 struct cds_list_head htb_head; 7 *b = hash % htp->ht_nbuckets;
8 spinlock_t htb_lock; 8 if (h)
9 }; 9 *h = hash;
10 10 return &htp->ht_bkt[*b];
11 struct ht { 11 }
12 long ht_nbuckets; 12
13 long ht_resize_cur; 13 static struct ht_elem *
14 struct ht *ht_new; 14 ht_search_bucket(struct ht *htp, void *key)
15 int ht_idx; 15 {
16 int (*ht_cmp)(struct ht_elem *htep, void *key); 16 long b;
17 unsigned long (*ht_gethash)(void *key); 17 struct ht_elem *htep;
18 void *(*ht_getkey)(struct ht_elem *htep); 18 struct ht_bucket *htbp;
19 struct ht_bucket ht_bkt[0]; 19
20 }; 20 htbp = ht_get_bucket(htp, key, &b, NULL);
21 21 cds_list_for_each_entry_rcu(htep,
22 struct ht_lock_state { 22 &htbp->htb_head,
23 struct ht_bucket *hbp[2]; 23 hte_next[htp->ht_idx]) {
24 int hls_idx[2]; 24 if (htp->ht_cmp(htep, key))
25 }; 25 return htep;
26 26 }
27 struct hashtab { 27 return NULL;
28 struct ht *ht_cur; 28 }
29 spinlock_t ht_lock;
30 };
v2022.09.25a
10.4. NON-PARTITIONABLE DATA STRUCTURES 197
Listing 10.11: Resizable Hash-Table Update-Side Concurrency The hashtab_lock_mod() spans lines 1–25 in the
Control listing. Line 10 enters an RCU read-side critical section
1 static void
2 hashtab_lock_mod(struct hashtab *htp_master, void *key, to prevent the data structures from being freed during
3 struct ht_lock_state *lsp) the traversal, line 11 acquires a reference to the current
4 {
5 long b; hash table, and then line 12 obtains a reference to the
6 unsigned long h; bucket in this hash table corresponding to the key. Line 13
7 struct ht *htp;
8 struct ht_bucket *htbp; acquires that bucket’s lock, which will prevent any con-
9 current resizing operation from distributing that bucket,
10 rcu_read_lock();
11 htp = rcu_dereference(htp_master->ht_cur); though of course it will have no effect if that bucket has
12 htbp = ht_get_bucket(htp, key, &b, &h); already been distributed. Lines 14–15 store the bucket
13 spin_lock(&htbp->htb_lock);
14 lsp->hbp[0] = htbp; pointer and pointer-set index into their respective fields in
15 lsp->hls_idx[0] = htp->ht_idx; the ht_lock_state structure, which communicates the
16 if (b > READ_ONCE(htp->ht_resize_cur)) {
17 lsp->hbp[1] = NULL; information to hashtab_add(), hashtab_del(), and
18 return; hashtab_unlock_mod(). Line 16 then checks to see
19 }
20 htp = rcu_dereference(htp->ht_new); if a concurrent resize operation has already distributed
21 htbp = ht_get_bucket(htp, key, &b, &h); this bucket across the new hash table, and if not, line 17
22 spin_lock(&htbp->htb_lock);
23 lsp->hbp[1] = htbp; indicates that there is no already-resized hash bucket and
24 lsp->hls_idx[1] = htp->ht_idx; line 18 returns with the selected hash bucket’s lock held
25 }
26
(thus preventing a concurrent resize operation from dis-
27 static void tributing this bucket) and also within an RCU read-side
28 hashtab_unlock_mod(struct ht_lock_state *lsp)
29 { critical section. Deadlock is avoided because the old
30 spin_unlock(&lsp->hbp[0]->htb_lock); table’s locks are always acquired before those of the new
31 if (lsp->hbp[1])
32 spin_unlock(&lsp->hbp[1]->htb_lock); table, and because the use of RCU prevents more than two
33 rcu_read_unlock(); versions from existing at a given time, thus preventing a
34 }
deadlock cycle.
Otherwise, a concurrent resize operation has already
parameter h (if non-NULL) on line 9. Line 10 then returns distributed this bucket, so line 20 proceeds to the new
a reference to the corresponding bucket. hash table, line 21 selects the bucket corresponding to the
The ht_search_bucket() function searches for the key, and line 22 acquires the bucket’s lock. Lines 23–24
specified key within the specified hash-table version. store the bucket pointer and pointer-set index into their
Line 20 obtains a reference to the bucket correspond- respective fields in the ht_lock_state structure, which
ing to the specified key. The loop spanning lines 21–26 again communicates this information to hashtab_add(),
searches that bucket, so that if line 24 detects a match, hashtab_del(), and hashtab_unlock_mod(). Be-
line 25 returns a pointer to the enclosing data element. cause this bucket has already been resized and because
Otherwise, if there is no match, line 27 returns NULL to hashtab_add() and hashtab_del() affect both the old
indicate failure. and the new ht_bucket structures, two locks are held,
one on each of the two buckets. Additionally, both ele-
Quick Quiz 10.10: How does the code in Listing 10.10 pro- ments of each array in ht_lock_state structure are used,
tect against the resizing process progressing past the selected with the [0] element pertaining to the old ht_bucket
bucket? structure and the [1] element pertaining to the new struc-
ture. Once again, hashtab_lock_mod() exits within an
This implementation of ht_get_bucket() and ht_
RCU read-side critical section.
search_bucket() permits lookups and modifications to
run concurrently with a resize operation. The hashtab_unlock_mod() function releases the
Read-side concurrency control is provided by RCU lock(s) acquired by hashtab_lock_mod(). Line 30
as was shown in Listing 10.6, but the update-side con- releases the lock on the old ht_bucket structure. In
currency-control functions hashtab_lock_mod() and the unlikely event that line 31 determines that a resize
hashtab_unlock_mod() must now deal with the pos- operation is in progress, line 32 releases the lock on the
sibility of a concurrent resize operation as shown in new ht_bucket structure. Either way, line 33 exits the
Listing 10.11. RCU read-side critical section.
v2022.09.25a
198 CHAPTER 10. DATA STRUCTURES
Listing 10.12: Resizable Hash-Table Access Functions to the current hash bucket. If line 19 determines that
1 struct ht_elem * this bucket has been distributed to a new version of the
2 hashtab_lookup(struct hashtab *htp_master, void *key)
3 { hash table, then line 20 also adds the new element to the
4 struct ht *htp; corresponding new bucket. The caller is required to handle
5 struct ht_elem *htep;
6 concurrency, for example, by invoking hashtab_lock_
7 htp = rcu_dereference(htp_master->ht_cur); mod() before the call to hashtab_add() and invoking
8 htep = ht_search_bucket(htp, key);
9 return htep; hashtab_unlock_mod() afterwards.
10 }
11
The hashtab_del() function on lines 24–32 of the
12 void hashtab_add(struct ht_elem *htep, listing removes an existing element from the hash table.
13 struct ht_lock_state *lsp)
14 {
Line 27 picks up the index of the pointer pair and line 29
15 struct ht_bucket *htbp = lsp->hbp[0]; removes the specified element from the current table. If
16 int i = lsp->hls_idx[0];
17
line 30 determines that this bucket has been distributed to
18 cds_list_add_rcu(&htep->hte_next[i], &htbp->htb_head); a new version of the hash table, then line 31 also removes
19 if ((htbp = lsp->hbp[1])) {
20 cds_list_add_rcu(&htep->hte_next[!i], &htbp->htb_head);
the specified element from the corresponding new bucket.
21 } As with hashtab_add(), the caller is responsible for
22 }
23
concurrency control and this concurrency control suffices
24 void hashtab_del(struct ht_elem *htep, for synchronizing with a concurrent resize operation.
25 struct ht_lock_state *lsp)
26 {
27 int i = lsp->hls_idx[0]; Quick Quiz 10.13: The hashtab_add() and hashtab_
28 del() functions in Listing 10.12 can update two hash buckets
29 cds_list_del_rcu(&htep->hte_next[i]); while a resize operation is progressing. This might cause
30 if (lsp->hbp[1])
31 cds_list_del_rcu(&htep->hte_next[!i]); poor performance if the frequency of resize operation is not
32 } negligible. Isn’t it possible to reduce the cost of updates in
such cases?
Quick Quiz 10.11: Suppose that one thread is inserting an The actual resizing itself is carried out by hashtab_
element into the hash table during a resize operation. What resize, shown in Listing 10.13 on page 199. Line 16
prevents this insertion from being lost due to a subsequent conditionally acquires the top-level ->ht_lock, and if this
resize operation completing before the insertion does? acquisition fails, line 17 returns -EBUSY to indicate that
a resize is already in progress. Otherwise, line 18 picks
Now that we have bucket selection and concurrency
up a reference to the current hash table, and lines 19–22
control in place, we are ready to search and update our re-
allocate a new hash table of the desired size. If a new
sizable hash table. The hashtab_lookup(), hashtab_
set of hash/key functions have been specified, these are
add(), and hashtab_del() functions are shown in List-
used for the new table, otherwise those of the old table are
ing 10.12.
preserved. If line 23 detects memory-allocation failure,
The hashtab_lookup() function on lines 1–10 of the line 24 releases ->ht_lock and line 25 returns a failure
listing does hash lookups. Line 7 fetches the current hash indication.
table and line 8 searches the bucket corresponding to the
Line 27 picks up the current table’s index and line 28
specified key. Line 9 returns a pointer to the searched-for
stores its inverse to the new hash table, thus ensuring that
element or NULL when the search fails. The caller must
the two hash tables avoid overwriting each other’s linked
be within an RCU read-side critical section.
lists. Line 29 then starts the bucket-distribution process by
Quick Quiz 10.12: The hashtab_lookup() function in installing a reference to the new table into the ->ht_new
Listing 10.12 ignores concurrent resize operations. Doesn’t this field of the old table. Line 30 ensures that all readers who
mean that readers might miss an element that was previously
are not aware of the new table complete before the resize
added during a resize operation?
operation continues.
The hashtab_add() function on lines 12–22 of the Each pass through the loop spanning lines 31–42 dis-
listing adds new data elements to the hash table. Line 15 tributes the contents of one of the old hash table’s buckets
picks up the current ht_bucket structure into which the into the new hash table. Line 32 picks up a reference to
new element is to be added, and line 16 picks up the the old table’s current bucket and line 33 acquires that
index of the pointer pair. Line 18 adds the new element bucket’s spinlock.
v2022.09.25a
10.4. NON-PARTITIONABLE DATA STRUCTURES 199
v2022.09.25a
200 CHAPTER 10. DATA STRUCTURES
1x107 in the hash table. The figure shows three traces for each
element count, one for a fixed-size 262,144-bucket hash
table, another for a fixed-size 524,288-bucket hash table,
Lookups per Millisecond
1x106 and a third for a resizable hash table that shifts back
262,144 and forth between 262,144 and 524,288 buckets, with a
one-millisecond pause between each resize operation.
100000 The uppermost three traces are for the 262,144-element
hash table.1 The dashed trace corresponds to the two
fixed-size hash tables, and the solid trace to the resizable
10000 2,097,152 hash table. In this case, the short hash chains cause normal
lookup overhead to be so low that the overhead of resizing
dominates over most of the range. In particular, the entire
1000
1 10 100 hash table fits into L3 cache.
Number of CPUs (Threads)
The lower three traces are for the 2,097,152-element
hash table. The upper dashed trace corresponds to the
Figure 10.19: Overhead of Resizing Hash Tables Between 262,144-bucket fixed-size hash table, the solid trace in
262,144 and 524,288 Buckets vs. Total Number of the middle for low CPU counts and at the bottom for high
Elements CPU counts to the resizable hash table, and the other trace
to the 524,288-bucket fixed-size hash table. The fact that
there are now an average of eight elements per bucket can
Quick Quiz 10.14: In the hashtab_resize() function in only be expected to produce a sharp decrease in perfor-
Listing 10.13, what guarantees that the update to ->ht_new on mance, as in fact is shown in the graph. But worse yet,
line 29 will be seen as happening before the update to ->ht_ the hash-table elements occupy 128 MB, which overflows
resize_cur on line 40 from the perspective of hashtab_ each socket’s 39 MB L3 cache, with performance conse-
add() and hashtab_del()? In other words, what prevents
quences analogous to those described in Section 3.2.2.
hashtab_add() and hashtab_del() from dereferencing a
NULL pointer loaded from ->ht_new?
The resulting cache overflow means that the memory sys-
tem is involved even for a read-only benchmark, and as
Each pass through the loop spanning lines 34–39 adds you can see from the sublinear portions of the lower three
one data element from the current old-table bucket to the traces, the memory system can be a serious bottleneck.
corresponding new-table bucket, holding the new-table Quick Quiz 10.16: How much of the difference in per-
bucket’s lock during the add operation. Line 40 updates formance between the large and small hash tables shown in
->ht_resize_cur to indicate that this bucket has been Figure 10.19 was due to long hash chains and how much was
distributed. Finally, line 41 releases the old-table bucket due to memory-system bottlenecks?
lock.
Execution reaches line 43 once all old-table buckets Referring to the last column of Table 3.1, we recall
have been distributed across the new table. Line 43 installs that the first 28 CPUs are in the first socket, on a one-
the newly created table as the current one, and line 44 CPU-per-core basis, which explains the sharp decrease in
waits for all old readers (who might still be referencing performance of the resizable hash table beyond 28 CPUs.
the old table) to complete. Then line 45 releases the Sharp though this decrease is, please recall that it is due
resize-serialization lock, line 46 frees the old hash table, to constant resizing back and forth. It would clearly be
and finally line 47 returns success. better to resize once to 524,288 buckets, or, even better,
do a single eight-fold resize to 2,097,152 elements, thus
Quick Quiz 10.15: Why is there a WRITE_ONCE() on line 40 dropping the average number of elements per bucket down
in Listing 10.13? to the level enjoyed by the runs producing the upper three
traces.
The key point from this data is that the RCU-protected
10.4.3 Resizable Hash Table Discussion resizable hash table performs and scales almost as well as
Figure 10.19 compares resizing hash tables to their fixed- 1 You see only two traces? The dashed one is composed of two
sized counterparts for 262,144 and 2,097,152 elements traces that differ only slightly, hence the irregular-looking dash pattern.
v2022.09.25a
10.4. NON-PARTITIONABLE DATA STRUCTURES 201
10.4.4 Other Resizable Hash Tables Figure 10.20: Shrinking a Relativistic Hash Table
v2022.09.25a
202 CHAPTER 10. DATA STRUCTURES
might still be traversing the old large hash table, so in this (b) odd
state (f).
all 0 1 2 3
Growing a relativistic hash table reverses the shrinking
process, but requires more grace-period steps, as shown
even
in Figure 10.21. The initial state (a) is at the top of this
odd
figure, with time advancing from top to bottom. (d)
v2022.09.25a
10.5. OTHER DATA STRUCTURES 203
result is state (f). A final unzipping operation (including compression techniques that may be used to work around
a grace-period operation) results in the final state (g). this disadvantage, including hashing the key value to a
In short, the relativistic hash table reduces the number smaller keyspace before the traversal [ON07]. Radix
of per-element list pointers at the expense of additional trees are heavily used in practice, including in the Linux
grace periods incurred during resizing. These additional kernel [Pig06].
grace periods are usually not a problem because insertions, One important special case of both a hash table and a
deletions, and lookups may proceed concurrently with a trie is what is perhaps the oldest of data structures, the
resize operation. array and its multi-dimensional counterpart, the matrix.
It turns out that it is possible to reduce the per-element The fully partitionable nature of matrices is exploited
memory overhead from a pair of pointers to a single heavily in concurrent numerical algorithms.
pointer, while still retaining O (1) deletions. This is Self-balancing trees are heavily used in sequential code,
accomplished by augmenting split-order list [SS06] with with AVL trees and red-black trees being perhaps the
RCU protection [Des09b, MDJ13a]. The data elements most well-known examples [CLRS01]. Early attempts to
in the hash table are arranged into a single sorted linked parallelize AVL trees were complex and not necessarily
list, with each hash bucket referencing the first element all that efficient [Ell80], however, more recent work on
in that bucket. Elements are deleted by setting low-order red-black trees provides better performance and scalability
bits in their pointer-to-next fields, and these elements are by using RCU for readers and hashed arrays of locks2 to
removed from the list by later traversals that encounter protect reads and updates, respectively [HW11, HW14]. It
them. turns out that red-black trees rebalance aggressively, which
This RCU-protected split-order list is complex, but works well for sequential programs, but not necessarily
offers lock-free progress guarantees for all insertion, dele- so well for parallel use. Recent work has therefore made
tion, and lookup operations. Such guarantees can be use of RCU-protected “bonsai trees” that rebalance less
important in real-time applications. An implementation aggressively [CKZ12], trading off optimal tree depth to
is available from recent versions of the userspace RCU gain more efficient concurrent updates.
library [Des09b]. Concurrent skip lists lend themselves well to RCU
readers, and in fact represents an early academic use of a
technique resembling RCU [Pug90].
10.5 Other Data Structures Concurrent double-ended queues were discussed in
Section 6.1.2, and concurrent stacks and queues have
All life is an experiment. The more experiments you a long history [Tre86], though not normally the most
make the better. impressive performance or scalability. They are neverthe-
less a common feature of concurrent libraries [MDJ13b].
Ralph Waldo Emerson Researchers have recently proposed relaxing the or-
dering constraints of stacks and queues [Sha11], with
The preceding sections have focused on data structures that some work indicating that relaxed-ordered queues actu-
enhance concurrency due to partitionability (Section 10.2), ally have better ordering properties than do strict FIFO
efficient handling of read-mostly access patterns (Sec- queues [HKLP12, KLP12, HHK+ 13].
tion 10.3), or application of read-mostly techniques to It seems likely that continued work with concurrent data
avoid non-partitionability (Section 10.4). This section structures will produce novel algorithms with surprising
gives a brief review of other data structures. properties.
One of the hash table’s greatest advantages for parallel
use is that it is fully partitionable, at least while not being
resized. One way of preserving the partitionability and
the size independence is to use a radix tree, which is also
called a trie. Tries partition the search key, using each
successive key partition to traverse the next level of the
trie. As such, a trie can be thought of as a set of nested
hash tables, thus providing the required partitionability.
One disadvantage of tries is that a sparse key space can 2 In the guise of swissTM [DFGG11], which is a variant of software
result in inefficient use of memory. There are a number of transactional memory in which the developer flags non-shared accesses.
v2022.09.25a
204 CHAPTER 10. DATA STRUCTURES
10.6 Micro-Optimization hash and interact with possible resize operations twice
rather than just once. In a performance-conscious envi-
ronment, the hashtab_lock_mod() function would also
The devil is in the details. return a reference to the bucket selected, eliminating the
Unknown subsequent call to ht_get_bucket().
Quick Quiz 10.17: Couldn’t the hashtorture.h code be
The data structures shown in this section were coded modified to accommodate a version of hashtab_lock_mod()
straightforwardly, with no adaptation to the underlying that subsumes the ht_get_bucket() functionality?
system’s cache hierarchy. In addition, many of the im-
plementations used pointers to functions for key-to-hash Quick Quiz 10.18: How much do these specializations really
conversions and other frequent operations. Although this save? Are they really worth it?
approach provides simplicity and portability, in many
cases it does give up some performance. All that aside, one of the great benefits of modern
The following sections touch on specialization, memory hardware compared to that available when I first started
conservation, and hardware considerations. Please do not learning to program back in the early 1970s is that much
mistake these short sections for a definitive treatise on this less specialization is required. This allows much greater
subject. Whole books have been written on optimizing productivity than was possible back in the days of four-
to a specific CPU, let alone to the set of CPU families in kilobyte address spaces.
common use today.
v2022.09.25a
10.6. MICRO-OPTIMIZATION 205
2. They cannot participate in the lockdep deadlock Listing 10.14: Alignment for 64-Byte Cache Lines
detection tooling in the Linux kernel [Cor06a]. 1 struct hash_elem {
2 struct ht_elem e;
3 long __attribute__ ((aligned(64))) counter;
3. They do not record lock ownership, further compli- 4 };
cating debugging.
4. They do not participate in priority boosting in -rt One way to solve this problem on systems with 64-
kernels, which means that preemption must be dis- byte cache line is shown in Listing 10.14. Here GCC’s
abled when holding bit spinlocks, which can degrade aligned attribute is used to force the ->counter and the
real-time latency. ht_elem structure into separate cache lines. This would
allow CPUs to traverse the hash bucket list at full speed
Despite these disadvantages, bit-spinlocks are extremely despite the frequent incrementing.
useful when memory is at a premium. Of course, this raises the question “How did we
One aspect of the second opportunity was covered in know that cache lines are 64 bytes in size?” On a
Section 10.4.4, which presented resizable hash tables that Linux system, this information may be obtained from
require only one set of bucket-list pointers in place of the the /sys/devices/system/cpu/cpu*/cache/ direc-
pair of sets required by the resizable hash table presented tories, and it is even possible to make the installation
in Section 10.4. Another approach would be to use singly process rebuild the application to accommodate the sys-
linked bucket lists in place of the doubly linked lists used tem’s hardware structure. However, this would be more
in this chapter. One downside of this approach is that difficult if you wanted your application to also run on non-
deletion would then require additional overhead, either Linux systems. Furthermore, even if you were content
by marking the outgoing pointer for later removal or by to run only on Linux, such a self-modifying installation
searching the bucket list for the element being deleted. poses validation challenges. For example, systems with
In short, there is a tradeoff between minimal memory 32-byte cachelines might work well, but performance
overhead on the one hand, and performance and simplicity might suffer on systems with 64-byte cachelines due to
on the other. Fortunately, the relatively large memories false sharing.
available on modern systems have allowed us to priori- Fortunately, there are some rules of thumb that work
tize performance and simplicity over memory overhead. reasonably well in practice, which were gathered into a
However, even though the year 2022’s pocket-sized smart- 1995 paper [GKPS95].3 The first group of rules involve
phones sport many gigabytes of memory and its mid-range rearranging structures to accommodate cache geometry:
servers sport terabytes, it is sometimes necessary to take
extreme measures to reduce memory overhead. 1. Place read-mostly data far from frequently updated
data. For example, place read-mostly data at the
beginning of the structure and frequently updated
10.6.3 Hardware Considerations data at the end. Place data that is rarely accessed in
between.
Modern computers typically move data between CPUs
and main memory in fixed-sized blocks that range in size 2. If the structure has groups of fields such that each
from 32 bytes to 256 bytes. These blocks are called cache group is updated by an independent code path, sep-
lines, and are extremely important to high performance arate these groups from each other. Again, it can
and scalability, as was discussed in Section 3.2. One be helpful to place rarely accessed data between the
timeworn way to kill both performance and scalability is groups. In some cases, it might also make sense
to place incompatible variables into the same cacheline. to place each such group into a separate structure
For example, suppose that a resizable hash table data referenced by the original structure.
element had the ht_elem structure in the same cacheline
as a frequently incremented counter. The frequent incre- 3. Where possible, associate update-mostly data with
menting would cause the cacheline to be present at the a CPU, thread, or task. We saw several very effec-
CPU doing the incrementing, but nowhere else. If other tive examples of this rule of thumb in the counter
CPUs attempted to traverse the hash bucket list containing implementations in Chapter 5.
that element, they would incur expensive cache misses, 3 A number of these rules are paraphrased and expanded on here
degrading both performance and scalability. with permission from Orran Krieger.
v2022.09.25a
206 CHAPTER 10. DATA STRUCTURES
4. Going one step further, partition your data on a per- 1. Fully partitioned data structures work well on small
CPU, per-thread, or per-task basis, as was discussed systems, for example, single-socket systems.
in Chapter 8.
2. Larger systems require locality of reference as well
There has been some work towards automated trace- as full partitioning.
based rearrangement of structure fields [GDZE10]. This
3. Read-mostly techniques, such as hazard pointers
work might well ease one of the more painstaking tasks
and RCU, provide good locality of reference for
required to get excellent performance and scalability from
read-mostly workloads, and thus provide excellent
multithreaded software.
performance and scalability even on larger systems.
An additional set of rules of thumb deal with locks:
4. Read-mostly techniques also work well on some
1. Given a heavily contended lock protecting data that types of non-partitionable data structures, such as
is frequently modified, take one of the following resizable hash tables.
approaches:
5. Large data structures can overflow CPU caches, re-
(a) Place the lock in a different cacheline than the ducing performance and scalability.
data that it protects.
6. Additional performance and scalability can be ob-
(b) Use a lock that is adapted for high contention,
tained by specializing the data structure to a specific
such as a queued lock.
workload, for example, by replacing a general key
(c) Redesign to reduce lock contention. (This with a 32-bit integer.
approach is best, but is not always trivial.)
7. Although requirements for portability and for extreme
2. Place uncontended locks into the same cache line performance often conflict, there are some data-
as the data that they protect. This approach means structure-layout techniques that can strike a good
that the cache miss that brings the lock to the current balance between these two sets of requirements.
CPU also brings its data.
That said, performance and scalability are of little use
3. Protect read-mostly data with hazard pointers, RCU, without reliability, so the next chapter covers validation.
or, for long-duration critical sections, reader-writer
locks.
10.7 Summary
Archibald MacLeish
v2022.09.25a
If it is not tested, it doesn’t work.
Unknown
Chapter 11
Validation
I have had a few parallel programs work the first time, but 11.1 Introduction
that is only because I have written a large number parallel
programs over the past three decades. And I have had far
Debugging is like being the detective in a crime
more parallel programs that fooled me into thinking that
movie where you are also the murderer.
they were working correctly the first time than actually
were working the first time. Filipe Fortes
I thus need to validate my parallel programs. The basic
trick behind validation, is to realize that the computer Section 11.1.1 discusses the sources of bugs, and Sec-
knows what is wrong. It is therefore your job to force tion 11.1.2 overviews the mindset required when validating
it to tell you. This chapter can therefore be thought of software. Section 11.1.3 discusses when you should start
as a short course in machine interrogation. But you can validation, and Section 11.1.4 describes the surprisingly
leave the good-cop/bad-cop routine at home. This chapter effective open-source regimen of code review and com-
covers much more sophisticated and effective methods, munity testing.
especially given that most computers couldn’t tell a good
cop from a bad cop, at least as far as we know. 11.1.1 Where Do Bugs Come From?
A longer course may be found in many recent books
Bugs come from developers. The basic problem is that
on validation, as well as at least one older but valuable
the human brain did not evolve with computer software in
one [Mye79]. Validation is an extremely important topic
mind. Instead, the human brain evolved in concert with
that cuts across all forms of software, and is worth intensive
other human brains and with animal brains. Because of this
study in its own right. However, this book is primarily
history, the following three characteristics of computers
about concurrency, so this chapter will do little more than
often come as a shock to human intuition:
scratch the surface of this critically important topic.
Section 11.1 introduces the philosophy of debugging. 1. Computers lack common sense, despite huge sacri-
Section 11.2 discusses tracing, Section 11.3 discusses fices at the altar of artificial intelligence.
assertions, and Section 11.4 discusses static analysis.
Section 11.5 describes some unconventional approaches 2. Computers fail to understand user intent, or more
to code review that can be helpful when the fabled 10,000 formally, computers generally lack a theory of mind.
eyes happen not to be looking at your code. Section 11.6
3. Computers cannot do anything useful with a frag-
overviews the use of probability for validating parallel
mentary plan, instead requiring that every detail of
software. Because performance and scalability are first-
all possible scenarios be spelled out in full.
class requirements for parallel programming, Section 11.7
covers these topics. Finally, Section 11.8 gives a fanciful The first two points should be uncontroversial, as they
summary and a short list of statistical traps to avoid. are illustrated by any number of failed products, perhaps
But never forget that the three best debugging tools most famously Clippy and Microsoft Bob. By attempting
are a thorough understanding of the requirements, a solid to relate to users as people, these two products raised
design, and a good night’s sleep! common-sense and theory-of-mind expectations that they
207
v2022.09.25a
208 CHAPTER 11. VALIDATION
proved incapable of meeting. Perhaps the set of software An important special case is the project that, while
assistants are now available on smartphones will fare valuable, is not valuable enough to justify the time required
better, but as of 2021 reviews are mixed. That said, the to implement it. This special case is quite common, and
developers working on them by all accounts still develop one early symptom is the unwillingness of the decision-
the old way: The assistants might well benefit end users, makers to invest enough to actually implement the project.
but not so much their own developers. A natural reaction is for the developers to produce an
This human love of fragmentary plans deserves more unrealistically optimistic estimate in order to be permitted
explanation, especially given that it is a classic two-edged to start the project. If the organization is strong enough
sword. This love of fragmentary plans is apparently due and its decision-makers ineffective enough, the project
to the assumption that the person carrying out the plan might succeed despite the resulting schedule slips and
will have (1) common sense and (2) a good understanding budget overruns. However, if the organization is not
of the intent and requirements driving the plan. This latter strong enough and if the decision-makers fail to cancel the
assumption is especially likely to hold in the common project as soon as it becomes clear that the estimates are
case where the person doing the planning and the person garbage, then the project might well kill the organization.
carrying out the plan are one and the same: In this This might result in another organization picking up the
case, the plan will be revised almost subconsciously as project and either completing it, canceling it, or being
obstacles arise, especially when that person has the a good killed by it. A given project might well succeed only
understanding of the problem at hand. In fact, the love after killing several organizations. One can only hope
of fragmentary plans has served human beings well, in that the organization that eventually makes a success of
part because it is better to take random actions that have a serial-organization-killer project maintains a suitable
a some chance of locating food than to starve to death level of humility, lest it be killed by its next such project.
while attempting to plan the unplannable. However, the
Quick Quiz 11.2: Who cares about the organization? After
usefulness of fragmentary plans in the everyday life of
all, it is the project that is important!
which we are all experts is no guarantee of their future
usefulness in stored-program computers. Important though insane levels of optimism might
Furthermore, the need to follow fragmentary plans has be, they are a key source of bugs (and perhaps failure
had important effects on the human psyche, due to the of organizations). The question is therefore “How to
fact that throughout much of human history, life was often maintain the optimism required to start a large project
difficult and dangerous. It should come as no surprise that while at the same time injecting enough reality to keep
executing a fragmentary plan that has a high probability the bugs down to a dull roar?” The next section examines
of a violent encounter with sharp teeth and claws requires this conundrum.
almost insane levels of optimism—a level of optimism that
actually is present in most human beings. These insane
levels of optimism extend to self-assessments of program- 11.1.2 Required Mindset
ming ability, as evidenced by the effectiveness of (and the
controversy over) code-interviewing techniques [Bra07]. When carrying out any validation effort, keep the following
In fact, the clinical term for a human being with less-than- definitions firmly in mind:
insane levels of optimism is “clinically depressed”. Such
people usually have extreme difficulty functioning in their 1. The only bug-free programs are trivial programs.
daily lives, underscoring the perhaps counter-intuitive im-
portance of insane levels of optimism to a normal, healthy 2. A reliable program has no known bugs.
life. Furtheremore, if you are not insanely optimistic, you
are less likely to start a difficult but worthwhile project.1 From these definitions, it logically follows that any
reliable non-trivial program contains at least one bug that
Quick Quiz 11.1: When in computing is it necessary to
you do not know about. Therefore, any validation effort
follow a fragmentary plan?
undertaken on a non-trivial program that fails to find any
bugs is itself a failure. A good validation is therefore an
1 There are some famous exceptions to this rule of thumb. Some
exercise in destruction. This means that if you are the
people take on difficult or risky projects in order to at least a temporarily
escape from their depression. Others have nothing to lose: The project type of person who enjoys breaking things, validation is
is literally a matter of life or death. just job for you.
v2022.09.25a
11.1. INTRODUCTION 209
Quick Quiz 11.3: Suppose that you are writing a script that
processes the output of the time command, which looks as
follows:
real 0m0.132s
user 0m0.040s
sys 0m0.008s
The script is required to check its input for errors, and to give
appropriate diagnostics if fed erroneous time output. What
test inputs should you provide to this program to test it for use
with time output generated by single-threaded programs?
v2022.09.25a
210 CHAPTER 11. VALIDATION
Some people might see vigorous validation as a form One such approach takes a Darwinian view, with the
of torture, as depicted in Figure 11.1.3 Such people might validation suite eliminating code that is not fit to solve
do well to remind themselves that, Tux cartoons aside, the problem at hand. From this viewpoint, a vigorous
they are really torturing an inanimate object, as shown in validation suite is essential to the fitness of your software.
Figure 11.2. Rest assured that those who fail to torture However, taking this approach to its logical conclusion is
their code are doomed to be tortured by it! quite humbling, as it requires us developers to admit that
However, this leaves open the question of exactly when our carefully crafted changes to the codebase are, from a
during the project lifetime validation should start, a topic Darwinian standpoint, random mutations. On the other
taken up by the next section. hand, this conclusion is supported by long experience
indicating that seven percent of fixes introduce at least
11.1.3 When Should Validation Start? one bug [BJ12].
How vigorous should your validation suite be? If the
Validation should start exactly when the project starts. bugs it finds aren’t threatening the very foundations of
To see this, consider that tracking down a bug is much your software design, then it is not yet vigorous enough.
harder in a large program than in a small one. Therefore, After all, your design is just as prone to bugs as is your
to minimize the time and effort required to track down code, and the earlier you find and fix the bugs in your
bugs, you should test small units of code. Although you design, the less time you will waste coding those design
won’t find all the bugs this way, you will find a substantial bugs.
fraction, and it will be much easier to find and fix the
ones you do find. Testing at this level can also alert you Quick Quiz 11.5: Are you actually suggesting that it is
to larger flaws in your overall design, minimizing the time possible to test correctness into software??? Everyone knows
that is impossible!!!
you waste writing code that is broken by design.
But why wait until you have code before validating your
design?4 Hopefully reading Chapters 3 and 4 provided you It is worth reiterating that this advice applies to first-
with the information required to avoid some regrettably of-a-kind projects. If you are instead doing a project in a
common design flaws, but discussing your design with a well-explored area, you would be quite foolish to refuse
colleague or even simply writing it down can help flush to learn from previous experience. But you should still
out additional flaws. start validating right at the beginning of the project, but
However, it is all too often the case that waiting to hopefully guided by others’ hard-won knowledge of both
start validation until you have a design is waiting too long. requirements and pitfalls.
Mightn’t your natural level of optimism caused you to start An equally important question is “When should valida-
the design before you fully understood the requirements? tion stop?” The best answer is “Some time after the last
The answer to this question will almost always be “yes”. change.” Every change has the potential to create a bug,
One good way to avoid flawed requirements is to get to and thus every change must be validated. Furthermore,
know your users. To really serve them well, you will have validation development should continue through the full
to live among them. lifetime of the project. After all, the Darwinian perspec-
tive above implies that bugs are adapting to your validation
Quick Quiz 11.4: You are asking me to do all this validation suite. Therefore, unless you continually improve your
BS before I even start coding??? That sounds like a great way
validation suite, your project will naturally accumulate
to never get started!!!
hordes of validation-suite-immune bugs.
First-of-a-kind projects often use different methodolo- But life is a tradeoff, and every bit of time invested in
gies such as rapid prototyping or agile. Here, the main validation suites as a bit of time that cannot be invested
goal of early prototypes are not to create correct imple- in directly improving the project itself. These sorts of
mentations, but rather to learn the project’s requirements. choices are never easy, and it can be just as damaging to
But this does not mean that you omit validation; it instead overinvest in validation as it can be to underinvest. But
means that you approach it differently. this is just one more indication that life is not easy.
3 The cynics among us might question whether these people are
Now that we have established that you should start
afraid that validation will find bugs that they will then be required to fix.
validation when you start the project (if not earlier!), and
4 The old saying “First we must code, then we have incentive to that both validation and validation development should
think” notwithstanding. continue throughout the lifetime of that project, the fol-
v2022.09.25a
11.2. TRACING 211
lowing sections cover a number of validation techniques likely would have forgotten how the patch was supposed
and methods that have proven their worth. to work, making it much more difficult to fix them.
However, we must not forget the second tenet of the
open-source development, namely intensive testing. For
11.1.4 The Open Source Way example, a great many people test the Linux kernel. Some
The open-source programming methodology has proven test patches as they are submitted, perhaps even yours.
quite effective, and includes a regimen of intense code Others test the -next tree, which is helpful, but there is
review and testing. likely to be several weeks or even months delay between
I can personally attest to the effectiveness of the open- the time that you write the patch and the time that it
source community’s intense code review. One of my appears in the -next tree, by which time the patch will not
first patches to the Linux kernel involved a distributed be quite as fresh in your mind. Still others test maintainer
filesystem where one node might write to a given file trees, which often have a similar time delay.
that another node has mapped into memory. In this case, Quite a few people don’t test code until it is committed
it is necessary to invalidate the affected pages from the to mainline, or the master source tree (Linus’s tree in the
mapping in order to allow the filesystem to maintain case of the Linux kernel). If your maintainer won’t accept
coherence during the write operation. I coded up a first your patch until it has been tested, this presents you with a
attempt at a patch, and, in keeping with the open-source deadlock situation: Your patch won’t be accepted until it
maxim “post early, post often”, I posted the patch. I then is tested, but it won’t be tested until it is accepted. Never-
considered how I was going to test it. theless, people who test mainline code are still relatively
But before I could even decide on an overall test strategy, aggressive, given that many people and organizations do
I got a reply to my posting pointing out a few bugs. I fixed not test code until it has been pulled into a Linux distro.
the bugs and reposted the patch, and returned to thinking And even if someone does test your patch, there is
out my test strategy. However, before I had a chance to no guarantee that they will be running the hardware and
write any test code, I received a reply to my reposted patch, software configuration and workload required to locate
pointing out more bugs. This process repeated itself many your bugs.
times, and I am not sure that I ever got a chance to actually Therefore, even when writing code for an open-source
test the patch. project, you need to be prepared to develop and run your
This experience brought home the truth of the open- own test suite. Test development is an underappreciated
source saying: Given enough eyeballs, all bugs are shal- and very valuable skill, so be sure to take full advantage
low [Ray99]. of any existing test suites available to you. Important as
However, when you post some code or a given patch, it test development is, we must leave further discussion of it
is worth asking a few questions: to books dedicated to that topic. The following sections
therefore discuss locating bugs in your code given that
1. How many of those eyeballs are actually going to you already have a good test suite.
look at your code?
v2022.09.25a
212 CHAPTER 11. VALIDATION
Much more sophisticated tools exist, with some of the 11.3 Assertions
more recent offering the ability to rewind backwards in
time from the point of failure.
No man really becomes a fool until he stops asking
These brute-force testing tools are all valuable, espe-
questions.
cially now that typical systems have more than 64K of
memory and CPUs running faster than 4 MHz. Much has Charles P. Steinmetz
been written about these tools, so this chapter will add
only a little more. Assertions are usually implemented in the following man-
However, these tools all have a serious shortcoming ner:
when you need a fastpath to tell you what is going wrong, 1 if (something_bad_is_happening())
namely, these tools often have excessive overheads. There 2 complain();
are special tracing technologies for this purpose, which
typically leverage data ownership techniques (see Chap- This pattern is often encapsulated into C-preprocessor
ter 8) to minimize the overhead of runtime data collec- macros or language intrinsics, for example, in the
tion. One example within the Linux kernel is “trace Linux kernel, this might be represented as WARN_
events” [Ros10b, Ros10c, Ros10d, Ros10a], which uses ON(something_bad_is_happening()). Of course, if
per-CPU buffers to allow data to be collected with ex- something_bad_is_happening() quite frequently, the
tremely low overhead. Even so, enabling tracing can resulting output might obscure reports of other prob-
sometimes change timing enough to hide bugs, resulting lems, in which case WARN_ON_ONCE(something_bad_
in heisenbugs, which are discussed in Section 11.6 and is_happening()) might be more appropriate.
especially Section 11.6.4. In the kernel, BPF can do Quick Quiz 11.6: How can you implement WARN_ON_
data reduction in the kernel, reducing the overhead of ONCE()?
transmitting the needed information from the kernel to
userspace [Gre19]. In userspace code, there is a huge In parallel code, one bad something that might hap-
number of tools that can help you. One good starting pen is that a function expecting to be called under a
point is Brendan Gregg’s blog.5 particular lock might be called without that lock being
Even if you avoid heisenbugs, other pitfalls await you. held. Such functions sometimes have header comments
For example, although the machine really does know all, stating something like “The caller must hold foo_lock
what it knows is almost always way more than your head when calling this function”, but such a comment does no
can hold. For this reason, high-quality test suites normally good unless someone actually reads it. An executable
come with sophisticated scripts to analyze the voluminous statement carries far more weight. The Linux kernel’s
output. But beware—scripts will only notice what you tell lockdep facility [Cor06a, Ros11] therefore provides a
them to. My rcutorture scripts are a case in point: Early lockdep_assert_held() function that checks whether
versions of those scripts were quite satisfied with a test the specified lock is held. Of course, lockdep incurs
run in which RCU grace periods stalled indefinitely. This significant overhead, and thus might not be helpful in
of course resulted in the scripts being modified to detect production.
RCU grace-period stalls, but this does not change the fact An especially bad parallel-code something is unex-
that the scripts will only detect problems that I make them pected concurrent access to data. The Kernel Concurrency
detect. But note well that unless you have a solid design, Sanitizer (KCSAN) [Cor16a] uses existing markings such
you won’t know what your script should check for! as READ_ONCE() and WRITE_ONCE() to determine which
Another problem with tracing and especially with concurrent accesses deserve warning messages. KCSAN
printk() calls is that their overhead can rule out produc- has a significant false-positive rate, especially from the
tion use. In such cases, assertions can be helpful. viewpoint of developers thinking in terms of C as assembly
language with additional syntax. KCSAN therefore pro-
vides a data_race() construct to forgive known-benign
data races, and also the ASSERT_EXCLUSIVE_ACCESS()
and ASSERT_EXCLUSIVE_WRITER() assertions to explic-
itly check for data races [EMV+ 20a, EMV+ 20b].
So what can be done in cases where checking is neces-
5 http://www.brendangregg.com/blog/ sary, but where the overhead of runtime checking cannot
v2022.09.25a
11.5. CODE REVIEW 213
11.5.1 Inspection
Static analysis is a validation technique where one program
takes a second program as input, reporting errors and vul- Traditionally, formal code inspections take place in face-
nerabilities located in this second program. Interestingly to-face meetings with formally defined roles: Moderator,
enough, almost all programs are statically analyzed by developer, and one or two other participants. The devel-
their compilers or interpreters. These tools are far from oper reads through the code, explaining what it is doing
perfect, but their ability to locate errors has improved and why it works. The one or two other participants ask
immensely over the past few decades, in part because they questions and raise issues, hopefully exposing the author’s
now have much more than 64K bytes of memory in which invalid assumptions, while the moderator’s job is to re-
to carry out their analyses. solve any resulting conflicts and take notes. This process
The original UNIX lint tool [Joh77] was quite useful, can be extremely effective at locating bugs, particularly if
though much of its functionality has since been incorpo- all of the participants are familiar with the code at hand.
rated into C compilers. There are nevertheless lint-like However, this face-to-face formal procedure does not
tools in use to this day. The sparse static analyzer [Cor04b] necessarily work well in the global Linux kernel com-
finds higher-level issues in the Linux kernel, including: munity. Instead, individuals review code separately and
provide comments via email or IRC. The note-taking
is provided by email archives or IRC logs, and modera-
1. Misuse of pointers to user-space structures. tors volunteer their services as required by the occasional
flamewar. This process also works reasonably well, par-
ticularly if all of the participants are familiar with the
2. Assignments from too-long constants.
code at hand. In fact, one advantage of the Linux kernel
community approach over traditional formal inspections
3. Empty switch statements. is the greater probability of contributions from people not
familiar with the code, who might not be blinded by the
4. Mismatched lock acquisition and release primitives. author’s invalid assumptions, and who might also test the
code.
Quick Quiz 11.7: Just what invalid assumptions are you
5. Misuse of per-CPU primitives. accusing Linux kernel hackers of harboring???
v2022.09.25a
214 CHAPTER 11. VALIDATION
3. It is sometimes difficult to resolve flamewars when where there is no reasonable alternative. For example, the
they do break out, especially when the combatants developer might be the only person authorized to look
have disjoint goals, experience, and vocabulary. at the code, other qualified developers might all be too
busy, or the code in question might be sufficiently bizarre
Perhaps some of the needed improvements will be that the developer is unable to convince anyone else to
provided by continuous-integration-style testing, but there take it seriously until after demonstrating a prototype. In
are many bugs more easily found by review than by testing. these cases, the following procedure can be quite helpful,
When reviewing, therefore, it is worthwhile to look at especially for complex parallel code:
relevant documentation in commit logs, bug reports, and
LWN articles. This documentation can help you quickly 1. Write design document with requirements, diagrams
build up the required expertise. for data structures, and rationale for design choices.
1. The tester presents the test case. 6. Produce proofs of correctness for any non-obvious
code.
2. The moderator starts the code under a debugger,
using the specified test case as input. 7. Use a source-code control system. Commit early;
commit often.
3. Before each statement is executed, the developer is
required to predict the outcome of the statement and 8. Test the code fragments from the bottom up.
explain why this outcome is correct.
9. When all the code is integrated (but preferably be-
4. If the outcome differs from that predicted by the fore), do full-up functional and stress testing.
developer, this is taken as a potential bug.
10. Once the code passes all tests, write code-level doc-
5. In parallel code, a “concurrency shark” asks what umentation, perhaps as an extension to the design
code might execute concurrently with this code, and document discussed above. Fix both the code and
why such concurrency is harmless. the test code as needed.
Sadistic, certainly. Effective? Maybe. If the partic- When I follow this procedure for new RCU code, there
ipants have a good understanding of the requirements, are normally only a few bugs left at the end. With a few
software tools, data structures, and algorithms, then walk- prominent (and embarrassing) exceptions [McK11a], I
throughs can be extremely effective. If not, walkthroughs usually manage to locate these bugs before others do. That
are often a waste of time. said, this is getting more difficult over time as the number
and variety of Linux-kernel users increases.
11.5.3 Self-Inspection Quick Quiz 11.8: Why would anyone bother copying exist-
ing code in pen on paper??? Doesn’t that just increase the
Although developers are usually not all that effective at probability of transcription errors?
inspecting their own code, there are a number of situations
v2022.09.25a
11.6. PROBABILITY AND HEISENBUGS 215
Quick Quiz 11.9: This procedure is ridiculously over- 5. Make extremely disciplined use of parallel-
engineered! How can you expect to get a reasonable amount programming primitives, so that the resulting code
of software written doing it this way??? is easily seen to be correct. But beware: It is always
tempting to break the rules “just a little bit” to gain
Quick Quiz 11.10: What do you do if, after all the pen-on- better performance or scalability. Breaking the rules
paper copying, you find a bug while typing in the resulting often results in general breakage. That is, unless you
code? carefully do the paperwork described in this section.
The above procedure works well for new code, but
what if you need to inspect code that you have already But the sad fact is that even if you do the paperwork
written? You can of course apply the above procedure or use one of the above ways to more-or-less safely avoid
for old code in the special case where you wrote one to paperwork, there will be bugs. If nothing else, more users
throw away [FPB79], but the following approach can also and a greater variety of users will expose more bugs more
be helpful in less desperate circumstances: quickly, especially if those users are doing things that the
original developers did not consider. The next section
1. Using your favorite documentation tool (LATEX, describes how to handle the probabilistic bugs that occur
HTML, OpenOffice, or straight ASCII), describe all too commonly when validating parallel software.
the high-level design of the code in question. Use
Quick Quiz 11.11: Wait! Why on earth would an abstract
lots of diagrams to illustrate the data structures and piece of software fail only sometimes???
how these structures are updated.
2. Make a copy of the code, stripping away all com-
ments.
11.6 Probability and Heisenbugs
3. Document what the code does statement by statement.
4. Fix bugs as you find them. With both heisenbugs and impressionist art, the
closer you get, the less you see.
This works because describing the code in detail is
Unknown
an excellent way to spot bugs [Mye79]. This second
procedure is also a good way to get your head around
So your parallel program fails sometimes. But you used
someone else’s code, although the first step often suffices.
techniques from the earlier sections to locate the problem
Although review and inspection by others is probably
and now have a fix in place! Congratulations!!!
more efficient and effective, the above procedures can be
Now the question is just how much testing is required
quite helpful in cases where for whatever reason it is not
in order to be certain that you actually fixed the bug, as
feasible to involve others.
opposed to just reducing the probability of it occurring on
At this point, you might be wondering how to write par-
the one hand, having fixed only one of several related bugs
allel code without having to do all this boring paperwork.
on the other hand, or made some ineffectual unrelated
Here are some time-tested ways of accomplishing this:
change on yet a third hand. In short, what is the answer to
1. Write a sequential program that scales through use the eternal question posed by Figure 11.3?
of available parallel library functions. Unfortunately, the honest answer is that an infinite
amount of testing is required to attain absolute certainty.
2. Write sequential plug-ins for a parallel framework,
such as map-reduce, BOINC, or a web-application Quick Quiz 11.12: Suppose that you had a very large number
server. of systems at your disposal. For example, at current cloud
prices, you can purchase a huge amount of CPU time at low
3. Fully partition your problems, then implement se- cost. Why not use this approach to get close enough to certainty
quential program(s) that run in parallel without com- for all practical purposes?
munication.
But suppose that we are willing to give up absolute
4. Stick to one of the application areas (such as linear certainty in favor of high probability. Then we can bring
algebra) where tools can automatically decompose powerful statistical tools to bear on this problem. However,
and parallelize the problem. this section will focus on simple statistical tools. These
v2022.09.25a
216 CHAPTER 11. VALIDATION
𝑆 𝑛 = (1 − 𝑓 ) 𝑛 (11.1)
tools are extremely helpful, but please note that reading
The probability of failure is 1 − 𝑆 𝑛 , or:
this section is not a substitute for statistics classes.6
For our start with simple statistical tools, we need to
𝐹𝑛 = 1 − (1 − 𝑓 ) 𝑛 (11.2)
decide whether we are doing discrete or continuous testing.
Discrete testing features well-defined individual test runs. Quick Quiz 11.13: Say what??? When I plug the earlier five-
For example, a boot-up test of a Linux kernel patch is an test 10 %-failure-rate example into the formula, I get 59,050 %
example of a discrete test: The kernel either comes up or it and that just doesn’t make sense!!!
does not. Although you might spend an hour boot-testing
your kernel, the number of times you attempted to boot So suppose that a given test has been failing 10 % of
the kernel and the number of times the boot-up succeeded the time. How many times do you have to run the test to
would often be of more interest than the length of time be 99 % sure that your supposed fix actually helped?
you spent testing. Functional tests tend to be discrete. Another way to ask this question is “How many times
On the other hand, if my patch involved RCU, I would would we need to run the test to cause the probability of
probably run rcutorture, which is a kernel module that, failure to rise above 99 %?” After all, if we were to run
strangely enough, tests RCU. Unlike booting the ker- the test enough times that the probability of seeing at least
nel, where the appearance of a login prompt signals the one failure becomes 99 %, if there are no failures, there is
successful end of a discrete test, rcutorture will happily only 1 % probability of this “success” being due to dumb
continue torturing RCU until either the kernel crashes or luck. And if we plug 𝑓 = 0.1 into Eq. 11.2 and vary 𝑛,
until you tell it to stop. The duration of the rcutorture test we find that 43 runs gives us a 98.92 % chance of at least
is usually of more interest than the number of times you one test failing given the original 10 % per-test failure
started and stopped it. Therefore, rcutorture is an example rate, while 44 runs gives us a 99.03 % chance of at least
of a continuous test, a category that includes many stress one test failing. So if we run the test on our fix 44 times
tests. and see no failures, there is a 99 % probability that our fix
Statistics for discrete tests are simpler and more famil- really did help.
iar than those for continuous tests, and furthermore the But repeatedly plugging numbers into Eq. 11.2 can get
statistics for discrete tests can often be pressed into service tedious, so let’s solve for 𝑛:
for continuous tests, though with some loss of accuracy.
We therefore start with discrete tests. 𝐹𝑛 = 1 − (1 − 𝑓 ) 𝑛 (11.3)
6 Which 1 − 𝐹𝑛 = (1 − 𝑓 ) 𝑛 (11.4)
I most highly recommend. The few statistics courses I have
taken have provided value far beyond that of the time I spent on them. log (1 − 𝐹𝑛 ) = 𝑛 log (1 − 𝑓 ) (11.5)
v2022.09.25a
11.6. PROBABILITY AND HEISENBUGS 217
v2022.09.25a
218 CHAPTER 11. VALIDATION
the question was how long the test would need to run Here 𝑚 is the actual number of errors in the long test
error-free on a alleged fix to be 99 % certain that the fix run (in this case, two) and 𝜆 is expected number of errors
actually reduced the failure rate. In this case, 𝑚 is zero, in the long test run (in this case, 24). Plugging 𝑚 = 2 and
so that Eq. 11.8 reduces to: 𝜆 = 24 into this expression gives the probability of two
or fewer failures as about 1.2 × 10−8 , in other words, we
𝐹0 = e−𝜆 (11.9) have a high level of confidence that the fix actually had
some relationship to the bug.7
Solving this requires setting 𝐹0 to 0.01 and solving for
𝜆, resulting in: Quick Quiz 11.16: Doing the summation of all the factorials
and exponentials is a real pain. Isn’t there an easier way?
𝜆 = − ln 0.01 = 4.6 (11.10)
Quick Quiz 11.17: But wait!!! Given that there has to be
Because we get 0.3 failures per hour, the number of some number of failures (including the possibility of zero
hours required is 4.6/0.3 = 14.3, which is within 10 % of failures), shouldn’t Eq. 11.13 approach the value 1 as 𝑚 goes
the 13 hours calculated using the method in Section 11.6.2. to infinity?
Given that you normally won’t know your failure rate to
The Poisson distribution is a powerful tool for analyzing
anywhere near 10 %, the simpler method described in
test results, but the fact is that in this last example there
Section 11.6.2 is almost always good and sufficient.
were still two remaining test failures in a 24-hour test run.
More generally, if we have 𝑛 failures per unit time, and
Such a low failure rate results in very long test runs. The
we want to be 𝑃 % certain that a fix reduced the failure
next section discusses counter-intuitive ways of improving
rate, we can use the following formula:
this situation.
1 100 − 𝑃
𝑇 = − ln (11.11)
𝑛 100 11.6.4 Hunting Heisenbugs
Quick Quiz 11.15: Suppose that a bug causes a test failure This line of thought also helps explain heisenbugs: Adding
three times per hour on average. How long must the test run tracing and assertions can easily reduce the probability of a
error-free to provide 99.9 % confidence that the fix significantly bug appearing, which is why extremely lightweight tracing
reduced the probability of failure? and assertion mechanism are so critically important.
The term “heisenbug” was inspired by the Heisenberg
As before, the less frequently the bug occurs and the Uncertainty Principle from quantum physics, which states
greater the required level of confidence, the longer the that it is impossible to exactly quantify a given particle’s
required error-free test run. position and velocity at any given point in time [Hei27].
Suppose that a given test fails about once every hour, Any attempt to more accurately measure that particle’s
but after a bug fix, a 24-hour test run fails only twice. position will result in increased uncertainty of its velocity
Assuming that the failure leading to the bug is a random and vice versa. Similarly, attempts to track down the
occurrence, what is the probability that the small number heisenbug causes its symptoms to radically change or
of failures in the second run was due to random chance? even disappear completely.8 Of course, adding debug-
In other words, how confident should we be that the fix ging overhead can and sometimes does make bugs more
actually had some effect on the bug? This probability may probable. But developers are more likely to remember
be calculated by summing Eq. 11.8 as follows: the frustration of a disappearing heisenbug than the joy
inspired by the bug becoming more easily reproduced!
𝑚 If the field of physics inspired the name of this problem,
∑︁ 𝜆𝑖 it is only fair that the field of physics should inspire
𝐹0 + 𝐹1 + · · · + 𝐹𝑚−1 + 𝐹𝑚 = e−𝜆 (11.12)
𝑖=0
𝑖! the solution. Fortunately, particle physics is up to the
task: Why not create an anti-heisenbug to annihilate the
This is the Poisson cumulative distribution function,
which can be written more compactly as: 7 Of course, this result in no way excuses you from finding and fixing
v2022.09.25a
11.6. PROBABILITY AND HEISENBUGS 219
heisenbug? Or, perhaps more accurately, to annihilate the some types of race conditions more probable. One way
heisen-ness of the heisenbug? Although producing an of getting a similar effect today is to test on multi-socket
anti-heisenbug for a given heisenbug is more an art than a systems, thus incurring the large delays described in
science, the following sections describe a number of ways Section 3.2.
to do just that: However you choose to add delays, you can then look
more intensively at the code implicated by those delays
1. Add delay to race-prone regions (Section 11.6.4.1). that make the greatest difference in failure rate. It might
2. Increase workload intensity (Section 11.6.4.2). be helpful to test that code in isolation, for example.
One important aspect of software configuration is the
3. Isolate suspicious subsystems (Section 11.6.4.3). history of changes, which is why git bisect is so useful.
Bisection of the change history can provide very valuable
4. Simulate unusual events (Section 11.6.4.4). clues as to the nature of the heisenbug, in this case
5. Count near misses (Section 11.6.4.5). presumably by locating a commit that shows a change in
the software’s response to the addition or removal of a
These are followed by discussion in Section 11.6.4.6. given delay.
Quick Quiz 11.19: But I did the bisection, and ended up
11.6.4.1 Add Delay with a huge commit. What do I do now?
Consider the count-lossy code in Section 5.1. Adding Once you locate the suspicious section of code, you can
printf() statements will likely greatly reduce or even then introduce delays to attempt to increase the probability
eliminate the lost counts. However, converting the load- of failure. As we have seen, increasing the probability of
add-store sequence to a load-add-delay-store sequence failure makes it much easier to gain high confidence in
will greatly increase the incidence of lost counts (try it!). the corresponding fix.
Once you spot a bug involving a race condition, it is However, it is sometimes quite difficult to track down
frequently possible to create an anti-heisenbug by adding the problem using normal debugging techniques. The
delay in this manner. following sections present some other alternatives.
Of course, this begs the question of how to find the
race condition in the first place. Although very lucky
developers might accidentally create delay-based anti- 11.6.4.2 Increase Workload Intensity
heisenbugs when adding debug code, this is in general a
It is often the case that a given test suite places relatively
dark art. Nevertheless, there are a number of things you
low stress on a given subsystem, so that a small change
can do to find your race conditions.
in timing can cause a heisenbug to disappear. One way
One approach is to recognize that race conditions of-
to create an anti-heisenbug for this case is to increase the
ten end up corrupting some of the data involved in the
workload intensity, which has a good chance of increasing
race. It is therefore good practice to double-check the
the bug’s probability. If the probability is increased suffi-
synchronization of any corrupted data. Even if you cannot
ciently, it may be possible to add lightweight diagnostics
immediately recognize the race condition, adding delay be-
such as tracing without causing the bug to vanish.
fore and after accesses to the corrupted data might change
How can you increase the workload intensity? This
the failure rate. By adding and removing the delays in an
depends on the program, but here are some things to try:
organized fashion (e.g., binary search), you might learn
more about the workings of the race condition. 1. Add more CPUs.
Quick Quiz 11.18: How is this approach supposed to help
if the corruption affected some unrelated pointer, which then 2. If the program uses networking, add more network
caused the corruption??? adapters and more or faster remote systems.
Another important approach is to vary the software and 3. If the program is doing heavy I/O when the problem
hardware configuration and look for statistically significant occurs, either (1) add more storage devices, (2) use
differences in failure rate. For example, back in the 1990s, faster storage devices, for example, substitute SSDs
it was common practice to test on systems having CPUs for disks, or (3) use a RAM-based filesystem to
running at different clock rates, which tended to make substitute main memory for mass storage.
v2022.09.25a
220 CHAPTER 11. VALIDATION
Near Miss
level of contention. If you aren’t sure whether you
Reader
Reader Error
should go large or go small, just try both.
Time
Grace-Period End
v2022.09.25a
11.6. PROBABILITY AND HEISENBUGS 221
rcutorture error and near miss is shown in Figure 11.5. mathematics of Sections 11.6.1, 11.6.2, and 11.6.3. If you
To qualify as a full-fledged error, an RCU read-side critical love precision and mathematics, you may be disappointed
section must extend from the call_rcu() that initiated a to learn that the situations to which this section applies
grace period, through the remainder of the previous grace are far more common than those to which the preceding
period, through the entirety of the grace period initiated sections apply.
by the call_rcu() (denoted by the region between the In fact, the common case is that although you might
jagged lines), and through the delay from the end of that have reason to believe that your code has bugs, you have
grace period to the callback invocation, as indicated by no idea what those bugs are, what causes them, how
the “Error” arrow. However, the formal definition of RCU likely they are to appear, or what conditions affect their
prohibits RCU read-side critical sections from extending probability of appearance. In this all-too-common case,
across a single grace period, as indicated by the “Near statistics cannot help you.12 That is to say, statistics cannot
Miss” arrow. This suggests using near misses as the help you directly. But statistics can be of great indirect
error condition, however, this can be problematic because help—if you have the humility required to admit that you
different CPUs can have different opinions as to exactly make mistakes, that you can reduce the probability of
where a given grace period starts and ends, as indicated these mistakes (for example, by getting enough sleep), and
by the jagged lines.11 Using the near misses as the error that the number and type of mistakes you made in the past
condition could therefore result in false positives, which is indicative of the number and type of mistakes that you
need to be avoided in the automated rcutorture testing. are likely to make in the future. For example, I have a
By sheer dumb luck, rcutorture happens to include deplorable tendency to forget to write a small but critical
some statistics that are sensitive to the near-miss version portion of the initialization code, and frequently get most
of the grace period. As noted above, these statistics are or even all of a parallel program correct—except for a
subject to false positives due to their unsynchronized stupid omission in initialization. Once I was willing to
access to RCU’s state variables, but these false positives admit to myself that I am prone to this type of mistake, it
turn out to be extremely rare on strongly ordered systems was easier (but not easy!) to force myself to double-check
such as the IBM mainframe and x86, occurring less than my initialization code. Doing this allowed me to find
once per thousand hours of testing. numerous bugs ahead of time.
These near misses occurred roughly once per hour, When your quick bug hunt morphs into a long-term
about two orders of magnitude more frequently than the quest, it is important to log everything you have tried and
actual errors. Use of these near misses allowed the bug’s what happened. In the common case where the software
root cause to be identified in less than a week and a high is changing during the course of your quest, make sure
degree of confidence in the fix to be built in less than a to record the exact version of the software to which each
day. In contrast, excluding the near misses in favor of log entry applies. From time to time, reread the entire log
the real errors would have required months of debug and in order to make connections between clues encountered
validation time. at different times. Such rereading is especially important
To sum up near-miss counting, the general approach upon encountering a surprising test result, for example, I
is to replace counting of infrequent failures with more- reread my log upon realizing that what I thought was a
frequent near misses that are believed to be correlated with failure of the hypervisor to schedule a vCPU was instead
those failures. These near-misses can be considered an an interrupt storm preventing that vCPU from making
anti-heisenbug to the real failure’s heisenbug because the forward progress on the interrupted code. If the code you
near-misses, being more frequent, are likely to be more are debugging is new to you, this log is also an excellent
robust in the face of changes to your code, for example, place to document the relationships between code and data
the changes you make to add debugging code. structures. Keeping a log when you are furiously chasing
a difficult bug might seem like needless paperwork, but it
11.6.4.6 Heisenbug Discussion has on many occasions saved me from debugging around
and around in circles, which can waste far more time than
The alert reader might have noticed that this section was keeping a log ever could.
fuzzy and qualitative, in stark contrast to the precise
12 Although if you know what your program is supposed to do and
11 In
real life, these lines can be much more jagged because idle if your program is small enough (both less likely that you might think),
CPUs can be completely unaware of a great many recent grace periods. then the formal-verification tools described in Chapter 12 can be helpful.
v2022.09.25a
222 CHAPTER 11. VALIDATION
Using Taleb’s nomenclature [Tal07], a white swan Quick Quiz 11.22: But if you are going to put in all the hard
is a bug that we can reproduce. We can run a large work of parallelizing an application, why not do it right? Why
number of tests, use ordinary statistics to estimate the settle for anything less than optimal performance and linear
bug’s probability, and use ordinary statistics again to scalability?
estimate our confidence in a proposed fix. An unsuspected
Validating a parallel program must therfore include
bug is a black swan. We know nothing about it, we have
validating its performance. But validating performance
no tests that have yet caused it to happen, and statistics
means having a workload to run and performance criteria
is of no help. Studying our own behavior, especially the
with which to evaluate the program at hand. These needs
number and types of mistakes we make, can turn black
are often met by performance benchmarks, which are
swans into grey swans. We might not know exactly what
discussed in the next section.
the bugs are, but we have some idea of their number and
maybe also of their type. Ordinary statistics is still of no
help (at least not until we are able to reproduce one of 11.7.1 Benchmarking
the bugs), but robust13 testing methods can be of great
Frequent abuse aside, benchmarks are both useful and
help. The goal, therefore, is to use experience and good
heavily used, so it is not helpful to be too dismissive of
validation practices to turn the black swans grey, focused
them. Benchmarks span the range from ad hoc test jigs
testing and analysis to turn the grey swans white, and
to international standards, but regardless of their level of
ordinary methods to fix the white swans.
formality, benchmarks serve four major purposes:
That said, thus far, we have focused solely on bugs in the
parallel program’s functionality. However, performance is 1. Providing a fair framework for comparing competing
a first-class requirement for a parallel program. Otherwise, implementations.
why not write a sequential program? To repurpose Kipling,
our goal when writing parallel code is to fill the unforgiving 2. Focusing competitive energy on improving imple-
second with sixty minutes worth of distance run. The next mentations in ways that matter to users.
section therefore discusses a number of performance bugs
3. Serving as example uses of the implementations
that would be happy to thwart this Kiplingesque goal.
being benchmarked.
4. Serving as a marketing tool to highlight your software
11.7 Performance Estimation against your competitors’ offerings.
v2022.09.25a
11.7. PERFORMANCE ESTIMATION 223
Creating a benchmark that approximates the application vary the load placed on the system, the number of network
can help overcome these obstacles. A carefully construc- adapters, the number of mass-storage devices, and so on.
ted benchmark can help promote performance, scalability, You then collect profiles of the two runs, and mathemati-
energy efficiency, and much else besides. However, be cally combine corresponding profile measurements. For
careful to avoid investing too much into the benchmarking example, if your main concern is scalability, you might
effort. It is after all important to invest at least a little into take the ratio of corresponding measurements, and then
the application itself [Gra91]. sort the ratios into descending numerical order. The prime
scalability suspects will then be sorted to the top of the
11.7.2 Profiling list [McK95, McK99].
Some tools such as perf have built-in differential-
In many cases, a fairly small portion of your software profiling support.
is responsible for the majority of the performance and
scalability shortfall. However, developers are notoriously 11.7.4 Microbenchmarking
unable to identify the actual bottlenecks by inspection.
For example, in the case of a kernel buffer allocator, Microbenchmarking can be useful when deciding which
all attention focused on a search of a dense array which algorithms or data structures are worth incorporating into
turned out to represent only a few percent of the allocator’s a larger body of software for deeper evaluation.
execution time. An execution profile collected via a One common approach to microbenchmarking is to
logic analyzer focused attention on the cache misses measure the time, run some number of iterations of the
that were actually responsible for the majority of the code under test, then measure the time again. The dif-
problem [MS93]. ference between the two times divided by the number of
An old-school but quite effective method of tracking iterations gives the measured time required to execute the
down performance and scalability bugs is to run your code under test.
program under a debugger, then periodically interrupt it, Unfortunately, this approach to measurement allows
recording the stacks of all threads at each interruption. any number of errors to creep in, including:
The theory here is that if something is slowing down your 1. The measurement will include some of the overhead
program, it has to be visible in your threads’ executions. of the time measurement. This source of error can
That said, there are a number of tools that will usually be reduced to an arbitrarily small value by increasing
do a much better job of helping you to focus your attention the number of iterations.
where it will do the most good. Two popular choices
are gprof and perf. To use perf on a single-process 2. The first few iterations of the test might incur cache
program, prefix your command with perf record, then misses or (worse yet) page faults that might inflate
after the command completes, type perf report. There the measured value. This source of error can also be
is a lot of work on tools for performance debugging reduced by increasing the number of iterations, or
of multi-threaded programs, which should make this it can often be eliminated entirely by running a few
important job easier. Again, one good starting point warm-up iterations before starting the measurement
is Brendan Gregg’s blog.15 period. Most systems have ways of detecting whether
a given process incurred a page fault, and you should
make use of this to reject runs whose performance
11.7.3 Differential Profiling has been thus impeded.
Scalability problems will not necessarily be apparent
3. Some types of interference, for example, random
unless you are running on very large systems. However,
memory errors, are so rare that they can be dealt
it is sometimes possible to detect impending scalability
with by running a number of sets of iterations of the
problems even when running on much smaller systems.
test. If the level of interference was statistically sig-
One technique for doing this is called differential profiling.
nificant, any performance outliers could be rejected
The idea is to run your workload under two different
statistically.
sets of conditions. For example, you might run it on two
CPUs, then run it again on four CPUs. You might instead 4. Any iteration of the test might be interfered with
by other activity on the system. Sources of inter-
15 http://www.brendangregg.com/blog/ ference include other applications, system utilities
v2022.09.25a
224 CHAPTER 11. VALIDATION
and daemons, device interrupts, firmware interrupts cases. However, it does have limitations, for example, it
(including system management interrupts, or SMIs), cannot do anything about the per-CPU kernel threads that
virtualization, memory errors, and much else besides. are often used for housekeeping tasks.
Assuming that these sources of interference occur One way to avoid interference from per-CPU kernel
randomly, their effect can be minimized by reducing threads is to run your test at a high real-time priority, for
the number of iterations. example, by using the POSIX sched_setscheduler()
system call. However, note that if you do this, you are im-
5. Thermal throttling can understate scalability because plicitly taking on responsibility for avoiding infinite loops,
increasing CPU activity increases heat generation, because otherwise your test can prevent part of the kernel
and on systems without adequate cooling (most of from functioning. This is an example of the Spiderman
them!), this can result in the CPU frequency decreas- Principle: “With great power comes great responsibility.”
ing as the number of CPUs increases.16 Of course, if And although the default real-time throttling settings often
you are testing an application to evaluate its expected address such problems, they might do so by causing your
behavior when run in production, such thermal throt- real-time threads to miss their deadlines.
tling is simply a fact of life. Otherwise, if you are These approaches can greatly reduce, and perhaps even
interested in theoretical scalability, use a system with eliminate, interference from processes, threads, and tasks.
adequate cooling or reduce the CPU clock rate to a However, it does nothing to prevent interference from
level that the cooling system can handle. device interrupts, at least in the absence of threaded
interrupts. Linux allows some control of threaded in-
The first and fourth sources of interference provide
terrupts via the /proc/irq directory, which contains
conflicting advice, which is one sign that we are living
numerical directories, one per interrupt vector. Each
in the real world. The remainder of this section looks at
numerical directory contains smp_affinity and smp_
ways of resolving this conflict.
affinity_list. Given sufficient permissions, you can
Quick Quiz 11.23: But what about other sources of error, write a value to these files to restrict interrupts to the
for example, due to interactions between caches and memory specified set of CPUs. For example, either “echo 3
layout? > /proc/irq/23/smp_affinity” or “echo 0-1 >
The following sections discuss ways of dealing with /proc/irq/23/smp_affinity_list” would confine
these measurement errors, with Section 11.7.5 covering interrupts on vector 23 to CPUs 0 and 1, at least given suffi-
isolation techniques that may be used to prevent some cient privileges. You can use “cat /proc/interrupts”
forms of interference, and with Section 11.7.6 covering to obtain a list of the interrupt vectors on your system,
methods for detecting interference so as to reject mea- how many are handled by each CPU, and what devices
surement data that might have been corrupted by that use each interrupt vector.
interference. Running a similar command for all interrupt vectors on
your system would confine interrupts to CPUs 0 and 1,
leaving the remaining CPUs free of interference. Or
11.7.5 Isolation mostly free of interference, anyway. It turns out that
The Linux kernel provides a number of ways to isolate a the scheduling-clock interrupt fires on each CPU that is
group of CPUs from outside interference. running in user mode.17 In addition you must take care to
First, let’s look at interference by other processes, ensure that the set of CPUs that you confine the interrupts
threads, and tasks. The POSIX sched_setaffinity() to is capable of handling the load.
system call may be used to move most tasks off of a But this only handles processes and interrupts running
given set of CPUs and to confine your tests to that same in the same operating-system instance as the test. Suppose
group. The Linux-specific user-level taskset command that you are running the test in a guest OS that is itself
may be used for the same purpose, though both sched_ running on a hypervisor, for example, Linux running
setaffinity() and taskset require elevated permis- KVM? Although you can in theory apply the same
sions. Linux-specific control groups (cgroups) may be techniques at the hypervisor level that you can at the
used for this same purpose. This approach can be quite
17 Frederic Weisbecker leads up a NO_HZ_FULL adaptive-ticks project
effective at reducing interference, and is sufficient in many
that allows scheduling-clock interrupts to be disabled on CPUs that have
16 Systems with adequate cooling tend to look like gaming systems. only one runnable task. As of 2021, this is largely complete.
v2022.09.25a
11.7. PERFORMANCE ESTIMATION 225
Listing 11.1: Using getrusage() to Detect Context Switches Opening and reading files is not the way to low overhead,
1 #include <sys/time.h> and it is possible to get the count of context switches for a
2 #include <sys/resource.h>
3 given thread by using the getrusage() system call, as
4 /* Return 0 if test results should be rejected. */ shown in Listing 11.1. This same system call can be used
5 int runtest(void)
6 { to detect minor page faults (ru_minflt) and major page
7 struct rusage ru1; faults (ru_majflt).
8 struct rusage ru2;
9 Unfortunately, detecting memory errors and firmware
10 if (getrusage(RUSAGE_SELF, &ru1) != 0) { interference is quite system-specific, as is the detection of
11 perror("getrusage");
12 abort(); interference due to virtualization. Although avoidance is
13 } better than detection, and detection is better than statistics,
14 /* run test here. */
15 if (getrusage(RUSAGE_SELF, &ru2 != 0) { there are times when one must avail oneself of statistics, a
16 perror("getrusage"); topic addressed in the next section.
17 abort();
18 }
19 return (ru1.ru_nvcsw == ru2.ru_nvcsw &&
20 ru1.runivcsw == ru2.runivcsw);
11.7.6.2 Detecting Interference Via Statistics
21 }
Any statistical analysis will be based on assumptions about
the data, and performance microbenchmarks often support
guest-OS level, it is quite common for hypervisor-level the following assumptions:
operations to be restricted to authorized personnel. In
1. Smaller measurements are more likely to be accurate
addition, none of these techniques work against firmware-
than larger measurements.
level interference.
Quick Quiz 11.24: Wouldn’t the techniques suggested to 2. The measurement uncertainty of good data is known.
isolate the code under test also affect that code’s performance,
particularly if it is running within a larger application? 3. A reasonable fraction of the test runs will result in
good data.
Of course, if it is in fact the interference that is producing
the behavior of interest, you will instead need to promote The fact that smaller measurements are more likely
interference, in which case being unable to prevent it is to be accurate than larger measurements suggests that
not a problem. But if you really do need interference-free sorting the measurements in increasing order is likely to be
measurements, then instead of preventing the interference, productive.18 The fact that the measurement uncertainty
you might need to detect the interference as described in is known allows us to accept measurements within this
the next section. uncertainty of each other: If the effects of interference are
large compared to this uncertainty, this will ease rejection
of bad data. Finally, the fact that some fraction (for
11.7.6 Detecting Interference example, one third) can be assumed to be good allows
If you cannot prevent interference, perhaps you can detect us to blindly accept the first portion of the sorted list,
it and reject results from any affected test runs. Sec- and this data can then be used to gain an estimate of the
tion 11.7.6.1 describes methods of rejection involving ad- natural variation of the measured data, over and above the
ditional measurements, while Section 11.7.6.2 describes assumed measurement error.
statistics-based rejection. The approach is to take the specified number of leading
elements from the beginning of the sorted list, and use
these to estimate a typical inter-element delta, which in
11.7.6.1 Detecting Interference Via Measurement
turn may be multiplied by the number of elements in the
Many systems, including Linux, provide means for deter- list to obtain an upper bound on permissible values. The
mining after the fact whether some forms of interference algorithm then repeatedly considers the next element of
have occurred. For example, process-based interference the list. If it falls below the upper bound, and if the distance
results in context switches, which, on Linux-based sys- between the next element and the previous element is not
tems, are visible in /proc/<PID>/sched via the nr_ too much greater than the average inter-element distance
switches field. Similarly, interrupt-based interference
can be detected via the /proc/interrupts file. 18 To paraphrase the old saying, “Sort first and ask questions later.”
v2022.09.25a
226 CHAPTER 11. VALIDATION
Listing 11.2: Statistical Elimination of Interference 5. The number of selected data items.
1 div=3
2 rel=0.01 6. The number of input data items.
3 tre=10
4 while test $# -gt 0
5 do This script takes three optional arguments as follows:
6 case "$1" in
7 --divisor)
8 shift --divisor: Number of segments to divide the list into,
9 div=$1
10 ;; for example, a divisor of four means that the first
11 --relerr) quarter of the data elements will be assumed to be
12 shift
13 rel=$1 good. This defaults to three.
14 ;;
15 --trendbreak) --relerr: Relative measurement error. The script as-
16 shift
17 tre=$1 sumes that values that differ by less than this error
18 ;; are for all intents and purposes equal. This defaults
19 esac
20 shift to 0.01, which is equivalent to 1 %.
21 done
22 --trendbreak: Ratio of inter-element spacing constitut-
23 awk -v divisor=$div -v relerr=$rel -v trendbreak=$tre '{
24 for (i = 2; i <= NF; i++) ing a break in the trend of the data. For example,
25 d[i - 1] = $i; if the average spacing in the data accepted so far is
26 asort(d);
27 i = int((NF + divisor - 1) / divisor); 1.5, then if the trend-break ratio is 2.0, then if the
28 delta = d[i] - d[1]; next data value differs from the last one by more than
29 maxdelta = delta * divisor;
30 maxdelta1 = delta + d[i] * relerr; 3.0, this constitutes a break in the trend. (Unless of
31 if (maxdelta1 > maxdelta) course, the relative error is greater than 3.0, in which
32 maxdelta = maxdelta1;
33 for (j = i + 1; j < NF; j++) { case the “break” will be ignored.)
34 if (j <= 2)
35 maxdiff = d[NF - 1] - d[1];
36 else
Lines 1–3 of Listing 11.2 set the default values for
37 maxdiff = trendbreak * (d[j - 1] - d[1]) / (j - 2); the parameters, and lines 4–21 parse any command-line
38 if (d[j] - d[1] > maxdelta && d[j] - d[j - 1] > maxdiff)
39 break;
overriding of these parameters. The awk invocation on
40 } line 23 sets the values of the divisor, relerr, and
41 n = sum = 0;
42 for (k = 1; k < j; k++) {
trendbreak variables to their sh counterparts. In the
43 sum += d[k]; usual awk manner, lines 24–50 are executed on each input
44 n++;
45 }
line. The loop spanning lines 24 and 25 copies the input
46 min = d[1]; y-values to the d array, which line 26 sorts into increasing
47 max = d[j - 1];
48 avg = sum / n;
order. Line 27 computes the number of trustworthy y-
49 print $1, avg, min, max, n, NF - 1; values by applying divisor and rounding up.
50 }'
Lines 28–32 compute the maxdelta lower bound on
the upper bound of y-values. To this end, line 29 multiplies
for the portion of the list accepted thus far, then the next the difference in values over the trusted region of data
element is accepted and the process repeats. Otherwise, by the divisor, which projects the difference in values
the remainder of the list is rejected. across the trusted region across the entire set of y-values.
Listing 11.2 shows a simple sh/awk script implementing However, this value might well be much smaller than
this notion. Input consists of an x-value followed by an the relative error, so line 30 computes the absolute error
arbitrarily long list of y-values, and output consists of one (d[i] * relerr) and adds that to the difference delta
line for each input line, with fields as follows: across the trusted portion of the data. Lines 31 and 32
then compute the maximum of these two values.
1. The x-value. Each pass through the loop spanning lines 33–40 at-
tempts to add another data value to the set of good data.
2. The average of the selected data. Lines 34–39 compute the trend-break delta, with line 34
3. The minimum of the selected data. disabling this limit if we don’t yet have enough val-
ues to compute a trend, and with line 37 multiplying
4. The maximum of the selected data. trendbreak by the average difference between pairs of
v2022.09.25a
11.8. SUMMARY 227
work out whether a program will halt, but also estimate how
long it will run before halting, as discussed in Section 11.7.
Furthermore, in cases where a given program might or
might not work correctly, we can often establish estimates
for what fraction of the time it will work correctly, as
discussed in Section 11.6.
Nevertheless, unthinking reliance on these estimates
is brave to the point of foolhardiness. After all, we are
summarizing a huge mass of complexity in code and data
Figure 11.6: Choose Validation Methods Wisely structures down to a single solitary number. Even though
we can get away with such bravery a surprisingly large
fraction of the time, abstracting all that code and data
data values in the good set. If line 38 determines that the away will occasionally cause severe problems.
candidate data value would exceed the lower bound on the
One possible problem is variability, where repeated
upper bound (maxdelta) and that the difference between
runs give wildly different results. This problem is often
the candidate data value and its predecessor exceeds the
addressed using standard deviation, however, using two
trend-break difference (maxdiff), then line 39 exits the
numbers to summarize the behavior of a large and complex
loop: We have the full good set of data.
program is about as brave as using only one number. In
Lines 41–49 then compute and print statistics.
computer programming, the surprising thing is that use
Quick Quiz 11.25: This approach is just plain weird! Why of the mean or the mean and standard deviation are often
not use means and standard deviations, like we were taught in sufficient. Nevertheless, there are no guarantees.
our statistics classes?
One cause of variation is confounding factors. For
Quick Quiz 11.26: But what if all the y-values in the trusted example, the CPU time consumed by a linked-list search
group of data are exactly zero? Won’t that cause the script to will depend on the length of the list. Averaging together
reject any non-zero value? runs with wildly different list lengths will probably not be
useful, and adding a standard deviation to the mean will
Although statistical interference detection can be quite not be much better. The right thing to do would be control
useful, it should be used only as a last resort. It is far better for list length, either by holding the length constant or to
to avoid interference in the first place (Section 11.7.5), measure CPU time as a function of list length.
or, failing that, detecting interference via measurement
Of course, this advice assumes that you are aware
(Section 11.7.6.1).
of the confounding factors, and Murphy says that you
will not be. I have been involved in projects that had
11.8 Summary confounding factors as diverse as air conditioners (which
drew considerable power at startup, thus causing the
voltage supplied to the computer to momentarily drop too
To err is human! Stop being human!!! low, sometimes resulting in failure), cache state (resulting
Ed Nofziger in odd variations in performance), I/O errors (including
disk errors, packet loss, and duplicate Ethernet MAC
Although validation never will be an exact science, much addresses), and even porpoises (which could not resist
can be gained by taking an organized approach to it, as playing with an array of transponders, which could be
an organized approach will help you choose the right otherwise used for high-precision acoustic positioning
validation tools for your job, avoiding situations like the and navigation). And this is but one reason why a good
one fancifully depicted in Figure 11.6. night’s sleep is such an effective debugging tool.
A key choice is that of statistics. Although the methods In short, validation always will require some measure
described in this chapter work very well most of the time, of the behavior of the system. To be at all useful, this
they do have their limitations, courtesy of the Halting measure must be a severe summarization of the system,
Problem [Tur37, Pul00]. Fortunately for us, there is a which in turn means that it can be misleading. So as the
huge number of special cases in which we can not only saying goes, “Be careful. It is a real world out there.”
v2022.09.25a
228 CHAPTER 11. VALIDATION
v2022.09.25a
Beware of bugs in the above code; I have only proved
it correct, not tried it.
Formal Verification
Parallel algorithms can be hard to write, and even harder are used to verifying data communication protocols. Sec-
to debug. Testing, though essential, is insufficient, as fatal tion 12.1.1 introduces Promela and Spin, including a
race conditions can have extremely low probabilities of couple of warm-up exercises verifying both non-atomic
occurrence. Proofs of correctness can be valuable, but in and atomic increment. Section 12.1.2 describes use of
the end are just as prone to human error as is the original Promela, including example command lines and a com-
algorithm. In addition, a proof of correctness cannot be parison of Promela syntax to that of C. Section 12.1.3
expected to find errors in your assumptions, shortcomings shows how Promela may be used to verify locking, Sec-
in the requirements, misunderstandings of the underlying tion 12.1.4 uses Promela to verify an unusual implemen-
software or hardware primitives, or errors that you did tation of RCU named “QRCU”, and finally Section 12.1.5
not think to construct a proof for. This means that formal applies Promela to early versions of RCU’s dyntick-idle
methods can never replace testing. Nevertheless, formal implementation.
methods can be a valuable addition to your validation
toolbox. 12.1.1 Promela and Spin
It would be very helpful to have a tool that could some-
how locate all race conditions. A number of such tools Promela is a language designed to help verify protocols,
exist, for example, Section 12.1 provides an introduction but which can also be used to verify small parallel al-
to the general-purpose state-space search tools Promela gorithms. You recode your algorithm and correctness
and Spin, Section 12.2 similarly introduces the special- constraints in the C-like language Promela, and then use
purpose ppcmem and cppmem tools, Section 12.3 looks Spin to translate it into a C program that you can compile
at an example axiomatic approach, Section 12.4 briefly and run. The resulting program carries out a full state-
overviews SAT solvers, Section 12.5 briefly overviews space search of your algorithm, either verifying or finding
stateless model checkers, Section 12.6 sums up use of counter-examples for assertions that you can associate
formal-verification tools for verifying parallel algorithms, with in your Promela program.
and finally Section 12.7 discusses how to decide how This full-state search can be extremely powerful, but
much and what type of validation to apply to a given can also be a two-edged sword. If your algorithm is too
software project. complex or your Promela implementation is careless, there
might be more states than fit in memory. Furthermore,
even given sufficient memory, the state-space search might
12.1 State-Space Search well run for longer than the expected lifetime of the
universe. Therefore, use this tool for compact but complex
parallel algorithms. Attempts to naively apply it to even
Follow every byway / Every path you know.
moderate-scale algorithms (let alone the full Linux kernel)
“Climb Every Mountain”, Rodgers & Hammerstein will end badly.
Promela and Spin may be downloaded from https:
This section features the general-purpose Promela and //spinroot.com/spin/whatispin.html.
Spin tools, which may be used to carry out a full state- The above site also gives links to Gerard Holzmann’s
space search of many types of multi-threaded code. They excellent book [Hol03] on Promela and Spin, as well as
229
v2022.09.25a
230 CHAPTER 12. FORMAL VERIFICATION
Listing 12.1: Promela Code for Non-Atomic Increment process’s completion. Because the Spin system will fully
1 #define NUMPROCS 2 search the state space, including all possible sequences of
2
3 byte counter = 0; states, there is no need for the loop that would be used for
4 byte progress[NUMPROCS]; conventional stress testing.
5
6 proctype incrementer(byte me) Lines 15–40 are the initialization block, which is ex-
7 { ecuted first. Lines 19–28 actually do the initialization,
8 int temp;
9 while lines 29–39 perform the assertion. Both are atomic
10 temp = counter; blocks in order to avoid unnecessarily increasing the state
11 counter = temp + 1;
12 progress[me] = 1; space: Because they are not part of the algorithm proper,
13 } we lose no verification coverage by making them atomic.
14
15 init { The do-od construct on lines 21–27 implements a
16 int i = 0; Promela loop, which can be thought of as a C for
17 int sum = 0;
18 (;;) loop containing a switch statement that allows
19 atomic { expressions in case labels. The condition blocks (prefixed
20 i = 0;
21 do by ::) are scanned non-deterministically, though in this
22 :: i < NUMPROCS -> case only one of the conditions can possibly hold at a
23 progress[i] = 0;
24 run incrementer(i); given time. The first block of the do-od from lines 22–25
25 i++; initializes the i-th incrementer’s progress cell, runs the i-th
26 :: i >= NUMPROCS -> break;
27 od; incrementer’s process, and then increments the variable i.
28 } The second block of the do-od on line 26 exits the loop
29 atomic {
30 i = 0; once these processes have been started.
31 sum = 0; The atomic block on lines 29–39 also contains a similar
32 do
33 :: i < NUMPROCS -> do-od loop that sums up the progress counters. The
34 sum = sum + progress[i]; assert() statement on line 38 verifies that if all processes
35 i++
36 :: i >= NUMPROCS -> break; have been completed, then all counts have been correctly
37 od; recorded.
38 assert(sum < NUMPROCS || counter == NUMPROCS);
39 } You can build and run this program as follows:
40 }
spin -a increment.spin # Translate the model to C
cc -DSAFETY -o pan pan.c # Compile the model
./pan # Run the model
searchable online references starting at: https://www.
spinroot.com/spin/Man/index.html.
This will produce output as shown in Listing 12.2.
The remainder of this section describes how to use
The first line tells us that our assertion was violated (as
Promela to debug parallel algorithms, starting with simple
expected given the non-atomic increment!). The second
examples and progressing to more complex uses.
line that a trail file was written describing how the
assertion was violated. The “Warning” line reiterates that
12.1.1.1 Warm-Up: Non-Atomic Increment all was not well with our model. The second paragraph
describes the type of state-search being carried out, in
Listing 12.1 demonstrates the textbook race condition this case for assertion violations and invalid end states.
resulting from non-atomic increment. Line 1 defines The third paragraph gives state-size statistics: This small
the number of processes to run (we will vary this to see model had only 45 states. The final line shows memory
the effect on state space), line 3 defines the counter, and usage.
line 4 is used to implement the assertion that appears on The trail file may be rendered human-readable as
lines 29–39. follows:
Lines 6–13 define a process that increments the counter
non-atomically. The argument me is the process number, spin -t -p increment.spin
v2022.09.25a
12.1. STATE-SPACE SEARCH 231
Listing 12.2: Non-Atomic Increment Spin Output Table 12.1: Memory Usage of Increment Model
pan:1: assertion violated
((sum<2)||(counter==2)) (at depth 22) # incrementers # states total memory usage (MB)
pan: wrote increment.spin.trail
12.1.2 How to Use Promela If you see a message from ./pan saying: “error:
max search depth too small”, you need to in-
Given a source file qrcu.spin, one can use the following crease the maximum depth by a -mN option for a
commands: complete search. The default is -m10000.
v2022.09.25a
232 CHAPTER 12. FORMAL VERIFICATION
v2022.09.25a
12.1. STATE-SPACE SEARCH 233
Listing 12.4: Promela Code for Atomic Increment Don’t forget to capture the output, especially if you
1 proctype incrementer(byte me) are working on a remote machine.
2 {
3 int temp; If your model includes forward-progress checks, you
4
5 atomic { will likely need to enable “weak fairness” via the -f
6 temp = counter; command-line argument to ./pan. If your forward-
7 counter = temp + 1;
8 } progress checks involve accept labels, you will also
9 progress[me] = 1; need the -a argument.
10 }
spin -t -p qrcu.spin
Listing 12.5: Atomic Increment Spin Output Given trail file output by a run that encountered
(Spin Version 6.4.8 -- 2 March 2018) an error, output the sequence of steps leading to that
+ Partial Order Reduction
error. The -g flag will also include the values of
Full statespace search for: changed global variables, and the -l flag will also
never claim - (none specified)
assertion violations + include the values of changed local variables.
cycle checks - (disabled by -DSAFETY)
invalid end states +
12.1.2.1 Promela Peculiarities
State-vector 48 byte, depth reached 22, errors: 0
52 states, stored
21 states, matched Although all computer languages have underlying similar-
73 transitions (= stored+matched) ities, Promela will provide some surprises to people used
68 atomic steps
hash conflicts: 0 (resolved) to coding in C, C++, or Java.
Stats on memory usage (in Megabytes):
0.004 equivalent memory usage for states 1. In C, “;” terminates statements. In Promela it sep-
(stored*(State-vector + overhead)) arates them. Fortunately, more recent versions of
0.290 actual memory usage for states
128.000 memory used for hash table (-w24) Spin have become much more forgiving of “extra”
0.534 memory used for DFS stack (-m10000) semicolons.
128.730 total actual memory usage
1 As
of Spin Version 6.4.6 and 6.4.8. In the online manual of Spin
5. In C, the easiest thing to do is to maintain a loop
dated 10 July 2011, the default for exhaustive search mode is said to be counter to track progress and terminate the loop.
-w19, which does not meet the actual behavior. In Promela, loop counters must be avoided like the
v2022.09.25a
234 CHAPTER 12. FORMAL VERIFICATION
plague because they cause the state space to explode. Listing 12.6: Complex Promela Assertion
On the other hand, there is no penalty for infinite 1 i = 0;
2 sum = 0;
loops in Promela as long as none of the variables 3 do
monotonically increase or decrease—Promela will 4 :: i < N_QRCU_READERS ->
5 sum = sum + (readerstart[i] == 1 &&
figure out how many passes through the loop really 6 readerprogress[i] == 1);
matter, and automatically prune execution beyond 7 i++
8 :: i >= N_QRCU_READERS ->
that point. 9 assert(sum == 0);
10 break
11 od
6. In C torture-test code, it is often wise to keep per-
task control variables. They are cheap to read, and
greatly aid in debugging the test code. In Promela,
per-task control variables should be used only when 1 if
2 :: 1 -> r1 = x;
there is no other alternative. To see this, consider 3 r2 = y
a 5-task verification with one bit each to indicate 4 :: 1 -> r2 = y;
5 r1 = x
completion. This gives 32 states. In contrast, a 6 fi
simple counter would have only six states, more
than a five-fold reduction. That factor of five might
not seem like a problem, at least not until you are The two branches of the if statement will be selected
struggling with a verification program possessing nondeterministically, since they both are available.
more than 150 million states consuming more than Because the full state space is searched, both choices
10 GB of memory! will eventually be made in all cases.
Of course, this trick will cause your state space to
7. One of the most challenging things both in C torture- explode if used too heavily. In addition, it requires
test code and in Promela is formulating good asser- you to anticipate possible reorderings.
tions. Promela also allows never claims that act like
an assertion replicated between every line of code. 2. State reduction. If you have complex assertions,
evaluate them under atomic. After all, they are not
8. Dividing and conquering is extremely helpful in part of the algorithm. One example of a complex
Promela in keeping the state space under control. assertion (to be discussed in more detail later) is as
Splitting a large model into two roughly equal halves shown in Listing 12.6.
will result in the state space of each half being roughly There is no reason to evaluate this assertion non-
the square root of the whole. For example, a million- atomically, since it is not actually part of the algo-
state combined model might reduce to a pair of rithm. Because each statement contributes to state,
thousand-state models. Not only will Promela handle we can reduce the number of useless states by enclos-
the two smaller models much more quickly with ing it in an atomic block as shown in Listing 12.7.
much less memory, but the two smaller algorithms
are easier for people to understand. 3. Promela does not provide functions. You must in-
stead use C preprocessor macros. However, you must
use them carefully in order to avoid combinatorial
12.1.2.2 Promela Coding Tricks explosion.
Promela was designed to analyze protocols, so using it on Now we are ready for further examples.
parallel programs is a bit abusive. The following tricks
can help you to abuse Promela safely:
12.1.3 Promela Example: Locking
1. Memory reordering. Suppose you have a pair of Since locks are generally useful, spin_lock() and spin_
statements copying globals x and y to locals r1 and unlock() macros are provided in lock.h, which may
r2, where ordering matters (e.g., unprotected by be included from multiple Promela models, as shown
locks), but where you have no memory barriers. This in Listing 12.8. The spin_lock() macro contains an
can be modeled in Promela as follows: infinite do-od loop spanning lines 2–11, courtesy of the
v2022.09.25a
12.1. STATE-SPACE SEARCH 235
v2022.09.25a
236 CHAPTER 12. FORMAL VERIFICATION
Listing 12.10: Output for Spinlock Test Quick Quiz 12.2: What are some Promela code-style issues
(Spin Version 6.4.8 -- 2 March 2018) with this example?
+ Partial Order Reduction
v2022.09.25a
12.1. STATE-SPACE SEARCH 237
Listing 12.11: QRCU Global Variables Listing 12.13: QRCU Unordered Summation
1 #include "lock.h" 1 #define sum_unordered \
2 2 atomic { \
3 #define N_QRCU_READERS 2 3 do \
4 #define N_QRCU_UPDATERS 2 4 :: 1 -> \
5 5 sum = ctr[0]; \
6 bit idx = 0; 6 i = 1; \
7 byte ctr[2]; 7 break \
8 byte readerprogress[N_QRCU_READERS]; 8 :: 1 -> \
9 bit mutex = 0; 9 sum = ctr[1]; \
10 i = 0; \
11 break \
Listing 12.12: QRCU Reader Process 12 od; \
13 } \
1 proctype qrcu_reader(byte me)
14 sum = sum + ctr[i]
2 {
3 int myidx;
4
5 do
6 :: 1 -> the global index, and lines 8–15 atomically increment it
7 myidx = idx;
8 atomic {
(and break from the infinite loop) if its value was non-zero
9 if (atomic_inc_not_zero()). Line 17 marks entry into
10 :: ctr[myidx] > 0 ->
11 ctr[myidx]++;
the RCU read-side critical section, and line 18 marks
12 break exit from this critical section, both lines for the benefit
13 :: else -> skip
14 fi
of the assert() statement that we shall encounter later.
15 } Line 19 atomically decrements the same counter that we
16 od;
17 readerprogress[me] = 1;
incremented, thereby exiting the RCU read-side critical
18 readerprogress[me] = 2; section.
19 atomic { ctr[myidx]-- }
20 } The C-preprocessor macro shown in Listing 12.13
sums the pair of counters so as to emulate weak memory
ordering. Lines 2–13 fetch one of the counters, and
Returning to the Promela code for QRCU, the global line 14 fetches the other of the pair and sums them. The
variables are as shown in Listing 12.11. This example atomic block consists of a single do-od statement. This
uses locking and includes lock.h. Both the number of do-od statement (spanning lines 3–12) is unusual in that it
readers and writers can be varied using the two #define contains two unconditional branches with guards on lines 4
statements, giving us not one but two ways to create and 8, which causes Promela to non-deterministically
combinatorial explosion. The idx variable controls which choose one of the two (but again, the full state-space
of the two elements of the ctr array will be used by search causes Promela to eventually make all possible
readers, and the readerprogress variable allows an choices in each applicable situation). The first branch
assertion to determine when all the readers are finished fetches the zero-th counter and sets i to 1 (so that line 14
(since a QRCU update cannot be permitted to complete will fetch the first counter), while the second branch does
until all pre-existing readers have completed their QRCU the opposite, fetching the first counter and setting i to 0
read-side critical sections). The readerprogress array (so that line 14 will fetch the second counter).
elements have values as follows, indicating the state of the
Quick Quiz 12.3: Is there a more straightforward way to
corresponding reader:
code the do-od statement?
0: Not yet started.
With the sum_unordered macro in place, we can now
1: Within QRCU read-side critical section. proceed to the update-side process shown in Listing 12.14.
2: Finished with QRCU read-side critical section. The update-side process repeats indefinitely, with the
corresponding do-od loop ranging over lines 7–57.
Finally, the mutex variable is used to serialize updaters’ Each pass through the loop first snapshots the global
slowpaths. readerprogress array into the local readerstart ar-
QRCU readers are modeled by the qrcu_reader() ray on lines 12–21. This snapshot will be used for the
process shown in Listing 12.12. A do-od loop spans assertion on line 53. Line 23 invokes sum_unordered,
lines 5–16, with a single guard of “1” on line 6 that makes and then lines 24–27 re-invoke sum_unordered if the
it an infinite loop. Line 7 captures the current value of fastpath is potentially usable.
v2022.09.25a
238 CHAPTER 12. FORMAL VERIFICATION
v2022.09.25a
12.1. STATE-SPACE SEARCH 239
Table 12.2: Memory Usage of QRCU Model Listing 12.16: 3 Readers 3 Updaters QRCU Spin Output with
-DMA=96
updaters readers # states depth memory (MB)a (Spin Version 6.4.6 -- 2 December 2016)
+ Partial Order Reduction
1 1 376 95 128.7 + Graph Encoding (-DMA=96)
1 2 6,177 218 128.9 Full statespace search for:
1 3 99,728 385 132.6 never claim - (none specified)
2 1 29,399 859 129.8 assertion violations +
cycle checks - (disabled by -DSAFETY)
2 2 1,071,181 2,352 169.6 invalid end states +
2 3 33,866,736 12,857 1,540.8
State-vector 96 byte, depth reached 2055621, errors: 0
3 1 2,749,453 53,809 236.6 MA stats: -DMA=84 is sufficient
3 2 186,202,860 328,014 10,483.7 Minimized Automaton: 56420520 nodes and 1.75128e+08 edges
9.6647071e+09 states, stored
a Obtained with the compiler flag -DCOLLAPSE specified. 9.7503813e+09 states, matched
1.9415088e+10 transitions (= stored+matched)
7.2047951e+09 atomic steps
v2022.09.25a
240 CHAPTER 12. FORMAL VERIFICATION
-DCOLLAPSE -DMA=N
updaters readers # states depth reached -wN memory (MB) runtime (s) N memory (MB) runtime (s)
1. See whether a smaller number of readers and updaters 1. For synchronize_qrcu() to exit too early, then by
suffice to prove the general case. definition there must have been at least one reader
present during synchronize_qrcu()’s full execu-
2. Manually construct a proof of correctness. tion.
3. Use a more capable tool. 2. The counter corresponding to this reader will have
been at least 1 during this time interval.
4. Divide and conquer.
3. The synchronize_qrcu() code forces at least one
The following sections discuss each of these approaches. of the counters to be at least 1 at all times.
4. Therefore, at any given point in time, either one of
12.1.4.2 How Many Readers and Updaters Are Re-
the counters will be at least 2, or both of the counters
ally Needed?
will be at least one.
One approach is to look carefully at the Promela code for
5. However, the synchronize_qrcu() fastpath code
qrcu_updater() and notice that the only global state
can read only one of the counters at a given time. It
change is happening under the lock. Therefore, only one
is therefore possible for the fastpath code to fetch the
updater at a time can possibly be modifying state visible
first counter while zero, but to race with a counter
to either readers or other updaters. This means that any
flip so that the second counter is seen as one.
sequences of state changes can be carried out serially by
a single updater due to the fact that Promela does a full 6. There can be at most one reader persisting through
state-space search. Therefore, at most two updaters are such a race condition, as otherwise the sum would
required: One to change state and a second to become be two or greater, which would cause the updater to
confused. take the slowpath.
The situation with the readers is less clear-cut, as each
reader does only a single read-side critical section then 7. But if the race occurs on the fastpath’s first read of
terminates. It is possible to argue that the useful number the counters, and then again on its second read, there
of readers is limited, due to the fact that the fastpath must have to have been two counter flips.
see at most a zero and a one in the counters. This is a 8. Because a given updater flips the counter only once,
fruitful avenue of investigation, in fact, it leads to the full and because the update-side lock prevents a pair of
proof of correctness described in the next section. updaters from concurrently flipping the counters, the
only way that the fastpath code can race with a flip
12.1.4.3 Alternative Approach: Proof of Correct- twice is if the first updater completes.
ness
9. But the first updater will not complete until after all
An informal proof [McK07c] follows: pre-existing readers have completed.
v2022.09.25a
12.1. STATE-SPACE SEARCH 241
10. Therefore, if the fastpath races with a counter flip QRCU correct. And this is why formal-verification tools
twice in succession, all pre-existing readers must themselves should be tested using bug-injected versions
have completed, so that it is safe to take the fastpath. of the code being verified. If a given tool cannot find the
injected bugs, then that tool is clearly untrustworthy.
Of course, not all parallel algorithms have such simple
Quick Quiz 12.7: But different formal-verification tools
proofs. In such cases, it may be necessary to enlist more are often designed to locate particular classes of bugs. For
capable tools. example, very few formal-verification tools will find an error
in the specification. So isn’t this “clearly untrustworthy”
12.1.4.4 Alternative Approach: More Capable Tools judgment a bit harsh?
Although Promela and Spin are quite useful, much more Therefore, if you do intend to use QRCU, please take
capable tools are available, particularly for verifying hard- care. Its proofs of correctness might or might not them-
ware. This means that if it is possible to translate your selves be correct. Which is one reason why formal verifi-
algorithm to the hardware-design VHDL language, as it cation is unlikely to completely replace testing, as Donald
often will be for low-level parallel algorithms, then it is Knuth pointed out so long ago.
possible to apply these tools to your code (for example, this Quick Quiz 12.8: Given that we have two independent proofs
was done for the first realtime RCU algorithm). However, of correctness for the QRCU algorithm described herein, and
such tools can be quite expensive. given that the proof of incorrectness covers what is known to
Although the advent of commodity multiprocessing be a different algorithm, why is there any room for doubt?
might eventually result in powerful free-software model-
checkers featuring fancy state-space-reduction capabilities,
this does not help much in the here and now. 12.1.5 Promela Parable: dynticks and Pre-
As an aside, there are Spin features that support ap- emptible RCU
proximate searches that require fixed amounts of memory,
however, I have never been able to bring myself to trust In early 2008, a preemptible variant of RCU was accepted
approximations when verifying parallel algorithms. into mainline Linux in support of real-time workloads,
Another approach might be to divide and conquer. a variant similar to the RCU implementations in the -rt
patchset [Mol05] since August 2005. Preemptible RCU
is needed for real-time workloads because older RCU
12.1.4.5 Alternative Approach: Divide and Conquer
implementations disable preemption across RCU read-
It is often possible to break down a larger parallel algorithm side critical sections, resulting in excessive real-time
into smaller pieces, which can then be proven separately. latencies.
For example, a 10-billion-state model might be broken However, one disadvantage of the older -rt implemen-
into a pair of 100,000-state models. Taking this approach tation was that each grace period requires work to be
not only makes it easier for tools such as Promela to verify done on each CPU, even if that CPU is in a low-power
your algorithms, it can also make your algorithms easier “dynticks-idle” state, and thus incapable of executing RCU
to understand. read-side critical sections. The idea behind the dynticks-
idle state is that idle CPUs should be physically powered
12.1.4.6 Is QRCU Really Correct? down in order to conserve energy. In short, preemptible
RCU can disable a valuable energy-conservation feature
Is QRCU really correct? We have a Promela-based me- of recent Linux kernels. Although Josh Triplett and Paul
chanical proof and a by-hand proof that both say that it is. McKenney had discussed some approaches for allowing
However, a recent paper by Alglave et al. [AKT13] says CPUs to remain in low-power state throughout an RCU
otherwise (see Section 5.1 of the paper at the bottom of grace period (thus preserving the Linux kernel’s ability
page 12). Which is it? to conserve energy), matters did not come to a head until
It turns out that both are correct! When QRCU was Steve Rostedt integrated a new dyntick implementation
added to a suite of formal-verification benchmarks, its with preemptible RCU in the -rt patchset.
memory barriers were omitted, thus resulting in a buggy This combination caused one of Steve’s systems to
version of QRCU. So the real news here is that a number hang on boot, so in October, Paul coded up a dynticks-
of formal-verification tools incorrectly proved this buggy friendly modification to preemptible RCU’s grace-period
v2022.09.25a
242 CHAPTER 12. FORMAL VERIFICATION
processing. Steve coded up rcu_irq_enter() and rcu_ Preemptible RCU’s grace-period machinery samples
irq_exit() interfaces called from the irq_enter() the value of the dynticks_progress_counter variable
and irq_exit() interrupt entry/exit functions. These in order to determine when a dynticks-idle CPU may safely
rcu_irq_enter() and rcu_irq_exit() functions are be ignored.
needed to allow RCU to reliably handle situations where The following three sections give an overview of the
a dynticks-idle CPU is momentarily powered up for an task interface, the interrupt/NMI interface, and the use
interrupt handler containing RCU read-side critical sec- of the dynticks_progress_counter variable by the
tions. With these changes in place, Steve’s system booted grace-period machinery as of Linux kernel v2.6.25-rc4.
reliably, but Paul continued inspecting the code periodi-
cally on the assumption that we could not possibly have 12.1.5.2 Task Interface
gotten the code right on the first try.
Paul reviewed the code repeatedly from October 2007 When a given CPU enters dynticks-idle mode because it
to February 2008, and almost always found at least one has no more tasks to run, it invokes rcu_enter_nohz():
bug. In one case, Paul even coded and tested a fix before 1 static inline void rcu_enter_nohz(void)
2 {
realizing that the bug was illusory, and in fact in all cases, 3 mb();
the “bug” turned out to be illusory. 4 __get_cpu_var(dynticks_progress_counter)++;
5 WARN_ON(__get_cpu_var(dynticks_progress_counter) &
Near the end of February, Paul grew tired of this game. 6 0x1);
He therefore decided to enlist the aid of Promela and Spin. 7 }
The following presents a series of seven increasingly real-
This function simply increments dynticks_
istic Promela models, the last of which passes, consuming
progress_counter and checks that the result is even, but
about 40 GB of main memory for the state space.
first executing a memory barrier to ensure that any other
More important, Promela and Spin did find a very subtle
CPU that sees the new value of dynticks_progress_
bug for me!
counter will also see the completion of any prior RCU
Quick Quiz 12.9: Yeah, that’s just great! Now, just what read-side critical sections.
am I supposed to do if I don’t happen to have a machine with Similarly, when a CPU that is in dynticks-idle mode
40 GB of main memory??? prepares to start executing a newly runnable task, it invokes
rcu_exit_nohz():
Still better would be to come up with a simpler and
1 static inline void rcu_exit_nohz(void)
faster algorithm that has a smaller state space. Even better 2 {
would be an algorithm so simple that its correctness was 3 __get_cpu_var(dynticks_progress_counter)++;
4 mb();
obvious to the casual observer! 5 WARN_ON(!(__get_cpu_var(dynticks_progress_counter) &
Sections 12.1.5.1–12.1.5.4 give an overview of pre- 6 0x1));
7 }
emptible RCU’s dynticks interface, followed by Sec-
tion 12.1.6’s discussion of the validation of the interface. This function again increments dynticks_progress_
counter, but follows it with a memory barrier to ensure
12.1.5.1 Introduction to Preemptible RCU and that if any other CPU sees the result of any subsequent
dynticks RCU read-side critical section, then that other CPU will
also see the incremented value of dynticks_progress_
The per-CPU dynticks_progress_counter variable is counter. Finally, rcu_exit_nohz() checks that the
central to the interface between dynticks and preemptible result of the increment is an odd value.
RCU. This variable has an even value whenever the The rcu_enter_nohz() and rcu_exit_nohz()
corresponding CPU is in dynticks-idle mode, and an odd functions handle the case where a CPU enters and exits
value otherwise. A CPU exits dynticks-idle mode for the dynticks-idle mode due to task execution, but does not
following three reasons: handle interrupts, which are covered in the following
section.
1. To start running a task,
2. When entering the outermost of a possibly nested set 12.1.5.3 Interrupt Interface
of interrupt handlers, and
The rcu_irq_enter() and rcu_irq_exit() functions
3. When entering an NMI handler. handle interrupt/NMI entry and exit, respectively. Of
v2022.09.25a
12.1. STATE-SPACE SEARCH 243
v2022.09.25a
244 CHAPTER 12. FORMAL VERIFICATION
v2022.09.25a
12.1. STATE-SPACE SEARCH 245
5 3 byte curr;
6 do 4 byte snap;
7 :: i >= MAX_DYNTICK_LOOP_NOHZ -> break; 5
8 :: i < MAX_DYNTICK_LOOP_NOHZ -> 6 atomic {
9 tmp = dynticks_progress_counter; 7 printf("MDLN = %d\n", MAX_DYNTICK_LOOP_NOHZ);
10 atomic { 8 snap = dynticks_progress_counter;
11 dynticks_progress_counter = tmp + 1; 9 }
12 assert((dynticks_progress_counter & 1) == 1); 10 do
13 } 11 :: 1 ->
14 tmp = dynticks_progress_counter; 12 atomic {
15 atomic { 13 curr = dynticks_progress_counter;
16 dynticks_progress_counter = tmp + 1; 14 if
17 assert((dynticks_progress_counter & 1) == 0); 15 :: (curr == snap) && ((curr & 1) == 0) ->
18 } 16 break;
19 i++; 17 :: (curr - snap) > 2 || (snap & 1) == 0 ->
20 od; 18 break;
21 } 19 :: 1 -> skip;
20 fi;
21 }
Lines 6 and 20 define a loop. Line 7 exits the loop 22 od;
once the loop counter i has exceeded the limit MAX_ 23 snap = dynticks_progress_counter;
24 do
DYNTICK_LOOP_NOHZ. Line 8 tells the loop construct to 25 :: 1 ->
execute lines 9–19 for each pass through the loop. Be- 26 atomic {
27 curr = dynticks_progress_counter;
cause the conditionals on lines 7 and 8 are exclusive of 28 if
each other, the normal Promela random selection of true 29 :: (curr == snap) && ((curr & 1) == 0) ->
30 break;
conditions is disabled. Lines 9 and 11 model rcu_ 31 :: (curr != snap) ->
exit_nohz()’s non-atomic increment of dynticks_ 32 break;
33 :: 1 -> skip;
progress_counter, while line 12 models the WARN_ 34 fi;
ON(). The atomic construct simply reduces the Promela 35 }
36 od;
state space, given that the WARN_ON() is not strictly speak- 37 }
ing part of the algorithm. Lines 14–18 similarly model
the increment and WARN_ON() for rcu_enter_nohz().
Finally, line 19 increments the loop counter. Lines 6–9 print out the loop limit (but only into the
Each pass through the loop therefore models a CPU ex- .trail file in case of error) and models a line of code from
iting dynticks-idle mode (for example, starting to execute rcu_try_flip_idle() and its call to dyntick_save_
a task), then re-entering dynticks-idle mode (for example, progress_counter(), which takes a snapshot of the
that same task blocking). current CPU’s dynticks_progress_counter variable.
Quick Quiz 12.13: Why isn’t the memory barrier in rcu_ These two lines are executed atomically to reduce state
exit_nohz() and rcu_enter_nohz() modeled in Promela? space.
Lines 10–22 model the relevant code in rcu_
Quick Quiz 12.14: Isn’t it a bit strange to model rcu_exit_ try_flip_waitack() and its call to rcu_try_flip_
nohz() followed by rcu_enter_nohz()? Wouldn’t it be waitack_needed(). This loop is modeling the grace-
more natural to instead model entry before exit? period state machine waiting for a counter-flip acknowl-
edgement from each CPU, but only that part that interacts
The next step is to model the interface to with dynticks-idle CPUs.
RCU’s grace-period processing. For this, we Line 23 models a line from rcu_try_flip_
need to model dyntick_save_progress_counter(), waitzero() and its call to dyntick_save_progress_
rcu_try_flip_waitack_needed(), rcu_try_flip_ counter(), again taking a snapshot of the CPU’s
waitmb_needed(), as well as portions of rcu_try_ dynticks_progress_counter variable.
flip_waitack() and rcu_try_flip_waitmb(), all
from the 2.6.25-rc4 kernel. The following grace_ Finally, lines 24–36 model the relevant code in rcu_
period() Promela process models these functions as try_flip_waitack() and its call to rcu_try_flip_
they would be invoked during a single pass through pre- waitack_needed(). This loop is modeling the grace-
emptible RCU’s grace-period processing. period state-machine waiting for each CPU to execute a
memory barrier, but again only that part that interacts
1 proctype grace_period()
2 {
with dynticks-idle CPUs.
v2022.09.25a
246 CHAPTER 12. FORMAL VERIFICATION
Quick Quiz 12.15: Wait a minute! In the Linux kernel, 36 :: (curr == snap) && ((curr & 1) == 0) ->
37 break;
both dynticks_progress_counter and rcu_dyntick_ 38 :: (curr != snap) ->
snapshot are per-CPU variables. So why are they instead 39 break;
being modeled as single global variables? 40 :: 1 -> skip;
41 fi;
42 }
The resulting model (dyntickRCU-base.spin), 43 od;
when run with the runspin.sh script, generates 691 44 grace_period_state = GP_DONE;
45 }
states and passes without errors, which is not at all sur-
prising given that it completely lacks the assertions that Lines 6, 10, 25, 26, 29, and 44 update this variable (com-
could find failures. The next section therefore adds safety bining atomically with algorithmic operations where fea-
assertions. sible) to allow the dyntick_nohz() process to verify the
basic RCU safety property. The form of this verification
12.1.6.2 Validating Safety is to assert that the value of the grace_period_state
variable cannot jump from GP_IDLE to GP_DONE during
A safe RCU implementation must never permit a grace a time period over which RCU readers could plausibly
period to complete before the completion of any RCU persist.
readers that started before the start of the grace period.
This is modeled by a grace_period_state variable that Quick Quiz 12.16: Given there are a pair of back-to-back
changes to grace_period_state on lines 25 and 26, how
can take on three states as follows:
can we be sure that line 25’s changes won’t be lost?
1 #define GP_IDLE 0
2 #define GP_WAITING 1 The dyntick_nohz() Promela process implements
3 #define GP_DONE 2
4 byte grace_period_state = GP_DONE; this verification as shown below:
1 proctype dyntick_nohz()
The grace_period() process sets this variable as it 2 {
3 byte tmp;
progresses through the grace-period phases, as shown 4 byte i = 0;
below: 5 bit old_gp_idle;
6
7 do
1 proctype grace_period()
8 :: i >= MAX_DYNTICK_LOOP_NOHZ -> break;
2 {
9 :: i < MAX_DYNTICK_LOOP_NOHZ ->
3 byte curr;
10 tmp = dynticks_progress_counter;
4 byte snap;
11 atomic {
5
12 dynticks_progress_counter = tmp + 1;
6 grace_period_state = GP_IDLE;
13 old_gp_idle = (grace_period_state == GP_IDLE);
7 atomic {
14 assert((dynticks_progress_counter & 1) == 1);
8 printf("MDLN = %d\n", MAX_DYNTICK_LOOP_NOHZ);
15 }
9 snap = dynticks_progress_counter;
16 atomic {
10 grace_period_state = GP_WAITING;
17 tmp = dynticks_progress_counter;
11 }
18 assert(!old_gp_idle ||
12 do
19 grace_period_state != GP_DONE);
13 :: 1 ->
20 }
14 atomic {
21 atomic {
15 curr = dynticks_progress_counter;
22 dynticks_progress_counter = tmp + 1;
16 if
23 assert((dynticks_progress_counter & 1) == 0);
17 :: (curr == snap) && ((curr & 1) == 0) ->
24 }
18 break;
25 i++;
19 :: (curr - snap) > 2 || (snap & 1) == 0 ->
26 od;
20 break;
27 }
21 :: 1 -> skip;
22 fi;
23 } Line 13 sets a new old_gp_idle flag if the value of
24 od;
25 grace_period_state = GP_DONE;
the grace_period_state variable is GP_IDLE at the
26 grace_period_state = GP_IDLE; beginning of task execution, and the assertion at lines 18
27 atomic {
28 snap = dynticks_progress_counter;
and 19 fire if the grace_period_state variable has
29 grace_period_state = GP_WAITING; advanced to GP_DONE during task execution, which would
30 }
31 do
be illegal given that a single RCU read-side critical section
32 :: 1 -> could span the entire intervening time period.
33 atomic {
34 curr = dynticks_progress_counter;
The resulting model (dyntickRCU-base-s.spin),
35 if when run with the runspin.sh script, generates 964
v2022.09.25a
12.1. STATE-SPACE SEARCH 247
states and passes without errors, which is reassuring. That 23 :: (curr - snap) > 2 || (snap & 1) == 0 ->
24 break;
said, although safety is critically important, it is also quite 25 :: else -> skip;
important to avoid indefinitely stalling grace periods. The 26 fi;
27 }
next section therefore covers verifying liveness. 28 od;
29 grace_period_state = GP_DONE;
30 grace_period_state = GP_IDLE;
12.1.6.3 Validating Liveness 31 atomic {
32 shouldexit = 0;
Although liveness can be difficult to prove, there is a 33 snap = dynticks_progress_counter;
34 grace_period_state = GP_WAITING;
simple trick that applies here. The first step is to make 35 }
dyntick_nohz() indicate that it is done via a dyntick_ 36 do
37 :: 1 ->
nohz_done variable, as shown on line 27 of the following: 38 atomic {
39 assert(!shouldexit);
1 proctype dyntick_nohz() 40 shouldexit = dyntick_nohz_done;
2 { 41 curr = dynticks_progress_counter;
3 byte tmp; 42 if
4 byte i = 0; 43 :: (curr == snap) && ((curr & 1) == 0) ->
5 bit old_gp_idle; 44 break;
6 45 :: (curr != snap) ->
7 do 46 break;
8 :: i >= MAX_DYNTICK_LOOP_NOHZ -> break; 47 :: else -> skip;
9 :: i < MAX_DYNTICK_LOOP_NOHZ -> 48 fi;
10 tmp = dynticks_progress_counter; 49 }
11 atomic { 50 od;
12 dynticks_progress_counter = tmp + 1; 51 grace_period_state = GP_DONE;
13 old_gp_idle = (grace_period_state == GP_IDLE); 52 }
14 assert((dynticks_progress_counter & 1) == 1);
15 }
16 atomic { We have added the shouldexit variable on line 5,
17 tmp = dynticks_progress_counter;
18 assert(!old_gp_idle || which we initialize to zero on line 10. Line 17 as-
19 grace_period_state != GP_DONE); serts that shouldexit is not set, while line 18 sets
20 }
21 atomic { shouldexit to the dyntick_nohz_done variable main-
22 dynticks_progress_counter = tmp + 1; tained by dyntick_nohz(). This assertion will there-
23 assert((dynticks_progress_counter & 1) == 0);
24 } fore trigger if we attempt to take more than one pass
25 i++; through the wait-for-counter-flip-acknowledgement loop
26 od;
27 dyntick_nohz_done = 1; after dyntick_nohz() has completed execution. After
28 } all, if dyntick_nohz() is done, then there cannot be any
more state changes to force us out of the loop, so going
With this variable in place, we can add assertions to through twice in this state means an infinite loop, which
grace_period() to check for unnecessary blockage as in turn means no end to the grace period.
follows:
Lines 32, 39, and 40 operate in a similar manner for the
1 proctype grace_period() second (memory-barrier) loop.
2 {
3 byte curr;
However, running this model (dyntickRCU-base-
4 byte snap; sl-busted.spin) results in failure, as line 23 is check-
5 bit shouldexit;
6
ing that the wrong variable is even. Upon failure,
7 grace_period_state = GP_IDLE; spin writes out a “trail” file (dyntickRCU-base-sl-
8 atomic {
9 printf("MDLN = %d\n", MAX_DYNTICK_LOOP_NOHZ);
busted.spin.trail), which records the sequence of
10 shouldexit = 0; states that lead to the failure. Use the “spin -t -p -g
11 snap = dynticks_progress_counter;
12 grace_period_state = GP_WAITING;
-l dyntickRCU-base-sl-busted.spin” command
13 } to cause spin to retrace this sequence of states, print-
14 do
15 :: 1 ->
ing the statements executed and the values of vari-
16 atomic { ables (dyntickRCU-base-sl-busted.spin.trail.
17 assert(!shouldexit);
18 shouldexit = dyntick_nohz_done;
txt). Note that the line numbers do not match the listing
19 curr = dynticks_progress_counter; above due to the fact that spin takes both functions in a
20 if
21 :: (curr == snap) && ((curr & 1) == 0) ->
single file. However, the line numbers do match the full
22 break; model (dyntickRCU-base-sl-busted.spin).
v2022.09.25a
248 CHAPTER 12. FORMAL VERIFICATION
We see that the dyntick_nohz() process completed at Lines 10–13 can now be combined and simplified,
step 34 (search for “34:”), but that the grace_period() resulting in the following. A similar simplification can be
process nonetheless failed to exit the loop. The value of applied to rcu_try_flip_waitmb_needed().
curr is 6 (see step 35) and that the value of snap is 5 (see
1 static inline int
step 17). Therefore the first condition on line 21 above 2 rcu_try_flip_waitack_needed(int cpu)
does not hold because “curr != snap”, and the second 3 {
4 long curr;
condition on line 23 does not hold either because snap is 5 long snap;
odd and because curr is only one greater than snap. 6
7 curr = per_cpu(dynticks_progress_counter, cpu);
So one of these two conditions has to be incorrect. Refer- 8 snap = per_cpu(rcu_dyntick_snapshot, cpu);
ring to the comment block in rcu_try_flip_waitack_ 9 smp_mb();
10 if ((curr - snap) >= 2 || (curr & 0x1) == 0)
needed() for the first condition: 11 return 0;
12 return 1;
If the CPU remained in dynticks mode for the 13 }
entire time and didn’t take any interrupts, NMIs,
SMIs, or whatever, then it cannot be in the Making the corresponding correction in the model
middle of an rcu_read_lock(), so the next (dyntickRCU-base-sl.spin) results in a correct verifi-
rcu_read_lock() it executes must use the cation with 661 states that passes without errors. However,
new value of the counter. So we can safely it is worth noting that the first version of the liveness verifi-
pretend that this CPU already acknowledged the cation failed to catch this bug, due to a bug in the liveness
counter. verification itself. This liveness-verification bug was lo-
cated by inserting an infinite loop in the grace_period()
The first condition does match this, because if process, and noting that the liveness-verification code
“curr == snap” and if curr is even, then the corre- failed to detect this problem!
sponding CPU has been in dynticks-idle mode the entire
We have now successfully verified both safety and
time, as required. So let’s look at the comment block for
liveness conditions, but only for processes running and
the second condition:
blocking. We also need to handle interrupts, a task taken
If the CPU passed through or entered a dynticks up in the next section.
idle phase with no active irq handlers, then,
as above, we can safely pretend that this CPU 12.1.6.4 Interrupts
already acknowledged the counter.
There are a couple of ways to model interrupts in Promela:
The first part of the condition is correct, because if
curr and snap differ by two, there will be at least one 1. Using C-preprocessor tricks to insert the interrupt
even number in between, corresponding to having passed handler between each and every statement of the
completely through a dynticks-idle phase. However, the dynticks_nohz() process, or
second part of the condition corresponds to having started
in dynticks-idle mode, not having finished in this mode. 2. Modeling the interrupt handler with a separate
We therefore need to be testing curr rather than snap for process.
being an even number.
A bit of thought indicated that the second approach
The corrected C code is as follows:
would have a smaller state space, though it requires that
1 static inline int
2 rcu_try_flip_waitack_needed(int cpu)
the interrupt handler somehow run atomically with respect
3 { to the dynticks_nohz() process, but not with respect
4 long curr;
5 long snap;
to the grace_period() process.
6 Fortunately, it turns out that Promela permits you
7 curr = per_cpu(dynticks_progress_counter, cpu);
8 snap = per_cpu(rcu_dyntick_snapshot, cpu);
to branch out of atomic statements. This trick allows
9 smp_mb(); us to have the interrupt handler set a flag, and recode
10 if ((curr == snap) && ((curr & 0x1) == 0))
11 return 0;
dynticks_nohz() to atomically check this flag and ex-
12 if ((curr - snap) > 2 || (curr & 0x1) == 0) ecute only when the flag is not set. This can be accom-
13 return 0;
14 return 1;
plished with a C-preprocessor macro that takes a label
15 } and a Promela statement as follows:
v2022.09.25a
12.1. STATE-SPACE SEARCH 249
1 #define EXECUTE_MAINLINE(label, stmt) \ Quick Quiz 12.18: But what if the dynticks_nohz()
2 label: skip; \
3 atomic { \
process had “if” or “do” statements with conditions, where
4 if \ the statement bodies of these constructs needed to execute
5 :: in_dyntick_irq -> goto label; \ non-atomically?
6 :: else -> stmt; \
7 fi; \
8 } The next step is to write a dyntick_irq() process to
model an interrupt handler:
One might use this macro as follows: 1 proctype dyntick_irq()
2 {
EXECUTE_MAINLINE(stmt1, 3 byte tmp;
tmp = dynticks_progress_counter) 4 byte i = 0;
5 bit old_gp_idle;
6
7 do
Line 2 of the macro creates the specified statement label. 8 :: i >= MAX_DYNTICK_LOOP_IRQ -> break;
9 :: i < MAX_DYNTICK_LOOP_IRQ ->
Lines 3–8 are an atomic block that tests the in_dyntick_ 10 in_dyntick_irq = 1;
irq variable, and if this variable is set (indicating that the 11 if
12 :: rcu_update_flag > 0 ->
interrupt handler is active), branches out of the atomic 13 tmp = rcu_update_flag;
block back to the label. Otherwise, line 6 executes the 14 rcu_update_flag = tmp + 1;
15 :: else -> skip;
specified statement. The overall effect is that mainline 16 fi;
execution stalls any time an interrupt is active, as required. 17 if
18 :: !in_interrupt &&
19 (dynticks_progress_counter & 1) == 0 ->
20 tmp = dynticks_progress_counter;
12.1.6.5 Validating Interrupt Handlers 21 dynticks_progress_counter = tmp + 1;
22 tmp = rcu_update_flag;
The first step is to convert dyntick_nohz() to EXECUTE_ 23 rcu_update_flag = tmp + 1;
24 :: else -> skip;
MAINLINE() form, as follows: 25 fi;
26 tmp = in_interrupt;
1 proctype dyntick_nohz() 27 in_interrupt = tmp + 1;
2 { 28 old_gp_idle = (grace_period_state == GP_IDLE);
3 byte tmp; 29 assert(!old_gp_idle ||
4 byte i = 0; 30 grace_period_state != GP_DONE);
5 bit old_gp_idle; 31 tmp = in_interrupt;
6 32 in_interrupt = tmp - 1;
7 do 33 if
8 :: i >= MAX_DYNTICK_LOOP_NOHZ -> break; 34 :: rcu_update_flag != 0 ->
9 :: i < MAX_DYNTICK_LOOP_NOHZ -> 35 tmp = rcu_update_flag;
10 EXECUTE_MAINLINE(stmt1, 36 rcu_update_flag = tmp - 1;
11 tmp = dynticks_progress_counter) 37 if
12 EXECUTE_MAINLINE(stmt2, 38 :: rcu_update_flag == 0 ->
13 dynticks_progress_counter = tmp + 1; 39 tmp = dynticks_progress_counter;
14 old_gp_idle = (grace_period_state == GP_IDLE); 40 dynticks_progress_counter = tmp + 1;
15 assert((dynticks_progress_counter & 1) == 1)) 41 :: else -> skip;
16 EXECUTE_MAINLINE(stmt3, 42 fi;
17 tmp = dynticks_progress_counter; 43 :: else -> skip;
18 assert(!old_gp_idle || 44 fi;
19 grace_period_state != GP_DONE)) 45 atomic {
20 EXECUTE_MAINLINE(stmt4, 46 in_dyntick_irq = 0;
21 dynticks_progress_counter = tmp + 1; 47 i++;
22 assert((dynticks_progress_counter & 1) == 0)) 48 }
23 i++; 49 od;
24 od; 50 dyntick_irq_done = 1;
25 dyntick_nohz_done = 1; 51 }
26 }
The loop from lines 7–49 models up to MAX_DYNTICK_
It is important to note that when a group of statements LOOP_IRQ interrupts, with lines 8 and 9 forming the loop
is passed to EXECUTE_MAINLINE(), as in lines 12–15, all condition and line 47 incrementing the control variable.
statements in that group execute atomically. Line 10 tells dyntick_nohz() that an interrupt handler
Quick Quiz 12.17: But what would you do if you needed is running, and line 46 tells dyntick_nohz() that this
the statements in a single EXECUTE_MAINLINE() group to handler has completed. Line 50 is used for liveness
execute non-atomically? verification, just like the corresponding line of dyntick_
nohz().
v2022.09.25a
250 CHAPTER 12. FORMAL VERIFICATION
Quick Quiz 12.19: Why are lines 46 and 47 (the The implementation of grace_period() is very simi-
“in_dyntick_irq = 0;” and the “i++;”) executed atom- lar to the earlier one. The only changes are the addition of
ically? line 10 to add the new interrupt-count parameter, changes
to lines 19 and 39 to add the new dyntick_irq_done
Lines 11–25 model rcu_irq_enter(), and lines 26 variable to the liveness checks, and of course the optimiza-
and 27 model the relevant snippet of __irq_enter(). tions on lines 22 and 42.
Lines 28–30 verify safety in much the same manner as do This model (dyntickRCU-irqnn-ssl.spin) results
the corresponding lines of dynticks_nohz(). Lines 31 in a correct verification with roughly half a million states,
and 32 model the relevant snippet of __irq_exit(), and passing without errors. However, this version of the model
finally lines 33–44 model rcu_irq_exit(). does not handle nested interrupts. This topic is taken up
in the next section.
Quick Quiz 12.20: What property of interrupts is this
dynticks_irq() process unable to model?
12.1.6.6 Validating Nested Interrupt Handlers
The grace_period() process then becomes as fol- Nested interrupt handlers may be modeled by splitting the
lows: body of the loop in dyntick_irq() as follows:
1 proctype dyntick_irq()
1 proctype grace_period() 2 {
2 { 3 byte tmp;
3 byte curr; 4 byte i = 0;
4 byte snap; 5 byte j = 0;
5 bit shouldexit; 6 bit old_gp_idle;
6 7 bit outermost;
7 grace_period_state = GP_IDLE; 8
8 atomic { 9 do
9 printf("MDLN = %d\n", MAX_DYNTICK_LOOP_NOHZ); 10 :: i >= MAX_DYNTICK_LOOP_IRQ &&
10 printf("MDLI = %d\n", MAX_DYNTICK_LOOP_IRQ); 11 j >= MAX_DYNTICK_LOOP_IRQ -> break;
11 shouldexit = 0; 12 :: i < MAX_DYNTICK_LOOP_IRQ ->
12 snap = dynticks_progress_counter; 13 atomic {
13 grace_period_state = GP_WAITING; 14 outermost = (in_dyntick_irq == 0);
14 } 15 in_dyntick_irq = 1;
15 do 16 }
16 :: 1 -> 17 if
17 atomic { 18 :: rcu_update_flag > 0 ->
18 assert(!shouldexit); 19 tmp = rcu_update_flag;
19 shouldexit = dyntick_nohz_done && dyntick_irq_done; 20 rcu_update_flag = tmp + 1;
20 curr = dynticks_progress_counter; 21 :: else -> skip;
21 if 22 fi;
22 :: (curr - snap) >= 2 || (curr & 1) == 0 -> 23 if
23 break; 24 :: !in_interrupt &&
24 :: else -> skip; 25 (dynticks_progress_counter & 1) == 0 ->
25 fi; 26 tmp = dynticks_progress_counter;
26 } 27 dynticks_progress_counter = tmp + 1;
27 od; 28 tmp = rcu_update_flag;
28 grace_period_state = GP_DONE; 29 rcu_update_flag = tmp + 1;
29 grace_period_state = GP_IDLE; 30 :: else -> skip;
30 atomic { 31 fi;
31 shouldexit = 0; 32 tmp = in_interrupt;
32 snap = dynticks_progress_counter; 33 in_interrupt = tmp + 1;
33 grace_period_state = GP_WAITING; 34 atomic {
34 } 35 if
35 do 36 :: outermost ->
36 :: 1 -> 37 old_gp_idle = (grace_period_state == GP_IDLE);
37 atomic { 38 :: else -> skip;
38 assert(!shouldexit); 39 fi;
39 shouldexit = dyntick_nohz_done && dyntick_irq_done; 40 }
40 curr = dynticks_progress_counter; 41 i++;
41 if 42 :: j < i ->
42 :: (curr != snap) || ((curr & 1) == 0) -> 43 atomic {
43 break; 44 if
44 :: else -> skip; 45 :: j + 1 == i ->
45 fi; 46 assert(!old_gp_idle ||
46 } 47 grace_period_state != GP_DONE);
47 od; 48 :: else -> skip;
48 grace_period_state = GP_DONE; 49 fi;
49 } 50 }
v2022.09.25a
12.1. STATE-SPACE SEARCH 251
v2022.09.25a
252 CHAPTER 12. FORMAL VERIFICATION
v2022.09.25a
12.1. STATE-SPACE SEARCH 253
19 assert(!shouldexit);
20 shouldexit = dyntick_nohz_done && static inline void rcu_enter_nohz(void)
21 dyntick_irq_done && {
22 dyntick_nmi_done; + mb();
23 curr = dynticks_progress_counter; __get_cpu_var(dynticks_progress_counter)++;
24 if - mb();
25 :: (curr - snap) >= 2 || (curr & 1) == 0 -> }
26 break;
27 :: else -> skip; static inline void rcu_exit_nohz(void)
28 fi; {
29 } - mb();
30 od; __get_cpu_var(dynticks_progress_counter)++;
31 grace_period_state = GP_DONE; + mb();
32 grace_period_state = GP_IDLE; }
33 atomic {
34 shouldexit = 0;
35 snap = dynticks_progress_counter;
36 grace_period_state = GP_WAITING; 3. Validate your code early, often, and up to the
37 } point of destruction. This effort located one sub-
38 do
39 :: 1 -> tle bug in rcu_try_flip_waitack_needed() that
40 atomic { would have been quite difficult to test or debug, as
41 assert(!shouldexit);
42 shouldexit = dyntick_nohz_done && shown by the following patch [McK08d].
43 dyntick_irq_done &&
44 dyntick_nmi_done; - if ((curr - snap) > 2 || (snap & 0x1) == 0)
45 curr = dynticks_progress_counter; + if ((curr - snap) > 2 || (curr & 0x1) == 0)
46 if
47 :: (curr != snap) || ((curr & 1) == 0) ->
48 break;
49 :: else -> skip; 4. Always verify your verification code. The usual
50 fi; way to do this is to insert a deliberate bug and verify
51 }
52 od; that the verification code catches it. Of course, if
53 grace_period_state = GP_DONE; the verification code fails to catch this bug, you
54 }
may also need to verify the bug itself, and so on,
recursing infinitely. However, if you find yourself
We have added the printf() for the new MAX_ in this position, getting a good night’s sleep can be
DYNTICK_LOOP_NMI parameter on line 11 and added an extremely effective debugging technique. You
dyntick_nmi_done to the shouldexit assignments on will then see that the obvious verify-the-verification
lines 22 and 44. technique is to deliberately insert bugs in the code
The model (dyntickRCU-irq-nmi-ssl.spin) re- being verified. If the verification fails to find them,
sults in a correct verification with several hundred million the verification clearly is buggy.
states, passing without errors.
5. Use of atomic instructions can simplify verifica-
Quick Quiz 12.21: Does Paul always write his code in this tion. Unfortunately, use of the cmpxchg atomic
painfully incremental manner? instruction would also slow down the critical IRQ
fastpath, so they are not appropriate in this case.
v2022.09.25a
254 CHAPTER 12. FORMAL VERIFICATION
Listing 12.17: Variables for Simple Dynticks Interface Listing 12.18: Entering and Exiting Dynticks-Idle Mode
1 struct rcu_dynticks { 1 void rcu_enter_nohz(void)
2 int dynticks_nesting; 2 {
3 int dynticks; 3 unsigned long flags;
4 int dynticks_nmi; 4 struct rcu_dynticks *rdtp;
5 }; 5
6 6 smp_mb();
7 struct rcu_data { 7 local_irq_save(flags);
8 ... 8 rdtp = &__get_cpu_var(rcu_dynticks);
9 int dynticks_snap; 9 rdtp->dynticks++;
10 int dynticks_nmi_snap; 10 rdtp->dynticks_nesting--;
11 ... 11 WARN_ON(rdtp->dynticks & 0x1);
12 }; 12 local_irq_restore(flags);
13 }
14
15 void rcu_exit_nohz(void)
variables for IRQs and NMIs, as has been done for hierar- 16 {
17 unsigned long flags;
chical RCU [McK08b] as indirectly suggested by Manfred 18 struct rcu_dynticks *rdtp;
Spraul [Spr08]. This work was pulled into mainline kernel 19
20 local_irq_save(flags);
during the v2.6.29 development cycle [McK08f]. 21 rdtp = &__get_cpu_var(rcu_dynticks);
22 rdtp->dynticks++;
23 rdtp->dynticks_nesting++;
12.1.6.10 State Variables for Simplified Dynticks In- 24 WARN_ON(!(rdtp->dynticks & 0x1));
25 local_irq_restore(flags);
terface 26 smp_mb();
27 }
Listing 12.17 shows the new per-CPU state variables.
These variables are grouped into structs to allow multiple
independent RCU implementations (e.g., rcu and rcu_ handlers running. Otherwise, the counter’s value
bh) to conveniently and efficiently share dynticks state. will be even.
In what follows, they can be thought of as independent
per-CPU variables. dynticks_snap
The dynticks_nesting, dynticks, and dynticks_ This will be a snapshot of the dynticks counter, but
snap variables are for the IRQ code paths, and the only if the current RCU grace period has extended
dynticks_nmi and dynticks_nmi_snap variables are for too long a duration.
for the NMI code paths, although the NMI code path will
also reference (but not modify) the dynticks_nesting dynticks_nmi_snap
variable. These variables are used as follows: This will be a snapshot of the dynticks_nmi counter,
but again only if the current RCU grace period has
dynticks_nesting extended for too long a duration.
This counts the number of reasons that the corre-
sponding CPU should be monitored for RCU read- If both dynticks and dynticks_nmi have taken on
side critical sections. If the CPU is in dynticks-idle an even value during a given time interval, then the
mode, then this counts the IRQ nesting level, other- corresponding CPU has passed through a quiescent state
wise it is one greater than the IRQ nesting level. during that interval.
dynticks Quick Quiz 12.22: But what happens if an NMI handler
This counter’s value is even if the corresponding starts running before an IRQ handler completes, and if that
CPU is in dynticks-idle mode and there are no IRQ NMI handler continues running until a second IRQ handler
handlers currently running on that CPU, otherwise starts?
the counter’s value is odd. In other words, if this
counter’s value is odd, then the corresponding CPU
might be in an RCU read-side critical section. 12.1.6.11 Entering and Leaving Dynticks-Idle Mode
dynticks_nmi Listing 12.18 shows the rcu_enter_nohz() and rcu_
This counter’s value is odd if the corresponding CPU exit_nohz(), which enter and exit dynticks-idle mode,
is in an NMI handler, but only if the NMI arrived also known as “nohz” mode. These two functions are
while this CPU was in dyntick-idle mode with no IRQ invoked from process context.
v2022.09.25a
12.1. STATE-SPACE SEARCH 255
Listing 12.19: NMIs From Dynticks-Idle Mode Listing 12.20: Interrupts From Dynticks-Idle Mode
1 void rcu_nmi_enter(void) 1 void rcu_irq_enter(void)
2 { 2 {
3 struct rcu_dynticks *rdtp; 3 struct rcu_dynticks *rdtp;
4 4
5 rdtp = &__get_cpu_var(rcu_dynticks); 5 rdtp = &__get_cpu_var(rcu_dynticks);
6 if (rdtp->dynticks & 0x1) 6 if (rdtp->dynticks_nesting++)
7 return; 7 return;
8 rdtp->dynticks_nmi++; 8 rdtp->dynticks++;
9 WARN_ON(!(rdtp->dynticks_nmi & 0x1)); 9 WARN_ON(!(rdtp->dynticks & 0x1));
10 smp_mb(); 10 smp_mb();
11 } 11 }
12 12
13 void rcu_nmi_exit(void) 13 void rcu_irq_exit(void)
14 { 14 {
15 struct rcu_dynticks *rdtp; 15 struct rcu_dynticks *rdtp;
16 16
17 rdtp = &__get_cpu_var(rcu_dynticks); 17 rdtp = &__get_cpu_var(rcu_dynticks);
18 if (rdtp->dynticks & 0x1) 18 if (--rdtp->dynticks_nesting)
19 return; 19 return;
20 smp_mb(); 20 smp_mb();
21 rdtp->dynticks_nmi++; 21 rdtp->dynticks++;
22 WARN_ON(rdtp->dynticks_nmi & 0x1); 22 WARN_ON(rdtp->dynticks & 0x1);
23 } 23 if (__get_cpu_var(rcu_data).nxtlist ||
24 __get_cpu_var(rcu_bh_data).nxtlist)
25 set_need_resched();
26 }
v2022.09.25a
256 CHAPTER 12. FORMAL VERIFICATION
Listing 12.21: Saving Dyntick Progress Counters Listing 12.22: Checking Dyntick Progress Counters
1 static int 1 static int
2 dyntick_save_progress_counter(struct rcu_data *rdp) 2 rcu_implicit_dynticks_qs(struct rcu_data *rdp)
3 { 3 {
4 int ret; 4 long curr;
5 int snap; 5 long curr_nmi;
6 int snap_nmi; 6 long snap;
7 7 long snap_nmi;
8 snap = rdp->dynticks->dynticks; 8
9 snap_nmi = rdp->dynticks->dynticks_nmi; 9 curr = rdp->dynticks->dynticks;
10 smp_mb(); 10 snap = rdp->dynticks_snap;
11 rdp->dynticks_snap = snap; 11 curr_nmi = rdp->dynticks->dynticks_nmi;
12 rdp->dynticks_nmi_snap = snap_nmi; 12 snap_nmi = rdp->dynticks_nmi_snap;
13 ret = ((snap & 0x1) == 0) && ((snap_nmi & 0x1) == 0); 13 smp_mb();
14 if (ret) 14 if ((curr != snap || (curr & 0x1) == 0) &&
15 rdp->dynticks_fqs++; 15 (curr_nmi != snap_nmi || (curr_nmi & 0x1) == 0)) {
16 return ret; 16 rdp->dynticks_fqs++;
17 } 17 return 1;
18 }
19 return rcu_implicit_offline_qs(rdp);
20 }
v2022.09.25a
12.2. SPECIAL-PURPOSE STATE-SPACE SEARCH 257
sometimes be a curse. For example, Promela does not Listing 12.23: PPCMEM Litmus Test
understand memory models or any sort of reordering 1 PPC SB+lwsync-RMW-lwsync+isync-simple
2 ""
semantics. This section therefore describes some state- 3 {
space search tools that understand memory models used 4 0:r2=x; 0:r3=2; 0:r4=y; 0:r10=0; 0:r11=0; 0:r12=z;
5 1:r2=y; 1:r4=x;
by production systems, greatly simplifying the verification 6 }
of weakly ordered code. 7 P0 | P1 ;
8 li r1,1 | li r1,1 ;
For example, Section 12.1.4 showed how to convince 9 stw r1,0(r2) | stw r1,0(r2) ;
Promela to account for weak memory ordering. Although 10 lwsync | sync ;
11 | lwz r3,0(r4) ;
this approach can work well, it requires that the developer 12 lwarx r11,r10,r12 | ;
fully understand the system’s memory model. Unfor- 13 stwcx. r11,r10,r12 | ;
14 bne Fail1 | ;
tunately, few (if any) developers fully understand the 15 isync | ;
complex memory models of modern CPUs. 16 lwz r3,0(r4) | ;
17 Fail1: | ;
Therefore, another approach is to use a tool that already 18
v2022.09.25a
258 CHAPTER 12. FORMAL VERIFICATION
Listing 12.24: Meaning of PPCMEM Litmus Test Listing 12.25: PPCMEM Detects an Error
1 void P0(void) ./ppcmem -model lwsync_read_block \
2 { -model coherence_points filename.litmus
3 int r3; ...
4 States 6
5 x = 1; /* Lines 8 and 9 */ 0:r3=0; 1:r3=0;
6 atomic_add_return(&z, 0); /* Lines 10-15 */ 0:r3=0; 1:r3=1;
7 r3 = y; /* Line 16 */ 0:r3=1; 1:r3=0;
8 } 0:r3=1; 1:r3=1;
9 0:r3=2; 1:r3=0;
10 void P1(void) 0:r3=2; 1:r3=1;
11 { Ok
12 int r3; Condition exists (0:r3=0 /\ 1:r3=0)
13 Hash=e2240ce2072a2610c034ccd4fc964e77
14 y = 1; /* Lines 8-9 */ Observation SB+lwsync-RMW-lwsync+isync Sometimes 1
15 smp_mb(); /* Line 10 */
16 r3 = x; /* Line 11 */
17 }
Listing 12.26: PPCMEM on Repaired Litmus Test
./ppcmem -model lwsync_read_block \
-model coherence_points filename.litmus
“Select POWER Test” buttons at https://www.cl.cam. ...
States 5
ac.uk/~pes20/ppcmem/). It is quite likely that one of 0:r3=0; 1:r3=1;
these pre-existing litmus tests will answer your Power or 0:r3=1; 1:r3=0;
0:r3=1; 1:r3=1;
Arm memory-ordering question. 0:r3=2; 1:r3=0;
0:r3=2; 1:r3=1;
No (allowed not found)
Condition exists (0:r3=0 /\ 1:r3=0)
12.2.2 What Does This Litmus Test Mean? Hash=77dd723cda9981248ea4459fcdf6097d
Observation SB+lwsync-RMW-lwsync+sync Never 0 5
v2022.09.25a
12.2. SPECIAL-PURPOSE STATE-SPACE SEARCH 259
“Never”. Therefore, the model predicts that the offending 6. These tools are not much good for complex data
execution sequence cannot happen. structures, although it is possible to create and tra-
Quick Quiz 12.27: Does the Arm Linux kernel have a similar
verse extremely simple linked lists using initialization
bug? statements of the form “x=y; y=z; z=42;”.
7. These tools do not handle memory mapped I/O or
Quick Quiz 12.28: Does the lwsync on line 10 in List- device registers. Of course, handling such things
ing 12.23 provide sufficient ordering?
would require that they be formalized, which does
not appear to be in the offing.
12.2.4 PPCMEM Discussion 8. The tools will detect only those problems for which
you code an assertion. This weakness is common to
These tools promise to be of great help to people working all formal methods, and is yet another reason why
on low-level parallel primitives that run on Arm and on testing remains important. In the immortal words of
Power. These tools do have some intrinsic limitations: Donald Knuth quoted at the beginning of this chapter,
“Beware of bugs in the above code; I have only proved
1. These tools are research prototypes, and as such are
it correct, not tried it.”
unsupported.
That said, one strength of these tools is that they are
2. These tools do not constitute official statements by
designed to model the full range of behaviors allowed by
IBM or Arm on their respective CPU architectures.
the architectures, including behaviors that are legal, but
For example, both corporations reserve the right to
which current hardware implementations do not yet inflict
report a bug at any time against any version of any of
on unwary software developers. Therefore, an algorithm
these tools. These tools are therefore not a substitute
that is vetted by these tools likely has some additional
for careful stress testing on real hardware. Moreover,
safety margin when running on real hardware. Further-
both the tools and the model that they are based on are
more, testing on real hardware can only find bugs; such
under active development and might change at any
testing is inherently incapable of proving a given usage
time. On the other hand, this model was developed
correct. To appreciate this, consider that the researchers
in consultation with the relevant hardware experts,
routinely ran in excess of 100 billion test runs on real hard-
so there is good reason to be confident that it is a
ware to validate their model. In one case, behavior that
robust representation of the architectures.
is allowed by the architecture did not occur, despite 176
3. These tools currently handle a subset of the instruc- billion runs [AMP+ 11]. In contrast, the full-state-space
tion set. This subset has been sufficient for my search allows the tool to prove code fragments correct.
purposes, but your mileage may vary. In particular, It is worth repeating that formal methods and tools are
the tool handles only word-sized accesses (32 bits), no substitute for testing. The fact is that producing large
and the words accessed must be properly aligned.3 In reliable concurrent software artifacts, the Linux kernel
addition, the tool does not handle some of the weaker for example, is quite difficult. Developers must therefore
variants of the Arm memory-barrier instructions, nor be prepared to apply every tool at their disposal towards
does it handle arithmetic. this goal. The tools presented in this chapter are able to
locate bugs that are quite difficult to produce (let alone
4. The tools are restricted to small loop-free code frag- track down) via testing. On the other hand, testing can
ments running on small numbers of threads. Larger be applied to far larger bodies of software than the tools
examples result in state-space explosion, just as with presented in this chapter are ever likely to handle. As
similar tools such as Promela and spin. always, use the right tools for the job!
Of course, it is always best to avoid the need to work
5. The full state-space search does not give any indica-
at this level by designing your parallel code to be easily
tion of how each offending state was reached. That
partitioned and then using higher-level primitives (such
said, once you realize that the state is in fact reach-
as locks, sequence counters, atomic operations, and RCU)
able, it is usually not too hard to find that state using
to get your job done more straightforwardly. And even if
the interactive tool.
you absolutely must use low-level memory barriers and
3 But recent work focuses on mixed-size accesses [FSP+ 17]. read-modify-write instructions to get your job done, the
v2022.09.25a
260 CHAPTER 12. FORMAL VERIFICATION
Listing 12.27: IRIW Litmus Test Listing 12.28: Expanded IRIW Litmus Test
1 PPC IRIW.litmus 1 PPC IRIW5.litmus
2 "" 2 ""
3 (* Traditional IRIW. *) 3 (* Traditional IRIW, but with five stores instead of *)
4 { 4 (* just one. *)
5 0:r1=1; 0:r2=x; 5 {
6 1:r1=1; 1:r4=y; 6 0:r1=1; 0:r2=x;
7 2:r2=x; 2:r4=y; 7 1:r1=1; 1:r4=y;
8 3:r2=x; 3:r4=y; 8 2:r2=x; 2:r4=y;
9 } 9 3:r2=x; 3:r4=y;
10 P0 | P1 | P2 | P3 ; 10 }
11 stw r1,0(r2) | stw r1,0(r4) | lwz r3,0(r2) | lwz r3,0(r4) ; 11 P0 | P1 | P2 | P3 ;
12 | | sync | sync ; 12 stw r1,0(r2) | stw r1,0(r4) | lwz r3,0(r2) | lwz r3,0(r4) ;
13 | | lwz r5,0(r4) | lwz r5,0(r2) ; 13 addi r1,r1,1 | addi r1,r1,1 | sync | sync ;
14 14 stw r1,0(r2) | stw r1,0(r4) | lwz r5,0(r4) | lwz r5,0(r2) ;
15 exists 15 addi r1,r1,1 | addi r1,r1,1 | | ;
16 (2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0) 16 stw r1,0(r2) | stw r1,0(r4) | | ;
17 addi r1,r1,1 | addi r1,r1,1 | | ;
18 stw r1,0(r2) | stw r1,0(r4) | | ;
19 addi r1,r1,1 | addi r1,r1,1 | | ;
more conservative your use of these sharp instruments, 20 stw r1,0(r2) | stw r1,0(r4) | | ;
21
the easier your life is likely to be. 22 exists
23 (2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0)
v2022.09.25a
12.3. AXIOMATIC APPROACHES 261
v2022.09.25a
262 CHAPTER 12. FORMAL VERIFICATION
v2022.09.25a
12.4. SAT SOLVERS 263
The output of the herd tool when running this litmus Trace Generation
(If Counterexample
test features Never, indicating that P0() never accesses a Located)
freed element, as expected. Also as expected, removing
either synchronize_rcu() results in P1() accessing a
freed element, as indicated by Sometimes in the herd
Verification Result
output.
Quick Quiz 12.33: In Listing 12.32, why not have just one
call to synchronize_rcu() immediately before line 48? Figure 12.2: CBMC Processing Flow
v2022.09.25a
264 CHAPTER 12. FORMAL VERIFICATION
v2022.09.25a
12.6. SUMMARY 265
testing remains the validation workhorse for large parallel said, very simple test harnesses can find significant bugs
software systems [Cor06a, Jon11, McK15d]. in arbitrarily large software systems. In contrast, the effort
It is nevertheless quite possible that this will not always required to apply formal verification seems to increase
be the case. To see this, consider that there is estimated to dramatically as the system size increases.
be more than twenty billion instances of the Linux kernel I have nevertheless made occasional use of formal
as of 2017. Suppose that the Linux kernel has a bug that verification for almost 30 years by playing to formal
manifests on average every million years of runtime. As verification’s strengths, namely design-time verification
noted at the end of the preceding chapter, this bug will be of small complex portions of the overarching software
appearing more than 50 times per day across the installed construct. The larger overarching software construct is of
base. But the fact remains that most formal validation course validated by testing.
techniques can be used only on very small codebases. So
what is a concurrency coder to do? Quick Quiz 12.36: In light of the full verification of the L4
Think in terms of finding the first bug, the first relevant microkernel, isn’t this limited view of formal verification just
bug, the last relevant bug, and the last bug. a little bit obsolete?
The first bug is normally found via inspection or com-
piler diagnostics. Although the increasingly sophisticated One final approach is to consider the following two
compiler diagnostics comprise a lightweight sort of formal definitions from Section 11.1.2 and the consequence that
verification, it is not common to think of them in those they imply:
terms. This is in part due to an odd practitioner prejudice
which says “If I am using it, it cannot be formal verifica-
tion” on the one hand, and a large gap between compiler Definition: Bug-free programs are trivial programs.
diagnostics and verification research on the other.
Although the first relevant bug might be located via Definition: Reliable programs have no known bugs.
inspection or compiler diagnostics, it is not unusual for
Consequence: Any non-trivial reliable program con-
these two steps to find only typos and false positives.
tains at least one as-yet-unknown bug.
Either way, the bulk of the relevant bugs, that is, those
bugs that might actually be encountered in production,
will often be found via testing. From this viewpoint, any advances in validation and
When testing is driven by anticipated or real use cases, verification can have but two effects: (1) An increase in
it is not uncommon for the last relevant bug to be located the number of trivial programs or (2) A decrease in the
by testing. This situation might motivate a complete number of reliable programs. Of course, the human race’s
rejection of formal verification, however, irrelevant bugs increasing reliance on multicore systems and software
have an annoying habit of suddenly becoming relevant at provides extreme motivation for a very sharp increase in
the least convenient moment possible, courtesy of black- the number of trivial programs.
hat attacks. For security-critical software, which appears
to be a continually increasing fraction of the total, there However, if your code is so complex that you find your-
can thus be strong motivation to find and fix the last bug. self relying too heavily on formal-verification tools, you
Testing is demonstrably unable to find the last bug, so should carefully rethink your design, especially if your
there is a possible role for formal verification, assuming, formal-verification tools require your code to be hand-
that is, that formal verification proves capable of growing translated to a special-purpose language. For example, a
into that role. As this chapter has shown, current formal complex implementation of the dynticks interface for pre-
verification systems are extremely limited. emptible RCU that was presented in Section 12.1.5 turned
out to have a much simpler alternative implementation,
Quick Quiz 12.35: But shouldn’t sufficiently low-level as discussed in Section 12.1.6.9. All else being equal, a
software be for all intents and purposes immune to being simpler implementation is much better than a proof of
exploited by black hats?
correctness for a complex implementation.
Please note that formal verification is often much harder And the open challenge to those working on formal ver-
to use than is testing. This is in part a cultural statement, ification techniques and systems is to prove this summary
and there is reason to hope that formal verification will wrong! To assist in this task, Verification Challenge 6 is
be perceived to be easier with increased familiarity. That now available [McK17]. Have at it!!!
v2022.09.25a
266 CHAPTER 12. FORMAL VERIFICATION
12.7 Choosing a Validation Plan One approach is to devote a given fraction of the soft-
ware budget to validation, with that fraction depending on
the criticality of the software, so that safety-critical avion-
Science is a first-rate piece of furniture for one’s ics software might grant a larger fraction of its budget to
upper chamber, but only given common sense on the
validation than would a homework assignment. Where
ground floor.
available, experience from prior similar projects should
Oliver Wendell Holmes, updated be brought to bear. However, it is necessary to structure
the project so that the validation investment starts when
What sort of validation should you use for your project? the project does, otherwise the inevitable overruns in
As is often the case in software in particular and in spending on coding will crowd out the validation effort.
engineering in general, the answer is “it depends”. Staffing start-up projects with experienced people can
Note that neither running a test nor undertaking formal result in overinvestment in validation efforts. Just as it
verification will change your project. At best, such ef- is possible to go broke buying too much insurance, it is
fort have an indirect effect by locating a bug that is later possible to kill a project by investing too much in testing.
fixed. Nevertheless, fixing a bug might prevent inconve- This is especially the case for first-of-a-kind projects where
nience, monetary loss, property damage, or even loss of it is not yet clear which use cases will be important, in
life. Clearly, this sort of indirect effect can be extremely which case testing for all possible use cases will be a
valuable. possibly fatal waste of time, energy, and funding.
Unfortunately, as we have seen, it is difficult to predict However, as the tasks supported by a start-up project
whether or not a given validation effort will find important become more routine, users often become less forgiving of
bugs. It is therefore all too easy to invest too little— failures, thus increasing the need for validation. Managing
or even to fail to invest at all, especially if development this shift in investment can be extremely challenging,
estimates proved overly optimistic or budgets unexpectedly especially in the all-too-common case where the users
tight, conditions which almost always come into play in are unwilling or unable to disclose the exact nature of
real-world software projects. their use case. It then becomes critically important to
The decision to nevertheless invest in validation is often reverse-engineer the use cases from bug reports and from
forced by experienced people with forceful personalities. discussions with the users. As these use cases are better
But this is no guarantee, given that other stakeholders understood, use of continuous integration can help reduce
might also have forceful personalities. Worse yet, these the cost of finding and fixing any bugs located.
other stakeholders might bring stories of expensive val- One example evolution of a software project’s use of
idation efforts that nevertheless allowed embarrassing validation is shown in Figure 12.4. As can be seen in
bugs to escape to the end users. So although a scarred, the figure, Linux-kernel RCU didn’t have any validation
grey-haired, and grouchy veteran might carry the day, a code whatsoever until Linux kernel v2.6.15, which was
more organized approach would perhaps be more useful. released more than two years after RCU was accepted into
Fortunately, there is a strictly financial analog to invest- the kernel. The test suite achieved its peak fraction of
ments in validation, and that is the insurance policy. the total lines of code in Linux kernel v2.6.19–v2.6.21.
Both insurance policies and validation efforts require This fraction decreased sharply with the acceptance of
consistent up-front investments, and both defend against preemptible RCU for real-time applications in v2.6.25.
disasters that might or might not ever happen. Further- This decrease was due to the fact that the RCU API was
more, both have exclusions of various types. For example, identical in the preemptible and non-preemptible variants
insurance policies for coastal areas might exclude damages of RCU. This in turn meant that the existing test suite
due to tidal waves, while on the other hand we have seen applied to both variants, so that even though the Linux-
that there is not yet any validation methodology that can kernel RCU code expanded significantly, there was no
find each and every bug. need to expand the tests.
In addition, it is possible to over-invest in both insurance Subsequent bars in Figure 12.4 show that the RCU code
and in validation. For but one example, a validation plan base expanded significantly, but that the corresponding
that consumed the entire development budget would be validation code expanded even more dramatically. Linux
just as pointless as would an insurance policy that covered kernel v3.5 added tests for the rcu_barrier() API, clos-
the Sun going nova. ing a long-standing hole in test coverage. Linux kernel
v2022.09.25a
12.7. CHOOSING A VALIDATION PLAN 267
35000 50
RCU
30000 RCU Test
% Test 40
25000
20000 30
% Test
LoC
15000 20
10000
10
5000
0 0
v2.6.12
v2.6.16
v2.6.20
v2.6.24
v2.6.28
v2.6.32
v2.6.36
v3.0
v3.4
v3.8
v3.12
v3.16
v4.0
v4.4
v4.8
v4.12
v4.16
v5.0
v5.4
v5.8
v5.12
v5.16
v5.19
Linux Release
v3.14 added automated testing and analysis of test results, This question is being answered naturally as compilers
moving RCU towards continuous integration. Linux ker- adopt increasingly aggressive formal-verification tech-
nel v4.7 added a performance validation suite for RCU’s niques into their diagnostics and as formal-verification
update-side primitives. Linux kernel v4.12 added Tree tools continue to mature. In addition, the Linux-kernel
SRCU, featuring improved update-side scalability, and lockdep and KCSAN tools illustrate the advantages of
v4.13 removed the old less-scalable SRCU implementa- combining formal verification techniques with run-time
tion. Linux kernel v5.0 briefly hosted the nolibc library analysis, as discussed in Section 11.3. Other com-
within the rcutorture scripting directory before it moved to bined techniques analyze traces gathered from execu-
its long-term home in tools/include/nolibc. Linux tions [dOCdO19]. For the time being, the best practice
kernel v5.8 added the Tasks Trace and Rude flavors of is to focus first on testing and to reserve explicit work on
RCU. Linux kernel v5.9 added the refscale.c suite formal verification for those portions of the project that
of read-side performance tests. Linux kernels v5.12 and are not well-served by testing, and that have exceptional
v5.13 started adding the ability to change a given CPU’s needs for robustness. For example, Linux-kernel RCU
callback-offloading status at runtime and also added the relies primarily on testing, but has made occasional use
torture.sh acceptance-test script. Linux kernel v5.14 of formal verification as discussed in this chapter.
added distributed rcutorture. Linux kernel v5.15 added In short, choosing a validation plan for concurrent
demonic vCPU placement in rcutorture testing, which software remains more an art than a science, let alone a
was successful in locating a number of race conditions.5 field of engineering. However, there is every reason to
Linux kernel v5.17 removed the RCU_FAST_NO_HZ Kcon- expect that increasingly rigorous approaches will continue
fig option. Numerous other changes may be found in the to become more prevalent.
Linux kernel’s git archives.
We have established that the validation budget varies
from one project to the next, and also over the lifetime of
any given project. But how should the validation invest-
ment be split between testing and formal verification?
5 The trick is to place one pair of vCPUs within the same core
on one socket, while placing another pair within the same core on
some other socket. As you might expect from Chapter 3, this produces
different memory latencies between different pairs of vCPUs (https:
//paulmck.livejournal.com/62071.html).
v2022.09.25a
268 CHAPTER 12. FORMAL VERIFICATION
v2022.09.25a
You don’t learn how to shoot and then learn how to
launch and then learn to do a controlled spin—you
learn to launch-shoot-spin.
Chapter 13 “Ender’s Shadow”, Orson Scott Card
269
v2022.09.25a
270 CHAPTER 13. PUTTING IT ALL TOGETHER
Release
Counting is the religion of this generation. It is its Reference Hazard
hope and its salvation. Acquisition Locks RCU
Counts Pointers
Gertrude Stein Locks − CAM M CA
Reference
A AM M A
Although reference counting is a conceptually simple Counts
technique, many devils hide in the details when it is Hazard
M M M M
applied to concurrent software. After all, if the object Pointers
was not subject to premature disposal, there would be no RCU CA MCA M CA
need for the reference counter in the first place. But if the
object can be disposed of, what prevents disposal during
the reference-acquisition process itself? Quick Quiz 13.1: Why not implement reference-acquisition
There are a number of ways to refurbish reference using a simple compare-and-swap operation that only acquires
counters for use in concurrent software, including: a reference if the reference counter is non-zero?
4. An existence guarantee is provided for the object, thus 4. Atomic counting with a check combined with the
preventing it from being freed while some other entity atomic acquisition operation, and with memory bar-
might be attempting to acquire a reference. Existence riers required only on release (“CAM”).
guarantees are often provided by automatic garbage 5. Atomic counting with a check combined with the
collectors, and, as is seen in Sections 9.3 and 9.5, by atomic acquisition operation (“CA”).
hazard pointers and RCU, respectively.
6. Simple counting with a check combined with full
5. A type-safety guarantee is provided for the object. An memory barriers (“M”).
additional identity check must be performed once the
7. Atomic counting with a check combined with the
reference is acquired. Type-safety guarantees can be
atomic acquisition operation, and with memory bar-
provided by special-purpose memory allocators, for
riers also required on acquisition (“MCA”).
example, by the SLAB_TYPESAFE_BY_RCU feature
within the Linux kernel, as is seen in Section 9.5. However, because all Linux-kernel atomic operations that
return a value are defined to contain memory barriers,1
Of course, any mechanism that provides existence guar- all release operations contain memory barriers, and all
antees by definition also provides type-safety guaran- checked acquisition operations also contain memory bar-
tees. This results in four general categories of reference- riers. Therefore, cases “CA” and “MCA” are equivalent to
acquisition protection: Reference counting, hazard point- 1 With atomic_read() and ATOMIC_INIT() being the exceptions
ers, sequence locking, and RCU. that prove the rule.
v2022.09.25a
13.2. REFURBISH REFERENCE COUNTING 271
“CAM”, so that there are sections below for only the first Listing 13.1: Simple Reference-Count API
four cases and the sixth case: “−”, “A”, “AM”, “CAM”, 1 struct sref {
2 int refcount;
and “M”. Later sections describe optimizations that can 3 };
improve performance if reference acquisition and release 4
5 void sref_init(struct sref *sref)
is very frequent, and the reference count need be checked 6 {
for zero only very rarely. 7 sref->refcount = 1;
8 }
9
10 void sref_get(struct sref *sref)
11 {
13.2.1 Implementation of Reference- 12 sref->refcount++;
13 }
Counting Categories 14
15 int sref_put(struct sref *sref,
16 void (*release)(struct sref *sref))
Simple counting protected by locking (“−”) is described 17 {
18 WARN_ON(release == NULL);
in Section 13.2.1.1, atomic counting with no memory 19 WARN_ON(release == (void (*)(struct sref *))kfree);
barriers (“A”) is described in Section 13.2.1.2, atomic 20
21 if (--sref->refcount == 0) {
counting with acquisition memory barrier (“AM”) is de- 22 release(sref);
scribed in Section 13.2.1.3, and atomic counting with 23 return 1;
24 }
check and release memory barrier (“CAM”) is described 25 return 0;
in Section 13.2.1.4. Use of hazard pointers is described 26 }
in Section 9.3 on page 130 and in Section 13.3.
v2022.09.25a
272 CHAPTER 13. PUTTING IT ALL TOGETHER
Listing 13.2: Linux Kernel kref API Listing 13.3: Linux Kernel dst_clone API
1 struct kref { 1 static inline
2 atomic_t refcount; 2 struct dst_entry * dst_clone(struct dst_entry * dst)
3 }; 3 {
4 4 if (dst)
5 void kref_init(struct kref *kref) 5 atomic_inc(&dst->__refcnt);
6 { 6 return dst;
7 atomic_set(&kref->refcount, 1); 7 }
8 } 8
9 9 static inline
10 void kref_get(struct kref *kref) 10 void dst_release(struct dst_entry * dst)
11 { 11 {
12 WARN_ON(!atomic_read(&kref->refcount)); 12 if (dst) {
13 atomic_inc(&kref->refcount); 13 WARN_ON(atomic_read(&dst->__refcnt) < 1);
14 } 14 smp_mb__before_atomic_dec();
15 15 atomic_dec(&dst->__refcnt);
16 static inline int 16 }
17 kref_sub(struct kref *kref, unsigned int count, 17 }
18 void (*release)(struct kref *kref))
19 {
20 WARN_ON(release == NULL);
21 Quick Quiz 13.3: Suppose that just after the atomic_sub_
22 if (atomic_sub_and_test((int) count,
23 &kref->refcount)) { and_test() on line 22 of Listing 13.2 is invoked, that some
24 release(kref); other CPU invokes kref_get(). Doesn’t this result in that
25 return 1; other CPU now having an illegal reference to a released object?
26 }
27 return 0;
28 }
Quick Quiz 13.4: Suppose that kref_sub() returns zero, in-
dicating that the release() function was not invoked. Under
CPUs have released their hazard pointers or exited their what conditions can the caller rely on the continued existence
RCU read-side critical sections, respectively. of the enclosing object?
Quick Quiz 13.2: Why isn’t it necessary to guard against Quick Quiz 13.5: Why not just pass kfree() as the release
cases where one CPU acquires a reference just after another function?
CPU releases the last reference?
v2022.09.25a
13.2. REFURBISH REFERENCE COUNTING 273
required, it will be embedded in the mechanism used to Listing 13.4: Linux Kernel fget/fput API
hand the dst_entry off. 1 struct file *fget(unsigned int fd)
2 {
The dst_release() primitive may be invoked from 3 struct file *file;
any environment, and the caller might well reference ele- 4 struct files_struct *files = current->files;
5
ments of the dst_entry structure immediately prior to the 6 rcu_read_lock();
call to dst_release(). The dst_release() primitive 7 file = fcheck_files(files, fd);
8 if (file) {
therefore contains a memory barrier on line 14 preventing 9 if (!atomic_inc_not_zero(&file->f_count)) {
both the compiler and the CPU from misordering accesses. 10 rcu_read_unlock();
11 return NULL;
Please note that the programmer making use of dst_ 12 }
clone() and dst_release() need not be aware of the 13 }
14 rcu_read_unlock();
memory barriers, only of the rules for using these two 15 return file;
primitives. 16 }
17
18 struct file *
19 fcheck_files(struct files_struct *files, unsigned int fd)
13.2.1.4 Atomic Counting With Check and Release 20 {
Memory Barrier 21 struct file * file = NULL;
22 struct fdtable *fdt = rcu_dereference((files)->fdt);
Consider a situation where the caller must be able to 23
24 if (fd < fdt->max_fds)
acquire a new reference to an object to which it does 25 file = rcu_dereference(fdt->fd[fd]);
not already hold a reference, but where that object’s 26 return file;
27 }
existence is guaranteed. The fact that initial reference- 28
count acquisition can now run concurrently with reference- 29 void fput(struct file *file)
30 {
count release adds further complications. Suppose that 31 if (atomic_dec_and_test(&file->f_count))
a reference-count release finds that the new value of the 32 call_rcu(&file->f_u.fu_rcuhead, file_free_rcu);
33 }
reference count is zero, signaling that it is now safe to 34
clean up the reference-counted object. We clearly cannot 35 static void file_free_rcu(struct rcu_head *head)
36 {
allow a reference-count acquisition to start after such 37 struct file *f;
clean-up has commenced, so the acquisition must include 38
39 f = container_of(head, struct file, f_u.fu_rcuhead);
a check for a zero reference count. This check must be 40 kmem_cache_free(filp_cachep, f);
part of the atomic increment operation, as shown below. 41 }
Quick Quiz 13.6: Why can’t the check for a zero reference
count be made in a simple “if” statement with an atomic
increment in its “then” clause?
descriptor, then line 9 attempts to atomically acquire a ref-
erence count. If it fails to do so, lines 10–11 exit the RCU
The Linux kernel’s fget() and fput() primitives use read-side critical section and report failure. Otherwise, if
this style of reference counting. Simplified versions of the attempt is successful, lines 14–15 exit the read-side
these functions are shown in Listing 13.4.4 critical section and return a pointer to the file structure.
Line 4 of fget() fetches the pointer to the current The fcheck_files() primitive is a helper function
process’s file-descriptor table, which might well be shared for fget(). Line 22 uses rcu_dereference() to safely
with other processes. Line 6 invokes rcu_read_lock(), fetch an RCU-protected pointer to this task’s current file-
which enters an RCU read-side critical section. The call- descriptor table, and line 24 checks to see if the specified
back function from any subsequent call_rcu() primitive file descriptor is in range. If so, line 25 fetches the pointer
will be deferred until a matching rcu_read_unlock() to the file structure, again using the rcu_dereference()
is reached (line 10 or 14 in this example). Line 7 looks primitive. Line 26 then returns a pointer to the file structure
up the file structure corresponding to the file descriptor or NULL in case of failure.
specified by the fd argument, as will be described later. The fput() primitive releases a reference to a file
If there is an open file corresponding to the specified file structure. Line 31 atomically decrements the reference
count, and, if the result was zero, line 32 invokes the call_
4 As of Linux v2.6.38. Additional O_PATH functionality was added rcu() primitives in order to free up the file structure
in v2.6.39, refactoring was applied in v3.14, and mmap_sem contention (via the file_free_rcu() function specified in call_
was reduced in v4.1. rcu()’s second argument), but only after all currently-
v2022.09.25a
274 CHAPTER 13. PUTTING IT ALL TOGETHER
executing RCU read-side critical sections complete, that 13.3.1 Scalable Reference Count
is, after an RCU grace period has elapsed.
Suppose a reference count is becoming a performance or
Once the grace period completes, the file_free_ scalability bottleneck. What can you do?
rcu() function obtains a pointer to the file structure on One approach is to instead use hazard pointers.
line 39, and frees it on line 40. There are some differences, perhaps most notably that
This code fragment thus demonstrates how RCU can be with hazard pointers it is extremely expensive to determine
used to guarantee existence while an in-object reference when the corresponding reference count has reached zero.
count is being incremented. One way to work around this problem is to split the load
between reference counters and hazard pointers. Each data
element has a reference counter that tracks the number of
13.2.2 Counter Optimizations other data elements referencing this element on the one
hand, and readers use hazard pointers on the other.
In some cases where increments and decrements are Making this arrangement work both efficiently and cor-
common, but checks for zero are rare, it makes sense to rectly can be quite challenging, and so interested readers
maintain per-CPU or per-task counters, as was discussed are invited to examine the UnboundedQueue and Con-
in Chapter 5. For example, see the paper on sleepable currentHashMap data structures implemented in Folly
read-copy update (SRCU), which applies this technique open-source library.5
to RCU [McK06]. This approach eliminates the need for
atomic instructions or memory barriers on the increment
13.3.2 Long-Duration Accesses
and decrement primitives, but still requires that code-
motion compiler optimizations be disabled. In addition, Suppose a reader-writer-locking reader is holding the lock
the primitives such as synchronize_srcu() that check for so long that updates are excessively delayed. If that
for the aggregate reference count reaching zero can be quite reader can reasonably be converted to use reference count-
slow. This underscores the fact that these techniques are ing instead of reader-writer locking, but if performance
designed for situations where the references are frequently and scalability considerations prevent use of actual refer-
acquired and released, but where it is rarely necessary to ence counters, then hazard pointers provides a scalable
check for a zero reference count. variant of reference counting.
However, it is usually the case that use of reference The key point is that where reader-writer locking readers
counts requires writing (often atomically) to a data struc- block all updates for that lock, hazard pointers instead
ture that is otherwise read only. In this case, reference simply hang onto the data that is actually needed, while
counts are imposing expensive cache misses on readers. still allowing updates to proceed.
It is therefore worthwhile to look into synchronization If the reader cannot be reasonably be converted to use
mechanisms that do not require readers to write to the data reference counting, the tricks in Section 13.5.8 might be
structure being traversed. One possibility is the hazard helpful.
pointers covered in Section 9.3 and another is RCU, which
is covered in Section 9.5.
13.4 Sequence-Locking Specials
The girl who can’t dance says the band can’t play.
13.3 Hazard-Pointer Helpers
Yiddish proverb
It’s the little things that count, hundreds of them. This section looks at some special uses of sequence coun-
Cliff Shaw ters.
v2022.09.25a
13.4. SEQUENCE-LOCKING SPECIALS 275
13.4.1 Dueling Sequence Locks sequence locks are not a replacement for RCU protection:
Sequence locks protect against concurrent modifications,
The classic sequence-locking use case enables a reader to but RCU is still needed to protect against concurrent
see a consistent snapshot of a small collection of variables, deletions.
for example, calibration constants for timekeeping. This
This approach works quite well when the number of
works quite well in practice because calibration constants
correlated elements is small, the time to read these el-
are rarely updated and, when updated, are updated quickly.
ements is short, and the update rate is low. Otherwise,
Readers therefore almost never need to retry.
updates might happen so quickly that readers might never
However, if the updater is delayed during the update,
complete. Although Schrödinger does not expect that even
readers will also be delayed. Such delays might be due to
his least-sane relatives will marry and divorce quickly
interrupts, NMIs, or even virtual-CPU preemption.
enough for this to be a problem, he does realize that this
One way to prevent updater delays from causing reader
problem could well arise in other situations. One way to
delays is to maintain two sets of calibration constants.
avoid this reader-starvation problem is to have the readers
Each set is updated in turn, but frequently enough that
use the update-side primitives if there have been too many
readers can make good use of either set. Each set has its
retries, but this can degrade both performance and scala-
own sequence lock (seqlock_t structure).
bility. Another way to avoid starvation is to have multiple
The updater alternates between the two sets, so that an
sequence locks, in Schrödinger’s case, perhaps one per
delayed updater delays readers of at most one of the sets.
species.
Each reader attempts to access the first set, but upon
retry attempts to access the second set. If the second set In addition, if the update-side primitives are used too
also forces a retry, the reader repeats starting again from frequently, poor performance and scalability will result
the first set. If the updater is stuck, only one of the two due to lock contention. One way to avoid this is to maintain
sets will force readers to retry, and therefore readers will a per-element sequence lock, and to hold both spouses’
succeed as soon as they attempt to access the other set. locks when updating their marital status. Readers can do
their retry looping on either of the spouses’ locks to gain
Quick Quiz 13.7: Why don’t all sequence-locking use cases a stable view of any change in marital status involving
replicate the data in this fashion? both members of the pair. This avoids contention due to
high marriage and divorce rates, but complicates gaining
a stable view of all marital statuses during a single scan
13.4.2 Correlated Data Elements of the database.
Suppose we have a hash table where we need correlated If the element groupings are well-defined and persistent,
views of two or more of the elements. These elements which marital status is hoped to be, then one approach
are updated together, and we do not want to see an old is to add pointers to the data elements to link together
version of the first element along with new versions of the the members of a given group. Readers can then traverse
other elements. For example, Schrödinger decided to add these pointers to access all the data elements in the same
his extended family to his in-memory database along with group as the first one located.
all his animals. Although Schrödinger understands that This technique is used heavily in the Linux kernel,
marriages and divorces do not happen instantaneously, he perhaps most notably in the dcache subsystem [Bro15b].
is also a traditionalist. As such, he absolutely does not want Note that it is likely that similar schemes also work with
his database ever to show that the bride is now married, hazard pointers.
but the groom is not, and vice versa. Plus, if you think This approach provides sequential consistency to suc-
Schrödinger is a traditionalist, you just try conversing with cessful readers, each of which will either see the effects of
some of his family members! In other words, Schrödinger a given update or not, with any partial updates resulting in
wants to be able to carry out a wedlock-consistent traversal a read-side retry. Sequential consistency is an extremely
of his database. strong guarantee, incurring equally strong restrictions
One approach is to use sequence locks (see Section 9.4), and equally high overheads. In this case, we saw that
so that wedlock-related updates are carried out under readers might be starved on the one hand, or might need
the protection of write_seqlock(), while reads re- to acquire the update-side lock on the other. Although this
quiring wedlock consistency are carried out within a works very well in cases where updates are infrequent,
read_seqbegin() / read_seqretry() loop. Note that it unnecessarily forces read-side retries even when the
v2022.09.25a
276 CHAPTER 13. PUTTING IT ALL TOGETHER
update does not affect any of the data that a retried reader Readers looking up a given element act as sequence-lock
accesses. Section 13.5.4 therefore covers a much weaker readers across their full set of accesses to that element.
form of consistency that not only avoids reader starvation, Note that these sequence-lock operations will order each
but also avoids any form of read-side retry. The next reader’s lookups.
section instead presents a weaker form of consistency that Renaming an element can then proceed roughly as
can be provided with much lower probabilities of reader follows:
starvation.
1. Acquire a global lock protecting rename operations.
13.4.3 Atomic Move 2. Allocate and initialize a copy of the element with the
Suppose that individual data elements are moved from new name.
one data structure to another, and that readers look up
only single data structures. However, when a data element 3. Write-acquire the sequence lock on the element with
moves, readers must must never see it as being in both the old name, which has the side effect of ordering this
structures at the same time and must also never see it acquisition with the following insertion. Concurrent
as missing from both structures at the same time. At lookups of the old name will now repeatedly retry.
the same time, any reader seeing the element in its new
location must never subsequently see it in its old location. 4. Insert the copy of the element with the new name.
In addition, the move may be implemented by inserting Lookups of the new name will now succeed.
a new copy of the old data element into the destination
location. 5. Execute smp_wmb() to order the prior insertion with
For example, consider a hash table that supports an the subsequent removal.
atomic-to-readers rename operation. Expanding on Schrö-
dinger’s zoo, suppose that an animal’s name changes, for 6. Remove the element with the old name. Concurrent
example, each of the brides in Schrödinger’s traditionalist lookups of the old name will now fail.
family might change their last name to match that of their
groom. 7. Write-release the sequence lock if necessary, for
example, if required by lock dependency checkers.
But changing their name might change the hash value,
and might also require that the bride’s element move from
8. Release the global lock.
one hash chain to another. The consistency set forth above
requires that if a reader successfully looks up the new
name, then any subsequent lookup of the old name by Thus, readers looking up the old name will retry until
that reader must result in failure. Similarly, if a reader’s the new name is available, at which point their final retry
lookup of the old name results in lookup failure, then any will fail. Any subsequent lookups of the new name will
subsequent lookup of the new name by that reader must succeed. Any reader succeeding in looking up the new
succeed. In short, a given reader should not see a bride name is guaranteed that any subsequent lookup of the old
momentarily blinking out of existence, nor should that name will fail, perhaps after a series of retries.
reader lookup a bride under her new name and then later Quick Quiz 13.8: Is it possible to write-acquire the sequence
lookup that bride under her old name. lock on the new element before it is inserted instead of acquiring
This consistency guarantee could be enforced with a that of the old element before it is removed?
single global sequence lock as described in Section 13.4.2,
but this can result in reader starvation even for readers that Quick Quiz 13.9: Is it possible to avoid the global lock?
are not looking up a bride who is currently undergoing
a name change. This guarantee could also be enforced It is of course possible to instead implement this pro-
by requiring that readers acquire a per-hash-chain lock, cedure somewhat more efficiently using simple flags.
but reviewing Figure 10.2 shows that this results in poor However, this can be thought of as a simplified variant
performance and scalabilty, even for single-socket systems. of sequence locking that relies on the fact that a given
Another more reader-friendly way to implement this is element’s sequence lock is never write-acquired more than
to use RCU and to place a sequence lock on each element. once.
v2022.09.25a
13.5. RCU RESCUES 277
13.4.4 Upgrade to Writer sum. In particular, when a given thread exits, we absolutely
cannot lose the exiting thread’s count, nor can we double-
As discussed in Section 9.5.4.9, RCU permits readers to count it. Such an error could result in inaccuracies equal to
upgrade to writers. This capability can be quite useful the full precision of the result, in other words, such an error
when a reader scanning an RCU-protected data structure would make the result completely useless. And in fact, one
notices that the current element needs to be updated. What of the purposes of final_mutex is to ensure that threads
happens when you try this trick with sequence locking? do not come and go in the middle of read_count()
It turns out that this sequence-locking trick is actually execution.
used in the Linux kernel, for example, by the sdma_
Therefore, if we are to dispense with final_mutex, we
flush() function in drivers/infiniband/hw/hfi1/
will need to come up with some other method for ensuring
sdma.c. The effect is to doom the enclosing reader to
consistency. One approach is to place the total count for
retry. This trick is therefore used when the reader detects
all previously exited threads and the array of pointers to
some condition that requires a retry.
the per-thread counters into a single structure. Such a
structure, once made available to read_count(), is held
constant, ensuring that read_count() sees consistent
13.5 RCU Rescues data.
v2022.09.25a
278 CHAPTER 13. PUTTING IT ALL TOGETHER
Listing 13.5: RCU and Per-Thread Statistical Counters Lines 43–50 show the count_register_thread()
1 struct countarray { function, which is invoked by each newly created thread.
2 unsigned long total;
3 unsigned long *counterp[NR_THREADS];
Line 45 picks up the current thread’s index, line 47 acquires
4 }; final_mutex, line 48 installs a pointer to this thread’s
5
6 unsigned long __thread counter = 0;
counter, and line 49 releases final_mutex.
7 struct countarray *countarrayp = NULL;
8 DEFINE_SPINLOCK(final_mutex);
Quick Quiz 13.11: Hey!!! Line 48 of Listing 13.5 modifies
9 a value in a pre-existing countarray structure! Didn’t you
10 __inline__ void inc_count(void) say that this structure, once made available to read_count(),
11 {
12 WRITE_ONCE(counter, counter + 1); remained constant???
13 }
14 Lines 52–72 show count_unregister_thread(),
15 unsigned long read_count(void) which is invoked by each thread just before it exits.
16 {
17 struct countarray *cap; Lines 58–62 allocate a new countarray structure, line 63
18 unsigned long *ctrp; acquires final_mutex and line 69 releases it. Line 64
19 unsigned long sum;
20 int t; copies the contents of the current countarray into the
21
22 rcu_read_lock();
newly allocated version, line 65 adds the exiting thread’s
23 cap = rcu_dereference(countarrayp); counter to new structure’s ->total, and line 66 NULLs
24 sum = READ_ONCE(cap->total); the exiting thread’s counterp[] array element. Line 67
25 for_each_thread(t) {
26 ctrp = READ_ONCE(cap->counterp[t]); then retains a pointer to the current (soon to be old)
27 if (ctrp != NULL) sum += *ctrp; countarray structure, and line 68 uses rcu_assign_
28 }
29 rcu_read_unlock(); pointer() to install the new version of the countarray
30 return sum; structure. Line 70 waits for a grace period to elapse, so
31 }
32 that any threads that might be concurrently executing in
33 void count_init(void) read_count(), and thus might have references to the old
34 {
35 countarrayp = malloc(sizeof(*countarrayp)); countarray structure, will be allowed to exit their RCU
36 if (countarrayp == NULL) { read-side critical sections, thus dropping any such refer-
37 fprintf(stderr, "Out of memory\n");
38 exit(EXIT_FAILURE); ences. Line 71 can then safely free the old countarray
39 } structure.
40 memset(countarrayp, '\0', sizeof(*countarrayp));
41 } Quick Quiz 13.12: Given the fixed-size counterp array,
42
43 void count_register_thread(unsigned long *p)
exactly how does this code avoid a fixed upper bound on the
44 { number of threads???
45 int idx = smp_thread_id();
46
47 spin_lock(&final_mutex);
48 countarrayp->counterp[idx] = &counter; 13.5.1.3 Discussion
49 spin_unlock(&final_mutex);
50 }
51 Quick Quiz 13.13: Wow! Listing 13.5 contains 70 lines
52 void count_unregister_thread(int nthreadsexpected) of code, compared to only 42 in Listing 5.4. Is this extra
53 {
54 struct countarray *cap; complexity really worth it?
55 struct countarray *capold;
56 int idx = smp_thread_id(); Use of RCU enables exiting threads to wait until other
57
58 cap = malloc(sizeof(*countarrayp)); threads are guaranteed to be done using the exiting threads’
59 if (cap == NULL) { __thread variables. This allows the read_count()
60 fprintf(stderr, "Out of memory\n");
61 exit(EXIT_FAILURE); function to dispense with locking, thereby providing ex-
62 } cellent performance and scalability for both the inc_
63 spin_lock(&final_mutex);
64 *cap = *countarrayp; count() and read_count() functions. However, this
65 WRITE_ONCE(cap->total, cap->total + counter); performance and scalability come at the cost of some
66 cap->counterp[idx] = NULL;
67 capold = countarrayp; increase in code complexity. It is hoped that compiler and
68 rcu_assign_pointer(countarrayp, cap); library writers employ user-level RCU [Des09b] to provide
69 spin_unlock(&final_mutex);
70 synchronize_rcu(); safe cross-thread access to __thread variables, greatly
71 free(capold); reducing the complexity seen by users of __thread vari-
72 }
ables.
v2022.09.25a
13.5. RCU RESCUES 279
13.5.2 RCU and Counters for Removable Listing 13.6: RCU-Protected Variable-Length Array
struct foo {
I/O Devices 1
2 int length;
3 char *a;
Section 5.4.6 showed a fanciful pair of code fragments for 4 };
dealing with counting I/O accesses to removable devices.
These code fragments suffered from high overhead on Listing 13.7: Improved RCU-Protected Variable-Length Array
the fastpath (starting an I/O) due to the need to acquire a 1 struct foo_a {
reader-writer lock. 2 int length;
3 char a[0];
This section shows how RCU may be used to avoid this 4 };
overhead. 5
6 struct foo {
The code for performing an I/O is quite similar to the 7 struct foo_a *fa;
original, with an RCU read-side critical section being 8 };
v2022.09.25a
280 CHAPTER 13. PUTTING IT ALL TOGETHER
Listing 13.8: Uncorrelated Measurement Fields Listing 13.9: Correlated Measurement Fields
1 struct animal { 1 struct measurement {
2 char name[40]; 2 double meas_1;
3 double age; 3 double meas_2;
4 double meas_1; 4 double meas_3;
5 double meas_2; 5 };
6 double meas_3; 6
7 char photo[0]; /* large bitmap. */ 7 struct animal {
8 }; 8 char name[40];
9 double age;
10 struct measurement *mp;
11 char photo[0]; /* large bitmap. */
4. CPU 1 shrinks the array to be of length 8, and assigns 12 };
that here readers need only see a consistent view of a given single data
element, not the consistent view of a group of data elements that was 7 Why would such a quantity be useful? Beats me! But group
v2022.09.25a
13.5. RCU RESCUES 281
v2022.09.25a
282 CHAPTER 13. PUTTING IT ALL TOGETHER
v2022.09.25a
13.5. RCU RESCUES 283
This state machine and pseudocode shows how to get the ture could multiply system-wide grace-period overhead
effect of a call_rcu_cancel() in those rare situations by the number of such structures.
needing such semantics. Therefore, it is often better to acquire some sort of non-
RCU reference on the needed data to permit a momentary
exit from the RCU read-side critical section, as described
13.5.8 Long-Duration Accesses Two in the next section.
Suppose a reader-writer-locking reader is holding the lock
for so long that updates are excessively delayed. Suppose 13.5.8.2 Acquire a Reference
further that this reader cannot reasonably be converted to If the RCU read-side critical section is too long, shorten
use reference counting (otherwise, see Section 13.3.2). it!
If that reader can be reasonably converted to use RCU, In some cases, this can be done trivially. For example,
that might solve the problem. The reason is that RCU code that scans all of the hash chains of a statically
readers do not completely block updates, but rather block allocated array of hash buckets can just as easily scan each
only the cleanup portions of those updates (including hash chain within its own critical section.
memory reclamation). Therefore, if the system has ample This works because hash chains are normally quite short,
memory, converting the reader-writer lock to RCU may and by design. When traversing long linked structures, it
suffice. is necessary to have some way of stopping in the middle
However, converting to RCU does not always suffice. and resuming later.
For example, the code might traverse an extremely large For example, in Linux kernel v5.16, the khugepaged_
linked data structure within a single RCU read-side critical scan_file() function checks to see if some other task
section, which might so greatly extend the RCU grace needs the current CPU using need_resched(), and if
period that the system runs out of memory. These situa- so invokes xas_pause() to adjust the traversal’s iterator
tions can be handled in a couple of different ways: (1) Use appropriately, and then invokes cond_resched_rcu() to
SRCU instead of RCU and (2) Acquire a reference to exit yield the CPU. In turn, the cond_resched_rcu() func-
the RCU reader. tion invokes rcu_read_unlock(), cond_resched(),
and finally rcu_read_lock() to drop out of the RCU
13.5.8.1 Use SRCU read-side critical section in order to yield the CPU.
Of course, where feasible, another approach would be
In the Linux kernel, RCU is global. In other words, to switch to a data structure such as a hash table that is
any long-running RCU reader anywhere in the kernel more friendly to momentarily dropping out of an RCU
will delay the current RCU grace period. If the long- read-side critical section.
running RCU reader is traversing a small data structure, Quick Quiz 13.16: But how would this work with a resizable
that small amount of data is delaying freeing of all other hash table, such as the one described in Section 10.4?
data structures, which can result in memory exhaustion.
One way to avoid this problem is to use SRCU for
that long-running RCU reader’s data structure, with its
own srcu_struct structure. The resulting long-running
SRCU readers will then delay only that srcu_struct
structure’s grace periods, and not those of RCU, thus
avoiding memory exhaustion. For more details, see the
SRCU API in Section 9.5.3.
Unfortunately, this approach does have some drawbacks.
For one thing, SRCU readers are not subject to priority
boosting, which can result in additional delays to low-
priority SRCU readers on busy systems. Worse yet, defin-
ing a separate srcu_struct structure reduces the number
of RCU updaters, which in turn increases the grace-period
overhead per updater. This means that giving each current
Linux-kernel RCU use case its own srcu_struct struc-
v2022.09.25a
284 CHAPTER 13. PUTTING IT ALL TOGETHER
v2022.09.25a
If a little knowledge is a dangerous thing, just think
what you could do with a lot of knowledge!
Chapter 14 Unknown
Advanced Synchronization
This chapter covers synchronization techniques used for barriers, and even cache misses for counter increments.
lockless algorithms and parallel real-time systems. Other examples we have covered include:
Although lockless algorithms can be quite helpful when
faced with extreme requirements, they are no panacea.
For example, as noted at the end of Chapter 5, you should 1. The fastpaths through a number of other counting
thoroughly apply partitioning, batching, and well-tested algorithms in Chapter 5.
packaged weak APIs (see Chapters 8 and 9) before even
thinking about lockless algorithms.
But after doing all that, you still might find yourself 2. The fastpath through resource allocator caches in
needing the advanced techniques described in this chap- Section 6.4.3.
ter. To that end, Section 14.1 summarizes techniques
used thus far for avoiding locks and Section 14.2 gives a
3. The maze solver in Section 6.5.
brief overview of non-blocking synchronization. Memory
ordering is also quite important, but it warrants its own
chapter, namely Chapter 15.
4. The data-ownership techniques in Chapter 8.
The second form of advanced synchronization pro-
vides the stronger forward-progress guarantees needed
for parallel real-time computing, which is the topic of 5. The reference-counting, hazard-pointer, and RCU
Section 14.3. techniques in Chapter 9.
285
v2022.09.25a
286 CHAPTER 14. ADVANCED SYNCHRONIZATION
14.2 Non-Blocking Synchroniza- seen informal use for a great many decades, but were
reformulated in 2013.
tion Quick Quiz 14.1: Given that there will always be a sharply
limited number of CPUs available, is population obliviousness
Never worry about theory as long as the machinery really useful?
does what it’s supposed to do.
In theory, any parallel algorithm can be cast into wait-
Robert A. Heinlein free form, but there are a relatively small subset of NBS
algorithms that are in common use. A few of these are
The term non-blocking synchronization (NBS) [Her90] listed in the following section.
describes eight classes of linearizable algorithms with
differing forward-progress guarantees [ACHS13], which
are as follows: 14.2.1 Simple NBS
Perhaps the simplest NBS algorithm is atomic update of
1. Bounded population-oblivious wait-free synchroniza-
an integer counter using fetch-and-add (atomic_add_
tion: Every thread will make progress within a spe-
return()) primitives. This section lists a few additional
cific finite period of time, where this period of time is
commonly used NBS algorithms in roughly increasing
independent of the number of threads [HS08]. This
order of complexity.
level is widely considered to be even less achievable
than bounded wait-free synchronization.
14.2.1.1 NBS Sets
2. Bounded wait-free synchronization: Every thread
One simple NBS algorithm implements a set of integers in
will make progress within a specific finite period
an array. Here the array index indicates a value that might
of time [Her91]. This level is widely considered to
be a member of the set and the array element indicates
be unachievable, which might be why Alitarh et al.
whether or not that value actually is a set member. The
omitted it [ACHS13].
linearizability criterion for NBS algorithms requires that
3. Wait-free synchronization: Every thread will make reads from and updates to the array either use atomic
progress in finite time [Her93]. instructions or be accompanied by memory barriers, but
in the not-uncommon case where linearizability is not
4. Lock-free synchronization: At least one thread will important, simple volatile loads and stores suffice, for
make progress in finite time [Her93]. example, using READ_ONCE() and WRITE_ONCE().
An NBS set may also be implemented using a bit-
5. Obstruction-free synchronization: Every thread will
map, where each value that might be a member of the
make progress in finite time in the absence of con-
set corresponds to one bit. Reads and updates must
tention [HLM03].
normally be carried out via atomic bit-manipulation in-
6. Clash-free synchronization: At least one thread will structions, although compare-and-swap (cmpxchg() or
make progress in finite time in the absence of con- CAS) instructions can also be used.
tention [ACHS13].
14.2.1.2 NBS Counters
7. Starvation-free synchronization: Every thread will
make progress in finite time in the absence of fail- The statistical counters algorithm discussed in Section 5.2
ures [ACHS13]. can be considered to be bounded-wait-free, but only by us-
ing a cute definitional trick in which the sum is considered
8. Deadlock-free synchronization: At least one thread to be approximate rather than exact.1 Given sufficiently
will make progress in finite time in the absence of wide error bounds that are a function of the length of time
failures [ACHS13]. that the read_count() function takes to sum the coun-
ters, it is not possible to prove that any non-linearizable
NBS class 1 was formulated some time before 2015,
behavior occurred. This definitely (if a bit artificially)
classes 2, 3, and 4 were first formulated in the early 1990s,
classifies the statistical-counters algorithm as bounded
class 5 was first formulated in the early 2000s, and class 6
was first formulated in 2013. The final two classes have 1 Citation needed. I heard of this trick verbally from Mark Moir.
v2022.09.25a
14.2. NON-BLOCKING SYNCHRONIZATION 287
Listing 14.1: NBS Enqueue Algorithm Listing 14.2: NBS Stack Algorithm
1 static inline bool 1 struct node_t {
2 ___cds_wfcq_append(struct cds_wfcq_head *head, 2 value_t val;
3 struct cds_wfcq_tail *tail, 3 struct node_t *next;
4 struct cds_wfcq_node *new_head, 4 };
5 struct cds_wfcq_node *new_tail) 5
6 { 6 // LIFO list structure
7 struct cds_wfcq_node *old_tail; 7 struct node_t* top;
8 8
9 old_tail = uatomic_xchg(&tail->p, new_tail); 9 void list_push(value_t v)
10 CMM_STORE_SHARED(old_tail->next, new_head); 10 {
11 return old_tail != &head->node; 11 struct node_t *newnode = malloc(sizeof(*newnode));
12 } 12 struct node_t *oldtop;
13 13
14 static inline bool 14 newnode->val = v;
15 _cds_wfcq_enqueue(struct cds_wfcq_head *head, 15 oldtop = READ_ONCE(top);
16 struct cds_wfcq_tail *tail, 16 do {
17 struct cds_wfcq_node *new_tail) 17 newnode->next = oldtop;
18 { 18 oldtop = cmpxchg(&top, newnode->next, newnode);
19 return ___cds_wfcq_append(head, tail, 19 } while (newnode->next != oldtop);
20 new_tail, new_tail); 20 }
21 } 21
22
23 void list_pop_all(void (foo)(struct node_t *p))
24 {
25 struct node_t *p = xchg(&top, NULL);
wait-free. This algorithm is probably the most heavily 26
used NBS algorithm in the Linux kernel. 27 while (p) {
28 struct node_t *next = p->next;
29
30 foo(p);
14.2.1.3 Half-NBS Queue 31 free(p);
32 p = next;
33 }
Another common NBS algorithm is the atomic queue 34 }
where elements are enqueued using an atomic exchange
instruction [MS98b], followed by a store into the ->next
pointer of the new element’s predecessor, as shown in cannot be done atomically on typical computer systems. So
Listing 14.1, which shows the userspace-RCU library how is this supposed to work???
implementation [Des09b]. Line 9 updates the tail pointer
to reference the new element while returning a reference
to its predecessor, which is stored in local variable old_ 14.2.1.4 NBS Stack
tail. Line 10 then updates the predecessor’s ->next
pointer to reference the newly added element, and finally Listing 14.2 shows the LIFO push algorithm, which boasts
line 11 returns an indication as to whether or not the queue lock-free push and bounded wait-free pop (lifo-push.c),
was initially empty. forming an NBS stack. The origins of this algorithm are
Although mutual exclusion is required to dequeue a unknown, but it was referred to in a patent granted in
single element (so that dequeue is blocking), it is possible 1975 [BS75]. This patent was filed in 1973, a few months
to carry out a non-blocking removal of the entire contents before your editor saw his first computer, which had but
of the queue. What is not possible is to dequeue any one CPU.
given element in a non-blocking manner: The enqueuer Lines 1–4 show the node_t structure, which contains
might have failed between lines 9 and 10 of the listing, an arbitrary value and a pointer to the next structure on
so that the element in question is only partially enqueued. the stack and line 7 shows the top-of-stack pointer.
This results in a half-NBS algorithm where enqueues The list_push() function spans lines 9–20. Line 11
are NBS but dequeues are blocking. This algorithm is allocates a new node and line 14 initializes it. Line 17
nevertheless heavily used in practice, in part because most initializes the newly allocated node’s ->next pointer, and
production software is not required to tolerate arbitrary line 18 attempts to push it on the stack. If line 19 detects
fail-stop errors. cmpxchg() failure, another pass through the loop retries.
Otherwise, the new node has been successfully pushed,
Quick Quiz 14.2: Wait! In order to dequeue all elements,
and this function returns to its caller. Note that line 19
both the ->head and ->tail pointers must be changed, which
resolves races in which two concurrent instances of list_
v2022.09.25a
288 CHAPTER 14. ADVANCED SYNCHRONIZATION
push() attempt to push onto the stack. The cmpxchg() calls to malloc(), then those pointers must not be equal.
will succeed for one and fail for the other, causing the Real compilers really will generate the constant false in
other to retry, thereby selecting an arbitrary order for the response to a p==q comparison. A pointer to an object that
two node on the stack. has been freed, but whose memory has been reallocated
The list_pop_all() function spans lines 23–34. The for a compatibly typed object is termed a zombie pointer.
xchg() statement on line 25 atomically removes all nodes Many concurrent applications avoid this problem by
on the stack, placing the head of the resulting list in local carefully hiding the memory allocator from the compiler,
variable p and setting top to NULL. This atomic operation thus preventing the compiler from making inappropriate
serializes concurrent calls to list_pop_all(): One of assumptions. This obfuscatory approach currently works
them will get the list, and the other a NULL pointer, at in practice, but might well one day fall victim to increas-
least assuming that there were no concurrent calls to ingly aggressive optimizers. There is work underway in
list_push(). both the C and C++ standards committees to address this
An instance of list_pop_all() that obtains a non- problem [MMS19, MMM+ 20]. In the meantime, please
empty list in p processes this list in the loop spanning exercise great care when coding ABA-tolerant algorithms.
lines 27–33. Line 28 prefetches the ->next pointer, Quick Quiz 14.3: So why not ditch antique languages like C
line 30 invokes the function referenced by foo() on the and C++ for something more modern?
current node, line 31 frees the current node, and line 32
sets up p for the next pass through the loop.
But suppose that a pair of list_push() instances run
14.2.2 Applicability of NBS Benefits
concurrently with a list_pop_all() with a list initially
containing a single Node 𝐴. Here is one way that this The most heavily cited NBS benefits stem from its forward-
scenario might play out: progress guarantees, its tolerance of fail-stop bugs, and
from its linearizability. Each of these is discussed in one
1. The first list_push() instance pushes a new of the following sections.
Node 𝐵, executing through line 17, having just stored
a pointer to Node 𝐴 into Node 𝐵’s ->next pointer. 14.2.2.1 NBS Forward Progress Guarantees
2. The list_pop_all() instance runs to completion, NBS’s forward-progress guarantees have caused many to
setting top to NULL and freeing Node 𝐴. suggest its use in real-time systems, and NBS algorithms
are in fact used in a great many such systems. However, it
3. The second list_push() instance runs to comple-
is important to note that forward-progress guarantees are
tion, pushing a new Node 𝐶, but happens to allocate
largely orthogonal to those that form the basis of real-time
the memory that used to belong to Node 𝐴.
programming:
4. The first list_push() instance executes the
1. Real-time forward-progress guarantees usually have
cmpxchg() on line 18. Because new Node 𝐶
some definite time associated with them, for example,
has the same address as the newly freed Node 𝐴,
“scheduling latency must be less than 100 microsec-
this cmpxchg() succeeds and this list_push()
onds.” In contrast, the most popular forms of NBS
instance runs to completion.
only guarantees that progress will be made in finite
Note that both pushes and the popall all ran successfully time, with no definite bound.
despite the reuse of Node 𝐴’s memory. This is an unusual 2. Real-time forward-progress guarantees are often
property: Most data structures require protection against probabilistic, as in the soft-real-time guarantee that
what is often called the ABA problem. “at least 99.9 % of the time, scheduling latency must
But this property holds only for algorithm written in be less than 100 microseconds.” In contrast, many
assembly language. The sad fact is that most languages of NBS’s forward-progress guarantees are uncondi-
(including C and C++) do not support pointers to lifetime- tional.
ended objects, such as the pointer to the old Node 𝐴
contained in Node 𝐵’s ->next pointer. In fact, compilers 3. Real-time forward-progress guarantees are often con-
are within their rights to assume that if two pointers ditioned on environmental constraints, for example,
(call them p and q) were returned from two different only being honored: (1) For the highest-priority
v2022.09.25a
14.2. NON-BLOCKING SYNCHRONIZATION 289
tasks, (2) When each CPU spends at least a certain 14.2.2.2 NBS Underlying Operations
fraction of its time idle, and (3) When I/O rates
are below some specified maximum. In contrast, An NBS algorithm can be truly non-blocking only if the
NBS’s forward-progress guarantees are often uncon- underlying operations that it uses are also non-blocking.
ditional, although recent NBS work accommodates In a surprising number of cases, this is not the case in
conditional guarantees [ACHS13]. practice.
For example, non-blocking algorithms often allocate
4. An important component of a real-time program’s memory. In theory, this is fine, given the existence of
environment is the scheduler. NBS algorithms as- lock-free memory allocators [Mic04b]. But in practice,
sume a worst-case demonic scheduler, though for most environments must eventually obtain memory from
whatever reason, not a scheduler so demonic that operating-system kernels, which commonly use locking.
it simply refuses to ever run the application hous- Therefore, unless all the memory that will ever be needed
ing the NBS algorithm. In contrast, real-time sys- is somehow preallocated, a “non-blocking” algorithm that
tems assume that the scheduler is doing its level allocates memory will not be non-blocking when running
best to satisfy any scheduling constraints it knows on common-case real-world computer systems.
about, and, in the absence of such constraints, its This same point clearly also applies to algorithms
level best to honor process priorities and to provide performing I/O operations or otherwise interacting with
fair scheduling to processes of the same priority. their environment.
Non-demonic schedulers allow real-time programs Perhaps surprisingly, this point also applies to ostensi-
to use simpler algorithms than those required for bly non-blocking algorithms that do only plain loads and
NBS [ACHS13, Bra11]. stores, such as the counters discussed in Section 14.2.1.2.
And at first glance, those loads and stores that can be com-
5. NBS forward-progress guarantee classes assume that piled into single load and store instructions, respectively,
a number of underlying operations are lock-free or would seem to be not just non-blocking, but bounded
even wait-free, when in fact these operations are population-oblivious wait free.
blocking on common-case computer systems. Except that load and store instructions are not necessar-
ily either fast or deterministic. For example, as noted in
6. NBS forward-progress guarantees are often achieved
Chapter 3, cache misses can consume thousands of CPU
by subdividing operations. For example, in order
cycles. Worse yet, the measured cache-miss latencies can
to avoid a blocking dequeue operation, an NBS
be a function of the number of CPUs, as illustrated in Fig-
algorithm might substitute a non-blocking polling
ure 5.1. It is only reasonable to assume that these latencies
operation. This is fine in theory, but not helpful
also depend on the details of the system’s interconnect.
in practice to real-world programs that require an
In addition, given that hardware vendors generally do not
element to propagate through the queue in a timely
publish upper bounds for cache-miss latencies, it seems
fashion.
brave to assume that memory-reference instructions are
7. Real-time forward-progress guarantees usually apply in fact wait-free in modern computer systems. And the
only in the absence of software bugs. In contrast, antique systems for which such bounds are available suffer
many classes of NBS guarantees apply even in the from profound overall slowness.
face of fail-stop bugs. Furthermore, hardware is not the only source of slow-
ness for memory-reference instructions. For example,
8. NBS forward-progress guarantee classes imply lin- when running on typical computer systems, both loads
earizability. In contrast, real-time forward progress and stores can result in page faults. Which cause in-kernel
guarantees are often independent of ordering con- page-fault handlers to be invoked. Which might acquire
straints such as linearizability. locks, or even do I/O, potentially even using something
like network file system (NFS). All of which are most
emphatically blocking operations.
Quick Quiz 14.4: Why does anyone care about demonic
schedulers? Nor are page faults the only kernel-induced hazard.
A given CPU might be interrupted at any time, and the
To reiterate, despite these differences, a number of NBS interrupt handler might run for some time. During this
algorithms are extremely useful in real-time programs. time, the user-mode ostensibly non-blocking algorithm
v2022.09.25a
290 CHAPTER 14. ADVANCED SYNCHRONIZATION
will not be running at all. This situation raises interesting code making use of that algorithm. In such cases, not only
questions about the forward-progress guarantees provided has nothing has been gained by this trick, but this trick has
by system calls relying on interrupts, for example, the increased the complexity of all users of this algorithm.
membarrier() system call. With concurrent algorithms as elsewhere, maximizing
Things do look bleak, but the non-blocking nature of a specific metric is no substitute for thinking carefully
such algorithms can be at least partially redeemed using a about the needs of one’s users.
number of approaches:
v2022.09.25a
14.2. NON-BLOCKING SYNCHRONIZATION 291
14.2.2.5 NBS Linearizability To their credit, there are some linearizability advocates
who are aware of some of its shortcomings [RR20]. There
It is important to note that linearizability can be quite use-
are also proposals to extend linearizability, for example,
ful, especially when analyzing concurrent code made up
interval-linearizability, which is intended to handle the
of strict locking and fully ordered atomic operations.2 Fur-
common case of operations that require non-zero time to
thermore, this handling of fully ordered atomic operations
complete [CnRR18]. It remains to be seen whether these
automatically covers simple NBS algorithms.
proposals will result in theories able to handle modern
However, the linearization points of a complex NBS
concurrent software artifacts, especially given that several
algorithms are often buried deep within that algorithm,
of the proof techniques discussed in Chapter 12 already
and thus not visible to users of a library function im-
handle many modern concurrent software artifacts.
plementing a part of such an algorithm. Therefore, any
claims that users benefit from the linearizability properties
of complex NBS algorithms should be regarded with deep 14.2.3 NBS Discussion
suspicion [HKLP12].
It is sometimes asserted that linearizability is necessary It is possible to create fully non-blocking queues [MS96],
for developers to produce proofs of correctness for their however, such queues are much more complex than the
concurrent code. However, such proofs are the exception half-NBS algorithm outlined above. The lesson here
rather than the rule, and modern developers who do is to carefully consider your actual requirements. Re-
produce proofs often use modern proof techniques that do laxing irrelevant requirements can often result in great
not depend on linearizability. Furthermore, developers improvements in simplicity, performance, and scalability.
frequently use modern proof techniques that do not require Recent research points to another important way to
a full specification, given that developers often learn their relax requirements. It turns out that systems providing
specification after the fact, one bug at a time. A few such fair scheduling can enjoy most of the benefits of wait-
proof techniques were discussed in Chapter 12.3 free synchronization even when running algorithms that
It is often asserted that linearizability maps well to se- provide only non-blocking synchronization, both in the-
quential specifications, which are said to be more natural ory [ACHS13] and in practice [AB13]. Because most
than are concurrent specifications [RR20]. But this asser- schedulers used in production do in fact provide fairness,
tion fails to account for our highly concurrent objective the more-complex algorithms providing wait-free syn-
universe. This universe can only be expected to select for chronization usually provide no practical advantages over
ability to cope with concurrency, especially for those par- simpler and faster non-wait-free algorithms.
ticipating in team sports or overseeing small children. In Interestingly enough, fair scheduling is but one benefi-
addition, given that the teaching of sequential computing cial constraint that is often respected in practice. Other sets
is still believed to be somewhat of a black art [PBCE20], of constraints can permit blocking algorithms to achieve
it is reasonable to expect that teaching of concurrent com- deterministic real-time response. For example, given:
puting is in a similar state of disarray. Therefore, focusing (1) Fair locks granted in FIFO order within a given pri-
on only one proof technique is unlikely to be a good way ority level, (2) Priority inversion avoidance (for example,
forward. priority inheritance [TS95, WTS96] or priority ceiling),
Again, please understand that linearizability is quite (3) A bounded number of threads, (4) Bounded critical
useful in many situations. Then again, so is that venerable section durations, (5) Bounded load, and (6) Absence of
tool, the hammer. But there comes a point in the field of fail-stop bugs, lock-based applications can provide deter-
computing where one should put down the hammer and ministic response times [Bra11, SM04a]. This approach
pick up a keyboard. Similarly, it appears that there are of course blurs the distinction between blocking and wait-
times when linearizability is not the best tool for the job. free synchronization, which is all to the good. Hopefully
theoretical frameworks will continue to improve their
2 For example, the Linux kernel’s value-returning atomic operations. ability to describe software actually used in practice.
3 A memorable verbal discussion with an advocate of linearizability Those who feel that theory should lead the way are
resulted in question: “So the reason linearizability is important is to referred to the inimitable Peter Denning, who said of
rescue 1980s proof techniques?” The advocate immediately replied in the
affirmative, then spent some time disparaging a particular modern proof
operating systems: “Theory follows practice” [Den15],
technique. Oddly enough, that technique was one of those successfully or to the eminent Tony Hoare, who said of the whole of
applied to Linux-kernel RCU. engineering: “In all branches of engineering science, the
v2022.09.25a
292 CHAPTER 14. ADVANCED SYNCHRONIZATION
engineering starts before the science; indeed, without the application: “My application computes million-point
early products of engineering, there would be nothing Fourier transforms in half a picosecond.” “No way!!! The
for the scientist to study!” [Mor07]. Of course, once clock cycle on this system is more than three hundred
an appropriate body of theory becomes available, it is picoseconds!” “Ah, but it is a soft real-time application!”
wise to make use of it. However, note well that the first If the term “soft real time” is to be of any use whatsoever,
appropriate body of theory is often one thing and the first some limits are clearly required.
proposed body of theory quite another. We might therefore say that a given soft real-time
application must meet its response-time requirements at
Quick Quiz 14.5: It seems like the various members of the
NBS hierarchy are rather useless. So why bother with them at least some fraction of the time, for example, we might say
all??? that it must execute in less than 20 microseconds 99.9 %
of the time.
Proponents of NBS algorithms sometimes call out real- This of course raises the question of what is to be
time computing as an important NBS beneficiary. The done when the application fails to meet its response-time
next section looks more deeply at the forward-progress requirements. The answer varies with the application,
needs of real-time systems. but one possibility is that the system being controlled
has sufficient stability and inertia to render harmless the
occasional late control action. Another possibility is that
14.3 Parallel Real-Time Computing the application has two ways of computing the result, a fast
and deterministic but inaccurate method on the one hand
One always has time enough if one applies it well. and a very accurate method with unpredictable compute
time on the other. One reasonable approach would be to
Johann Wolfgang von Göthe start both methods in parallel, and if the accurate method
fails to finish in time, kill it and use the answer from the
An important emerging area in computing is that of paral- fast but inaccurate method. One candidate for the fast but
lel real-time computing. Section 14.3.1 looks at a number inaccurate method is to take no control action during the
of definitions of “real-time computing”, moving beyond current time period, and another candidate is to take the
the usual sound bites to more meaningful criteria. Sec- same control action as was taken during the preceding
tion 14.3.2 surveys the sorts of applications that need time period.
real-time response. Section 14.3.3 notes that parallel real- In short, it does not make sense to talk about soft real
time computing is upon us, and discusses when and why time without some measure of exactly how soft it is.
parallel real-time computing can be useful. Section 14.3.4
gives a brief overview of how parallel real-time systems 14.3.1.2 Hard Real Time
may be implemented, with Sections 14.3.5 and 14.3.6 fo-
cusing on operating systems and applications, respectively. In contrast, the definition of hard real time is quite definite.
Finally, Section 14.3.7 outlines how to decide whether or After all, a given system either always meets its deadlines
not your application needs real-time facilities. or it doesn’t.
Unfortunately, a strict application of this definition
would mean that there can never be any hard real-time
14.3.1 What is Real-Time Computing? systems. The reason for this is fancifully depicted in
One traditional way of classifying real-time computing Figure 14.1. Yes, you could construct a more robust
is into the categories of hard real time and soft real time, system, perhaps with redundancy. But your adversary can
where the macho hard real-time applications never miss always get a bigger hammer.
their deadlines, but the wimpy soft real-time applications Then again, perhaps it is unfair to blame the software
miss their deadlines quite often. for what is clearly not just a hardware problem, but a bona
fide big-iron hardware problem at that.4 This suggests
that we define hard real-time software as software that
14.3.1.1 Soft Real Time
will always meet its deadlines, but only in the absence of
It should be easy to see problems with this definition a hardware failure. Unfortunately, failure is not always an
of soft real time. For one thing, by this definition, any
piece of software could be said to be a soft real-time 4 Or, given modern hammers, a big-steel problem.
v2022.09.25a
14.3. PARALLEL REAL-TIME COMPUTING 293
v2022.09.25a
294 CHAPTER 14. ADVANCED SYNCHRONIZATION
that he wisely handed off to his colleagues designing the There are also situations where a minimum level of
hardware. In effect, my colleague imposed an atmospheric- energy is required, for example, through the power leads
composition constraint on the environment immediately of the system and through the devices through which the
surrounding the computer, a constraint that the hardware system is to communicate with that portion of the outside
designers met through use of physical seals. world that is to be monitored or controlled.
Quick Quiz 14.6: But what about battery-powered systems?
Another old college friend worked on a computer- They don’t require energy flowing into the system as a whole.
controlled system that sputtered ingots of titanium using
an industrial-strength arc in a vacuum. From time to time, A number of systems are intended to operate in envi-
the arc would decide that it was bored with its path through ronments with impressive levels of shock and vibration,
the ingot of titanium and choose a far shorter and more en- for example, engine control systems. More strenuous
tertaining path to ground. As we all learned in our physics requirements may be found when we move away from
classes, a sudden shift in the flow of electrons creates an continuous vibrations to intermittent shocks. For example,
electromagnetic wave, with larger shifts in larger flows during my undergraduate studies, I encountered an old
creating higher-power electromagnetic waves. And in this Athena ballistics computer, which was designed to con-
case, the resulting electromagnetic pulses were sufficient tinue operating normally even if a hand grenade went off
to induce a quarter of a volt potential difference in the nearby.5 And finally, the “black boxes” used in airliners
leads of a small “rubber ducky” antenna located more than must continue operating before, during, and after a crash.
400 meters away. This meant that nearby conductors expe- Of course, it is possible to make hardware more robust
rienced higher voltages, courtesy of the inverse-square law. against environmental shocks and insults. Any number of
This included those conductors making up the computer ingenious mechanical shock-absorbing devices can reduce
controlling the sputtering process. In particular, the volt- the effects of shock and vibration, multiple layers of shield-
age induced on that computer’s reset line was sufficient to ing can reduce the effects of low-energy electromagnetic
actually reset the computer, mystifying everyone involved. radiation, error-correction coding can reduce the effects
This situation was addressed using hardware, including of high-energy radiation, various potting and sealing tech-
some elaborate shielding and a fiber-optic network with niques can reduce the effect of air quality, and any number
the lowest bitrate I have ever heard of, namely 9600 baud. of heating and cooling systems can counter the effects of
Less spectacular electromagnetic environments can often temperature. In extreme cases, triple modular redundancy
be handled by software through use of error detection and can reduce the probability that a fault in one part of the
correction codes. That said, it is important to remember system will result in incorrect behavior from the overall
that although error detection and correction codes can
reduce failure rates, they normally cannot reduce them 5 Decades later, the acceptance tests for some types of computer
all the way down to zero, which can present yet another systems involve large detonations, and some types of communications
obstacle to achieving hard real-time response. networks must deal with what is delicately termed “ballistic jamming.”
v2022.09.25a
14.3. PARALLEL REAL-TIME COMPUTING 295
system. However, all of these methods have one thing in This means that real-time applications must be confined
common: Although they can reduce the probability of to operations for which bounded latencies can reasonably
failure, they cannot reduce it to zero. be provided. Other operations must either be pushed
These environmental challenges are often met via ro- out into the non-real-time portions of the application or
bust hardware, however, the workload and application forgone entirely.
constraints in the next two sections are often handled in There might also be constraints on the non-real-time
software. portions of the application. For example, is the non-real-
time application permitted to use the CPUs intended for
the real-time portion? Are there time periods during which
Workload Constraints Just as with people, it is often
the real-time portion of the application is expected to be
possible to prevent a real-time system from meeting its
unusually busy, and if so, is the non-real-time portion of
deadlines by overloading it. For example, if the system is
the application permitted to run at all during those times?
being interrupted too frequently, it might not have suffi-
Finally, by what amount is the real-time portion of the
cient CPU bandwidth to handle its real-time application.
application permitted to degrade the throughput of the
A hardware solution to this problem might limit the rate
non-real-time portion?
at which interrupts were delivered to the system. Possible
software solutions include disabling interrupts for some
time if they are being received too frequently, resetting the Real-World Real-Time Specifications As can be seen
device generating too-frequent interrupts, or even avoiding from the preceding sections, a real-world real-time specifi-
interrupts altogether in favor of polling. cation needs to include constraints on the environment, on
Overloading can also degrade response times due to the workload, and on the application itself. In addition, for
queueing effects, so it is not unusual for real-time systems the operations that the real-time portion of the application
to overprovision CPU bandwidth, so that a running system is permitted to make use of, there must be constraints on
has (say) 80 % idle time. This approach also applies to the hardware and software implementing those operations.
storage and networking devices. In some cases, separate For each such operation, these constraints might in-
storage and networking hardware might be reserved for clude a maximum response time (and possibly also a
the sole use of high-priority portions of the real-time minimum response time) and a probability of meeting
application. In short, it is not unusual for this hardware to that response time. A probability of 100 % indicates that
be mostly idle, given that response time is more important the corresponding operation must provide hard real-time
than throughput in real-time systems. service.
In some cases, both the response times and the required
Quick Quiz 14.7: But given the results from queueing theory, probabilities of meeting them might vary depending on
won’t low utilization merely improve the average response time the parameters to the operation in question. For example,
rather than improving the worst-case response time? And isn’t a network operation over a local LAN would be much
worst-case response time all that most real-time systems really
more likely to complete in (say) 100 microseconds than
care about?
would that same network operation over a transcontinental
Of course, maintaining sufficiently low utilization re- WAN. Furthermore, a network operation over a copper
quires great discipline throughout the design and imple- or fiber LAN might have an extremely high probability
mentation. There is nothing quite like a little feature creep of completing without time-consuming retransmissions,
to destroy deadlines. while that same networking operation over a lossy WiFi
network might have a much higher probability of missing
tight deadlines. Similarly, a read from a tightly coupled
Application Constraints It is easier to provide bounded solid-state disk (SSD) could be expected to complete
response time for some operations than for others. For much more quickly than that same read to an old-style
example, it is quite common to see response-time specifi- USB-connected rotating-rust disk drive.6
cations for interrupts and for wake-up operations, but quite Some real-time applications pass through different
rare for (say) filesystem unmount operations. One reason phases of operation. For example, a real-time system
for this is that it is quite difficult to bound the amount of
work that a filesystem-unmount operation might need to 6 Important safety tip: Worst-case response times from USB devices
do, given that the unmount is required to flush all of that can be extremely long. Real-time systems should therefore take care to
filesystem’s in-memory data to mass storage. place any USB devices well away from critical paths.
v2022.09.25a
296 CHAPTER 14. ADVANCED SYNCHRONIZATION
controlling a plywood lathe that peels a thin sheet of wood In addition to latency requirements for the real-time por-
(called “veneer”) from a spinning log must: (1) Load the tions of the application, there will likely be performance
log into the lathe, (2) Position the log on the lathe’s chucks and scalability requirements for the non-real-time portions
so as to expose the largest cylinder contained within that of the application. These additional requirements reflect
log to the blade, (3) Start spinning the log, (4) Continu- the fact that ultimate real-time latencies are often attained
ously vary the knife’s position so as to peel the log into by degrading scalability and average performance.
veneer, (5) Remove the remaining core of the log that is Software-engineering requirements can also be impor-
too small to peel, and (6) Wait for the next log. Each of tant, especially for large applications that must be devel-
these six phases of operation might well have its own set oped and maintained by large teams. These requirements
of deadlines and environmental constraints, for example, often favor increased modularity and fault isolation.
one would expect phase 4’s deadlines to be much more This is a mere outline of the work that would be required
severe than those of phase 6, as in milliseconds rather than to specify deadlines and environmental constraints for a
seconds. One might therefore expect that low-priority production real-time system. It is hoped that this outline
work would be performed in phase 6 rather than in phase 4. clearly demonstrates the inadequacy of the sound-bite-
In any case, careful choices of hardware, drivers, and soft- based approach to real-time computing.
ware configuration would be required to support phase 4’s
more severe requirements.
14.3.2 Who Needs Real-Time?
A key advantage of this phase-by-phase approach is
that the latency budgets can be broken down, so that It is possible to argue that all computing is in fact real-
the application’s various components can be developed time computing. For one example, when you purchase a
independently, each with its own latency budget. Of birthday gift online, you expect the gift to arrive before
course, as with any other kind of budget, there will likely the recipient’s birthday. And in fact even turn-of-the-
be the occasional conflict as to which component gets millennium web services observed sub-second response
which fraction of the overall budget, and as with any constraints [Boh01], and requirements have not eased with
other kind of budget, strong leadership and a sense of the passage of time [DHJ+ 07]. It is nevertheless useful
shared goals can help to resolve these conflicts in a timely to focus on those real-time applications whose response-
fashion. And, again as with other kinds of technical time requirements cannot be achieved straightforwardly
budget, a strong validation effort is required in order to by non-real-time systems and applications. Of course,
ensure proper focus on latencies and to give early warning as hardware costs decrease and bandwidths and memory
of latency problems. A successful validation effort will sizes increase, the line between real-time and non-real-
almost always include a good test suite, which might be time will continue to shift, but such progress is by no
unsatisfying to the theorists, but has the virtue of helping means a bad thing.
to get the job done. As a point of fact, as of early 2021, Quick Quiz 14.9: Differentiating real-time from non-real-
most real-world real-time system use an acceptance test time based on what can “be achieved straightforwardly by
rather than formal proofs. non-real-time systems and applications” is a travesty! There is
absolutely no theoretical basis for such a distinction!!! Can’t
However, the widespread use of test suites to validate
we do better than that???
real-time systems does have a very real disadvantage,
namely that real-time software is validated only on spe- Real-time computing is used in industrial-control ap-
cific configurations of hardware and software. Adding plications, ranging from manufacturing to avionics; sci-
additional configurations requires additional costly and entific applications, perhaps most spectacularly in the
time-consuming testing. Perhaps the field of formal veri- adaptive optics used by large Earth-bound telescopes to
fication will advance sufficiently to change this situation, de-twinkle starlight; military applications, including the
but as of early 2021, rather large advances are required. afore-mentioned avionics; and financial-services applica-
tions, where the first computer to recognize an opportunity
Quick Quiz 14.8: Formal verification is already quite capable, is likely to reap most of the profit. These four areas could
benefiting from decades of intensive study. Are additional be characterized as “in search of production”, “in search
advances really required, or is this just a practitioner’s excuse of life”, “in search of death”, and “in search of money”.
to continue to lazily ignore the awesome power of formal Financial-services applications differ subtlely from ap-
verification? plications in the other three categories in that money is
v2022.09.25a
14.3. PARALLEL REAL-TIME COMPUTING 297
v2022.09.25a
298 CHAPTER 14. ADVANCED SYNCHRONIZATION
Scripting languages 1s
RTOS Process
RTOS Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
100 ms
Linux 2.4 kernel
10 ms
Real-time Java (with GC)
1 ms
Linux 2.6.x/3.x kernel
Real-time Java (no GC)
Linux 4.x/5.x kernel 100 μs
Linux -rt patchset
10 μs
Specialty RTOSes (no MMU)
1 μs
RCU read-side
Hand-coded assembly
100 ns Linux critical sections
Custom digital hardware
10 ns Kernel Spinlock
critical sections
1 ns
Interrupt handlers
Scheduling
Custom analog hardware 100 ps Clock Interrupt disable
Interrupt Preempt disable
v2022.09.25a
14.3. PARALLEL REAL-TIME COMPUTING 299
maintenance with changes in both hardware and ker- be preempted, as shown in the right-most diagram. Of
nel. Furthermore, each such RTOS often has its own course, a great deal of other real-time functionality was
system-call interface and set of system libraries, which added during this time, however, it cannot be as easily
can balkanize both ecosystems and developers. In fact, represented on this diagram. It will instead be discussed
these problems seem to be what drove the combination of in Section 14.3.5.1.
RTOSes with Linux, as this approach allowed access to The bottom row of Figure 14.7 shows the -rt patchset,
the full real-time capabilities of the RTOS, while allowing which features threaded (and thus preemptible) interrupt
the application’s non-real-time code full access to Linux’s handlers for many devices, which also allows the corre-
open-source ecosystem. sponding “interrupt-disabled” regions of these drivers
Although pairing RTOSes with the Linux kernel was a to be preempted. These drivers instead use locking to
clever and useful short-term response during the time that coordinate the process-level portions of each driver with
the Linux kernel had minimal real-time capabilities, it its threaded interrupt handlers. Finally, in some cases, dis-
also motivated adding real-time capabilities to the Linux abling of preemption is replaced by disabling of migration.
kernel. Progress towards this goal is shown in Figure 14.7. These measures result in excellent response times in many
The upper row shows a diagram of the Linux kernel with systems running the -rt patchset [RMF19, dOCdO19].
preemption disabled, thus having essentially no real-time A final approach is simply to get everything out of the
capabilities. The middle row shows a set of diagrams way of the real-time process, clearing all other processing
showing the increasing real-time capabilities of the main- off of any CPUs that this process needs, as shown in
line Linux kernel with preemption enabled. Finally, the Figure 14.8. This was implemented in the 3.10 Linux
bottom row shows a diagram of the Linux kernel with kernel via the CONFIG_NO_HZ_FULL Kconfig parame-
the -rt patchset applied, maximizing real-time capabilities. ter [Cor13, Wei12]. It is important to note that this
Functionality from the -rt patchset is added to mainline, approach requires at least one housekeeping CPU to do
hence the increasing capabilities of the mainline Linux ker- background processing, for example running kernel dae-
nel over time. Nevertheless, the most demanding real-time mons. However, when there is only one runnable task on a
applications continue to use the -rt patchset. given non-housekeeping CPU, scheduling-clock interrupts
are shut off on that CPU, removing an important source
The non-preemptible kernel shown at the top of Fig-
of interference and OS jitter. With a few exceptions, the
ure 14.7 is built with CONFIG_PREEMPT=n, so that ex-
kernel does not force other processing off of the non-
ecution within the Linux kernel cannot be preempted.
housekeeping CPUs, but instead simply provides better
This means that the kernel’s real-time response latency
performance when only one runnable task is present on a
is bounded below by the longest code path in the Linux
given CPU. Any number of userspace tools may be used
kernel, which is indeed long. However, user-mode exe-
to force a given CPU to have no more that one runnable
cution is preemptible, so that one of the real-time Linux
task. If configured properly, a non-trivial undertaking,
processes shown in the upper right may preempt any of
CONFIG_NO_HZ_FULL offers real-time threads levels of
the non-real-time Linux processes shown in the upper left
performance that come close to those of bare-metal sys-
anytime the non-real-time process is executing in user
tems [ACA+ 18].
mode.
There has of course been much debate over which
The middle row of Figure 14.7 shows three stages (from of these approaches is best for real-time systems,
left to right) in the development of Linux’s preemptible and this debate has been going on for quite some
kernels. In all three stages, most process-level code within time [Cor04a, Cor04c]. As usual, the answer seems
the Linux kernel can be preempted. This of course greatly to be “It depends,” as discussed in the following sections.
improves real-time response latency, but preemption is still Section 14.3.5.1 considers event-driven real-time systems,
disabled within RCU read-side critical sections, spinlock and Section 14.3.5.2 considers real-time systems that use
critical sections, interrupt handlers, interrupt-disabled a CPU-bound polling loop.
code regions, and preempt-disabled code regions, as in-
dicated by the red boxes in the left-most diagram in the
14.3.5.1 Event-Driven Real-Time Support
middle row of the figure. The advent of preemptible RCU
allowed RCU read-side critical sections to be preempted, The operating-system support required for event-driven
as shown in the central diagram, and the advent of threaded real-time applications is quite extensive, however, this
interrupt handlers allowed device-interrupt handlers to section will focus on only a few items, namely timers,
v2022.09.25a
300 CHAPTER 14. ADVANCED SYNCHRONIZATION
RT Linux Process
RT Linux Process
RT Linux Process
Linux Process
Linux Process
Linux Process
RCU read-side
Linux critical sections
Kernel Spinlock
critical sections
Interrupt handlers
Scheduling
Clock Interrupt disable
Interrupt Preempt disable
CONFIG_PREEMPT=n
RT Linux Process
RT Linux Process
RT Linux Process
RT Linux Process
RT Linux Process
RT Linux Process
RT Linux Process
RT Linux Process
RT Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
RCU read-side RCU read-side RCU read-side
Linux critical sections Linux critical sections Linux critical sections
Kernel Spinlock
critical sections
Kernel Spinlock
critical sections
Kernel Spinlock
critical sections
Interrupt handlers Interrupt handlers Interrupt handlers
Scheduling Scheduling Scheduling
Clock Interrupt disable Clock Interrupt disable Clock Interrupt disable
Interrupt Preempt disable Interrupt Preempt disable Interrupt Preempt disable
RT Linux Process
Linux Process
Linux Process
Linux Process
RCU read-side
Linux critical sections
Kernel Spinlock
critical sections
Interrupt handlers
Scheduling
Clock Interrupt disable
Interrupt Preempt disable
-rt patchset
v2022.09.25a
14.3. PARALLEL REAL-TIME COMPUTING 301
0x x0
RT Linux Process
RT Linux Process
RT Linux Process
1x x1
Linux Process
Linux Process
Linux Process
1f
v2022.09.25a
302 CHAPTER 14. ADVANCED SYNCHRONIZATION
v2022.09.25a
14.3. PARALLEL REAL-TIME COMPUTING 303
Return From
Interrupt Interrupt Mainline
Mainline
Interrupt Handler
Code Code
Long Latency:
Degrades Response Time
Return From
Interrupt
Interrupt Interrupt Mainline
Mainline
Code Code Preemptible
IRQ Thread
Interrupt Handler
Short Latency:
Improved Response Time
Threaded interrupts are used to address a significant is increased interrupt latency. Instead of immediately
source of degraded real-time latencies, namely long- running the interrupt handler, the handler’s execution is
running interrupt handlers, as shown in Figure 14.12. deferred until the IRQ thread gets around to running it. Of
These latencies can be especially problematic for devices course, this is not a problem unless the device generating
that can deliver a large number of events with a single in- the interrupt is on the real-time application’s critical path.
terrupt, which means that the interrupt handler will run for Another downside is that poorly written high-priority
an extended period of time processing all of these events. real-time code might starve the interrupt handler, for ex-
Worse yet are devices that can deliver new events to a ample, preventing networking code from running, in turn
still-running interrupt handler, as such an interrupt handler making it very difficult to debug the problem. Developers
might well run indefinitely, thus indefinitely degrading must therefore take great care when writing high-priority
real-time latencies. real-time code. This has been dubbed the Spiderman
One way of addressing this problem is the use of principle: With great power comes great responsibility.
threaded interrupts shown in Figure 14.13. Interrupt
handlers run in the context of a preemptible IRQ thread, Priority inheritance is used to handle priority inver-
which runs at a configurable priority. The device interrupt sion, which can be caused by, among other things, locks
handler then runs for only a short time, just long enough acquired by preemptible interrupt handlers [SRL90]. Sup-
to make the IRQ thread aware of the new event. As shown pose that a low-priority thread holds a lock, but is pre-
in the figure, threaded interrupts can greatly improve real- empted by a group of medium-priority threads, at least one
time latencies, in part because interrupt handlers running such thread per CPU. If an interrupt occurs, a high-priority
in the context of the IRQ thread may be preempted by IRQ thread will preempt one of the medium-priority
high-priority real-time threads. threads, but only until it decides to acquire the lock held
However, there is no such thing as a free lunch, and by the low-priority thread. Unfortunately, the low-priority
there are downsides to threaded interrupts. One downside thread cannot release the lock until it starts running, which
v2022.09.25a
304 CHAPTER 14. ADVANCED SYNCHRONIZATION
v2022.09.25a
14.3. PARALLEL REAL-TIME COMPUTING 305
Listing 14.3: Preemptible Linux-Kernel RCU sections. A grace period is permitted to end: (1) Once
1 void __rcu_read_lock(void) all CPUs have completed any RCU read-side critical sec-
2 {
3 current->rcu_read_lock_nesting++; tions that were in effect before the start of the current
4 barrier(); grace period and (2) Once all tasks that were preempted
5 }
6 while in one of those pre-existing critical sections have
7 void __rcu_read_unlock(void) removed themselves from their lists. A simplified version
8 {
9 barrier(); of this implementation is shown in Listing 14.3. The
10 if (!--current->rcu_read_lock_nesting) __rcu_read_lock() function spans lines 1–5 and the
11 barrier();
12 if (READ_ONCE(current->rcu_read_unlock_special.s)) { __rcu_read_unlock() function spans lines 7–15.
13 rcu_read_unlock_special(t); Line 3 of __rcu_read_lock() increments a per-task
14 }
15 } count of the number of nested rcu_read_lock() calls,
and line 4 prevents the compiler from reordering the
subsequent code in the RCU read-side critical section to
functions enter an RCU read-side critical section and both precede the rcu_read_lock().
the rt_read_unlock() and the rt_write_unlock() Line 9 of __rcu_read_unlock() prevents the com-
functions exit that critical section. This is necessary piler from reordering the code in the critical section with
because non-realtime kernels’ reader-writer locking func- the remainder of this function. Line 10 decrements the
tions disable preemption across their critical sections, and nesting count and checks to see if it has become zero, in
there really are reader-writer locking use cases that rely other words, if this corresponds to the outermost rcu_
on the fact that synchronize_rcu() will therefore wait read_unlock() of a nested set. If so, line 11 prevents
for all pre-exiting reader-writer-lock critical sections to the compiler from reordering this nesting update with
complete. Let this be a lesson to you: Understanding what line 12’s check for special handling. If special handling is
your users really need is critically important to correct required, then the call to rcu_read_unlock_special()
operation, not just to performance. Not only that, but what on line 13 carries it out.
your users really need changes over time. There are several types of special handling that can
This has the side-effect that all of a -rt kernel’s reader- be required, but we will focus on that required when the
writer locking critical sections are subject to RCU priority RCU read-side critical section has been preempted. In
boosting. This provides at least a partial solution to the this case, the task must remove itself from the list that it
problem of reader-writer lock readers being preempted was added to when it was first preempted within its RCU
for extended periods of time. read-side critical section. However, it is important to note
It is also possible to avoid reader-writer lock priority that these lists are protected by locks, which means that
inversion by converting the reader-writer lock to RCU, as rcu_read_unlock() is no longer lockless. However,
briefly discussed in the next section. the highest-priority threads will not be preempted, and
therefore, for those highest-priority threads, rcu_read_
unlock() will never attempt to acquire any locks. In
Preemptible RCU can sometimes be used as a re- addition, if implemented carefully, locking can be used to
placement for reader-writer locking [MW07, MBWW12, synchronize real-time software [Bra11, SM04a].
McK14f], as was discussed in Section 9.5. Where it can
be used, it permits readers and updaters to run concur- Quick Quiz 14.11: Suppose that preemption occurs just after
the load from t->rcu_read_unlock_special.s on line 12
rently, which prevents low-priority readers from inflicting
of Listing 14.3. Mightn’t that result in the task failing to invoke
any sort of priority-inversion scenario on high-priority up-
rcu_read_unlock_special(), thus failing to remove itself
daters. However, for this to be useful, it is necessary to be from the list of tasks blocking the current grace period, in turn
able to preempt long-running RCU read-side critical sec- causing that grace period to extend indefinitely?
tions [GMTW08]. Otherwise, long RCU read-side critical
sections would result in excessive real-time latencies. Another important real-time feature of RCU, whether
A preemptible RCU implementation was therefore preemptible or not, is the ability to offload RCU callback
added to the Linux kernel. This implementation avoids execution to a kernel thread. To use this, your kernel must
the need to individually track the state of each and every be built with CONFIG_RCU_NOCB_CPU=y and booted with
task in the kernel by keeping lists of tasks that have been the rcu_nocbs= kernel boot parameter specifying which
preempted within their current RCU read-side critical CPUs are to be offloaded. Alternatively, any CPU speci-
v2022.09.25a
306 CHAPTER 14. ADVANCED SYNCHRONIZATION
fied by the nohz_full= kernel boot parameter described is a problem, the use case at hand must properly synchro-
in Section 14.3.5.2 will also have its RCU callbacks off- nize the updates, perhaps through a set of per-CPU locks
loaded. specific to that use case. Although introducing locks again
In short, this preemptible RCU implementation enables introduces the possibility of deadlock, the per-use-case
real-time response for read-mostly data structures without nature of these locks makes any such deadlocks easier to
the delays inherent to priority boosting of large numbers of manage and avoid.
readers, and also without delays due to callback invocation.
Closing event-driven remarks. There are of course
Preemptible spinlocks are an important part of the -rt any number of other Linux-kernel components that are
patchset due to the long-duration spinlock-based critical critically important to achieving world-class real-time la-
sections in the Linux kernel. This functionality has not yet tencies, for example, deadline scheduling [dO18b, dO18a],
reached mainline: Although they are a conceptually simple however, those listed in this section give a good feeling
substitution of sleeplocks for spinlocks, they have proven for the workings of the Linux kernel augmented by the -rt
relatively controversial. In addition the real-time function- patchset.
ality that is already in the mainline Linux kernel suffices
for a great many use cases, which slowed the -rt patch-
set’s development rate in the early 2010s [Edg13, Edg14]. 14.3.5.2 Polling-Loop Real-Time Support
However, preemptible spinlocks are absolutely necessary At first glance, use of a polling loop might seem to avoid
to the task of achieving real-time latencies down in the all possible operating-system interference problems. After
tens of microseconds. Fortunately, Linux Foundation all, if a given CPU never enters the kernel, the kernel
organized an effort to fund moving the remaining code is completely out of the picture. And the traditional
from the -rt patchset to mainline. approach to keeping the kernel out of the way is simply
not to have a kernel, and many real-time applications do
Per-CPU variables are used heavily in the Linux kernel indeed run on bare metal, particularly those running on
for performance reasons. Unfortunately for real-time eight-bit microcontrollers.
applications, many use cases for per-CPU variables require One might hope to get bare-metal performance on a
coordinated update of multiple such variables, which is modern operating-system kernel simply by running a
normally provided by disabling preemption, which in single CPU-bound user-mode thread on a given CPU,
turn degrades real-time latencies. Real-time applications avoiding all causes of interference. Although the reality is
clearly need some other way of coordinating per-CPU of course more complex, it is becoming possible to do just
variable updates. that, courtesy of the NO_HZ_FULL implementation led by
One alternative is to supply per-CPU spinlocks, which Frederic Weisbecker [Cor13, Wei12] that was accepted
as noted above are actually sleeplocks, so that their critical into version 3.10 of the Linux kernel. Nevertheless,
sections can be preempted and so that priority inheritance considerable care is required to properly set up such an
is provided. In this approach, code updating groups environment, as it is necessary to control a number of
of per-CPU variables must acquire the current CPU’s possible sources of OS jitter. The discussion below covers
spinlock, carry out the update, then release whichever the control of several sources of OS jitter, including device
lock is acquired, keeping in mind that a preemption might interrupts, kernel threads and daemons, scheduler real-
have resulted in a migration to some other CPU. However, time throttling (this is a feature, not a bug!), timers, non-
this approach introduces both overhead and deadlocks. real-time device drivers, in-kernel global synchronization,
Another alternative, which is used in the -rt patchset scheduling-clock interrupts, page faults, and finally, non-
as of early 2021, is to convert preemption disabling to real-time hardware and firmware.
migration disabling. This ensures that a given kernel Interrupts are an excellent source of large amounts of
thread remains on its CPU through the duration of the OS jitter. Unfortunately, in most cases interrupts are ab-
per-CPU-variable update, but could also allow some other solutely required in order for the system to communicate
kernel thread to intersperse its own update of those same with the outside world. One way of resolving this conflict
variables, courtesy of preemption. There are cases such between OS jitter and maintaining contact with the out-
as statistics gathering where this is not a problem. In the side world is to reserve a small number of housekeeping
surprisingly rare case where such mid-update preemption CPUs, and to force all interrupts to these CPUs. The
v2022.09.25a
14.3. PARALLEL REAL-TIME COMPUTING 307
Documentation/IRQ-affinity.txt file in the Linux A fourth source of OS jitter comes from timers. In
source tree describes how to direct device interrupts to most cases, keeping a given CPU out of the kernel will
specified CPUs, which as of early 2021 involves something prevent timers from being scheduled on that CPU. One
like the following: important exception are recurring timers, where a given
timer handler posts a later occurrence of that same timer.
$ echo 0f > /proc/irq/44/smp_affinity
If such a timer gets started on a given CPU for any reason,
that timer will continue to run periodically on that CPU,
This command would confine interrupt #44 to CPUs 0–
inflicting OS jitter indefinitely. One crude but effective
3. Note that scheduling-clock interrupts require special
way to offload recurring timers is to use CPU hotplug
handling, and are discussed later in this section.
to offline all worker CPUs that are to run CPU-bound
A second source of OS jitter is due to kernel threads
real-time application threads, online these same CPUs,
and daemons. Individual kernel threads, such as RCU’s
then start your real-time application.
grace-period kthreads (rcu_bh, rcu_preempt, and rcu_
A fifth source of OS jitter is provided by device drivers
sched), may be forced onto any desired CPUs using the
that were not intended for real-time use. For an old
taskset command, the sched_setaffinity() system
canonical example, in 2005, the VGA driver would blank
call, or cgroups.
the screen by zeroing the frame buffer with interrupts
Per-CPU kthreads are often more challenging, some-
disabled, which resulted in tens of milliseconds of OS
times constraining hardware configuration and workload
jitter. One way of avoiding device-driver-induced OS
layout. Preventing OS jitter from these kthreads requires
jitter is to carefully select devices that have been used
either that certain types of hardware not be attached to
heavily in real-time systems, and which have therefore
real-time systems, that all interrupts and I/O initiation take
had their real-time bugs fixed. Another way is to confine
place on housekeeping CPUs, that special kernel Kconfig
the device’s interrupts and all code using that device to
or boot parameters be selected in order to direct work
designated housekeeping CPUs. A third way is to test the
away from the worker CPUs, or that worker CPUs never
device’s ability to support real-time workloads and fix any
enter the kernel. Specific per-kthread advice may be found
real-time bugs.8
in the Linux kernel source Documentation directory at
A sixth source of OS jitter is provided by some in-kernel
kernel-per-CPU-kthreads.txt.
full-system synchronization algorithms, perhaps most no-
A third source of OS jitter in the Linux kernel for
tably the global TLB-flush algorithm. This can be avoided
CPU-bound threads running at real-time priority is the
by avoiding memory-unmapping operations, and espe-
scheduler itself. This is an intentional debugging feature,
cially avoiding unmapping operations within the kernel.
designed to ensure that important non-realtime work is
As of early 2021, the way to avoid in-kernel unmapping
allotted at least 50 milliseconds out of each second, even if
operations is to avoid unloading kernel modules.
there is an infinite-loop bug in your real-time application.
A seventh source of OS jitter is provided by scheduling-
However, when you are running a polling-loop-style real-
clock interrrupts and RCU callback invocation. These
time application, you will need to disable this debugging
may be avoided by building your kernel with the NO_HZ_
feature. This can be done as follows:
FULL Kconfig parameter enabled, and then booting with
$ echo -1 > /proc/sys/kernel/sched_rt_runtime_us the nohz_full= parameter specifying the list of worker
CPUs that are to run real-time threads. For example,
You will of course need to be running as root to exe- nohz_full=2-7 would designate CPUs 2, 3, 4, 5, 6,
cute this command, and you will also need to carefully and 7 as worker CPUs, thus leaving CPUs 0 and 1 as
consider the aforementioned Spiderman principle. One housekeeping CPUs. The worker CPUs would not incur
way to minimize the risks is to offload interrupts and ker- scheduling-clock interrupts as long as there is no more
nel threads/daemons from all CPUs running CPU-bound than one runnable task on each worker CPU, and each
real-time threads, as described in the paragraphs above. worker CPU’s RCU callbacks would be invoked on one
In addition, you should carefully read the material in the of the housekeeping CPUs. A CPU that has suppressed
Documentation/scheduler directory. The material in scheduling-clock interrupts due to there only being one
the sched-rt-group.rst file is particularly important,
especially if you are using the cgroups real-time fea- 8 If you take this approach, please submit your fixes upstream so
tures enabled by the CONFIG_RT_GROUP_SCHED Kconfig that others can benefit. After all, when you need to port your application
parameter. to a later version of the Linux kernel, you will be one of those “others”.
v2022.09.25a
308 CHAPTER 14. ADVANCED SYNCHRONIZATION
Listing 14.4: Locating Sources of OS Jitter automation has been applied, but given the relatively small
1 cd /sys/kernel/debug/tracing number of users, automation can be expected to appear
2 echo 1 > max_graph_depth
3 echo function_graph > current_tracer relatively slowly. Nevertheless, the ability to gain near-
4 # run workload bare-metal performance while running a general-purpose
5 cat per_cpu/cpuN/trace
operating system promises to ease construction of some
types of real-time systems.
runnable task on that CPU is said to be in adaptive ticks
mode or in nohz_full mode. It is important to ensure 14.3.6 Implementing Parallel Real-Time
that you have designated enough housekeeping CPUs to Applications
handle the housekeeping load imposed by the rest of the
Developing real-time applications is a wide-ranging topic,
system, which requires careful benchmarking and tuning.
and this section can only touch on a few aspects. To this
An eighth source of OS jitter is page faults. Because
end, Section 14.3.6.1 looks at a few software components
most Linux implementations use an MMU for memory
commonly used in real-time applications, Section 14.3.6.2
protection, real-time applications running on these systems
provides a brief overview of how polling-loop-based ap-
can be subject to page faults. Use the mlock() and
plications may be implemented, Section 14.3.6.3 gives
mlockall() system calls to pin your application’s pages
a similar overview of streaming applications, and Sec-
into memory, thus avoiding major page faults. Of course,
tion 14.3.6.4 briefly covers event-based applications.
the Spiderman principle applies, because locking down
too much memory may prevent the system from getting
other work done. 14.3.6.1 Real-Time Components
A ninth source of OS jitter is unfortunately the hardware As in all areas of engineering, a robust set of components
and firmware. It is therefore important to use systems that is essential to productivity and reliability. This section is
have been designed for real-time use. not a full catalog of real-time software components—such
Unfortunately, this list of OS-jitter sources can never be a catalog would fill multiple books—but rather a brief
complete, as it will change with each new version of the overview of the types of components available.
kernel. This makes it necessary to be able to track down A natural place to look for real-time software com-
additional sources of OS jitter. Given a CPU 𝑁 running ponents would be algorithms offering wait-free synchro-
a CPU-bound usermode thread, the commands shown in nization [Her91], and in fact lockless algorithms are very
Listing 14.4 will produce a list of all the times that this important to real-time computing. However, wait-free
CPU entered the kernel. Of course, the N on line 5 must synchronization only guarantees forward progress in finite
be replaced with the number of the CPU in question, and time. Although a century is finite, this is unhelpful when
the 1 on line 2 may be increased to show additional levels your deadlines are measured in microseconds, let alone
of function call within the kernel. The resulting trace can milliseconds.
help track down the source of the OS jitter. Nevertheless, there are some important wait-free algo-
As always, there is no free lunch, and NO_HZ_FULL rithms that do provide bounded response time, including
is no exception. As noted earlier, NO_HZ_FULL makes atomic test and set, atomic exchange, atomic fetch-and-
kernel/user transitions more expensive due to the need for add, single-producer/single-consumer FIFO queues based
delta process accounting and the need to inform kernel on circular arrays, and numerous per-thread partitioned
subsystems (such as RCU) of the transitions. As a rough algorithms. In addition, recent research has confirmed
rule of thumb, NO_HZ_FULL helps with many types of the observation that algorithms with lock-free guarantees9
real-time and heavy-compute workloads, but hurts other also provide the same latencies in practice (in the wait-
workloads that feature high rates of system calls and free sense), assuming a stochastically fair scheduler and
I/O [ACA+ 18]. Additional limitations, tradeoffs, and absence of fail-stop bugs [ACHS13]. This means that
configuration advice may be found in Documentation/ many non-wait-free stacks and queues are nevertheless
timers/no_hz.rst. appropriate for real-time use.
As you can see, obtaining bare-metal performance 9 Wait-free algorithms guarantee that all threads make progress in
when running CPU-bound real-time threads on a general- finite time, while lock-free algorithms only guarantee that at least one
purpose OS such as Linux requires painstaking attention thread will make progress in finite time. See Section 14.2 for more
to detail. Automation would of course help, and some details.
v2022.09.25a
14.3. PARALLEL REAL-TIME COMPUTING 309
Quick Quiz 14.12: But isn’t correct operation despite fail- Of course, a careful and simple application design is also
stop bugs a valuable fault-tolerance property? extremely important. The best real-time components in the
world cannot make up for a poorly thought-out design. For
In practice, locking is often used in real-time programs, parallel real-time applications, synchronization overheads
theoretical concerns notwithstanding. However, under clearly must be a key component of the design.
more severe constraints, lock-based algorithms can also
provide bounded latencies [Bra11]. These constraints
include: 14.3.6.2 Polling-Loop Applications
1. Fair scheduler. In the common case of a fixed-priority Many real-time applications consist of a single CPU-bound
scheduler, the bounded latencies are provided only loop that reads sensor data, computes a control law, and
to the highest-priority threads. writes control output. If the hardware registers providing
sensor data and taking control output are mapped into the
2. Sufficient bandwidth to support the workload. An application’s address space, this loop might be completely
implementation rule supporting this constraint might free of system calls. But beware of the Spiderman princi-
be “There will be at least 50 % idle time on all CPUs ple: With great power comes great responsibility, in this
during normal operation,” or, more formally, “The case the responsibility to avoid bricking the hardware by
offered load will be sufficiently low to allow the making inappropriate references to the hardware registers.
workload to be schedulable at all times.” This arrangement is often run on bare metal, without
the benefits of (or the interference from) an operating
3. No fail-stop bugs.
system. However, increasing hardware capability and
4. FIFO locking primitives with bounded acquisition, increasing levels of automation motivates increasing soft-
handoff, and release latencies. Again, in the com- ware functionality, for example, user interfaces, logging,
mon case of a locking primitive that is FIFO within and reporting, all of which can benefit from an operating
priorities, the bounded latencies are provided only system.
to the highest-priority threads. One way of gaining much of the benefit of running on
bare metal while still having access to the full features
5. Some way of preventing unbounded priority inver- and functions of a general-purpose operating system is to
sion. The priority-ceiling and priority-inheritance use the Linux kernel’s NO_HZ_FULL capability, described
disciplines mentioned earlier in this chapter suffice. in Section 14.3.5.2.
6. Bounded nesting of lock acquisitions. We can have
an unbounded number of locks, but only as long as a 14.3.6.3 Streaming Applications
given thread never acquires more than a few of them
(ideally only one of them) at a time. One type of big-data real-time application takes input from
numerous sources, processes it internally, and outputs
7. Bounded number of threads. In combination with the alerts and summaries. These streaming applications are
earlier constraints, this constraint means that there often highly parallel, processing different information
will be a bounded number of threads waiting on any sources concurrently.
given lock. One approach for implementing streaming applications
8. Bounded time spent in any given critical section. is to use dense-array circular FIFOs to connect different
Given a bounded number of threads waiting on any processing steps [Sut13]. Each such FIFO has only a single
given lock and a bounded critical-section duration, thread producing into it and a (presumably different) single
the wait time will be bounded. thread consuming from it. Fan-in and fan-out points use
threads rather than data structures, so if the output of
several FIFOs needed to be merged, a separate thread
Quick Quiz 14.13: I couldn’t help but spot the word “includes”
before this list. Are there other constraints?
would input from them and output to another FIFO for
which this separate thread was the sole producer. Similarly,
This result opens a vast cornucopia of algorithms and if the output of a given FIFO needed to be split, a separate
data structures for use in real-time software—and validates thread would input from this FIFO and output to several
long-standing real-time practice. FIFOs as needed.
v2022.09.25a
310 CHAPTER 14. ADVANCED SYNCHRONIZATION
Listing 14.5: Timed-Wait Test Program we can still see page faults. We also need to use the
1 if (clock_gettime(CLOCK_REALTIME, ×tart) != 0) { mlockall() system call to pin the application’s memory,
2 perror("clock_gettime 1");
3 exit(-1); preventing page faults. With all of these changes, results
4 } might finally be acceptable.
5 if (nanosleep(&timewait, NULL) != 0) {
6 perror("nanosleep"); In other situations, further adjustments might be needed.
7 exit(-1);
8 }
It might be necessary to affinity time-critical threads onto
9 if (clock_gettime(CLOCK_REALTIME, &timeend) != 0) { their own CPUs, and it might also be necessary to affinity
10 perror("clock_gettime 2");
11 exit(-1);
interrupts away from those CPUs. It might be necessary
12 } to carefully select hardware and drivers, and it will very
likely be necessary to carefully select kernel configuration.
As can be seen from this example, real-time computing
This discipline might seem restrictive, but it allows com- can be quite unforgiving.
munication among threads with minimal synchronization
overhead, and minimal synchronization overhead is im-
portant when attempting to meet tight latency constraints. 14.3.6.5 The Role of RCU
This is especially true when the amount of processing for
each step is small, so that the synchronization overhead is Suppose that you are writing a parallel real-time applica-
significant compared to the processing overhead. tion that needs to access data that is subject to gradual
The individual threads might be CPU-bound, in which change, perhaps due to changes in temperature, humid-
case the advice in Section 14.3.6.2 applies. On the other ity, and barometric pressure. The real-time response
hand, if the individual threads block waiting for data from constraints on this program are so severe that it is not
their input FIFOs, the advice of the next section applies. permissible to spin or block, thus ruling out locking, nor is
it permissible to use a retry loop, thus ruling out sequence
14.3.6.4 Event-Driven Applications locks and hazard pointers. Fortunately, the temperature
and pressure are normally controlled, so that a default
We will use fuel injection into a mid-sized industrial hard-coded set of data is usually sufficient.
engine as a fanciful example for event-driven applications.
However, the temperature, humidity, and pressure oc-
Under normal operating conditions, this engine requires
casionally deviate too far from the defaults, and in such
that the fuel be injected within a one-degree interval
situations it is necessary to provide data that replaces the
surrounding top dead center. If we assume a 1,500-RPM
defaults. Because the temperature, humidity, and pressure
rotation rate, we have 25 rotations per second, or about
change gradually, providing the updated values is not a
9,000 degrees of rotation per second, which translates
matter of urgency, though it must happen within a few min-
to 111 microseconds per degree. We therefore need to
utes. The program is to use a global pointer imaginatively
schedule the fuel injection to within a time interval of
named cur_cal that normally references default_cal,
about 100 microseconds.
which is a statically allocated and initialized structure that
Suppose that a timed wait was to be used to initiate fuel
contains the default calibration values in fields imagina-
injection, although if you are building an engine, I hope
tively named a, b, and c. Otherwise, cur_cal points to
you supply a rotation sensor. We need to test the timed-
a dynamically allocated structure providing the current
wait functionality, perhaps using the test program shown
calibration values.
in Listing 14.5. Unfortunately, if we run this program, we
Listing 14.6 shows how RCU can be used to solve
can get unacceptable timer jitter, even in a -rt kernel.
this problem. Lookups are deterministic, as shown in
One problem is that POSIX CLOCK_REALTIME is, oddly
calc_control() on lines 9–15, consistent with real-
enough, not intended for real-time use. Instead, it means
time requirements. Updates are more complex, as shown
“realtime” as opposed to the amount of CPU time con-
by update_cal() on lines 17–35.
sumed by a process or thread. For real-time use, you
should instead use CLOCK_MONOTONIC. However, even Quick Quiz 14.14: Given that real-time systems are often
with this change, results are still unacceptable. used for safety-critical applications, and given that runtime
Another problem is that the thread must be raised to a memory allocation is forbidden in many safety-critical situa-
real-time priority by using the sched_setscheduler() tions, what is with the call to malloc()???
system call. But even this change is insufficient, because
v2022.09.25a
14.3. PARALLEL REAL-TIME COMPUTING 311
v2022.09.25a
312 CHAPTER 14. ADVANCED SYNCHRONIZATION
v2022.09.25a
The art of progress is to preserve order amid change
and to preserve change amid order.
Causality and sequencing are deeply intuitive, and hack- 15.1 Ordering: Why and How?
ers often have a strong grasp of these concepts. These
intuitions can be quite helpful when writing, analyzing,
Nothing is orderly till people take hold of it.
and debugging not only sequential code, but also parallel
Everything in creation lies around loose.
code that makes use of standard mutual-exclusion mech-
anisms such as locking. Unfortunately, these intuitions Henry Ward Beecher, updated
break down completely in face of code that fails to use
such mechanisms. One example of such code implements One motivation for memory ordering can be seen in the
the standard mutual-exclusion mechanisms themselves, trivial-seeming litmus test in Listing 15.1 (C-SB+o-o+o-
while another example implements fast paths that use o.litmus), which at first glance might appear to guar-
weaker synchronization. Insults to intuition notwithstand- antee that the exists clause never triggers.1 After all,
ing, some argue that weakness is a virtue [Alg13]. Virtue if 0:r2=0 as shown in the exists clause,2 we might
or vice, this chapter will help you gain an understanding of hope that Thread P0()’s load from x1 into r2 must have
memory ordering, that, with practice, will be sufficient to happened before Thread P1()’s store to x1, which might
implement synchronization primitives and performance- raise further hopes that Thread P1()’s load from x0 into
critical fast paths. r2 must happen after Thread P0()’s store to x0, so that
1:r2=2, thus not triggering the exists clause. The ex-
Section 15.1 will demonstrate that real computer sys- ample is symmetric, so similar reasoning might lead us to
tems can reorder memory references, give some reasons hope that 1:r2=0 guarantees that 0:r2=2. Unfortunately,
why they do so, and provide some information on how the lack of memory barriers dashes these hopes. The CPU
to prevent undesired reordering. Sections 15.2 and 15.3 is within its rights to reorder the statements within both
will cover the types of pain that hardware and compilers, Thread P0() and Thread P1(), even on relatively strongly
respectively, can inflict on unwary parallel programmers. ordered systems such as x86.
Section 15.4 gives an overview of the benefits of mod-
eling memory ordering at higher levels of abstraction. Quick Quiz 15.2: The compiler can also reorder
Thread P0()’s and Thread P1()’s memory accesses in List-
Section 15.5 follows up with more detail on a few rep-
ing 15.1, right?
resentative hardware platforms. Finally, Section 15.6
provides some useful rules of thumb. This willingness to reorder can be confirmed using tools
such as litmus7 [AMT14], which found that the counter-
intuitive ordering happened 314 times out of 100,000,000
Quick Quiz 15.1: This chapter has been rewritten since the
1 Purists would instead insist that the exists clause is never satisfied,
first edition. Did memory ordering change all that since 2014?
but we use “trigger” here by analogy with assertions.
2 That is, Thread P0()’s instance of local variable r2 equals zero.
313
v2022.09.25a
314 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
v2022.09.25a
15.1. ORDERING: WHY AND HOW? 315
CPU 0 CPU 1
Instruction Store Buffer Cache Instruction Store Buffer Cache
1 (Initial state) x1==0 (Initial state) x0==0
2 x0 = 2; x0==2 x1==0 x1 = 2; x1==2 x0==0
3 r2 = x1; (0) x0==2 x1==0 r2 = x0; (0) x1==2 x0==0
4 (Read-invalidate) x0==2 x0==0 (Read-invalidate) x1==2 x1==0
5 (Finish store) x0==2 (Finish store) x1==2
v2022.09.25a
316 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
Listing 15.2: Memory Ordering: Store-Buffering Litmus Test optimizations. A great many situations can be handled
1 C C-SB+o-mb-o+o-mb-o with much weaker ordering guarantees that use much
2
3 {} cheaper memory-ordering instructions, or, in some case,
4 no memory-ordering instructions at all.
5 P0(int *x0, int *x1)
6 { Table 15.3 provides a cheatsheet of the Linux kernel’s
7 int r2;
8
ordering primitives and their guarantees. Each row corre-
9 WRITE_ONCE(*x0, 2); sponds to a primitive or category of primitives that might
10 smp_mb();
11 r2 = READ_ONCE(*x1);
or might not provide ordering, with the columns labeled
12 } “Prior Ordered Operation” and “Subsequent Ordered Op-
13
14 P1(int *x0, int *x1)
eration” being the operations that might (or might not)
15 { be ordered against. Cells containing “Y” indicate that
16 int r2;
17
ordering is supplied unconditionally, while other charac-
18 WRITE_ONCE(*x1, 2); ters indicate that ordering is supplied only partially or
19 smp_mb();
20 r2 = READ_ONCE(*x0);
conditionally. Blank cells indicate that no ordering is
21 } supplied.
22
23 exists (1:r2=0 /\ 0:r2=0) The “Store” row also covers the store portion of an
atomic RMW operation. In addition, the “Load” row
covers the load component of a successful value-returning
as they often are on x86. Since these standard synchro- _relaxed() RMW atomic operation, although the com-
nization primitives preserve the illusion of ordering, your bined “_relaxed() RMW operation” line provides a
path of least resistance is to simply use these primitives, convenient combined reference in the value-returning
thus allowing you to stop reading this section. case. A CPU executing unsuccessful value-returning
However, if you need to implement the synchronization atomic RMW operations must invalidate the correspond-
primitives themselves, or if you are simply interested in ing variable from all other CPUs’ caches. Therefore,
understanding how memory ordering works, read on! The unsuccessful value-returning atomic RMW operations
first stop on the journey is Listing 15.2 (C-SB+o-mb- have many of the properties of a store, which means that
o+o-mb-o.litmus), which places an smp_mb() Linux- the “_relaxed() RMW operation” line also applies to
kernel full memory barrier between the store and load unsuccessful value-returning atomic RMW operations.
in both P0() and P1(), but is otherwise identical to The *_acquire row covers smp_load_acquire(),
Listing 15.1. These barriers prevent the counter-intuitive cmpxchg_acquire(), xchg_acquire(), and so on; the
outcome from happening on 100,000,000 trials on my x86 *_release row covers smp_store_release(), rcu_
laptop. Interestingly enough, the added overhead due to assign_pointer(), cmpxchg_release(), xchg_
these barriers causes the legal outcome where both loads release(), and so on; and the “Successful full-
return the value two to happen more than 800,000 times, strength non-void RMW” row covers atomic_add_
as opposed to only 167 times for the barrier-free code in return(), atomic_add_unless(), atomic_dec_
Listing 15.1. and_test(), cmpxchg(), xchg(), and so on. The “Suc-
These barriers have a profound effect on ordering, as cessful” qualifiers apply to primitives such as atomic_
can be seen in Table 15.2. Although the first two rows add_unless(), cmpxchg_acquire(), and cmpxchg_
are the same as in Table 15.1 and although the smp_ release(), which have no effect on either memory or
mb() instructions on row 3 do not change state in and of on ordering when they indicate failure, as indicated by the
themselves, they do cause the stores to complete (rows 4 earlier “_relaxed() RMW operation” row.
and 5) before the loads (row 6), which rules out the Column “C” indicates cumulativity and propagation,
counter-intuitive outcome shown in Table 15.1. Note that as explained in Sections 15.2.7.1 and 15.2.7.2. In the
variables x0 and x1 each still have more than one value meantime, this column can usually be ignored when there
on row 2, however, as promised earlier, the smp_mb() are at most two threads involved.
invocations straighten things out in the end. Quick Quiz 15.5: The rows in Table 15.3 seem quite random
Although full barriers such as smp_mb() have extremely and confused. Whatever is the conceptual basis of this table???
strong ordering guarantees, their strength comes at a
high price in terms of foregone hardware and compiler
v2022.09.25a
15.1. ORDERING: WHY AND HOW? 317
CPU 0 CPU 1
Instruction Store Buffer Cache Instruction Store Buffer Cache
1 (Initial state) x1==0 (Initial state) x0==0
2 x0 = 2; x0==2 x1==0 x1 = 2; x1==2 x0==0
3 smp_mb(); x0==2 x1==0 smp_mb(); x1==2 x0==0
4 (Read-invalidate) x0==2 x0==0 (Read-invalidate) x1==2 x1==0
5 (Finish store) x0==2 (Finish store) x1==2
6 r2 = x1; (2) x1==2 r2 = x0; (2) x0==2
v2022.09.25a
318 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
CPU 0
Quick Quiz 15.6: Why is Table 15.3 missing
smp_mb__after_unlock_lock() and smp_mb__after_ Memory
.... memory barriers guarantee X0 before X1.
Reference X0
spinlock()?
Quick Quiz 15.7: But how can I know that a given project
can be designed and coded within the confines of these rules Figure 15.3: Memory Barriers Provide Conditional If-
of thumb? Then Ordering
A given thread sees its own accesses in order. This zero. The if-then rule would then state that the load from
rule assumes that loads and stores from/to shared variables x0 on line 20 happens after the store to x0 on line 9. In
use READ_ONCE() and WRITE_ONCE(), respectively. Oth- other words, P1()’s local variable r2 is guaranteed to
erwise, the compiler can profoundly scramble4 your code, end up with the value two only if P0()’s local variable
and sometimes the CPU can do a bit of scrambling as well, r2 ends up with the value zero. This underscores the
as discussed in Section 15.5.4. point that memory ordering guarantees are conditional,
not absolute.
Ordering has conditional if-then semantics. Fig- Although Figure 15.3 specifically mentions memory
ure 15.3 illustrates this for memory barriers. Assuming barriers, this same if-then rule applies to the rest of the
that both memory barriers are strong enough, if CPU 1’s Linux kernel’s ordering operations.
access Y1 happens after CPU 0’s access Y0, then CPU 1’s
access X1 is guaranteed to happen after CPU 0’s access Ordering operations must be paired. If you carefully
X0. When in doubt as to which memory barriers are order the operations in one thread, but then fail to do so
strong enough, smp_mb() will always do the job, albeit in another thread, then there is no ordering. Both threads
at a price. must provide ordering for the if-then rule to apply.5
Quick Quiz 15.8: How can you tell which memory barriers
are strong enough for a given use case? Ordering operations almost never speed things up.
If you find yourself tempted to add a memory barrier in
Listing 15.2 is a case in point. The smp_mb() on an attempt to force a prior store to be flushed to memory
lines 10 and 19 serve as the barriers, the store to x0 on faster, resist! Adding ordering usually slows things down.
line 9 as X0, the load from x1 on line 11 as Y0, the store Of course, there are situations where adding instructions
to x1 on line 18 as Y1, and the load from x0 on line 20 as speeds things up, as was shown by Figure 9.22 on page 161,
X1. Applying the if-then rule step by step, we know that but careful benchmarking is required in such cases. And
the store to x1 on line 18 happens after the load from x1 even then, it is quite possible that although you sped things
on line 11 if P0()’s local variable r2 is set to the value up a little bit on your system, you might well have slowed
4 Many compiler writers prefer the word “optimize”. 5 In Section 15.2.7.2, pairing will be generalized to cycles.
v2022.09.25a
15.2. TRICKS AND TRAPS 319
things down significantly on your users’ systems. Or on Listing 15.3: Software Logic Analyzer
your future system. 1 state.variable = mycpu;
2 lasttb = oldtb = firsttb = gettb();
3 while (state.variable == mycpu) {
4 lasttb = oldtb;
Ordering operations are not magic. When your pro- 5 oldtb = gettb();
gram is failing due to some race condition, it is often 6 if (lasttb - firsttb > 1000)
7 break;
tempting to toss in a few memory-ordering operations in 8 }
an attempt to barrier your bugs out of existence. A far bet-
ter reaction is to use higher-level primitives in a carefully
designed manner. With concurrent programming, it is But first, let’s take a quick look at just how many values
almost always better to design your bugs out of existence a single variable might have at a single point in time.
than to hack them down to lower probabilities.
15.2.1 Variables With Multiple Values
These are only rough rules of thumb. Although these It is natural to think of a variable as taking on a well-
rules of thumb cover the vast majority of situations seen defined sequence of values in a well-defined, global order.
in actual practice, as with any set of rules of thumb, they Unfortunately, the next stop on the journey says “goodbye”
do have their limits. The next section will demonstrate to this comforting fiction. Hopefully, you already started
some of these limits by introducing trick-and-trap lit- to say “goodbye” in response to row 2 of Tables 15.1
mus tests that are intended to insult your intuition while and 15.2, and if so, the purpose of this section is to drive
increasing your understanding. These litmus tests will this point home.
also illuminate many of the concepts represented by the To this end, consider the program fragment shown in
Linux-kernel memory-ordering cheat sheet shown in Ta- Listing 15.3. This code fragment is executed in parallel
ble 15.3, and can be automatically analyzed given proper by several CPUs. Line 1 sets a shared variable to the
tooling [AMM+ 18]. Section 15.6 will circle back to this current CPU’s ID, line 2 initializes several variables from
cheat sheet, presenting a more sophisticated set of rules of a gettb() function that delivers the value of a fine-
thumb in light of learnings from all the intervening tricks grained hardware “timebase” counter that is synchronized
and traps. among all CPUs (not available from all CPU architectures,
Quick Quiz 15.9: Wait!!! Where do I find this tooling that unfortunately!), and the loop from lines 3–8 records the
automatically analyzes litmus tests??? length of time that the variable retains the value that this
CPU assigned to it. Of course, one of the CPUs will
“win”, and would thus never exit the loop if not for the
check on lines 6–7.
15.2 Tricks and Traps Quick Quiz 15.10: What assumption is the code fragment in
Listing 15.3 making that might not be valid on real hardware?
Knowing where the trap is—that’s the first step in
evading it.
Upon exit from the loop, firsttb will hold a timestamp
Duke Leto Atreides, “Dune”, Frank Herbert taken shortly after the assignment and lasttb will hold
a timestamp taken before the last sampling of the shared
Now that you know that hardware can reorder memory variable that still retained the assigned value, or a value
accesses and that you can prevent it from doing so, the equal to firsttb if the shared variable had changed
next step is to get you to admit that your intuition has a before entry into the loop. This allows us to plot each
problem. This painful task is taken up by Section 15.2.1, CPU’s view of the value of state.variable over a 532-
which presents some code demonstrating that scalar vari- nanosecond time period, as shown in Figure 15.4. This
ables can take on multiple values simultaneously, and by data was collected in 2006 on 1.5 GHz POWER5 system
Sections 15.2.2 through 15.2.7, which show a series of with 8 cores, each containing a pair of hardware threads.
intuitively correct code fragments that fail miserably on CPUs 1, 2, 3, and 4 recorded the values, while CPU 0
real hardware. Once your intuition has made it through controlled the test. The timebase counter period was about
the grieving process, later sections will summarize the 5.32 ns, sufficiently fine-grained to allow observations of
basic rules that memory ordering follows. intermediate cache states.
v2022.09.25a
320 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
v2022.09.25a
15.2. TRICKS AND TRAPS 321
CPU 1 1 6 4 10 15 3 9
CPU 2 2 3 9
CPU 3 3 9
CPU 4 4 10 15 12 9
CPU 5 5 10 15 12 9
CPU 6 6 2 15 9
CPU 7 7 2 15 9
CPU 8 8 9
CPU 9 9
CPU 10 10 15 12 9
CPU 11 11 10 15 12 9
CPU 12 12 9
CPU 13 13 12 9
CPU 14 14 15 12 9
CPU 15 15 12 9
0 50 100 150 200 250 300 350 400 450 500 (tick)
CPU 1 1
CPU 2 2
CPU 3 3
CPU 4 4
CPU 5 5
CPU 6 6
CPU 7 7
CPU 8 8 9
CPU 9 9
CPU 10 10
CPU 11 11
CPU 12 12
CPU 13 13
CPU 14 14 15
CPU 15 15
0 5 10 15 20 25 30 35 40 45 (tick)
v2022.09.25a
322 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
explicit ordering, for example, the smp_rmb() shown 21 exists (1:r2=2 /\ 0:r2=2)
v2022.09.25a
15.2. TRICKS AND TRAPS 323
Listing 15.8: Message-Passing Litmus Test, No Writer Ordering Listing 15.9: Message-Passing Address-Dependency Litmus
(No Ordering) Test (No Ordering Before v4.15)
1 C C-MP+o-o+o-rmb-o 1 C C-MP+o-wmb-o+o-ad-o
2 2
3 {} 3 {
4 4 y=1;
5 P0(int* x0, int* x1) { 5 x1=y;
6 WRITE_ONCE(*x0, 2); 6 }
7 WRITE_ONCE(*x1, 2); 7
8 } 8 P0(int* x0, int** x1) {
9 9 WRITE_ONCE(*x0, 2);
10 P1(int* x0, int* x1) { 10 smp_wmb();
11 int r2; 11 WRITE_ONCE(*x1, x0);
12 int r3; 12 }
13 13
14 r2 = READ_ONCE(*x1); 14 P1(int** x1) {
15 smp_rmb(); 15 int *r2;
16 r3 = READ_ONCE(*x0); 16 int r3;
17 } 17
18 18 r2 = READ_ONCE(*x1);
19 exists (1:r2=2 /\ 1:r3=0) 19 r3 = READ_ONCE(*r2);
20 }
21
22 exists (1:r2=x0 /\ 1:r3=1)
store is ready at hand. Therefore, portable code must
enforce any required ordering, for example, as shown
in Listing 15.7 (C-LB+o-r+a-o.litmus). The smp_ by a later memory-reference instruction. This means that
store_release() and smp_load_acquire() guaran- the exact same sequence of instructions used to traverse
tee that the exists clause on line 21 never triggers. a linked data structure in single-threaded code provides
weak but extremely useful ordering in concurrent code.
Listing 15.9 (C-MP+o-wmb-o+o-addr-o.litmus)
15.2.2.3 Store Followed By Store
shows a linked variant of the message-passing pattern.
Listing 15.8 (C-MP+o-o+o-rmb-o.litmus) once again The head pointer is x1, which initially references the int
shows the classic message-passing litmus test, with the variable y (line 5), which is in turn initialized to the value
smp_rmb() providing ordering for P1()’s loads, but with- 1 (line 4). P0() updates head pointer x1 to reference x0
out any ordering for P0()’s stores. Again, the rela- (line 11), but only after initializing it to 2 (line 9) and
tively strongly ordered architectures do enforce ordering, forcing ordering (line 10). P1() picks up the head pointer
but weakly ordered architectures do not necessarily do x1 (line 18), and then loads the referenced value (line 19).
so [AMP+ 11], which means that the exists clause can There is thus an address dependency from the load on
trigger. One situation in which such reordering could be line 18 to the load on line 19. In this case, the value
beneficial is when the store buffer is full, another store is returned by line 18 is exactly the address used by line 19,
ready to execute, but the cacheline needed by the oldest but many variations are possible, including field access
store is not yet available. In this situation, allowing stores using the C-language -> operator, addition, subtraction,
to complete out of order would allow execution to proceed. and array indexing.6
Therefore, portable code must explicitly order the stores, One might hope that line 18’s load from the head pointer
for example, as shown in Listing 15.5, thus preventing the would be ordered before line 19’s dereference, which is in
exists clause from triggering. fact the case on Linux v4.15 and later. However, prior to
v4.15, this was not the case on DEC Alpha, which could
Quick Quiz 15.14: Why should strongly ordered systems
in effect use a speculated value for the dependent load, as
pay the performance price of unnecessary smp_rmb() and
smp_wmb() invocations? Shouldn’t weakly ordered systems
described in more detail in Section 15.5.1. Therefore, on
shoulder the full cost of their misordering choices??? older versions of Linux, Listing 15.9’s exists clause can
trigger.
Listing 15.10 shows how to make this work reliably
on pre-v4.15 Linux kernels running on DEC Alpha,
15.2.3 Address Dependencies by replacing line 19’s READ_ONCE() with lockless_
An address dependency occurs when the value returned 6 But note that in the Linux kernel, the address dependency must be
by a load instruction is used to compute the address used carried through the pointer to the array, not through the array index.
v2022.09.25a
324 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
Listing 15.10: Enforced Ordering of Message-Passing Address- Listing 15.12: Load-Buffering Data-Dependency Litmus Test
Dependency Litmus Test (Before v4.15) 1 C C-LB+o-r+o-data-o
1 C C-MP+o-wmb-o+ld-addr-o 2
2
3 {}
3 { 4
v2022.09.25a
15.2. TRICKS AND TRAPS 325
v2022.09.25a
326 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
Listing 15.15: Cache-Coherent IRIW Litmus Test CPU 0 CPU 1 CPU 2 CPU 3
1 C C-CCIRIW+o+o+o-o+o-o
2
3 {}
4
5 P0(int *x)
6 { Memory Memory
7 WRITE_ONCE(*x, 1);
8 }
9
10 P1(int *x) Figure 15.7: Global System Bus And Multi-Copy Atom-
11 {
12 WRITE_ONCE(*x, 2);
icity
13 }
14
15 P2(int *x)
16 { Listing 15.15 (C-CCIRIW+o+o+o-o+o-o.litmus)
17 int r1; shows a litmus test that tests for cache coherence, where
18 int r2;
19
“IRIW” stands for “independent reads of independent
20 r1 = READ_ONCE(*x); writes”. Because this litmus test uses only one vari-
21 r2 = READ_ONCE(*x);
22 } able, P2() and P3() must agree on the order of P0()’s
23 and P1()’s stores. In other words, if P2() believes that
24 P3(int *x)
25 { P0()’s store came first, then P3() had better not believe
26 int r3; that P1()’s store came first. And in fact the exists
27 int r4;
28
clause on line 33 will trigger if this situation arises.
29 r3 = READ_ONCE(*x);
30 r4 = READ_ONCE(*x); Quick Quiz 15.19: But in Listing 15.15, wouldn’t be just
31 } as bad if P2()’s r1 and r2 obtained the values 2 and 1,
32
33 exists(2:r1=1 /\ 2:r2=2 /\ 3:r3=2 /\ 3:r4=1)
respectively, while P3()’s r3 and r4 obtained the values 1
and 2, respectively?
example, xchg()) for all the stores will provide sequentially consistent
8 Recall that SC stands for sequentially consistent. ordering, but this has not yet been proven either way.
v2022.09.25a
15.2. TRICKS AND TRAPS 327
if only a subset of CPUs are doing stores, the other CPUs 5 P0(int *x)
6 {
will agree on the order of stores, hence the “other” in 7 WRITE_ONCE(*x, 1);
“other-multicopy atomicity”. Unlike multicopy-atomic 8 }
9
platforms, within other-multicopy-atomic platforms, the 10 P1(int *x, int* y)
CPU doing the store is permitted to observe its store 11 {
12 int r1;
early, which allows its later loads to obtain the newly 13
stored value directly from the store buffer, which improves 14 r1 = READ_ONCE(*x);
15 WRITE_ONCE(*y, r1);
performance. 16 }
17
Quick Quiz 15.20: Can you give a specific example showing 18 P2(int *x, int* y)
different behavior for multicopy atomic on the one hand and 19 {
20 int r2;
other-multicopy atomic on the other? 21 int r3;
22
ity, IBM mainframe provides full multicopy atomicity, and PPC provides
no multicopy atomicity at all. More detail is shown in Table 15.5.
v2022.09.25a
328 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
buffer, thus potentially immediately seeing a value stored Listing 15.17: WRC Litmus Test With Release
by CPU 0. In contrast, CPUs 2 and 3 will have to wait for 1 C C-WRC+o+o-r+a-o
2
the corresponding cache line to carry this new value to 3 {}
them. 4
5 P0(int *x)
Quick Quiz 15.21: Then who would even think of designing 6 {
7 WRITE_ONCE(*x, 1);
a system with shared store buffers??? 8 }
9
Table 15.4 shows one sequence of events that can result 10 P1(int *x, int* y)
11 {
in the exists clause in Listing 15.16 triggering. This 12 int r1;
sequence of events will depend critically on P0() and 13
14 r1 = READ_ONCE(*x);
P1() sharing both cache and a store buffer in the manner 15 smp_store_release(y, r1);
shown in Figure 15.8. 16 }
17
Quick Quiz 15.22: But just how is it fair that P0() and P1() 18 P2(int *x, int* y)
19 {
must share a store buffer and a cache, but P2() gets one each 20 int r2;
of its very own??? 21 int r3;
22
23 r2 = smp_load_acquire(y);
Row 1 shows the initial state, with the initial value of y 24 r3 = READ_ONCE(*x);
in P0()’s and P1()’s shared cache, and the initial value 25 }
26
of x in P2()’s cache. 27 exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)
Row 2 shows the immediate effect of P0() executing
its store on line 7. Because the cacheline containing x is
not in P0()’s and P1()’s shared cache, the new value (1) Note well that the exists clause on line 28 has trig-
is stored in the shared store buffer. gered. The values of r1 and r2 are both the value one, and
Row 3 shows two transitions. First, P0() issues a read- the final value of r3 the value zero. This strange result oc-
invalidate operation to fetch the cacheline containing x so curred because P0()’s new value of x was communicated
that it can flush the new value for x out of the shared store to P1() long before it was communicated to P2().
buffer. Second, P1() loads from x (line 14), an operation
Quick Quiz 15.23: Referring to Table 15.4, why on earth
that completes immediately because the new value of x is
would P0()’s store take so long to complete when P1()’s store
immediately available from the shared store buffer. complete so quickly? In other words, does the exists clause
Row 4 also shows two transitions. First, it shows the on line 28 of Listing 15.16 really trigger on real systems?
immediate effect of P1() executing its store to y (line 15),
placing the new value into the shared store buffer. Second, This counter-intuitive result happens because although
it shows the start of P2()’s load from y (line 23). dependencies do provide ordering, they provide it only
Row 5 continues the tradition of showing two transitions. within the confines of their own thread. This three-thread
First, it shows P1() complete its store to y, flushing from example requires stronger ordering, which is the subject
the shared store buffer to the cache. Second, it shows of Sections 15.2.7.1 through 15.2.7.4.
P2() request the cacheline containing y.
Row 6 shows P2() receive the cacheline containing y, 15.2.7.1 Cumulativity
allowing it to finish its load into r2, which takes on the
value 1. The three-thread example shown in Listing 15.16 re-
Row 7 shows P2() execute its smp_rmb() (line 24), quires cumulative ordering, or cumulativity. A cumulative
thus keeping its two loads ordered. memory-ordering operation orders not just any given ac-
Row 8 shows P2() execute its load from x, which cess preceding it, but also earlier accesses by any thread
immediately returns with the value zero from P2()’s to that same variable.
cache. Dependencies do not provide cumulativity, which is
Row 9 shows P2() finally responding to P0()’s request why the “C” column is blank for the READ_ONCE()
for the cacheline containing x, which was made way back row of Table 15.3 on page 317. However, as indi-
up on row 3. cated by the “C” in their “C” column, release opera-
Finally, row 10 shows P0() finish its store, flushing its tions do provide cumulativity. Therefore, Listing 15.17
value of x from the shared store buffer to the shared cache. (C-WRC+o+o-r+a-o.litmus) substitutes a release oper-
v2022.09.25a
15.2. TRICKS AND TRAPS 329
v2022.09.25a
330 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
CPU 0
Store x=1 ... cumulativity guarantees CPU 0's store before CPU 1's store
CPU 1
... and given this link ... Load r1=x .... memory barriers guarantee this order ...
Release store
y=r1
CPU 2
Acquire load
Given this link ...
r2=y
Memory
Barrier
Load r3=x
CPU 0 WRITE_ONCE(z, 1); P1() (lines 16 and 17), P2() (lines 24, 25, and 26), and
1 back to P0() (line 7). The exists clause delineates
CPU 1
0
fr z=
CPU 2 z=
this cycle: The 1:r1=1 indicates that the smp_load_
CPU 3 r1 = READ_ONCE(z) == 0;
acquire() on line 16 returned the value stored by the
smp_store_release() on line 8, the 1:r2=0 indicates
Time that the WRITE_ONCE() on line 24 came too late to affect
the value returned by the READ_ONCE() on line 17, and
Figure 15.10: Load-to-Store is Counter-Temporal
finally the 2:r3=0 indicates that the WRITE_ONCE() on
line 7 came too late to affect the value returned by the
READ_ONCE() on line 26. In this case, the fact that the
Quick Quiz 15.24: But it is not necessary to worry about
propagation unless there are at least three threads in the litmus
exists clause can trigger means that the cycle is said to
test, right? be allowed. In contrast, in cases where the exists clause
cannot trigger, the cycle is said to be prohibited.
This situation might seem completely counter-intuitive, But what if we need to keep the exists clause on line 29
but keep in mind that the speed of light is finite and of Listing 15.18? One solution is to replace P0()’s smp_
computers are of non-zero size. It therefore takes time for store_release() with an smp_mb(), which Table 15.3
the effect of the P2()’s store to z to propagate to P1(), shows to have not only cumulativity, but also propagation.
which in turn means that it is possible that P1()’s read The result is shown in Listing 15.19 (C-W+RWC+o-mb-
from z happens much later in time, but nevertheless still o+a-o+o-mb-o.litmus).
sees the old value of zero. This situation is depicted in Quick Quiz 15.25: But given that smp_mb() has the prop-
Figure 15.10: Just because a load sees the old value does agation property, why doesn’t the smp_mb() on line 25 of
not mean that this load executed at an earlier time than Listing 15.18 prevent the exists clause from triggering?
did the store of the new value.
Note that Listing 15.18 also shows the limitations of For completeness, Figure 15.11 shows that the “winning”
memory-barrier pairing, given that there are not two but store among a group of stores to the same variable is not
three processes. These more complex litmus tests can necessarily the store that started last. This should not
instead be said to have cycles, where memory-barrier come as a surprise to anyone who carefully examined
pairing is the special case of a two-thread cycle. The Figure 15.5 on page 321. One way to rationalize the
cycle in Listing 15.18 goes through P0() (lines 7 and 8), counter-temporal properties of both load-to-store and
v2022.09.25a
15.2. TRICKS AND TRAPS 331
Listing 15.19: W+WRC Litmus Test With More Barriers Listing 15.20: 2+2W Litmus Test With Write Barriers
1 C C-W+RWC+o-mb-o+a-o+o-mb-o 1 C C-2+2W+o-wmb-o+o-wmb-o
2 2
3 {} 3 {}
4 4
5 P0(int *x, int *y) 5 P0(int *x0, int *x1)
6 { 6 {
7 WRITE_ONCE(*x, 1); 7 WRITE_ONCE(*x0, 1);
8 smp_mb(); 8 smp_wmb();
9 WRITE_ONCE(*y, 1); 9 WRITE_ONCE(*x1, 2);
10 } 10 }
11 11
12 P1(int *y, int *z) 12 P1(int *x0, int *x1)
13 { 13 {
14 int r1; 14 WRITE_ONCE(*x1, 1);
15 int r2; 15 smp_wmb();
16 16 WRITE_ONCE(*x0, 2);
17 r1 = smp_load_acquire(y); 17 }
18 r2 = READ_ONCE(*z); 18
19 } 19 exists (x0=1 /\ x1=1)
20
21 P2(int *z, int *x)
22 {
23 int r3; CPU 0 WRITE_ONCE(x, 1);
1
24 CPU 1 =
25 WRITE_ONCE(*z, 1); X
26 smp_mb(); CPU 2 0 rf
r3 = READ_ONCE(*x);
=
27
CPU 3
X r1 = READ_ONCE(x);
28 }
29
30 exists(1:r1=1 /\ 1:r2=0 /\ 2:r3=0) Time
store-to-store ordering is to clearly distinguish between As shown in Figure 15.12, on platforms without user-
the temporal order in which the store instructions executed visible speculation, if a load returns the value from a
on the one hand, and the order in which the corresponding particular store, then, courtesy of the finite speed of light
cacheline visited the CPUs that executed those instructions and the non-zero size of modern computing systems, the
on the other. It is the cacheline-visitation order that defines store absolutely has to have executed at an earlier time
the externally visible ordering of the actual stores. This than did the load. This means that carefully constructed
cacheline-visitation order is not directly visible to the programs can rely on the passage of time itself as a
code executing the store instructions, which results in the memory-ordering operation.
counter-intuitive counter-temporal nature of load-to-store Of course, just the passage of time by itself is not
and store-to-store ordering.11 enough, as was seen in Listing 15.6 on page 322, which
Quick Quiz 15.26: But for litmus tests having only ordered has nothing but store-to-load links and, because it provides
stores, as shown in Listing 15.20 (C-2+2W+o-wmb-o+o-wmb- absolutely no ordering, still can trigger its exists clause.
o.litmus), research shows that the cycle is prohibited, even However, as long as each thread provides even the weakest
in weakly ordered systems such as Arm and Power [SSA+ 11]. possible ordering, exists clause would not be able to
trigger. For example, Listing 15.21 (C-LB+a-o+o-data-
11 In some hardware-multithreaded systems, the store would become o+o-data-o.litmus) shows P0() ordered with an smp_
visible to other CPUs in that same core as soon as the store reached the load_acquire() and both P1() and P2() ordered with
shared store buffer. As a result, such systems are non-multicopy atomic. data dependencies. These orderings, which are close to
v2022.09.25a
332 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
Listing 15.21: LB Litmus Test With One Acquire Listing 15.22: Long LB Release-Acquire Chain
1 C C-LB+a-o+o-data-o+o-data-o 1 C C-LB+a-r+a-r+a-r+a-r
2 2
3 {} 3 {}
4 4
5 P0(int *x0, int *x1) 5 P0(int *x0, int *x1)
6 { 6 {
7 int r2; 7 int r2;
8 8
9 r2 = smp_load_acquire(x0); 9 r2 = smp_load_acquire(x0);
10 WRITE_ONCE(*x1, 2); 10 smp_store_release(x1, 2);
11 } 11 }
12 12
13 P1(int *x1, int *x2) 13 P1(int *x1, int *x2)
14 { 14 {
15 int r2; 15 int r2;
16 16
17 r2 = READ_ONCE(*x1); 17 r2 = smp_load_acquire(x1);
18 WRITE_ONCE(*x2, r2); 18 smp_store_release(x2, 2);
19 } 19 }
20 20
21 P2(int *x2, int *x0) 21 P2(int *x2, int *x3)
22 { 22 {
23 int r2; 23 int r2;
24 24
25 r2 = READ_ONCE(*x2); 25 r2 = smp_load_acquire(x2);
26 WRITE_ONCE(*x0, r2); 26 smp_store_release(x3, 2);
27 } 27 }
28 28
29 exists (0:r2=2 /\ 1:r2=2 /\ 2:r2=2) 29 P3(int *x3, int *x0)
30 {
31 int r2;
32
the top of Table 15.3, suffice to prevent the exists clause 33 r2 = smp_load_acquire(x3);
34 smp_store_release(x0, 2);
from triggering. 35 }
36
Quick Quiz 15.27: Can you construct a litmus test like that 37 exists (0:r2=2 /\ 1:r2=2 /\ 2:r2=2 /\ 3:r2=2)
in Listing 15.21 that uses only dependencies?
An important use of time for ordering memory accesses if P3()’s READ_ONCE() returns zero, this cumulativity
is covered in the next section. will force the READ_ONCE() to be ordered before P0()’s
smp_store_release(). In addition, the release-acquire
15.2.7.4 Release-Acquire Chains chain (lines 8, 15, 16, 23, 24, and 32) forces P3()’s
READ_ONCE() to be ordered after P0()’s smp_store_
A minimal release-acquire chain was shown in Listing 15.7 release(). Because P3()’s READ_ONCE() cannot be
on page 322, but these chains can be much longer, as shown both before and after P0()’s smp_store_release(),
in Listing 15.22 (C-LB+a-r+a-r+a-r+a-r.litmus). either or both of two things must be true:
The longer the release-acquire chain, the more order-
ing is gained from the passage of time, so that no matter 1. P3()’s READ_ONCE() came after P0()’s WRITE_
how many threads are involved, the corresponding exists ONCE(), so that the READ_ONCE() returned the value
clause cannot trigger. two, so that the exists clause’s 3:r2=0 is false.
Although release-acquire chains are inherently store-to-
load creatures, it turns out that they can tolerate one load- 2. The release-acquire chain did not form, that is, one
to-store step, despite such steps being counter-temporal, or more of the exists clause’s 1:r2=2, 2:r2=2, or
as shown in Figure 15.10 on page 330. For example, List- 3:r1=2 is false.
ing 15.23 (C-ISA2+o-r+a-r+a-r+a-o.litmus) shows
a three-step release-acquire chain, but where P3()’s final Either way, the exists clause cannot trigger, despite
access is a READ_ONCE() from x0, which is accessed via this litmus test containing a notorious load-to-store link
WRITE_ONCE() by P0(), forming a non-temporal load-to- between P3() and P0(). But never forget that release-
store link between these two processes. However, because acquire chains can tolerate only one load-to-store link, as
P0()’s smp_store_release() (line 8) is cumulative, was seen in Listing 15.18.
v2022.09.25a
15.2. TRICKS AND TRAPS 333
17 } 15 r2 = smp_load_acquire(x1);
18
16 smp_store_release(x2, 2);
19 P2(int *x2, int *x3) 17 }
20 { 18
25 } 23 r2 = smp_load_acquire(x2);
26
24 smp_store_release(x3, 2);
27 P3(int *x3, int *x0) 25 }
28 { 26
33 r2 = READ_ONCE(*x0); 31 r2 = smp_load_acquire(x3);
34 } 32 WRITE_ONCE(*x0, 3);
35
33 }
36 exists (1:r2=2 /\ 2:r2=2 /\ 3:r1=2 /\ 3:r2=0) 34
35 exists (1:r2=2 /\ 2:r2=2 /\ 3:r2=2 /\ x0=2)
v2022.09.25a
334 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
v2022.09.25a
15.3. COMPILE-TIME CONSTERNATION 335
Plain accesses, as in plain-access C-language assign- accesses, as described in Section 4.3.4.4. After all, if
ment statements such as “r1 = a” or “b = 1” are sub- there are no data races, then each and every one of the
ject to the shared-variable shenanigans described in Sec- compiler optimizations mentioned above is perfectly safe.
tion 4.3.4.1. Ways of avoiding these shenanigans are But for code containing data races, this list is subject to
described in Sections 4.3.4.2–4.3.4.4 starting on page 43: change without notice as compiler optimizations continue
becoming increasingly aggressive.
1. Plain accesses can tear, for example, the compiler In short, use of READ_ONCE(), WRITE_ONCE(),
could choose to access an eight-byte pointer one barrier(), volatile, and other primitives called out
byte at a time. Tearing of aligned machine-sized in Table 15.3 on page 317 are valuable tools in preventing
accesses can be prevented by using READ_ONCE() the compiler from optimizing your parallel algorithm out
and WRITE_ONCE(). of existence. Compilers are starting to provide other mech-
anisms for avoiding load and store tearing, for example,
2. Plain loads can fuse, for example, if the results of
memory_order_relaxed atomic loads and stores, how-
an earlier load from that same object are still in a
ever, work is still needed [Cor16b]. In addition, compiler
machine register, the compiler might opt to reuse
issues aside, volatile is still needed to avoid fusing and
the value in that register instead of reloading from
invention of accesses, including C11 atomic accesses.
memory. Load fusing can be prevented by using
Please note that, it is possible to overdo use of READ_
READ_ONCE() or by enforcing ordering between the
ONCE() and WRITE_ONCE(). For example, if you have
two loads using barrier(), smp_rmb(), and other
prevented a given variable from changing (perhaps by
means shown in Table 15.3.
holding the lock guarding all updates to that variable),
3. Plain stores can fuse, so that a store can be omit- there is no point in using READ_ONCE(). Similarly, if you
ted entirely if there is a later store to that same have prevented any other CPUs or threads from reading a
variable. Store fusing can be prevented by using given variable (perhaps because you are initializing that
WRITE_ONCE() or by enforcing ordering between variable before any other CPU or thread has access to it),
the two stores using barrier(), smp_wmb(), and there is no point in using WRITE_ONCE(). However, in
other means shown in Table 15.3. my experience, developers need to use things like READ_
ONCE() and WRITE_ONCE() more often than they think
4. Plain accesses can be reordered in surprising ways that they do, and the overhead of unnecessary uses is quite
by modern optimizing compilers. This reordering low. In contrast, the penalty for failing to use them when
can be prevented by enforcing ordering as called out needed can be quite high.
above.
5. Plain loads can be invented, for example, register 15.3.2 Address- and Data-Dependency Dif-
pressure might cause the compiler to discard a previ- ficulties
ously loaded value from its register, and then reload
it later on. Invented loads can be prevented by using The low overheads of the address and data dependen-
READ_ONCE() or by enforcing ordering as called out cies discussed in Sections 15.2.3 and 15.2.4, respectively,
above between the load and a later use of its value makes their use extremely attractive. Unfortunately, com-
using barrier(). pilers do not understand either address or data dependen-
cies, although there are efforts underway to teach them,
6. Stores can be invented before a plain store, for ex- or at the very least, standardize the process of teaching
ample, by using the stored-to location as temporary them [MWB+ 17, MRP+ 17]. In the meantime, it is neces-
storage. This can be prevented by use of WRITE_ sary to be very careful in order to prevent your compiler
ONCE(). from breaking your dependencies.
Quick Quiz 15.30: Why not place a barrier() call im- 15.3.2.1 Give your dependency chain a good start
mediately before a plain store to prevent the compiler from
inventing stores? The load that heads your dependency chain must use
proper ordering, for example rcu_dereference() or
Please note that all of these shared-memory shenanigans READ_ONCE(). Failure to follow this rule can have serious
can instead be avoided by avoiding data races on plain side effects:
v2022.09.25a
336 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
Listing 15.28: Breakable Dependencies With Comparisons 1. Although it is permissible to compute offsets from
1 int reserve_int; a pointer, these offsets must not result in total can-
2 int *gp;
3 int *p; cellation. For example, given a char pointer cp,
4 cp-(uintptr_t)cp will cancel and can allow the
5 p = rcu_dereference(gp);
6 if (p == &reserve_int) compiler to break your dependency chain. On the
7 handle_reserve(p); other hand, canceling offset values with each other
8 do_something_with(*p); /* buggy! */
is perfectly safe and legal. For example, if a and b
are equal, cp+a-b is an identity function, including
Listing 15.29: Broken Dependencies With Comparisons preserving the dependency.
1 int reserve_int;
2 int *gp; 2. Comparisons can break dependencies. Listing 15.28
3 int *p;
4 shows how this can happen. Here global pointer gp
5 p = rcu_dereference(gp); points to a dynamically allocated integer, but if mem-
6 if (p == &reserve_int) {
7 handle_reserve(&reserve_int); ory is low, it might instead point to the reserve_int
8 do_something_with(reserve_int); /* buggy! */ variable. This reserve_int case might need spe-
9 } else {
10 do_something_with(*p); /* OK! */ cial handling, as shown on lines 6 and 7 of the listing.
11 } But the compiler could reasonably transform this
code into the form shown in Listing 15.29, espe-
cially on systems where instructions with absolute
1. On DEC Alpha, a dependent load might not be addresses run faster than instructions using addresses
ordered with the load heading the dependency chain, supplied in registers. However, there is clearly no
as described in Section 15.5.1. ordering between the pointer load on line 5 and the
dereference on line 8. Please note that this is simply
2. If the load heading the dependency chain is a C11 non-
an example: There are a great many other ways to
volatile memory_order_relaxed load, the com-
break dependency chains with comparisons.
piler could omit the load, for example, by using
a value that it loaded in the past. Quick Quiz 15.31: Why can’t you simply dereference the
pointer before comparing it to &reserve_int on line 6 of
3. If the load heading the dependency chain is a plain Listing 15.28?
load, the compiler can omit the load, again by using
a value that it loaded in the past. Worse yet, it could Quick Quiz 15.32: But it should be safe to compare two
load twice instead of once, so that different parts of pointer variables, right? After all, the compiler doesn’t know
your code use different values—and compilers really the value of either, so how can it possibly learn anything from
do this, especially when under register pressure. the comparison?
4. The value loaded by the head of the dependency Note that a series of inequality comparisons might,
chain must be a pointer. In theory, yes, you could when taken together, give the compiler enough information
load an integer, perhaps to use it as an array index. In to determine the exact value of the pointer, at which point
practice, the compiler knows too much about integers, the dependency is broken. Furthermore, the compiler
and thus has way too many opportunities to break might be able to combine information from even a single
your dependency chain [MWB+ 17]. inequality comparison with other information to learn the
exact value, again breaking the dependency. Pointers to
elements in arrays are especially susceptible to this latter
15.3.2.2 Avoid arithmetic dependency breakage form of dependency breakage.
Although it is just fine to do some arithmetic operations on
a pointer in your dependency chain, you need to be careful 15.3.2.3 Safe comparison of dependent pointers
to avoid giving the compiler too much information. After
It turns out that there are several safe ways to compare
all, if the compiler learns enough to determine the exact
dependent pointers:
value of the pointer, it can use that exact value instead of
the pointer itself. As soon as the compiler does that, the 1. Comparisons against the NULL pointer. In this case,
dependency is broken and all ordering is lost. all the compiler can learn is that the pointer is NULL,
v2022.09.25a
15.3. COMPILE-TIME CONSTERNATION 337
in which case you are not allowed to dereference it Listing 15.30: Broken Dependencies With Pointer Comparisons
anyway. 1 struct foo {
2 int a;
3 int b;
2. The dependent pointer is never dereferenced, whether 4 int c;
before or after the comparison. 5 };
6 struct foo *gp1;
7 struct foo *gp2;
3. The dependent pointer is compared to a pointer that 8
references objects that were last modified a very long 9 void updater(void)
10 {
time ago, where the only unconditionally safe value 11 struct foo *p;
of “a very long time ago” is “at compile time”. The 12
13 p = malloc(sizeof(*p));
key point is that something other than the address or 14 BUG_ON(!p);
data dependency guarantees ordering. 15 p->a = 42;
16 p->b = 43;
17 p->c = 44;
4. Comparisons between two pointers, each of which 18 rcu_assign_pointer(gp1, p);
19 WRITE_ONCE(p->b, 143);
carries an appropriate dependency. For example, you 20 WRITE_ONCE(p->c, 144);
have a pair of pointers, each carrying a dependency, 21 rcu_assign_pointer(gp2, p);
22 }
to data structures each containing a lock, and you 23
want to avoid deadlock by acquiring the locks in 24 void reader(void)
25 {
address order. 26 struct foo *p;
27 struct foo *q;
5. The comparison is not-equal, and the compiler does 28 int r1, r2 = 0;
29
not have enough other information to deduce the 30 p = rcu_dereference(gp2);
value of the pointer carrying the dependency. 31 if (p == NULL)
32 return;
33 r1 = READ_ONCE(p->b);
Pointer comparisons can be quite tricky, and so it 34 q = rcu_dereference(gp1);
is well worth working through the example shown in 35 if (p == q) {
36 r2 = READ_ONCE(p->c);
Listing 15.30. This example uses a simple struct foo 37 }
shown on lines 1–5 and two global pointers, gp1 and 38 do_something_with(r1, r2);
39 }
gp2, shown on lines 6 and 7, respectively. This example
uses two threads, namely updater() on lines 9–22 and
reader() on lines 24–39. they carry different dependencies. This means that the
The updater() thread allocates memory on line 13, compiler might well transform line 36 to instead be r2
and complains bitterly on line 14 if none is available. = q->c, which might well cause the value 44 to be loaded
Lines 15–17 initialize the newly allocated structure, and instead of the expected value 144.
then line 18 assigns the pointer to gp1. Lines 19 and 20
then update two of the structure’s fields, and does so after Quick Quiz 15.33: But doesn’t the condition in line 35
line 18 has made those fields visible to readers. Please supply a control dependency that would keep line 36 ordered
note that unsynchronized update of reader-visible fields after line 34?
often constitutes a bug. Although there are legitimate use
In short, great care is required to ensure that dependency
cases doing just this, such use cases require more care
chains in your source code are still dependency chains in
than is exercised in this example.
the compiler-generated assembly code.
Finally, line 21 assigns the pointer to gp2.
The reader() thread first fetches gp2 on line 30, with
lines 31 and 32 checking for NULL and returning if so. 15.3.3 Control-Dependency Calamities
Line 33 fetches field ->b and line 34 fetches gp1. If
line 35 sees that the pointers fetched on lines 30 and 34 are The control dependencies described in Section 15.2.5 are
equal, line 36 fetches p->c. Note that line 36 uses pointer attractive due to their low overhead, but are also especially
p fetched on line 30, not pointer q fetched on line 34. tricky because current compilers do not understand them
But this difference might not matter. An equals com- and can easily break them. The rules and examples in this
parison on line 35 might lead the compiler to (incorrectly) section are intended to help you prevent your compiler’s
conclude that both pointers are equivalent, when in fact ignorance from breaking your code.
v2022.09.25a
338 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
1 q = READ_ONCE(x);
2 barrier();
This will not have the desired effect because there is no 3 WRITE_ONCE(y, 1); /* BUG: No ordering!!! */
actual data dependency, but rather a control dependency 4 if (q) {
5 do_something();
that the CPU may short-circuit by attempting to predict 6 } else {
the outcome in advance, so that other CPUs see the load 7 do_something_else();
8 }
from y as having happened before the load from x. In
such a case what’s actually required is:
Now there is no conditional between the load from x and
1 q = READ_ONCE(x); the store to y, which means that the CPU is within its rights
2 if (q) {
3 <read barrier> to reorder them: The conditional is absolutely required,
4 q = READ_ONCE(y); and must be present in the assembly code even after all
5 }
compiler optimizations have been applied. Therefore,
if you need ordering in this example, you need explicit
However, stores are not speculated. This means that memory-ordering operations, for example, a release store:
ordering is provided for load-store control dependencies,
as in the following example: 1 q = READ_ONCE(x);
2 if (q) {
3 smp_store_release(&y, 1);
1 q = READ_ONCE(x);
4 do_something();
2 if (q)
5 } else {
3 WRITE_ONCE(y, 1);
6 smp_store_release(&y, 1);
7 do_something_else();
8 }
Control dependencies pair normally with other types
of ordering operations. That said, please note that neither
READ_ONCE() nor WRITE_ONCE() are optional! Without The initial READ_ONCE() is still required to prevent the
the READ_ONCE(), the compiler might fuse the load from x compiler from guessing the value of x. In addition, you
with other loads from x. Without the WRITE_ONCE(), the need to be careful what you do with the local variable q,
compiler might fuse the store to y with other stores to y. otherwise the compiler might be able to guess its value
Either can result in highly counter-intuitive effects on and again remove the needed conditional. For example:
ordering. 1 q = READ_ONCE(x);
Worse yet, if the compiler is able to prove (say) that 2 if (q % MAX) {
the value of variable x is always non-zero, it would be 3 WRITE_ONCE(y, 1);
4 do_something();
well within its rights to optimize the original example by 5 } else {
eliminating the “if” statement as follows: 6 WRITE_ONCE(y, 2);
7 do_something_else();
8 }
1 q = READ_ONCE(x);
2 WRITE_ONCE(y, 1); /* BUG: CPU can reorder!!! */
If MAX is defined to be 1, then the compiler knows that
It is tempting to try to enforce ordering on identical (q%MAX) is equal to zero, in which case the compiler
stores on both branches of the “if” statement as follows: is within its rights to transform the above code into the
following:
1 q = READ_ONCE(x);
2 if (q) { 1 q = READ_ONCE(x);
3 barrier(); 2 WRITE_ONCE(y, 2);
4 WRITE_ONCE(y, 1); 3 do_something_else();
5 do_something();
v2022.09.25a
15.3. COMPILE-TIME CONSTERNATION 339
Given this transformation, the CPU is not required to Listing 15.31: LB Litmus Test With Control Dependency
respect the ordering between the load from variable x and 1 C C-LB+o-cgt-o+o-cgt-o
2
the store to variable y. It is tempting to add a barrier() 3 {}
to constrain the compiler, but this does not help. The 4
5 P0(int *x, int *y)
conditional is gone, and the barrier() won’t bring it 6 {
back. Therefore, if you are relying on this ordering, you 7 int r1;
8
should make sure that MAX is greater than one, perhaps as 9 r1 = READ_ONCE(*x);
follows: 10 if (r1 > 0)
11 WRITE_ONCE(*y, 1);
12 }
1 q = READ_ONCE(x); 13
2 BUILD_BUG_ON(MAX <= 1); 14 P1(int *x, int *y)
3 if (q % MAX) { 15 {
4 WRITE_ONCE(y, 1); 16 int r2;
5 do_something(); 17
6 } else { 18 r2 = READ_ONCE(*y);
7 WRITE_ONCE(y, 2); 19 if (r2 > 0)
8 do_something_else(); 20 WRITE_ONCE(*x, 1);
9 } 21 }
22
23 exists (0:r1=1 /\ 1:r2=1)
Please note once again that the stores to y differ. If they
were identical, as noted earlier, the compiler could pull
this store outside of the “if” statement. also cannot reorder the writes to y with the condition.
You must also avoid excessive reliance on boolean Unfortunately for this line of reasoning, the compiler
short-circuit evaluation. Consider this example: might compile the two writes to y as conditional-move
instructions, as in this fanciful pseudo-assembly language:
1 q = READ_ONCE(x);
2 if (q || 1 > 0)
3 WRITE_ONCE(y, 1); 1 ld r1,x
2 cmp r1,$0
3 cmov,ne r4,$1
4 cmov,eq r4,$2
Because the first condition cannot fault and the second 5 st r4,y
condition is always true, the compiler can transform this 6 st $1,z
example as following, defeating the control dependency:
A weakly ordered CPU would have no dependency of
1 q = READ_ONCE(x);
2 WRITE_ONCE(y, 1); any sort between the load from x and the store to z. The
control dependencies would extend only to the pair of cmov
This example underscores the need to ensure that the instructions and the store depending on them. In short,
compiler cannot out-guess your code. More generally, control dependencies apply only to the stores in the “then”
although READ_ONCE() does force the compiler to actually and “else” of the “if” in question (including functions
emit code for a given load, it does not force the compiler invoked by those two clauses), and not necessarily to code
to use the value loaded. following that “if”.
In addition, control dependencies apply only to the then- Finally, control dependencies do not provide cumula-
clause and else-clause of the if-statement in question. In tivity.13 This is demonstrated by two related litmus tests,
particular, it does not necessarily apply to code following namely Listings 15.31 and 15.32 with the initial values
the if-statement: of x and y both being zero.
The exists clause in the two-thread example of
1 q = READ_ONCE(x); Listing 15.31 (C-LB+o-cgt-o+o-cgt-o.litmus) will
2 if (q) {
3 WRITE_ONCE(y, 1); never trigger. If control dependencies guaranteed cumu-
4 } else { lativity (which they do not), then adding a thread to the
5 WRITE_ONCE(y, 2);
6 } example as in Listing 15.32 (C-WWC+o-cgt-o+o-cgt-
7 WRITE_ONCE(z, 1); /* BUG: No ordering. */ o+o.litmus) would guarantee the related exists clause
never to trigger.
It is tempting to argue that there in fact is ordering
because the compiler cannot reorder volatile accesses and 13 Refer to Section 15.2.7.1 for the meaning of cumulativity.
v2022.09.25a
340 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
Listing 15.32: WWC Litmus Test With Control Dependency is needed, precede both of them with smp_mb() or
(Cumulativity?) use smp_store_release(). Please note that it is
1 C C-WWC+o-cgt-o+o-cgt-o+o not sufficient to use barrier() at beginning of each
2
3 {} leg of the “if” statement because, as shown by the
4
5 P0(int *x, int *y)
example above, optimizing compilers can destroy the
6 { control dependency while respecting the letter of the
7 int r1; barrier() law.
8
9 r1 = READ_ONCE(*x);
10 if (r1 > 0) 4. Control dependencies require at least one run-time
11 WRITE_ONCE(*y, 1); conditional between the prior load and the subsequent
12 }
13 store, and this conditional must involve the prior load.
14 P1(int *x, int *y) If the compiler is able to optimize the conditional
15 {
16 int r2; away, it will have also optimized away the ordering.
17
18 r2 = READ_ONCE(*y);
Careful use of READ_ONCE() and WRITE_ONCE()
19 if (r2 > 0) can help to preserve the needed conditional.
20 WRITE_ONCE(*x, 1);
21 } 5. Control dependencies require that the compiler
22
23 P2(int *x) avoid reordering the dependency into nonexistence.
24 { Careful use of READ_ONCE(), atomic_read(), or
25 WRITE_ONCE(*x, 2);
26 } atomic64_read() can help to preserve your control
27
28 exists (0:r1=2 /\ 1:r2=1 /\ x=2)
dependency.
6. Control dependencies apply only to the “then” and
“else” of the “if” containing the control dependency,
But because control dependencies do not provide cu- including any functions that these two clauses call.
mulativity, the exists clause in the three-thread litmus Control dependencies do not apply to code following
test can trigger. If you need the three-thread example to the end of the “if” statement containing the control
provide ordering, you will need smp_mb() between the dependency.
load and store in P0(), that is, just before or just after
the “if” statements. Furthermore, the original two-thread 7. Control dependencies pair normally with other types
example is very fragile and should be avoided. of memory-ordering operations.
Quick Quiz 15.34: Can’t you instead add an smp_mb() to 8. Control dependencies do not provide cumulativity. If
P1() in Listing 15.32?
you need cumulativity, use something that provides
The following list of rules summarizes the lessons of it, such as smp_store_release() or smp_mb().
this section:
Again, many popular languages were designed with
1. Compilers do not understand control dependencies, single-threaded use in mind. Successful multithreaded use
so it is your job to make sure that the compiler cannot of these languages requires you to pay special attention to
break your code. your memory references and dependencies.
v2022.09.25a
15.4. HIGHER-LEVEL PRIMITIVES 341
provide a deeper understanding of the synchronization write before the unlock to be reordered with a read after the
primitives themselves. Section 15.4.1 takes a look at mem- lock?
ory allocation, Section 15.4.2 examines the surprisingly
Therefore, the ordering required by conventional uses of
varied semantics of locking, and Section 15.4.3 digs more
memory allocation can be provided solely by non-fastpath
deeply into RCU.
locking, allowing the fastpath to remain synchronization-
free.
15.4.1 Memory Allocation
15.4.2 Locking
Section 6.4.3.2 touched upon memory allocation, and
this section expands upon the relevant memory-ordering Locking is a well-known synchronization primitive with
issues. which the parallel-programming community has had
The key requirement is that any access executed on a decades of experience. As such, locking’s semantics
given block of memory before freeing that block must be are quite simple.
ordered before any access executed after that same block That is, they are quite simple until you start trying to
is reallocated. It would after all be a cruel and unusual mathematically model them.
memory-allocator bug if a store preceding the free were to The simple part is that any CPU or thread holding a
be reordered after another store following the reallocation! given lock is guaranteed to see any accesses executed by
However, it would also be cruel and unusual to require CPUs or threads while they were previously holding that
developers to use READ_ONCE() and WRITE_ONCE() to same lock. Similarly, any CPU or thread holding a given
access dynamically allocated memory. Full ordering must lock is guaranteed not to see accesses that will be executed
therefore be provided for plain accesses, in spite of all the by other CPUs or threads while subsequently holding that
shared-variable shenanigans called out in Section 4.3.4.1. same lock. And what else is there?
As it turns out, quite a bit:
Of course, each CPU sees its own accesses in order and
the compiler always has fully accounted for intra-CPU 1. Are CPUs, threads, or compilers allowed to pull
shenanigans. These facts are what enables the lockless fast- memory accesses into a given lock-based critical
paths in memblock_alloc() and memblock_free(), section?
which are shown in Listings 6.10 and 6.11, respectively.
However, this is also why the developer is responsible 2. Will a CPU or thread holding a given lock also see
for providing appropriate ordering (for example, by using accesses executed by CPUs and threads before they
smp_store_release()) when publishing a pointer to last acquired that same lock, and vice versa?
a newly allocated block of memory. After all, in the
3. Suppose that a given CPU or thread executes one
CPU-local case, the allocator has not necessarily provided
access (call it “A”), releases a lock, reacquires that
any ordering.
same lock, then executes another access (call it “B”).
However, the allocator must provide ordering when Is some other CPU or thread not holding that lock
rebalancing its per-thread pools. This ordering is guaranteed to see A and B in order?
provided by the calls to spin_lock() and spin_
unlock() from memblock_alloc() and memblock_ 4. As above, but with the lock reacquisition carried out
free(). For any block that has migrated from one by some other CPU or thread?
thread to another, the old thread will have executed spin_
5. As above, but with the lock reacquisition being some
unlock(&globalmem.mutex) after placing the block in
other lock?
the globalmem pool, and the new thread will have exe-
cuted spin_lock(&globalmem.mutex) before moving 6. What ordering guarantees are provided by spin_
that block to its per-thread pool. This spin_unlock() is_locked()?
and spin_lock() ensures that both the old and new
threads see the old thread’s accesses as having happened The reaction to some or even all of these questions
before those of the new thread. might well be “Why would anyone do that?” However,
any complete mathematical definition of locking must
Quick Quiz 15.35: But doesn’t PowerPC have weak unlock-
have answers to all of these questions. Therefore, the
lock ordering properties within the Linux kernel, allowing a
following sections address these questions.
v2022.09.25a
342 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
Listing 15.33: Prior Accesses Into Critical Section means that neither spin_lock() nor spin_unlock()
1 C Lock-before-into are required to act as a full memory barrier.
2
3 {} However, other environments might make other choices.
4
For example, locking implementations that run only on
5 P0(int *x, int *y, spinlock_t *sp)
6 { the x86 CPU family will have lock-acquisition primitives
7 int r1; that fully order the lock acquisition with any prior and
8
9 WRITE_ONCE(*x, 1); any subsequent accesses. Therefore, on such systems the
10 spin_lock(sp); ordering shown in Listing 15.33 comes for free. There
11 r1 = READ_ONCE(*y);
12 spin_unlock(sp); are x86 lock-release implementations that are weakly
13 } ordered, thus failing to provide the ordering shown in
14
15 P1(int *x, int *y) Listing 15.34, but an implementation could nevertheless
16 { choose to guarantee this ordering.
17 int r1;
18 For their part, weakly ordered systems might well
19 WRITE_ONCE(*y, 1); choose to execute the memory-barrier instructions re-
20 smp_mb();
21 r1 = READ_ONCE(*x); quired to guarantee both orderings, possibly simpli-
22 } fying code making advanced use of combinations of
23
24 exists (0:r1=0 /\ 1:r1=0) locked and lockless accesses. However, as noted earlier,
LKMM chooses not to provide these additional order-
ings, in part to avoid imposing performance penalties on
Listing 15.34: Subsequent Accesses Into Critical Section
1 C Lock-after-into
the simpler and more prevalent locking use cases. In-
2 stead, the smp_mb__after_spinlock() and smp_mb__
3 {}
4
after_unlock_lock() primitives are provided for those
5 P0(int *x, int *y, spinlock_t *sp) more complex use cases, as discussed in Section 15.5.
6 {
7 int r1;
Thus far, this section has discussed only hardware
8 reordering. Can the compiler also reorder memory refer-
9 spin_lock(sp);
10 WRITE_ONCE(*x, 1);
ences into lock-based critical sections?
11 spin_unlock(sp); The answer to this question in the context of the Linux
12 r1 = READ_ONCE(*y);
13 }
kernel is a resounding “No!” One reason for this other-
14 wise inexplicable favoring of hardware reordering over
15 P1(int *x, int *y)
16 {
compiler optimizations is that the hardware will avoid
17 int r1; reordering a page-faulting access into a lock-based crit-
18
19 WRITE_ONCE(*y, 1);
ical section. In contrast, compilers have no clue about
20 smp_mb(); page faults, and would therefore happily reorder a page
21 r1 = READ_ONCE(*x);
22 }
fault into a critical section, which could crash the kernel.
23 The compiler is also unable to reliably determine which
24 exists (0:r1=0 /\ 1:r1=0)
accesses will result in cache misses, so that compiler re-
ordering into critical sections could also result in excessive
lock contention. Therefore, the Linux kernel prohibits the
15.4.2.1 Accesses Into Critical Sections? compiler (but not the CPU) from moving accesses into
lock-based critical sections.
Can memory accesses be reordered into lock-based critical
sections?
15.4.2.2 Accesses Outside of Critical Section?
Within the context of the Linux-kernel memory model,
the simple answer is “yes”. This may be verified by If a given CPU or thread holds a given lock, it is guaranteed
running the litmus tests shown in Listings 15.33 and 15.34 to see accesses executed during all prior critical sections
(C-Lock-before-into.litmus and C-Lock-after- for that same lock. Similarly, such a CPU or thread is
into.litmus, respectively), both of which will yield the guaranteed not to see accesses that will be executed during
Sometimes result. This result indicates that the exists all subsequent critical sections for that same lock.
clause can be satisfied, that is, that the final value of But what about accesses preceding prior critical sections
both P0()’s and P1()’s r1 variable can be zero. This and following subsequent critical sections?
v2022.09.25a
15.4. HIGHER-LEVEL PRIMITIVES 343
Listing 15.35: Accesses Outside of Critical Sections Listing 15.36: Accesses Between Same-CPU Critical Sections
1 C Lock-outside-across 1 C Lock-across-unlock-lock-1
2 2
3 {} 3 {}
4 4
5 P0(int *x, int *y, spinlock_t *sp) 5 P0(int *x, int *y, spinlock_t *sp)
6 { 6 {
7 int r1; 7 int r1;
8 8
9 WRITE_ONCE(*x, 1); 9 spin_lock(sp);
10 spin_lock(sp); 10 WRITE_ONCE(*x, 1);
11 r1 = READ_ONCE(*y); 11 spin_unlock(sp);
12 spin_unlock(sp); 12 spin_lock(sp);
13 } 13 r1 = READ_ONCE(*y);
14 14 spin_unlock(sp);
15 P1(int *x, int *y, spinlock_t *sp) 15 }
16 { 16
17 int r1; 17 P1(int *x, int *y, spinlock_t *sp)
18 18 {
19 spin_lock(sp); 19 int r1;
20 WRITE_ONCE(*y, 1); 20
21 spin_unlock(sp); 21 WRITE_ONCE(*y, 1);
22 r1 = READ_ONCE(*x); 22 smp_mb();
23 } 23 r1 = READ_ONCE(*x);
24 24 }
25 exists (0:r1=0 /\ 1:r1=0) 25
26 exists (0:r1=0 /\ 1:r1=0)
v2022.09.25a
344 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
Listing 15.37: Accesses Between Different-CPU Critical Sec- concluded that spin_is_locked() should be used only
tions for debugging. Part of the reason for this is that even a fully
1 C Lock-across-unlock-lock-2
2
ordered spin_is_locked() might return true because
3 {} some other CPU or thread was just about to release the
4
5 P0(int *x, spinlock_t *sp) lock in question. In this case, there is little that can be
6 { learned from that return value of true, which means
7 spin_lock(sp);
8 WRITE_ONCE(*x, 1); that reliable use of spin_is_locked() is surprisingly
9 spin_unlock(sp); complex. Other approaches almost always work better,
10 }
11
for example, use of explicit shared variables or the spin_
12 P1(int *x, int *y, spinlock_t *sp) trylock() primitive.
13 {
14 int r1; This situation resulted in the current state, namely that
15 int r2; spin_is_locked() provides no ordering guarantees,
16
17 spin_lock(sp); except that if it returns false, the current CPU or thread
18 r1 = READ_ONCE(*x); cannot be holding the corresponding lock.
19 r2 = READ_ONCE(*y);
20 spin_unlock(sp);
21 }
Quick Quiz 15.37: But if spin_is_locked() returns
22 false, don’t we also know that no other CPU or thread is
23 P2(int *x, int *y, spinlock_t *sp) holding the corresponding lock?
24 {
25 int r1;
26
27 WRITE_ONCE(*y, 1);
28 smp_mb(); 15.4.2.5 Why Mathematically Model Locking?
29 r1 = READ_ONCE(*x);
30 }
31 Given all these possible choices, why model locking in
32 exists (1:r1=1 /\ 1:r2=0 /\ 2:r1=0) general? Why not simply model a simple implementation?
One reason is modeling performance, as shown in
Table E.4 on page 534. Directly modeling locking in
hold that lock or either smp_mb__after_spinlock() general is orders of magnitude faster than emulating even
or smp_mb__after_unlock_lock() must be executed a trivial implementation. This should be no surprise, given
just after P1()’s lock acquisition. the combinatorial explosion experienced by present-day
Given that ordering is not guaranteed when both crit- formal-verification tools with increases in the number of
ical sections are protected by the same lock, there is no memory accesses executed by the code being modeled.
hope of any ordering guarantee when different locks are
Another reason is that a trivial implementation might
used. However, readers are encouraged to construct the
needlessly constrain either real implementations or real
corresponding litmus test and see this for themselves.
use cases. In contrast, modeling a platonic lock allows
This situation can seem counter-intuitive, but it is rare
the widest variety of implementations while providing
for code to care. This approach also allows certain weakly
specific guidance to locks’ users.
ordered systems to implement more efficient locks.
v2022.09.25a
15.4. HIGHER-LEVEL PRIMITIVES 345
If happens before ... Listing 15.39: RCU Fundamental Property and Reordering
1 C C-SB+o-rcusync-o+i-rl-o-o-rul
2
rcu_read_lock() call_rcu()
3 {}
4
rcu_read_unlock() 5 P0(uintptr_t *x0, uintptr_t *x1)
6 {
10 }
11
12 P1(uintptr_t *x0, uintptr_t *x1)
13 {
14 rcu_read_lock();
15 uintptr_t r2 = READ_ONCE(*x0);
16 WRITE_ONCE(*x1, 2);
17 rcu_read_unlock();
18 }
19
rcu_read_lock() 20 exists (1:r2=0 /\ 0:r2=0)
fundamental grace-period guarantee, as can be seen by within RCU read-side critical sections.
16 Several of which were introduced to Paul by Jade Alglave during
early work on LKMM, and a few more of which came from other LKMM
14 For more detail, please see Figures 9.11–9.13 starting on page 146. participants [AMM+ 18].
v2022.09.25a
346 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
Listing 15.40: RCU Readers Provide No Lock-Like Ordering Listing 15.42: RCU Updaters Provide Full Ordering
1 C C-LB+rl-o-o-rul+rl-o-o-rul 1 C C-SB+o-rcusync-o+o-rcusync-o
2 2
3 {} 3 {}
4 4
5 P0(uintptr_t *x0, uintptr_t *x1) 5 P0(uintptr_t *x0, uintptr_t *x1)
6 { 6 {
7 rcu_read_lock(); 7 WRITE_ONCE(*x0, 2);
8 uintptr_t r1 = READ_ONCE(*x0); 8 synchronize_rcu();
9 WRITE_ONCE(*x1, 1); 9 uintptr_t r2 = READ_ONCE(*x1);
10 rcu_read_unlock(); 10 }
11 } 11
12 12 P1(uintptr_t *x0, uintptr_t *x1)
13 P1(uintptr_t *x0, uintptr_t *x1) 13 {
14 { 14 WRITE_ONCE(*x1, 2);
15 rcu_read_lock(); 15 synchronize_rcu();
16 uintptr_t r1 = READ_ONCE(*x1); 16 uintptr_t r2 = READ_ONCE(*x0);
17 WRITE_ONCE(*x0, 1); 17 }
18 rcu_read_unlock(); 18
19 } 19 exists (1:r2=0 /\ 0:r2=0)
20
21 exists (0:r1=1 /\ 1:r1=1)
v2022.09.25a
15.4. HIGHER-LEVEL PRIMITIVES 347
Listing 15.43: What Happens Before RCU Readers? Listing 15.45: What Happens With Empty RCU Readers?
1 C C-SB+o-rcusync-o+o-rl-o-rul 1 C C-SB+o-rcusync-o+o-rl-rul-o
2 2
3 {} 3 {}
4 4
5 P0(uintptr_t *x0, uintptr_t *x1) 5 P0(uintptr_t *x0, uintptr_t *x1)
6 { 6 {
7 WRITE_ONCE(*x0, 2); 7 WRITE_ONCE(*x0, 2);
8 synchronize_rcu(); 8 synchronize_rcu();
9 uintptr_t r2 = READ_ONCE(*x1); 9 uintptr_t r2 = READ_ONCE(*x1);
10 } 10 }
11 11
12 P1(uintptr_t *x0, uintptr_t *x1) 12 P1(uintptr_t *x0, uintptr_t *x1)
13 { 13 {
14 WRITE_ONCE(*x1, 2); 14 WRITE_ONCE(*x1, 2);
15 rcu_read_lock(); 15 rcu_read_lock();
16 uintptr_t r2 = READ_ONCE(*x0); 16 rcu_read_unlock();
17 rcu_read_unlock(); 17 uintptr_t r2 = READ_ONCE(*x0);
18 } 18 }
19 19
20 exists (1:r2=0 /\ 0:r2=0) 20 exists (1:r2=0 /\ 0:r2=0)
Listing 15.44: What Happens After RCU Readers? Listing 15.44 (C-SB+o-rcusync-o+rl-o-rul-o.
1 C C-SB+o-rcusync-o+rl-o-rul-o
2
litmus) is similar, but instead looks at accesses after
3 {} the RCU read-side critical section. This test’s cycle is
4
5 P0(uintptr_t *x0, uintptr_t *x1)
also forbidden, as can be checked with the herd tool. The
6 { reasoning is similar to that for Listing 15.43, and is left as
7 WRITE_ONCE(*x0, 2);
8 synchronize_rcu();
an exercise for the reader.
9 uintptr_t r2 = READ_ONCE(*x1); Listing 15.45 (C-SB+o-rcusync-o+o-rl-rul-o.
10 }
11
litmus) takes things one step farther, moving P1()’s
12 P1(uintptr_t *x0, uintptr_t *x1) WRITE_ONCE() to precede the RCU read-side critical
13 {
14 rcu_read_lock(); section and moving P1()’s READ_ONCE() to follow it,
15 WRITE_ONCE(*x1, 2); resulting in an empty RCU read-side critical section.
16 rcu_read_unlock();
17 uintptr_t r2 = READ_ONCE(*x0); Perhaps surprisingly, despite the empty critical section,
18 } RCU nevertheless still manages to forbid the cycle. This
19
20 exists (1:r2=0 /\ 0:r2=0) can again be checked using the herd tool. Furthermore,
the reasoning is once again similar to that for Listing 15.43,
Recapping, if P1()’s WRITE_ONCE() follows the end of
a given grace period, then P1()’s RCU read-side critical
place memory-barrier instructions in rcu_read_lock() section—and everything following it—must follow the
and rcu_read_unlock() will preserve the ordering of beginning of that same grace period. Similarly, if P1()’s
P1()’s two accesses all the way down to the hardware READ_ONCE() precedes the beginning of a given grace
level. On the other hand, RCU implementations that rely period, then P1()’s RCU read-side critical section—and
on interrupt-based state machines will also fully preserve everything preceding it—must precede the end of that
this ordering relative to the grace period due to the fact that same grace period. In both cases, the critical section’s
interrupts take place at a precise location in the execution emptiness is irrelevant.
of the interrupted code.
Quick Quiz 15.38: Wait a minute! In QSBR implementations
This in turn means that if the WRITE_ONCE() follows of RCU, no code is emitted for rcu_read_lock() and rcu_
the end of a given RCU grace period, then the accesses read_unlock(). This means that the RCU read-side critical
within and following that RCU read-side critical section section in Listing 15.45 isn’t just empty, it is completely
must follow the beginning of that same grace period. nonexistent!!! So how can something that doesn’t exist at all
Similarly, if the READ_ONCE() precedes the beginning of possibly have any effect whatsoever on ordering???
the grace period, everything within and preceding that
critical section must precede the end of that same grace This situation leads to the question of what hap-
period. pens if rcu_read_lock() and rcu_read_unlock() are
v2022.09.25a
348 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
Listing 15.46: What Happens With No RCU Readers? Listing 15.47: One RCU Grace Period and Two Readers
1 C C-SB+o-rcusync-o+o-o 1 C C-SB+o-rcusync-o+rl-o-o-rul+rl-o-o-rul
2 2
3 {} 3 {}
4 4
5 P0(uintptr_t *x0, uintptr_t *x1) 5 P0(uintptr_t *x0, uintptr_t *x1)
6 { 6 {
7 WRITE_ONCE(*x0, 2); 7 WRITE_ONCE(*x0, 2);
8 synchronize_rcu(); 8 synchronize_rcu();
9 uintptr_t r2 = READ_ONCE(*x1); 9 uintptr_t r2 = READ_ONCE(*x1);
10 } 10 }
11 11
12 P1(uintptr_t *x0, uintptr_t *x1) 12 P1(uintptr_t *x1, uintptr_t *x2)
13 { 13 {
14 WRITE_ONCE(*x1, 2); 14 rcu_read_lock();
15 uintptr_t r2 = READ_ONCE(*x0); 15 WRITE_ONCE(*x1, 2);
16 } 16 uintptr_t r2 = READ_ONCE(*x2);
17 17 rcu_read_unlock();
18 exists (1:r2=0 /\ 0:r2=0) 18 }
19
20 P2(uintptr_t *x2, uintptr_t *x0)
21 {
omitted entirely, as shown in Listing 15.46 (C-SB+o- 22 rcu_read_lock();
23 WRITE_ONCE(*x2, 2);
rcusync-o+o-o.litmus). As can be checked with 24 uintptr_t r2 = READ_ONCE(*x0);
herd, this litmus test’s cycle is allowed, that is, both 25 rcu_read_unlock();
26 }
instances of r2 can have final values of zero. 27
This might seem strange in light of the fact that empty 28 exists (2:r2=0 /\ 0:r2=0 /\ 1:r2=0)
v2022.09.25a
15.4. HIGHER-LEVEL PRIMITIVES 349
rcu_read_lock();
r2 = READ_ONCE(x0);
WRITE_ONCE(x0, 2);
synchronize_rcu();
rcu_read_lock();
r2 = READ_ONCE(x2);
WRITE_ONCE(x2, 2);
rcu_read_unlock();
r2 = READ_ONCE(x1);
WRITE_ONCE(x1, 2);
rcu_read_unlock();
Figure 15.14: Cycle for One RCU Grace Period and Two RCU Readers
WRITE_ONCE(x0, 2);
synchronize_rcu(); rcu_read_lock();
r2 = READ_ONCE(x0);
r2 = READ_ONCE(x1);
WRITE_ONCE(x1, 2);
synchronize_rcu(); rcu_read_lock();
r2 = READ_ONCE(x3);
WRITE_ONCE(x3, 2);
rcu_read_unlock();
r2 = READ_ONCE(x2);
WRITE_ONCE(x2, 2);
rcu_read_unlock();
Figure 15.15: No Cycle for Two RCU Grace Periods and Two RCU Readers
v2022.09.25a
350 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
Listing 15.48: Two RCU Grace Periods and Two Readers there are at least as many RCU grace periods as there are
1 C C-SB+o-rcusync-o+o-rcusync-o+rl-o-o-rul+rl-o-o-rul RCU read-side critical sections, the cycle is forbidden.18
2
3 {}
4
5 P0(uintptr_t *x0, uintptr_t *x1) 15.4.3.5 RCU and Other Ordering Mechanisms
6 {
7 WRITE_ONCE(*x0, 2); But what about litmus tests that combine RCU with other
8 synchronize_rcu();
9 uintptr_t r2 = READ_ONCE(*x1);
ordering mechanisms?
10 } The general rule is that it takes only one mechanism to
11
12 P1(uintptr_t *x1, uintptr_t *x2)
forbid a cycle.
13 { For example, refer back to Listing 15.40. Applying
14 WRITE_ONCE(*x1, 2);
15 synchronize_rcu();
the general rule from the previous section, because this
16 uintptr_t r2 = READ_ONCE(*x2); litmus test has two RCU read-side critical sections and
17 }
18
no RCU grace periods, the cycle is allowed. But what
19 P2(uintptr_t *x2, uintptr_t *x3) if P0()’s WRITE_ONCE() is replaced by an smp_store_
20 {
21 rcu_read_lock();
release() and P1()’s READ_ONCE() is replaced by an
22 WRITE_ONCE(*x2, 2); smp_load_acquire()?
23 uintptr_t r2 = READ_ONCE(*x3);
24 rcu_read_unlock();
RCU would still allow the cycle, but the release-acquire
25 } pair would forbid it. Because it only takes one mechanism
26
27 P3(uintptr_t *x0, uintptr_t *x3)
to forbid a cycle, the release-acquire pair would prevail so
28 { that the cycle would be forbidden.
29 rcu_read_lock();
30 WRITE_ONCE(*x3, 2);
For another example, refer back to Listing 15.47. Be-
31 uintptr_t r2 = READ_ONCE(*x0); cause this litmus test has two RCU readers but only one
32 rcu_read_unlock();
33 }
grace period, its cycle is allowed. But suppose that an
34 smp_mb() was placed between P1()’s pair of accesses.
35 exists (3:r2=0 /\ 0:r2=0 /\ 1:r2=0 /\ 2:r2=0)
In this new litmus test, because of the addition of the smp_
mb(), P2()’s as well as P1()’s critical sections would
5. Therefore, P2()’s read from x0 can precede P0()’s extend beyond the end of P0()’s grace period, which in
write, thus allowing the cycle to form. turn would prevent P2()’s read from x0 from preceding
P0()’s write, as depicted by the red dashed arrow in Fig-
But what happens when another grace period is added? ure 15.16. In this case, RCU and the full memory barrier
This situation is shown in Listing 15.48, an SB litmus work together to forbid the cycle, with RCU preserving
test in which P0() and P1() have RCU grace periods ordering between P0() and both P1() and P2(), and
and P2() and P3() have RCU readers. Again, the CPUs with the smp_mb() preserving ordering between P1()
can reorder the accesses within RCU read-side critical and P2().
sections, as shown in Figure 15.15. For this cycle to form, Quick Quiz 15.40: What would happen if the smp_mb()
P2()’s critical section must end after P1()’s grace period was instead added between P2()’s accesses in Listing 15.47?
and P3()’s must end after the beginning of that same
grace period, which happens to also be after the end of
P0()’s grace period. Therefore, P3()’s critical section In short, where RCU’s semantics were once purely prag-
must start after the beginning of P0()’s grace period, matic, they are now fully formalized [MW05, DMS+ 12,
which in turn means that P3()’s read from x0 cannot GRY13, AMM+ 18].
possibly precede P0()’s write. Therefore, the cycle is It is hoped that detailed semantics for higher-level
forbidden because RCU read-side critical sections cannot primitives will enable more capable static analysis and
span full RCU grace periods. model checking.
However, a closer look at Figure 15.15 makes it clear
that adding a third reader would allow the cycle. This
is because this third reader could end before the end of 18 Interestingly enough, Alan Stern proved that within the context
P0()’s grace period, and thus start before the beginning of of LKMM, the two-part fundamental property of RCU expressed in
that same grace period. This in turn suggests the general Section 9.5.2 actually implies this seemingly more general result, which
rule, which is: In these sorts of RCU-only litmus tests, if is called the RCU axiom [AMM+ 18].
v2022.09.25a
15.5. HARDWARE SPECIFICS 351
WRITE_ONCE(x0, 2);
synchronize_rcu(); rcu_read_lock();
r2 = READ_ONCE(x0);
rcu_read_lock();
r2 = READ_ONCE(x1);
WRITE_ONCE(x1, 2);
smp_mb();
r2 = READ_ONCE(x2);
WRITE_ONCE(x2, 2);
rcu_read_unlock();
rcu_read_unlock();
Figure 15.16: Cycle for One RCU Grace Period, Two RCU Readers, and Memory Barrier
15.5 Hardware Specifics and which is explained in more detail in Section 15.5.1.
The short version is that Alpha requires memory barriers
for readers as well as updaters of linked data structures,
Rock beats paper! however, these memory barriers are provided by the Alpha
Derek Williams architecture-specific code in v4.15 and later Linux kernels.
The next row, “Non-Sequentially Consistent”, indicates
Each CPU family has its own peculiar approach to memory whether the CPU’s normal load and store instructions are
ordering, which can make portability a challenge, as constrained by sequential consistency. Almost all are not
indicated by Table 15.5. constrained in this way for performance reasons.
In fact, some software environments simply prohibit The next two rows cover multicopy atomicity, which
direct use of memory-ordering operations, restricting the was defined in Section 15.2.7. The first is full-up (and
programmer to mutual-exclusion primitives that incor- rare) multicopy atomicity, and the second is the weaker
porate them to the extent that they are required. Please other-multicopy atomicity.
note that this section is not intended to be a reference The next row, “Non-Cache Coherent”, covers accesses
manual covering all (or even most) aspects of each CPU from multiple threads to a single variable, which was
family, but rather a high-level overview providing a rough discussed in Section 15.2.6.
comparison. For full details, see the reference manual for The final three rows cover instruction-level choices and
the CPU of interest. issues. The first row indicates how each CPU implements
Getting back to Table 15.5, the first group of rows load-acquire and store-release, the second row classifies
look at memory-ordering properties and the second group CPUs by atomic-instruction type, and the third and final
looks at instruction properties. row indicates whether a given CPU has an incoherent
The first three rows indicate whether a given CPU al- instruction cache and pipeline. Such CPUs require special
lows the four possible combinations of loads and stores instructions be executed for self-modifying code.
to be reordered, as discussed in Section 15.1 and Sec- The common “just say no” approach to memory-
tions 15.2.2.1–15.2.2.3. The next row (“Atomic Instruc- ordering operations can be eminently reasonable where
tions Reordered With Loads or Stores?”) indicates whether it applies, but there are environments, such as the Linux
a given CPU allows loads and stores to be reordered with kernel, where direct use of memory-ordering operations
atomic instructions. is required. Therefore, Linux provides a carefully cho-
The fifth and sixth rows cover reordering and depen- sen least-common-denominator set of memory-ordering
dencies, which was covered in Sections 15.2.3–15.2.5 primitives, which are as follows:
v2022.09.25a
352 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
CPU Family
SPARC TSO
Armv7-A/R
z Systems
POWER
Itanium
Armv8
Alpha
MIPS
x86
Property
Instructions Load-Acquire/Store-Release? F F i I F b
Atomic RMW Instruction Type? L L L C L L C C C
Incoherent Instruction Cache/Pipeline? Y Y Y Y Y Y Y Y Y
Key: Load-Acquire/Store-Release?
b: Lightweight memory barrier
F: Full memory barrier
i: Instruction with lightweight ordering
I: Instruction with heavyweight ordering
Atomic RMW Instruction Type?
C: Compare-and-exchange instruction
L: Load-linked/store-conditional instruction
smp_mb() (full memory barrier) that orders both loads smp_mb__after_atomic() that forces ordering of ac-
and stores. This means that loads and stores preced- cesses preceding an earlier RMW atomic operation
ing the memory barrier will be committed to memory against accesses following the smp_mb__after_
before any loads and stores following the memory atomic(). This is also a noop on systems that fully
barrier. order atomic RMW operations.
smp_rmb() (read memory barrier) that orders only loads. smp_mb__after_spinlock() that forces ordering of
accesses preceding a lock acquisition against ac-
smp_wmb() (write memory barrier) that orders only cesses following the smp_mb__after_spinlock().
stores. This is also a noop on systems that fully order lock
acquisitions.
smp_mb__before_atomic() that forces ordering of ac-
cesses preceding the smp_mb__before_atomic() mmiowb() that forces ordering on MMIO writes that
against accesses following a later RMW atomic op- are guarded by global spinlocks, and is more
eration. This is a noop on systems that fully order thoroughly described in a 2016 LWN article on
atomic RMW operations. MMIO [MDR16].
v2022.09.25a
15.5. HARDWARE SPECIFICS 353
The smp_mb(), smp_rmb(), and smp_wmb() primitives Listing 15.49: Insert and Lock-Free Search (No Ordering)
also force the compiler to eschew any optimizations that 1 struct el *insert(long key, long data)
2 {
would have the effect of reordering memory optimizations 3 struct el *p;
across the barriers. 4 p = kmalloc(sizeof(*p), GFP_ATOMIC);
5 spin_lock(&mutex);
Quick Quiz 15.41: What happens to code between an atomic 6 p->next = head.next;
operation and an smp_mb__after_atomic()? 7 p->key = key;
8 p->data = data;
9 smp_store_release(&head.next, p);
These primitives generate code only in SMP kernels, 10 spin_unlock(&mutex);
11 }
however, several have UP versions (mb(), rmb(), and 12
wmb(), respectively) that generate a memory barrier even 13 struct el *search(long searchkey)
14 {
in UP kernels. The smp_ versions should be used in most 15 struct el *p;
cases. However, these latter primitives are useful when 16 p = READ_ONCE_OLD(head.next);
17 while (p != &head) {
writing drivers, because MMIO accesses must remain 18 /* Prior to v4.15, BUG ON ALPHA!!! */
ordered even in UP kernels. In absence of memory- 19 if (p->key == searchkey) {
20 return (p);
ordering operations, both CPUs and compilers would 21 }
happily rearrange these accesses, which at best would 22 p = READ_ONCE_OLD(p->next);
23 };
make the device act strangely, and could crash your kernel 24 return (NULL);
or even damage your hardware. 25 }
So most kernel programmers need not worry about the
memory-ordering peculiarities of each and every CPU, as
long as they stick to these interfaces. If you are working traces of this accommodation were removed in v5.9 with
deep in a given CPU’s architecture-specific code, of course, the removal of the smp_read_barrier_depends() and
all bets are off. read_barrier_depends() APIs. This section is nev-
Furthermore, all of Linux’s locking primitives (spin- ertheless retained in the Second Edition because here in
locks, reader-writer locks, semaphores, RCU, . . .) include early 2021 there are quite a few Linux kernel hackers still
any needed ordering primitives. So if you are working working on pre-v4.15 versions of the Linux kernel. In ad-
with code that uses these primitives properly, you need dition, the modifications to READ_ONCE() that permitted
not worry about Linux’s memory-ordering primitives. these APIs to be removed have not necessarily propagated
That said, deep knowledge of each CPU’s memory- to all userspace projects that might still support Alpha.
consistency model can be very helpful when debugging, The dependent-load difference between Alpha and the
to say nothing of when writing architecture-specific code other CPUs is illustrated by the code shown in List-
or synchronization primitives. ing 15.49. This smp_store_release() guarantees that
Besides, they say that a little knowledge is a very the element initialization in lines 6–8 is executed before
dangerous thing. Just imagine the damage you could do the element is added to the list on line 9, so that the
with a lot of knowledge! For those who wish to understand lock-free search will work correctly. That is, it makes this
more about individual CPUs’ memory consistency models, guarantee on all CPUs except Alpha.
the next sections describe those of a few popular and Given the pre-v4.15 implementation of READ_ONCE(),
prominent CPUs. Although there is no substitute for indicated by READ_ONCE_OLD() in the listing, Alpha
actually reading a given CPU’s documentation, these actually allows the code on line 19 of Listing 15.49 to
sections do give a good overview. see the old garbage values that were present before the
initialization on lines 6–8.
Figure 15.17 shows how this can happen on an aggres-
15.5.1 Alpha
sively parallel machine with partitioned caches, so that
It may seem strange to say much of anything about a CPU alternating cache lines are processed by the different parti-
whose end of life has long since passed, but Alpha is inter- tions of the caches. For example, the load of head.next
esting because it is the only mainstream CPU that reorders on line 16 of Listing 15.49 might access cache bank 0, and
dependent loads, and has thus had outsized influence on the load of p->key on line 19 and of p->next on line 22
concurrency APIs, including within the Linux kernel. The might access cache bank 1. On Alpha, the smp_store_
need for core Linux-kernel code to accommodate Alpha release() will guarantee that the cache invalidations
ended with version v4.15 of the Linux kernel, and all performed by lines 6–8 of Listing 15.49 (for p->next,
v2022.09.25a
354 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
v2022.09.25a
15.5. HARDWARE SPECIFICS 355
15.5.2 Armv7-A/R
The Arm family of CPUs is extremely popular in em-
bedded applications, particularly for power-constrained
applications such as cellphones. Its memory model is sim-
ilar to that of POWER (see Section 15.5.6), but Arm uses
a different set of memory-barrier instructions [ARM10]:
LDLAR
DMB (data memory barrier) causes the specified type
of operations to appear to have completed before
any subsequent operations of the same type. The
“type” of operations can be all operations or can be
restricted to only writes (similar to the Alpha wmb
and the POWER eieio instructions). In addition,
Arm allows cache coherence to have one of three
scopes: Single processor, a subset of the processors
(“inner”) and global (“outer”). Figure 15.18: Half Memory Barrier
DSB (data synchronization barrier) causes the specified
type of operations to actually complete before any 6 ISB();
subsequent operations (of any type) are executed. 7 r3 = z;
v2022.09.25a
356 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
However, Armv8 goes well beyond the C11 memory Quick Quiz 15.44: Given that hardware can have a half mem-
model by mandating that the combination of a store-release ory barrier, why don’t locking primitives allow the compiler to
and load-acquire act as a full barrier under certain circum- move memory-reference instructions into lock-based critical
stances. For example, in Armv8, given a store followed sections?
by a store-release followed a load-acquire followed by a
The Itanium mf instruction is used for the smp_rmb(),
load, all to different variables and all from a single CPU,
smp_mb(), and smp_wmb() primitives in the Linux ker-
all CPUs would agree that the initial store preceded the
nel. Despite persistent rumors to the contrary, the “mf”
final load. Interestingly enough, most TSO architectures
mnemonic stands for “memory fence”.
(including x86 and the mainframe) do not make this guar-
Itanium also offers a global total order for release op-
antee, as the two loads could be reordered before the two
erations, including the mf instruction. This provides the
stores.
notion of transitivity, where if a given code fragment sees
Armv8 is one of only two architectures that needs
a given access as having happened, any later code frag-
the smp_mb__after_spinlock() primitive to be a full
ment will also see that earlier access as having happened.
barrier, due to its relatively weak lock-acquisition imple-
Assuming, that is, that all the code fragments involved
mentation in the Linux kernel.
correctly use memory barriers.
Armv8 also has the distinction of being the first CPU
Finally, Itanium is the only architecture supporting the
whose vendor publicly defined its memory ordering with
Linux kernel that can reorder normal loads to the same
an executable formal model [ARM17].
variable. The Linux kernel avoids this issue because
READ_ONCE() emits a volatile load, which is compiled
15.5.4 Itanium as a ld,acq instruction, which forces ordering of all
READ_ONCE() invocations by a given CPU, including
Itanium offers a weak consistency model, so that in absence
those to the same variable.
of explicit memory-barrier instructions or dependencies,
Itanium is within its rights to arbitrarily reorder mem-
ory references [Int02a]. Itanium has a memory-fence 15.5.5 MIPS
instruction named mf, but also has “half-memory fence”
The MIPS memory model [Wav16, page 479] appears
modifiers to loads, stores, and to some of its atomic
to resemble that of Arm, Itanium, and POWER, being
instructions [Int02b]. The acq modifier prevents subse-
weakly ordered by default, but respecting dependencies.
quent memory-reference instructions from being reordered
MIPS has a wide variety of memory-barrier instructions,
before the acq, but permits prior memory-reference in-
but ties them not to hardware considerations, but rather
structions to be reordered after the acq, similar to the
to the use cases provided by the Linux kernel and the
Armv8 load-acquire instructions. Similarly, the rel mod-
C++11 standard [Smi19] in a manner similar to the Armv8
ifier prevents prior memory-reference instructions from
additions:
being reordered after the rel, but allows subsequent
memory-reference instructions to be reordered before the
SYNC
rel.
Full barrier for a number of hardware operations
These half-memory fences are useful for critical sec-
in addition to memory references, which is used to
tions, since it is safe to push operations into a critical
implement the v4.13 Linux kernel’s smp_mb() for
section, but can be fatal to allow them to bleed out. How-
OCTEON systems.
ever, as one of the few CPUs with this property, Itanium at
one time defined Linux’s semantics of memory ordering SYNC_WMB
associated with lock acquisition and release.20 Oddly Write memory barrier, which can be used on
enough, actual Itanium hardware is rumored to implement OCTEON systems to implement the smp_wmb()
both load-acquire and store-release instructions as full bar- primitive in the v4.13 Linux kernel via the syncw
riers. Nevertheless, Itanium was the first mainstream CPU mnemonic. Other systems use plain sync.
to introduce the concept (if not the reality) of load-acquire
and store-release into its instruction set. SYNC_MB
Full memory barrier, but only for memory operations.
This may be used to implement the C++ atomic_
20 PowerPC is now the architecture with this dubious privilege. thread_fence(memory_order_seq_cst).
v2022.09.25a
15.5. HARDWARE SPECIFICS 357
v2022.09.25a
358 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
POWER features “cumulativity”, which can be used StoreLoad orders preceding stores before subsequent
to obtain transitivity. When used properly, any code loads.
seeing the results of an earlier code fragment will also
see the accesses that this earlier code fragment itself LoadLoad orders preceding loads before subsequent
saw. Much more detail is available from McKenney and loads. (This option is used by the Linux smp_rmb()
Silvera [MS09]. primitive.)
POWER respects control dependencies in much the
Sync fully completes all preceding operations before
same way that Arm does, with the exception that the
starting any subsequent operations.
POWER isync instruction is substituted for the Arm ISB
instruction. MemIssue completes preceding memory operations be-
Like Armv8, POWER requires smp_mb__after_ fore subsequent memory operations, important for
spinlock() to be a full memory barrier. In addition, some instances of memory-mapped I/O.
POWER is the only architecture requiring smp_mb__
after_unlock_lock() to be a full memory barrier. In Lookaside does the same as MemIssue, but only applies
both cases, this is because of the weak ordering prop- to preceding stores and subsequent loads, and even
erties of POWER’s locking primitives, due to the use then only for stores and loads that access the same
of the lwsync instruction to provide ordering for both memory location.
acquisition and release.
Many members of the POWER architecture have in- So, why is “membar #MemIssue” needed? Because
coherent instruction caches, so that a store to memory a “membar #StoreLoad” could permit a subsequent
will not necessarily be reflected in the instruction cache. load to get its value from a store buffer, which would
Thankfully, few people write self-modifying code these be disastrous if the write was to an MMIO register
days, but JITs and compilers do it all the time. Further- that induced side effects on the value to be read. In
more, recompiling a recently run program looks just like contrast, “membar #MemIssue” would wait until the
self-modifying code from the CPU’s viewpoint. The icbi store buffers were flushed before permitting the loads
instruction (instruction cache block invalidate) invalidates to execute, thereby ensuring that the load actually gets
a specified cache line from the instruction cache, and may its value from the MMIO register. Drivers could
be used in these situations. instead use “membar #Sync”, but the lighter-weight
“membar #MemIssue” is preferred in cases where the ad-
ditional function of the more-expensive “membar #Sync”
15.5.7 SPARC TSO are not required.
The “membar #Lookaside” is a lighter-weight version
Although SPARC’s TSO (total-store order) is used by
of “membar #MemIssue”, which is useful when writing
both Linux and Solaris, the architecture also defines PSO
to a given MMIO register affects the value that will next
(partial store order) and RMO (relaxed-memory order).
be read from that register. However, the heavier-weight
Any program that runs in RMO will also run in either PSO
“membar #MemIssue” must be used when a write to a
or TSO, and similarly, a program that runs in PSO will also
given MMIO register affects the value that will next be
run in TSO. Moving a shared-memory parallel program
read from some other MMIO register.
in the other direction may require careful insertion of
SPARC requires a flush instruction be used between
memory barriers.
the time that the instruction stream is modified and the
Although SPARC’s PSO and RMO modes are not used
time that any of these instructions are executed [SPA94].
much these days, they did give rise to a very flexible
This is needed to flush any prior value for that location
memory-barrier instruction [SPA94] that permits fine-
from the SPARC’s instruction cache. Note that flush
grained control of ordering:
takes an address, and will flush only that address from the
StoreStore orders preceding stores before subsequent instruction cache. On SMP systems, all CPUs’ caches are
stores. (This option is used by the Linux smp_wmb() flushed, but there is no convenient way to determine when
primitive.) the off-CPU flushes complete, though there is a reference
to an implementation note.
LoadStore orders preceding loads before subsequent But again, the Linux kernel runs SPARC in TSO mode,
stores. so all of the above membar variants are strictly of historical
v2022.09.25a
15.6. WHERE IS MEMORY ORDERING NEEDED? 359
interest. In particular, the smp_mb() primitive only needs Although newer x86 implementations accommodate
to use #StoreLoad because the other three reorderings self-modifying code without any special instructions, to
are prohibited by TSO. be fully compatible with past and potential future x86
implementations, a given CPU must execute a jump in-
struction or a serializing instruction (e.g., cpuid) between
15.5.8 x86 modifying the code and executing it [Int11, Section 8.1.3].
v2022.09.25a
360 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
key point is that although loads and stores are conceptually minimal ordering suffices. Minimal ordering includes
simple, on real multicore hardware significant periods of dependencies and acquires as well as all stronger ordering
time are required for their effects to become visible to all operations.
other threads. The third rule of thumb involves release-acquire chains:
The simple and intuitive case occurs when one thread If all but one of the links in a given cycle is a store-to-load
loads a value that some other thread stored. This straight- link, it is sufficient to use release-acquire pairs for each of
forward cause-and-effect case exhibits temporal behavior, those store-to-load links, as illustrated by Listings 15.23
so that the software can safely assume that the store in- and 15.24. You can replace a given acquire with a de-
struction completed before the load instruction started. In pendency in environments permitting this, keeping in
real life, the load instruction might well have started quite mind that the C11 standard’s memory model does not
some time before the store instruction did, but all modern fully respect dependencies. Therefore, a dependency lead-
hardware must carefully hide such cases from the software. ing to a load must be headed by a READ_ONCE() or an
Again, software will see the expected temporal cause-and- rcu_dereference(): A plain C-language load is not
effect behavior when one thread loads a value that some sufficient. In addition, carefully review Sections 15.3.2
other thread stores, as discussed in Section 15.2.7.3. and 15.3.3, because a dependency broken by your com-
However, hardware is under no obligation to provide piler will not order anything. The two threads sharing the
temporal cause-and-effect illusions when one thread’s sole non-store-to-load link can usually substitute WRITE_
store overwrites a value either loaded or stored by some ONCE() plus smp_wmb() for smp_store_release() on
other thread. It is quite possible that, from the software’s the one hand, and READ_ONCE() plus smp_rmb() for
viewpoint, an earlier store will overwrite a later store’s smp_load_acquire() on the other. However, the wise
value, but only if those two stores were executed by developer will check such substitutions carefully, for ex-
different threads. Similarly, a later load might well read ample, using the herd tool as described in Section 12.3.
a value overwritten by an earlier store, but again only
Quick Quiz 15.45: Why is it necessary to use heavier-weight
if that load and store were executed by different threads. ordering for load-to-store and store-to-store links, but not for
This counter-intuitive behavior occurs due to the need to store-to-load links? What on earth makes store-to-load links
buffer stores in order to achieve adequate performance, as so special???
discussed in Section 15.2.7.2.
As a result, situations where one thread reads a value The fourth and final rule of thumb identifies where full
written by some other thread can make do with far weaker memory barriers (or stronger) are required: If a given
ordering than can situations where one thread overwrites cycle contains two or more non-store-to-load links (that is,
a value loaded or stored by some other thread. These a total of two or more links that are either load-to-store or
differences are captured by the following rules of thumb. store-to-store links), you will need at least one full barrier
The first rule of thumb is that memory-ordering oper- between each pair of non-store-to-load links in that cycle,
ations are only required where there is a possibility of as illustrated by Listing 15.19 as well as in the answer to
interaction between at least two variables shared among Quick Quiz 15.24. Full barriers include smp_mb(), suc-
at least two threads. In light of the intervening material, cessful full-strength non-void atomic RMW operations,
this single sentence encapsulates much of Section 15.1.3’s and other atomic RMW operations in conjunction with ei-
basic rules of thumb, for example, keeping in mind that ther smp_mb__before_atomic() or smp_mb__after_
“memory-barrier pairing” is a two-thread special case of atomic(). Any of RCU’s grace-period-wait primitives
“cycle”. And, as always, if a single-threaded program will (synchronize_rcu() and friends) also act as full bar-
provide sufficient performance, why bother with paral- riers, but at far greater expense than smp_mb(). With
lelism?21 After all, avoiding parallelism also avoids the strength comes expense, though full barriers usually hurt
added cost of memory-ordering operations. performance more than they hurt scalability.
The second rule of thumb involves load-buffering situ- Recapping the rules:
ations: If all thread-to-thread communication in a given
cycle use store-to-load links (that is, the next thread’s 1. Memory-ordering operations are required only if at
load returns the value stored by the previous thread), least two variables are shared by at least two threads.
21 Hobbyists and researchers should of course feel free to ignore this 2. If all links in a cycle are store-to-load links, then
and many other cautions. minimal ordering suffices.
v2022.09.25a
15.6. WHERE IS MEMORY ORDERING NEEDED? 361
v2022.09.25a
362 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
v2022.09.25a
Creating a perfect API is like committing the perfect
crime. There are at least fifty things that can go
wrong, and if you are a genius, you might be able to
anticipate twenty-five of them.
Ease of Use
16.1 What is Easy? nothing about, you should not be surprised when those
people find fault with your project.
If you really want to help a given group of people, there
When someone says “I want a programming is simply no substitute for working closely with them over
language in which I need only say what I wish done,” an extended period of time, as in years. Nevertheless,
give them a lollipop. there are some simple things that you can do to increase
Alan J. Perlis, updated the odds of your users being happy with your software,
and some of these things are covered in the next section.
If you are tempted to look down on ease-of-use require-
ments, please consider that an ease-of-use bug in Linux-
kernel RCU resulted in an exploitable Linux-kernel secu-
16.2 Rusty Scale for API Design
rity bug in a use of RCU [McK19a]. It is therefore clearly
important that even in-kernel APIs be easy to use. Finding the appropriate measurement is thus not a
mathematical exercise. It is a risk-taking judgment.
Unfortunately, “easy” is a relative term. For example,
many people would consider a 15-hour airplane flight to Peter Drucker
be a bit of an ordeal—unless they stopped to consider
alternative modes of transportation, especially swimming. This section is adapted from portions of Rusty Russell’s
This means that creating an easy-to-use API requires that 2003 Ottawa Linux Symposium keynote address [Rus03,
you understand your intended users well enough to know Slides 39–57]. Rusty’s key point is that the goal should
what is easy for them. Which might or might not have not be merely to make an API easy to use, but rather to
anything to do with what is easy for you. make the API hard to misuse. To that end, Rusty proposed
The following question illustrates this point: “Given his “Rusty Scale” in decreasing order of this important
a randomly chosen person among everyone alive today, hard-to-misuse property.
what one change would improve that person’s life?” The following list attempts to generalize the Rusty Scale
There is no single change that would be guaranteed to beyond the Linux kernel:
help everyone’s life. After all, there is an extremely wide 1. It is impossible to get wrong. Although this is the
range of people, with a correspondingly wide range of standard to which all API designers should strive,
needs, wants, desires, and aspirations. A starving person only the mythical dwim()1 command manages to
might need food, but additional food might well hasten come close.
the death of a morbidly obese person. The high level of
excitement so fervently desired by many young people 2. The compiler or linker won’t let you get it wrong.
might well be fatal to someone recovering from a heart
attack. Information critical to the success of one person 3. The compiler or linker will warn you if you get it
might contribute to the failure of someone suffering from wrong. BUILD_BUG_ON() is your users’ friend.
information overload. In short, if you are working on a 1 The dwim() function is an acronym that expands to “do what I
363
v2022.09.25a
364 CHAPTER 16. EASE OF USE
4. The simplest use is the correct one. 17. The obvious use is wrong. The Linux kernel smp_
mb() function is an example of this point on the
5. The name tells you how to use it. But names can scale. Many developers assume that this function
be two-edged swords. Although rcu_read_lock() has much stronger ordering semantics than it actually
is plain enough for someone converting code from possesses. Chapter 15 contains the information
reader-writer locking, it might cause some conster- needed to avoid this mistake, as does the Linux-kernel
nation for someone converting code from reference source tree’s Documentation and tools/memory-
counting. model directories.
6. Do it right or it will always break at runtime. WARN_ 18. The compiler or linker will warn you if you get it
ON_ONCE() is your users’ friend. right.
7. Follow common convention and you will get it right. 19. The compiler or linker won’t let you get it right.
The malloc() library function is a good example.
Although it is easy to get memory allocation wrong, a 20. It is impossible to get right. The gets() function
great many projects do manage to get it right, at least is a famous example of this point on the scale. In
most of the time. Using malloc() in conjunction fact, gets() can perhaps best be described as an
with Valgrind [The11] moves malloc() almost up unconditional buffer-overflow security hole.
to the “do it right or it will always break at runtime”
point on the scale. 16.3 Shaving the Mandelbrot Set
8. Read the documentation and you will get it right.
Simplicity does not precede complexity,
9. Read the implementation and you will get it right. but follows it.
10. Read the right mailing-list archive and you will get it Alan J. Perlis
right.
The set of useful programs resembles the Mandelbrot set
11. Read the right mailing-list archive and you will get it
(shown in Figure 16.1) in that it does not have a clear-cut
wrong.
smooth boundary—if it did, the halting problem would
12. Read the implementation and you will get it wrong. be solvable. But we need APIs that real people can use,
The original non-CONFIG_PREEMPT implementation not ones that require a Ph.D. dissertation be completed
of rcu_read_lock() [McK07a] is an infamous ex- for each and every potential use. So, we “shave the
ample of this point on the scale. Mandelbrot set”,2 restricting the use of the API to an
easily described subset of the full set of potential uses.
13. Read the documentation and you will get it wrong. Such shaving may seem counterproductive. After all,
For example, the DEC Alpha wmb instruction’s doc- if an algorithm works, why shouldn’t it be used?
umentation [Cor02] fooled a number of developers To see why at least some shaving is absolutely necessary,
into thinking that this instruction had much stronger consider a locking design that avoids deadlock, but in
memory-order semantics than it actually does. Later perhaps the worst possible way. This design uses a circular
documentation clarified this point [Com01, Pug00], doubly linked list, which contains one element for each
moving the wmb instruction up to the “read the doc- thread in the system along with a header element. When
umentation and you will get it right” point on the a new thread is spawned, the parent thread must insert a
scale. new element into this list, which requires some sort of
synchronization.
14. Follow common convention and you will get it wrong.
One way to protect the list is to use a global lock.
The printf() statement is an example of this point
However, this might be a bottleneck if threads were being
on the scale because developers almost always fail to
created and deleted frequently.3 Another approach would
check printf()’s error return.
2 Due to Josh Triplett.
15. Do it right and it will break at runtime. 3 Those of you with strong operating-system backgrounds, please
suspend disbelief. Those unable to suspend disbelief are encouraged to
16. The name tells you how not to use it. provide better examples.
v2022.09.25a
16.3. SHAVING THE MANDELBROT SET 365
v2022.09.25a
366 CHAPTER 16. EASE OF USE
v2022.09.25a
Prediction is very difficult, especially about the
future.
17.1 The Future of CPU Technol- Figure 17.1: Uniprocessor Über Alles
ogy Ain’t What it Used to Be
2. Multithreaded Mania (Figure 17.2),
A great future behind him.
Years past always seem so simple and innocent when 4. Crash Dummies Slamming into the Memory Wall
viewed through the lens of many years of experience. (Figure 17.4).
And the early 2000s were for the most part innocent
of the impending failure of Moore’s Law to continue 5. Astounding Accelerators (Figure 17.5).
delivering the then-traditional increases in CPU clock
frequency. Oh, there were the occasional warnings about Each of these scenarios is covered in the following
the limits of technology, but such warnings had been sections.
sounded for decades. With that in mind, consider the
following scenarios: 17.1.1 Uniprocessor Über Alles
1. Uniprocessor Über Alles (Figure 17.1), As was said in 2004 [McK04]:
367
v2022.09.25a
368 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
v2022.09.25a
17.1. THE FUTURE OF CPU TECHNOLOGY AIN’T WHAT IT USED TO BE 369
In this scenario, the combination of Moore’s- CPU, which can degrade performance for ap-
Law increases in CPU clock rate and continued plications with large cache footprints. There is
progress in horizontally scaled computing ren- also some possibility that the restricted amount
der SMP systems irrelevant. This scenario is of cache available will cause RCU-based algo-
therefore dubbed “Uniprocessor Über Alles”, rithms to incur performance penalties due to
literally, uniprocessors above all else. their grace-period-induced additional memory
These uniprocessor systems would be subject consumption. Investigating this possibility is
only to instruction overhead, since memory future work.
barriers, cache thrashing, and contention do not However, in order to avoid such performance
affect single-CPU systems. In this scenario, degradation, a number of multithreaded CPUs
RCU is useful only for niche applications, such and multi-CPU chips partition at least some of
as interacting with NMIs. It is not clear that an the levels of cache on a per-hardware-thread
operating system lacking RCU would see the basis. This increases the amount of cache avail-
need to adopt it, although operating systems able to each hardware thread, but re-introduces
that already implement RCU might continue to memory latency for cachelines that are passed
do so. from one hardware thread to another.
However, recent progress with multithreaded
CPUs seems to indicate that this scenario is And we all know how this story has played out, with
quite unlikely. multiple multi-threaded cores on a single die plugged
into a single socket, with varying degrees of optimization
Unlikely indeed! But the larger software community for lower numbers of active threads per core. The ques-
was reluctant to accept the fact that they would need to tion then becomes whether or not future shared-memory
embrace parallelism, and so it was some time before this systems will always fit into a single socket.
community concluded that the “free lunch” of Moore’s-
Law-induced CPU core-clock frequency increases was
well and truly finished. Never forget: Belief is an emotion, 17.1.3 More of the Same
not necessarily the result of a rational technical thought
process! Again from 2004 [McK04]:
A less-extreme variant of Uniprocessor Über This scenario actually represents a change, since
Alles features uniprocessors with hardware mul- to have more of the same, interconnect perfor-
tithreading, and in fact multithreaded CPUs mance must begin keeping up with the Moore’s-
are now standard for many desktop and lap- Law increases in core CPU performance. In this
top computer systems. The most aggressively scenario, overhead due to pipeline stalls, mem-
multithreaded CPUs share all levels of cache hi- ory latency, and contention remains significant,
erarchy, thereby eliminating CPU-to-CPU mem- and RCU retains the high level of applicability
ory latency, in turn greatly reducing the perfor- that it enjoys today.
mance penalty for traditional synchronization
mechanisms. However, a multithreaded CPU And the change has been the ever-increasing levels of
would still incur overhead due to contention integration that Moore’s Law is still providing. But longer
and to pipeline stalls caused by memory barri- term, which will it be? More CPUs per die? Or more I/O,
ers. Furthermore, because all hardware threads cache, and memory?
share all levels of cache, the cache available to Servers seem to be choosing the former, while em-
a given hardware thread is a fraction of what bedded systems on a chip (SoCs) continue choosing the
it would be on an equivalent single-threaded latter.
v2022.09.25a
370 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
10000
Instructions per Memory Reference Time
1000
1
100 spinlock
0.1 RCU
82 84 86 88 90 92 94 96 98 00 02
Year
v2022.09.25a
17.2. TRANSACTIONAL MEMORY 371
workloads. In the event, the SLAB_TYPESAFE_BY_RCU forthcoming, despite other somewhat similar proposals
has been pressed into service in a number of instances being put forward [SSHT93]. Not long after, Shavit
where these cache-warmth issues would otherwise be and Touitou proposed a software-only implementation of
problematic, as has sequence locking. On the other hand, transactional memory (STM) that was capable of running
this passage also failed to anticipate that RCU would be on commodity hardware, give or take memory-ordering
used to reduce scheduling latency or for security. issues [ST95]. This proposal languished for many years,
Much of the data generated for this book was collected perhaps due to the fact that the research community’s
on an eight-socket system with 28 cores per socket and attention was absorbed by non-blocking synchronization
two hardware threads per core, for a total of 448 hardware (see Section 14.2).
threads. The idle-system memory latencies are less than But by the turn of the century, TM started receiving
one microsecond, which are no worse than those of similar- more attention [MT01, RG01], and by the middle of
sized systems of the year 2004. Some claim that these the decade, the level of interest can only be termed “in-
latencies approach a microsecond only because of the candescent” [Her05, Gro07], with only a few voices of
x86 CPU family’s relatively strong memory ordering, but caution [BLM05, MMW07].
it may be some time before that particular argument is The basic idea behind TM is to execute a section of
settled. code atomically, so that other threads see no interme-
diate state. As such, the semantics of TM could be
17.1.5 Astounding Accelerators implemented by simply replacing each transaction with a
recursively acquirable global lock acquisition and release,
The potential of hardware accelerators was not quite as albeit with abysmal performance and scalability. Much of
clear in 2004 as it is in 2021, so this section has no quote. the complexity inherent in TM implementations, whether
However, the November 2020 Top 500 list [MDSS20] hardware or software, is efficiently detecting when concur-
features a great many accelerators, so one could argue rent transactions can safely run in parallel. Because this
that this section is a view of the present rather than of the detection is done dynamically, conflicting transactions can
future. The same could be said of most of the preceding be aborted or “rolled back”, and in some implementations,
sections. this failure mode is visible to the programmer.
Hardware accelerators are being put to many other uses, Because transaction roll-back is increasingly unlikely
including encryption, compression, machine learning. as transaction size decreases, TM might become quite
In short, beware of prognostications, including those in attractive for small memory-based operations, such as
the remainder of this chapter. linked-list manipulations used for stacks, queues, hash
tables, and search trees. However, it is currently much
more difficult to make the case for large transactions, par-
17.2 Transactional Memory ticularly those containing non-memory operations such
as I/O and process creation. The following sections look
Everything should be as simple as it can be, but not at current challenges to the grand vision of “Transac-
simpler. tional Memory Everywhere” [McK09b]. Section 17.2.1
Albert Einstein, by way of Louis Zukofsky examines the challenges faced interacting with the outside
world, Section 17.2.2 looks at interactions with process
modification primitives, Section 17.2.3 explores interac-
The idea of using transactions outside of databases goes
tions with other synchronization primitives, and finally
back many decades [Lom77, Kni86, HM93], with the
Section 17.2.4 closes with some discussion.
key difference between database and non-database trans-
actions being that non-database transactions drop the
“D” in the “ACID”1 properties defining database transac- 17.2.1 Outside World
tions. The idea of supporting memory-based transactions,
or “transactional memory” (TM), in hardware is more In the wise words of Donald Knuth:
recent [HM93], but unfortunately, support for such trans-
actions in commodity hardware was not immediately Many computer users feel that input and output
are not actually part of “real programming,”
1 Atomicity, consistency, isolation, and durability. they are merely things that (unfortunately) must
v2022.09.25a
372 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
be done in order to get information in and out for unbuffered I/O, but requires that TM interoperate
of the machine. with other synchronization primitives tolerating I/O.
Whether or not we believe that input and output are “real 3. Prohibit I/O within transactions, but enlist the com-
programming”, the fact is that software absolutely must piler’s aid in enforcing this prohibition.
deal with the outside world. This section therefore cri-
tiques transactional memory’s outside-world capabilities, 4. Permit only one special irrevocable transac-
focusing on I/O operations, time delays, and persistent tion [SMS08] to proceed at any given time, thus
storage. allowing irrevocable transactions to contain I/O op-
erations.2 This works in general, but severely limits
the scalability and performance of I/O operations.
17.2.1.1 I/O Operations Given that scalability and performance is a first-class
One can execute I/O operations within a lock-based crit- goal of parallelism, this approach’s generality seems
ical section, while holding a hazard pointer, within a a bit self-limiting. Worse yet, use of irrevocability
sequence-locking read-side critical section, and from to tolerate I/O operations seems to greatly restrict
within a userspace-RCU read-side critical section, and use of manual transaction-abort operations.3 Finally,
even all at the same time, if need be. What happens when if there is an irrevocable transaction manipulating
you attempt to execute an I/O operation from within a a given data item, any other transaction manipulat-
transaction? ing that same data item cannot have non-blocking
The underlying problem is that transactions may be semantics.
rolled back, for example, due to conflicts. Roughly speak-
5. Create new hardware and protocols such that I/O op-
ing, this requires that all operations within any given
erations can be pulled into the transactional substrate.
transaction be revocable, so that executing the operation
In the case of input operations, the hardware would
twice has the same effect as executing it once. Unfor-
need to correctly predict the result of the operation,
tunately, I/O is in general the prototypical irrevocable
and to abort the transaction if the prediction failed.
operation, making it difficult to include general I/O opera-
tions in transactions. In fact, general I/O is irrevocable: I/O operations are a well-known weakness of TM,
Once you have pushed the proverbial button launching the and it is not clear that the problem of supporting I/O in
nuclear warheads, there is no turning back. transactions has a reasonable general solution, at least
Here are some options for handling of I/O within trans- if “reasonable” is to include usable performance and
actions: scalability. Nevertheless, continued time and attention to
1. Restrict I/O within transactions to buffered I/O with this problem will likely produce additional progress.
in-memory buffers. These buffers may then be in-
cluded in the transaction in the same way that any 17.2.1.2 RPC Operations
other memory location might be included. This One can execute RPCs within a lock-based critical section,
seems to be the mechanism of choice, and it does while holding a hazard pointer, within a sequence-locking
work well in many common cases of situations such read-side critical section, and from within a userspace-
as stream I/O and mass-storage I/O. However, spe- RCU read-side critical section, and even all at the same
cial handling is required in cases where multiple time, if need be. What happens when you attempt to
record-oriented output streams are merged onto a execute an RPC from within a transaction?
single file from multiple processes, as might be done If both the RPC request and its response are to be
using the “a+” option to fopen() or the O_APPEND contained within the transaction, and if some part of the
flag to open(). In addition, as will be seen in the transaction depends on the result returned by the response,
next section, common networking operations cannot then it is not possible to use the memory-buffer tricks that
be handled via buffering.
2 In earlier literature, irrevocable transactions are termed inevitable
2. Prohibit I/O within transactions, so that any attempt to
transactions.
execute an I/O operation aborts the enclosing transac- 3 This difficulty was pointed out by Michael Factor. To see the
tion (and perhaps multiple nested transactions). This problem, think through what TM should do in response to an attempt to
approach seems to be the conventional TM approach abort a transaction after it has executed an irrevocable operation.
v2022.09.25a
17.2. TRANSACTIONAL MEMORY 373
can be used in the case of buffered I/O. Any attempt to take necessary to roll all but one of them back, with con-
this buffering approach would deadlock the transaction, as sequent degradation of performance and scalability.
the request could not be transmitted until the transaction This approach nevertheless might be valuable given
was guaranteed to succeed, but the transaction’s success long-running transactions ending with an RPC. This
might not be knowable until after the response is received, approach must still restrict manual transaction-abort
as is the case in the following example: operations.
v2022.09.25a
374 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
“elegant” solutions, fails to survive first contact with Persistent locks help avoid the need to share memory
legacy code. Such code, which might well have among unrelated applications. Persistent locking APIs
important time delays in critical sections, would fail include the flock family, lockf(), System V semaphores,
upon being transactionalized. or the O_CREAT flag to open(). These persistent APIs
can be used to protect large-scale operations spanning
2. Abort transactions upon encountering a time-delay runs of multiple applications, and, in the case of O_CREAT
operation. This is attractive, but it is unfortunately even surviving operating-system reboot. If need be, locks
not always possible to automatically detect a time- can even span multiple computer systems via distributed
delay operation. Is that tight loop carrying out a lock managers and distributed filesystems—and persist
critical computation, or is it simply waiting for time across reboots of any or all of those computer systems.
to elapse? Persistent locks can be used by any application, in-
cluding applications written using multiple languages
3. Enlist the compiler to prohibit time delays within
and software environments. In fact, a persistent lock
transactions.
might well be acquired by an application written in C and
4. Let the time delays execute normally. Unfortunately, released by an application written in Python.
some TM implementations publish modifications How could a similar persistent functionality be provided
only at commit time, which could defeat the purpose for TM?
of the time delay.
1. Restrict persistent transactions to special-purpose
It is not clear that there is a single correct answer. TM environments designed to support them, for example,
implementations featuring weak atomicity that publish SQL. This clearly works, given the decades-long
changes immediately within the transaction (rolling these history of database systems, but does not provide
changes back upon abort) might be reasonably well served the same degree of flexibility provided by persistent
by the last alternative. Even in this case, the code (or locks.
possibly even hardware) at the other end of the transaction
2. Use snapshot facilities provided by some storage de-
may require a substantial redesign to tolerate aborted
vices and/or filesystems. Unfortunately, this does not
transactions. This need for redesign would make it more
handle network communication, nor does it handle
difficult to apply transactional memory to legacy code.
I/O to devices that do not provide snapshot capabili-
ties, for example, memory sticks.
17.2.1.4 Persistence
3. Build a time machine.
There are many different types of locking primitives.
One interesting distinction is persistence, in other words, 4. Avoid the problem entirely by using existing persis-
whether the lock can exist independently of the address tent facilities, presumably avoiding such use within
space of the process using the lock. transactions.
Non-persistent locks include pthread_mutex_
lock(), pthread_rwlock_rdlock(), and most kernel- Of course, the fact that it is called transactional memory
level locking primitives. If the memory locations instan- should give us pause, as the name itself conflicts with
tiating a non-persistent lock’s data structures disappear, the concept of a persistent transaction. It is nevertheless
so does the lock. For typical use of pthread_mutex_ worthwhile to consider this possibility as an important
lock(), this means that when the process exits, all of test case probing the inherent limitations of transactional
its locks vanish. This property can be exploited in order memory.
to trivialize lock cleanup at program shutdown time, but
makes it more difficult for unrelated applications to share 17.2.2 Process Modification
locks, as such sharing requires the applications to share
memory. Processes are not eternal: They are created and destroyed,
their memory mappings are modified, they are linked to
Quick Quiz 17.1: But suppose that an application exits dynamic libraries, and they are debugged. These sections
while holding a pthread_mutex_lock() that happens to be
look at how transactional memory can handle an ever-
located in a file-mapped region of memory?
changing execution environment.
v2022.09.25a
17.2. TRANSACTIONAL MEMORY 375
17.2.2.1 Multithreaded Transactions that the parent and children are presumably permit-
ted to conflict with each other, but not with other
It is perfectly legal to create processes and threads while
threads. It also raises interesting questions as to
holding a lock or, for that matter, while holding a hazard
what should happen if the parent thread does not wait
pointer, within a sequence-locking read-side critical sec-
for its children before committing the transaction.
tion, and from within a userspace-RCU read-side critical
Even more interesting, what happens if the parent
section, and even all at the same time, if need be. Not
conditionally executes pthread_join() based on
only is it legal, but it is quite simple, as can be seen from
the values of variables participating in the transac-
the following code fragment:
tion? The answers to these questions are reasonably
1 pthread_mutex_lock(...);
straightforward in the case of locking. The answers
2 for (i = 0; i < ncpus; i++) for TM are left as an exercise for the reader.
3 pthread_create(&tid[i], ...);
4 for (i = 0; i < ncpus; i++)
5 pthread_join(tid[i], ...); Given that parallel execution of transactions is com-
6 pthread_mutex_unlock(...);
monplace in the database world, it is perhaps surprising
that current TM proposals do not provide for it. On the
This pseudo-code fragment uses pthread_create() other hand, the example above is a fairly sophisticated use
to spawn one thread per CPU, then uses pthread_join() of locking that is not normally found in simple textbook
to wait for each to complete, all under the protection of examples, so perhaps its omission is to be expected. That
pthread_mutex_lock(). The effect is to execute a lock- said, some researchers are using transactions to autoparal-
based critical section in parallel, and one could obtain a lelize code [RKM+ 10], and there are rumors that other TM
similar effect using fork() and wait(). Of course, the researchers are investigating fork/join parallelism within
critical section would need to be quite large to justify the transactions, so perhaps this topic will soon be addressed
thread-spawning overhead, but there are many examples more thoroughly.
of large critical sections in production software.
What might TM do about thread spawning within a
transaction? 17.2.2.2 The exec() System Call
1. Declare pthread_create() to be illegal within One can execute an exec() system call within a lock-
transactions, preferably by aborting the transac- based critical section, while holding a hazard pointer,
tion. Alternatively, enlist the compiler to enforce within a sequence-locking read-side critical section, and
pthread_create()-free transactions. from within a userspace-RCU read-side critical section,
and even all at the same time, if need be. The exact
2. Permit pthread_create() to be executed within a semantics depends on the type of primitive.
transaction, but only the parent thread will be con- In the case of non-persistent primitives (in-
sidered to be part of the transaction. This approach cluding pthread_mutex_lock(), pthread_rwlock_
seems to be reasonably compatible with existing and rdlock(), and userspace RCU), if the exec() succeeds,
posited TM implementations, but seems to be a trap the whole address space vanishes, along with any locks
for the unwary. This approach raises further ques- being held. Of course, if the exec() fails, the address
tions, such as how to handle conflicting child-thread space still lives, so any associated locks would also still
accesses. live. A bit strange perhaps, but well defined.
3. Convert the pthread_create()s to function calls. On the other hand, persistent primitives (including
This approach is also an attractive nuisance, as it does the flock family, lockf(), System V semaphores, and
not handle the not-uncommon cases where the child the O_CREAT flag to open()) would survive regardless
threads communicate with one another. In addition, of whether the exec() succeeded or failed, so that the
it does not permit concurrent execution of the body exec()ed program might well release them.
of the transaction. Quick Quiz 17.2: What about non-persistent primitives
4. Extend the transaction to cover the parent and all represented by data structures in mmap() regions of memory?
What happens when there is an exec() within a critical section
child threads. This approach raises interesting ques-
of such a primitive?
tions about the nature of conflicting accesses, given
v2022.09.25a
376 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
What happens when you attempt to execute an exec() 17.2.2.3 Dynamic Linking and Loading
system call from within a transaction?
Lock-based critical section, code holding a hazard
1. Disallow exec() within transactions, so that the pointer, sequence-locking read-side critical sections, and
enclosing transactions abort upon encountering the userspace-RCU read-side critical sections can (separately
exec(). This is well defined, but clearly requires or in combination) legitimately contain code that invokes
non-TM synchronization primitives for use in con- dynamically linked and loaded functions, including C/C++
junction with exec(). shared libraries and Java class libraries. Of course, the
code contained in these libraries is by definition unknow-
2. Disallow exec() within transactions, with the com- able at compile time. So, what happens if a dynamically
piler enforcing this prohibition. There is a draft loaded function is invoked within a transaction?
specification for TM in C++ that takes this ap- This question has two parts: (a) How do you dynam-
proach, allowing functions to be decorated with the ically link and load a function within a transaction and
transaction_safe and transaction_unsafe at- (b) What do you do about the unknowable nature of the
tributes.4 This approach has some advantages over code within this function? To be fair, item (b) poses
aborting the transaction at runtime, but again re- some challenges for locking and userspace-RCU as well,
quires non-TM synchronization primitives for use in at least in theory. For example, the dynamically linked
conjunction with exec(). One disadvantage is the function might introduce a deadlock for locking or might
need to decorate a great many library functions with (erroneously) introduce a quiescent state into a userspace-
transaction_safe and transaction_unsafe at- RCU read-side critical section. The difference is that
tributes. while the class of operations permitted in locking and
userspace-RCU critical sections is well-understood, there
3. Treat the transaction in a manner similar to non-
appears to still be considerable uncertainty in the case of
persistent locking primitives, so that the transaction
TM. In fact, different implementations of TM seem to
survives if exec() fails, and silently commits if
have different restrictions.
the exec() succeeds. The case where only some
So what can TM do about dynamically linked and
of the variables affected by the transaction reside
loaded library functions? Options for part (a), the actual
in mmap()ed memory (and thus could survive a
loading of the code, include the following:
successful exec() system call) is left as an exercise
for the reader. 1. Treat the dynamic linking and loading in a manner
4. Abort the transaction (and the exec() system call) similar to a page fault, so that the function is loaded
if the exec() system call would have succeeded, and linked, possibly aborting the transaction in the
but allow the transaction to continue if the exec() process. If the transaction is aborted, the retry will
system call would fail. This is in some sense the find the function already present, and the transaction
“correct” approach, but it would require considerable can thus be expected to proceed normally.
work for a rather unsatisfying result. 2. Disallow dynamic linking and loading of functions
from within transactions.
The exec() system call is perhaps the strangest example
of an obstacle to universal TM applicability, as it is
Options for part (b), the inability to detect TM-
not completely clear what approach makes sense, and
unfriendly operations in a not-yet-loaded function, possi-
some might argue that this is merely a reflection of the
bilities include the following:
perils of real-life interaction with exec(). That said, the
two options prohibiting exec() within transactions are 1. Just execute the code: If there are any TM-unfriendly
perhaps the most logical of the group. operations in the function, simply abort the transac-
Similar issues surround the exit() and kill() sys- tion. Unfortunately, this approach makes it impos-
tem calls, as well as a longjmp() or an exception that sible for the compiler to determine whether a given
would exit the transaction. (Where did the longjmp() or group of transactions may be safely composed. One
exception come from?) way to permit composability regardless is irrevocable
4 Thanks to Mark Moir for pointing me at this spec, and to Michael transactions, however, current implementations per-
Wong for having pointed me at an earlier revision some time back. mit only a single irrevocable transaction to proceed
v2022.09.25a
17.2. TRANSACTIONAL MEMORY 377
at any given time, which can severely limit perfor- aborted. This does simplify things somewhat, but
mance and scalability. Irrevocable transactions also also requires that TM interoperate with synchro-
to restrict use of manual transaction-abort opera- nization primitives that do tolerate remapping from
tions. Finally, if there is an irrevocable transaction within their critical sections.
manipulating a given data item, any other transac-
tion manipulating that same data item cannot have 2. Memory remapping is illegal within a transaction,
non-blocking semantics. and the compiler is enlisted to enforce this prohibi-
tion.
2. Decorate the function declarations indicating which
functions are TM-friendly. These decorations can 3. Memory mapping is legal within a transaction, but
then be enforced by the compiler’s type system. Of aborts all other transactions having variables in the
course, for many languages, this requires language region mapped over.
extensions to be proposed, standardized, and imple- 4. Memory mapping is legal within a transaction, but
mented, with the corresponding time delays, and also the mapping operation will fail if the region being
with the corresponding decoration of a great many mapped overlaps with the current transaction’s foot-
otherwise uninvolved library functions. That said, the print.
standardization effort is already in progress [ATS09].
5. All memory-mapping operations, whether within or
3. As above, disallow dynamic linking and loading of
outside a transaction, check the region being mapped
functions from within transactions.
against the memory footprint of all transactions in the
I/O operations are of course a known weakness of system. If there is overlap, then the memory-mapping
TM, and dynamic linking and loading can be thought operation fails.
of as yet another special case of I/O. Nevertheless, the
proponents of TM must either solve this problem, or resign 6. The effect of memory-mapping operations that over-
themselves to a world where TM is but one tool of several lap the memory footprint of any transaction in the
in the parallel programmer’s toolbox. (To be fair, a number system is determined by the TM conflict manager,
of TM proponents have long since resigned themselves to which might dynamically determine whether to fail
a world containing more than just TM.) the memory-mapping operation or abort any conflict-
ing transactions.
17.2.2.4 Memory-Mapping Operations It is interesting to note that munmap() leaves the relevant
It is perfectly legal to execute memory-mapping operations region of memory unmapped, which could have additional
(including mmap(), shmat(), and munmap() [Gro01]) interesting implications.5
within a lock-based critical section, while holding a haz-
ard pointer, within a sequence-locking read-side critical 17.2.2.5 Debugging
section, and from within a userspace-RCU read-side crit-
The usual debugging operations such as breakpoints
ical section, and even all at the same time, if need be.
work normally within lock-based critical sections and
What happens when you attempt to execute such an op-
from usespace-RCU read-side critical sections. However,
eration from within a transaction? More to the point,
in initial transactional-memory hardware implementa-
what happens if the memory region being remapped con-
tions [DLMN09] an exception within a transaction will
tains some variables participating in the current thread’s
abort that transaction, which in turn means that break-
transaction? And what if this memory region contains
points abort all enclosing transactions.
variables participating in some other thread’s transaction?
So how can transactions be debugged?
It should not be necessary to consider cases where
the TM system’s metadata is remapped, given that most 1. Use software emulation techniques within transac-
locking primitives do not define the outcome of remapping tions containing breakpoints. Of course, it might
their lock variables. be necessary to emulate all transactions any time a
Here are some TM memory-mapping options: breakpoint is set within the scope of any transaction.
1. Memory remapping is illegal within a transaction, 5 This difference between mapping and unmapping was noted by
v2022.09.25a
378 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
If the runtime system is unable to determine whether usual well-known software-engineering techniques are
or not a given breakpoint is within the scope of a employed to avoid deadlock. It is not unusual to acquire
transaction, then it might be necessary to emulate all locks from within RCU read-side critical sections, which
transactions just to be on the safe side. However, this eases deadlock concerns because RCU read-side prim-
approach might impose significant overhead, which itives cannot participate in lock-based deadlock cycles.
might in turn obscure the bug being pursued. It is also possible to acquire locks while holding hazard
pointers and within sequence-lock read-side critical sec-
2. Use only hardware TM implementations that are
tions. But what happens when you attempt to acquire a
capable of handling breakpoint exceptions. Unfor-
lock from within a transaction?
tunately, as of this writing (March 2021), all such
In theory, the answer is trivial: Simply manipulate the
implementations are research prototypes.
data structure representing the lock as part of the trans-
3. Use only software TM implementations, which are action, and everything works out perfectly. In practice, a
(very roughly speaking) more tolerant of exceptions number of non-obvious complications [VGS08] can arise,
than are the simpler of the hardware TM implemen- depending on implementation details of the TM system.
tations. Of course, software TM tends to have higher These complications can be resolved, but at the cost of a
overhead than hardware TM, so this approach may 45 % increase in overhead for locks acquired outside of
not be acceptable in all situations. transactions and a 300 % increase in overhead for locks
acquired within transactions. Although these overheads
4. Program more carefully, so as to avoid having bugs might be acceptable for transactional programs contain-
in the transactions in the first place. As soon as you ing small amounts of locking, they are often completely
figure out how to do this, please do let everyone know unacceptable for production-quality lock-based programs
the secret! wishing to use the occasional transaction.
There is some reason to believe that transactional mem-
ory will deliver productivity improvements compared to 1. Use only locking-friendly TM implementations. Un-
other synchronization mechanisms, but it does seem quite fortunately, the locking-unfriendly implementations
possible that these improvements could easily be lost if have some attractive properties, including low over-
traditional debugging techniques cannot be applied to head for successful transactions and the ability to
transactions. This seems especially true if transactional accommodate extremely large transactions.
memory is to be used by novices on large transactions. In
2. Use TM only “in the small” when introducing TM
contrast, macho “top-gun” programmers might be able to
to lock-based programs, thereby accommodating the
dispense with such debugging aids, especially for small
limitations of locking-friendly TM implementations.
transactions.
Therefore, if transactional memory is to deliver on 3. Set aside locking-based legacy systems entirely, re-
its productivity promises to novice programmers, the implementing everything in terms of transactions.
debugging problem does need to be solved. This approach has no shortage of advocates, but this
requires that all the issues described in this series be
17.2.3 Synchronization resolved. During the time it takes to resolve these
issues, competing synchronization mechanisms will
If transactional memory someday proves that it can be of course also have the opportunity to improve.
everything to everyone, it will not need to interact with
any other synchronization mechanism. Until then, it 4. Use TM strictly as an optimization in lock-based
will need to work with synchronization mechanisms that systems, as was done by the TxLinux [RHP+ 07]
can do what it cannot, or that work more naturally in a group and by a great many transactional lock elision
given situation. The following sections outline the current projects [PD11, Kle14, FIMR16, PMDY20]. This
challenges in this area. approach seems sound, but leaves the locking design
constraints (such as the need to avoid deadlock) firmly
17.2.3.1 Locking in place.
It is commonplace to acquire locks while holding other 5. Strive to reduce the overhead imposed on locking
locks, which works quite well, at least as long as the primitives.
v2022.09.25a
17.2. TRANSACTIONAL MEMORY 379
The fact that there could possibly be a problem inter- acquiring reader-writer locks from within transac-
facing TM and locking came as a surprise to many, which tions.
underscores the need to try out new mechanisms and prim-
itives in real-world production software. Fortunately, the 3. Set aside locking-based legacy systems entirely, re-
advent of open source means that a huge quantity of such implementing everything in terms of transactions.
software is now freely available to everyone, including This approach has no shortage of advocates, but this
researchers. requires that all the issues described in this series be
resolved. During the time it takes to resolve these
issues, competing synchronization mechanisms will
17.2.3.2 Reader-Writer Locking
of course also have the opportunity to improve.
It is commonplace to read-acquire reader-writer locks
4. Use TM strictly as an optimization in lock-based sys-
while holding other locks, which just works, at least as long
tems, as was done by the TxLinux [RHP+ 07] group,
as the usual well-known software-engineering techniques
and as has been done by more recent work using TM
are employed to avoid deadlock. Read-acquiring reader-
to elide reader writer locks [FIMR16]. This approach
writer locks from within RCU read-side critical sections
seems sound, at least on POWER8 CPUs [LGW+ 15],
also works, and doing so eases deadlock concerns because
but leaves the locking design constraints (such as the
RCU read-side primitives cannot participate in lock-based
need to avoid deadlock) firmly in place.
deadlock cycles. It is also possible to acquire locks
while holding hazard pointers and within sequence-lock Of course, there might well be other non-obvious issues
read-side critical sections. But what happens when you surrounding combining TM with reader-writer locking,
attempt to read-acquire a reader-writer lock from within a as there in fact were with exclusive locking.
transaction?
Unfortunately, the straightforward approach to read-
17.2.3.3 Deferred Reclamation
acquiring the traditional counter-based reader-writer lock
within a transaction defeats the purpose of the reader- This section focuses mainly on RCU. Similar issues
writer lock. To see this, consider a pair of transactions and possible resolutions arise when combining TM with
concurrently attempting to read-acquire the same reader- other deferred-reclamation mechanisms such as reference
writer lock. Because read-acquisition involves modifying counters and hazard pointers. In the text below, known
the reader-writer lock’s data structures, a conflict will differences are specifically called out.
result, which will roll back one of the two transactions. Reference counting, hazard pointers, and RCU are all
This behavior is completely inconsistent with the reader- heavily used, as noted in Sections 9.5.5 and 9.6.3. This
writer lock’s goal of allowing concurrent readers. means that any TM implementation that chooses not to
Here are some options available to TM: surmount each and every challenge called out in this
section needs to interoperate cleanly and efficiently with
1. Use per-CPU or per-thread reader-writer lock-
all of these synchronization mechanisms.
ing [HW92], which allows a given CPU (or thread,
The TxLinux group from the University of Texas at
respectively) to manipulate only local data when
Austin appears to be the group to take on the challenge
read-acquiring the lock. This would avoid the con-
of RCU/TM interoperability [RHP+ 07]. Because they
flict between the two transactions concurrently read-
applied TM to the Linux 2.6 kernel, which uses RCU,
acquiring the lock, permitting both to proceed, as
they had no choice but to integrate TM and RCU, with
intended. Unfortunately, (1) the write-acquisition
TM taking the place of locking for RCU updates. Un-
overhead of per-CPU/thread locking can be extremely
fortunately, although the paper does state that the RCU
high, (2) the memory overhead of per-CPU/thread
implementation’s locks (e.g., rcu_ctrlblk.lock) were
locking can be prohibitive, and (3) this transforma-
converted to transactions, it is silent about what was done
tion is available only when you have access to the
with those locks used by RCU-based updates (for example,
source code in question. Other more-recent scalable
dcache_lock).
reader-writer locks [LLO09] might avoid some or all
More recently, Dimitrios Siakavaras et al. have ap-
of these problems.
plied HTM and RCU to search trees [SNGK17, SBN+ 20],
2. Use TM only “in the small” when introducing Christina Giannoula et al. have used HTM and RCU to
TM to lock-based programs, thereby avoiding read- color graphs [GGK18], and SeongJae Park et al. have
v2022.09.25a
380 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
used HTM and RCU to optimize high-contention locking transactions. In addition, not all TM implementa-
on NUMA systems [PMDY20]. tions are capable of delaying conflicting accesses.
It is important to note that RCU permits readers and Nevertheless, this approach seems eminently reason-
updaters to run concurrently, further permitting RCU able for hardware TM implementations that support
readers to access data that is in the act of being updated. only small transactions.
Of course, this property of RCU, whatever its performance,
4. RCU readers are converted to transactions. This ap-
scalability, and real-time-response benefits might be, flies
proach pretty much guarantees that RCU is compati-
in the face of the underlying atomicity properties of
ble with any TM implementation, but it also imposes
TM, although the POWER8 CPU family’s suspended-
TM’s rollbacks on RCU read-side critical sections,
transaction facility [LGW+ 15] makes it an exception to
destroying RCU’s real-time response guarantees, and
this rule.
also degrading RCU’s read-side performance. Fur-
So how should TM-based updates interact with concur-
thermore, this approach is infeasible in cases where
rent RCU readers? Some possibilities are as follows:
any of the RCU read-side critical sections contains
1. RCU readers abort concurrent conflicting TM up- operations that the TM implementation in question
dates. This is in fact the approach taken by the is incapable of handling. This approach is more
TxLinux project. This approach does preserve RCU difficult to apply to hazard pointers and reference
semantics, and also preserves RCU’s read-side perfor- counters, which do not have a sharply defined notion
mance, scalability, and real-time-response properties, of a reader as a section of code.
but it does have the unfortunate side-effect of unnec-
5. Many update-side uses of RCU modify a single
essarily aborting conflicting updates. In the worst
pointer to publish a new data structure. In some
case, a long sequence of RCU readers could poten-
of these cases, RCU can safely be permitted to see
tially starve all updaters, which could in theory result
a transactional pointer update that is subsequently
in system hangs. In addition, not all TM implementa-
rolled back, as long as the transaction respects mem-
tions offer the strong atomicity required to implement
ory ordering and as long as the roll-back process uses
this approach, and for good reasons.
call_rcu() to free up the corresponding structure.
2. RCU readers that run concurrently with conflicting Unfortunately, not all TM implementations respect
TM updates get old (pre-transaction) values from any memory barriers within a transaction. Apparently,
conflicting RCU loads. This preserves RCU seman- the thought is that because transactions are supposed
tics and performance, and also prevents RCU-update to be atomic, the ordering of the accesses within the
starvation. However, not all TM implementations transaction is not supposed to matter.
can provide timely access to old values of variables 6. Prohibit use of TM in RCU updates. This is guaran-
that have been tentatively updated by an in-flight teed to work, but restricts use of TM.
transaction. In particular, log-based TM implemen-
tations that maintain old values in the log (thus It seems likely that additional approaches will be un-
providing excellent TM commit performance) are covered, especially given the advent of user-level RCU
not likely to be happy with this approach. Perhaps the and hazard-pointer implementations.6 It is interesting
rcu_dereference() primitive can be leveraged to to note that many of the better performing and scaling
permit RCU to access the old values within a greater STM implementations make use of RCU-like techniques
range of TM implementations, though performance internally [Fra04, FH07, GYW+ 19, KMK+ 19].
might still be an issue. Nevertheless, there are pop- Quick Quiz 17.3: MV-RLU looks pretty good! Doesn’t it
ular TM implementations that have been integrated beat RCU hands down?
with RCU in this manner [PW07, HW11, HW14].
3. If an RCU reader executes an access that conflicts 17.2.3.4 Extra-Transactional Accesses
with an in-flight transaction, then that RCU access
is delayed until the conflicting transaction either Within a lock-based critical section, it is perfectly legal
commits or aborts. This approach preserves RCU to manipulate variables that are concurrently accessed or
semantics, but not RCU’s performance or real-time 6 Kudos to the TxLinux group, Maged Michael, and Josh Triplett
response, particularly in presence of long-running for coming up with a number of the above alternatives.
v2022.09.25a
17.2. TRANSACTIONAL MEMORY 381
even modified outside that lock’s critical section, with one 17.2.4 Discussion
common example being statistical counters. The same
thing is possible within RCU read-side critical sections, The obstacles to universal TM adoption lead to the follow-
and is in fact the common case. ing conclusions:
Given mechanisms such as the so-called “dirty reads”
1. One interesting property of TM is the fact that transac-
that are prevalent in production database systems, it is not
tions are subject to rollback and retry. This property
surprising that extra-transactional accesses have received
underlies TM’s difficulties with irreversible oper-
serious attention from the proponents of TM, with the
ations, including unbuffered I/O, RPCs, memory-
concept of weak atomicity [BLM06] being but one case
mapping operations, time delays, and the exec()
in point.
system call. This property also has the unfortu-
Here are some extra-transactional options:
nate consequence of introducing all the complexi-
ties inherent in the possibility of failure, often in a
1. Conflicts due to extra-transactional accesses always
developer-visible manner.
abort transactions. This is strong atomicity.
2. Another interesting property of TM, noted by Sh-
2. Conflicts due to extra-transactional accesses are ig-
peisman et al. [SATG+ 09], is that TM intertwines
nored, so only conflicts among transactions can abort
the synchronization with the data it protects. This
transactions. This is weak atomicity.
property underlies TM’s issues with I/O, memory-
3. Transactions are permitted to carry out non- mapping operations, extra-transactional accesses,
transactional operations in special cases, such as and debugging breakpoints. In contrast, conven-
when allocating memory or interacting with lock- tional synchronization primitives, including locking
based critical sections. and RCU, maintain a clear separation between the
synchronization primitives and the data that they
4. Produce hardware extensions that permit some op- protect.
erations (for example, addition) to be carried out
concurrently on a single variable by multiple trans- 3. One of the stated goals of many workers in the TM
actions. area is to ease parallelization of large sequential
programs. As such, individual transactions are com-
5. Introduce weak semantics to transactional memory. monly expected to execute serially, which might do
One approach is the combination with RCU de- much to explain TM’s issues with multithreaded
scribed in Section 17.2.3.3, while Gramoli and Guer- transactions.
raoui survey a number of other weak-transaction
approaches [GG14], for example, restricted parti- Quick Quiz 17.4: Given things like spin_trylock(), how
tioning of large “elastic” transactions into smaller does it make any sense at all to claim that TM introduces the
transactions, thus reducing conflict probabilities (al- concept of failure???
beit with tepid performance and scalability). Per-
haps further experience will show that some uses of What should TM researchers and developers do about
extra-transactional accesses can be replaced by weak all of this?
transactions. One approach is to focus on TM in the small, focusing
on small transactions where hardware assist potentially
It appears that transactions were conceived in a vacuum, provides substantial advantages over other synchronization
with no interaction required with any other synchronization primitives and on small programs where there is some
mechanism. If so, it is no surprise that much confusion evidence for increased productivity for a combined TM-
and complexity arises when combining transactions with locking approach [PAT11]. Sun took the small-transaction
non-transactional accesses. But unless transactions are to approach with its Rock research CPU [DLMN09]. Some
be confined to small updates to isolated data structures, or TM researchers seem to agree with these two small-is-
alternatively to be confined to new programs that do not beautiful approaches [SSHT93], others have much higher
interact with the huge body of existing parallel code, then hopes for TM, and yet others hint that high TM aspirations
transactions absolutely must be so combined if they are to might be TM’s worst enemy [Att10, Section 6]. It is
have large-scale practical impact in the near term. nonetheless quite possible that TM will be able to take on
v2022.09.25a
382 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
larger problems, and this section has listed a few of the Figure 17.10: The STM Reality: Conflicts
issues that must be resolved if TM is to achieve this lofty
goal.
Of course, everyone involved should treat this as a
learning experience. It would seem that TM researchers
have great deal to learn from practitioners who have
successfully built large software systems using traditional
synchronization primitives.
And vice versa.
Quick Quiz 17.5: What is to learn? Why not just use TM
for memory-based data structures and locking for those rare
cases featuring the many silly corner cases listed in this silly
section???
But for the moment, the current state of STM can best be
summarized with a series of cartoons. First, Figure 17.9
shows the STM vision. As always, the reality is a bit more
nuanced, as fancifully depicted by Figures 17.10, 17.11,
and 17.12.7 Less fanciful STM retrospectives are also
available [Duf10a, Duf10b].
Some commercially available hardware supports re-
stricted variants of HTM, which are addressed in the
following section.
STM systems for real-time use [And19, NA18], albeit without any Figure 17.11: The STM Reality: Irrevocable Operations
performance results, and with some indications that real-time hybrid
STM/HTM systems must choose between fast common-case performance
and worst-case forward-progress guarantees [AKK+ 14, SBV10].
v2022.09.25a
17.3. HARDWARE TRANSACTIONAL MEMORY 383
v2022.09.25a
384 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
running on, a subsequent execution of the same instance of 17.3.1.3 Practical Value
that synchronization primitive on some other CPU will re-
Some evidence of HTM’s practical value has been demon-
sult in a cache miss. These communications cache misses
strated in a number of hardware platforms, including
severely degrade both the performance and scalability
Sun Rock [DLMN09], Azul Vega [Cli09], IBM Blue
of conventional synchronization mechanisms [ABD+ 97,
Gene/Q [Mer11], Intel Haswell TSX [RD12], and IBM
Section 4.2.3].
System z [JSG12].
In contrast, HTM synchronizes by using the CPU’s
Expected practical benefits include:
cache, avoiding the need for a separate synchronization
data structure and resultant cache misses. HTM’s advan-
1. Lock elision for in-memory data access and up-
tage is greatest in cases where a lock data structure is
date [MT01, RG02].
placed in a separate cache line, in which case, convert-
ing a given critical section to an HTM transaction can 2. Concurrent access and small random updates to large
reduce that critical section’s overhead by a full cache miss. non-partitionable data structures.
These savings can be quite significant for the common
case of short critical sections, at least for those situations However, HTM also has some very real shortcomings,
where the elided lock does not share a cache line with an which will be discussed in the next section.
oft-written variable protected by that lock.
Quick Quiz 17.6: Why would it matter that oft-written 17.3.2 HTM Weaknesses WRT Locking
variables shared the cache line with the lock variable?
The concept of HTM is quite simple: A group of accesses
and updates to memory occurs atomically. However, as
17.3.1.2 Dynamic Partitioning of Data Structures is the case with many simple ideas, complications arise
when you apply it to real systems in the real world. These
A major obstacle to the use of some conventional synchro- complications are as follows:
nization mechanisms is the need to statically partition data
structures. There are a number of data structures that are 1. Transaction-size limitations.
trivially partitionable, with the most prominent example
being hash tables, where each hash chain constitutes a 2. Conflict handling.
partition. Allocating a lock for each hash chain then triv-
3. Aborts and rollbacks.
ially parallelizes the hash table for operations confined to
a given chain.9 Partitioning is similarly trivial for arrays, 4. Lack of forward-progress guarantees.
radix trees, skiplists, and several other data structures.
However, partitioning for many types of trees and 5. Irrevocable operations.
graphs is quite difficult, and the results are often quite
6. Semantic differences.
complex [Ell80]. Although it is possible to use two-
phased locking and hashed arrays of locks to partition Each of these complications is covered in the following
general data structures, other techniques have proven sections, followed by a summary.
preferable [Mil06], as will be discussed in Section 17.3.3.
Given its avoidance of synchronization cache misses,
17.3.2.1 Transaction-Size Limitations
HTM is therefore a very real possibility for large non-
partitionable data structures, at least assuming relatively The transaction-size limitations of current HTM imple-
small updates. mentations stem from the use of the processor caches to
Quick Quiz 17.7: Why are relatively small updates important hold the data affected by the transaction. Although this
to HTM performance and scalability? allows a given CPU to make the transaction appear atomic
to other CPUs by executing the transaction within the
confines of its cache, it also means that any transaction
that does not fit cannot commit. Furthermore, events that
9 And it is also easy to extend this scheme to operations accessing change execution context, such as interrupts, system calls,
multiple hash chains by having such operations acquire the locks for all exceptions, traps, and context switches either must abort
relevant chains in hash order. any ongoing transaction on the CPU in question or must
v2022.09.25a
17.3. HARDWARE TRANSACTIONAL MEMORY 385
further restrict transaction size due to the cache footprint currently available systems do not implement any of these
of the other execution context. research ideas, and perhaps for good reason.
Of course, modern CPUs tend to have large caches, and
the data required for many transactions would fit easily 17.3.2.2 Conflict Handling
in a one-megabyte cache. Unfortunately, with caches,
sheer size is not all that matters. The problem is that The first complication is the possibility of conflicts. For
most caches can be thought of hash tables implemented example, suppose that transactions A and B are defined
in hardware. However, hardware caches do not chain as follows:
their buckets (which are normally called sets), but rather Transaction A Transaction B
provide a fixed number of cachelines per set. The number
x = 1; y = 2;
of elements provided for each set in a given cache is y = 3; x = 4;
termed that cache’s associativity.
Although cache associativity varies, the eight-way as- Suppose that each transaction executes concurrently on
sociativity of the level-0 cache on the laptop I am typing its own processor. If transaction A stores to x at the same
this on is not unusual. What this means is that if a given time that transaction B stores to y, neither transaction can
transaction needed to touch nine cache lines, and if all progress. To see this, suppose that transaction A executes
nine cache lines mapped to the same set, then that trans- its store to y. Then transaction A will be interleaved
action cannot possibly complete, never mind how many within transaction B, in violation of the requirement that
megabytes of additional space might be available in that transactions execute atomically with respect to each other.
cache. Yes, given randomly selected data elements in a Allowing transaction B to execute its store to x similarly
given data structure, the probability of that transaction violates the atomic-execution requirement. This situation
being able to commit is quite high, but there can be no is termed a conflict, which happens whenever two concur-
guarantee [McK11c]. rent transactions access the same variable where at least
There has been some research work to alleviate this one of the accesses is a store. The system is therefore
limitation. Fully associative victim caches would alleviate obligated to abort one or both of the transactions in order
the associativity constraints, but there are currently strin- to allow execution to progress. The choice of exactly
gent performance and energy-efficiency constraints on the which transaction to abort is an interesting topic that will
sizes of victim caches. That said, HTM victim caches for very likely retain the ability to generate Ph.D. dissertations
unmodified cache lines can be quite small, as they need to for some time to come, see for example [ATC+ 11].10 For
retain only the address: The data itself can be written to the purposes of this section, we can assume that the system
memory or shadowed by other caches, while the address makes a random choice.
itself is sufficient to detect a conflicting write [RD12]. Another complication is conflict detection, which is
Unbounded-transactional-memory (UTM) comparatively straightforward, at least in the simplest case.
schemes [AAKL06, MBM+ 06] use DRAM as an When a processor is executing a transaction, it marks every
extremely large victim cache, but integrating such cache line touched by that transaction. If the processor’s
schemes into a production-quality cache-coherence cache receives a request involving a cache line that has been
mechanism is still an unsolved problem. In addition, marked as touched by the current transaction, a potential
use of DRAM as a victim cache may have unfortunate conflict has occurred. More sophisticated systems might
performance and energy-efficiency consequences, try to order the current processors’ transaction to precede
particularly if the victim cache is to be fully associative. that of the processor sending the request, and optimizing
Finally, the “unbounded” aspect of UTM assumes that all this process will likely also retain the ability to generate
of DRAM could be used as a victim cache, while in reality Ph.D. dissertations for quite some time. However this
the large but still fixed amount of DRAM assigned to a section assumes a very simple conflict-detection strategy.
given CPU would limit the size of that CPU’s transactions. However, for HTM to work effectively, the probability
Other schemes use a combination of hardware and of conflict must be quite low, which in turn requires
software transactional memory [KCH+ 06] and one could that the data structures be organized so as to maintain a
imagine using STM as a fallback mechanism for HTM. sufficiently low probability of conflict. For example, a
However, to the best of my knowledge, with the ex- 10 Liu’s and Spear’s paper entitled “Toxic Transactions” [LS11] is
v2022.09.25a
386 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
red-black tree with simple insertion, deletion, and search abort those of low-priority threads? If so, how is the hard-
operations fits this description, but a red-black tree that ware efficiently informed of priorities? The literature on
maintains an accurate count of the number of elements real-time use of HTM is quite sparse, perhaps because
in the tree does not.11 For another example, a red-black there are more than enough problems in making HTM
tree that enumerates all elements in the tree in a single work well in non-real-time environments.
transaction will have high conflict probabilities, degrading Because current HTM implementations might determin-
performance and scalability. As a result, many serial istically abort a given transaction, software must provide
programs will require some restructuring before HTM can fallback code. This fallback code must use some other
work effectively. In some cases, practitioners will prefer form of synchronization, for example, locking. If a lock-
to take the extra steps (in the red-black-tree case, perhaps based fallback is ever used, then all the limitations of
switching to a partitionable data structure such as a radix locking, including the possibility of deadlock, reappear.
tree or a hash table), and just use locking, particularly One can of course hope that the fallback isn’t used of-
until such time as HTM is readily available on all relevant ten, which might allow simpler and less deadlock-prone
architectures [Cli09]. locking designs to be used. But this raises the question
Quick Quiz 17.8: How could a red-black tree possibly of how the system transitions from using the lock-based
efficiently enumerate all elements of the tree regardless of fallbacks back to transactions.12 One approach is to use a
choice of synchronization mechanism??? test-and-test-and-set discipline [MT02], so that everyone
holds off until the lock is released, allowing the system to
Furthermore, the potential for conflicting accesses start from a clean slate in transactional mode at that point.
among concurrent transactions can result in failure. Han- However, this could result in quite a bit of spinning, which
dling such failure is discussed in the next section. might not be wise if the lock holder has blocked or been
preempted. Another approach is to allow transactions to
17.3.2.3 Aborts and Rollbacks proceed in parallel with a thread holding a lock [MT02],
but this raises difficulties in maintaining atomicity, espe-
Because any transaction might be aborted at any time, cially if the reason that the thread is holding the lock is
it is important that transactions contain no statements because the corresponding transaction would not fit into
that cannot be rolled back. This means that transactions cache.
cannot do I/O, system calls, or debugging breakpoints (no Finally, dealing with the possibility of aborts and roll-
single stepping in the debugger for HTM transactions!!!). backs seems to put an additional burden on the developer,
Instead, transactions must confine themselves to accessing who must correctly handle all combinations of possible
normal cached memory. Furthermore, on some systems, error conditions.
interrupts, exceptions, traps, TLB misses, and other events It is clear that users of HTM must put considerable
will also abort transactions. Given the number of bugs that validation effort into testing both the fallback code paths
have resulted from improper handling of error conditions, and transition from fallback code back to transactional
it is fair to ask what impact aborts and rollbacks have on code. Nor is there any reason to believe that the validation
ease of use. requirements of HTM hardware are any less daunting.
Quick Quiz 17.9: But why can’t a debugger emulate single
stepping by setting breakpoints at successive lines of the 17.3.2.4 Lack of Forward-Progress Guarantees
transaction, relying on the retry to retrace the steps of the
earlier instances of the transaction? Even though transaction size, conflicts, and aborts/roll-
backs can all cause transactions to abort, one might hope
Of course, aborts and rollbacks raise the question of
that sufficiently small and short-duration transactions
whether HTM can be useful for hard real-time systems.
could be guaranteed to eventually succeed. This would per-
Do the performance benefits of HTM outweigh the costs
mit a transaction to be unconditionally retried, in the same
of the aborts and rollbacks, and if so under what condi-
way that compare-and-swap (CAS) and load-linked/store-
tions? Can transactions use priority boosting? Or should
conditional (LL/SC) operations are unconditionally retried
transactions for high-priority threads instead preferentially
11 The need to update the count would result in additions to and 12 The possibility of an application getting stuck in fallback mode
deletions from the tree conflicting with each other, resulting in strong has been termed the “lemming effect”, a term that Dave Dice has been
non-commutativity [AGH+ 11a, AGH+ 11b, McK11b]. credited with coining.
v2022.09.25a
17.3. HARDWARE TRANSACTIONAL MEMORY 387
in code that uses these instructions to implement atomic changes in configuration. But if this empty critical section
operations. is translated to a transaction, the result is a no-op. The
Unfortunately, other than low-clock-rate academic re- guarantee that all prior critical sections have terminated
search prototypes [SBV10], currently available HTM im- is lost. In other words, transactional lock elision pre-
plementations refuse to make any sort of forward-progress serves the data-protection semantics of locking, but loses
guarantee. As noted earlier, HTM therefore cannot be locking’s time-based messaging semantics.
used to avoid deadlock on those systems. Hopefully fu- Quick Quiz 17.10: But why would anyone need an empty
ture implementations of HTM will provide some sort of lock-based critical section???
forward-progress guarantees. Until that time, HTM must
be used with extreme caution in real-time applications. Quick Quiz 17.11: Can’t transactional lock elision trivially
The one exception to this gloomy picture as of 2021 is handle locking’s time-based messaging semantics by simply
the IBM mainframe, which provides constrained trans- choosing not to elide empty lock-based critical sections?
actions [JSG12]. The constraints are quite severe, and
are presented in Section 17.3.5.1. It will be interesting Quick Quiz 17.12: Given modern hardware [MOZ09], how
to see if HTM forward-progress guarantees migrate from can anyone possibly expect parallel software relying on timing
the mainframe to commodity CPU families. to work?
v2022.09.25a
388 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
Listing 17.1: Exploiting Priority Boosting Quick Quiz 17.14: So a bunch of people set out to supplant
1 void boostee(void) locking, and they mostly end up just optimizing locking???
2 {
3 int i = 0;
4
5 acquire_lock(&boost_lock[i]);
6 for (;;) { 17.3.2.7 Summary
7 acquire_lock(&boost_lock[!i]);
8 release_lock(&boost_lock[i]); Although it seems likely that HTM will have com-
9 i = i ^ 1;
10 do_something();
pelling use cases, current implementations have serious
11 } transaction-size limitations, conflict-handling complica-
12 }
13
tions, abort-and-rollback issues, and semantic differences
14 void booster(void) that will require careful handling. HTM’s current situa-
15 {
16 int i = 0;
tion relative to locking is summarized in Table 17.1. As
17 can be seen, although the current state of HTM alleviates
for (;;) {
18
19 usleep(500); /* sleep 0.5 ms. */
some serious shortcomings of locking,13 it does so by
20 acquire_lock(&boost_lock[i]); introducing a significant number of shortcomings of its
21 release_lock(&boost_lock[i]);
22 i = i ^ 1;
own. These shortcomings are acknowledged by leaders in
23 } the TM community [MS12].14
24 }
In addition, this is not the whole story. Locking is not
normally used by itself, but is instead typically augmented
by other synchronization mechanisms, including reference
This arrangement requires that boostee() acquire its counting, atomic operations, non-blocking data structures,
first lock on line 5 before the system becomes busy, but hazard pointers [Mic04a, HLM02], and RCU [MS98a,
this is easily arranged, even on modern hardware. MAK+ 01, HMBW07, McK12b]. The next section looks
Unfortunately, this arrangement can break down in at how such augmentation changes the equation.
presence of transactional lock elision. The boostee()
function’s overlapping critical sections become one infinite
17.3.3 HTM Weaknesses WRT Locking
transaction, which will sooner or later abort, for example,
on the first time that the thread running the boostee() When Augmented
function is preempted. At this point, boostee() will fall Practitioners have long used reference counting, atomic
back to locking, but given its low priority and that the operations, non-blocking data structures, hazard point-
quiet initialization period is now complete (which after ers, and RCU to avoid some of the shortcomings of
all is why boostee() was preempted), this thread might locking. For example, deadlock can be avoided in
never again get a chance to run. many cases by using reference counts, hazard point-
And if the boostee() thread is not holding the lock, ers, or RCU to protect data structures, particularly for
then the booster() thread’s empty critical section on read-only critical sections [Mic04a, HLM02, DMS+ 12,
lines 20 and 21 of Listing 17.1 will become an empty GMTW08, HMBW07]. These approaches also reduce
transaction that has no effect, so that boostee() never the need to partition data structures, as was seen in Chap-
runs. This example illustrates some of the subtle con- ter 10. RCU further provides contention-free bounded
sequences of transactional memory’s rollback-and-retry wait-free read-side primitives [MS98a, DMS+ 12], while
semantics.
13 In fairness, it is important to emphasize that locking’s shortcomings
Given that experience will likely uncover additional do have well-known and heavily used engineering solutions, including
subtle semantic differences, application of HTM-based deadlock detectors [Cor06a], a wealth of data structures that have been
lock elision to large programs should be undertaken with adapted to locking, and a long history of augmentation, as discussed in
caution. That said, where it does apply, HTM-based Section 17.3.3. In addition, if locking really were as horrible as a quick
skim of many academic papers might reasonably lead one to believe,
lock elision can eliminate the cache misses associated where did all the large lock-based parallel programs (both FOSS and
with the lock variable, which has resulted in tens of proprietary) come from, anyway?
14 In addition, in early 2011, I was invited to deliver a critique of
percent performance increases in large real-world software
systems as of early 2015. We can therefore expect to see some of the assumptions underlying transactional memory [McK11e].
The audience was surprisingly non-hostile, though perhaps they were
substantial use of this technique on hardware providing taking it easy on me due to the fact that I was heavily jet-lagged while
reliable support for it. giving the presentation.
v2022.09.25a
17.3. HARDWARE TRANSACTIONAL MEMORY 389
Table 17.1: Comparison of Locking and HTM ( Advantage , Disadvantage , Strong Disadvantage )
v2022.09.25a
390 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
Table 17.2: Comparison of Locking (Augmented by RCU or Hazard Pointers) and HTM ( Advantage , Disadvantage ,
Strong Disadvantage )
v2022.09.25a
17.3. HARDWARE TRANSACTIONAL MEMORY 391
hazard pointers provides lock-free read-side primi- operations that traverse large fractions of the data struc-
tives [Mic02, HLM02, Mic04a]. Adding these consider- ture [PMDY20]. Current HTM implementations uncondi-
ations to Table 17.1 results in the updated comparison tionally abort an update transaction that conflicts with an
between augmented locking and HTM shown in Table 17.2. RCU or hazard-pointer reader, but perhaps future HTM
A summary of the differences between the two tables is implementations will interoperate more smoothly with
as follows: these synchronization mechanisms. In the meantime, the
probability of an update conflicting with a large RCU or
1. Use of non-blocking read-side mechanisms alleviates hazard-pointer read-side critical section should be much
deadlock issues. smaller than the probability of conflicting with the equiv-
2. Read-side mechanisms such as hazard pointers and alent read-only transaction.15 Nevertheless, it is quite
RCU can operate efficiently on non-partitionable possible that a steady stream of RCU or hazard-pointer
data. readers might starve updaters due to a corresponding
steady stream of conflicts. This vulnerability could be
3. Hazard pointers and RCU do not contend with each eliminated (at significant hardware cost and complexity)
other or with updaters, allowing excellent perfor- by giving extra-transactional reads the pre-transaction
mance and scalability for read-mostly workloads. copy of the memory location being loaded.
The fact that HTM transactions must have fallbacks
4. Hazard pointers and RCU provide forward-progress might in some cases force static partitionability of data
guarantees (lock freedom and bounded wait-freedom, structures back onto HTM. This limitation might be
respectively). alleviated if future HTM implementations provide forward-
5. Privatization operations for hazard pointers and RCU progress guarantees, which might eliminate the need for
are straightforward. fallback code in some cases, which in turn might allow
HTM to be used efficiently in situations with higher
For those with good eyesight, Table 17.3 combines conflict probabilities.
Tables 17.1 and 17.2. In short, although HTM is likely to have important
Quick Quiz 17.15: Tables 17.1 and 17.2 state that hardware uses and applications, it is another tool in the parallel
is only starting to become available. But hasn’t HTM hardware programmer’s toolbox, not a replacement for the toolbox
support been widely available for almost a full decade? in its entirety.
Although it will likely be some time before HTM’s area 1. Forward-progress guarantees.
of applicability can be as crisply delineated as that shown
for RCU in Figure 9.33 on page 176, that is no reason not 2. Transaction-size increases.
to start moving in that direction. 3. Improved debugging support.
HTM seems best suited to update-heavy workloads
involving relatively small changes to disparate portions 4. Weak atomicity.
of relatively large in-memory data structures running on
large multiprocessors, as this meets the size restrictions These are expanded upon in the following sections.
of current HTM implementations while minimizing the
probability of conflicts and attendant aborts and rollbacks.
This scenario is also one that is relatively difficult to handle
given current synchronization primitives. 15 It is quite ironic that strictly transactional mechanisms are ap-
Use of locking in conjunction with HTM seems likely pearing in shared-memory systems at just about the time that NoSQL
databases are relaxing the traditional database-application reliance on
to overcome HTM’s difficulties with irrevocable opera- strict transactions. Nevertheless, HTM has in fact realized the ease-of-
tions, while use of RCU or hazard pointers might alle- use promise of TM, albeit for black-hat attacks on the Linux kernel’s
viate HTM’s transaction-size limitations for read-only address-space randomization defense mechanism [JLK16a, JLK16b].
v2022.09.25a
CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
Table 17.3: Comparison of Locking (Plain and Augmented) and HTM ( Advantage , Disadvantage , Strong Disadvantage )
v2022.09.25a
Locking Locking with Userspace RCU or Hazard Pointers Hardware Transactional Memory
Basic Idea Allow only one thread at a time to access a given set Allow only one thread at a time to access a given set Cause a given operation over a set of objects to execute
of objects. of objects. atomically.
Scope Handles all operations. Handles all operations. Handles revocable operations.
Irrevocable operations force fallback (typically to lock-
ing).
Composability Limited by deadlock. Readers limited only by grace-period-wait operations. Limited by irrevocable operations, transaction size,
and deadlock. (Assuming lock-based fallback code.)
Updaters limited by deadlock. Readers reduce dead-
lock.
Scalability & Per- Data must be partitionable to avoid lock contention. Data must be partitionable to avoid lock contention Data must be partitionable to avoid conflicts.
formance among updaters.
Partitioning not needed for readers.
Partitioning must typically be fixed at design time. Partitioning for updaters must typically be fixed at Dynamic adjustment of partitioning carried out auto-
design time. matically down to cacheline boundaries.
Partitioning not needed for readers. Partitioning required for fallbacks (less important for
rare fallbacks).
Locking primitives typically result in expensive cache Updater locking primitives typically result in expensive Transactions begin/end instructions typically do not
misses and memory-barrier instructions. cache misses and memory-barrier instructions. result in cache misses, but do have memory-ordering
and overhead consequences.
Contention effects are focused on acquisition and re- Update-side contention effects are focused on acquisi- Contention aborts conflicting transactions, even if they
lease, so that the critical section runs at full speed. tion and release, so that the critical section runs at full have been running for a long time.
speed.
Readers do not contend with updaters or with each
other.
Read-side primitives are typically bounded wait-free Read-only transactions subject to conflicts and roll-
with low overhead. (Lock-free with low overhead for backs. No forward-progress guarantees other than
hazard pointers.) those supplied by fallback code.
Privatization operations are simple, intuitive, perfor- Privatization operations are simple, intuitive, perfor- Privatized data contributes to transaction size.
mant, and scalable. mant, and scalable when data is visible only to updaters.
Privatization operations are expensive (though still
intuitive and scalable) for reader-visible data.
Hardware Support Commodity hardware suffices. Commodity hardware suffices. New hardware required (and is starting to become
available).
Performance is insensitive to cache-geometry details. Performance is insensitive to cache-geometry details. Performance depends critically on cache geometry.
Software Support APIs exist, large body of code and experience, debug- APIs exist, large body of code and experience, debug- APIs emerging, little experience outside of DBMS,
gers operate naturally. gers operate naturally. breakpoints mid-transaction can be problematic.
Interaction With Long experience of successful interaction. Long experience of successful interaction. Just beginning investigation of interaction.
Other Mechanisms
Practical Apps Yes. Yes. Yes.
392
v2022.09.25a
394 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
3. If a given 4K page contains a constrained transaction’s is that a single-step exception aborts the enclosing trans-
code, then that page may not contain that transaction’s action. There are a number of workarounds for this issue,
data. including emulating the processor (slow!), substituting
STM for HTM (slow and slightly different semantics!),
4. The maximum number of assembly instructions that
playback techniques using repeated retries to emulate for-
may be executed is 32.
ward progress (strange failure modes!), and full support
5. Backwards branches are forbidden. of debugging HTM transactions (complex!).
Should one of the HTM vendors produce an HTM sys-
Nevertheless, these constraints support a number of
tem that allows straightforward use of classical debugging
important data structures, including linked lists, stacks,
techniques within transactions, including breakpoints, sin-
queues, and arrays. Constrained HTM therefore seems
gle stepping, and print statements, this will make HTM
likely to become an important tool in the parallel program-
much more compelling. Some transactional-memory
mer’s toolbox.
researchers started to recognize this problem in 2013,
Note that these forward-progress guarantees need not
with at least one proposal involving hardware-assisted
be absolute. For example, suppose that a use of HTM
debugging facilities [GKP13]. Of course, this proposal
uses a global lock as fallback. Assuming that the fall-
depends on readily available hardware gaining such facili-
back mechanism has been carefully designed to avoid the
ties [Hay20, Int20b]. Worse yet, some cutting-edge debug-
“lemming effect” discussed in Section 17.3.2.3, then if
ging facilities are incompatible with HTM [OHOC20].
HTM rollbacks are sufficiently infrequent, the global lock
will not be a bottleneck. That said, the larger the system,
the longer the critical sections, and the longer the time 17.3.5.4 Weak Atomicity
required to recover from the “lemming effect”, the more Given that HTM is likely to face some sort of size limi-
rare “sufficiently infrequent” needs to be. tations for the foreseeable future, it will be necessary for
HTM to interoperate smoothly with other mechanisms.
17.3.5.2 Transaction-Size Increases HTM’s interoperability with read-mostly mechanisms
Forward-progress guarantees are important, but as we saw, such as hazard pointers and RCU would be improved if
they will be conditional guarantees based on transaction extra-transactional reads did not unconditionally abort
size and duration. There has been some progress, for exam- transactions with conflicting writes—instead, the read
ple, some commercially available HTM implementations could simply be provided with the pre-transaction value.
use approximation techniques to support extremely large In this way, hazard pointers and RCU could be used to
HTM read sets [RD12]. For another example, POWER8 allow HTM to handle larger data structures and to reduce
HTM supports suspended transations, which avoid adding conflict probabilities.
irrelevant accesses to the suspended transation’s read and This is not necessarily simple, however. The most
write sets [LGW+ 15]. This capability has been used to straightforward way of implementing this requires an ad-
produce a high performance reader-writer lock [FIMR16]. ditional state in each cache line and on the bus, which is
It is important to note that even small-sized guarantees a non-trivial added expense. The benefit that goes along
will be quite useful. For example, a guarantee of two with this expense is permitting large-footprint readers
cache lines is sufficient for a stack, queue, or dequeue. without the risk of starving updaters due to continual
However, larger data structures require larger guarantees, conflicts. An alternative approach, applied to great effect
for example, traversing a tree in order requires a guarantee to binary search trees by Siakavaras et al. [SNGK17], is
equal to the number of nodes in the tree. Therefore, even to use RCU for read-only traversals and HTM only for
modest increases in the size of the guarantee also increases the actual updates themselves. This combination outper-
the usefulness of HTM, thereby increasing the need for formed other transactional-memory techniques by up to
CPUs to either provide it or provide good-and-sufficient 220 %, a speedup similar to that observed by Howard and
workarounds. Walpole [HW11] when they combined RCU with STM. In
both cases, the weak atomicity is implemented in software
rather than in hardware. It would nevertheless be inter-
17.3.5.3 Improved Debugging Support
esting to see what additional speedups could be obtained
Another inhibitor to transaction size is the need to debug by implementing weak atomicity in both hardware and
the transactions. The problem with current mechanisms software.
v2022.09.25a
17.4. FORMAL REGRESSION TESTING? 395
v2022.09.25a
396 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
their capabilities continue to grow, could well become Promela and spin assume sequential consistency,
excellent additions to regression suites. The Coverity which is not a good match for modern computer sys-
static-analysis tool also inputs C programs, and of very tems, as was seen in Chapter 15. In contrast, one of
large size, including the Linux kernel. Of course, Cover- the great strengths of PPCMEM and herd is their de-
ity’s static analysis is quite simple compared to that of tailed modeling of various CPU families memory models,
cbmc and Nidhugg. On the other hand, Coverity had an including x86, Arm, Power, and, in the case of herd,
all-encompassing definition of “C program” that posed a Linux-kernel memory model [AMM+ 18], which was
special challenges [BBC+ 10]. Amazon Web Services uses accepted into Linux-kernel version v4.17.
a variety of formal-verification tools, including cbmc, and The cbmc and Nidhugg tools provide some ability to
applies some of these tools to regression testing [Coo18]. select memory models, but do not provide the variety that
Google uses a number of relatively simple static analy- PPCMEM and herd do. However, it is likely that the
sis tools directly on large Java code bases, which are larger-scale tools will adopt a greater variety of memory
arguably less diverse than C code bases [SAE+ 18]. Face- models as time goes on.
book uses more aggressive forms of formal verifica- In the longer term, it would be helpful for formal-
tion against its code bases, including analysis of con- verification tools to include I/O [MDR16], but it may be
currency [DFLO19, O’H19], though not yet on the Linux some time before this comes to pass.
kernel. Finally, Microsoft has long used static analysis on Nevertheless, tools that fail to match the environment
its code bases [LBD+ 04]. can still be useful. For example, a great many concur-
Given this list, it is clearly possible to create sophis- rency bugs would still be bugs on a mythical sequentially
ticated formal-verification tools that directly consume consistent system, and these bugs could be located by a
production-quality source code. tool that over-approximates the system’s memory model
However, one shortcoming of taking C code as input is with sequential consistency. Nevertheless, these tools
that it assumes that the compiler is correct. An alternative will fail to find bugs involving missing memory-ordering
approach is to take the binary produced by the C compiler directives, as noted in the aforementioned cautionary tale
as input, thereby accounting for any relevant compiler bugs. of Section 12.1.4.6.
This approach has been used in a number of verification
efforts, perhaps most notably by the SEL4 project [SM13].
Quick Quiz 17.17: Given the groundbreaking nature of the
17.4.3 Overhead
various verifiers used in the SEL4 project, why doesn’t this Almost all hard-core formal-verification tools are expo-
chapter cover them in more depth?
nential in nature, which might seem discouraging until
However, verifying directly from either the source or you consider that many of the most interesting software
binary both have the advantage of eliminating human questions are in fact undecidable. However, there are
translation errors, which is critically important for reliable differences in degree, even among exponentials.
regression testing. PPCMEM by design is unoptimized, in order to provide
This is not to say that tools with special-purpose lan- greater assurance that the memory models of interest are
guages are useless. On the contrary, they can be quite accurately represented. The herd tool optimizes more
helpful for design-time verification, as was discussed in aggressively, as described in Section 12.3, and is thus
Chapter 12. However, such tools are not particularly orders of magnitude faster than PPCMEM. Nevertheless,
helpful for automated regression testing, which is in fact both PPCMEM and herd target very small litmus tests
the topic of this section. rather than larger bodies of code.
In contrast, Promela/spin, cbmc, and Nidhugg
are designed for (somewhat) larger bodies of code.
17.4.2 Environment Promela/spin was used to verify the Curiosity rover’s
It is critically important that formal-verification tools filesystem [GHH+ 14] and, as noted earlier, both cbmc and
correctly model their environment. One all-too-common Nidhugg were appled to Linux-kernel RCU.
omission is the memory model, where a great many formal- If advances in heuristics continue at the rate of the past
verification tools, including Promela/spin, are restricted three decades, we can look forward to large reductions in
to sequential consistency. The QRCU experience related overhead for formal verification. That said, combinatorial
in Section 12.1.4.6 is an important cautionary tale. explosion is still combinatorial explosion, which would be
v2022.09.25a
17.4. FORMAL REGRESSION TESTING? 397
Listing 17.2: Emulating Locking with cmpxchg_acquire() o-u*-C.litmus). Table 17.4 compares the performance
1 C C-SB+l-o-o-u+l-o-o-u-C and scalability of using the model’s spin_lock() and
2
3 {} spin_unlock() against emulating these primitives as
4 shown in the listing. The difference is not insignificant:
5 P0(int *sl, int *x0, int *x1)
6 { At four processes, the model is more than two orders of
7 int r2; magnitude faster than emulation!
8 int r1;
9
Quick Quiz 17.18: Why bother with a separate filter
10 r2 = cmpxchg_acquire(sl, 0, 1);
11 WRITE_ONCE(*x0, 1); command on line 27 of Listing 17.2 instead of just adding the
12 r1 = READ_ONCE(*x1); condition to the exists clause? And wouldn’t it be simpler to
13 smp_store_release(sl, 0);
14 }
use xchg_acquire() instead of cmpxchg_acquire()?
15
16 P1(int *sl, int *x0, int *x1) It would of course be quite useful for tools to automat-
17 {
18 int r2; ically divide up large programs, verify the pieces, and
19 int r1; then verify the combinations of pieces. In the meantime,
20
21 r2 = cmpxchg_acquire(sl, 0, 1); verification of large programs will require significant
22 WRITE_ONCE(*x1, 1); manual intervention. This intervention will preferably
23 r1 = READ_ONCE(*x0);
24 smp_store_release(sl, 0); mediated by scripting, the better to reliably carry out
25 } repeated verifications on each release, and preferably
26
27 filter (0:r2=0 /\ 1:r2=0) eventually in a manner well-suited for continuous inte-
28 exists (0:r1=0 /\ 1:r1=0) gration. And Facebook’s Infer tool has taken important
steps towards doing just that, via compositionality and
Table 17.4: Emulating Locking: Performance (s) abstraction [BGOS18, DFLO19].
In any case, we can expect formal-verification capa-
# Threads Locking cmpxchg_acquire bilities to continue to increase over time, and any such
2 0.004 0.022
increases will in turn increase the applicability of formal
3 0.041 0.743
verification to regression testing.
4 0.374 59.565
5 4.905 17.4.4 Locate Bugs
Any software artifact of any size contains bugs. Therefore,
a formal-verification tool that reports only the presence or
expected to sharply limit the size of programs that could absence of bugs is not particularly useful. What is needed
be verified, with or without continued improvements in is a tool that gives at least some information as to where
heuristics. the bug is located and the nature of that bug.
However, the flip side of combinatorial explosion is The cbmc output includes a traceback mapping back
Philip II of Macedon’s timeless advice: “Divide and rule.” to the source code, similar to Promela/spin’s, as does
If a large program can be divided and the pieces verified, Nidhugg. Of course, these tracebacks can be quite long,
the result can be combinatorial implosion [McK11e]. One and analyzing them can be quite tedious. However, doing
natural place to divide is on API boundaries, for example, so is usually quite a bit faster and more pleasant than
those of locking primitives. One verification pass can locating bugs the old-fashioned way.
then verify that the locking implementation is correct, and In addition, one of the simplest tests of formal-
additional verification passes can verify correct use of the verification tools is bug injection. After all, not only
locking APIs. could any of us write printf("VERIFIED\n"), but the
The performance benefits of this approach can plain fact is that developers of formal-verification tools are
be demonstrated using the Linux-kernel memory just as bug-prone as are the rest of us. Therefore, formal-
model [AMM+ 18]. This model provides spin_lock() verification tools that just proclaim that a bug exists are
and spin_unlock() primitives, but these primitives can fundamentally less trustworthy because it is more difficult
also be emulated using cmpxchg_acquire() and smp_ to verify them on real-world code.
store_release(), as shown in Listing 17.2 (C-SB+l- All that aside, people writing formal-verification tools
o-o-u+l-o-o-*u.litmus and C-SB+l-o-o-u+l-o- are permitted to leverage existing tools. For example, a
v2022.09.25a
398 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
tool designed to determine only the presence or absence to stress tests.17 The cbmc tool also checks for array-out-
of a serious but rare bug might leverage bisection. If an of-bound references, thus implicitly adding them to the
old version of the program under test did not contain the specification. The aforementioned incorrectness logic can
bug, but a new version did, then bisection could be used also be thought of as using an implicit bugs-not-present
to quickly locate the commit that inserted the bug, which specification [O’H19].
might be sufficient information to find and fix the bug. This implicit-specification approach makes quite a bit of
Of course, this sort of strategy would not work well for sense, particularly if you look at formal verification not as
common bugs because in this case bisection would fail a full proof of correctness, but rather an alternative form of
due to all commits having at least one instance of the validation with a different set of strengths and weaknesses
common bug. than the common case, that is, testing. From this viewpoint,
Therefore, the execution traces provided by many software will always have bugs, and therefore any tool of
formal-verification tools will continue to be valuable, any kind that helps to find those bugs is a very good thing
particularly for complex and difficult-to-understand bugs. indeed.
In addition, recent work applies incorrectness-logic for-
malism reminiscent of the traditional Hoare logic used for 17.4.6 Relevant Bugs
full-up correctness proofs, but with the sole purpose of
finding bugs [O’H19]. Finding bugs—and fixing them—is of course the whole
point of any type of validation effort. Clearly, false
positives are to be avoided. But even in the absence of
17.4.5 Minimal Scaffolding
false positives, there are bugs and there are bugs.
In the old days, formal-verification researchers demanded For example, suppose that a software artifact had exactly
a full specification against which the software would 100 remaining bugs, each of which manifested on average
be verified. Unfortunately, a mathematically rigorous once every million years of runtime. Suppose further
specification might well be larger than the actual code, and that an omniscient formal-verification tool located all 100
each line of specification is just as likely to contain bugs as bugs, which the developers duly fixed. What happens to
is each line of code. A formal verification effort proving the reliability of this software artifact?
that the code faithfully implemented the specification The answer is that the reliability decreases.
would be a proof of bug-for-bug compatibility between To see this, keep in mind that historical experience indi-
the two, which might not be all that helpful. cates that about 7 % of fixes introduce a new bug [BJ12].
Worse yet, the requirements for a number of software Therefore, fixing the 100 bugs, which had a combined
artifacts, including Linux-kernel RCU, are empirical in mean time to failure (MTBF) of about 10,000 years, will
nature [McK15h, McK15e, McK15f].16 For this common introduce seven more bugs. Historical statistics indicate
type of software, a complete specification is a polite fiction. that each new bug will have an MTBF much less than
Nor are complete specifications any less fictional for 70,000 years. This in turn suggests that the combined
hardware, as was made clear by the late-2017 Meltdown MTBF of these seven new bugs will most likely be much
and Spectre side-channel attacks [Hor18]. less than 10,000 years, which in turn means that the
This situation might cause one to give up all hope of well-intentioned fixing of the original 100 bugs actually
formal verification of real-world software and hardware decreased the reliability of the overall software.
artifacts, but it turns out that there is quite a bit that can Quick Quiz 17.19: How do we know that the MTBFs of
be done. For example, design and coding rules can act known bugs is a good estimate of the MTBFs of bugs that have
as a partial specification, as can assertions contained in not yet been located?
the code. And in fact formal-verification tools such as
cbmc and Nidhugg both check for assertions that can be Quick Quiz 17.20: But the formal-verification tools should
triggered, implicitly treating these assertions as part of immediately find all the bugs introduced by the fixes, so why
the specification. However, the assertions are also part is this a problem?
of the code, which makes it less likely that they will
become obsolete, especially if the code is also subjected Worse yet, imagine another software artifact with one
bug that fails once every day on average and 99 more
16 Or, in formal-verification parlance, Linux-kernel RCU has an
v2022.09.25a
17.4. FORMAL REGRESSION TESTING? 399
that fail every million years each. Suppose that a formal- requires constructing an exists clause and cannot take
verification tool located the 99 million-year bugs, but intra-process assertions, so its fifth cell is also yellow.
failed to find the one-day bug. Fixing the 99 bugs located The herd tool has size restrictions similar to those of
will take time and effort, decrease reliability, and do PPCMEM, so herd’s first cell is also orange. It supports a
nothing at all about the pressing each-day failure that is wide variety of memory models, so its second cell is blue.
likely causing embarrassment and perhaps much worse It has reasonable overhead, so its third cell is yellow. Its
besides. bug-location and assertion capabilities are quite similar to
Therefore, it would be best to have a validation tool those of PPCMEM, so herd also gets yellow for the next
that preferentially located the most troublesome bugs. two cells.
However, as noted in Section 17.4.4, it is permissible The cbmc tool inputs C code directly, so its first cell
to leverage additional tools. One powerful tool is none is blue. It supports a few memory models, so its second
other than plain old testing. Given knowledge of the cell is yellow. It has reasonable overhead, so its third cell
bug, it should be possible to construct specific tests for is also yellow, however, perhaps SAT-solver performance
it, possibly also using some of the techniques described will continue improving. It provides a traceback, so its
in Section 11.6.4 to increase the probability of the bug fourth cell is green. It takes assertions directly from the C
manifesting. These techniques should allow calculation code, so its fifth cell is blue.
of a rough estimate of the bug’s raw failure rate, which Nidhugg also inputs C code directly, so its first cell is
could in turn be used to prioritize bug-fix efforts. also blue. It supports only a couple of memory models,
so its second cell is orange. Its overhead is quite low (for
Quick Quiz 17.21: But many formal-verification tools can
formal-verification), so its third cell is green. It provides
only find one bug at a time, so that each bug must be fixed
a traceback, so its fourth cell is green. It takes assertions
before the tool can locate the next. How can bug-fix efforts be
prioritized given such a tool? directly from the C code, so its fifth cell is blue.
So what about the sixth and final row? It is too early to
There has been some recent formal-verification work tell how any of the tools do at finding the right bugs, so
that prioritizes executions having fewer preemptions, un- they are all yellow with question marks.
der that reasonable assumption that smaller numbers of Quick Quiz 17.22: How would testing stack up in the
preemptions are more likely. scorecard shown in Table 17.5?
Identifying relevant bugs might sound like too much to
ask, but it is what is really required if we are to actually Quick Quiz 17.23: But aren’t there a great many more
increase software reliability. formal-verification systems than are shown in Table 17.5?
Once again, please note that this table rates these tools
17.4.7 Formal Regression Scorecard for use in regression testing. Just because many of them
are a poor fit for regression testing does not at all mean
Table 17.5 shows a rough-and-ready scorecard for the that they are useless, in fact, many of them have proven
formal-verification tools covered in this chapter. Shorter their worth many times over.18 Just not for regression
wavelengths are better than longer wavelengths. testing.
Promela requires hand translation and supports only However, this might well change. After all, formal
sequential consistency, so its first two cells are red. It verification tools made impressive strides in the 2010s.
has reasonable overhead (for formal verification, anyway) If that progress continues, formal verification might well
and provides a traceback, so its next two cells are yel- become an indispensable tool in the parallel programmer’s
low. Despite requiring hand translation, Promela handles validation toolbox.
assertions in a natural way, so its fifth cell is green.
PPCMEM usually requires hand translation due to the
small size of litmus tests that it supports, so its first cell is
orange. It handles several memory models, so its second
cell is green. Its overhead is quite high, so its third
cell is red. It provides a graphical display of relations 18 For but one example, Promela was used to verify the file system of
among operations, which is not as helpful as a traceback, none other than the Curiosity Rover. Was your formal verification tool
but is still quite useful, so its fourth cell is yellow. It used on software that currently runs on Mars???
v2022.09.25a
400 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
17.5 Functional Programming for tions, and transactions, which inflict added violence
upon the functional model.
Parallelism
3. Procedural languages can alias function arguments,
for example, by passing a pointer to the same structure
The curious failure of functional programming for
via two different arguments to the same invocation
parallel applications.
of a given function. This can result in the function
Malte Skarupke unknowingly updating that structure via two different
(and possibly overlapping) code sequences, which
When I took my first-ever functional-programming class greatly complicates analysis.
in the early 1980s, the professor asserted that the side-
Of course, given the importance of global state, syn-
effect-free functional-programming style was well-suited
chronization primitives, and aliasing, clever functional-
to trivial parallelization and analysis. Thirty years later,
programming experts have proposed any number of at-
this assertion remains, but mainstream production use
tempts to reconcile the function programming model to
of parallel functional languages is minimal, a state of
them, monads being but one case in point.
affairs that might not be entirely unrelated to professor’s
Another approach is to compile the parallel procedural
additional assertion that programs should neither maintain
program into a functional program, then to use functional-
state nor do I/O. There is niche use of functional languages
programming tools to analyze the result. But it is possible
such as Erlang, and multithreaded support has been added
to do much better than this, given that any real computation
to several other functional languages, but mainstream
is a large finite-state machine with finite input that runs for
production usage remains the province of procedural
a finite time interval. This means that any real program
languages such as C, C++, Java, and Fortran (usually
can be transformed into an expression, possibly albeit an
augmented with OpenMP, MPI, or coarrays).
impractically large one [DHK12].
This situation naturally leads to the question “If analysis
However, a number of the low-level kernels of paral-
is the goal, why not transform the procedural language into
lel algorithms transform into expressions that are small
a functional language before doing the analysis?” There
enough to fit easily into the memories of modern comput-
are of course a number of objections to this approach, of
ers. If such an expression is coupled with an assertion,
which I list but three:
checking to see if the assertion would ever fire becomes a
1. Procedural languages often make heavy use of global satisfiability problem. Even though satisfiability problems
variables, which can be updated independently by are NP-complete, they can often be solved in much less
different functions, or, worse yet, by multiple threads. time than would be required to generate the full state
Note that Haskell’s monads were invented to deal space. In addition, the solution time appears to be only
with single-threaded global state, and that multi- weakly dependent on the underlying memory model, so
threaded access to global state inflicts additional that algorithms running on weakly ordered systems can
violence on the functional model. also be checked [AKT13].
The general approach is to transform the program into
2. Multithreaded procedural languages often use syn- single-static-assignment (SSA) form, so that each assign-
chronization primitives such as locks, atomic opera- ment to a variable creates a separate version of that variable.
v2022.09.25a
17.6. SUMMARY 401
This applies to assignments from all the active threads, it is more likely that, as in the past, the future will be far
so that the resulting expression embodies all possible stranger than we can possibly imagine.
executions of the code in question. The addition of an
assertion entails asking whether any combination of inputs
and initial values can result in the assertion firing, which,
as noted above, is exactly the satisfiability problem.
One possible objection is that it does not gracefully
handle arbitrary looping constructs. However, in many
cases, this can be handled by unrolling the loop a finite
number of times. In addition, perhaps some loops will
also prove amenable to collapse via inductive methods.
Another possible objection is that spinlocks involve
arbitrarily long loops, and any finite unrolling would fail
to capture the full behavior of the spinlock. It turns out
that this objection is easily overcome. Instead of modeling
a full spinlock, model a trylock that attempts to obtain
the lock, and aborts if it fails to immediately do so. The
assertion must then be crafted so as to avoid firing in
cases where a spinlock aborted due to the lock not being
immediately available. Because the logic expression is
independent of time, all possible concurrency behaviors
will be captured via this approach.
A final objection is that this technique is unlikely to
be able to handle a full-sized software artifact such as
the millions of lines of code making up the Linux kernel.
This is likely the case, but the fact remains that exhaustive
validation of each of the much smaller parallel primitives
within the Linux kernel would be quite valuable. And
in fact the researchers spearheading this approach have
applied it to non-trivial real-world code, including the
Tree RCU implementation in the Linux kernel [LMKM16,
KS17a].
It remains to be seen how widely applicable this tech-
nique is, but it is one of the more interesting innovations
in the field of formal verification. Although it might well
be that the functional-programming advocates are at long
last correct in their assertion of the inevitable dominance
of functional programming, it is clearly the case that
this long-touted methodology is starting to see credible
competition on its formal-verification home turf. There
is therefore continued reason to doubt the inevitability of
functional-programming dominance.
17.6 Summary
This chapter has taken a quick tour of a number of possible
futures, including multicore, transactional memory, formal
verification as a regression test, and concurrent functional
programming. Any of these futures might come true, but
v2022.09.25a
402 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
v2022.09.25a
History is the sum total of things that could have
been avoided.
You have arrived at the end of this book, well done! I hope Chapter 7 expounded on that parallel-programming
that your journey was a pleasant but challenging and workhorse (and villain), locking. This chapter covered
worthwhile one. a number of types of locking and presented some engi-
For your editor and contributors, this is the end of the neering solutions to many well-known and aggressively
journey to the Second Edition, but for those willing to join advertised shortcomings of locking.
in, it is also the start of the journey to the Third Edition. Chapter 8 discussed the uses of data ownership, where
Either way, it is good to recap this past journey. synchronization is supplied by the association of a given
Chapter 1 covered what this book is about, along with data item with a specific thread. Where it applies, this
some alternatives for those interested in something other approach combines excellent performance and scalability
than low-level parallel programming. with profound simplicity.
Chapter 2 covered parallel-programming challenges and Chapter 9 showed how a little procrastination can greatly
high-level approaches for addressing them. It also touched improve performance and scalability, while in a surpris-
on ways of avoiding these challenges while nevertheless ingly large number of cases also simplifying the code.
still gaining most of the benefits of parallelism. A number of the mechanisms presented in this chapter
Chapter 3 gave a high-level overview of multicore take advantage of the ability of CPU caches to replicate
hardware, especially those aspects that pose challenges read-only data, thus sidestepping the laws of physics that
for concurrent software. This chapter puts the blame cruelly limit the speed of light and the smallness of atoms.
for these challenges where it belongs, very much on the Chapter 10 looked at concurrent data structures, with
laws of physics and rather less on intransigent hardware emphasis on hash tables, which have a long and honorable
architects and designers. However, there might be some history in parallel programs.
things that hardware architects and engineers can do, and Chapter 11 dug into code-review and testing methods,
this chapter discusses a few of them. In the meantime, and Chapter 12 overviewed formal verification. Whichever
software architects and engineers must do their part to side of the formal-verification/testing divide you might be
meet these challenges, as discussed in the rest of the book. on, if code has not been thoroughly validated, it does not
Chapter 4 gave a quick overview of the tools of the work. And that goes at least double for concurrent code.
low-level concurrency trade. Chapter 5 then demon- Chapter 13 presented a number of situations where com-
strated use of those tools—and, more importantly, use of bining concurrency mechanisms with each other or with
parallel-programming design techniques—on the simple other design tricks can greatly ease parallel programmers’
but surprisingly challenging task of concurrent counting. lives. Chapter 14 looked at advanced synchronization
So challenging, in fact, that a number of concurrent count- methods, including lockless programming, non-blocking
ing algorithms are in common use, each specialized for a synchronization, and parallel real-time computing. Chap-
different use case. ter 15 dug into the critically important topic of memory
Chapter 6 dug more deeply into the most important ordering, presenting techniques and tools to help you not
parallel-programming design technique, namely partition- only solve memory-ordering problems, but also to avoid
ing the problem at the highest possible level. This chapter them completely. Chapter 16 presented a brief overview
also overviewed a number of points in this design space. of the surprisingly important topic of ease of use.
403
v2022.09.25a
404 CHAPTER 18. LOOKING FORWARD AND BACK
Last, but definitely not least, Chapter 17 expounded on with many excellent innovations and improvements from
a number of conflicting visions of the future, including throughout the community. The thought of writing a book
CPU-technology trends, transactional memory, hardware occurred to Paul from time to time, but life was flowing
transactional memory, use of formal verification in re- fast, so he made no progress on this project.
gression testing, and the long-standing prediction that In 2006, Paul was invited to a conference on Linux
the future of parallel programming belongs to functional- scalability, and was granted the privilege of asking the
programming languages. last question of panel of esteemed parallel-programming
But now that we have recapped the contents of this experts. Paul began his question by noting that in the
Second Edition, how did this book get started? 15 years from 1991 to 2006, the price of a parallel system
Paul’s parallel-programming journey started in earnest had dropped from that of a house to that of a mid-range
in 1990, when he joined Sequent Computer Systems, Inc. bicycle, and it was clear that there was much more room for
Sequent used an apprenticeship-like program in which additional dramatic price decreases over the next 15 years
newly hired engineers were placed in cubicles surrounded extending to the year 2021. He also noted that decreasing
by experienced engineers, who mentored them, reviewed price should result in greater familiarity and faster progress
their code, and gave copious quantities of advice on a in solving parallel-programming problems. This led to
variety of topics. A few of the newly hired engineers his question: “In the year 2021, why wouldn’t parallel
were greatly helped by the fact that there were no on-chip programming have become routine?”
caches in those days, which meant that logic analyzers The first panelist seemed quite disdainful of anyone who
could easily display a given CPU’s instruction stream would ask such an absurd question, and quickly responded
and memory accesses, complete with accurate timing with a soundbite answer. To which Paul gave a soundbite
information. Of course, the downside of this transparency response. They went back and forth for some time, for
was that CPU core clock frequencies were 100 times example, the panelist’s sound-bite answer “Deadlock”
slower than those of the twenty-first century. Between provoked Paul’s sound-bite response “Lock dependency
apprenticeship and hardware performance transparency, checker”.
these newly hired engineers became productive parallel The panelist eventually ran out of soundbites, impro-
programmers within two or three months, and some were vising a final “People like you should be hit over the head
doing ground-breaking work within a couple of years. with a hammer!”
Sequent understood that its ability to quickly train new Paul’s response was of course “You will have to get in
engineers in the mysteries of parallelism was unusual, so line for that!”
it produced a slim volume that crystalized the company’s Paul turned his attention to the next panelist, who
parallel-programming wisdom [Seq88], which joined a seemed torn between agreeing with the first panelist and
pair of groundbreaking papers that had been written a few not wishing to have to deal with Paul’s series of responses.
years earlier [BK85, Inm85]. People already steeped in He therefore have a short non-committal speech. And so
these mysteries saluted this book and these papers, but it went through the rest of the panel.
novices were usually unable to benefit much from them, Until it was the turn of the last panelist, who was
invariably making highly creative and quite destructive someone you might have heard of who goes by the name
errors that were not explicitly prohibited by either the of Linus Torvalds. Linus noted that three years earlier (that
book or the papers.1 This situation of course caused Paul is, 2003), the initial version of any concurrency-related
to start thinking in terms of writing an improved book, patch was usually quite poor, having design flaws and
but his efforts during this time were limited to internal many bugs. And even when it was cleaned up enough
training materials and to published papers. to be accepted, bugs still remained. Linus contrasted
By the time Sequent was acquired by IBM in 1999, this with the then-current situation in 2006, in which
many of the world’s largest database instances ran on he said that it was not unusual for the first version of a
Sequent hardware. But times change, and by 2001 many concurrency-related patch to be well-designed with few or
of Sequent’s parallel programmers had shifted their focus even no bugs. He then suggested that if tools continued to
to the Linux kernel. After some initial reluctance, the improve, then maybe parallel programming would become
Linux kernel community embraced concurrency both routine by the year 2021.2
enthusiastically and effectively [BWCM+ 10, McK12a],
2 Tools have in fact continued to improve, including fuzzers, lock
1 “But why on earth would you do that???” “Well, why not?” dependency checkers, static analyzers, formal verification, memory
v2022.09.25a
405
v2022.09.25a
406 CHAPTER 18. LOOKING FORWARD AND BACK
v2022.09.25a
Ask me no questions, and I’ll tell you no fibs.
Important Questions
The following sections discuss some important questions For more information on this question, see Chapter 3,
relating to SMP programming. Each section also shows Section 5.1, and especially Chapter 6, each of which
how to avoid worrying about the corresponding question, present ways of slowing down your code by ineptly paral-
which can be extremely important if your goal is to simply lelizing it. Of course, much of this book deals with ways
get your SMP code working as quickly and painlessly as of ensuring that your parallel programs really are faster
possible—which is an excellent goal, by the way! than their sequential counterparts.
Although the answers to these questions are often less
However, never forget that parallel programs can be
intuitive than they would be in a single-threaded setting,
quite fast while at the same time being quite simple, with
with a bit of work, they are not that difficult to understand.
the example in Section 4.1 being a case in point. Also
If you managed to master recursion, there is nothing here
never forget that parallel execution is but one optimiza-
that should pose an overwhelming challenge.
tion of many, and there are programs for which other
With that, here are the questions:
optimizations produce better results.
1. Why aren’t parallel programs always faster? (Appen-
dix A.1)
407
v2022.09.25a
408 APPENDIX A. IMPORTANT QUESTIONS
There is an old saying that those who have but one clock
What time is it? always know the time, but those who have several clocks
can never be sure. And there was a time when the typical
low-end computer’s sole software-visible clock was its
Uh. When did program counter, but those days are long gone. This is not
you ask? a bad thing, considering that on modern computer systems,
the program counter is a truly horrible clock [MOZ09].
In addition, different clocks provide different tradeoffs
of performance, accuracy, precision, and ordering. For
example, in the Linux kernel, the jiffies counter1
provides high-speed access to a course-grained counter (at
best one-millisecond accuracy and precision) that imposes
almost no ordering on either the compiler or the hardware.
In contrast, the x86 HPET hardware provides an accurate
and precise clock, but at the price of slow access. The
x86 time-stamp counter has a checkered past, but is more
Figure A.1: What Time Is It? recently held out as providing a good combination of
precision, accuracy, and performance. Unfortunately,
for all of these counters, ordering against all effects of
A.3 What Time Is It? prior and subsequent code requires expensive memory-
barrier instructions. And this expense appears to be
an unavoidable consequence of the complex superscalar
A key issue with timekeeping on multicore computer
nature of modern computer systems.
systems is illustrated by Figure A.1. One problem is
In addition, all of these clock sources provide their
that it takes time to read out the time. An instruction
own timebase, so that (for example) the jiffies counter
might read from a hardware clock, and might have to
on one system is not necessarily compatible with that
go off-core (or worse yet, off-socket) to complete this
of another. Not only did they start counting at different
read operation. It might also be necessary to do some
times, but they might well be counting at different rates.
computation on the value read out, for example, to convert
This brings up the topic of synchronizing a given system’s
it to the desired format, to apply network time protocol
counters with some real-world notion of time, but that
(NTP) adjustments, and so on. So does the time eventually
topic is beyond the scope of this book.
returned correspond to the beginning of the resulting time
In short, time is a slippery topic that causes untold
interval, the end, or somewhere in between?
confusion to parallel programmers and to their code.
Worse yet, the thread reading the time might be inter-
rupted or preempted. Furthermore, there will likely be
some computation between reading out the time and the A.4 What Does “After” Mean?
actual use of the time that has been read out. Both of these
possibilities further extend the interval of uncertainty. “After” is an intuitive, but surprisingly difficult concept. An
One approach is to read the time twice, and take the important non-intuitive issue is that code can be delayed
arithmetic mean of the two readings, perhaps one on each at any point for any amount of time. Consider a producing
side of the operation being timestamped. The difference and a consuming thread that communicate using a global
between the two readings is then a measure of uncertainty struct with a timestamp “t” and integer fields “a”, “b”,
of the time at which the intervening operation occurred. and “c”. The producer loops recording the current time
Of course, in many cases, the exact time is not necessary. (in seconds since 1970 in decimal), then updating the
For example, when printing the time for the benefit of values of “a”, “b”, and “c”, as shown in Listing A.1. The
a human user, we can rely on slow human reflexes to consumer code loops, also recording the current time,
render internal hardware and software delays irrelevant. but also copying the producer’s timestamp along with the
Similarly, if a server needs to timestamp the response to a 1 The jiffies variable is a location in normal memory that is
client, any time between the reception of the request and incremented by software in response to events such as the scheduling-
the transmission of the response will do equally well. clock interrupt.
v2022.09.25a
A.4. WHAT DOES “AFTER” MEAN? 409
v2022.09.25a
410 APPENDIX A. IMPORTANT QUESTIONS
v2022.09.25a
A.5. HOW MUCH ORDERING IS NEEDED? 411
software is also unreliable. Is there a happy medium Nevertheless, it is wise to adopt some meaningful
with both robust reliability on the one hand and powerful semantics that are visible to those accessing the data, for
performance augmented by scintellating scalability on the example, a given function’s return value might be:
other?
The answer, as is so often the case, is “it depends”. 1. Some value between the conceptual value at the time
One approach is to construct a strongly ordered system, of the call to the function and the conceptual value
then examine its performance and scalability. If these at the time of the return from that function. For
suffice, the system is good and sufficient, and no more example, see the statistical counters discussed in
need be done. Otherwise, undertake careful analysis (see Section 5.2, keeping in mind that such counters are
Section 11.7) and attack each bottleneck until the system’s normally monotonic, at least between consecutive
performance is good and sufficient. overflows.
This approach can work very well, especially in contrast
to the all-too-common approach of optimizing random 2. The actual value at some time between the call to and
components of the system in the hope of achieving sig- the return from that function. For example, see the
nificant system-wide benefits. However, starting with single-variable atomic counter shown in Listing 5.2.
strong ordering can also be quite wasteful, given that
weakening ordering of the system’s bottleneck can require 3. If the values used by that function remain unchanged
that large portions of the rest of the system be redesigned during the time between that function’s call and
and rewritten to accommodate the weakening. Worse return, the expected value, otherwise some approxi-
yet, eliminating one bottleneck often exposes another, mation to the expected value. Precise specification
which in turn needs to be weakened and which in turn can of the bounds on the approximation can be quite chal-
result in wholesale redesigns and rewrites of other parts lenging. For example, consider a function combining
of the system. Perhaps even worse is the approach, also values from different elements of an RCU-protected
common, of starting with a fast but unreliable system and linked data structure, as described in Section 10.3.
then playing whack-a-mole with an endless succession of
concurrency bugs, though in the latter case, Chapters 11
Weaker ordering usually implies weaker semantics, and
and 12 are always there for you.
you should be able to give some sort of promise to your
It would be better to have design-time tools to determine
users as to how this weakening affects them. At the same
which portions of the system could use weak ordering,
time, unless the caller holds a lock across both the function
and at the same time, which portions actually benefit from
call and the use of any values computed by that function,
weak ordering. These tasks are taken up by the following
even fully ordered implementations normally cannot do
sections.
any better than the semantics given by the options above.
A.5.1 Where is the Defining Data? Quick Quiz A.3: But if fully ordered implementations cannot
offer stronger guarantees than the better performing and more
One way to do this is to keep firmly in mind that the region scalable weakly ordered implementations, why bother with
of consistency engendered by strong ordering cannot full ordering?
extend out past the boundaries of the system.2 Portions of
the system whose role is to track the state of the outside Some might argue that useful computing deals only
world can usually feature weak ordering, given that speed- with the outside world, and therefore that all computing
of-light delays will force the within-system state to lag that can use weak ordering. Such arguments are incorrect. For
of the outside world. There is often no point in incurring example, the value of your bank account is defined within
large overheads to force a consistent view of data that your bank’s computers, and people often prefer exact
is inherently out of date. In these cases, the methods of computations involving their account balances, especially
Chapter 9 can be quite helpful, as can some of the data those who might suspect that any such approximations
structures described in Chapter 10. would be in the bank’s favor.
In short, although data tracking external state can be
an attractive candidate for weakly ordered access, please
think carefully about exactly what is being tracked and
2 Which might well be a distributed system. what is doing the tracking.
v2022.09.25a
412 APPENDIX A. IMPORTANT QUESTIONS
A.5.2 Consistent Data Used Consistently? the two, and it turns out that these distinctions can be
understood from a couple of different perspectives.
Another hint that weakening is safe can appear in the The first perspective treats “parallel” as an abbreviation
guise of data that is computed while holding a lock, for “data parallel”, and treats “concurrent” as pretty much
but then used after the lock is released. The computed everything else. From this perspective, in parallel com-
result clearly becomes at best an approximation as soon puting, each partition of the overall problem can proceed
as the lock is released, which suggests computing an completely independently, with no communication with
approximate result in the first place, possibly permitting other partitions. In this case, little or no coordination
use of weaker ordering. To this end, Chapter 5 covers among partitions is required. In contrast, concurrent com-
numerous approximate methods for counting. puting might well have tight interdependencies, in the form
Great care is required, however. Is the use of data of contended locks, transactions, or other synchronization
following lock release a hint that weak-ordering optimiza- mechanisms.
tions might be helpful? Or is instead a bug in which the
Quick Quiz A.4: Suppose a portion of a program uses RCU
lock was released too soon?
read-side primitives as its only synchronization mechanism.
Is this parallelism or concurrency?
A.5.3 Is the Problem Partitionable?
This of course begs the question of why such a distinc-
Suppose that the system holds the defining instance of tion matters, which brings us to the second perspective,
the data, or that using a computed value past lock release that of the underlying scheduler. Schedulers come in a
proved to be a bug. What then? wide range of complexities and capabilities, and as a rough
One approach is to partition the system, as discussed in rule of thumb, the more tightly and irregularly a set of
Chapter 6. Partititioning can provide excellent scalability parallel processes communicate, the higher the level of so-
and in its more extreme form, per-CPU performance phistication required from the scheduler. As such, parallel
rivaling that of a sequential program, as discussed in computing’s avoidance of interdependencies means that
Chapter 8. Partial partitioning is often mediated by parallel-computing programs run well on the least-capable
locking, which is the subject of Chapter 7. schedulers. In fact, a pure parallel-computing program
can run successfully after being arbitrarily subdivided and
interleaved onto a uniprocessor.3 In contrast, concurrent-
A.5.4 None of the Above? computing programs might well require extreme subtlety
The previous sections described the easier ways to gain on the part of the scheduler.
performance and scalability, sometimes using weaker One could argue that we should simply demand a
ordering and sometimes not. But the plain fact is that reasonable level of competence from the scheduler, so
multicore systems are under no compunction to make that we could simply ignore any distinctions between
life easy. But perhaps the advanced topics covered in parallelism and concurrency. Although this is often a good
Chapters 14 and 15 will prove helpful. strategy, there are important situations where efficiency,
performance, and scalability concerns sharply limit the
But please proceed with care, as it is all too easy to
level of competence that the scheduler can reasonably
destabilize your codebase optimizing non-bottlenecks.
offer. One important example is when the scheduler is
Once again, Section 11.7 can help. It might also be worth
implemented in hardware, as it often is in SIMD units or
your time to review other portions of this book, as it
GPGPUs. Another example is a workload where the units
contains much information on handling a number of tricky
of work are quite short, so that even a software-based
situations.
scheduler must make hard choices between subtlety on
the one hand and efficiency on the other.
Now, this second perspective can be thought of as
A.6 What is the Difference Between making the workload match the available scheduler, with
“Concurrent” and “Parallel”? parallel workloads able to use simple schedulers and
concurrent workloads requiring sophisticated schedulers.
From a classic computing perspective, “concurrent” and
“parallel” are clearly synonyms. However, this has not 3 Yes, this does mean that data-parallel-computing programs are
stopped many people from drawing distinctions between best-suited for sequential execution. Why did you ask?
v2022.09.25a
A.7. WHY IS SOFTWARE BUGGY? 413
v2022.09.25a
414 APPENDIX A. IMPORTANT QUESTIONS
v2022.09.25a
The only difference between men and boys is the
price of their toys.
Appendix B M. Hébert
The toy RCU implementations in this appendix are de- Listing B.1: Lock-Based RCU Implementation
signed not for high performance, practicality, or any kind 1 static void rcu_read_lock(void)
2 {
of production use,1 but rather for clarity. Nevertheless, 3 spin_lock(&rcu_gp_lock);
you will need a thorough understanding of Chapters 2, 3, 4 }
5
4, 6, and 9 for even these toy RCU implementations to be 6 static void rcu_read_unlock(void)
easily understandable. 7 {
8 spin_unlock(&rcu_gp_lock);
This appendix provides a series of RCU implemen- 9 }
tations in order of increasing sophistication, from the 10
11 void synchronize_rcu(void)
viewpoint of solving the existence-guarantee problem. 12 {
Appendix B.1 presents a rudimentary RCU implemen- 13 spin_lock(&rcu_gp_lock);
14 spin_unlock(&rcu_gp_lock);
tation based on simple locking, while Appendices B.2 15 }
through B.9 present a series of simple RCU implemen-
tations based on locking, reference counters, and free-
running counters. Finally, Appendix B.10 provides a heavyweight, with read-side overhead ranging from about
summary and a list of desirable RCU properties. 100 nanoseconds on a single POWER5 CPU up to more
than 17 microseconds on a 64-CPU system. Worse yet,
these same lock operations permit rcu_read_lock() to
B.1 Lock-Based RCU participate in deadlock cycles. Furthermore, in absence
of recursive locks, RCU read-side critical sections cannot
Perhaps the simplest RCU implementation leverages be nested, and, finally, although concurrent RCU updates
locking, as shown in Listing B.1 (rcu_lock.h and could in principle be satisfied by a common grace period,
rcu_lock.c). this implementation serializes grace periods, preventing
In this implementation, rcu_read_lock() acquires grace-period sharing.
a global spinlock, rcu_read_unlock() releases it, and
Quick Quiz B.1: Why wouldn’t any deadlock in the RCU
synchronize_rcu() acquires it then immediately re- implementation in Listing B.1 also be a deadlock in any other
leases it. RCU implementation?
Because synchronize_rcu() does not return until
it has acquired (and released) the lock, it cannot return Quick Quiz B.2: Why not simply use reader-writer locks in
until all prior RCU read-side critical sections have com- the RCU implementation in Listing B.1 in order to allow RCU
pleted, thus faithfully implementing RCU semantics. Of readers to proceed in parallel?
course, only one RCU reader may be in its read-side
critical section at a time, which almost entirely defeats the It is hard to imagine this implementation being useful
purpose of RCU. In addition, the lock operations in rcu_ in a production setting, though it does have the virtue of
read_lock() and rcu_read_unlock() are extremely being implementable in almost any user-level application.
Furthermore, similar implementations having one lock
1 However, production-quality user-level RCU implementations are per CPU or using reader-writer locks have been used in
available [Des09b, DMS+ 12]. production in the 2.4 Linux kernel.
415
v2022.09.25a
416 APPENDIX B. “TOY” RCU IMPLEMENTATIONS
Listing B.2: Per-Thread Lock-Based RCU Implementation Listing B.3: RCU Implementation Using Single Global Refer-
1 static void rcu_read_lock(void) ence Counter
2 { 1 atomic_t rcu_refcnt;
3 spin_lock(&__get_thread_var(rcu_gp_lock)); 2
4 } 3 static void rcu_read_lock(void)
5 4 {
6 static void rcu_read_unlock(void) 5 atomic_inc(&rcu_refcnt);
7 { 6 smp_mb();
8 spin_unlock(&__get_thread_var(rcu_gp_lock)); 7 }
9 } 8
10 9 static void rcu_read_unlock(void)
11 void synchronize_rcu(void) 10 {
12 { 11 smp_mb();
13 int t; 12 atomic_dec(&rcu_refcnt);
14 13 }
15 for_each_running_thread(t) { 14
16 spin_lock(&per_thread(rcu_gp_lock, t)); 15 void synchronize_rcu(void)
17 spin_unlock(&per_thread(rcu_gp_lock, t)); 16 {
18 } 17 smp_mb();
19 } 18 while (atomic_read(&rcu_refcnt) != 0) {
19 poll(NULL, 0, 10);
20 }
21 smp_mb();
A modified version of this one-lock-per-CPU approach, 22 }
v2022.09.25a
B.4. STARVATION-FREE COUNTER-BASED RCU 417
allel execution of RCU read-side critical sections. In Listing B.4: RCU Global Reference-Count Pair Data
happy contrast to the per-thread lock-based implemen- 1 DEFINE_SPINLOCK(rcu_gp_lock);
2 atomic_t rcu_refcnt[2];
tation shown in Appendix B.2, it also allows them to 3 atomic_t rcu_idx;
be nested. In addition, the rcu_read_lock() primitive 4 DEFINE_PER_THREAD(int, rcu_nesting);
5 DEFINE_PER_THREAD(int, rcu_read_idx);
cannot possibly participate in deadlock cycles, as it never
spins nor blocks.
Listing B.5: RCU Read-Side Using Global Reference-Count
Quick Quiz B.6: But what if you hold a lock across a call to Pair
synchronize_rcu(), and then acquire that same lock within 1 static void rcu_read_lock(void)
2 {
an RCU read-side critical section? 3 int i;
4 int n;
5
However, this implementation still has some serious 6 n = __get_thread_var(rcu_nesting);
shortcomings. First, the atomic operations in rcu_ 7 if (n == 0) {
8 i = atomic_read(&rcu_idx);
read_lock() and rcu_read_unlock() are still quite 9 __get_thread_var(rcu_read_idx) = i;
heavyweight, with read-side overhead ranging from about 10 atomic_inc(&rcu_refcnt[i]);
11 }
100 nanoseconds on a single POWER5 CPU up to almost 12 __get_thread_var(rcu_nesting) = n + 1;
40 microseconds on a 64-CPU system. This means that 13 smp_mb();
14 }
the RCU read-side critical sections have to be extremely 15
long in order to get any real read-side parallelism. On 16 static void rcu_read_unlock(void)
17 {
the other hand, in the absence of readers, grace periods 18 int i;
elapse in about 40 nanoseconds, many orders of magni- 19 int n;
20
tude faster than production-quality implementations in the 21 smp_mb();
Linux kernel. 22 n = __get_thread_var(rcu_nesting);
23 if (n == 1) {
Quick Quiz B.7: How can the grace period possibly elapse 24 i = __get_thread_var(rcu_read_idx);
25 atomic_dec(&rcu_refcnt[i]);
in 40 nanoseconds when synchronize_rcu() contains a 26 }
10-millisecond delay? 27 __get_thread_var(rcu_nesting) = n - 1;
28 }
Therefore, it is still hard to imagine this implementation Design It is the two-element rcu_refcnt[] array that
being useful in a production setting, though it has a provides the freedom from starvation. The key point
bit more potential than the lock-based mechanism, for is that synchronize_rcu() is only required to wait
example, as an RCU implementation suitable for a high- for pre-existing readers. If a new reader starts after
stress debugging environment. The next section describes a given instance of synchronize_rcu() has already
v2022.09.25a
418 APPENDIX B. “TOY” RCU IMPLEMENTATIONS
begun execution, then that instance of synchronize_ Listing B.6: RCU Update Using Global Reference-Count Pair
rcu() need not wait on that new reader. At any given 1 void synchronize_rcu(void)
2 {
time, when a given reader enters its RCU read-side critical 3 int i;
section via rcu_read_lock(), it increments the element 4
5 smp_mb();
of the rcu_refcnt[] array indicated by the rcu_idx 6 spin_lock(&rcu_gp_lock);
variable. When that same reader exits its RCU read-side 7 i = atomic_read(&rcu_idx);
8 atomic_set(&rcu_idx, !i);
critical section via rcu_read_unlock(), it decrements 9 smp_mb();
whichever element it incremented, ignoring any possible 10 while (atomic_read(&rcu_refcnt[i]) != 0) {
11 poll(NULL, 0, 10);
subsequent changes to the rcu_idx value. 12 }
This arrangement means that synchronize_rcu() 13 smp_mb();
14 atomic_set(&rcu_idx, i);
can avoid starvation by complementing the value of rcu_ 15 smp_mb();
idx, as in rcu_idx = !rcu_idx. Suppose that the 16 while (atomic_read(&rcu_refcnt[!i]) != 0) {
17 poll(NULL, 0, 10);
old value of rcu_idx was zero, so that the new value 18 }
is one. New readers that arrive after the complement 19 spin_unlock(&rcu_gp_lock);
20 smp_mb();
operation will increment rcu_refcnt[1], while the old 21 }
readers that previously incremented rcu_refcnt[0] will
decrement rcu_refcnt[0] when they exit their RCU
read-side critical sections. This means that the value of section does not bleed out before the rcu_read_lock()
rcu_refcnt[0] will no longer be incremented, and thus code.
will be monotonically decreasing.2 This means that all Similarly, the rcu_read_unlock() function executes
that synchronize_rcu() need do is wait for the value a memory barrier at line 21 to ensure that the RCU
of rcu_refcnt[0] to reach zero. read-side critical section does not bleed out after the rcu_
With the background, we are ready to look at the read_unlock() code. Line 22 picks up this thread’s
implementation of the actual primitives. instance of rcu_nesting, and if line 23 finds that this is
the outermost rcu_read_unlock(), then lines 24 and 25
Implementation The rcu_read_lock() primitive pick up this thread’s instance of rcu_read_idx (saved by
atomically increments the member of the rcu_refcnt[] the outermost rcu_read_lock()) and atomically decre-
pair indexed by rcu_idx, and keeps a snapshot of this in- ments the selected element of rcu_refcnt. Regardless of
dex in the per-thread variable rcu_read_idx. The rcu_ the nesting level, line 27 decrements this thread’s instance
read_unlock() primitive then atomically decrements of rcu_nesting.
whichever counter of the pair that the corresponding rcu_ Listing B.6 (rcu_rcpg.c) shows the corresponding
read_lock() incremented. However, because only one synchronize_rcu() implementation. Lines 6 and 19
value of rcu_idx is remembered per thread, additional acquire and release rcu_gp_lock in order to prevent
measures must be taken to permit nesting. These addi- more than one concurrent instance of synchronize_
tional measures use the per-thread rcu_nesting variable rcu(). Lines 7 and 8 pick up the value of rcu_idx and
to track nesting. complement it, respectively, so that subsequent instances
To make all this work, line 6 of rcu_read_lock() of rcu_read_lock() will use a different element of
in Listing B.5 picks up the current thread’s instance of rcu_refcnt than did preceding instances. Lines 10–12
rcu_nesting, and if line 7 finds that this is the outermost then wait for the prior element of rcu_refcnt to reach
rcu_read_lock(), then lines 8–10 pick up the current zero, with the memory barrier on line 9 ensuring that
value of rcu_idx, save it in this thread’s instance of the check of rcu_refcnt is not reordered to precede
rcu_read_idx, and atomically increment the selected the complementing of rcu_idx. Lines 13–18 repeat
element of rcu_refcnt. Regardless of the value of this process, and line 20 ensures that any subsequent
rcu_nesting, line 12 increments it. Line 13 executes a reclamation operations are not reordered to precede the
memory barrier to ensure that the RCU read-side critical checking of rcu_refcnt.
Quick Quiz B.9: Why the memory barrier on line 5 of
2 There is a race condition that this “monotonically decreasing” synchronize_rcu() in Listing B.6 given that there is a
statement ignores. This race condition will be dealt with by the code for spin-lock acquisition immediately after?
synchronize_rcu(). In the meantime, I suggest suspending disbelief.
v2022.09.25a
B.5. SCALABLE COUNTER-BASED RCU 419
Quick Quiz B.10: Why is the counter flipped twice in List- Listing B.7: RCU Per-Thread Reference-Count Pair Data
ing B.6? Shouldn’t a single flip-and-wait cycle be sufficient? 1 DEFINE_SPINLOCK(rcu_gp_lock);
2 DEFINE_PER_THREAD(int [2], rcu_refcnt);
3 atomic_t rcu_idx;
4 DEFINE_PER_THREAD(int, rcu_nesting);
This implementation avoids the update-starvation issues 5 DEFINE_PER_THREAD(int, rcu_read_idx);
that could occur in the single-counter implementation
shown in Listing B.3. Listing B.8: RCU Read-Side Using Per-Thread Reference-Count
Pair
1 static void rcu_read_lock(void)
Discussion There are still some serious shortcomings. 2 {
First, the atomic operations in rcu_read_lock() and 3 int i;
4 int n;
rcu_read_unlock() are still quite heavyweight. In fact, 5
they are more complex than those of the single-counter 6 n = __get_thread_var(rcu_nesting);
7 if (n == 0) {
variant shown in Listing B.3, with the read-side primitives 8 i = atomic_read(&rcu_idx);
consuming about 150 nanoseconds on a single POWER5 9 __get_thread_var(rcu_read_idx) = i;
10 __get_thread_var(rcu_refcnt)[i]++;
CPU and almost 40 microseconds on a 64-CPU system. 11 }
The update-side synchronize_rcu() primitive is more 12 __get_thread_var(rcu_nesting) = n + 1;
13 smp_mb();
costly as well, ranging from about 200 nanoseconds on 14 }
a single POWER5 CPU to more than 40 microseconds 15
16 static void rcu_read_unlock(void)
on a 64-CPU system. This means that the RCU read-side 17 {
critical sections have to be extremely long in order to get 18 int i;
19 int n;
any real read-side parallelism. 20
Second, if there are many concurrent rcu_read_ 21 smp_mb();
22 n = __get_thread_var(rcu_nesting);
lock() and rcu_read_unlock() operations, there will 23 if (n == 1) {
be extreme memory contention on the rcu_refcnt ele- 24 i = __get_thread_var(rcu_read_idx);
25 __get_thread_var(rcu_refcnt)[i]--;
ments, resulting in expensive cache misses. This further 26 }
extends the RCU read-side critical-section duration re- 27 __get_thread_var(rcu_nesting) = n - 1;
28 }
quired to provide parallel read-side access. These first
two shortcomings defeat the purpose of RCU in most
situations.
Third, the need to flip rcu_idx twice imposes sub- B.5 Scalable Counter-Based RCU
stantial overhead on updates, especially if there are large
numbers of threads. Listing B.8 (rcu_rcpl.h) shows the read-side primitives
Finally, despite the fact that concurrent RCU updates of an RCU implementation that uses per-thread pairs of
could in principle be satisfied by a common grace period, reference counters. This implementation is quite similar
this implementation serializes grace periods, preventing to that shown in Listing B.5, the only difference being
grace-period sharing. that rcu_refcnt is now a per-thread array (as shown
in Listing B.7). As with the algorithm in the previous
Quick Quiz B.11: Given that atomic increment and decre-
section, use of this two-element array prevents readers
ment are so expensive, why not just use non-atomic increment
from starving updaters. One benefit of per-thread rcu_
on line 10 and a non-atomic decrement on line 25 of List-
ing B.5? refcnt[] array is that the rcu_read_lock() and rcu_
read_unlock() primitives no longer perform atomic
Despite these shortcomings, one could imagine this operations.
variant of RCU being used on small tightly coupled multi- Quick Quiz B.12: Come off it! We can see the atomic_
processors, perhaps as a memory-conserving implementa- read() primitive in rcu_read_lock()!!! So why are you
tion that maintains API compatibility with more complex trying to pretend that rcu_read_lock() contains no atomic
implementations. However, it would not likely scale well operations???
beyond a few CPUs.
The next section describes yet another variation on the Listing B.9 (rcu_rcpl.c) shows the implementa-
reference-counting scheme that provides greatly improved tion of synchronize_rcu(), along with a helper
read-side performance and scalability. function named flip_counter_and_wait(). The
v2022.09.25a
420 APPENDIX B. “TOY” RCU IMPLEMENTATIONS
Listing B.9: RCU Update Using Per-Thread Reference-Count Listing B.10: RCU Read-Side Using Per-Thread Reference-
Pair Count Pair and Shared Update Data
1 static void flip_counter_and_wait(int i) 1 DEFINE_SPINLOCK(rcu_gp_lock);
2 { 2 DEFINE_PER_THREAD(int [2], rcu_refcnt);
3 int t; 3 long rcu_idx;
4 4 DEFINE_PER_THREAD(int, rcu_nesting);
5 atomic_set(&rcu_idx, !i); 5 DEFINE_PER_THREAD(int, rcu_read_idx);
6 smp_mb();
7 for_each_thread(t) {
8 while (per_thread(rcu_refcnt, t)[i] != 0) {
9 poll(NULL, 0, 10);
10 } This implementation still has several shortcomings.
11 } First, the need to flip rcu_idx twice imposes substantial
12 smp_mb();
13 } overhead on updates, especially if there are large numbers
14 of threads.
15 void synchronize_rcu(void)
16 { Second, synchronize_rcu() must now examine a
17 int i; number of variables that increases linearly with the number
18
19 smp_mb(); of threads, imposing substantial overhead on applications
20 spin_lock(&rcu_gp_lock); with large numbers of threads.
21 i = atomic_read(&rcu_idx);
22 flip_counter_and_wait(i); Third, as before, although concurrent RCU updates
23 flip_counter_and_wait(!i); could in principle be satisfied by a common grace period,
24 spin_unlock(&rcu_gp_lock);
25 smp_mb(); this implementation serializes grace periods, preventing
26 } grace-period sharing.
Finally, as noted in the text, the need for per-thread
variables and for enumerating threads may be problematic
synchronize_rcu() function resembles that shown in in some software environments.
Listing B.6, except that the repeated counter flip is re- That said, the read-side primitives scale very nicely,
placed by a pair of calls on lines 22 and 23 to the new requiring about 115 nanoseconds regardless of whether
helper function. running on a single-CPU or a 64-CPU POWER5 system.
The new flip_counter_and_wait() function up- As noted above, the synchronize_rcu() primitive does
dates the rcu_idx variable on line 5, executes a memory not scale, ranging in overhead from almost a microsecond
barrier on line 6, then lines 7–11 spin on each thread’s on a single POWER5 CPU up to almost 200 microseconds
prior rcu_refcnt element, waiting for it to go to zero. on a 64-CPU system. This implementation could con-
Once all such elements have gone to zero, it executes ceivably form the basis for a production-quality user-level
another memory barrier on line 12 and returns. RCU implementation.
The next section describes an algorithm permitting
This RCU implementation imposes important new re-
more efficient concurrent RCU updates.
quirements on its software environment, namely, (1) that
it be possible to declare per-thread variables, (2) that these
per-thread variables be accessible from other threads, and
(3) that it is possible to enumerate all threads. These
B.6 Scalable Counter-Based RCU
requirements can be met in almost all software envi- With Shared Grace Periods
ronments, but often result in fixed upper bounds on the
number of threads. More-complex implementations might Listing B.11 (rcu_rcpls.h) shows the read-side prim-
avoid such bounds, for example, by using expandable hash itives for an RCU implementation using per-thread ref-
tables. Such implementations might dynamically track erence count pairs, as before, but permitting updates to
threads, for example, by adding them on their first call to share grace periods. The main difference from the earlier
rcu_read_lock(). implementation shown in Listing B.8 is that rcu_idx
is now a long that counts freely, so that line 8 of List-
Quick Quiz B.13: Great, if we have 𝑁 threads, we can have ing B.11 must mask off the low-order bit. We also switched
2𝑁 ten-millisecond waits (one set per flip_counter_and_ from using atomic_read() and atomic_set() to using
wait() invocation, and even that assumes that we wait only READ_ONCE(). The data is also quite similar, as shown
once for each thread). Don’t we need the grace period to
in Listing B.10, with rcu_idx now being a long instead
complete much more quickly?
of an atomic_t.
v2022.09.25a
B.6. SCALABLE COUNTER-BASED RCU WITH SHARED GRACE PERIODS 421
Listing B.11: RCU Read-Side Using Per-Thread Reference- Listing B.12: RCU Shared Update Using Per-Thread Reference-
Count Pair and Shared Update Count Pair
1 static void rcu_read_lock(void) 1 static void flip_counter_and_wait(int ctr)
2 { 2 {
3 int i; 3 int i;
4 int n; 4 int t;
5 5
6 n = __get_thread_var(rcu_nesting); 6 WRITE_ONCE(rcu_idx, ctr + 1);
7 if (n == 0) { 7 i = ctr & 0x1;
8 i = READ_ONCE(rcu_idx) & 0x1; 8 smp_mb();
9 __get_thread_var(rcu_read_idx) = i; 9 for_each_thread(t) {
10 __get_thread_var(rcu_refcnt)[i]++; 10 while (per_thread(rcu_refcnt, t)[i] != 0) {
11 } 11 poll(NULL, 0, 10);
12 __get_thread_var(rcu_nesting) = n + 1; 12 }
13 smp_mb(); 13 }
14 } 14 smp_mb();
15 15 }
16 static void rcu_read_unlock(void) 16
17 { 17 void synchronize_rcu(void)
18 int i; 18 {
19 int n; 19 int ctr;
20 20 int oldctr;
21 smp_mb(); 21
22 n = __get_thread_var(rcu_nesting); 22 smp_mb();
23 if (n == 1) { 23 oldctr = READ_ONCE(rcu_idx);
24 i = __get_thread_var(rcu_read_idx); 24 smp_mb();
25 __get_thread_var(rcu_refcnt)[i]--; 25 spin_lock(&rcu_gp_lock);
26 } 26 ctr = READ_ONCE(rcu_idx);
27 __get_thread_var(rcu_nesting) = n - 1; 27 if (ctr - oldctr >= 3) {
28 } 28 spin_unlock(&rcu_gp_lock);
29 smp_mb();
30 return;
31 }
Listing B.12 (rcu_rcpls.c) shows the implementation 32 flip_counter_and_wait(ctr);
33 if (ctr - oldctr < 2)
of synchronize_rcu() and its helper function flip_ 34 flip_counter_and_wait(ctr + 1);
counter_and_wait(). These are similar to those in 35 spin_unlock(&rcu_gp_lock);
36 smp_mb();
Listing B.9. The differences in flip_counter_and_ 37 }
wait() include:
1. Line 6 uses WRITE_ONCE() instead of atomic_ two counter flips while the lock was being acquired.
set(), and increments rather than complementing. On the other hand, if there were two counter flips,
some other thread did one full wait for all the counters
2. A new line 7 masks the counter down to its bottom
to go to zero, so only one more is required.
bit.
With this approach, if an arbitrarily large number of
The changes to synchronize_rcu() are more perva-
threads invoke synchronize_rcu() concurrently, with
sive:
one CPU for each thread, there will be a total of only three
1. There is a new oldctr local variable that captures waits for counters to go to zero.
the pre-lock-acquisition value of rcu_idx on line 20. Despite the improvements, this implementation of RCU
still has a few shortcomings. First, as before, the need
2. Line 23 uses READ_ONCE() instead of atomic_ to flip rcu_idx twice imposes substantial overhead on
read(). updates, especially if there are large numbers of threads.
3. Lines 27–30 check to see if at least three counter flips Second, each updater still acquires rcu_gp_lock, even
were performed by other threads while the lock was if there is no work to be done. This can result in a
being acquired, and, if so, releases the lock, does a severe scalability limitation if there are large numbers of
memory barrier, and returns. In this case, there were concurrent updates. There are ways of avoiding this, as
two full waits for the counters to go to zero, so those was done in a production-quality real-time implementation
other threads already did all the required work. of RCU for the Linux kernel [McK07a].
Third, this implementation requires per-thread variables
4. At lines 33–34, flip_counter_and_wait() is and the ability to enumerate threads, which again can be
only invoked a second time if there were fewer than problematic in some software environments.
v2022.09.25a
422 APPENDIX B. “TOY” RCU IMPLEMENTATIONS
Finally, on 32-bit machines, a given update thread might Listing B.13: Data for Free-Running Counter Using RCU
be preempted long enough for the rcu_idx counter to 1 DEFINE_SPINLOCK(rcu_gp_lock);
2 long rcu_gp_ctr = 0;
overflow. This could cause such a thread to force an 3 DEFINE_PER_THREAD(long, rcu_reader_gp);
unnecessary pair of counter flips. However, even if each 4 DEFINE_PER_THREAD(long, rcu_reader_gp_snap);
grace period took only one microsecond, the offending
thread would need to be preempted for more than an hour, Listing B.14: Free-Running Counter Using RCU
in which case an extra pair of counter flips is likely the 1 static inline void rcu_read_lock(void)
least of your worries. 2 {
3 __get_thread_var(rcu_reader_gp) =
As with the implementation described in Appendix B.3, 4 READ_ONCE(rcu_gp_ctr) + 1;
the read-side primitives scale extremely well, incurring 5 smp_mb();
6 }
roughly 115 nanoseconds of overhead regardless of the 7
v2022.09.25a
B.8. NESTABLE RCU BASED ON FREE-RUNNING COUNTER 423
on pre-existing RCU read-side critical sections. Line 19 Listing B.15: Data for Nestable RCU Using a Free-Running
executes a memory barrier to prevent prior manipulations Counter
1 DEFINE_SPINLOCK(rcu_gp_lock);
of RCU-protected data structures from being reordered (by 2 #define RCU_GP_CTR_SHIFT 7
either the CPU or the compiler) to follow the increment on 3 #define RCU_GP_CTR_BOTTOM_BIT (1 << RCU_GP_CTR_SHIFT)
4 #define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BOTTOM_BIT - 1)
line 21. Line 20 acquires the rcu_gp_lock (and line 30 5 #define MAX_GP_ADV_DISTANCE (RCU_GP_CTR_NEST_MASK << 8)
releases it) in order to prevent multiple synchronize_ 6 unsigned long rcu_gp_ctr = 0;
7 DEFINE_PER_THREAD(unsigned long, rcu_reader_gp);
rcu() instances from running concurrently. Line 21 then
increments the global rcu_gp_ctr variable by two, so
that all pre-existing RCU read-side critical sections will
have corresponding per-thread rcu_reader_gp variables Quick Quiz B.18: Is the possibility of readers being pre-
empted in lines 3–4 of Listing B.14 a real problem, in other
with values less than that of rcu_gp_ctr, modulo the
words, is there a real sequence of events that could lead to
machine’s word size. Recall also that threads with even-
failure? If not, why not? If so, what is the sequence of events,
numbered values of rcu_reader_gp are not in an RCU and how can the failure be addressed?
read-side critical section, so that lines 23–29 scan the rcu_
reader_gp values until they all are either even (line 24)
or are greater than the global rcu_gp_ctr (lines 25–26).
Line 27 blocks for a short period of time to wait for a B.8 Nestable RCU Based on Free-
pre-existing RCU read-side critical section, but this can be
replaced with a spin-loop if grace-period latency is of the Running Counter
essence. Finally, the memory barrier at line 31 ensures
that any subsequent destruction will not be reordered into Listing B.16 (rcu_nest.h and rcu_nest.c) shows an
the preceding loop. RCU implementation based on a single global free-running
counter, but that permits nesting of RCU read-side critical
Quick Quiz B.16: Why are the memory barriers on lines 19 sections. This nestability is accomplished by reserving the
and 31 of Listing B.14 needed? Aren’t the memory barriers
low-order bits of the global rcu_gp_ctr to count nesting,
inherent in the locking primitives on lines 20 and 30 sufficient?
using the definitions shown in Listing B.15. This is a
generalization of the scheme in Appendix B.7, which can
This approach achieves much better read-side perfor- be thought of as having a single low-order bit reserved for
mance, incurring roughly 63 nanoseconds of overhead counting nesting depth. Two C-preprocessor macros are
regardless of the number of POWER5 CPUs. Updates in- used to arrange this, RCU_GP_CTR_NEST_MASK and RCU_
cur more overhead, ranging from about 500 nanoseconds GP_CTR_BOTTOM_BIT. These are related: RCU_GP_CTR_
on a single POWER5 CPU to more than 100 microseconds NEST_MASK=RCU_GP_CTR_BOTTOM_BIT-1. The RCU_
on 64 such CPUs. GP_CTR_BOTTOM_BIT macro contains a single bit that
Quick Quiz B.17: Couldn’t the update-side batching opti- is positioned just above the bits reserved for counting
mization described in Appendix B.6 be applied to the imple- nesting, and the RCU_GP_CTR_NEST_MASK has all one
mentation shown in Listing B.14? bits covering the region of rcu_gp_ctr used to count
nesting. Obviously, these two C-preprocessor macros
This implementation suffers from some serious short- must reserve enough of the low-order bits of the counter
comings in addition to the high update-side overhead to permit the maximum required nesting of RCU read-
noted earlier. First, it is no longer permissible to nest side critical sections, and this implementation reserves
RCU read-side critical sections, a topic that is taken up seven bits, for a maximum RCU read-side critical-section
in the next section. Second, if a reader is preempted at nesting depth of 127, which should be well in excess of
line 3 of Listing B.14 after fetching from rcu_gp_ctr that needed by most applications.
but before storing to rcu_reader_gp, and if the rcu_ The resulting rcu_read_lock() implementation is
gp_ctr counter then runs through more than half but less still reasonably straightforward. Line 6 places a pointer
than all of its possible values, then synchronize_rcu() to this thread’s instance of rcu_reader_gp into the local
will ignore the subsequent RCU read-side critical section. variable rrgp, minimizing the number of expensive calls
Third and finally, this implementation requires that the to the pthreads thread-local-state API. Line 7 records
enclosing software environment be able to enumerate the current value of rcu_reader_gp into another local
threads and maintain per-thread variables. variable tmp, and line 8 checks to see if the low-order bits
v2022.09.25a
424 APPENDIX B. “TOY” RCU IMPLEMENTATIONS
Listing B.16: Nestable RCU Using a Free-Running Counter Interestingly enough, despite their rcu_read_lock()
1 static void rcu_read_lock(void) differences, the implementation of rcu_read_unlock()
2 {
3 unsigned long tmp; is broadly similar to that shown in Appendix B.7. Line 17
4 unsigned long *rrgp; executes a memory barrier in order to prevent the RCU
5
6 rrgp = &__get_thread_var(rcu_reader_gp); read-side critical section from bleeding out into code
7 tmp = *rrgp; following the call to rcu_read_unlock(), and line 18
8 if ((tmp & RCU_GP_CTR_NEST_MASK) == 0)
9 tmp = READ_ONCE(rcu_gp_ctr); decrements this thread’s instance of rcu_reader_gp,
10 tmp++; which has the effect of decrementing the nesting count
11 WRITE_ONCE(*rrgp, tmp);
12 smp_mb(); contained in rcu_reader_gp’s low-order bits. Debug-
13 } ging versions of this primitive would check (before decre-
14
15 static void rcu_read_unlock(void) menting!) that these low-order bits were non-zero.
16 {
17 smp_mb(); The implementation of synchronize_rcu() is quite
18 __get_thread_var(rcu_reader_gp)--; similar to that shown in Appendix B.7. There are two
19 }
20 differences. The first is that lines 27 and 28 adds RCU_
21 void synchronize_rcu(void) GP_CTR_BOTTOM_BIT to the global rcu_gp_ctr instead
22 {
23 int t; of adding the constant “2”, and the second is that the
24 comparison on line 31 has been abstracted out to a separate
25 smp_mb();
26 spin_lock(&rcu_gp_lock); function, where it checks the bits indicated by RCU_GP_
27 WRITE_ONCE(rcu_gp_ctr, rcu_gp_ctr + CTR_NEST_MASK instead of unconditionally checking the
28 RCU_GP_CTR_BOTTOM_BIT);
29 smp_mb(); low-order bit.
30 for_each_thread(t) {
31 while (rcu_gp_ongoing(t) && This approach achieves read-side performance almost
32 ((READ_ONCE(per_thread(rcu_reader_gp, t)) - equal to that shown in Appendix B.7, incurring roughly
33 rcu_gp_ctr) < 0)) {
34 poll(NULL, 0, 10); 65 nanoseconds of overhead regardless of the number
35 } of POWER5 CPUs. Updates again incur more overhead,
36 }
37 spin_unlock(&rcu_gp_lock); ranging from about 600 nanoseconds on a single POWER5
38 smp_mb(); CPU to more than 100 microseconds on 64 such CPUs.
39 }
v2022.09.25a
B.9. RCU BASED ON QUIESCENT STATES 425
Listing B.17: Data for Quiescent-State-Based RCU Listing B.18: Quiescent-State-Based RCU Read Side
1 DEFINE_SPINLOCK(rcu_gp_lock); 1 static void rcu_read_lock(void)
2 long rcu_gp_ctr = 0; 2 {
3 DEFINE_PER_THREAD(long, rcu_reader_qs_gp); 3 }
4
5 static void rcu_read_unlock(void)
6 {
}
B.9 RCU Based on Quiescent States 7
8
9 static void rcu_quiescent_state(void)
10 {
Listing B.18 (rcu_qs.h) shows the read-side primitives 11 smp_mb();
used to construct a user-level implementation of RCU 12 __get_thread_var(rcu_reader_qs_gp) =
13 READ_ONCE(rcu_gp_ctr) + 1;
based on quiescent states, with the data shown in List- 14 smp_mb();
ing B.17. As can be seen from lines 1–7 in the listing, 15 }
16
the rcu_read_lock() and rcu_read_unlock() prim- 17 static void rcu_thread_offline(void)
itives do nothing, and can in fact be expected to be inlined 18 {
19 smp_mb();
and optimized away, as they are in server builds of the 20 __get_thread_var(rcu_reader_qs_gp) =
Linux kernel. This is due to the fact that quiescent-state- 21 READ_ONCE(rcu_gp_ctr);
22 smp_mb();
based RCU implementations approximate the extents of 23 }
RCU read-side critical sections using the aforementioned 24
25 static void rcu_thread_online(void)
quiescent states. Each of these quiescent states contains a 26 {
call to rcu_quiescent_state(), which is shown from 27 rcu_quiescent_state();
28 }
lines 9–15 in the listing. Threads entering extended quies-
cent states (for example, when blocking) may instead call
rcu_thread_offline() (lines 17–23) when entering
an extended quiescent state and then call rcu_thread_ read-side critical sections will thus know to ignore this
online() (lines 25–28) when leaving it. As such, new one. Finally, line 14 executes a memory barrier,
rcu_thread_online() is analogous to rcu_read_ which prevents subsequent code (including a possible
lock() and rcu_thread_offline() is analogous to RCU read-side critical section) from being re-ordered
rcu_read_unlock(). In addition, rcu_quiescent_ with the lines 12–13.
state() can be thought of as a rcu_thread_online()
immediately followed by a rcu_thread_offline().3 Quick Quiz B.22: Doesn’t the additional memory barrier
It is illegal to invoke rcu_quiescent_state(), rcu_ shown on line 14 of Listing B.18 greatly increase the overhead
of rcu_quiescent_state?
thread_offline(), or rcu_thread_online() from
an RCU read-side critical section.
In rcu_quiescent_state(), line 11 executes a mem- Some applications might use RCU only occasionally,
ory barrier to prevent any code prior to the quiescent state but use it very heavily when they do use it. Such ap-
(including possible RCU read-side critical sections) from plications might choose to use rcu_thread_online()
being reordered into the quiescent state. Lines 12–13 pick when starting to use RCU and rcu_thread_offline()
up a copy of the global rcu_gp_ctr, using READ_ONCE() when no longer using RCU. The time between a call
to ensure that the compiler does not employ any optimiza- to rcu_thread_offline() and a subsequent call to
tions that would result in rcu_gp_ctr being fetched more rcu_thread_online() is an extended quiescent state,
than once, and then adds one to the value fetched and so that RCU will not expect explicit quiescent states to be
stores it into the per-thread rcu_reader_qs_gp variable, registered during this time.
so that any concurrent instance of synchronize_rcu() The rcu_thread_offline() function simply sets the
will see an odd-numbered value, thus becoming aware that per-thread rcu_reader_qs_gp variable to the current
a new RCU read-side critical section has started. Instances value of rcu_gp_ctr, which has an even-numbered value.
of synchronize_rcu() that are waiting on older RCU Any concurrent instances of synchronize_rcu() will
thus know to ignore this thread.
3 Although the code in the listing is consistent with rcu_
quiescent_state() being the same as rcu_thread_online() im- Quick Quiz B.23: Why are the two memory barriers on
mediately followed by rcu_thread_offline(), this relationship is lines 11 and 14 of Listing B.18 needed?
obscured by performance optimizations.
v2022.09.25a
426 APPENDIX B. “TOY” RCU IMPLEMENTATIONS
Listing B.19: RCU Update Side Using Quiescent States That said, one could easily imagine a production-quality
1 void synchronize_rcu(void) RCU implementation based on this version of RCU.
2 {
3 int t;
4
5
6
smp_mb();
spin_lock(&rcu_gp_lock);
B.10 Summary of Toy RCU Imple-
WRITE_ONCE(rcu_gp_ctr, rcu_gp_ctr + 2);
7
8 smp_mb(); mentations
9 for_each_thread(t) {
10 while (rcu_gp_ongoing(t) &&
11 ((per_thread(rcu_reader_qs_gp, t) If you made it this far, congratulations! You should
12 - rcu_gp_ctr) < 0)) { now have a much clearer understanding not only of RCU
13 poll(NULL, 0, 10);
14 } itself, but also of the requirements of enclosing software
15 } environments and applications. Those wishing an even
16 spin_unlock(&rcu_gp_lock);
17 smp_mb(); deeper understanding are invited to read descriptions
18 } of production-quality RCU implementations [DMS+ 12,
McK07a, McK08b, McK09a].
The preceding sections listed some desirable properties
The rcu_thread_online() function simply invokes of the various RCU primitives. The following list is
rcu_quiescent_state(), thus marking the end of the provided for easy reference for those wishing to create a
extended quiescent state. new RCU implementation.
Listing B.19 (rcu_qs.c) shows the implementation of
synchronize_rcu(), which is quite similar to that of 1. There must be read-side primitives (such as
the preceding sections. rcu_read_lock() and rcu_read_unlock()) and
This implementation has blazingly fast read-side primi- grace-period primitives (such as synchronize_
tives, with an rcu_read_lock()–rcu_read_unlock() rcu() and call_rcu()), such that any RCU read-
round trip incurring an overhead of roughly 50 picosec- side critical section in existence at the start of a grace
onds. The synchronize_rcu() overhead ranges from period has completed by the end of the grace period.
about 600 nanoseconds on a single-CPU POWER5 system 2. RCU read-side primitives should have minimal over-
up to more than 100 microseconds on a 64-CPU system. head. In particular, expensive operations such as
Quick Quiz B.24: To be sure, the clock frequencies of cache misses, atomic instructions, memory barriers,
POWER systems in 2008 were quite high, but even a 5 GHz and branches should be avoided.
clock frequency is insufficient to allow loops to be executed in
50 picoseconds! What is going on here? 3. RCU read-side primitives should have O (1) compu-
tational complexity to enable real-time use. (This
However, this implementation requires that each thread implies that readers run concurrently with updaters.)
either invoke rcu_quiescent_state() periodically or
to invoke rcu_thread_offline() for extended quies- 4. RCU read-side primitives should be usable in all
cent states. The need to invoke these functions periodically contexts (in the Linux kernel, they are permitted
can make this implementation difficult to use in some sit- everywhere except in the idle loop). An important
uations, such as for certain types of library functions. special case is that RCU read-side primitives be
usable within an RCU read-side critical section, in
Quick Quiz B.25: Why would the fact that the code is in a other words, that it be possible to nest RCU read-side
library make any difference for how easy it is to use the RCU critical sections.
implementation shown in Listings B.18 and B.19?
5. RCU read-side primitives should be unconditional,
Quick Quiz B.26: But what if you hold a lock across a with no failure returns. This property is extremely
call to synchronize_rcu(), and then acquire that same lock important, as failure checking increases complexity
within an RCU read-side critical section? This should be a and complicates testing and validation.
deadlock, but how can a primitive that generates absolutely no
code possibly participate in a deadlock cycle? 6. Any operation other than a quiescent state (and thus
a grace period) should be permitted in an RCU
In addition, this implementation does not permit concur- read-side critical section. In particular, irrevocable
rent calls to synchronize_rcu() to share grace periods. operations such as I/O should be permitted.
v2022.09.25a
B.10. SUMMARY OF TOY RCU IMPLEMENTATIONS 427
v2022.09.25a
428 APPENDIX B. “TOY” RCU IMPLEMENTATIONS
v2022.09.25a
Order! Order in the court!
Unknown
Appendix C
1. Present the structure of a cache, nanoseconds to fetch a data item from main memory. This
disparity in speed—more than two orders of magnitude—
2. Describe how cache-coherency protocols ensure that has resulted in the multi-megabyte caches found on modern
CPUs agree on the value of each location in memory, CPUs. These caches are associated with the CPUs as
and, finally, shown in Figure C.1, and can typically be accessed in a
few cycles.1
3. Outline how store buffers and invalidate queues help Data flows among the CPUs’ caches and memory in
caches and cache-coherency protocols achieve high fixed-length blocks called “cache lines”, which are nor-
performance. mally a power of two in size, ranging from 16 to 256
We will see that memory barriers are a necessary evil that bytes. When a given data item is first accessed by a given
is required to enable good performance and scalability, CPU, it will be absent from that CPU’s cache, meaning
an evil that stems from the fact that CPUs are orders of that a “cache miss” (or, more specifically, a “startup” or
magnitude faster than are both the interconnects between “warmup” cache miss) has occurred. The cache miss
them and the memory they are attempting to access. means that the CPU will have to wait (or be “stalled”) for
hundreds of cycles while the item is fetched from memory.
However, the item will be loaded into that CPU’s cache,
C.1 Cache Structure 1 It is standard practice to use multiple levels of cache, with a small
level-one cache close to the CPU with single-cycle access time, and a
Modern CPUs are much faster than are modern memory larger level-two cache with a longer access time, perhaps roughly ten
systems. A 2006 CPU might be capable of executing ten clock cycles. Higher-performance CPUs often have three or even four
instructions per nanosecond, but will require many tens of levels of cache.
429
v2022.09.25a
430 APPENDIX C. WHY MEMORY BARRIERS?
Way 0 Way 1
bits of each address are zero, and the choice of hardware
0x0 0x12345000 hash function means that the next-higher four bits match
0x1 0x12345100
0x2 0x12345200 the hash line number.
0x3 0x12345300
0x4 0x12345400 The situation depicted in the figure might arise if the pro-
0x5 0x12345500 gram’s code were located at address 0x43210E00 through
0x6 0x12345600 0x43210EFF, and this program accessed data sequentially
0x7 0x12345700
0x8 0x12345800 from 0x12345000 through 0x12345EFF. Suppose that
0x9 0x12345900 the program were now to access location 0x12345F00.
0xA 0x12345A00 This location hashes to line 0xF, and both ways of this
0xB 0x12345B00 line are empty, so the corresponding 256-byte line can be
0xC 0x12345C00 accommodated. If the program were to access location
0xD 0x12345D00
0xE 0x12345E00 0x43210E00 0x1233000, which hashes to line 0x0, the corresponding
0xF 256-byte cache line can be accommodated in way 1. How-
ever, if the program were to access location 0x1233E00,
Figure C.2: CPU Cache Structure which hashes to line 0xE, one of the existing lines must
be ejected from the cache to make room for the new cache
line. If this ejected line were accessed later, a cache miss
so that subsequent accesses will find it in the cache and would result. Such a cache miss is termed an “associativity
therefore run at full speed. miss”.
After some time, the CPU’s cache will fill, and subse-
quent misses will likely need to eject an item from the Thus far, we have been considering only cases where
cache in order to make room for the newly fetched item. a CPU reads a data item. What happens when it does a
Such a cache miss is termed a “capacity miss”, because it write? Because it is important that all CPUs agree on
is caused by the cache’s limited capacity. However, most the value of a given data item, before a given CPU writes
caches can be forced to eject an old item to make room to that data item, it must first cause it to be removed,
for a new item even when they are not yet full. This is due or “invalidated”, from other CPUs’ caches. Once this
to the fact that large caches are implemented as hardware invalidation has completed, the CPU may safely modify
hash tables with fixed-size hash buckets (or “sets”, as the data item. If the data item was present in this CPU’s
CPU designers call them) and no chaining, as shown in cache, but was read-only, this process is termed a “write
Figure C.2. miss”. Once a given CPU has completed invalidating a
This cache has sixteen “sets” and two “ways” for a total given data item from other CPUs’ caches, that CPU may
of 32 “lines”, each entry containing a single 256-byte repeatedly write (and read) that data item.
“cache line”, which is a 256-byte-aligned block of memory.
This cache line size is a little on the large size, but makes Later, if one of the other CPUs attempts to access the
the hexadecimal arithmetic much simpler. In hardware data item, it will incur a cache miss, this time because
parlance, this is a two-way set-associative cache, and is the first CPU invalidated the item in order to write to
analogous to a software hash table with sixteen buckets, it. This type of cache miss is termed a “communication
where each bucket’s hash chain is limited to at most two miss”, since it is usually due to several CPUs using the
elements. The size (32 cache lines in this case) and the data items to communicate (for example, a lock is a data
associativity (two in this case) are collectively called the item that is used to communicate among CPUs using a
cache’s “geometry”. Since this cache is implemented in mutual-exclusion algorithm).
hardware, the hash function is extremely simple: Extract
four bits from the memory address. Clearly, much care must be taken to ensure that all CPUs
In Figure C.2, each box corresponds to a cache entry, maintain a coherent view of the data. With all this fetching,
which can contain a 256-byte cache line. However, a invalidating, and writing, it is easy to imagine data being
cache entry can be empty, as indicated by the empty boxes lost or (perhaps worse) different CPUs having conflicting
in the figure. The rest of the boxes are flagged with the values for the same data item in their respective caches.
memory address of the cache line that they contain. Since These problems are prevented by “cache-coherency proto-
the cache lines must be 256-byte aligned, the low eight cols”, described in the next section.
v2022.09.25a
C.2. CACHE-COHERENCE PROTOCOLS 431
v2022.09.25a
432 APPENDIX C. WHY MEMORY BARRIERS?
Quick Quiz C.4: If SMP machines are really using message Transition (g):
passing anyway, why bother with SMP at all? Some other CPU reads a data item in this cache line,
and it is supplied either from this CPU’s cache or
from memory. In either case, this CPU retains a read-
only copy. This transition is initiated by the reception
C.2.3 MESI State Diagram
of a “read” message, and this CPU responds with
A given cache line’s state changes as protocol messages a “read response” message containing the requested
are sent and received, as shown in Figure C.3. data.
The transition arcs in this figure are as follows:
Transition (h):
Transition (a): This CPU realizes that it will soon need to write to
A cache line is written back to memory, but the CPU some data item in this cache line, and thus transmits
retains it in its cache and further retains the right an “invalidate” message. The CPU cannot complete
v2022.09.25a
C.3. STORES RESULT IN UNNECESSARY STALLS 433
the transition until it receives a full set of “invalidate state of each CPU’s cache line (memory address followed
acknowledge” responses, indicating that no other by MESI state), and the final two columns whether the
CPU has this cacheline in its cache. In other words, corresponding memory contents are up to date (“V”) or
this CPU is the only CPU caching it. not (“I”).
Initially, the CPU cache lines in which the data would
Transition (i): reside are in the “invalid” state, and the data is valid in
Some other CPU does an atomic read-modify-write memory. When CPU 0 loads the data at address 0, it
operation on a data item in a cache line held only in enters the “shared” state in CPU 0’s cache, and is still
this CPU’s cache, so this CPU invalidates it from its valid in memory. CPU 3 also loads the data at address 0,
cache. This transition is initiated by the reception of so that it is in the “shared” state in both CPUs’ caches,
a “read invalidate” message, and this CPU responds and is still valid in memory. Next CPU 0 loads some
with both a “read response” and an “invalidate ac- other cache line (at address 8), which forces the data at
knowledge” message. address 0 out of its cache via an invalidation, replacing it
Transition (j): with the data at address 8. CPU 2 now does a load from
This CPU does a store to a data item in a cache line address 0, but this CPU realizes that it will soon need
that was not in its cache, and thus transmits a “read to store to it, and so it uses a “read invalidate” message
invalidate” message. The CPU cannot complete the in order to gain an exclusive copy, invalidating it from
transition until it receives the “read response” and a CPU 3’s cache (though the copy in memory remains up to
full set of “invalidate acknowledge” messages. The date). Next CPU 2 does its anticipated store, changing the
cache line will presumably transition to “modified” state to “modified”. The copy of the data in memory is
state via transition (b) as soon as the actual store now out of date. CPU 1 does an atomic increment, using
completes. a “read invalidate” to snoop the data from CPU 2’s cache
and invalidate it, so that the copy in CPU 1’s cache is in
Transition (k): the “modified” state (and the copy in memory remains out
This CPU loads a data item in a cache line that of date). Finally, CPU 1 reads the cache line at address 8,
was not in its cache. The CPU transmits a “read” which uses a “writeback” message to push address 0’s
message, and completes the transition upon receiving data back out to memory.
the corresponding “read response”. Note that we end with data in some of the CPU’s caches.
Transition (l): Quick Quiz C.6: What sequence of operations would put
Some other CPU does a store to a data item in this the CPUs’ caches all back into the “invalid” state?
cache line, but holds this cache line in read-only state
due to its being held in other CPUs’ caches (such as
the current CPU’s cache). This transition is initiated C.3 Stores Result in Unnecessary
by the reception of an “invalidate” message, and
this CPU responds with an “invalidate acknowledge” Stalls
message.
Although the cache structure shown in Figure C.1 provides
Quick Quiz C.5: How does the hardware handle the delayed
good performance for repeated reads and writes from a
transitions described above? given CPU to a given item of data, its performance for the
first write to a given cache line is quite poor. To see this,
consider Figure C.4, which shows a timeline of a write by
C.2.4 MESI Protocol Example CPU 0 to a cacheline held in CPU 1’s cache. Since CPU 0
must wait for the cache line to arrive before it can write to
Let’s now look at this from the perspective of a cache line’s it, CPU 0 must stall for an extended period of time.3
worth of data, initially residing in memory at address 0, But there is no real reason to force CPU 0 to stall for
as it travels through the various single-line direct-mapped so long—after all, regardless of what data happens to be
caches in a four-CPU system. Table C.1 shows this flow
of data, with the first column showing the sequence of 3 The time required to transfer a cache line from one CPU’s cache to
operations, the second the CPU performing the operation, another’s is typically a few orders of magnitude more than that required
the third the operation being performed, the next four the to execute a simple register-to-register instruction.
v2022.09.25a
434 APPENDIX C. WHY MEMORY BARRIERS?
CPU 0 CPU 1
CPU 0 CPU 1
Write
Invalidate
Store Store
Buffer Buffer
Stall
Acknowledgement
Cache Cache
Interconnect
Memory
v2022.09.25a
C.3. STORES RESULT IN UNNECESSARY STALLS 435
Memory
C.3.2 Store Forwarding
To see the first complication, a violation of self-
Figure C.6: Caches With Store Forwarding
consistency, consider the following code with variables
“a” and “b” both initially zero, and with the cache line
containing variable “a” initially owned by CPU 1 and that
containing “b” initially owned by CPU 0: 8 CPU 0 loads “a” from its cache, finding the value
zero.
1 a = 1;
2 b = a + 1; 9 CPU 0 applies the entry from its store buffer to the
3 assert(b == 2); newly arrived cache line, setting the value of “a” in
its cache to one.
One would not expect the assertion to fail. However, if
one were foolish enough to use the very simple architecture 10 CPU 0 adds one to the value zero loaded for “a”
shown in Figure C.5, one would be surprised. Such a above, and stores it into the cache line containing “b”
system could potentially see the following sequence of (which we will assume is already owned by CPU 0).
events:
11 CPU 0 executes assert(b == 2), which fails.
1 CPU 0 starts executing the a = 1.
2 CPU 0 looks “a” up in the cache, and finds that it is The problem is that we have two copies of “a”, one in
missing. the cache and the other in the store buffer.
This example breaks a very important guarantee, namely
3 CPU 0 therefore sends a “read invalidate” message that each CPU will always see its own operations as if they
in order to get exclusive ownership of the cache line happened in program order. Breaking this guarantee is
containing “a”. violently counter-intuitive to software types, so much so
that the hardware guys took pity and implemented “store
4 CPU 0 records the store to “a” in its store buffer. forwarding”, where each CPU refers to (or “snoops”) its
store buffer as well as its cache when performing loads, as
5 CPU 1 receives the “read invalidate” message, and
shown in Figure C.6. In other words, a given CPU’s stores
responds by transmitting the cache line and removing
are directly forwarded to its subsequent loads, without
that cacheline from its cache.
having to pass through the cache.
6 CPU 0 starts executing the b = a + 1. With store forwarding in place, item 8 in the above
sequence would have found the correct value of 1 for “a”
7 CPU 0 receives the cache line from CPU 1, which in the store buffer, so that the final value of “b” would
still has a value of zero for “a”. have been 2, as one would hope.
v2022.09.25a
436 APPENDIX C. WHY MEMORY BARRIERS?
C.3.3 Store Buffers and Memory Barriers 8 CPU 1 receives the “read invalidate” message, and
transmits the cache line containing “a” to CPU 0 and
To see the second complication, a violation of global invalidates this cache line from its own cache. But it
memory ordering, consider the following code sequences is too late.
with variables “a” and “b” initially zero:
9 CPU 0 receives the cache line containing “a” and
1 void foo(void) applies the buffered store just in time to fall victim
2 {
3 a = 1;
to CPU 1’s failed assertion.
4 b = 1;
5 } Quick Quiz C.9: In step 1 above, why does CPU 0 need
6
to issue a “read invalidate” rather than a simple “invalidate”?
7 void bar(void)
8 {
After all, foo() will overwrite the variable a in any case, so
9 while (b == 0) continue; why should it care about the old value of a?
10 assert(a == 1);
11 }
Quick Quiz C.10: In step 9 above, did bar() read a stale
value from a, or did its reads of b and a get reordered?
Suppose CPU 0 executes foo() and CPU 1 executes
bar(). Suppose further that the cache line containing “a” The hardware designers cannot help directly here, since
resides only in CPU 1’s cache, and that the cache line the CPUs have no idea which variables are related, let
containing “b” is owned by CPU 0. Then the sequence of alone how they might be related. Therefore, the hardware
operations might be as follows: designers provide memory-barrier instructions to allow
the software to tell the CPU about such relations. The
1 CPU 0 executes a = 1. The cache line is not in program fragment must be updated to contain the memory
CPU 0’s cache, so CPU 0 places the new value of barrier:
“a” in its store buffer and transmits a “read invalidate”
message. 1 void foo(void)
2 {
3 a = 1;
2 CPU 1 executes while (b == 0)continue, but 4 smp_mb();
the cache line containing “b” is not in its cache. It 5 b = 1;
therefore transmits a “read” message. 6 }
7
8 void bar(void)
3 CPU 0 executes b = 1. It already owns this cache
9 {
line (in other words, the cache line is already in either 10 while (b == 0) continue;
the “modified” or the “exclusive” state), so it stores 11 assert(a == 1);
the new value of “b” in its cache line. 12 }
v2022.09.25a
C.4. STORE SEQUENCES RESULT IN UNNECESSARY STALLS 437
2 CPU 1 executes while (b == 0)continue, but 14 CPU 0 receives the “acknowledgement” message, and
the cache line containing “b” is not in its cache. It puts the cache line containing “b” into the “exclusive”
therefore transmits a “read” message. state. CPU 0 now stores the new value of “b” into
the cache line.
3 CPU 0 executes smp_mb(), and marks all current
store-buffer entries (namely, the a = 1). 15 CPU 0 receives the “read” message, and transmits
the cache line containing the new value of “b” to
4 CPU 0 executes b = 1. It already owns this cache CPU 1. It also marks its own copy of this cache line
line (in other words, the cache line is already in as “shared”.
either the “modified” or the “exclusive” state), but
there is a marked entry in the store buffer. Therefore, 16 CPU 1 receives the cache line containing “b” and
rather than store the new value of “b” in the cache installs it in its cache.
line, it instead places it in the store buffer (but in an
17 CPU 1 can now load the value of “b”, and since it
unmarked entry).
finds that the value of “b” is 1, it exits the while
5 CPU 0 receives the “read” message, and transmits loop and proceeds to the next statement.
the cache line containing the original value of “b” to
18 CPU 1 executes the assert(a == 1), but the cache
CPU 1. It also marks its own copy of this cache line
line containing “a” is no longer in its cache. Once
as “shared”.
it gets this cache from CPU 0, it will be working
6 CPU 1 receives the cache line containing “b” and with the up-to-date value of “a”, and the assertion
installs it in its cache. therefore passes.
7 CPU 1 can now load the value of “b”, but since it Quick Quiz C.11: After step 15 in Appendix C.3.3 on
finds that the value of “b” is still 0, it repeats the page 437, both CPUs might drop the cache line containing the
while statement. The new value of “b” is safely new value of “b”. Wouldn’t that cause this new value to be
hidden in CPU 0’s store buffer. lost?
8 CPU 1 receives the “read invalidate” message, and As you can see, this process involves no small amount
transmits the cache line containing “a” to CPU 0 and of bookkeeping. Even something intuitively simple, like
invalidates this cache line from its own cache. “load the value of a” can involve lots of complex steps in
silicon.
9 CPU 0 receives the cache line containing “a” and
applies the buffered store, placing this line into the
“modified” state. C.4 Store Sequences Result in Un-
10 Since the store to “a” was the only entry in the store
necessary Stalls
buffer that was marked by the smp_mb(), CPU 0 can Unfortunately, each store buffer must be relatively small,
also store the new value of “b”—except for the fact which means that a CPU executing a modest sequence
that the cache line containing “b” is now in “shared” of stores can fill its store buffer (for example, if all of
state. them result in cache misses). At that point, the CPU must
11 CPU 0 therefore sends an “invalidate” message to once again wait for invalidations to complete in order
CPU 1. to drain its store buffer before it can continue executing.
This same situation can arise immediately after a memory
12 CPU 1 receives the “invalidate” message, invalidates barrier, when all subsequent store instructions must wait
the cache line containing “b” from its cache, and for invalidations to complete, regardless of whether or not
sends an “acknowledgement” message to CPU 0. these stores result in cache misses.
This situation can be improved by making invalidate
13 CPU 1 executes while (b == 0)continue, but acknowledge messages arrive more quickly. One way of
the cache line containing “b” is not in its cache. It accomplishing this is to use per-CPU queues of invalidate
therefore transmits a “read” message to CPU 0. messages, or “invalidate queues”.
v2022.09.25a
438 APPENDIX C. WHY MEMORY BARRIERS?
Memory
C.4.2 Invalidate Queues and Invalidate Ac-
knowledge
Figure C.7: Caches With Invalidate Queues
Figure C.7 shows a system with invalidate queues. A CPU
with an invalidate queue may acknowledge an invalidate
message as soon as it is placed in the queue, instead Suppose the values of “a” and “b” are initially zero, that
of having to wait until the corresponding line is actually “a” is replicated read-only (MESI “shared” state), and that
invalidated. Of course, the CPU must refer to its invalidate “b” is owned by CPU 0 (MESI “exclusive” or “modified”
queue when preparing to transmit invalidation messages— state). Then suppose that CPU 0 executes foo() while
if an entry for the corresponding cache line is in the CPU 1 executes function bar() in the following code
invalidate queue, the CPU cannot immediately transmit fragment:
the invalidate message; it must instead wait until the
1 void foo(void)
invalidate-queue entry has been processed. 2 {
Placing an entry into the invalidate queue is essentially 3 a = 1;
a promise by the CPU to process that entry before trans- 4 smp_mb();
5 b = 1;
mitting any MESI protocol messages regarding that cache
6 }
line. As long as the corresponding data structures are not 7
highly contended, the CPU will rarely be inconvenienced 8 void bar(void)
by such a promise. 9 {
10 while (b == 0) continue;
However, the fact that invalidate messages can be 11 assert(a == 1);
buffered in the invalidate queue provides additional oppor- 12 }
tunity for memory-misordering, as discussed in the next
section. Then the sequence of operations might be as follows:
1 CPU 0 executes a = 1. The corresponding cache
C.4.3 Invalidate Queues and Memory Bar- line is read-only in CPU 0’s cache, so CPU 0 places
riers the new value of “a” in its store buffer and trans-
mits an “invalidate” message in order to flush the
Let us suppose that CPUs queue invalidation requests, but corresponding cache line from CPU 1’s cache.
respond to them immediately. This approach minimizes
the cache-invalidation latency seen by CPUs doing stores, 2 CPU 1 executes while (b == 0)continue, but
but can defeat memory barriers, as seen in the following the cache line containing “b” is not in its cache. It
example. therefore transmits a “read” message.
v2022.09.25a
C.4. STORE SEQUENCES RESULT IN UNNECESSARY STALLS 439
Quick Quiz C.12: In step 1 of the first scenario in Ap- 5 CPU 0 executes b = 1. It already owns this cache
pendix C.4.3, why is an “invalidate” sent instead of a ”read line (in other words, the cache line is already in either
invalidate” message? Doesn’t CPU 0 need the values of the the “modified” or the “exclusive” state), so it stores
other variables that share this cache line with “a”? the new value of “b” in its cache line.
There is clearly not much point in accelerating inval- 6 CPU 0 receives the “read” message, and transmits
idation responses if doing so causes memory barriers the cache line containing the now-updated value of
to effectively be ignored. However, the memory-barrier “b” to CPU 1, also marking the line as “shared” in its
instructions can interact with the invalidate queue, so that own cache.
when a given CPU executes a memory barrier, it marks 7 CPU 1 receives the cache line containing “b” and
all the entries currently in its invalidate queue, and forces installs it in its cache.
any subsequent load to wait until all marked entries have
been applied to the CPU’s cache. Therefore, we can add a 8 CPU 1 can now finish executing while (b ==
memory barrier to function bar as follows: 0)continue, and since it finds that the value of “b”
v2022.09.25a
440 APPENDIX C. WHY MEMORY BARRIERS?
is 1, it proceeds to the next statement, which is now the CPU that executes it, so that all loads preceding the
a memory barrier. read memory barrier will appear to have completed before
any load following the read memory barrier. Similarly,
9 CPU 1 must now stall until it processes all pre- a write memory barrier orders only stores, again on the
existing messages in its invalidation queue. CPU that executes it, and again so that all stores preceding
10 CPU 1 now processes the queued “invalidate” mes- the write memory barrier will appear to have completed
sage, and invalidates the cache line containing “a” before any store following the write memory barrier. A
from its own cache. full-fledged memory barrier orders both loads and stores,
but again only on the CPU executing the memory barrier.
11 CPU 1 executes the assert(a == 1), and, since
Quick Quiz C.15: But can’t full memory barriers impose
the cache line containing “a” is no longer in CPU 1’s
global ordering? After all, isn’t that needed to provide the
cache, it transmits a “read” message. ordering shown in Listing 12.27?
12 CPU 0 responds to this “read” message with the
If we update foo and bar to use read and write memory
cache line containing the new value of “a”.
barriers, they appear as follows:
13 CPU 1 receives this cache line, which contains a
value of 1 for “a”, so that the assertion does not 1 void foo(void)
2 {
trigger. 3 a = 1;
4 smp_wmb();
With much passing of MESI messages, the CPUs arrive 5 b = 1;
at the correct answer. This section illustrates why CPU 6 }
designers must be extremely careful with their cache- 7
C.5 Read and Write Memory Bar- C.6 Example Memory-Barrier Se-
riers quences
In the previous section, memory barriers were used to This section presents some seductive but subtly broken
mark entries in both the store buffer and the invalidate uses of memory barriers. Although many of them will
queue. But in our code fragment, foo() had no reason to work most of the time, and some will work all the time
do anything with the invalidate queue, and bar() similarly on some specific CPUs, these uses must be avoided if the
had no reason to do anything with the store buffer. goal is to produce code that works reliably on all CPUs.
Many CPU architectures therefore provide weaker To help us better see the subtle breakage, we first need to
memory-barrier instructions that do only one or the other focus on an ordering-hostile architecture.
of these two. Roughly speaking, a “read memory barrier”
marks only the invalidate queue (and snoops entries in the
C.6.1 Ordering-Hostile Architecture
store buffer) and a “write memory barrier” marks only the
store buffer, while a full-fledged memory barrier does all A number of ordering-hostile computer systems have been
of the above. produced over the decades, but the nature of the hostility
The software-visible effect of these hardware mecha- has always been extremely subtle, and understanding it
nisms is that a read memory barrier orders only loads on has required detailed knowledge of the specific hardware.
v2022.09.25a
C.6. EXAMPLE MEMORY-BARRIER SEQUENCES 441
Quick Quiz C.16: Does the guarantee that each CPU sees Listing C.1 shows three code fragments, executed concur-
its own memory accesses in order also guarantee that each rently by CPUs 0, 1, and 2. Each of “a”, “b”, and “c” are
user-level thread will see its own memory accesses in order? initially zero.
Why or why not? Suppose CPU 0 recently experienced many cache
misses, so that its message queue is full, but that CPU 1
Imagine a large non-uniform cache architecture (NUCA)
has been running exclusively within the cache, so that its
system that, in order to provide fair allocation of inter-
message queue is empty. Then CPU 0’s assignment to
connect bandwidth to CPUs in a given node, provided
“a” and “b” will appear in Node 0’s cache immediately
per-CPU queues in each node’s interconnect interface, as
(and thus be visible to CPU 1), but will be blocked behind
shown in Figure C.8. Although a given CPU’s accesses
CPU 0’s prior traffic. In contrast, CPU 1’s assignment
are ordered as specified by memory barriers executed by
to “c” will sail through CPU 1’s previously empty queue.
that CPU, however, the relative order of a given pair of
Therefore, CPU 2 might well see CPU 1’s assignment to
CPUs’ accesses could be severely reordered, as we will
“c” before it sees CPU 0’s assignment to “a”, causing the
see.5
assertion to fire, despite the memory barriers.
4 Readers preferring a detailed look at real hardware architectures
v2022.09.25a
442 APPENDIX C. WHY MEMORY BARRIERS?
Listing C.2: Memory Barrier Example 2 if any, are required to enable the code to work correctly, in
CPU 0 CPU 1 CPU 2 other words, to prevent the assertion from firing?
a = 1; while (a == 0);
smp_mb(); y = b;
b = 1; smp_rmb(); Quick Quiz C.19: If CPU 2 executed an
x = a; assert(e==0||c==1) in the example in Listing C.3, would
assert(y == 0 || x == 1);
this assert ever trigger?
v2022.09.25a
C.8. ADVICE TO HARDWARE DESIGNERS 443
C.8 Advice to Hardware Designers 2. External busses that fail to transmit cache-coherence
data.
There are any number of things that hardware designers This is an even more painful variant of the above
can do to make the lives of software people difficult. Here problem, but causes groups of devices—and even
is a list of a few such things that we have encountered memory itself—to fail to respect cache coherence. It
in the past, presented here in the hope that it might help is my painful duty to inform you that as embedded
prevent future such problems: systems move to multicore architectures, we will no
doubt see a fair number of such problems arise. By
1. I/O devices that ignore cache coherence.
the year 2021, there were some efforts to address
This charming misfeature can result in DMAs from these problems with new interconnect standards, with
memory missing recent changes to the output buffer, some debate as to how effective these standards will
or, just as bad, cause input buffers to be overwritten really be [Won19].
by the contents of CPU caches just after the DMA
completes. To make your system work in face of 3. Device interrupts that ignore cache coherence.
such misbehavior, you must carefully flush the CPU This might sound innocent enough—after all, in-
caches of any location in any DMA buffer before terrupts aren’t memory references, are they? But
presenting that buffer to the I/O device. Otherwise, a imagine a CPU with a split cache, one bank of which
store from one of the CPUs might not be accounted is extremely busy, therefore holding onto the last
for in the data DMAed out through the device. This cacheline of the input buffer. If the corresponding
is a form of data corruption, which is an extremely I/O-complete interrupt reaches this CPU, then that
serious bug. CPU’s memory reference to the last cache line of the
Similarly, you need to invalidate6 the CPU caches buffer could return old data, again resulting in data
corresponding to any location in any DMA buffer corruption, but in a form that will be invisible in a
after DMA to that buffer completes. Otherwise, a later crash dump. By the time the system gets around
given CPU might see the old data still residing in to dumping the offending input buffer, the DMA will
its cache instead of the newly DMAed data that it most likely have completed.
was supposed to see. This is another form of data
corruption. 4. Inter-processor interrupts (IPIs) that ignore cache
coherence.
And even then, you need to be very careful to avoid
pointer bugs, as even a misplaced read to an input This can be problematic if the IPI reaches its destina-
buffer can result in corrupting the data input! One tion before all of the cache lines in the corresponding
way to avoid this is to invalidate all of the caches of message buffer have been committed to memory.
all of the CPUs once the DMA completes, but it is
5. Context switches that get ahead of cache coherence.
much easier and more efficient if the device DMA
participates in the cache-coherence protocol, making If memory accesses can complete too wildly out of
all of this flushing and invalidating unnecessary. order, then context switches can be quite harrowing.
If the task flits from one CPU to another before all
6 Why not flush? If there is a difference, then a CPU must have the memory accesses visible to the source CPU make
incorrectly stored to the DMA buffer in the midst of the DMA operation. it to the destination CPU, then the task could easily
v2022.09.25a
444 APPENDIX C. WHY MEMORY BARRIERS?
v2022.09.25a
De gustibus non est disputandum.
Latin maxim
Appendix D
Style Guide
This appendix is a collection of style guides which is • \co{} for identifiers, \url{} for URLs, \path{}
intended as a reference to improve consistency in perfbook. for filenames.
It also contains several suggestions and their experimental
examples. • Dates should use an unambiguous format. Never
Appendix D.1 describes basic punctuation and spelling “mm/dd/yy” or “dd/mm/yy”, but rather “July 26, 2016”
rules. Appendix D.2 explains rules related to unit symbols. or “26 July 2016” or “26-Jul-2016” or “2016/07/26”.
Appendix D.3 summarizes LATEX-specific conventions. I tend to use yyyy.mm.ddA for filenames, for exam-
ple.
• North American rules on periods and abbreviations.
D.1 Paul’s Conventions For example neither of the following can reasonably
be interpreted as two sentences:
Following is the list of Paul’s conventions assembled from
his answers to Akira’s questions regarding perfbook’s – Say hello, to Mr. Jones.
punctuation policy. – If it looks like she sprained her ankle, call
Dr. Smith and then tell her to keep the ankle
• (On punctuations and quotations) Despite being iced and elevated.
American myself, for this sort of book, the UK
approach is better because it removes ambiguities An ambiguous example:
like the following:
If I take the cow, the pig, the horse, etc.
Type “ls -a,” look for the file “.,” and George will be upset.
file a bug if you don’t see it.
can be written with more words:
The following is much more clear:
If I take the cow, the pig, the horse, or
Type “ls -a”, look for the file “.”, and much of anything else, George will be
file a bug if you don’t see it. upset.
or:
• American English spelling: “color” rather than
“colour”. If I take the cow, the pig, the horse, etc.,
George will be upset.
• Oxford comma: “a, b, and c” rather than “a, b and c”.
This is arbitrary. Cases where the Oxford comma • I don’t like ampersand (“&”) in headings, but will
results in ambiguity should be reworded, for example, sometimes use it if doing so prevents a line break in
by introducing numbering: “a, b, and c and d” should that heading.
be “(1) a, (2) b, and (3) c and d”.
• When mentioning words, I use quotations. When
• Italic for emphasis. Use sparingly. introducing a new word, I use \emph{}.
445
v2022.09.25a
446 APPENDIX D. STYLE GUIDE
Following is a convention regarding punctuation in “A 240 GB hard drive”, rather than “a 240-GB
LATEX sources. hard drive” nor “a 240GB hard drive”.
• Place a newline after a colon (:) and the end of a Strictly speaking, NIST guide requires us to use the
sentence. This avoids the whole one-space/two-space binary prefixes “Ki”, “Mi”, or “Gi” to represent powers
food fight and also has the advantage of more clearly of 210 . However, we accept the JEDEC conventions to
showing changes to single sentences in the middle use “K”, “M”, and “G” as binary prefixes in describing
of long paragraphs. memory capacity [JED].
An acceptable example:
“8 GB of main memory”, meaning “8 GiB of
D.2 NIST Style Guide main memory”.
D.2.1 Unit Symbol Also, it is acceptable to use just “K”, “M”, or “G”
as abbreviations appended to a numerical value, e.g.,
D.2.1.1 SI Unit Symbol “4K entries”. In such cases, no space before an abbreviation
is required. For example,
NIST style guide [Nat19, Chapter 5] states the following
rules (rephrased for perfbook). “8K entries”, rather than “8 K entries”.
• When SI unit symbols such as “ns”, “MHz”, and “K” If you put a space in between, the symbol looks like
(kelvin) are used behind numerical values, narrow a unit symbol and is confusing. Note that “K” and “k”
spaces should be placed between the values and the represent 210 and 103 , respectively. “M” can represent
symbols. either 220 or 106 , and “G” can represent either 230 or 109 .
These ambiguities should not be confusing in discussing
A narrow space can be coded in LATEX by the sequence
approximate order.
of “\,”. For example,
“2.4 GHz”, rather then “2.4GHz”. D.2.1.3 Degree Symbol
• Even when the value is used in adjectival sense, a The angular-degree symbol (°) does not require any space
narrow space should be placed. For example, in front of it. NIST style guide clearly states so.
The symbol of degree can also be typeset easily by the
“a 10 ms interval”, rather than “a 10-ms help of gensymb package. A macro “\degree” can be
interval” nor “a 10ms interval”. used in both text and math modes.
Example:
The symbol of micro (µ:10−6 ) can be typeset easily by
the help of “gensymb” LATEX package. A macro “\micro” 45°, rather than 45 °.
can be used in both text and math modes. To typeset the
symbol of “microsecond”, you can do so by “\micro s”. D.2.1.4 Percent Symbol
For example,
NIST style guide treats the percent symbol (%) as the
10 µs same as SI unit symbols.
Note that math mode “\mu” is italic by default and 50 % possibility, rather than 50% possibility.
should not be used as a prefix. An improper example:
10 𝜇s (math mode “\mu”) D.2.1.5 Font Style
Quote from NIST check list [Nata, #6]:
D.2.1.2 Non-SI Unit Symbol
Variables and quantity symbols are in italic
Although NIST style guide does not cover non-SI unit type. Unit symbols are in roman type. Num-
symbols such as “KB”, “MB”, and “GB”, the same rule bers should generally be written in roman type.
should be followed. These rules apply irrespective of the typeface
Example: used in the surrounding text.
v2022.09.25a
D.3. LATEX CONVENTIONS 447
v2022.09.25a
448 APPENDIX D. STYLE GUIDE
Listing D.1: LATEX Source of Sample Code Snippet (Current) Above code results in the paragraph below:
1 \begin{listing}
2 \begin{fcvlabel}[ln:base1]
3 \begin{VerbatimL}[commandchars=\$\[\]] Lines 7 and 8 can be referred to from text.
4 /*
5 * Sample Code Snippet
6 */
Macros “\lnlbl{}” and “\lnref{}” are defined in
7 #include <stdio.h> the preamble as follows:
8 int main(void)
9 {
10 printf("Hello world!\n"); $lnlbl[printf] \newcommand{\lnlblbase}{}
11 return 0; $lnlbl[return] \newcommand{\lnlbl}[1]{%
12 } \phantomsection\label{\lnlblbase:#1}}
13 \end{VerbatimL} \newcommand{\lnrefbase}{}
14 \end{fcvlabel} \newcommand{\lnref}[1]{\ref{\lnrefbase:#1}}
15 \caption{Sample Code Snippet}
16 \label{lst:app:styleguide:Sample Code Snippet}
17 \end{listing} Environments “fcvlabel” and “fcvref” are defined
as shown below:
Listing D.2: Sample Code Snippet
1 /* \newenvironment{fcvlabel}[1][]{%
2 * Sample Code Snippet \renewcommand{\lnlblbase}{#1}%
3 */ \ignorespaces}{\ignorespacesafterend}
4 #include <stdio.h> \newenvironment{fcvref}[1][]{%
5 int main(void) \renewcommand{\lnrefbase}{#1}%
6 { \ignorespaces}{\ignorespacesafterend}
7 printf("Hello world!\n");
8 return 0;
9 } The main part of LATEX source shown on lines 2–14
in Listing D.1 can be extracted from a code sample of
Listing D.3 by a perl script utilities/fcvextract.
is for inline snippets without line count. They are defined pl. All the relevant rules of extraction are described as
in the preamble as shown below: recipes in the top level Makefile and a script to generate
dependencies (utilities/gen_snippet_d.pl).
\DefineVerbatimEnvironment{VerbatimL}{Verbatim}%
{fontsize=\scriptsize,numbers=left,numbersep=5pt,%
As you can see, Listing D.3 has meta commands in
xleftmargin=9pt,obeytabs=true,tabsize=2} comments of C (C++ style). Those meta commands
\AfterEndEnvironment{VerbatimL}{\vspace*{-9pt}}
\DefineVerbatimEnvironment{VerbatimN}{Verbatim}%
are interpreted by utilities/fcvextract.pl, which
{fontsize=\scriptsize,numbers=left,numbersep=3pt,% distinguishes the type of comment style by the suffix of
xleftmargin=5pt,xrightmargin=5pt,obeytabs=true,%
tabsize=2,frame=single}
code sample’s file name.
\DefineVerbatimEnvironment{VerbatimU}{Verbatim}% Meta commands which can be used in code samples
{fontsize=\scriptsize,numbers=none,xleftmargin=5pt,%
xrightmargin=5pt,obeytabs=true,tabsize=2,% are listed below:
samepage=true,frame=single}
• \begin{snippet}[<options>]
• \end{snippet}
The LATEX source of a sample code snippet is shown in
• \lnlbl{<label string>}
Listing D.1 and is typeset as shown in Listing D.2.
• \fcvexclude
Labels to lines are specified in “$lnlbl[]” command.
• \fcvblank
The characters specified by “commandchars” option to
VarbatimL environment are used by the fancyvrb pack- “<options>” to the \begin{snippet} meta com-
age to substitute “\lnlbl{}” for “$lnlbl[]”. Those mand is a comma-spareted list of options shown below:
characters should be selected so that they don’t appear
elsewhere in the code snippet. • labelbase=<label base string>
Labels “printf” and “return” in Listing D.2 can be • keepcomment=yes
referred to as shown below: • gobbleblank=yes
• commandchars=\X\Y\Z
\begin{fcvref}[ln:base1]
\Clnref{printf, return} can be referred
to from text. The “labelbase” option is mandatory and
\end{fcvref}
the string given to it will be passed to the
v2022.09.25a
D.3. LATEX CONVENTIONS 449
“\begin{fcvlabel}[<label base string>]” com- Once one of them appears in a litmus test, comments
mand as shown on line 2 of Listing D.1. The should be of OCaml style (“(* ... *)”). Those to-
“keepcomment=yes” option tells fcvextract.pl to kens keep the same meaning even when they appear in
keep comment blocks. Otherwise, comment blocks in C comments!
source code will be omitted. The “gobbleblank=yes” The pair of characters “{” and “}” also have special
option will remove empty or blank lines in the resulting meaning in the C flavour tests. They are used to separate
snippet. The “commandchars” option is given to the portions in a litmus test.
VerbatimL environment as is. At the moment, it is also First pair of “{” and “}” encloses initialization part.
mandatory and must come at the end of options listed Comments in this part should also be in the ocaml form.
above. Other types of options, if any, are also passed to You can’t use “{” and “}” in comments in litmus tests,
the VerbatimL environment. either.
The “\lnlbl” commands are converted along the way Examples of disallowed comments in a litmus test are
to reflect the escape-character choice.1 Source lines with shown below:
“\fcvexclude” are removed. “\fcvblank” can be used
1 // Comment at first
to keep blank lines when the “gobbleblank=yes” option 2 C C-sample
is specified. 3 // Comment with { and } characters
4 {
There can be multiple pairs of \begin{snippet} 5 x=2; // C style comment in initialization
and \end{snippet} as long as they have unique 6 }
7
“labelbase” strings. 8 P0(int *x}
Our naming scheme of “labelbase” for unique name 9 {
10 int r1;
space is as follows: 11
12 r1 = READ_ONCE(*x); // Comment with "exists"
ln:<Chapter/Subdirectory>:<File Name>:<Function Name> 13 }
14
15 [...]
16
Litmus tests, which are handled by “herdtools7” com- 17 exists (0:r1=0) // C++ style comment after test body
mands such as “litmus7” and “herd7”, were problematic
in this scheme. Those commands have particular rules To avoid parse errors, meta commands in litmus tests
of where comments can be placed and restriction on per- (C flavor) are embedded in the following way.
mitted characters in comments. They also forbid a couple
of tokens to appear in comments. (Tokens in comments 1 C C-SB+o-o+o-o
2 //\begin[snippet][labelbase=ln:base,commandchars=\%\@\$]
might sound strange, but they do have such restriction.) 3
For example, the first token in a litmus test must be one 4 {
5 1:r2=0 (*\lnlbl[initr2]*)
of “C”, “PPC”, “X86”, “LISA”, etc., which indicates the 6 }
flavor of the test. This means no comment is allowed at 7
8 P0(int *x0, int *x1) //\lnlbl[P0:b]
the beginning of a litmus test. 9 {
Similarly, several tokens such as “exists”, “filter”, 10 int r2;
11
and “locations” indicate the end of litmus test’s body. 12 WRITE_ONCE(*x0, 2);
13 r2 = READ_ONCE(*x1);
1 Characters
forming comments around the “\lnlbl” commands 14 } //\lnlbl[P0:e]
are also gobbled up regardless of the “keepcomment” setting. 15
v2022.09.25a
450 APPENDIX D. STYLE GUIDE
16 P1(int *x0, int *x1) Listing D.4: LATEX Source of Sample Code Snippet (Obsolete)
17 { 1 \begin{listing}
18 int r2; 2 { \scriptsize
19 3 \begin{verbbox}[\LstLineNo]
20 WRITE_ONCE(*x1, 2); 4 /*
21 r2 = READ_ONCE(*x0); 5 * Sample Code Snippet
22 } 6 */
23 7 #include <stdio.h>
24 //\end[snippet] 8 int main(void)
25 exists (1:r2=0 /\ 0:r2=0) (* \lnlbl[exists_] *) 9 {
10 printf("Hello world!\n");
11 return 0;
Example above is converted to the following interme- 12 }
13 \end{verbbox}
diate code by a script utilities/reorder_ltms.pl.2 14 }
The intermediate code can be handled by the common 15 \centering
16 \theverbbox
script utilities/fcvextract.pl. 17 \caption{Sample Code Snippet (Obsolete)}
18 \label{lst:app:styleguide:Sample Code Snippet (Obsolete)}
1 // Do not edit! 19 \end{listing}
2 // Generated by utillities/reorder_ltms.pl
3 //\begin{snippet}[labelbase=ln:base,commandchars=\%\@\$]
4 C C-SB+o-o+o-o Listing D.5: Sample Code Snippet (Obsolete)
5
1 /*
6 {
2 * Sample Code Snippet
7 1:r2=0 //\lnlbl{initr2}
3 */
8 }
4 #include <stdio.h>
9
5 int main(void)
10 P0(int *x0, int *x1) //\lnlbl{P0:b}
6 {
11 {
7 printf("Hello world!\n");
12 int r2;
8 return 0;
13
9 }
14 WRITE_ONCE(*x0, 2);
15 r2 = READ_ONCE(*x1);
16 } //\lnlbl{P0:e}
17
18 P1(int *x0, int *x1) The “verbatim” environment is used for listings with
19 { too many lines to fit in a column. It is also used to avoid
20 int r2;
21 overwhelming LATEX with a lot of floating objects. They
22 WRITE_ONCE(*x1, 2); are being converted to the scheme using the VerbatimN
23 r2 = READ_ONCE(*x0);
24 } environment.
25
26 exists (1:r2=0 /\ 0:r2=0) \lnlbl{exists_}
27 //\end{snippet}
D.3.1.3 Identifier
Note that each litmus test’s source file can con- We use “\co{}” macro for inline identifiers. (“co” stands
tain at most one pair of \begin[snippet] and for “code”.)
\end[snippet] because of the restriction of comments. By putting them into \co{}, underscore characters in
their names are free of escaping in LATEX source. It is
D.3.1.2 Code Snippet (Obsolete) convenient to search them in source files. Also, \co{}
macro has a capability to permit line breaks at particular
Sample LATEX source of a code snippet coded using the sequences of letters. Current definition permits a line
“verbatimbox” package is shown in Listing D.4 and is break at an underscore (_), two consecutive underscores
typeset as shown in Listing D.5. (__), a white space, or an operator ->.
The auto-numbering feature of verbbox is enabled
by the “\LstLineNo” macro specified in the option to
verbbox (line 3 in Listing D.4). The macro is defined in D.3.1.4 Identifier inside Table and Heading
the preamble of perfbook.tex as follows: Although \co{} command is convenient for inlining
within text, it is fragile because of its capability of line
\newcommand{\LstLineNo}
{\makebox[5ex][r]{\arabic{VerbboxLineNo}\hspace{2ex}}} break. When it is used inside a “tabular” environment
or its derivative such as “tabularx”, it confuses column
2 Currently, only C flavor litmus tests are supported. width estimation of those environments. Furthermore,
v2022.09.25a
D.3. LATEX CONVENTIONS 451
v2022.09.25a
452 APPENDIX D. STYLE GUIDE
• Reference to a Chapter or a Section: x-, y-, and z-coordinates; x-, y-, and z-
Please refer to Appendix D.2. coordinates; x-, y-, and z-coordinates; x-, y-,
and z-coordinates; x-, y-, and z-coordinates; x-,
• Calling out CPU number or Thread name: y-, and z-coordinates;
High-frequency radio wave, high-frequency ra- Note that “\=/” enables hyphenation in elements of
dio wave, high-frequency radio wave, high- compound words as the same as “\-/” does.
frequency radio wave, high-frequency radio
wave, high-frequency radio wave. D.3.4.3 Em Dash
Em dashes are used to indicate parenthetic expression. In
By using a shortcut “\-/” provided by the “extdash”
perfbook, em dashes are placed without spaces around it.
package, hyphenation in elements of compound words is
In LATEX source, an em dash is represented by “---”.
enabled in perfbook.5
Example (quote from Appendix C.1):
Example with “\-/”:
This disparity in speed—more than two or-
High-frequency radio wave, high-frequency ra- ders of magnitude—has resulted in the multi-
dio wave, high-frequency radio wave, high-fre- megabyte caches found on modern CPUs.
quency radio wave, high-frequency radio wave,
high-frequency radio wave. D.3.4.4 En Dash
In LATEX convention, en dashes (–) are used for ranges
D.3.4.2 Non Breakable Hyphen of (mostly) numbers. Past revisions of perfbook didn’t
follow this rule and used plain dashes (-) for such cases.
We want hyphenated compound terms such as “x-coordi- Now that \clnrefrange, \crefrange, and their vari-
nate”, “y-coordinate”, etc. not to be broken at the hyphen ants, which generate en dashes, are used for ranges of
following a single letter. cross-references, the remaining couple of tens of simple
To make a hyphen unbreakable, we can use a short cut dashes of other types of ranges have been converted to
“\=/” also provided by the “extdash” package. en dashes for consistency.
Example without a shortcut: Example with a simple dash:
v2022.09.25a
D.3. LATEX CONVENTIONS 453
Lines 4–12 in Listing D.4 are the contents of D.3.5.2 Full Stop
the verbbox environment. The box is output by
LATEX treats a full stop in front of a white space as an end
the \theverbbox macro on line 16.
of a sentence and puts a slightly wider skip by default
(double spacing). There is an exception to this rule, i.e.
D.3.4.5 Numerical Minus Sign where the full stop is next to a capital letter, LATEX assumes
Numerical minus signs should be coded as math mode it represents an abbreviation and puts a normal skip.
minus signs, namely “$-$”.6 For example, To make LATEX use proper skips, one need to annotate
such exceptions. For example, given the following LATEX
−30, rather than -30. source:
\begin{quote}
D.3.5 Punctuation Lock~1 is owned by CPU~A.
Lock~2 is owned by CPU~B. (Bad.)
D.3.5.1 Ellipsis
Lock~1 is owned by CPU~A\@.
Lock~2 is owned by CPU~B\@. (Good.)
In monospace fonts, ellipses can be expressed by series of \end{quote}
periods. For example:
Great ... So how do I fix it? the output will be as the following:
However, in proportional fonts, the series of periods is
Lock 1 is owned by CPU A. Lock 2 is owned
printed with tight spaces as follows:
by CPU B. (Bad.)
Great ... So how do I fix it? Lock 1 is owned by CPU A. Lock 2 is owned
Standard LATEX defines the \dots macro for this pur- by CPU B. (Good.)
pose. However, it has a kludge in the evenness of spaces.
The “ellipsis” package redefines the \dots macro to fix On the other hand, where a full stop is following a lower
the issue.7 By using \dots, the above example is typeset case letter, e.g. as in “Mr. Smith”, a wider skip will follow
as the following: in the output unless it is properly hinted. Such hintings
can be done in one of several ways.
Great . . . So how do I fix it? Given the following source,
Note that the “xspace” option specified to the “ellipsis”
\begin{itemize}[nosep]
package adjusts the spaces after ellipses depending on \item Mr. Smith (bad)
what follows them. \item Mr.~Smith (good)
\item Mr.\ Smith (good)
For example: \item Mr.\@ Smith (good)
\end{itemize}
• He said, “I . . . really don’t remember . . .”
• Sequence A: (one, two, three, . . .) the result will look as follows:
• Sequence B: (4, 5, . . . , 𝑛) • Mr. Smith (bad)
• Mr. Smith (good)
As you can see, extra space is placed before the comma.
• Mr. Smith (good)
\dots macro can also be used in math mode:
• Mr. Smith (good)
• Sequence C: (1, 2, 3, 5, 8, . . . )
• Sequence D: (10, 12, . . . , 20) D.3.6 Floating Object Format
The \ldots macro behaves the same as the \dots D.3.6.1 Ruled Line in Table
macro.
6 This rule assumes that math mode uses the same upright glyph as They say that tables drawn by using ruled lines of plain
text mode. Our default font choice meets the assumption. LATEX look ugly.8 Vertical lines should be avoided and
7 To be exact, it is the \textellipsis macro that is redefined. The
v2022.09.25a
454 APPENDIX D. STYLE GUIDE
v2022.09.25a
D.3. LATEX CONVENTIONS 455
Figure D.1: Timer Wheel at 1 kHz Figure D.2: Timer Wheel at 100 kHz
v2022.09.25a
456 APPENDIX D. STYLE GUIDE
Table D.4: CPU 0 View of Synchronization Mechanisms Table D.5: Synchronization and Reference Counting
on 8-Socket System With Intel Xeon Platinum 8176
CPUs @ 2.10GHz Release
Reference Hazard
Ratio Acquisition Locks RCU
Counts Pointers
Operation Cost (ns) (cost/clock)
Locks − CAMR M CA
Clock period 0.5 1.0
Reference
Best-case CAS 7.0 14.6 A AMR M A
Counts
Best-case lock 15.4 32.3
Hazard
Blind CAS 7.2 15.2 M M M M
Pointers
CAS 18.0 37.7
RCU CA MA CA M CA
Blind CAS (off-core) 47.5 99.8
CAS (off-core) 101.9 214.0
Key: A: Atomic counting
Blind CAS (off-socket) 148.8 312.5
C: Check combined with the atomic acquisition
CAS (off-socket) 442.9 930.1 operation
Comms Fabric 5,000 10,500 M: Full memory barriers required
Global Comms 195,000,000 409,500,000 MR : Memory barriers required only on release
MA : Memory barriers required on acquire
v2022.09.25a
D.3. LATEX CONVENTIONS 457
v2022.09.25a
458 APPENDIX D. STYLE GUIDE
CPU 0 CPU 1
Instruction Store Buffer Cache Instruction Store Buffer Cache
1 (Initial state) x1==0 (Initial state) x0==0
2 x0 = 2; x0==2 x1==0 x1 = 2; x1==2 x0==0
3 r2 = x1; (0) x0==2 x1==0 r2 = x0; (0) x1==2 x0==0
4 (Read-invalidate) x0==2 x0==0 (Read-invalidate) x1==2 x1==0
5 (Finish store) x0==2 (Finish store) x1==2
v2022.09.25a
D.3. LATEX CONVENTIONS 459
v2022.09.25a
460 APPENDIX D. STYLE GUIDE
v2022.09.25a
The Answer to the Ultimate Question of Life, The
Universe, and Everything.
461
v2022.09.25a
462 APPENDIX E. ANSWERS TO QUICK QUIZZES
what on earth they are talking about. q carefully. Why do you ask?
v2022.09.25a
E.2. INTRODUCTION 463
v2022.09.25a
464 APPENDIX E. ANSWERS TO QUICK QUIZZES
v2022.09.25a
E.2. INTRODUCTION 465
v2022.09.25a
466 APPENDIX E. ANSWERS TO QUICK QUIZZES
3. The project contains heavily used APIs that were level properties of the hardware? Wouldn’t it be easier,
designed without regard to parallelism [AGH+ 11a, better, and more elegant to remain at a higher level of
CKZ+ 13]. Some of the more ornate features of the abstraction?
System V message-queue API form a case in point.
Of course, if your project has been around for a few Answer:
decades, and its developers did not have access to It might well be easier to ignore the detailed properties
parallel hardware, it undoubtedly has at least its share of the hardware, but in most cases it would be quite
of such APIs. foolish to do so. If you accept that the only purpose of
parallelism is to increase performance, and if you further
4. The project was implemented without regard to paral- accept that performance depends on detailed properties
lelism. Given that there are a great many techniques of the hardware, then it logically follows that parallel
that work extremely well in a sequential environment, programmers are going to need to know at least a few
but that fail miserably in parallel environments, if hardware properties.
your project ran only on sequential hardware for most This is the case in most engineering disciplines. Would
of its lifetime, then your project undoubtably has at you want to use a bridge designed by an engineer who
least its share of parallel-unfriendly code. did not understand the properties of the concrete and steel
making up that bridge? If not, why would you expect
5. The project was implemented without regard to good
a parallel programmer to be able to develop competent
software-development practice. The cruel truth is
parallel software without at least some understanding of
that shared-memory parallel environments are often
the underlying hardware? q
much less forgiving of sloppy development practices
than are sequential environments. You may be well-
served to clean up the existing design and code prior Quick Quiz 3.2: p.20
to attempting parallelization. What types of machines would allow atomic operations
on multiple data elements?
6. The people who originally did the development on
your project have since moved on, and the people Answer:
remaining, while well able to maintain it or add small One answer to this question is that it is often possible to
features, are unable to make “big animal” changes. pack multiple elements of data into a single machine word,
In this case, unless you can work out a very simple which can then be manipulated atomically.
way to parallelize your project, you will probably be A more trendy answer would be machines support-
best off leaving it sequential. That said, there are a ing transactional memory [Lom77, Kni86, HM93]. By
number of simple approaches that you might use to early 2014, several mainstream systems provided limited
parallelize your project, including running multiple hardware transactional memory implementations, which
instances of it, using a parallel implementation of is covered in more detail in Section 17.3. The jury
some heavily used library function, or making use is still out on the applicability of software transactional
of some other parallel project, such as a database. memory [MMW07, PW07, RHP+ 07, CBM+ 08, DFGG11,
One can argue that many of these obstacles are non- MS12], which is covered in Section 17.2. q
technical in nature, but that does not make them any less
real. In short, parallelization of a large body of code Quick Quiz 3.3: p.20
can be a large and complex effort. As with any large So have CPU designers also greatly reduced the overhead
and complex effort, it makes sense to do your homework of cache misses?
beforehand. q
Answer:
Unfortunately, not so much. There has been some re-
duction given constant numbers of CPUs, but the finite
E.3 Hardware and its Habits speed of light and the atomic nature of matter limits their
ability to reduce cache-miss overhead for larger systems.
Quick Quiz 3.1: p.17 Section 3.3 discusses some possible avenues for possible
Why should parallel programmers bother learning low- future progress. q
v2022.09.25a
E.3. HARDWARE AND ITS HABITS 467
v2022.09.25a
468 APPENDIX E. ANSWERS TO QUICK QUIZZES
Table E.1: Performance of Synchronization Mechanisms For the more-expensive inter-system communications
on 16-CPU 2.8 GHz Intel X5550 (Nehalem) System latencies, use several rolls (or multiple cases) of toilet
paper to represent the communications latency.
Ratio
Operation Cost (ns) (cost/clock) Important safety tip: Make sure to account for the needs
of those you live with when appropriating toilet paper,
Clock period 0.4 1.0 especially in 2020 or during a similar time when store
Same-CPU CAS 12.2 33.8 shelves are free of toilet paper and much else besides.
Same-CPU lock 25.6 71.2 Furthermore, for those working on kernel code, a CPU
Blind CAS 12.9 35.8 disabling interrupts across a cache miss is analogous to
CAS 7.0 19.4 you holding your breath while unrolling a roll of toilet
Off-Core paper. How many rolls of toilet paper can you unroll while
Blind CAS 31.2 86.6 holding your breath? You might wish to avoid disabling
CAS 31.2 86.5 interrupts across that many cache misses.3 q
Off-Socket
Quick Quiz 3.9: p.25
Blind CAS 92.4 256.7
CAS 95.9 266.4 But individual electrons don’t move anywhere near that
fast, even in conductors!!! The electron drift velocity in
Off-System a conductor under semiconductor voltage levels is on the
Comms Fabric 2,600 7,220 order of only one millimeter per second. What gives???
Global Comms 195,000,000 542,000,000
Answer:
Integration of hardware threads in a single core and Electron drift velocity tracks the long-term movement of
multiple cores on a die have improved latencies greatly, individual electrons. It turns out that individual electrons
at least within the confines of a single core or single bounce around quite randomly, so that their instantaneous
die. There has been some improvement in overall system speed is very high, but over the long term, they don’t
latency, but only by about a factor of two. Unfortunately, move very far. In this, electrons resemble long-distance
neither the speed of light nor the atomic nature of matter commuters, who might spend most of their time traveling
has changed much in the past few years [Har16]. Therefore, at full highway speed, but over the long term go nowhere.
spatial and temporal locality are first-class concerns for These commuters’ speed might be 70 miles per hour (113
concurrent software, even when running on relatively kilometers per hour), but their long-term drift velocity
small systems. relative to the planet’s surface is zero.
Therefore, we should pay attention not to the electrons’
Section 3.3 looks at what else hardware designers might drift velocity, but to their instantaneous velocities. How-
be able to do to ease the plight of parallel programmers. ever, even their instantaneous velocities are nowhere near
q a significant fraction of the speed of light. Nevertheless,
the measured velocity of electric waves in conductors is a
substantial fraction of the speed of light, so we still have a
Quick Quiz 3.8: p.24 mystery on our hands.
These numbers are insanely large! How can I possibly The other trick is that electrons interact with each other
get my head around them? at significant distances (from an atomic perspective, any-
way), courtesy of their negative charge. This interaction
Answer: is carried out by photons, which do move at the speed of
Get a roll of toilet paper. In the USA, each roll will light. So even with electricity’s electrons, it is photons
normally have somewhere around 350–500 sheets. Tear doing most of the fast footwork.
off one sheet to represent a single clock cycle, setting it Extending the commuter analogy, a driver might use a
aside. Now unroll the rest of the roll. smartphone to inform other drivers of an accident or con-
gestion, thus allowing a change in traffic flow to propagate
The resulting pile of toilet paper will likely represent a
single CAS cache miss. 3 Kudos to Matthew Wilcox for this holding-breath analogy.
v2022.09.25a
E.3. HARDWARE AND ITS HABITS 469
Table E.2: CPU 0 View of Synchronization Mechanisms on 12-CPU Intel Core i7-8750H CPU @ 2.20 GHz
Ratio
Operation Cost (ns) (cost/clock) CPUs
Clock period 0.5 1.0
Same-CPU CAS 6.2 13.6 0
Same-CPU lock 13.5 29.6 0
In-core blind CAS 6.5 14.3 6
In-core CAS 16.2 35.6 6
Off-core blind CAS 22.2 48.8 1–5,7–11
Off-core CAS 53.6 117.9 1–5,7–11
Off-System
Comms Fabric 5,000 11,000
Global Comms 195,000,000 429,000,000
much faster than the instantaneous velocity of the individ- 1. Shared-memory multiprocessor systems have strict
ual cars. Summarizing the analogy between electricity size limits. If you need more than a few thousand
and traffic flow: CPUs, you have no choice but to use a distributed
system.
1. The (very low) drift velocity of an electron is similar
2. Large shared-memory systems tend to be more ex-
to the long-term velocity of a commuter, both being
pensive per unit computation than their smaller coun-
very nearly zero.
terparts.
2. The (still rather low) instantaneous velocity of an
3. Large shared-memory systems tend to have much
electron is similar to the instantaneous velocity of
longer cache-miss latencies than do smaller system.
a car in traffic. Both are much higher than the drift
To see this, compare Table 3.1 on page 23 with
velocity, but quite small compared to the rate at which
Table E.2.
changes propagate.
4. The distributed-systems communications operations
3. The (much higher) propagation velocity of an elec- do not necessarily use much CPU, so that computa-
tric wave is primarily due to photons transmitting tion can proceed in parallel with message transfer.
electromagnetic force among the electrons. Simi-
larly, traffic patterns can change quite quickly due 5. Many important problems are “embarrassingly paral-
to communication among drivers. Not that this is lel”, so that extremely large quantities of processing
necessarily of much help to the drivers already stuck may be enabled by a very small number of messages.
in traffic, any more than it is to the electrons already SETI@HOME [Uni08b] was but one example of
pooled in a given capacitor. such an application. These sorts of applications can
make good use of networks of computers despite
Of course, to fully understand this topic, you should extremely long communications latencies.
read up on electrodynamics. q
Thus, large shared-memory systems tend to be used
for applications that benefit from faster latencies than can
Quick Quiz 3.10: p.28 be provided by distributed computing, and particularly
Given that distributed-systems communication is so for those applications that benefit from a large shared
horribly expensive, why does anyone bother with such memory.
systems? It is likely that continued work on parallel applications
will increase the number of embarrassingly parallel ap-
Answer: plications that can run well on machines and/or clusters
There are a number of reasons: having long communications latencies, reductions in cost
v2022.09.25a
470 APPENDIX E. ANSWERS TO QUICK QUIZZES
being the driving force that it is. That said, greatly re- p.29
Quick Quiz 4.3:
duced hardware latencies would be an extremely welcome
Is there a simpler way to create a parallel shell script?
development, both for single-system and for distributed
If so, how? If not, why not?
computing. q
Answer:
One straightforward approach is the shell pipeline:
Quick Quiz 3.11: p.28
OK, if we are going to have to apply distributed- grep $pattern1 | sed -e 's/a/b/' | sort
Quick Quiz 4.2: p.29 These limitations require that script-based parallelism
But this silly shell script isn’t a real parallel program! use coarse-grained parallelism, with each unit of work
Why bother with such trivia??? having execution time of at least tens of milliseconds, and
preferably much longer.
Those requiring finer-grained parallelism are well ad-
Answer:
vised to think hard about their problem to see if it can be
Because you should never forget the simple stuff!
expressed in a coarse-grained form. If not, they should
Please keep in mind that the title of this book is “Is consider using other parallel-programming environments,
Parallel Programming Hard, And, If So, What Can You such as those discussed in Section 4.2. q
Do About It?”. One of the most effective things you can
do about it is to avoid forgetting the simple stuff! After all, p.30
Quick Quiz 4.5:
if you choose to do parallel programming the hard way,
Why does this wait() primitive need to be so compli-
you have no one but yourself to blame. q
v2022.09.25a
E.4. TOOLS OF THE TRADE 471
cated? Why not just make it work like the shell-script Quick Quiz 4.8: p.32
wait does? If the C language makes no guarantees in presence of a
Answer: data race, then why does the Linux kernel have so many
Some parallel applications need to take special action when data races? Are you trying to tell me that the Linux
specific children exit, and therefore need to wait for each kernel is completely broken???
child individually. In addition, some parallel applications
Answer:
need to detect the reason that the child died. As we saw in
Ah, but the Linux kernel is written in a carefully selected
Listing 4.2, it is not hard to build a waitall() function
superset of the C language that includes special GNU
out of the wait() function, but it would be impossible
extensions, such as asms, that permit safe execution even
to do the reverse. Once the information about a specific
in presence of data races. In addition, the Linux kernel
child is lost, it is lost. q
does not run on a number of platforms where data races
would be especially problematic. For an example, consider
Quick Quiz 4.6: p.31 embedded systems with 32-bit pointers and 16-bit busses.
Isn’t there a lot more to fork() and wait() than dis- On such a system, a data race involving a store to and a
cussed here? load from a given pointer might well result in the load
returning the low-order 16 bits of the old value of the
Answer:
pointer concatenated with the high-order 16 bits of the
Indeed there is, and it is quite possible that this section
new value of the pointer.
will be expanded in future versions to include messaging
Nevertheless, even in the Linux kernel, data races
features (such as UNIX pipes, TCP/IP, and shared file I/O)
can be quite dangerous and should be avoided where
and memory mapping (such as mmap() and shmget()).
feasible [Cor12]. q
In the meantime, there are any number of textbooks that
cover these primitives in great detail, and the truly moti-
vated can read manpages, existing parallel applications Quick Quiz 4.9: p.32
using these primitives, as well as the source code of the What if I want several threads to hold the same lock at
Linux-kernel implementations themselves. the same time?
It is important to note that the parent process in List-
ing 4.3 waits until after the child terminates to do its Answer:
printf(). Using printf()’s buffered I/O concurrently The first thing you should do is to ask yourself why you
to the same file from multiple processes is non-trivial, would want to do such a thing. If the answer is “because I
and is best avoided. If you really need to do concur- have a lot of data that is read by many threads, and only
rent buffered I/O, consult the documentation for your occasionally updated”, then POSIX reader-writer locks
OS. For UNIX/Linux systems, Stewart Weiss’s lecture might be what you are looking for. These are introduced
notes provide a good introduction with informative exam- in Section 4.2.4.
ples [Wei13]. q Another way to get the effect of multiple threads holding
the same lock is for one thread to acquire the lock, and
then use pthread_create() to create the other threads.
Quick Quiz 4.7: p.31
The question of why this would ever be a good idea is left
If the mythread() function in Listing 4.4 can simply to the reader. q
return, why bother with pthread_exit()?
v2022.09.25a
472 APPENDIX E. ANSWERS TO QUICK QUIZZES
casts are quite a bit uglier and harder to get right than are on a multiprocessor, one would normally expect lock_
simple pointer casts. q reader() to acquire the lock first. Nevertheless, there
are no guarantees, especially on a busy system. q
v2022.09.25a
E.4. TOOLS OF THE TRADE 473
p.35 Answer:
Quick Quiz 4.17:
It depends. If the per-thread variable was accessed only
Instead of using READ_ONCE() everywhere, why not just
from its thread, and never from a signal handler, then
declare goflag as volatile on line 10 of Listing 4.8?
no. Otherwise, it is quite possible that READ_ONCE()
is needed. We will see examples of both situations in
Answer: Section 5.4.4.
A volatile declaration is in fact a reasonable alternative This leads to the question of how one thread can gain
in this particular case. However, use of READ_ONCE() has access to another thread’s __thread variable, and the
the benefit of clearly flagging to the reader that goflag answer is that the second thread must store a pointer to
is subject to concurrent reads and updates. Note that its __thread variable somewhere that the first thread has
READ_ONCE() is especially useful in cases where most of access to. One common approach is to maintain a linked
the accesses are protected by a lock (and thus not subject list with one element per thread, and to store the address
to change), but where a few of the accesses are made of each thread’s __thread variable in the corresponding
outside of the lock. Using a volatile declaration in element. q
this case would make it harder for the reader to note the
special accesses outside of the lock, and would also make Quick Quiz 4.20: p.35
it harder for the compiler to generate good code under the Isn’t comparing against single-CPU throughput a bit
lock. q harsh?
p.35 Answer:
Quick Quiz 4.18:
Not at all. In fact, this comparison was, if anything,
READ_ONCE() only affects the compiler, not the CPU.
overly lenient. A more balanced comparison would be
Don’t we also need memory barriers to make sure that
against single-CPU throughput with the locking primitives
the change in goflag’s value propagates to the CPU in
commented out. q
a timely fashion in Listing 4.8?
v2022.09.25a
474 APPENDIX E. ANSWERS TO QUICK QUIZZES
should be faster. So why should anyone worry about Quick Quiz 4.25: p.36
reader-writer locks being slow? What happened to ACCESS_ONCE()?
Answer:
Answer:
In the 2018 v4.15 release, the Linux kernel’s ACCESS_
In general, newer hardware is improving. However, it will
ONCE() was replaced by READ_ONCE() and WRITE_
need to improve several orders of magnitude to permit
ONCE() for reads and writes, respectively [Cor12, Cor14a,
reader-writer lock to achieve ideal performance on 448
Rut17]. ACCESS_ONCE() was introduced as a helper in
CPUs. Worse yet, the greater the number of CPUs, the
RCU code, but was promoted to core API soon after-
larger the required performance improvement. The per-
ward [McK07b, Tor08]. Linux kernel’s READ_ONCE()
formance problems of reader-writer locking are therefore
and WRITE_ONCE() have evolved into complex forms that
very likely to be with us for quite some time to come. q
look quite different than the original ACCESS_ONCE()
implementation due to the need to support access-once
semantics for large structures, but with the possibility of
Quick Quiz 4.23: p.36 load/store tearing if the structure cannot be loaded and
Is it really necessary to have both sets of primitives? stored with a single machine instruction. q
implementor. q
On such machines, two threads might simultaneously
load the value of counter, each increment it, and each
store the result. The new value of counter will then
Quick Quiz 4.24: p.36
only be one greater than before, despite two threads each
Given that these atomic operations will often be able incrementing it. q
to generate single atomic instructions that are directly
supported by the underlying instruction set, shouldn’t
Quick Quiz 4.28: p.40
they be the fastest possible way to get things done?
What is wrong with loading Listing 4.14’s global_ptr
up to three times?
Answer:
Unfortunately, no. See Chapter 5 for some stark coun- Answer:
terexamples. q Suppose that global_ptr is initially non-NULL, but that
v2022.09.25a
E.4. TOOLS OF THE TRADE 475
some other thread sets global_ptr to NULL. Suppose In the case of volatile and atomic variables, the
further that line 1 of the transformed code (Listing 4.15) compiler is specifically forbidden from inventing writes.
executes just before global_ptr is set to NULL and line 2 q
just after. Then line 1 will conclude that global_ptr is
non-NULL, line 2 will conclude that it is less than high_ p.45
address, so that line 3 passes do_low() a NULL pointer, Quick Quiz 4.31:
which do_low() just might not be prepared to deal with. But aren’t full memory barriers very heavyweight? Isn’t
Your editor made exactly this mistake in the DYNIX/ptx there a cheaper way to enforce the ordering needed in
kernel’s memory allocator in the early 1990s. Tracking Listing 4.29?
down the bug consumed a holiday weekend not just for
Answer:
your editor, but also for several of his colleagues. In short,
As is often the case, the answer is “it depends”. However,
this is not a new problem, nor is it likely to go away on its
if only two threads are accessing the status and other_
own. q
task_ready variables, then the smp_store_release()
and smp_load_acquire() functions discussed in Sec-
Quick Quiz 4.29: p.41
tion 4.3.5 will suffice. q
Why does it matter whether do_something() and do_
something_else() in Listing 4.18 are inline func-
tions? Quick Quiz 4.32: p.46
What needs to happen if an interrupt or signal handler
Answer: might itself be interrupted?
Because gp is not a static variable, if either do_
something() or do_something_else() were sepa- Answer:
rately compiled, the compiler would have to assume that Then that interrupt handler must follow the same rules
either or both of these two functions might change the that are followed by other interrupted code. Only those
value of gp. This possibility would force the compiler handlers that cannot be themselves interrupted or that
to reload gp on line 15, thus avoiding the NULL-pointer access no variables shared with an interrupting handler
dereference. q may safely use plain accesses, and even then only if those
variables cannot be concurrently accessed by some other
Quick Quiz 4.30: p.43 CPU or thread. q
Ouch! So can’t the compiler invent a store to a normal
variable pretty much any time it likes?
Quick Quiz 4.33: p.46
Answer: How could you work around the lack of a per-thread-
Thankfully, the answer is no. This is because the compiler variable API on systems that do not provide it?
is forbidden from introducing data races. The case of
inventing a store just before a normal store is quite special: Answer:
It is not possible for some other entity, be it CPU, thread, One approach would be to create an array indexed by
signal handler, or interrupt handler, to be able to see the smp_thread_id(), and another would be to use a hash
invented store unless the code already has a data race, table to map from smp_thread_id() to an array index—
even without the invented store. And if the code already which is in fact what this set of APIs does in pthread
has a data race, it already invokes the dreaded spectre of environments.
undefined behavior, which allows the compiler to generate Another approach would be for the parent to allocate
pretty much whatever code it wants, regardless of the a structure containing fields for each desired per-thread
wishes of the developer. variable, then pass this to the child during thread cre-
But if the original store is volatile, as in WRITE_ONCE(), ation. However, this approach can impose large software-
for all the compiler knows, there might be a side effect engineering costs in large systems. To see this, imagine if
associated with the store that could signal some other all global variables in a large system had to be declared
thread, allowing data-race-free access to the variable. By in a single file, regardless of whether or not they were C
inventing the store, the compiler might be introducing a static variables! q
data race, which it is not permitted to do.
v2022.09.25a
476 APPENDIX E. ANSWERS TO QUICK QUIZZES
v2022.09.25a
E.5. COUNTING 477
Answer: p.50
Quick Quiz 5.8:
Hint: Yet again, the act of updating the counter must be
The 8-figure accuracy on the number of failures indicates
blazingly fast and scalable in order to avoid slowing down
that you really did test this. Why would it be necessary
I/O operations, but because the counter is read out only
to test such a trivial program, especially when the bug
when the user wishes to remove the device, the counter
is easily seen by inspection?
read-out operation can be extremely slow. Furthermore,
there is no need to be able to read out the counter at all Answer:
unless the user has already indicated a desire to remove the Not only are there very few trivial parallel programs, and
device. In addition, the value read out need not be accurate most days I am not so sure that there are many trivial
except that it absolutely must distinguish perfectly between sequential programs, either.
non-zero and zero values, and even then only when the No matter how small or simple the program, if you
device is in the process of being removed. However, once haven’t tested it, it does not work. And even if you have
it has read out a zero value, it must act to keep the value at tested it, Murphy’s Law says that there will be at least a
zero until it has taken some action to prevent subsequent few bugs still lurking.
threads from gaining access to the device being removed. Furthermore, while proofs of correctness certainly do
See Section 5.4.6. q have their place, they never will replace testing, including
the counttorture.h test setup used here. After all,
p.50 proofs are only as good as the assumptions that they are
Quick Quiz 5.6:
based on. Finally, proofs can be every bit as buggy as are
One thing that could be simpler is ++ instead of that
programs! q
concatenation of READ_ONCE() and WRITE_ONCE().
Why all that extra typing???
Quick Quiz 5.9: p.50
Answer: Why doesn’t the horizontal dashed line on the x axis
See Section 4.3.4.1 on page 40 for more information meet the diagonal line at 𝑥 = 1?
on how the compiler can cause trouble, as well as how
READ_ONCE() and WRITE_ONCE() can avoid this trouble. Answer:
q Because of the overhead of the atomic operation. The
dashed line on the x axis represents the overhead of a single
non-atomic increment. After all, an ideal algorithm would
Quick Quiz 5.7: p.50 not only scale linearly, it would also incur no performance
But can’t a smart compiler prove that line 5 of Listing 5.1 penalty compared to single-threaded code.
is equivalent to the ++ operator and produce an x86 add- This level of idealism may seem severe, but if it is good
to-memory instruction? And won’t the CPU cache cause enough for Linus Torvalds, it is good enough for you. q
this to be atomic?
Quick Quiz 5.10: p.50
Answer:
But atomic increment is still pretty fast. And incre-
Although the ++ operator could be atomic, there is no
menting a single variable in a tight loop sounds pretty
requirement that it be so unless it is applied to a C11
unrealistic to me, after all, most of the program’s exe-
_Atomic variable. And indeed, in the absence of _
cution should be devoted to actually doing work, not
Atomic, GCC often chooses to load the value to a register,
accounting for the work it has done! Why should I care
increment the register, then store the value to memory,
about making this go faster?
which is decidedly non-atomic.
Furthermore, note the volatile casts in READ_ONCE() Answer:
and WRITE_ONCE(), which tell the compiler that the In many cases, atomic increment will in fact be fast enough
location might well be an MMIO device register. Because for you. In those cases, you should by all means use atomic
MMIO registers are not cached, it would be unwise for increment. That said, there are many real-world situations
the compiler to assume that the increment operation is where more elaborate counting algorithms are required.
atomic. q The canonical example of such a situation is counting
packets and bytes in highly optimized networking stacks,
v2022.09.25a
478 APPENDIX E. ANSWERS TO QUICK QUIZZES
where it is all too easy to find much of the execution time CPU 0 CPU 1 CPU 2 CPU 3
going into these sorts of accounting tasks, especially on
Cache Cache Cache Cache
large multiprocessors.
Interconnect Interconnect
In addition, as noted at the beginning of this chap-
ter, counting provides an excellent view of the issues
encountered in shared-memory parallel programs. q Memory System Interconnect Memory
v2022.09.25a
E.5. COUNTING 479
example, using C11’s _Thread_local facility, as shown In the worst case, the read operation completes immedi-
in Section 5.2.3. q ately, but is then delayed for 𝛥 time units before returning,
in which case the worst-case error is simply 𝑟 𝛥.
p.52 This worst-case behavior is rather unlikely, so let us
Quick Quiz 5.14:
instead consider the case where the reads from each of
What other nasty optimizations could GCC apply?
the 𝑁 counters is spaced equally over the time period 𝛥.
There will be 𝑁 + 1 intervals of duration 𝑁𝛥+1 between
Answer:
the 𝑁 reads. The error due to the delay after the read
See Sections 4.3.4.1 and 15.3 for more information. One
from the last thread’s counter will be given by 𝑁 (𝑟𝑁𝛥+1) ,
nasty optimization would be to apply common subexpres-
sion elimination to successive calls to the read_count() the second-to-last thread’s counter by 𝑁 (2𝑟𝑁𝛥+1) , the third-
function, which might come as a surprise to code expect- to-last by 𝑁 (3𝑟𝑁𝛥+1) , and so on. The total error is given by
ing changes in the values returned from successive calls the sum of the errors due to the reads from each thread’s
to that function. q counter, which is:
𝑁
Quick Quiz 5.15: p.52 𝑟𝛥 ∑︁
𝑖 (E.1)
How does the per-thread counter variable in Listing 5.3 𝑁 (𝑁 + 1) 𝑖=1
get initialized?
Expressing the summation in closed form yields:
Answer: 𝑟𝛥 𝑁 (𝑁 + 1)
The C standard specifies that the initial value of global (E.2)
𝑁 (𝑁 + 1) 2
variables is zero, unless they are explicitly initialized,
thus implicitly initializing all the instances of counter Canceling yields the intuitively expected result:
to zero. Besides, in the common case where the user is
interested only in differences between consecutive reads 𝑟𝛥
(E.3)
from statistical counters, the initial value is irrelevant. q 2
It is important to remember that error continues accu-
p.52
mulating as the caller executes code making use of the
Quick Quiz 5.16:
count returned by the read operation. For example, if the
How is the code in Listing 5.3 supposed to permit more
caller spends time 𝑡 executing some computation based
than one counter?
on the result of the returned count, the worst-case error
Answer: will have increased to 𝑟 ( 𝛥 + 𝑡).
Indeed, this toy example does not support more than one The expected error will have similarly increased to:
counter. Modifying it so that it can provide multiple
counters is left as an exercise to the reader. q 𝛥
𝑟 +𝑡 (E.4)
2
Quick Quiz 5.17: p.52 Of course, it is sometimes unacceptable for the counter
The read operation takes time to sum up the per-thread to continue incrementing during the read operation. Sec-
values, and during that time, the counter could well tion 5.4.6 discusses a way to handle this situation.
be changing. This means that the value returned by Thus far, we have been considering a counter that is
read_count() in Listing 5.3 will not necessarily be only increased, never decreased. If the counter value is
exact. Assume that the counter is being incremented at being changed by 𝑟 counts per unit time, but in either
rate 𝑟 counts per unit time, and that read_count()’s direction, we should expect the error to reduce. However,
execution consumes 𝛥 units of time. What is the expected the worst case is unchanged because although the counter
error in the return value? could move in either direction, the worst case is when the
read operation completes immediately, but then is delayed
Answer: for 𝛥 time units, during which time all the changes in the
Let’s do worst-case analysis first, followed by a less con- counter’s value move it in the same direction, again giving
servative analysis. us an absolute error of 𝑟 𝛥.
v2022.09.25a
480 APPENDIX E. ANSWERS TO QUICK QUIZZES
There are a number of ways to compute the average the corresponding CPU exists yet, or indeed, whether the
error, based on a variety of assumptions about the patterns corresponding CPU will ever exist.
of increments and decrements. For simplicity, let’s assume A key limitation that the Linux kernel imposes is a
that the 𝑓 fraction of the operations are decrements, and compile-time maximum bound on the number of CPUs,
that the error of interest is the deviation from the counter’s namely, CONFIG_NR_CPUS, along with a typically tighter
long-term trend line. Under this assumption, if 𝑓 is less boot-time bound of nr_cpu_ids. In contrast, in user
than or equal to 0.5, each decrement will be canceled by space, there is not necessarily a hard-coded upper limit
an increment, so that 2 𝑓 of the operations will cancel each on the number of threads.
other, leaving 1 − 2 𝑓 of the operations being uncanceled Of course, both environments must handle dynamically
increments. On the other hand, if 𝑓 is greater than 0.5, 1− 𝑓 loaded code (dynamic libraries in user space, kernel mod-
of the decrements are canceled by increments, so that the ules in the Linux kernel), which increases the complexity
counter moves in the negative direction by −1 + 2 (1 − 𝑓 ), of per-thread variables.
which simplifies to 1 − 2 𝑓 , so that the counter moves an
These complications make it significantly harder for
average of 1 − 2 𝑓 per operation in either case. Therefore,
user-space environments to provide access to other threads’
that the long-term movement of the counter is given by
per-thread variables. Nevertheless, such access is highly
(1 − 2 𝑓 ) 𝑟. Plugging this into Eq. E.3 yields:
useful, and it is hoped that it will someday appear.
(1 − 2 𝑓 ) 𝑟 𝛥 In the meantime, textbook examples such as this one can
(E.5) use arrays whose limits can be easily adjusted by the user.
2
Alternatively, such arrays can be dynamically allocated
All that aside, in most uses of statistical counters, the and expanded as needed at runtime. Finally, variable-
error in the value returned by read_count() is irrelevant. length data structures such as linked lists can be used, as
This irrelevance is due to the fact that the time required for is done in the userspace RCU library [Des09b, DMS+ 12].
read_count() to execute is normally extremely small This last approach can also reduce false sharing in some
compared to the time interval between successive calls to cases. q
read_count(). q
Quick Quiz 5.19: p.53
Quick Quiz 5.18: p.53 Doesn’t the check for NULL on line 19 of Listing 5.4 add
Doesn’t that explicit counterp array in Listing 5.4 extra branch mispredictions? Why not have a variable set
reimpose an arbitrary limit on the number of threads? permanently to zero, and point unused counter-pointers
Why doesn’t the C language provide a per_thread() to that variable rather than setting them to NULL?
interface, similar to the Linux kernel’s per_cpu() prim-
itive, to allow threads to more easily access each others’ Answer:
per-thread variables? This is a reasonable strategy. Checking for the perfor-
mance difference is left as an exercise for the reader.
Answer: However, please keep in mind that the fastpath is not
Why indeed? read_count(), but rather inc_count(). q
To be fair, user-mode thread-local storage faces some
challenges that the Linux kernel gets to ignore. When Quick Quiz 5.20: p.53
a user-level thread exits, its per-thread variables all dis- Why on earth do we need something as heavyweight as
appear, which complicates the problem of per-thread- a lock guarding the summation in the function read_
variable access, particularly before the advent of user-level count() in Listing 5.4?
RCU (see Section 9.5). In contrast, in the Linux kernel,
when a CPU goes offline, that CPU’s per-CPU variables Answer:
remain mapped and accessible. Remember, when a thread exits, its per-thread variables
Similarly, when a new user-level thread is created, its disappear. Therefore, if we attempt to access a given
per-thread variables suddenly come into existence. In thread’s per-thread variables after that thread exits, we will
contrast, in the Linux kernel, all per-CPU variables are get a segmentation fault. The lock coordinates summation
mapped and initialized at boot time, regardless of whether and thread exit, preventing this scenario.
v2022.09.25a
E.5. COUNTING 481
Of course, we could instead read-acquire a reader-writer Listing E.1: Per-Thread Statistical Counters With Lockless
lock, but Chapter 9 will introduce even lighter-weight Summation
1 unsigned long __thread counter = 0;
mechanisms for implementing the required coordination. 2 unsigned long *counterp[NR_THREADS] = { NULL };
Another approach would be to use an array instead of 3 int finalthreadcount = 0;
4 DEFINE_SPINLOCK(final_mutex);
a per-thread variable, which, as Alexey Roytman notes, 5
would eliminate the tests against NULL. However, array 6 static __inline__ void inc_count(void)
7 {
accesses are often slower than accesses to per-thread 8 WRITE_ONCE(counter, counter + 1);
variables, and use of an array would imply a fixed upper 9 }
10
bound on the number of threads. Also, note that neither 11 static __inline__ unsigned long read_count(void)
tests nor locks are needed on the inc_count() fastpath. 12 /* need to tweak counttorture! */
13 {
q 14 int t;
15 unsigned long sum = 0;
16
p.53 17 for_each_thread(t)
Quick Quiz 5.21: 18 if (READ_ONCE(counterp[t]) != NULL)
Why on earth do we need to acquire the lock in count_ 19 sum += READ_ONCE(*counterp[t]);
20 return sum;
register_thread() in Listing 5.4? It is a single 21 }
properly aligned machine-word store to a location that 22
23 void count_register_thread(unsigned long *p)
no other thread is modifying, so it should be atomic 24 {
anyway, right? 25 WRITE_ONCE(counterp[smp_thread_id()], &counter);
26 }
27
Answer: 28 void count_unregister_thread(int nthreadsexpected)
This lock could in fact be omitted, but better safe than 29 {
30 spin_lock(&final_mutex);
sorry, especially given that this function is executed only 31 finalthreadcount++;
at thread startup, and is therefore not on any critical path. 32 spin_unlock(&final_mutex);
33 while (READ_ONCE(finalthreadcount) < nthreadsexpected)
Now, if we were testing on machines with thousands of 34 poll(NULL, 0, 1);
CPUs, we might need to omit the lock, but on machines 35 }
v2022.09.25a
482 APPENDIX E. ANSWERS TO QUICK QUIZZES
extremely important to note that this zeroing cannot be eventually-consistent counters, which would limit the
delayed too long or overflow of the smaller per-thread overhead to a single CPU, but would result in increas-
variables will result. This approach therefore imposes ing update-to-read latencies as the number of counters
real-time requirements on the underlying system, and in increased. Alternatively, that single thread could track
turn must be used with extreme care. the update rates of the counters, visiting the frequently-
In contrast, if all variables are the same size, overflow updated counters more frequently. In addition, the num-
of any variable is harmless because the eventual sum will ber of threads handling the counters could be set to some
be modulo the word size. q fraction of the total number of CPUs, and perhaps also
adjusted at runtime. Finally, each counter could specify
Quick Quiz 5.24: p.55 its latency, and deadline-scheduling techniques could be
Won’t the single global thread in the function used to provide the required latencies to each counter.
eventual() of Listing 5.5 be just as severe a bottleneck There are no doubt many other tradeoffs that could be
as a global lock would be? made. q
Answer:
Quick Quiz 5.27: p.55
In this case, no. What will happen instead is that as the
number of threads increases, the estimate of the counter What is the accuracy of the estimate returned by read_
value returned by read_count() will become more in- count() in Listing 5.5?
accurate. q
Answer:
p.55 A straightforward way to evaluate this estimate is to use
Quick Quiz 5.25:
the analysis derived in Quick Quiz 5.17, but set 𝛥 to the
Won’t the estimate returned by read_count() in List-
interval between the beginnings of successive runs of the
ing 5.5 become increasingly inaccurate as the number
eventual() thread. Handling the case where a given
of threads rises?
counter has multiple eventual() threads is left as an
Answer: exercise for the reader. q
Yes. If this proves problematic, one fix is to provide
multiple eventual() threads, each covering its own
Quick Quiz 5.28: p.55
subset of the other threads. In more extreme cases, a tree-
like hierarchy of eventual() threads might be required. What fundamental difference is there between counting
q packets and counting the total number of bytes in the
packets, given that the packets vary in size?
Quick Quiz 5.26: p.55
Answer:
Given that in the eventually-consistent algorithm shown When counting packets, the counter is only incremented
in Listing 5.5 both reads and updates have extremely by the value one. On the other hand, when counting bytes,
low overhead and are extremely scalable, why would the counter might be incremented by largish numbers.
anyone bother with the implementation described in
Why does this matter? Because in the increment-by-one
Section 5.2.2, given its costly read-side code?
case, the value returned will be exact in the sense that the
Answer: counter must necessarily have taken on that value at some
The thread executing eventual() consumes CPU time. point in time, even if it is impossible to say precisely when
As more of these eventually-consistent counters are added, that point occurred. In contrast, when counting bytes, two
the resulting eventual() threads will eventually con- different threads might return values that are inconsistent
sume all available CPUs. This implementation therefore with any global ordering of operations.
suffers a different sort of scalability limitation, with the To see this, suppose that thread 0 adds the value three to
scalability limit being in terms of the number of eventually its counter, thread 1 adds the value five to its counter, and
consistent counters rather than in terms of the number of threads 2 and 3 sum the counters. If the system is “weakly
threads or CPUs. ordered” or if the compiler uses aggressive optimizations,
Of course, it is possible to make other tradeoffs. For thread 2 might find the sum to be three and thread 3 might
example, a single thread could be created to handle all find the sum to be five. The only possible global orders of
v2022.09.25a
E.5. COUNTING 483
the sequence of values of the counter are 0,3,8 and 0,5,8, Try the formulation in Listing 5.8 with counter equal
and neither order is consistent with the results obtained. to 10 and delta equal to ULONG_MAX. Then try it again
If you missed this one, you are not alone. Michael Scott with the code shown in Listing 5.7.
used this question to stump Paul E. McKenney during A good understanding of integer overflow will be re-
Paul’s Ph.D. defense. q quired for the rest of this example, so if you have never
dealt with integer overflow before, please try several exam-
p.55
ples to get the hang of it. Integer overflow can sometimes
Quick Quiz 5.29:
be more difficult to get right than parallel algorithms! q
Given that the reader must sum all the threads’ coun-
ters, this counter-read operation could take a long time
given large numbers of threads. Is there any way that Quick Quiz 5.32: p.58
the increment operation can remain fast and scalable Why does globalize_count() zero the per-thread
while allowing readers to also enjoy not only reasonable variables, only to later call balance_count() to refill
performance and scalability, but also good accuracy? them in Listing 5.7? Why not just leave the per-thread
variables non-zero?
Answer:
One approach would be to maintain a global approxima- Answer:
tion to the value, similar to the approach described in That is in fact what an earlier version of this code did.
Section 5.2.4. Updaters would increment their per-thread But addition and subtraction are extremely cheap, and
variable, but when it reached some predefined limit, atom- handling all of the special cases that arise is quite complex.
ically add it to a global variable, then zero their per-thread Again, feel free to try it yourself, but beware of integer
variable. This would permit a tradeoff between average overflow! q
increment overhead and accuracy of the value read out. In
particular, it would allow sharp bounds on the read-side Quick Quiz 5.33: p.58
inaccuracy. Given that globalreserve counted against us in add_
Another approach makes use of the fact that readers count(), why doesn’t it count for us in sub_count()
often care only about certain transitions in value, not in in Listing 5.7?
the exact value. This approach is examined in Section 5.3.
The reader is encouraged to think up and try out other Answer:
approaches, for example, using a combining tree. q The globalreserve variable tracks the sum of all
threads’ countermax variables. The sum of these threads’
counter variables might be anywhere from zero to
Quick Quiz 5.30: p.57
globalreserve. We must therefore take a conservative
Why does Listing 5.7 provide add_count() and approach, assuming that all threads’ counter variables
sub_count() instead of the inc_count() and dec_ are full in add_count() and that they are all empty in
count() interfaces show in Section 5.2? sub_count().
But remember this question, as we will come back to it
Answer: later. q
Because structures come in different sizes. Of course,
a limit counter corresponding to a specific size of struc-
Quick Quiz 5.34: p.58
ture might still be able to use inc_count() and dec_
count(). q Suppose that one thread invokes add_count() shown
in Listing 5.7, and then another thread invokes sub_
count(). Won’t sub_count() return failure even
Quick Quiz 5.31: p.57
though the value of the counter is non-zero?
What is with the strange form of the condition on line 3
of Listing 5.7? Why not the more intuitive form of the Answer:
fastpath shown in Listing 5.8? Indeed it will! In many cases, this will be a problem,
as discussed in Section 5.3.3, and in those cases the
Answer: algorithms from Section 5.4 will likely be preferable. q
Two words. “Integer overflow.”
v2022.09.25a
484 APPENDIX E. ANSWERS TO QUICK QUIZZES
p.58 limit. To see this last point, step through the algorithm
Quick Quiz 5.35:
and watch what it does. q
Why have both add_count() and sub_count() in
Listing 5.7? Why not simply pass a negative number to
add_count()? Quick Quiz 5.38: p.61
Why is it necessary to atomically manipulate the thread’s
Answer: counter and countermax variables as a unit? Wouldn’t
Given that add_count() takes an unsigned long as its it be good enough to atomically manipulate them indi-
argument, it is going to be a bit tough to pass it a negative vidually?
number. And unless you have some anti-matter memory,
there is little point in allowing negative numbers when Answer:
counting the number of structures in use! This might well be possible, but great care is re-
All kidding aside, it would of course be possible to quired. Note that removing counter without first zeroing
combine add_count() and sub_count(), however, the countermax could result in the corresponding thread
if conditions on the combined function would be more increasing counter immediately after it was zeroed, com-
complex than in the current pair of functions, which would pletely negating the effect of zeroing the counter.
in turn mean slower execution of these fast paths. q The opposite ordering, namely zeroing countermax
and then removing counter, can also result in a non-zero
p.59
counter. To see this, consider the following sequence of
Quick Quiz 5.36:
events:
Why set counter to countermax / 2 in line 15 of List-
ing 5.9? Wouldn’t it be simpler to just take countermax 1. Thread A fetches its countermax, and finds that it
counts? is non-zero.
Answer: 2. Thread B zeroes Thread A’s countermax.
First, it really is reserving countermax counts (see
line 14), however, it adjusts so that only half of these 3. Thread B removes Thread A’s counter.
are actually in use by the thread at the moment. This
allows the thread to carry out at least countermax / 2 4. Thread A, having found that its countermax is non-
increments or decrements before having to refer back to zero, proceeds to add to its counter, resulting in a
globalcount again. non-zero value for counter.
Note that the accounting in globalcount remains
accurate, thanks to the adjustment in line 18. q Again, it might well be possible to atomically manipu-
late countermax and counter as separate variables, but
it is clear that great care is required. It is also quite likely
Quick Quiz 5.37: p.59 that doing so will slow down the fastpath.
In Figure 5.6, even though a quarter of the remaining Exploring these possibilities are left as exercises for the
count up to the limit is assigned to thread 0, only an reader. q
eighth of the remaining count is consumed, as indicated
by the uppermost dotted line connecting the center and p.61
Quick Quiz 5.39:
the rightmost configurations. Why is that?
In what way does line 7 of Listing 5.12 violate the C
Answer: standard?
The reason this happened is that thread 0’s counter Answer:
was set to half of its countermax. Thus, of the quarter It assumes eight bits per byte. This assumption does hold
assigned to thread 0, half of that quarter (one eighth) came for all current commodity microprocessors that can be
from globalcount, leaving the other half (again, one easily assembled into shared-memory multiprocessors,
eighth) to come from the remaining count. but certainly does not hold for all computer systems that
There are two purposes for taking this approach: (1) To have ever run C code. (What could you do instead in order
allow thread 0 to use the fastpath for decrements as well to comply with the C standard? What drawbacks would it
as increments and (2) To reduce the inaccuracies if all have?) q
threads are monotonically incrementing up towards the
v2022.09.25a
E.5. COUNTING 485
Quick Quiz 5.40: p.62 local_count() on line 14 of Listing 5.15 empties it?
Given that there is only one counterandmax variable,
why bother passing in a pointer to it on line 18 of
Answer:
Listing 5.12?
This other thread cannot refill its counterandmax un-
Answer: til the caller of flush_local_count() releases the
There is only one counterandmax variable per gblcnt_mutex. By that time, the caller of flush_
thread. Later, we will see code that needs to pass local_count() will have finished making use of the
other threads’ counterandmax variables to split_ counts, so there will be no problem with this other thread
counterandmax(). q refilling—assuming that the value of globalcount is
large enough to permit a refill. q
Quick Quiz 5.41: p.62
Why does merge_counterandmax() in Listing 5.12 re- Quick Quiz 5.45: p.63
turn an int rather than storing directly into an atomic_ What prevents concurrent execution of the fastpath of
t? either add_count() or sub_count() from interfer-
ing with the counterandmax variable while flush_
Answer:
local_count() is accessing it on line 27 of List-
Later, we will see that we need the int return to pass to
ing 5.15?
the atomic_cmpxchg() primitive. q
Answer:
Quick Quiz 5.42: p.62 Nothing. Consider the following three cases:
Yecch! Why the ugly goto on line 11 of Listing 5.13?
Haven’t you heard of the break statement??? 1. If flush_local_count()’s atomic_xchg() exe-
cutes before the split_counterandmax() of either
Answer: fastpath, then the fastpath will see a zero counter
Replacing the goto with a break would require keeping and countermax, and will thus transfer to the slow-
a flag to determine whether or not line 15 should return, path (unless of course delta is zero).
which is not the sort of thing you want on a fastpath. If
you really hate the goto that much, your best bet would be 2. If flush_local_count()’s atomic_xchg() ex-
to pull the fastpath into a separate function that returned ecutes after the split_counterandmax() of ei-
success or failure, with “failure” indicating a need for ther fastpath, but before that fastpath’s atomic_
the slowpath. This is left as an exercise for goto-hating cmpxchg(), then the atomic_cmpxchg() will fail,
readers. q causing the fastpath to restart, which reduces to case 1
above.
Quick Quiz 5.43: p.62
3. If flush_local_count()’s atomic_xchg() exe-
Why would the atomic_cmpxchg() primitive at
cutes after the atomic_cmpxchg() of either fast-
lines 13–14 of Listing 5.13 ever fail? After all, we
path, then the fastpath will (most likely) complete
picked up its old value on line 9 and have not changed
successfully before flush_local_count() zeroes
it!
the thread’s counterandmax variable.
Answer:
Later, we will see how the flush_local_count() Either way, the race is resolved correctly. q
function in Listing 5.15 might update this thread’s
counterandmax variable concurrently with the execu- p.64
Quick Quiz 5.46:
tion of the fastpath on lines 8–14 of Listing 5.13. q
Given that the atomic_set() primitive does a simple
store to the specified atomic_t, how can line 21 of
Quick Quiz 5.44: p.63
balance_count() in Listing 5.16 work correctly in
What stops a thread from simply refilling its face of concurrent flush_local_count() updates to
counterandmax variable immediately after flush_ this variable?
v2022.09.25a
486 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer: (c) The thread receives the signal, which also notes
The caller of both balance_count() and flush_ the REQACK state, and, because there is no
local_count() hold gblcnt_mutex, so only one may fastpath in effect, sets the state to READY.
be executing at a given time. q (d) The slowpath notes the READY state, steals the
count, and sets the state to IDLE, and completes.
Quick Quiz 5.47: p.64 (e) The fastpath sets the state to READY, disabling
But signal handlers can be migrated to some other CPU further fastpath execution for this thread.
while running. Doesn’t this possibility require that
The basic problem here is that the combined
atomic instructions and memory barriers are required
REQACK state can be referenced by both the signal
to reliably communicate between a thread and a signal
handler and the fastpath. The clear separation main-
handler that interrupts that thread?
tained by the four-state setup ensures orderly state
Answer: transitions.
No. If the signal handler is migrated to another CPU, then That said, you might well be able to make a three-state
the interrupted thread is also migrated along with it. q setup work correctly. If you do succeed, compare carefully
to the four-state setup. Is the three-state solution really
Quick Quiz 5.48: p.65 preferable, and why or why not? q
In Figure 5.7, why is the REQ theft state colored red?
Quick Quiz 5.50: p.65
In Listing 5.18’s function flush_local_count_
Answer:
sig(), why are there READ_ONCE() and WRITE_
To indicate that only the fastpath is permitted to change
ONCE() wrappers around the uses of the theft per-
the theft state, and that if the thread remains in this state
thread variable?
for too long, the thread running the slowpath will resend
the POSIX signal. q Answer:
The first one (on line 11) can be argued to be unnecessary.
Quick Quiz 5.49: p.65 The last two (lines 14 and 16) are important. If these
In Figure 5.7, what is the point of having separate REQ are removed, the compiler would be within its rights to
and ACK theft states? Why not simplify the state rewrite lines 14–16 as follows:
machine by collapsing them into a single REQACK 14 theft = THEFT_READY;
15 if (counting) {
state? Then whichever of the signal handler or the 16 theft = THEFT_ACK;
fastpath gets there first could set the state to READY. 17 }
v2022.09.25a
E.5. COUNTING 487
Answer:
Because many operating systems over several decades have Answer:
had the property of losing the occasional signal. Whether Strange, perhaps, but true! Almost enough to make you
this is a feature or a bug is debatable, but irrelevant. The think that the name “reader-writer lock” was poorly chosen,
obvious symptom from the user’s viewpoint will not be a isn’t it? q
kernel bug, but rather a user application hanging.
Quick Quiz 5.59: p.68
Your user application hanging! q
What other issues would need to be accounted for in a
real system?
Quick Quiz 5.55: p.68
Answer:
Not only are POSIX signals slow, sending one to each A huge number!
thread simply does not scale. What would you do if you
Here are a few to start with:
had (say) 10,000 threads and needed the read side to be
fast?
1. There could be any number of devices, so that the
global variables are inappropriate, as are the lack of
Answer: arguments to functions like do_io().
One approach is to use the techniques shown in Sec-
tion 5.2.4, summarizing an approximation to the overall 2. Polling loops can be problematic in real systems,
counter value in a single variable. Another approach wasting CPU time and energy. In many cases, an
would be to use multiple threads to carry out the reads, event-driven design is far better, for example, where
with each such thread interacting with a specific subset of the last completing I/O wakes up the device-removal
the updating threads. q thread.
v2022.09.25a
488 APPENDIX E. ANSWERS TO QUICK QUIZZES
3. The I/O might fail, and so do_io() will likely need Table 5.1, we should always prefer signals over atomic
a return value. operations, right?
4. If the device fails, the last I/O might never complete.
Answer:
In such cases, there might need to be some sort of
That depends on the workload. Note that on a 64-core
timeout to allow error recovery.
system, you need more than one hundred non-atomic
5. Both add_count() and sub_count() can fail, but operations (with roughly a 40-nanosecond performance
their return values are not checked. gain) to make up for even one signal (with almost a 5-
microsecond performance loss). Although there are no
6. Reader-writer locks do not scale well. One way of shortage of workloads with far greater read intensity, you
avoiding the high read-acquisition costs of reader- will need to consider your particular workload.
writer locks is presented in Chapters 7 and 9. q In addition, although memory barriers have historically
been expensive compared to ordinary instructions, you
should check this on the specific hardware you will be
Quick Quiz 5.60: p.69
running. The properties of computer hardware do change
On the count_stat.c row of Table 5.1, we see that over time, and algorithms must change accordingly. q
the read-side scales linearly with the number of threads.
How is that possible given that the more threads there
are, the more per-thread counters must be summed up? Quick Quiz 5.63: p.69
Can advanced techniques be applied to address the
lock contention for readers seen in the bottom half of
Answer: Table 5.1?
The read-side code must scan the entire fixed-size array, re-
gardless of the number of threads, so there is no difference Answer:
in performance. In contrast, in the last two algorithms, One approach is to give up some update-side perfor-
readers must do more work when there are more threads. mance, as is done with scalable non-zero indicators
In addition, the last two algorithms interpose an additional (SNZI) [ELLM07]. There are a number of other ways one
level of indirection because they map from integer thread might go about this, and these are left as exercises for the
ID to the corresponding _Thread_local variable. q reader. Any number of approaches that apply hierarchy,
which replace frequent global-lock acquisitions with local
Quick Quiz 5.61: p.69 lock acquisitions corresponding to lower levels of the
Even on the fourth row of Table 5.1, the read-side hierarchy, should work quite well. q
performance of these statistical counter implementations
is pretty horrible. So why bother with them? p.70
Quick Quiz 5.64:
Answer: The ++ operator works just fine for 1,000-digit numbers!
“Use the right tool for the job.” Haven’t you heard of operator overloading???
As can be seen from Figure 5.1, single-variable atomic
increment need not apply for any job involving heavy use of Answer:
parallel updates. In contrast, the algorithms shown in the In the C++ language, you might well be able to use ++ on
top half of Table 5.1 do an excellent job of handling update- a 1,000-digit number, assuming that you had access to a
heavy situations. Of course, if you have a read-mostly class implementing such numbers. But as of 2021, the C
situation, you should use something else, for example, an language does not permit operator overloading. q
eventually consistent design featuring a single atomically
incremented variable that can be read out using a single p.70
Quick Quiz 5.65:
load, similar to the approach used in Section 5.2.4. q
But if we are going to have to partition everything, why
bother with shared-memory multithreading? Why not
Quick Quiz 5.62: p.69 just partition the problem completely and run as multiple
Given the performance data shown in the bottom half of processes, each in its own address space?
v2022.09.25a
E.6. PARTITIONING AND SYNCHRONIZATION DESIGN 489
Answer:
P1
Indeed, multiple processes with separate address spaces
can be an excellent way to exploit parallelism, as the
proponents of the fork-join methodology and the Erlang
language would be very quick to tell you. However, there
are also some advantages to shared-memory parallelism: P5 P2
v2022.09.25a
490 APPENDIX E. ANSWERS TO QUICK QUIZZES
v2022.09.25a
E.6. PARTITIONING AND SYNCHRONIZATION DESIGN 491
p.79 time at each end, and in the common case requires only
Quick Quiz 6.10:
one lock acquisition per operation. Therefore, the tandem
But in the case where data is flowing in only one di-
double-ended queue should be expected to outperform the
rection, the algorithm shown in Listing 6.3 will have
hashed double-ended queue.
both ends attempting to acquire the same lock whenever
the consuming end empties its underlying double-ended Can you create a double-ended queue that allows multi-
queue. Doesn’t that mean that sometimes this algorithm ple concurrent operations at each end? If so, how? If not,
fails to provide concurrent access to both ends of the why not? q
queue even when the queue contains an arbitrarily large
number of elements?
Quick Quiz 6.13: p.80
Answer: Is there a significantly better way of handling concur-
Indeed it does! rency for double-ended queues?
But the same is true of other algorithms claiming
this property. For example, in solutions using software Answer:
transactional memory mechanisms based on hashed ar- One approach is to transform the problem to be solved so
rays of locks, the leftmost and rightmost elements’ ad- that multiple double-ended queues can be used in parallel,
dresses will sometimes happen to hash to the same lock. allowing the simpler single-lock double-ended queue to
These hash collisions will also prevent concurrent ac- be used, and perhaps also replace each double-ended
cess. For another example, solutions using hardware queue with a pair of conventional single-ended queues.
transactional memory mechanisms with software fall- Without such “horizontal scaling”, the speedup is limited
backs [YHLR13, Mer11, JSG12] often use locking within to 2.0. In contrast, horizontal-scaling designs can achieve
those software fallbacks, and thus suffer (albeit hopefully very large speedups, and are especially attractive if there
rarely) from whatever concurrency limitations that these are multiple threads working either end of the queue,
locking solutions suffer from. because in this multiple-thread case the dequeue simply
Therefore, as of 2021, all practical solutions to the cannot provide strong ordering guarantees. After all, the
concurrent double-ended queue problem fail to provide fact that a given thread removed an item first in no way
full concurrency in at least some circumstances, including implies that it will process that item first [HKLP12]. And
the compound double-ended queue. q if there are no guarantees, we may as well obtain the
performance benefits that come with refusing to provide
Quick Quiz 6.11: p.80 these guarantees.
Why are there not one but two solutions to the double- Regardless of whether or not the problem can be trans-
ended queue problem? formed to use multiple queues, it is worth asking whether
work can be batched so that each enqueue and dequeue op-
Answer: eration corresponds to larger units of work. This batching
There are actually at least three. The third, by Dominik approach decreases contention on the queue data struc-
Dingel, makes interesting use of reader-writer locking, tures, which increases both performance and scalability,
and may be found in lockrwdeq.c. q as will be seen in Section 6.3. After all, if you must incur
high synchronization overheads, be sure you are getting
p.80
your money’s worth.
Quick Quiz 6.12:
The tandem double-ended queue runs about twice as fast Other researchers are working on other ways to take ad-
as the hashed double-ended queue, even when I increase vantage of limited ordering guarantees in queues [KLP12].
the size of the hash table to an insanely large number. q
Why is that?
Quick Quiz 6.14: p.82
Answer:
The hashed double-ended queue’s locking design only Don’t all these problems with critical sections mean
permits one thread at a time at each end, and further that we should just always use non-blocking synchro-
requires two lock acquisitions for each operation. The nization [Her90], which don’t have critical sections?
tandem double-ended queue also permits one thread at a
v2022.09.25a
492 APPENDIX E. ANSWERS TO QUICK QUIZZES
v2022.09.25a
E.6. PARTITIONING AND SYNCHRONIZATION DESIGN 493
v2022.09.25a
494 APPENDIX E. ANSWERS TO QUICK QUIZZES
v2022.09.25a
E.7. LOCKING 495
v2022.09.25a
496 APPENDIX E. ANSWERS TO QUICK QUIZZES
v2022.09.25a
E.7. LOCKING 497
v2022.09.25a
498 APPENDIX E. ANSWERS TO QUICK QUIZZES
2. The code does not check for overflow. On the other Answer:
hand, this bug is nullified by the previous bug: 32 Empty lock-based critical sections are rarely used, but
bits worth of seconds is more than 50 years. q they do have their uses. The point is that the semantics
of exclusive locks have two components: (1) The familiar
data-protection semantic and (2) A messaging semantic,
Quick Quiz 7.17: p.109 where releasing a given lock notifies a waiting acquisi-
Wouldn’t it be better just to use a good parallel design so tion of that same lock. An empty critical section uses
that lock contention was low enough to avoid unfairness? the messaging component without the data-protection
component.
The rest of this answer provides some example uses of
Answer: empty critical sections, however, these examples should
It would be better in some sense, but there are situations be considered “gray magic.”8 As such, empty critical
where it can be appropriate to use designs that sometimes sections are almost never used in practice. Nevertheless,
result in high lock contentions. pressing on into this gray area . . .
For example, imagine a system that is subject to a One historical use of empty critical sections appeared in
rare error condition. It might well be best to have a the networking stack of the 2.4 Linux kernel through use
simple error-handling design that has poor performance of a read-side-scalable reader-writer lock called brlock
and scalability for the duration of the rare error condition, for “big reader lock”. This use case is a way of approxi-
as opposed to a complex and difficult-to-debug design that mating the semantics of read-copy update (RCU), which
is helpful only when one of those rare error conditions is is discussed in Section 9.5. And in fact this Linux-kernel
in effect. use case has been replaced with RCU.
That said, it is usually worth putting some effort into The empty-lock-critical-section idiom can also be used
attempting to produce a design that both simple as well as to reduce lock contention in some situations. For example,
efficient during error conditions, for example by partition- consider a multithreaded user-space application where
ing the problem. q each thread processes units of work maintained in a per-
thread list, where threads are prohibited from touching
each others’ lists [McK12e]. There could also be updates
Quick Quiz 7.18: p.109
that require that all previously scheduled units of work
How might the lock holder be interfered with? have completed before the update can progress. One way
to handle this is to schedule a unit of work on each thread,
Answer: so that when all of these units of work complete, the
If the data protected by the lock is in the same cache line update may proceed.
as the lock itself, then attempts by other CPUs to acquire In some applications, threads can come and go. For
the lock will result in expensive cache misses on the part example, each thread might correspond to one user of
of the CPU holding the lock. This is a special case of the application, and thus be removed when that user
false sharing, which can also occur if a pair of variables logs out or otherwise disconnects. In many applications,
protected by different locks happen to share a cache line. threads cannot depart atomically: They must instead
In contrast, if the lock is in a different cache line than the explicitly unravel themselves from various portions of
data that it protects, the CPU holding the lock will usually the application using a specific sequence of actions. One
suffer a cache miss only on first access to a given variable. specific action will be refusing to accept further requests
Of course, the downside of placing the lock and data from other threads, and another specific action will be
into separate cache lines is that the code will incur two disposing of any remaining units of work on its list, for
cache misses rather than only one in the uncontended case. example, by placing these units of work in a global work-
As always, choose wisely! q item-disposal list to be taken by one of the remaining
threads. (Why not just drain the thread’s work-item list by
Quick Quiz 7.19: p.110 executing each item? Because a given work item might
Does it ever make sense to have an exclusive lock acqui- generate more work items, so that the list could not be
sition immediately followed by a release of that same drained in a timely fashion.)
lock, that is, an empty critical section? 8 Thanks to Alexey Roytman for this description.
v2022.09.25a
E.7. LOCKING 499
If the application is to perform and scale well, a good 1. Set a global counter to one and initialize a condition
locking design is required. One common solution is to variable to zero.
have a global lock (call it G) protecting the entire process
of departing (and perhaps other things as well), with 2. Send a message to all threads to cause them to
finer-grained locks protecting the individual unraveling atomically increment the global counter, and then to
operations. enqueue a work item. The work item will atomically
Now, a departing thread must clearly refuse to accept decrement the global counter, and if the result is zero,
further requests before disposing of the work on its list, it will set a condition variable to one.
because otherwise additional work might arrive after the
disposal action, which would render that disposal action 3. Acquire G, which will wait on any currently depart-
ineffective. So simplified pseudocode for a departing ing thread to finish departing. Because only one
thread might be as follows: thread may depart at a time, all the remaining threads
will have already received the message sent in the
1. Acquire lock G. preceding step.
2. Acquire the lock guarding communications. 4. Release G.
3. Refuse further communications from other threads.
5. Acquire the lock guarding the global work-item-
4. Release the lock guarding communications. disposal list.
5. Acquire the lock guarding the global work-item- 6. Move all work items from the global work-item-
disposal list. disposal list to this thread’s list, processing them as
6. Move all pending work items to the global work-item- needed along the way.
disposal list.
7. Release the lock guarding the global work-item-
7. Release the lock guarding the global work-item- disposal list.
disposal list.
8. Enqueue an additional work item onto this thread’s
8. Release lock G. list. (As before, this work item will atomically
Of course, a thread that needs to wait for all pre-existing decrement the global counter, and if the result is zero,
work items will need to take departing threads into account. it will set a condition variable to one.)
To see this, suppose that this thread starts waiting for all
9. Wait for the condition variable to take on the value
pre-existing work items just after a departing thread has
one.
refused further communications from other threads. How
can this thread wait for the departing thread’s work items
Once this procedure completes, all pre-existing work
to complete, keeping in mind that threads are not allowed
items are guaranteed to have completed. The empty
to access each others’ lists of work items?
critical sections are using locking for messaging as well
One straightforward approach is for this thread to ac-
as for protection of data. q
quire G and then the lock guarding the global work-item-
disposal list, then move the work items to its own list. The
thread then release both locks, places a work item on the Quick Quiz 7.20: p.112
end of its own list, and then wait for all of the work items Is there any other way for the VAX/VMS DLM to
that it placed on each thread’s list (including its own) to emulate a reader-writer lock?
complete.
This approach does work well in many cases, but if Answer:
special processing is required for each work item as it There are in fact several. One way would be to use the
is pulled in from the global work-item-disposal list, the null, protected-read, and exclusive modes. Another way
result could be excessive contention on G. One way to would be to use the null, protected-read, and concurrent-
avoid that contention is to acquire G and then immediately write modes. A third way would be to use the null,
release it. Then the process of waiting for all prior work concurrent-read, and exclusive modes. q
items look something like the following:
v2022.09.25a
500 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Answer:
This can be a legitimate implementation, but only if this
Conditionally acquiring a single global lock does work
store is preceded by a memory barrier and makes use
very well, but only for relatively small numbers of CPUs.
of WRITE_ONCE(). The memory barrier is not required
To see why it is problematic in systems with many hundreds
when the xchg() operation is used because this operation
of CPUs, look at Figure 5.1. q
implies a full memory barrier due to the fact that it returns
a value. q
Answer:
In the C language, the following macro correctly handles
Answer: this:
How indeed? This just shows that in concurrency, just as
in life, one should take care to learn exactly what winning #define ULONG_CMP_LT(a, b) \
(ULONG_MAX / 2 < (a) - (b))
entails before playing the game. q
Answer: p.116
Quick Quiz 7.27:
Suppose that the lock is held and that several threads
Which is better, the counter approach or the flag ap-
are attempting to acquire the lock. In this situation, if
proach?
these threads all loop on the atomic exchange operation,
they will ping-pong the cache line containing the lock Answer:
among themselves, imposing load on the interconnect. In The flag approach will normally suffer fewer cache misses,
contrast, if these threads are spinning in the inner loop but a better answer is to try both and see which works best
on lines 7–8, they will each spin within their own caches, for your particular workload. q
placing negligible load on the interconnect. q
v2022.09.25a
E.8. DATA OWNERSHIP 501
Answer: Answer:
Here are some bugs resulting from improper use of implicit The creation of the threads via the sh & operator and the
existence guarantees: joining of thread via the sh wait command.
Of course, if the processes explicitly share memory,
1. A program writes the address of a global variable to a for example, using the shmget() or mmap() system calls,
file, then a later instance of that same program reads explicit synchronization might well be needed when acc-
that address and attempts to dereference it. This cessing or updating the shared memory. The processes
can fail due to address-space randomization, to say might also synchronize using any of the following inter-
nothing of recompilation of the program. process communications mechanisms:
2. A module can record the address of one of its vari- 1. System V semaphores.
ables in a pointer located in some other module, then 2. System V message queues.
attempt to dereference that pointer after the module
has been unloaded. 3. UNIX-domain sockets.
4. Networking protocols, including TCP/IP, UDP, and
3. A function can record the address of one of its on-
a whole host of others.
stack variables into a global pointer, which some
other function might attempt to dereference after that 5. File locking.
function has returned.
6. Use of the open() system call with the O_CREAT
I am sure that you can come up with additional possibilities. and O_EXCL flags.
q
7. Use of the rename() system call.
v2022.09.25a
502 APPENDIX E. ANSWERS TO QUICK QUIZZES
3. Shared-memory mailboxes.
E.9 Deferred Processing
4. UNIX-domain sockets. p.129
Quick Quiz 9.1:
5. TCP/IP or UDP, possibly augmented by any number Why bother with a use-after-free check?
of higher-level protocols, including RPC, HTTP,
Answer:
XML, SOAP, and so on.
To greatly increase the probability of finding bugs. A
Compilation of a complete list is left as an exercise to small torture-test program (routetorture.h) that allo-
sufficiently single-minded readers, who are warned that cates and frees only one type of structure can tolerate a
the list will be extremely long. q surprisingly large amount of use-after-free misbehavior.
See Figure 11.4 on page 217 and the related discussion
in Section 11.6.4 starting on page 218 for more on the
Quick Quiz 8.6: p.125 importance of increasing the probability of finding bugs.
But none of the data in the eventual() function shown q
on lines 17–34 of Listing 5.5 is actually owned by
the eventual() thread! In just what way is this data Quick Quiz 9.2: p.129
ownership??? Why doesn’t route_del() in Listing 9.3 use reference
counts to protect the traversal to the element to be freed?
Answer:
The key phrase is “owns the rights to the data”. In this
case, the rights in question are the rights to access the per- Answer:
thread counter variable defined on line 1 of the listing. Because the traversal is already protected by the lock, so
This situation is similar to that described in Section 8.2. no additional protection is required. q
v2022.09.25a
E.9. DEFERRED PROCESSING 503
8
p.130 1x10
Quick Quiz 9.3:
Why the break in the “ideal” line at 224 CPUs in Fig- 7 ideal
1x10
v2022.09.25a
504 APPENDIX E. ANSWERS TO QUICK QUIZZES
ble indirection to the data element? Why not void * Quick Quiz 9.11: p.134
instead of void **? Figure 9.3 shows no sign of hyperthread-induced flatten-
Answer: ing at 224 threads. Why is that?
Because hp_try_record() must check for concurrent
modifications. To do that job, it needs a pointer to a pointer Answer:
to the element, so that it can check for a modification to Modern microprocessors are complicated beasts, so signif-
the pointer to the element. q icant skepticism is appropriate for any simple answer. That
aside, the most likely reason is the full memory barriers
required by hazard-pointers readers. Any delays resulting
Quick Quiz 9.8: p.131
from those memory barriers would make time available
Why bother with hp_try_record()? Wouldn’t it be to the other hardware thread sharing the core, resulting in
easier to just use the failure-immune hp_record() greater scalability at the expense of per-hardware-thread
function? performance. q
Answer:
It might be easier in some sense, but as will be seen in the Quick Quiz 9.12: p.134
Pre-BSD routing example, there are situations for which The paper “Structured Deferral: Synchronization via
hp_record() simply does not work. q Procrastination” [McK13] shows that hazard pointers
have near-ideal performance. Whatever happened in
Quick Quiz 9.9: p.133 Figure 9.3???
Readers must “typically” restart? What are some excep-
tions? Answer:
First, Figure 9.3 has a linear y-axis, while most of the
Answer: graphs in the “Structured Deferral” paper have logscale
If the pointer emanates from a global variable or is other- y-axes. Next, that paper uses lightly-loaded hash tables,
wise not subject to being freed, then hp_record() may while Figure 9.3’s uses a 10-element simple linked list,
be used to repeatedly attempt to record the hazard pointer, which means that hazard pointers face a larger memory-
even in the face of concurrent deletions. barrier penalty in this workload than in that of the “Struc-
In certain cases, restart can be avoided by using link tured Deferral” paper. Finally, that paper used an older
counting as exemplified by the UnboundedQueue and modest-sized x86 system, while a much newer and larger
ConcurrentHashMap data structures implemented in Folly system was used to generate the data shown in Figure 9.3.
open-source library.9 q
In addition, use of pairwise asymmetric barriers [Mic08,
Cor10b, Cor18] has been proposed to eliminate the read-
Quick Quiz 9.10: p.133 side hazard-pointer memory barriers on systems sup-
But don’t these restrictions on hazard pointers also apply porting this notion [Gol18b], which might improve the
to other forms of reference counting? performance of hazard pointers beyond what is shown in
the figure.
Answer: As always, your mileage may vary. Given the difference
Yes and no. These restrictions apply only to reference- in performance, it is clear that hazard pointers give you
counting mechanisms whose reference acquisition can fail. the best performance either for very large data structures
q (where the memory-barrier overhead will at least partially
overlap cache-miss penalties) and for data structures such
as hash tables where a lookup operation needs a minimal
number of hazard pointers. q
v2022.09.25a
E.9. DEFERRED PROCESSING 505
Answer: Answer:
The sequence-lock mechanism is really a combination In older versions of the Linux kernel, no.
of two separate synchronization mechanisms, sequence In very new versions of the Linux kernel, line 16 could
counts and locking. In fact, the sequence-count mech- use smp_load_acquire() instead of READ_ONCE(),
anism is available separately in the Linux kernel via which in turn would allow the smp_mb() on line 17 to
the write_seqcount_begin() and write_seqcount_ be dropped. Similarly, line 41 could use an smp_store_
end() primitives. release(), for example, as follows:
However, the combined write_seqlock() and
write_sequnlock() primitives are used much more smp_store_release(&slp->seq, READ_ONCE(slp->seq) + 1);
heavily in the Linux kernel. More importantly, many
more people will understand what you mean if you say This would allow the smp_mb() on line 40 to be
“sequence lock” than if you say “sequence count”. dropped. q
So this section is entitled “Sequence Locks” so that
people will understand what it is about just from the title,
Quick Quiz 9.17: p.136
and it appears in the “Deferred Processing” because (1) of
the emphasis on the “sequence count” aspect of “sequence What prevents sequence-locking updaters from starving
locks” and (2) because a “sequence lock” is much more readers?
than merely a lock. q
Answer:
Nothing. This is one of the weaknesses of sequence
Quick Quiz 9.14: p.135 locking, and as a result, you should use sequence locking
Why not have read_seqbegin() in Listing 9.10 check only in read-mostly situations. Unless of course read-side
for the low-order bit being set, and retry internally, rather starvation is acceptable in your situation, in which case,
than allowing a doomed read to start? go wild with the sequence-locking updates! q
Answer:
Quick Quiz 9.18: p.136
That would be a legitimate implementation. However,
if the workload is read-mostly, it would likely increase What if something else serializes writers, so that the
the overhead of the common-case successful read, which lock is not needed?
could be counter-productive. However, given a sufficiently Answer:
large fraction of updates and sufficiently high-overhead In this case, the ->lock field could be omitted, as it is in
readers, having the check internal to read_seqbegin() seqcount_t in the Linux kernel. q
might be preferable. q
v2022.09.25a
506 APPENDIX E. ANSWERS TO QUICK QUIZZES
3. Other threads repeatedly invoke write_seqlock() 4. CPU 0 picks up what used to be element A’s ->
and write_sequnlock(), until the value of ->seq next pointer, gets random bits, and therefore gets a
overflows back to the value that Thread 0 fetched. segmentation fault.
4. Thread 0 resumes execution, completing its read-side One way to protect against this sort of problem requires
critical section with inconsistent data. use of “type-safe memory”, which will be discussed in Sec-
tion 9.5.4.5. Roughly similar solutions are possible using
5. Thread 0 invokes read_seqretry(), which incor-
the hazard pointers discussed in Section 9.3. But in either
rectly concludes that Thread 0 has seen a consistent
case, you would be using some other synchronization
view of the data protected by the sequence lock.
mechanism in addition to sequence locks! q
The Linux kernel uses sequence locking for things that
are updated rarely, with time-of-day information being a Quick Quiz 9.21: p.139
case in point. This information is updated at most once Why does Figure 9.7 use smp_store_release() given
per millisecond, so that seven weeks would be required to that it is storing a NULL pointer? Wouldn’t WRITE_
overflow the counter. If a kernel thread was preempted for ONCE() work just as well in this case, given that there
seven weeks, the Linux kernel’s soft-lockup code would is no structure initialization to order against the store of
be emitting warnings every two minutes for that entire the NULL pointer?
time.
In contrast, with a 64-bit counter, more than five cen- Answer:
turies would be required to overflow, even given an update Yes, it would.
every nanosecond. Therefore, this implementation uses a Because a NULL pointer is being assigned, there is noth-
type for ->seq that is 64 bits on 64-bit systems. q ing to order against, so there is no need for smp_store_
release(). In contrast, when assigning a non-NULL
Quick Quiz 9.20: p.137 pointer, it is necessary to use smp_store_release()
Can this bug be fixed? In other words, can you use in order to ensure that initialization of the pointed-to
sequence locks as the only synchronization mechanism structure is carried out before assignment of the pointer.
protecting a linked list supporting concurrent addition, In short, WRITE_ONCE() would work, and would save a
deletion, and lookup? little bit of CPU time on some architectures. However, as
we will see, software-engineering concerns will motivate
Answer: use of a special rcu_assign_pointer() that is quite
One trivial way of accomplishing this is to surround all similar to smp_store_release(). q
accesses, including the read-only accesses, with write_
seqlock() and write_sequnlock(). Of course, this Quick Quiz 9.22: p.139
solution also prohibits all read-side parallelism, resulting Readers running concurrently with each other and with
in massive lock contention, and furthermore could just as the procedure outlined in Figure 9.7 can disagree on the
easily be implemented using simple locking. value of gptr. Isn’t that just a wee bit problematic???
If you do come up with a solution that uses read_
seqbegin() and read_seqretry() to protect read-side
accesses, make sure that you correctly handle the following Answer:
sequence of events: Not necessarily.
As hinted at in Sections 3.2.3 and 3.3, speed-of-light
1. CPU 0 is traversing the linked list, and picks up a delays mean that a computer’s data is always stale com-
pointer to list element A. pared to whatever external reality that data is intended to
2. CPU 1 removes element A from the list and frees it. model.
Real-world algorithms therefore absolutely must tol-
3. CPU 2 allocates an unrelated data structure, and gets erate inconsistancies between external reality and the
the memory formerly occupied by element A. In this in-computer data reflecting that reality. Many of those
unrelated data structure, the memory previously used algorithms are also able to tolerate some degree of in-
for element A’s ->next pointer is now occupied by consistency within the in-computer data. Section 10.3.4
a floating-point number. discusses this point in more detail.
v2022.09.25a
E.9. DEFERRED PROCESSING 507
Please note that this need to tolerate inconsistent and for single-theaded situations. In other words, this addi-
stale data is not limited to RCU. It also applies to reference tional “redundant” waiting enables excellent read-side
counting, hazard pointers, sequence locks, and even to performance, scalability, and real-time response. q
some locking use cases. For example, if you compute
some quantity while holding a lock, but use that quantity
Quick Quiz 9.26: p.142
after releasing that lock, you might well be using stale
data. After all, the data that quantity is based on might What is the point of rcu_read_lock() and rcu_read_
change arbitrarily as soon as the lock is released. unlock() in Listing 9.13? Why not just let the quiescent
So yes, RCU readers can see stale and inconsistent data, states speak for themselves?
but no, this is not necessarily problematic. And, when
needed, there are RCU usage patterns that avoid both Answer:
staleness and inconsistency [ACMS03]. q Recall that readers are not permitted to pass through a
quiescent state. For example, within the Linux kernel,
RCU readers are not permitted to execute a context switch.
Quick Quiz 9.23: p.139
Use of rcu_read_lock() and rcu_read_unlock() en-
What is an RCU-protected pointer? ables debug checks for improperly placed quiescent states,
making it easy to find bugs that would otherwise be
Answer:
difficult to find, intermittent, and quite destructive. q
A pointer to RCU-protected data. RCU-protected data is
in turn a block of dynamically allocated memory whose
freeing will be deferred such that an RCU grace period Quick Quiz 9.27: p.142
will elapse between the time that there were no longer any What is the point of rcu_dereference(), rcu_
RCU-reader-accessible pointers to that block and the time assign_pointer() and RCU_INIT_POINTER() in
that that block is freed. This ensures that no RCU readers Listing 9.13? Why not just use READ_ONCE(), smp_
will have access to that block at the time that it is freed. store_release(), and WRITE_ONCE(), respectively?
RCU-protected pointers must be handled carefully.
For example, any reader that intends to dereference an
RCU-protected pointer must use rcu_dereference() Answer:
(or stronger) to load that pointer. In addition, any updater The RCU-specific APIs do have similar semantics to the
must use rcu_assign_pointer() (or stronger) to store suggested replacements, but also enable static-analysis
to that pointer. q debugging checks that complain if an RCU-specific API
is invoked on a non-RCU pointer and vice versa. q
Quick Quiz 9.24: p.140
What does synchronize_rcu() do if it starts at about Quick Quiz 9.28: p.143
the same time as an rcu_read_lock()? But what if the old structure needs to be freed, but
the caller of ins_route() cannot block, perhaps due
Answer:
to performance considerations or perhaps because the
If a synchronize_rcu() cannot prove that it started
caller is executing within an RCU read-side critical
before a given rcu_read_lock(), then it must wait for
section?
the corresponding rcu_read_unlock(). q
Answer:
Quick Quiz 9.25: p.142 A call_rcu() function, which is described in Sec-
In Figure 9.8, the last of CPU 3’s readers that could tion 9.5.2.2, permits asynchronous grace-period waits.
possibly have access to the old data item ended before q
the grace period even started! So why would anyone
bother waiting until CPU 3’s later context switch??? p.143
Quick Quiz 9.29:
Answer: Doesn’t Section 9.4’s seqlock also permit readers and
Because that waiting is exactly what enables readers to updaters to make useful concurrent forward progress?
use the same sequence of instructions that is appropriate
v2022.09.25a
508 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer: each present for only part of the iteration, the reader is
Yes and no. Although seqlock readers can run concurrently permitted to iterate over them, but not obliged to. Note
with seqlock writers, whenever this happens, the read_ that this supports the common case where the reader is
seqretry() primitive will force the reader to retry. This simply looking up a single item, and does not know or
means that any work done by a seqlock reader running care about the presence or absence of other items.
concurrently with a seqlock updater will be discarded If stronger consistency is required, then higher-cost
and then redone upon retry. So seqlock readers can run synchronization mechanisms are required, for example,
concurrently with updaters, but they cannot actually get sequence locking or reader-writer locking. But if stronger
any work done in this case. consistency is not required (and it very often is not), then
In contrast, RCU readers can perform useful work even why pay the higher cost? q
in presence of concurrent RCU updaters.
However, both reference counters (Section 9.2) and p.146
Quick Quiz 9.32:
hazard pointers (Section 9.3) really do permit useful
What other final values of r1 and r2 are possible in
concurrent forward progress for both updaters and readers,
Figure 9.11?
just at somewhat greater cost. Please see Section 9.6 for
a comparison of these different solutions to the deferred- Answer:
reclamation problem. q The r1 == 0 && r2 == 0 possibility was called out in
the text. Given that r1 == 0 implies r2 == 0, we know
Quick Quiz 9.30: p.145 that r1 == 0 && r2 == 1 is forbidden. The following
Wouldn’t use of data ownership for RCU updaters mean discussion will show that both r1 == 1 && r2 == 1
that the updates could use exactly the same sequence of and r1 == 1 && r2 == 0 are possible. q
instructions as would the corresponding single-threaded
code? Quick Quiz 9.33: p.146
What would happen if the order of P0()’s two accesses
Answer:
was reversed in Figure 9.12?
Sometimes, for example, on TSO systems such as x86 or
the IBM mainframe where a store-release operation emits a Answer:
single store instruction. However, weakly ordered systems Absolutely nothing would change. The fact that P0()’s
must also emit a memory barrier or perhaps a store-release loads from x and y are in the same RCU read-side critical
instruction. In addition, removing data requires quite a section suffices; their order is irrelevant. q
bit of additional work because it is necessary to wait for
pre-existing readers before freeing the removed data. q
Quick Quiz 9.34: p.147
What would happen if P0()’s accesses in Figures 9.11–
Quick Quiz 9.31: p.145
9.13 were stores?
But suppose that updaters are adding and removing
multiple data items from a linked list while a reader Answer:
is iterating over that same list. Specifically, suppose The exact same ordering rules would apply, that is, (1) If
that a list initially contains elements A, B, and C, and any part of P0()’s RCU read-side critical section preceded
that an updater removes element A and then adds a new the beginning of P1()’s grace period, all of P0()’s RCU
element D at the end of the list. The reader might well read-side critical section would precede the end of P1()’s
see {A, B, C, D}, when that sequence of elements never grace period, and (2) If any part of P0()’s RCU read-side
actually ever existed! In what alternate universe would critical section followed the end of P1()’s grace period,
that qualify as “not disrupting concurrent readers”??? all of P0()’s RCU read-side critical section would follow
the beginning of P1()’s grace period.
Answer: It might seem strange to have RCU read-side critical
In the universe where an iterating reader is only required sections containing writes, but this capability is not only
to traverse elements that were present throughout the full permitted, but also highly useful. For example, the Linux
duration of the iteration. In the example, that would be kernel frequently carries out an RCU-protected traversal
elements B and C. Because elements A and D were of a linked data structure and then acquires a reference to
v2022.09.25a
E.9. DEFERRED PROCESSING 509
Listing E.2: Concurrent RCU Deletion 1. Thread A traverses the list, obtaining a reference to
1 spin_lock(&mylock); Element C.
2 p = search(head, key);
3 if (p == NULL)
4 spin_unlock(&mylock); 2. Thread B replaces Element C with a new Element F,
5 else {
6 list_del_rcu(&p->list);
then waits for its synchronize_rcu() call to return.
7 spin_unlock(&mylock);
8 synchronize_rcu(); 3. Thread C traverses the list, obtaining a reference to
9 kfree(p);
10 } Element F.
v2022.09.25a
510 APPENDIX E. ANSWERS TO QUICK QUIZZES
v2022.09.25a
E.9. DEFERRED PROCESSING 511
v2022.09.25a
512 APPENDIX E. ANSWERS TO QUICK QUIZZES
2.5x107 Listing E.5: Using RCU to Wait for Mythical Preemptible NMIs
to Finish
struct profile_buffer {
2x107
1
Lookups per Millisecond ideal
2 long size;
3 atomic_t entry[0];
7 4 };
1.5x10 static struct profile_buffer *buf = NULL;
RCU 5
6
void nmi_profile(unsigned long pcvalue)
1x107
7
8 {
9 struct profile_buffer *p;
seqlock 10
5x106 11 rcu_read_lock();
hazptr 12 p = rcu_dereference(buf);
13 if (p == NULL) {
0 14 rcu_read_unlock();
0 50 100 150 200 250 300 350 400 450 15 return;
Number of CPUs (Threads) 16 }
17 if (pcvalue >= p->size) {
Figure E.5: Pre-BSD Routing Table Protected by RCU 18 rcu_read_unlock();
19 return;
QSBR With Non-Initial rcu_head 20 }
21 atomic_inc(&p->entry[pcvalue]);
22 rcu_read_unlock();
23 }
same as the sequential variant. And the answer, as can 24
25 void nmi_stop(void)
be seen in Figure E.5, is that this causes RCU QSBR’s 26 {
performance to decrease to where it is still very nearly 27 struct profile_buffer *p = buf;
28
ideal, but no longer super-ideal. q 29 if (p == NULL)
30 return;
31 rcu_assign_pointer(buf, NULL);
32 synchronize_rcu();
Quick Quiz 9.48: p.161 33 kfree(p);
Given RCU QSBR’s read-side performance, why bother 34 }
Answer: p.164
Quick Quiz 9.50:
Because RCU QSBR places constraints on the overall ap-
What is the point of the second call to synchronize_
plication that might not be tolerable, for example, requiring
rcu() in function maint() in Listing 9.17? Isn’t it
that each and every thread in the application regularly
OK for any cco() invocations in the clean-up phase to
pass through a quiescent state. Among other things, this
invoke either cco_carefully() or cco_quickly()?
means that RCU QSBR is not helpful to library writers,
who might be better served by other flavors of userspace
RCU [MDJ13c]. q
Answer:
The problem is that there is no ordering between the cco()
Quick Quiz 9.49: p.164
function’s load from be_careful and any memory loads
Suppose that the nmi_profile() function was pre- executed by the cco_quickly() function. Because there
emptible. What would need to change to make this is no ordering, without that second call to syncrhonize_
example work correctly? rcu(), memory ordering could cause loads in cco_
quickly() to overlap with stores by do_maint().
Answer:
One approach would be to use rcu_read_lock() and Another alternative would be to compensate for the
rcu_read_unlock() in nmi_profile(), and to replace removal of that second call to synchronize_rcu() by
the synchronize_sched() with synchronize_rcu(), changing the READ_ONCE() to smp_load_acquire()
perhaps as shown in Listing E.5. and the WRITE_ONCE() to smp_store_release(), thus
But why on earth would an NMI handler be pre- restoring the needed ordering. q
emptible??? q
v2022.09.25a
E.9. DEFERRED PROCESSING 513
v2022.09.25a
514 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Quick Quiz 9.56: p.167 Excellent memory!!! The overhead in some early releases
WTF? How the heck do you expect me to believe that was in fact roughly 100 femtoseconds.
RCU can have less than a 300-picosecond overhead when What happened was that RCU usage spread more
the clock period at 2.10 GHz is almost 500 picoseconds? broadly through the Linux kernel, including into code
that takes page faults. Back at that time, rcu_read_
lock() and rcu_read_unlock() were complete no-
Answer: ops in CONFIG_PREEMPT=n kernels. Unfortunately, that
First, consider that the inner loop used to take this mea- situation allowed the compiler to reorder page-faulting
surement is as follows: memory accesses into RCU read-side critical sections. Of
course, page faults can block, which destroys those critical
1 for (i = nloops; i >= 0; i--) {
2 rcu_read_lock();
sections.
3 rcu_read_unlock(); Nor was this a theoretical problem: A failure actually
4 }
manifested in 2019. Herbert Xu tracked down this failure
down and Linus Torvalds therefore queued a commit to
Next, consider the effective definitions of rcu_read_ upgrade rcu_read_lock() and rcu_read_unlock()
lock() and rcu_read_unlock(): to unconditionally include a call to barrier() [Tor19].
And although barrier() emits no code, it does constrain
1 #define rcu_read_lock() barrier() compiler optimizations. And so the price of widespread
2 #define rcu_read_unlock() barrier() RCU usage is slightly higher rcu_read_lock() and
rcu_read_unlock() overhead. As such, Linux-kernel
These definitions constrain compiler code-movement RCU has proven to be a victim of its own success.
optimizations involving memory references, but emit no Of course, it is also the case that the older results were
instructions in and of themselves. However, if the loop obtained on a different system than were those shown in
variable is maintained in a register, the accesses to i Figure 9.25. So which change had the most effect, Linus’s
will not count as memory references. Furthermore, the commit or the change in the system? This question is left
compiler can do loop unrolling, allowing the resulting as an exercise to the reader. q
code to “execute” multiple passes through the loop body
simply by incrementing i by some value larger than the Quick Quiz 9.58: p.167
value 1. Why is there such large variation for the RCU trace in
So the “measurement” of 267 picoseconds is simply the Figure 9.25?
fixed overhead of the timing measurements divided by the
number of passes through the inner loop containing the Answer:
calls to rcu_read_lock() and rcu_read_unlock(), Keep in mind that this is a log-log plot, so those large-
plus the code to manipulate i divided by the loop-unrolling seeming RCU variances in reality span only a few hundred
factor. And therefore, this measurement really is in error, picoseconds. And that is such a short time that anything
in fact, it exaggerates the overhead by an arbitrary number could cause it. However, given that the variance decreases
of orders of magnitude. After all, in terms of machine with both small and large numbers of CPUs, one hypothesis
instructions emitted, the actual overheads of rcu_read_ is that the variation is due to migrations from one CPU to
lock() and of rcu_read_unlock() are each precisely another.
zero. Yes, these measurements were taken with interrupts
It is not just every day that a timing measurement of disabled, but they were also taken within a guest OS,
267 picoseconds turns out to be an overestimate! q so that preemption was still possible at the hypervisor
level. In addition, the system featured hyperthreading and
v2022.09.25a
E.9. DEFERRED PROCESSING 515
a single hardware thread running this RCU workload is result is a classic self-deadlock—you get the same effect
able to consume more than half of the core’s resources. when attempting to write-acquire a reader-writer lock
Therefore, the overall throughput varies depending on how while read-holding it.
many of a given guest OS’s CPUs share cores. Attempting Note that this self-deadlock scenario does not apply to
to reduce these variations by running the guest OSes at RCU QSBR, because the context switch performed by the
real-time priority (as suggested by Joel Fernandes) is left synchronize_rcu() would act as a quiescent state for
as an exercise for the reader. q this CPU, allowing a grace period to complete. However,
this is if anything even worse, because data used by the
Quick Quiz 9.59: p.168 RCU read-side critical section might be freed as a result of
Given that the system had no fewer than 448 hardware the grace period completing. Plus Linux kernel’s lockdep
threads, why only 192 CPUs? facility will yell at you.
In short, do not invoke synchronous RCU update-side
Answer: primitives from within an RCU read-side critical section.
Because the script (rcuscale.sh) that generates this data In addition, within the Linux kernel, RCU uses the
spawns a guest operating system for each set of points scheduler and the scheduler uses RCU. In some cases,
gathered, and on this particular system, both qemu and both RCU and the scheduler must take care to avoid
KVM limit the number of CPUs that may be configured deadlock. q
into a given guest OS. Yes, it would have been possible
to run a few more CPUs, but 192 is a nice round number Quick Quiz 9.62: p.169
from a binary perspective, given that 256 is infeasible. q Immunity to both deadlock and priority inversion???
Sounds too good to be true. Why should I believe that
Quick Quiz 9.60: p.168 this is even possible?
Why the larger error ranges for the submicrosecond
durations in Figure 9.27? Answer:
It really does work. After all, if it didn’t work, the Linux
Answer: kernel would not run. q
Because smaller disturbances result in greater relative
errors for smaller measurements. Also, the Linux kernel’s Quick Quiz 9.63: p.169
ndelay() nanosecond-scale primitive is (as of 2020) less But how many other algorithms really tolerate stale and
accurate than is the udelay() primitive used for the data inconsistent data?
for durations of a microsecond or more. It is instructive to
compare to the zero-length case shown in Figure 9.25. q Answer:
Quite a few!
p.168 Please keep in mind that the finite speed of light means
Quick Quiz 9.61:
that data reaching a given computer system is at least
Is there an exception to this deadlock immunity, and if
slightly stale at the time that it arrives, and extremely
so, what sequence of events could lead to deadlock?
stale in the case of astronomical data. The finite speed of
Answer: light also places a sharp limit on the consistency of data
One way to cause a deadlock cycle involving RCU read- arriving from different sources of via different paths.
side primitives is via the following (illegal) sequence of You might as well face the fact that the laws of physics
statements: are incompatible with naive notions of perfect freshness
and consistency. q
rcu_read_lock();
synchronize_rcu();
rcu_read_unlock(); p.170
Quick Quiz 9.64:
If Tasks RCU Trace might someday be priority boosted,
The synchronize_rcu() cannot return until all pre- why not also Tasks RCU and Tasks RCU Rude?
existing RCU read-side critical sections complete, but is
enclosed in an RCU read-side critical section that cannot Answer:
complete until the synchronize_rcu() returns. The Maybe, but these are less likely.
v2022.09.25a
516 APPENDIX E. ANSWERS TO QUICK QUIZZES
In the case of Tasks RCU, recall that the quiescent state difference is in how we think about the code. For example,
is a voluntary context switch. Thus, all tasks not blocked what does an atomic_inc() operation do? It might be
after a voluntary context switch might need to be boosted, acquiring another explicit reference to an object to which
and the mechanics of deboosting would not likely be at we already have a reference, it might be incrementing an
all pretty. often-read/seldom-updated statistical counter, it might be
In the case of Tasks RCU Rude, as was the case with checking into an HPC-style barrier, or any of a number of
the old RCU Sched, any preemptible region of code is other things.
a quiescent state. Thus, the only tasks that might need However, these differences can be extremely important.
boosting are those currently running with preemption For but one example of the importance, consider that if we
disabled. But boosting the priority of a preemption- think of RCU as a restricted reference counting scheme,
disabled task has no effect. It therefore seems doubly we would never be fooled into thinking that the updates
unlikely that priority boosting will ever be introduced to would exclude the RCU read-side critical sections.
Tasks RCU Rude, at least in its current form. q It nevertheless is often useful to think of RCU as a
replacement for reader-writer locking, for example, when
Quick Quiz 9.65: p.173 you are replacing reader-writer locking with RCU. q
Is RCU the only synchronization mechanism that com-
bines temporal and spatial synchronization in this way? Quick Quiz 9.67: p.176
Which of these use cases best describes the Pre-BSD
routing example in Section 9.5.4.1?
Answer:
Not at all.
Answer:
Hazard pointers can be considered to combine temporal Pre-BSD routing could be argued to fit into either quasi
and spatial synchronization in a similar manner. Referring reader-writer lock, quasi reference count, or quasi multi-
to Listing 9.4, the hp_record() function’s acquisition version concurrency control. The code is the same either
of a reference provides both spatial and temporal syn- way. This is similar to things like atomic_inc(), another
chronization, subscribing to a version and marking the tool that can be put to a great many uses. q
start of a reference, respectively. This function therefore
combines the effects of RCU’s rcu_read_lock() and
rcu_dereference(). Referring now to Listing 9.5, the Quick Quiz 9.68: p.178
hp_clear() function’s release of a reference provides Why not just drop the lock before waiting for the grace
temporal synchronization marking the end of a reference, period, or using something like call_rcu() instead of
and is thus similar to RCU’s rcu_read_unlock(). The waiting for a grace period?
hazptr_free_later() function’s retiring of a hazard-
pointer-protected object provides temporal synchroniza- Answer:
tion, similar to RCU’s call_rcu(). The primitives used The authors wished to support linearizable tree opera-
to mutate a hazard-pointer-protected structure provide tions, so that concurrent additions to, deletions from, and
spatial synchronization, similar to RCU’s rcu_assign_ searches of the tree would appear to execute in some glob-
pointer(). ally agreed-upon order. In their search trees, this requires
Alternatively, one could instead come at hazard pointers holding locks across grace periods. (It is probably better
by analogy with reference counting. q to drop linearizability as a requirement in most cases,
but linearizability is a surprisingly popular (and costly!)
requirement.) q
Quick Quiz 9.66: p.174
But wait! This is exactly the same code that might
be used when thinking of RCU as a replacement for Quick Quiz 9.69: p.180
reader-writer locking! What gives? Why can’t users dynamically allocate the hazard pointers
as they are needed?
Answer:
This is an effect of the Law of Toy Examples: Beyond a Answer:
certain point, the code fragments look the same. The only They can, but at the expense of additional reader-traversal
v2022.09.25a
E.10. DATA STRUCTURES 517
overhead and, in some environments, the need to handle thus well-suited to concurrent use. There are other
memory-allocation failure. q completely-partitionable hash tables, for example, split-
ordered list [SS06], but they are considerably more com-
p.180 plex. We therefore start with chained hash tables. q
Quick Quiz 9.70:
But don’t Linux-kernel kref reference counters allow
guaranteed unconditional reference acquisition? Quick Quiz 10.2: p.186
But isn’t the double comparison on lines 10–13 in List-
Answer: ing 10.3 inefficient in the case where the key fits into an
Yes they do, but the guarantee only applies unconditionally unsigned long?
in cases where a reference is already held. With this in
mind, please review the paragraph at the beginning of Answer:
Section 9.6, especially the part saying “large enough Indeed it is! However, hash tables quite frequently store
that readers do not hold references from one traversal to information with keys such as character strings that do
another”. q not necessarily fit into an unsigned long. Simplifying the
hash-table implementation for the case where keys always
Quick Quiz 9.71: p.181 fit into unsigned longs is left as an exercise for the reader.
But didn’t the answer to one of the quick quizzes in q
Section 9.3 say that pairwise asymmetric barriers could
eliminate the read-side smp_mb() from hazard pointers? Quick Quiz 10.3: p.188
Instead of simply increasing the number of hash buckets,
Answer: wouldn’t it be better to cache-align the existing hash
Yes, it did. However, doing this could be argued to buckets?
change hazard-pointers “Reclamation Forward Progress”
Answer:
row (discussed later) from lock-free to blocking because a
The answer depends on a great many things. If the hash
CPU spinning with interrupts disabled in the kernel would
table has a large number of elements per bucket, it would
prevent the update-side portion of the asymmetric barrier
clearly be better to increase the number of hash buckets.
from completing. In the Linux kernel, such blocking
On the other hand, if the hash table is lightly loaded, the
could in theory be prevented by building the kernel with
answer depends on the hardware, the effectiveness of the
CONFIG_NO_HZ_FULL, designating the relevant CPUs as
hash function, and the workload. Interested readers are
nohz_full at boot time, ensuring that only one thread
encouraged to experiment. q
was ever runnable on a given CPU at a given time, and
avoiding ever calling into the kernel. Alternatively, you
could ensure that the kernel was free of any bugs that Quick Quiz 10.4: p.188
might cause CPUs to spin with interrupts disabled. Given the negative scalability of the Schrödinger’s Zoo
Given that CPUs spinning in the Linux kernel with application across sockets, why not just run multiple
interrupts disabled seems to be rather rare, one might copies of the application, with each copy having a subset
counter-argue that asymmetric-barrier hazard-pointer up- of the animals and confined to run on a single socket?
dates are non-blocking in practice, if not in theory. q
Answer:
You can do just that! In fact, you can extend this idea
to large clustered systems, running one copy of the ap-
E.10 Data Structures plication on each node of the cluster. This practice is
called “sharding”, and is heavily used in practice by large
Quick Quiz 10.1: p.186 web-based retailers [DHJ+ 07].
But chained hash tables are but one type of many. Why However, if you are going to shard on a per-socket basis
the focus on chained hash tables? within a multisocket system, why not buy separate smaller
and cheaper single-socket systems, and then run one shard
Answer: of the database on each of those systems? q
Chained hash tables are completely partitionable, and
v2022.09.25a
518 APPENDIX E. ANSWERS TO QUICK QUIZZES
p.189 1x108
100
1000
10000
100000
1x106
also use RCU-aware list-manipulation primitives. Finally,
this is why the caller of hashtab_del() must wait for
a grace period (e.g., by calling synchronize_rcu())
Hash Table Size (Buckets and Maximum Elements)
before freeing the removed element. This will ensure that
all RCU readers that might reference the newly removed Figure E.6: Read-Only RCU-Protected Hash-Table Per-
element have completed before that element is freed. q formance For Schrödinger’s Zoo at 448 CPUs, Vary-
ing Table Size
Quick Quiz 10.6: p.190
The hashtorture.h file contains more than 1,000
lines! Is that a comprehensive test or what??? that varying hash-table size has almost no effect? Might
the problem instead be something like false sharing?
Answer:
What. Answer:
The hashtorture.h tests are a good start and suffice Excellent question!
for a textbook algorithm. If this code was to be used in
False sharing requires writes, which are not featured
production, much more testing would be required:
in the unsynchronized and RCU runs of this lookup-only
benchmark. The problem is therefore not false sharing.
1. Have some subset of elements that always reside
in the table, and verify that lookups always find Still unconvinced? Then look at the log-log plot in
these elements regardless of the number and type of Figure E.6, which shows performance for 448 CPUs
concurrent updates in flight. as a function of the hash-table size, that is, number of
buckets and maximum number of elements. A hash-
2. Pair an updater with one or more readers, verifying table of size 1,024 has 1,024 buckets and contains at
that after an element is added, once a reader success- most 1,024 elements, with the average occupancy being
fully looks up that element, all later lookups succeed. 512 elements. Because this is a read-only benchmark, the
The definition of “later” will depend on the table’s actual occupancy is always equal to the average occupancy.
consistency requirements.
This figure shows near-ideal performance below about
3. Pair an updater with one or more readers, verifying 8,000 elements, that is, when the hash table comprises
that after an element is deleted, once a reader’s lookup less than 1 MB of data. This near-ideal performance is
of that element fails, all later lookups also fail. consistent with that for the pre-BSD routing table shown in
Figure 9.21 on page 161, even at 448 CPUs. However, the
There are many more tests where those came from, performance drops significantly (this is a log-log plot) at
the exact nature of which depend on the details of the about 8,000 elements, which is where the 1,048,576-byte
requirements on your particular hash table. q L2 cache overflows. Performance falls off a cliff (even on
this log-log plot) at about 300,000 elements, where the
p.192 40,370,176-byte L3 cache overflows. This demonstrates
Quick Quiz 10.7:
that the memory-system bottleneck is profound, degrading
How can we be so sure that the hash-table size is at fault
performance by well in excess of an order of magnitude
here, especially given that Figure 10.4 on page 188 shows
for the large hash tables. This should not be a surprise,
v2022.09.25a
E.10. DATA STRUCTURES 519
as the size-8,388,608 hash table occupies about 1 GB of was made quite clear in Section 10.2.3. Would extrapo-
memory, overflowing the L3 caches by a factor of 25. lating up from 448 CPUs be any safer?
The reason that Figure 10.4 on page 188 shows little
effect is that its data was gathered from bucket-locked hash Answer:
tables, where locking overhead and contention drowned In theory, no, it isn’t any safer, and a useful exercise
out cache-capacity effects. In contrast, both RCU and would be to run these programs on larger systems. In
hazard-pointers readers avoid stores to shared data, which practice, there are only a very few systems with more than
means that the cache-capacity effects come to the fore. 448 CPUs, in contrast to the huge number having more
Still not satisfied? Find a multi-socket system and run than 28 CPUs. This means that although it is dangerous
this code, making use of whatever performance-counter to extrapolate beyond 448 CPUs, there is very little need
hardware is available. This hardware should allow you to to do so.
track down the precise cause of any slowdowns exhibited In addition, other testing has shown that RCU read-side
on your particular system. The experience gained by primitives offer consistent performance and scalability up
doing this exercise will be extremely valuable, giving you to at least 1024 CPUs. However, it is useful to review
a significant advantage over those whose understanding Figure E.6 and its associated commentary. You see,
of this issue is strictly theoretical.10 q unlike the 448-CPU system that provided this data, the
system enjoying linear scalability up to 1024 CPUs boasted
excellent memory bandwidth. q
Quick Quiz 10.8: p.192
The memory system is a serious bottleneck on this big
Quick Quiz 10.10: p.197
system. Why bother putting 448 CPUs on a system
without giving them enough memory bandwidth to do How does the code in Listing 10.10 protect against the
something useful??? resizing process progressing past the selected bucket?
Answer: Answer:
It would indeed be a bad idea to use this large and expensive It does not provide any such protection. That is instead
system for a workload consisting solely of simple hash- the job of the update-side concurrency-control functions
table lookups of small data elements. However, this described next. q
system is extremely useful for a great many workloads
that feature more processing and less memory accessing. Quick Quiz 10.11: p.198
For example, some in-memory databases run extremely Suppose that one thread is inserting an element into
well on this class of system, albeit when running much the hash table during a resize operation. What prevents
more complex sets of queries than performed by the this insertion from being lost due to a subsequent resize
benchmarks in this chapter. For example, such systems operation completing before the insertion does?
might be processing images or video streams stored in
each element, providing further performance benefits due Answer:
to the fact that the resulting sequential memory accesses The second resize operation will not be able to move
will make better use of the available memory bandwidth beyond the bucket into which the insertion is taking place
than will a pure pointer-following workload. due to the insertion holding the lock(s) on one or both
But let this be a lesson to you. Modern computer of the hash buckets in the hash tables. Furthermore, the
systems come in a great many shapes and sizes, and great insertion operation takes place within an RCU read-side
care is frequently required to select one that suits your critical section. As we will see when we examine the
application. And perhaps even more frequently, significant hashtab_resize() function, this means that each resize
care and work is required to adjust your application to the operation uses synchronize_rcu() invocations to wait
specific computer systems at hand. q for the insertion’s read-side critical section to complete.
q
v2022.09.25a
520 APPENDIX E. ANSWERS TO QUICK QUIZZES
that readers might miss an element that was previously Listing E.6: Resizable Hash-Table Access Functions (Fewer
added during a resize operation? Updates)
1 struct ht_elem *
2 hashtab_lookup(struct hashtab *htp_master, void *key)
Answer: 3 {
4 struct ht *htp;
No. As we will see soon, the hashtab_add() and 5 struct ht_elem *htep;
hashtab_del() functions keep the old hash table up- 6
htp = rcu_dereference(htp_master->ht_cur);
to-date while a resize operation is in progress. q 7
8 htep = ht_search_bucket(htp, key);
9 if (htep)
10 return htep;
p.198 11 htp = rcu_dereference(htp->ht_new);
Quick Quiz 10.13: 12 if (!htp)
The hashtab_add() and hashtab_del() functions 13 return NULL;
14 return ht_search_bucket(htp, key);
in Listing 10.12 can update two hash buckets while a 15 }
resize operation is progressing. This might cause poor 16
17 void hashtab_add(struct ht_elem *htep,
performance if the frequency of resize operation is not 18 struct ht_lock_state *lsp)
negligible. Isn’t it possible to reduce the cost of updates 19 {
20 struct ht_bucket *htbp = lsp->hbp[0];
in such cases? 21 int i = lsp->hls_idx[0];
22
Answer: 23 htep->hte_next[!i].prev = NULL;
24 cds_list_add_rcu(&htep->hte_next[i], &htbp->htb_head);
Yes, at least assuming that a slight increase in the cost 25 }
of hashtab_lookup() is acceptable. One approach is 26
27 void hashtab_del(struct ht_elem *htep,
shown in Listings E.6 and E.7 (hash_resize_s.c). 28 struct ht_lock_state *lsp)
This version of hashtab_add() adds an element to 29 {
30 int i = lsp->hls_idx[0];
either the old bucket if it is not resized yet, or to the 31
new bucket if it has been resized, and hashtab_del() 32 if (htep->hte_next[i].prev) {
33 cds_list_del_rcu(&htep->hte_next[i]);
removes the specified element from any buckets into which 34 htep->hte_next[i].prev = NULL;
it has been inserted. The hashtab_lookup() function 35 }
36 if (lsp->hbp[1] && htep->hte_next[!i].prev) {
searches the new bucket if the search of the old bucket 37 cds_list_del_rcu(&htep->hte_next[!i]);
fails, which has the disadvantage of adding overhead to the 38 htep->hte_next[!i].prev = NULL;
39 }
lookup fastpath. The alternative hashtab_lock_mod() 40 }
returns the locking state of the new bucket in ->hbp[0]
and ->hls_idx[0] if resize operation is in progress,
instead of the perhaps more natural choice of ->hbp[1] between the time that we install the new hash-table ref-
and ->hls_idx[1]. However, this less-natural choice erence on line 29 and the time that we update ->ht_
has the advantage of simplifying hashtab_add(). resize_cur on line 40. This means that any reader that
Further analysis of the code is left as an exercise for the sees a non-negative value of ->ht_resize_cur cannot
reader. q have started before the assignment to ->ht_new, and thus
must be able to see the reference to the new hash table.
p.200
And this is why the update-side hashtab_add() and
Quick Quiz 10.14:
hashtab_del() functions must be enclosed in RCU read-
In the hashtab_resize() function in Listing 10.13,
side critical sections, courtesy of hashtab_lock_mod()
what guarantees that the update to ->ht_new on line 29
and hashtab_unlock_mod() in Listing 10.11. q
will be seen as happening before the update to ->
ht_resize_cur on line 40 from the perspective of
hashtab_add() and hashtab_del()? In other words, Quick Quiz 10.15: p.200
what prevents hashtab_add() and hashtab_del() Why is there a WRITE_ONCE() on line 40 in List-
from dereferencing a NULL pointer loaded from ->ht_ ing 10.13?
new?
Answer:
Answer: Together with the READ_ONCE() on line 16 in hashtab_
The synchronize_rcu() on line 30 of Listing 10.13 lock_mod() of Listing 10.11, it tells the compiler that
ensures that all pre-existing RCU readers have completed the non-initialization accesses to ->ht_resize_cur must
v2022.09.25a
E.10. DATA STRUCTURES 521
single CPU, showing that cache capacity is degrading the memory system.
v2022.09.25a
522 APPENDIX E. ANSWERS TO QUICK QUIZZES
v2022.09.25a
E.11. VALIDATION 523
11. Do you have a set of test cases in which one of the that the time will not be completely wasted. For example,
times has a positive minutes value but a negative if you are embarking on a first-of-a-kind project, the
seconds value? requirements are in some sense unknowable anyway. In
this case, the best approach might be to quickly prototype
12. Do you have a set of test cases in which one of the a number of rough solutions, try them out, and see what
times omits the “m” or the “s”? works best.
13. Do you have a set of test cases in which one of the On the other hand, if you are being paid to produce
times is non-numeric? (For example, “Go Fish”.) a system that is broadly similar to existing systems, you
owe it to your users, your employer, and your future self
14. Do you have a set of test cases in which one of the to validate early and often. q
lines is omitted? (For example, where there is a “real”
value and a “sys” value, but no “user” value.)
Quick Quiz 11.5: p.210
15. Do you have a set of test cases where one of the lines
Are you actually suggesting that it is possible to test
is duplicated? Or duplicated, but with a different
correctness into software??? Everyone knows that is
time value for the duplicate?
impossible!!!
16. Do you have a set of test cases where a given line
has more than one time value? (For example, “real Answer:
0m0.132s 0m0.008s”.) Please note that the text used the word “validation” rather
than the word “testing”. The word “validation” includes
17. Do you have a set of test cases containing random formal methods as well as testing, for more on which
characters? please see Chapter 12.
18. In all test cases involving invalid input, did you But as long as we are bringing up things that everyone
generate all permutations? should know, let’s remind ourselves that Darwinian evo-
lution is not about correctness, but rather about survival.
19. For each test case, do you have an expected outcome As is software. My goal as a developer is not that my
for that test? software be attractive from a theoretical viewpoint, but
rather that it survive whatever its users throw at it.
If you did not generate test data for a substantial number
Although the notion of correctness does have its uses,
of the above cases, you will need to cultivate a more
its fundamental limitation is that the specification against
destructive attitude in order to have a chance of generating
which correctness is judged will also have bugs. This
high-quality tests.
means nothing more nor less than that traditional correct-
Of course, one way to economize on destructiveness
ness proofs prove that the code in question contains the
is to generate the tests with the to-be-tested source code
intended set of bugs!
at hand, which is called white-box testing (as opposed to
black-box testing). However, this is no panacea: You will Alternative definitions of correctness instead focus on
find that it is all too easy to find your thinking limited by the lack of problematic properties, for example, proving
what the program can handle, thus failing to generate truly that the software has no use-after-free bugs, no NULL
destructive inputs. q pointer dereferences, no array-out-of-bounds references,
and so on. Make no mistake, finding and eliminating such
classes of bugs can be highly useful. But the fact remains
Quick Quiz 11.4: p.210
that the lack of certain classes of bugs does nothing to
You are asking me to do all this validation BS before demonstrate fitness for any specific purpose.
I even start coding??? That sounds like a great way to
Therefore, usage-driven validation remains critically
never get started!!!
important.
Answer: Besides, it is also impossible to verify correctness into
If it is your project, for example, a hobby, do what you your software, especially given the problematic need to
like. Any time you waste will be your own, and you have verify both the verifier and the specification. q
no one else to answer to for it. And there is a good chance
v2022.09.25a
524 APPENDIX E. ANSWERS TO QUICK QUIZZES
v2022.09.25a
E.11. VALIDATION 525
v2022.09.25a
526 APPENDIX E. ANSWERS TO QUICK QUIZZES
p.217
Table E.3: Human-Friendly Poisson-Function Display
Quick Quiz 11.14:
In Eq. 11.6, are the logarithms base-10, base-2, or Improvement
base-e?
Certainty (%) Any 10x 100x
Answer:
It does not matter. You will get the same answer no matter 90.0 2.3 23.0 230.0
what base of logarithms you use because the result is 95.0 3.0 30.0 300.0
a pure ratio of logarithms. The only constraint is that 99.0 4.6 46.1 460.5
you use the same base for both the numerator and the 99.9 6.9 69.1 690.7
denominator. q
Quick Quiz 11.15: p.218 also greatly prefer error-free test runs, and so should you
Suppose that a bug causes a test failure three times per because doing so reduces your required test durations.
hour on average. How long must the test run error-free Therefore, it is quite possible that the values in Table E.3
to provide 99.9 % confidence that the fix significantly will suffice. Simply look up the desired confidence and
reduced the probability of failure? degree of improvement, and the resulting number will
give you the required error-free test duration in terms of
Answer: the expected time for a single error to appear. So if your
We set 𝑛 to 3 and 𝑃 to 99.9 in Eq. 11.11, resulting in: pre-fix testing suffered one failure per hour, and the powers
that be require a 95 % confidence of a 10x improvement,
1 100 − 99.9 you need a 30-hour error-free run.
𝑇 = − ln = 2.3 (E.9)
3 100 Alternatively, you can use the rough-and-ready method
If the test runs without failure for 2.3 hours, we can described in Section 11.6.2. q
be 99.9 % certain that the fix reduced the probability of
failure. q Quick Quiz 11.17: p.218
But wait!!! Given that there has to be some number
Quick Quiz 11.16: p.218 of failures (including the possibility of zero failures),
Doing the summation of all the factorials and exponen- shouldn’t Eq. 11.13 approach the value 1 as 𝑚 goes to
tials is a real pain. Isn’t there an easier way? infinity?
Answer: Answer:
One approach is to use the open-source symbolic ma- Indeed it should. And it does.
nipulation program named “maxima”. Once you have To see this, note that e−𝜆 does not depend on 𝑖, which
installed this program, which is a part of many Linux dis- means that it can be pulled out of the summation as
tributions, you can run it and give the load(distrib); follows:
command followed by any number of bfloat(cdf_ ∞
poisson(m,l)); commands, where the m is replaced
∑︁ 𝜆𝑖
e−𝜆 (E.10)
by the desired value of 𝑚 (the actual number of failures in 𝑖=0
𝑖!
actual test) and the l is replaced by the desired value of 𝜆
The remaining summation is exactly the Taylor series
(the expected number of failures in the actual test).
for e𝜆 , yielding:
In particular, the bfloat(cdf_poisson(2,24));
command results in 1.181617112359357b-8, which
e−𝜆 e𝜆 (E.11)
matches the value given by Eq. 11.13.
Another approach is to recognize that in this real world, The two exponentials are reciprocals, and therefore
it is not all that useful to compute (say) the duration cancel, resulting in exactly 1, as required. q
of a test having two or fewer errors that would give a
76.8 % confidence of a 349.2x improvement in reliability. p.219
Quick Quiz 11.18:
Instead, human beings tend to focus on specific values, for
How is this approach supposed to help if the corruption
example, a 95 % confidence of a 10x improvement. People
v2022.09.25a
E.11. VALIDATION 527
Answer: Answer:
A huge commit? Shame on you! This is but one reason Although I do heartily salute your spirit and aspirations,
why you are supposed to keep the commits small. you are forgetting that there may be high costs due to
And that is your answer: Break up the commit into delays in the program’s completion. For an extreme
bite-sized pieces and bisect the pieces. In my experience, example, suppose that a 40 % performance shortfall from
the act of breaking up the commit is often sufficient to a single-threaded application is causing one person to die
make the bug painfully obvious. q each day. Suppose further that in a day you could hack
together a quick and dirty parallel program that ran 50 %
p.220
faster on an eight-CPU system than the sequential version,
Quick Quiz 11.20:
but that an optimal parallel program would require four
Why don’t conditional-locking primitives provide this
months of painstaking design, coding, debugging, and
spurious-failure functionality?
tuning.
Answer: It is safe to say that more than 100 people would prefer
There are locking algorithms that depend on conditional- the quick and dirty version. q
locking primitives telling them the truth. For example, if
conditional-lock failure signals that some other thread is Quick Quiz 11.23: p.224
already working on a given job, spurious failure might But what about other sources of error, for example, due
cause that job to never get done, possibly resulting in a to interactions between caches and memory layout?
hang. q
Answer:
Changes in memory layout can indeed result in unrealistic
Quick Quiz 11.21: p.222
decreases in execution time. For example, suppose that
That is ridiculous!!! After all, isn’t getting the correct a given microbenchmark almost always overflows the
answer later than one would like better than getting an L0 cache’s associativity, but with just the right memory
incorrect answer??? layout, it all fits. If this is a real concern, consider running
v2022.09.25a
528 APPENDIX E. ANSWERS TO QUICK QUIZZES
your microbenchmark using huge pages (or within the Of course, it is possible to create a script similar to that
kernel or on bare metal) in order to completely control the in Listing 11.2 that uses standard deviation rather than
memory layout. absolute difference to get a similar effect, and this is left
But note that there are many different possible memory- as an exercise for the interested reader. Be careful to avoid
layout bottlenecks. Benchmarks sensitive to memory divide-by-zero errors arising from strings of identical data
bandwidth (such as those involving matrix arithmetic) values! q
should spread the running threads across the available
cores and sockets to maximize memory parallelism. They p.227
Quick Quiz 11.26:
should also spread the data across NUMA nodes, memory
But what if all the y-values in the trusted group of data
controllers, and DRAM chips to the extent possible. In
are exactly zero? Won’t that cause the script to reject
contrast, benchmarks sensitive to memory latency (in-
any non-zero value?
cluding most poorly scaling applications) should instead
maximize locality, filling each core and socket in turn Answer:
before adding another one. q Indeed it will! But if your performance measurements
often produce a value of exactly zero, perhaps you need
Quick Quiz 11.24: p.225 to take a closer look at your performance-measurement
Wouldn’t the techniques suggested to isolate the code un- code.
der test also affect that code’s performance, particularly Note that many approaches based on mean and standard
if it is running within a larger application? deviation will have similar problems with this sort of
dataset. q
Answer:
Indeed it might, although in most microbenchmarking
efforts you would extract the code under test from the
enclosing application. Nevertheless, if for some reason E.12 Formal Verification
you must keep the code under test within the application,
you will very likely need to use the techniques discussed
Quick Quiz 12.1: p.236
in Section 11.7.6. q
Why is there an unreached statement in locker? After
all, isn’t this a full state-space search?
Quick Quiz 11.25: p.227
This approach is just plain weird! Why not use means Answer:
and standard deviations, like we were taught in our The locker process is an infinite loop, so control never
statistics classes? reaches the end of this process. However, since there are
Answer: no monotonically increasing variables, Promela is able to
Because mean and standard deviation were not designed model this infinite loop with a small number of states. q
to do this job. To see this, try applying mean and standard
deviation to the following data set, given a 1 % relative Quick Quiz 12.2: p.236
error in measurement: What are some Promela code-style issues with this
example?
49,548.4 49,549.4 49,550.2 49,550.9 49,550.9
49,551.0 49,551.5 49,552.1 49,899.0 49,899.3 Answer:
49,899.7 49,899.8 49,900.1 49,900.4 52,244.9 There are several:
53,333.3 53,333.3 53,706.3 53,706.3 54,084.5
1. The declaration of sum should be moved to within
The problem is that mean and standard deviation do not the init block, since it is not used anywhere else.
rest on any sort of measurement-error assumption, and
they will therefore see the difference between the values 2. The assertion code should be moved outside of the
near 49,500 and those near 49,900 as being statistically initialization loop. The initialization loop can then
significant, when in fact they are well within the bounds be placed in an atomic block, greatly reducing the
of estimated measurement error. state space (by how much?).
v2022.09.25a
E.12. FORMAL VERIFICATION 529
3. The atomic block covering the assertion code should 5. The second updater fetches the value of ctr[1],
be extended to include the initialization of sum and which is now one.
j, and also to cover the assertion. This also reduces
the state space (again, by how much?). q 6. The second updater now incorrectly concludes that
it is safe to proceed on the fastpath, despite the fact
that the original reader has not yet completed. q
Answer:
Quick Quiz 12.7: p.241
Because those operations are for the benefit of the assertion
only. They are not part of the algorithm itself. There But different formal-verification tools are often designed
is therefore no harm in marking them atomic, and so to locate particular classes of bugs. For example, very
marking them greatly reduces the state space that must be few formal-verification tools will find an error in the
searched by the Promela model. q specification. So isn’t this “clearly untrustworthy” judg-
ment a bit harsh?
v2022.09.25a
530 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
There is always room for doubt. In this case, it is important
Listing E.8: Spin Output Diff of -DCOLLAPSE and -DMA=88
to keep in mind that the two proofs of correctness preceded
@@ -1,6 +1,6 @@
(Spin Version 6.4.6 -- 2 December 2016) the formalization of real-world memory models, raising
+ Partial Order Reduction the possibility that these two proofs are based on incorrect
- + Compression
+ + Graph Encoding (-DMA=88) memory-ordering assumptions. Furthermore, since both
proofs were constructed by the same person, it is quite
Full statespace search for:
never claim - (none specified) possible that they contain a common error. Again, there
@@ -9,27 +9,22 @@ is always room for doubt. q
invalid end states +
v2022.09.25a
E.12. FORMAL VERIFICATION 531
Answer: Answer:
This fails in presence of NMIs. To see this, suppose It probably would be more natural, but we will need this
an NMI was received just after rcu_irq_enter() in- particular order for the liveness checks that we will add
cremented rcu_update_flag, but before it incremented later. q
dynticks_progress_counter. The instance of rcu_
irq_enter() invoked by the NMI would see that the
Quick Quiz 12.15: p.246
original value of rcu_update_flag was non-zero, and
would therefore refrain from incrementing dynticks_ Wait a minute! In the Linux kernel, both dynticks_
progress_counter. This would leave the RCU grace- progress_counter and rcu_dyntick_snapshot
period machinery no clue that the NMI handler was are per-CPU variables. So why are they instead be-
executing on this CPU, so that any RCU read-side crit- ing modeled as single global variables?
ical sections in the NMI handler would lose their RCU
Answer:
protection.
Because the grace-period code processes each CPU’s
The possibility of NMI handlers, which, by definition
dynticks_progress_counter and rcu_dyntick_
cannot be masked, does complicate this code. q
snapshot variables separately, we can collapse the state
onto a single CPU. If the grace-period code were instead
Quick Quiz 12.11: p.243 to do something special given specific values on specific
But if line 7 finds that we are the outermost inter- CPUs, then we would indeed need to model multiple
rupt, wouldn’t we always need to increment dynticks_ CPUs. But fortunately, we can safely confine ourselves to
progress_counter? two CPUs, the one running the grace-period processing
and the one entering and leaving dynticks-idle mode. q
Answer:
Not if we interrupted a running task! In that case,
Quick Quiz 12.16: p.246
dynticks_progress_counter would have already
been incremented by rcu_exit_nohz(), and there would Given there are a pair of back-to-back changes to grace_
be no need to increment it again. q period_state on lines 25 and 26, how can we be sure
that line 25’s changes won’t be lost?
v2022.09.25a
532 APPENDIX E. ANSWERS TO QUICK QUIZZES
v2022.09.25a
E.12. FORMAL VERIFICATION 533
entering an IRQ or NMI handler, and setting it upon terminating the model with the initial value of 2 in P0’s
exit? r3 register, which will not trigger the exists assertion.
There is some debate about whether this trick is univer-
Answer: sally applicable, but I have not seen an example where it
Although this approach would be functionally correct, it fails. q
would result in excessive IRQ entry/exit overhead on large
machines. In contrast, the approach laid out in this section
Quick Quiz 12.27: p.259
allows each CPU to touch only per-CPU data on IRQ and
NMI entry/exit, resulting in much lower IRQ entry/exit Does the Arm Linux kernel have a similar bug?
overhead, especially on large machines. q
Answer:
Arm does not have this particular bug because it places
Quick Quiz 12.24: p.257 smp_mb() before and after the atomic_add_return()
But x86 has strong memory ordering, so why formalize function’s assembly-language implementation. PowerPC
its memory model? no longer has this bug; it has long since been fixed [Her11].
q
Answer:
Actually, academics consider the x86 memory model to p.259
Quick Quiz 12.28:
be weak because it can allow prior stores to be reordered
Does the lwsync on line 10 in Listing 12.23 provide
with subsequent loads. From an academic viewpoint, a
sufficient ordering?
strong memory model is one that allows absolutely no
reordering, so that all threads agree on the order of all Answer:
operations visible to them. It depends on the semantics required. The rest of this
Plus it really is the case that developers are sometimes answer assumes that the assembly language for P0 in
confused about x86 memory ordering. q Listing 12.23 is supposed to implement a value-returning
atomic operation.
Quick Quiz 12.25: p.257 As is discussed in Chapter 15, Linux kernel’s memory
Why does line 8 of Listing 12.23 initialize the registers? consistency model requires value-returning atomic RMW
Why not instead initialize them on lines 4 and 5? operations to be fully ordered on both sides. The ordering
provided by lwsync is insufficient for this purpose, and so
Answer: sync should be used instead. This change has since been
Either way works. However, in general, it is better made [Fen15] in response to an email thread discussing a
to use initialization than explicit instructions. The ex- couple of other litmus tests [McK15g]. Finding any other
plicit instructions are used in this example to demonstrate bugs that the Linux kernel might have is left as an exercise
their use. In addition, many of the litmus tests available for the reader.
on the tool’s web site (https://www.cl.cam.ac.uk/ In other enviroments providing weaker semantics,
~pes20/ppcmem/) were automatically generated, which lwsync might be sufficient. But not for the Linux kernel’s
generates explicit initialization instructions. q value-returning atomic operations! q
v2022.09.25a
534 APPENDIX E. ANSWERS TO QUICK QUIZZES
filter exists
cmpxchg xchg cmpxchg xchg Answer:
2 0.004 0.022 0.027 0.039 0.058 Yes, it usually is a critical bug. However, in this case,
3 0.041 0.743 0.968 1.653 3.203 the updater has been cleverly constructed to properly
4 0.374 59.565 74.818 151.962 500.960 handle such pointer leaks. But please don’t make a habit
5 4.905 of doing this sort of thing, and especially don’t do this
without having put a lot of thought into making some
more conventional approach work. q
v2022.09.25a
E.13. PUTTING IT ALL TOGETHER 535
At one time, Paul E. McKenny felt that Linux-kernel E.13 Putting It All Together
RCU was immune to such exploits, but the advent of Row
Hammer showed him otherwise. After all, if the black
hats can hit the system’s DRAM, they can hit any and all Quick Quiz 13.1: p.270
low-level software, even including RCU. Why not implement reference-acquisition using a sim-
And in 2018, this possibility passed from the realm ple compare-and-swap operation that only acquires a
of theoretical speculation into the hard and fast realm of reference if the reference counter is non-zero?
objective reality [McK19a]. q
Answer:
Although this can resolve the race between the release of
Quick Quiz 12.36: p.265
the last reference and acquisition of a new reference, it
In light of the full verification of the L4 microkernel, does absolutely nothing to prevent the data structure from
isn’t this limited view of formal verification just a little being freed and reallocated, possibly as some completely
bit obsolete? different type of structure. It is quite likely that the “sim-
Answer: ple compare-and-swap operation” would give undefined
Unfortunately, no. results if applied to the differently typed structure.
The first full verification of the L4 microkernel was In short, use of atomic operations such as compare-and-
a tour de force, with a large number of Ph.D. students swap absolutely requires either type-safety or existence
hand-verifying code at a very slow per-student rate. This guarantees.
level of effort could not be applied to most software But what if it is absolutely necessary to let the type
projects because the rate of change is just too great. change?
Furthermore, although the L4 microkernel is a large One approach is for each such type to have the refer-
software artifact from the viewpoint of formal verification, ence counter at the same location, so that as long as the
it is tiny compared to a great number of projects, including reallocation results in an object from this group of types,
LLVM, GCC, the Linux kernel, Hadoop, MongoDB, all is well. If you do this in C, make sure you comment
and a great many others. In addition, this verification the reference counter in each structure in which it appears.
did have limits, as the researchers freely admit, to their In C++, use inheritance and templates. q
credit: https://docs.sel4.systems/projects/
sel4/frequently-asked-questions.html#does- p.272
Quick Quiz 13.2:
sel4-have-zero-bugs.
Why isn’t it necessary to guard against cases where one
Although formal verification is finally starting to show
CPU acquires a reference just after another CPU releases
some promise, including more-recent L4 verifications
the last reference?
involving greater levels of automation, it currently has no
chance of completely displacing testing in the foreseeable Answer:
future. And although I would dearly love to be proven Because a CPU must already hold a reference in order
wrong on this point, please note that such proof will be in to legally acquire another reference. Therefore, if one
the form of a real tool that verifies real software, not in CPU releases the last reference, there had better not be
the form of a large body of rousing rhetoric. any CPU acquiring a new reference! q
Perhaps someday formal verification will be used heav-
ily for validation, including for what is now known as
Quick Quiz 13.3: p.272
regression testing. Section 17.4 looks at what would be
required to make this possibility a reality. q Suppose that just after the atomic_sub_and_test()
on line 22 of Listing 13.2 is invoked, that some other
CPU invokes kref_get(). Doesn’t this result in that
other CPU now having an illegal reference to a released
object?
Answer:
This cannot happen if these functions are used correctly.
It is illegal to invoke kref_get() unless you already
v2022.09.25a
536 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Answer:
Yes, and one way to do this would be to use per-hash-chain
Suppose that the “if” condition completed, finding the
locks. The updater could acquire lock(s) corresponding
reference counter value equal to one. Suppose that a
to both the old and the new element, acquiring them in
release operation executes, decrementing the reference
address order. In this case, the insertion and removal
counter to zero and therefore starting cleanup operations.
operations would of course need to refrain from acquiring
But now the “then” clause can increment the counter back
and releasing these same per-hash-chain locks. This
to a value of one, allowing the object to be used after it
complexity can be worthwhile if rename operations are
has been cleaned up.
frequent, and of course can allow rename operations to
This use-after-cleanup bug is every bit as bad as a execute concurrently. q
full-fledged use-after-free bug. q
v2022.09.25a
E.13. PUTTING IT ALL TOGETHER 537
p.277
Listing E.9: Localized Correlated Measurement Fields
Quick Quiz 13.10: 1 struct measurement {
Why on earth did we need that global lock in the first 2 double meas_1;
place? 3 double meas_2;
4 double meas_3;
5 };
Answer: 6
7 struct animal {
A given thread’s __thread variables vanish when that 8 char name[40];
thread exits. It is therefore necessary to synchronize any 9 double age;
10 struct measurement *mp;
operation that accesses other threads’ __thread variables 11 struct measurement meas;
with thread exit. Without such synchronization, accesses 12 char photo[0]; /* large bitmap. */
13 };
to __thread variable of a just-exited thread will result in
segmentation faults. q
v2022.09.25a
538 APPENDIX E. ANSWERS TO QUICK QUIZZES
2. Use rcu_assign_pointer() to point ->mp to this One way to handle this is to have a reference count
new structure. on each set of buckets, which is initially set to the value
one. A full-table scan would acquire a reference at the
3. Wait for a grace period to elapse, for example using beginning of the scan (but only if the reference is non-zero)
either synchronize_rcu() or call_rcu(). and release it at the end of the scan. The resizing would
4. Copy the measurements from the new measurement populate the new buckets, release the reference, wait for
structure into the embedded ->meas field. a grace period, and then wait for the reference to go to
zero. Once the reference was zero, the resizing could let
5. Use rcu_assign_pointer() to point ->mp back updaters forget about the old hash buckets and then free it.
to the old embedded ->meas field. Actual implementation is left to the interested reader,
who will gain much insight from this task. q
6. After another grace period elapses, free up the new
measurement structure.
This approach uses a heavier weight update procedure E.14 Advanced Synchronization
to eliminate the extra cache miss in the common case. The
extra cache miss will be incurred only while an update is p.286
Quick Quiz 14.1:
actually in progress. q
Given that there will always be a sharply limited number
of CPUs available, is population obliviousness really
Quick Quiz 13.15: p.280 useful?
But how does this scan work while a resizable hash table
Answer:
is being resized? In that case, neither the old nor the
Given the surprisingly limited scalability of any num-
new hash table is guaranteed to contain all the elements
ber of NBS algorithms, population obliviousness can be
in the hash table!
surprisingly useful. Nevertheless, the overall point of
Answer: the question is valid. It is not normally helpful for an
True, resizable hash tables as described in Section 10.4 algorithm to scale beyond the size of the largest system it
cannot be fully scanned while being resized. One simple is ever going to run on. q
way around this is to acquire the hashtab structure’s
->ht_lock while scanning, but this prevents more than Quick Quiz 14.2: p.287
one scan from proceeding concurrently. Wait! In order to dequeue all elements, both the ->head
Another approach is for updates to mutate the old hash and ->tail pointers must be changed, which cannot be
table as well as the new one while resizing is in progress. done atomically on typical computer systems. So how
This would allow scans to find all elements in the old is this supposed to work???
hash table. Implementing this is left as an exercise for the
reader. q Answer:
One pointer at a time!
Quick Quiz 13.16: p.283 First, atomically exchange the ->head pointer with
But how would this work with a resizable hash table, NULL. If the return value from the atomic exchange
such as the one described in Section 10.4? operation is NULL, the queue was empty and you are done.
And if someone else attempts a dequeue-all at this point,
Answer: they will get back a NULL pointer.
In this case, more care is required because the hash table Otherwise, atomically exchange the ->tail pointer
might well be resized during the time that we momentarily with a pointer to the now-NULL ->head pointer. The
exited the RCU read-side critical section. Worse yet, return value from the atomic exchange operation is a
the resize operation can be expected to free the old hash pointer to the ->next field of the eventual last element on
buckets, leaving us pointing to the freelist. the list.
But it is not sufficient to prevent the old hash buckets Producing and testing actual code is left as an exercise
from being freed. It is also necessary to ensure that those for the interested and enthusiastic reader, as are strategies
buckets continue to be updated. for handling half-enqueued elements. q
v2022.09.25a
E.14. ADVANCED SYNCHRONIZATION 539
Answer: Answer:
That won’t help unless the more-modern languages pro- One advantage of the members of the NBS hierarchy is
ponents are energetic enough to write their own compiler that they are reasonably simple to define and use from
backends. The usual practice of re-using existing back- a theoretical viewpoint. We can hope that work done in
ends also reuses charming properties such as refusal to the NBS arena will help lay the groundwork for analysis
support pointers to lifetime-ended objects. q of real-world forward-progress guarantees for concurrent
real-time programs. However, as of 2022 it appears that
p.289 trace-based methodologies are in the lead [dOCdO19].
Quick Quiz 14.4:
So why bother learning about NBS at all?
Why does anyone care about demonic schedulers?
Because a great many people know of it, and are vaguely
Answer: aware that it is somehow related to real-time computing.
A demonic scheduler is one way to model an insanely Their response to your carefully designed real-time con-
overloaded system. After all, if you have an algorithm that straints might well be of the form “Bah, just use wait-free
you can prove runs reasonably given a demonic scheduler, algorithms!”. In the all-too-common case where they are
mere overload should be no problem, right? very convincing to your management, you will need to
On the other hand, it is only reasonable to ask if a understand NBS in order to bring the discussion back to
demonic scheduler is really the best way to model overload reality. I hope that this section has provided you with the
conditions. And perhaps it is time for more accurate required depth of understanding.
models. For one thing, a system might be overloaded in Another thing to note is that learning about the NBS
any of a number of ways. After all, an NBS algorithm that hierarchy is probably no more harmful than learning
works fine on a demonic scheduler might or might not about transfinite numbers of the computational-complexity
do well in out-of-memory conditions, when mass storage hierarchy. In all three cases, it is important to avoid over-
fills, or when the network is congested. applying the theory. Which is in and of itself good
Except that systems’ core counts have been increasing, practice! q
which means that an overloaded system is quite likely to
be running more than one concurrent program.12 In that Quick Quiz 14.6: p.294
case, even if a demonic scheduler is not so demonic as But what about battery-powered systems? They don’t
to inject idle cycles while there are runnable tasks, it is require energy flowing into the system as a whole.
easy to imagine such a scheduler consistently favoring
the other program over yours. If both programs could Answer:
consume all available CPU, then this scheduler might not Sooner or later, the battery must be recharged, which
run your program at all. requires energy to flow into the system. q
One way to avoid these issues is to simply avoid over-
load conditions. This is often the preferred approach in p.295
Quick Quiz 14.7:
production, where load balancers direct traffic away from
But given the results from queueing theory, won’t low
overloaded systems. And if all systems are overloaded,
utilization merely improve the average response time
it is not unheard of to simply shed load, that is, to drop
rather than improving the worst-case response time?
the low-priority incoming requests. Nor is this approach
And isn’t worst-case response time all that most real-
limited to computing, as those who have suffered through
time systems really care about?
a rolling blackout can attest. But load-shedding is often
considered a bad thing by those whose load is being shed. Answer:
As always, choose wisely! q Yes, but . . .
12 As a point of reference, back in the mid-1990s, Paul witnessed Those queueing-theory results assume infinite “calling
a 16-CPU system running about 20 instances of a certain high-end populations”, which in the Linux kernel might correspond
proprietary database. to an infinite number of tasks. As of early 2021, no real
v2022.09.25a
540 APPENDIX E. ANSWERS TO QUICK QUIZZES
v2022.09.25a
E.15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING 541
Answer: p.311
Quick Quiz 14.15:
That is a real problem, and it is solved in RCU’s scheduler
Don’t you need some kind of synchronization to protect
hook. If that scheduler hook sees that the value of t->
update_cal()?
rcu_read_lock_nesting is negative, it invokes rcu_
read_unlock_special() if needed before allowing the Answer:
context switch to complete. q Indeed you do, and you could use any of a number of
techniques discussed earlier in this book. One of those
Quick Quiz 14.12: p.309 techniques is use of a single updater thread, which would
But isn’t correct operation despite fail-stop bugs a valu- result in exactly the code shown in update_cal() in
able fault-tolerance property? Listing 14.6. q
Answer:
Yes and no.
Yes in that non-blocking algorithms can provide fault E.15 Advanced Synchronization:
tolerance in the face of fail-stop bugs, but no in that this Memory Ordering
is grossly insufficient for practical fault tolerance. For
example, suppose you had a wait-free queue, and further
suppose that a thread has just dequeued an element. If Quick Quiz 15.1: p.313
that thread now succumbs to a fail-stop bug, the element it This chapter has been rewritten since the first edition.
has just dequeued is effectively lost. True fault tolerance Did memory ordering change all that since 2014?
requires way more than mere non-blocking properties,
and is beyond the scope of this book. q Answer:
The earlier memory-ordering section had its roots in a
p.309
pair of Linux Journal articles [McK05a, McK05b] dating
Quick Quiz 14.13:
back to 2005. Since then, the C and C++ memory mod-
I couldn’t help but spot the word “includes” before this
els [Bec11] have been formalized (and critiqued [BS14,
list. Are there other constraints?
BD14, VBC+ 15, BMN+ 15, LVK+ 17, BGV17]), exe-
Answer: cutable formal memory models for computer systems have
Indeed there are, and lots of them. However, they tend to become the norm [MSS12, McK11d, SSA+ 11, AMP+ 11,
be specific to a given situation, and many of them can be AKNT13, AKT13, AMT14, MS14, FSP+ 17, ARM17],
thought of as refinements of some of the constraints listed and there is even a memory model for the Linux ker-
above. For example, the many constraints on choices of nel [AMM+ 17a, AMM+ 17b, AMM+ 18], along with a
data structure will help meeting the “Bounded time spent paper describing differences between the C11 and Linux
in any given critical section” constraint. q memory models [MWPF18].
Given all this progress since 2005, it was high time for
a full rewrite! q
Quick Quiz 14.14: p.310
Given that real-time systems are often used for safety-
Quick Quiz 15.2: p.313
critical applications, and given that runtime memory
allocation is forbidden in many safety-critical situations, The compiler can also reorder Thread P0()’s and
what is with the call to malloc()??? Thread P1()’s memory accesses in Listing 15.1, right?
Answer:
In early 2016, projects forbidding runtime memory al- Answer:
location were also not at all interested in multithreaded In general, compiler optimizations carry out more exten-
computing. So the runtime memory allocation is not an sive and profound reorderings than CPUs can. However,
additional obstacle to safety criticality. in this case, the volatile accesses in READ_ONCE() and
However, by 2020 runtime memory allocation in multi- WRITE_ONCE() prevent the compiler from reordering.
core real-time systems was gaining some traction. q And also from doing much else as well, so the examples
in this section will be making heavy use of READ_ONCE()
v2022.09.25a
542 APPENDIX E. ANSWERS TO QUICK QUIZZES
and WRITE_ONCE(). See Section 15.3 for more detail on The *_dereference() row captures the address
the need for READ_ONCE() and WRITE_ONCE(). q and data dependency ordering provided by rcu_
dereference() and friends. Again, these dependen-
cies must been constructed carefully, as described in
Quick Quiz 15.3: p.315
Section 15.3.2.
But wait!!! On row 2 of Table 15.1 both x0 and x1 each
have two values at the same time, namely zero and two. The “Successful *_acquire()” row captures the fact
How can that possibly work??? that many CPUs have special “acquire” forms of loads
and of atomic RMW instructions, and that many other
Answer: CPUs have lightweight memory-barrier instructions that
There is an underlying cache-coherence protocol that order prior loads against subsequent loads and stores.
straightens things out, which are discussed in Appen- The “Successful *_release()” row captures the fact
dix C.2. But if you think that a given variable having two that many CPUs have special “release” forms of stores
values at the same time is surprising, just wait until you and of atomic RMW instructions, and that many other
get to Section 15.2.1! q CPUs have lightweight memory-barrier instructions that
order prior loads and stores against subsequent stores.
Quick Quiz 15.4: p.315 The smp_rmb() row captures the fact that many CPUs
But don’t the values also need to be flushed from the have lightweight memory-barrier instructions that order
cache to main memory? prior loads against subsequent loads. Similarly, the smp_
wmb() row captures the fact that many CPUs have light-
Answer: weight memory-barrier instructions that order prior stores
Perhaps surprisingly, not necessarily! On some systems, against subsequent stores.
if the two variables are being used heavily, they might be None of the ordering operations thus far require prior
bounced back and forth between the CPUs’ caches and stores to be ordered against subsequent loads, which means
never land in main memory. q that these operations need not interfere with store buffers,
whose main purpose in life is in fact to reorder prior
p.316 stores against subsequent loads. The lightweight nature
Quick Quiz 15.5:
of these operations is precisely due to their policy of
The rows in Table 15.3 seem quite random and confused.
store-buffer non-interference. However, as noted earlier, it
Whatever is the conceptual basis of this table???
is sometimes necessary to interfere with the store buffer in
Answer: order to prevent prior stores from being reordered against
The rows correspond roughly to hardware mechanisms of later stores, which brings us to the remaining rows in this
increasing power and overhead. table.
The WRITE_ONCE() row captures the fact that accesses The smp_mb() row corresponds to the full memory
to a single variable are always fully ordered, as indicated by barrier available on most platforms, with Itanium being
the “SV”column. Note that all other operations providing the exception that proves the rule. However, even on
ordering against accesses to multiple variables also provide Itanium, smp_mb() provides full ordering with respect
this same-variable ordering. to READ_ONCE() and WRITE_ONCE(), as discussed in
The READ_ONCE() row captures the fact that (as of Section 15.5.4.
2021) compilers and CPUs do not indulge in user-visible The “Successful full-strength non-void RMW” row
speculative stores, so that any store whose address, data, captures the fact that on some platforms (such as x86)
or execution depends on a prior load is guaranteed to atomic RMW instructions provide full ordering both be-
happen after that load completes. However, this guarantee fore and after. The Linux kernel therefore requires that
assumes that these dependencies have been constructed full-strength non-void atomic RMW operations provide
carefully, as described in Sections 15.3.2 and 15.3.3. full ordering in cases where these operations succeed.
The “_relaxed() RMW operation” row captures the (Full-strength atomic RMW operation’s names do not end
fact that a value-returning _relaxed() RMW has done in _relaxed, _acquire, or _release.) As noted earlier,
a load and a store, which are every bit as good as a the case where these operations do not succeed is covered
READ_ONCE() and a WRITE_ONCE(), respectively. by the “_relaxed() RMW operation” row.
v2022.09.25a
E.15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING 543
However, the Linux kernel does not require that either p.318
Quick Quiz 15.7:
void or _relaxed() atomic RMW operations provide
But how can I know that a given project can be designed
any ordering whatsoever, with the canonical example
and coded within the confines of these rules of thumb?
being atomic_inc(). Therefore, these operations, along
with failing non-void atomic RMW operations may be
preceded by smp_mb__before_atomic() and followed Answer:
by smp_mb__after_atomic() to provide full ordering Much of the purpose of the remainder of this chapter is to
for any accesses preceding or following both. No ordering answer exactly that question! q
need be provided for accesses between the smp_mb__
before_atomic() (or, similarly, the smp_mb__after_ p.318
Quick Quiz 15.8:
atomic()) and the atomic RMW operation, as indicated
How can you tell which memory barriers are strong
by the “a” entries on the smp_mb__before_atomic()
enough for a given use case?
and smp_mb__after_atomic() rows of the table.
In short, the structure of this table is dictated by the Answer:
properties of the underlying hardware, which are con- Ah, that is a deep question whose answer requires most
strained by nothing other than the laws of physics, which of the rest of this chapter. But the short answer is that
were covered back in Chapter 3. That is, the table is not smp_mb() is almost always strong enough, albeit at some
random, although it is quite possible that you are confused. cost. q
q
Quick Quiz 15.9: p.319
Quick Quiz 15.6: p.318 Wait!!! Where do I find this tooling that automatically
Why is Table 15.3 missing smp_mb__after_unlock_ analyzes litmus tests???
lock() and smp_mb__after_spinlock()?
Answer:
Answer: Get version v4.17 (or later) of the Linux-kernel source
These two primitives are rather specialized, and at present code, then follow the instructions in tools/memory-
seem difficult to fit into Table 15.3. The smp_mb__after_ model/README to install the needed tools. Then follow
unlock_lock() primitive is intended to be placed im- the further instructions to run these tools on the litmus
mediately after a lock acquisition, and ensures that all test of your choice. q
CPUs see all accesses in prior critical sections as happen-
ing before all accesses following the smp_mb__after_ Quick Quiz 15.10: p.319
unlock_lock() and also before all accesses in later What assumption is the code fragment in Listing 15.3
critical sections. Here “all CPUs” includes those CPUs making that might not be valid on real hardware?
not holding that lock, and “prior critical sections” in-
cludes all prior critical sections for the lock in question Answer:
as well as all prior critical sections for all other locks The code assumes that as soon as a given CPU stops seeing
that were released by the same CPU that executed the its own value, it will immediately see the final agreed-upon
smp_mb__after_unlock_lock(). value. On real hardware, some of the CPUs might well
The smp_mb__after_spinlock() provides the same see several intermediate results before converging on the
guarantees as does smp_mb__after_unlock_lock(), final value. The actual code used to produce the data in
but also provides additional visibility guarantees for other the figures discussed later in this section was therefore
accesses performed by the CPU that executed the smp_ somewhat more complex. q
mb__after_spinlock(). Given any store S performed
prior to any earlier lock acquisition and any load L Quick Quiz 15.11: p.320
performed after the smp_mb__after_spinlock(), all How could CPUs possibly have different views of the
CPUs will see S as happening before L. In other words, value of a single variable at the same time?
if a CPU performs a store S, acquires a lock, executes an
smp_mb__after_spinlock(), then performs a load L, Answer:
all CPUs will see S as happening before L. q As discussed in Section 15.1.1, many CPUs have store
v2022.09.25a
544 APPENDIX E. ANSWERS TO QUICK QUIZZES
buffers that record the values of recent stores, which do not mance price of unnecessary smp_rmb() and smp_
become globally visible until the corresponding cache line wmb() invocations? Shouldn’t weakly ordered systems
makes its way to the CPU. Therefore, it is quite possible shoulder the full cost of their misordering choices???
for each CPU to see its own value for a given variable
(in its own store buffer) at a single point in time—and Answer:
for main memory to hold yet another value. One of the That is in fact exactly what happens. On strongly ordered
reasons that memory barriers were invented was to allow systems, smp_rmb() and smp_wmb() emit no instructions,
software to deal gracefully with situations like this one. but instead just constrain the compiler. Thus, in this case,
Fortunately, software rarely cares about the fact that weakly ordered systems do in fact shoulder the full cost
multiple CPUs might see multiple values for the same of their memory-ordering choices. q
variable. q
v2022.09.25a
E.15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING 545
the required memory barriers supplied by READ_ONCE() Listing E.10: Litmus Test Distinguishing Multicopy Atomic
(DEC Alpha in Linux kernel v4.15 and later), or supplied From Other Multicopy Atomic
by rcu_dereference() (DEC Alpha in Linux kernel 1 C C-MP-OMCA+o-o-o+o-rmb-o
2
v4.14 and earlier). q 3 {}
4
5 P0(int *x, int *y)
6 {
Quick Quiz 15.16: p.324
7 int r0;
SP, MP, LB, and now S. Where do all these litmus-test 8
9 WRITE_ONCE(*x, 1);
abbreviations come from and how can anyone keep track 10 r0 = READ_ONCE(*x);
of them? 11 WRITE_ONCE(*y, r0);
12 }
13
Answer: 14 P1(int *x, int *y)
The best scorecard is the infamous test6.pdf [SSA+ 11]. 15 {
16 int r1;
Unfortunately, not all of the abbreviations have catchy 17 int r2;
expansions like SB (store buffering), MP (message pass- 18
19 r1 = READ_ONCE(*y);
ing), and LB (load buffering), but at least the list of 20 smp_rmb();
abbreviations is readily available. q 21 r2 = READ_ONCE(*x);
22 }
23
24 exists (1:r1=1 /\ 1:r2=0)
Quick Quiz 15.17: p.325
But wait!!! Line 17 of Listing 15.12 uses READ_ONCE(),
which marks the load as volatile, which means that the Answer:
compiler absolutely must emit the load instruction even Yes, it would. Feel free to modify the exists clause to
if the value is later multiplied by zero. So how can the check for that outcome and see what happens. q
compiler possibly break this data dependency?
v2022.09.25a
546 APPENDIX E. ANSWERS TO QUICK QUIZZES
p.328
Listing E.11: R Litmus Test With Write Memory Barrier (No
Quick Quiz 15.21: Ordering)
Then who would even think of designing a system with 1 C C-R+o-wmb-o+o-mb-o
shared store buffers??? 2
3 {}
4
Answer: 5 P0(int *x0, int *x1)
This is in fact a very natural design for any system hav- 6 {
7 WRITE_ONCE(*x0, 1);
ing multiple hardware threads per core. Natural from a 8 smp_wmb();
hardware point of view, that is! q 9 WRITE_ONCE(*x1, 1);
10 }
11
12 P1(int *x0, int *x1)
Quick Quiz 15.22: p.328 13 {
14 int r2;
But just how is it fair that P0() and P1() must share a 15
store buffer and a cache, but P2() gets one each of its 16 WRITE_ONCE(*x1, 2);
17 smp_mb();
very own??? 18 r2 = READ_ONCE(*x0);
19 }
20
Answer: 21 exists (1:r2=0 /\ x1=2)
Presumably there is a P3(), as is in fact shown in Fig-
ure 15.8, that shares P2()’s store buffer and cache. But
not necessarily. Some platforms allow different cores to
Quick Quiz 15.24: p.330
disable different numbers of threads, allowing the hard-
ware to adjust to the needs of the workload at hand. For But it is not necessary to worry about propagation unless
example, a single-threaded critical-path portion of the there are at least three threads in the litmus test, right?
workload might be assigned to a core with only one thread
enabled, thus allowing the single thread running that por- Answer:
tion of the workload to use the entire capabilities of that Wrong.
core. Other more highly parallel but cache-miss-prone
Listing E.11 (C-R+o-wmb-o+o-mb-o.litmus) shows
portions of the workload might be assigned to cores with
a two-thread litmus test that requires propagation due to
all hardware threads enabled to provide improved through-
the fact that it only has store-to-store and load-to-store
put. This improved throughput could be due to the fact
links between its pair of threads. Even though P0() is
that while one hardware thread is stalled on a cache miss,
fully ordered by the smp_wmb() and P1() is fully ordered
the other hardware threads can make forward progress.
by the smp_mb(), the counter-temporal nature of the links
In such cases, performance requirements override quaint
means that the exists clause on line 21 really can trigger.
human notions of fairness. q
To prevent this triggering, the smp_wmb() on line 8 must
become an smp_mb(), bringing propagation into play
Quick Quiz 15.23: p.328 twice, once for each non-temporal link. q
Referring to Table 15.4, why on earth would P0()’s store
take so long to complete when P1()’s store complete
Quick Quiz 15.25: p.330
so quickly? In other words, does the exists clause on
line 28 of Listing 15.16 really trigger on real systems? But given that smp_mb() has the propagation property,
why doesn’t the smp_mb() on line 25 of Listing 15.18
prevent the exists clause from triggering?
Answer:
You need to face the fact that it really can trigger. Akira Answer:
Yokosawa used the litmus7 tool to run this litmus test on a As a rough rule of thumb, the smp_mb() barrier’s propaga-
POWER8 system. Out of 1,000,000,000 runs, 4 triggered tion property is sufficient to maintain ordering through only
the exists clause. Thus, triggering the exists clause one load-to-store link between processes. Unfortunately,
is not merely a one-in-a-million occurrence, but rather a Listing 15.18 has not one but two load-to-store links, with
one-in-a-hundred-million occurrence. But it nevertheless the first being from the READ_ONCE() on line 17 to the
really does trigger on real systems. q WRITE_ONCE() on line 24 and the second being from the
READ_ONCE() on line 26 to the WRITE_ONCE() on line 7.
v2022.09.25a
E.15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING 547
Listing E.12: 2+2W Litmus Test (No Ordering) Listing E.13: LB Litmus Test With No Acquires
1 C C-2+2W+o-o+o-o 1 C C-LB+o-data-o+o-data-o+o-data-o
2 2
3 {} 3 {
4 4 x1=1;
5 P0(int *x0, int *x1) 5 x2=2;
6 { 6 }
7 WRITE_ONCE(*x0, 1); 7
8 WRITE_ONCE(*x1, 2); 8 P0(int *x0, int *x1)
9 } 9 {
10 10 int r2;
11 P1(int *x0, int *x1) 11
12 { 12 r2 = READ_ONCE(*x0);
13 WRITE_ONCE(*x1, 1); 13 WRITE_ONCE(*x1, r2);
14 WRITE_ONCE(*x0, 2); 14 }
15 } 15
16 16 P1(int *x1, int *x2)
17 exists (x0=1 /\ x1=1) 17 {
18 int r2;
19
20 r2 = READ_ONCE(*x1);
Therefore, preventing the exists clause from triggering 21 WRITE_ONCE(*x2, r2);
22 }
should be expected to require not one but two instances 23
Answer:
Quick Quiz 15.28: p.333
This litmus test is indeed a very interesting curiosity.
Its ordering apparently occurs naturally given typical Suppose we have a short release-acquire chain along
weakly ordered hardware design, which would normally with one load-to-store link and one store-to-store link,
be considered a great gift from the relevant laws of physics like that shown in Listing 15.25. Given that there is only
and cache-coherency-protocol mathematics. one of each type of non-store-to-load link, the exists
Unfortunately, no one has been able to come up with a cannot trigger, right?
software use case for this gift that does not have a much
better alternative implementation. Therefore, neither the Answer:
C11 nor the Linux kernel memory models provide any Wrong. It is the number of non-store-to-load links that
guarantee corresponding to Listing 15.20. This means matters. If there is only one non-store-to-load link, a
that the exists clause on line 19 can trigger. release-acquire chain can prevent the exists clause from
Of course, without the barrier, there are no ordering triggering. However, if there is more than one non-store-
guarantees, even on real weakly ordered hardware, as to-load link, be they store-to-store, load-to-store, or any
shown in Listing E.12 (C-2+2W+o-o+o-o.litmus). q combination thereof, it is necessary to have at least one
full barrier (smp_mb() or better) between each non-store-
to-load link. In Listing 15.25, preventing the exists
Quick Quiz 15.27: p.332
clause from triggering therefore requires an additional full
Can you construct a litmus test like that in Listing 15.21 barrier between either P0()’s or P1()’s accesses. q
that uses only dependencies?
v2022.09.25a
548 APPENDIX E. ANSWERS TO QUICK QUIZZES
p.334
Listing E.14: Breakable Dependencies With Non-Constant
Quick Quiz 15.29: Comparisons
There are store-to-load links, load-to-store links, and 1 int *gp1;
store-to-store links. But what about load-to-load links? 2 int *p;
3 int *q;
4
5 p = rcu_dereference(gp1);
Answer: 6 q = get_a_pointer();
7 if (p == q)
The problem with the concept of load-to-load links is 8 handle_equality(p);
that if the two loads from the same variable return the 9 do_something_with(*p);
same value, there is no way to determine their ordering.
The only way to determine their ordering is if they return Listing E.15: Broken Dependencies With Non-Constant Com-
different values, in which case there had to have been an parisons
intervening store. And that intervening store means that 1 int *gp1;
2 int *p;
there is no load-to-load link, but rather a load-to-store link 3 int *q;
followed by a store-to-load link. q 4
5 p = rcu_dereference(gp1);
6 q = get_a_pointer();
7 if (p == q) {
Quick Quiz 15.30: p.335 8 handle_equality(q);
9 do_something_with(*q);
Why not place a barrier() call immediately before a 10 } else {
plain store to prevent the compiler from inventing stores? 11 do_something_with(*p);
12 }
Answer:
Because it would not work. Although the compiler Listing E.14. The compiler is within its rights to transform
would be prevented from inventing a store prior to the this code into that shown in Listing E.15, and might
barrier(), nothing would prevent it from inventing a well make this transformation due to register pressure
store between that barrier() and the plain store. q if handle_equality() was inlined and needed a lot
of registers. Line 9 of this transformed code uses q,
which although equal to p, is not necessarily tagged by
Quick Quiz 15.31: p.336
the hardware as carrying a dependency. Therefore, this
Why can’t you simply dereference the pointer before com- transformed code does not necessarily guarantee that line 9
paring it to &reserve_int on line 6 of Listing 15.28? is ordered after line 5.13 q
v2022.09.25a
E.15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING 549
Answer:
Not given the Linux kernel memory model. (Try it!)
However, you can instead replace P0()’s WRITE_ONCE()
with smp_store_release(), which usually has less
overhead than does adding an smp_mb(). q
Answer:
No. By the time that the code inspects the return value
from spin_is_locked(), some other CPU or thread
might well have acquired the corresponding lock. q
v2022.09.25a
550 APPENDIX E. ANSWERS TO QUICK QUIZZES
v2022.09.25a
E.15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING 551
Listing E.17: Userspace RCU Code Reordering at about the same time. Then consider the following
1 static inline int rcu_gp_ongoing(unsigned long *ctr) sequence of events:
2 {
3 unsigned long v;
4 1. CPU 0 acquires the lock at line 29.
5 v = LOAD_SHARED(*ctr);
6 return v && (v != rcu_gp_ctr); 2. Line 27 determines that CPU 0 was online, so it
7 }
8 clears its own counter at line 28. (Recall that lines 27
9 static void update_counter_and_wait(void) and 28 have been reordered by the compiler to follow
10 {
11 struct rcu_reader *index; line 29).
12
13 STORE_SHARED(rcu_gp_ctr, rcu_gp_ctr + RCU_GP_CTR); 3. CPU 0 invokes update_counter_and_wait()
14 barrier();
15 list_for_each_entry(index, ®istry, node) { from line 30.
16 while (rcu_gp_ongoing(&index->ctr))
17 msleep(10); 4. CPU 0 invokes rcu_gp_ongoing() on itself at
18 }
19 } line 16, and line 5 sees that CPU 0 is in a quies-
20 cent state. Control therefore returns to update_
21 void synchronize_rcu(void)
22 { counter_and_wait(), and line 15 advances to
23 unsigned long was_online; CPU 1.
24
25 was_online = rcu_reader.ctr;
26 smp_mb(); 5. CPU 1 invokes synchronize_rcu(), but because
27 if (was_online) CPU 0 already holds the lock, CPU 1 blocks wait-
28 STORE_SHARED(rcu_reader.ctr, 0);
29 mutex_lock(&rcu_gp_lock); ing for this lock to become available. Because the
30 update_counter_and_wait(); compiler reordered lines 27 and 28 to follow line 29,
31 mutex_unlock(&rcu_gp_lock);
32 if (was_online) CPU 1 does not clear its own counter, despite having
33 STORE_SHARED(rcu_reader.ctr, LOAD_SHARED(rcu_gp_ctr)); been online.
34 smp_mb();
35 }
6. CPU 0 invokes rcu_gp_ongoing() on CPU 1 at
line 16, and line 5 sees that CPU 1 is not in a quiescent
state. The while loop at line 16 therefore never exits.
Quick Quiz 15.44: p.356
Given that hardware can have a half memory barrier, So the compiler’s reordering results in a deadlock. In
why don’t locking primitives allow the compiler to move contrast, hardware reordering is temporary, so that CPU 1
memory-reference instructions into lock-based critical might undertake its first attempt to acquire the mutex
sections? on line 29 before executing lines 27 and 28, but it will
eventually execute lines 27 and 28. Because hardware
Answer: reordering only results in a short delay, it can be tolerated.
In fact, as we saw in Section 15.5.3 and will see in On the other hand, because compiler reordering results in
Section 15.5.6, hardware really does implement partial a deadlock, it must be prohibited.
memory-ordering instructions and it also turns out that Some research efforts have used hardware transactional
these really are used to construct locking primitives. How- memory to allow compilers to safely reorder more aggres-
ever, these locking primitives use full compiler barriers, sively, but the overhead of hardware transactions has thus
thus preventing the compiler from reordering memory- far made such optimizations unattractive. q
reference instructions both out of and into the correspond-
ing critical section. p.360
Quick Quiz 15.45:
To see why the compiler is forbidden from doing reorder- Why is it necessary to use heavier-weight ordering for
ing that is permitted by hardware, consider the following load-to-store and store-to-store links, but not for store-
sample code in Listing E.17. This code is based on the to-load links? What on earth makes store-to-load links
userspace RCU update-side code [DMS+ 12, Supplemen- so special???
tary Materials Figure 5].
Suppose that the compiler reordered lines 27 and 28 Answer:
into the critical section starting at line 29. Now suppose Recall that load-to-store and store-to-store links can be
that two updaters start executing synchronize_rcu() counter-temporal, as illustrated by Figures 15.10 and 15.11
v2022.09.25a
552 APPENDIX E. ANSWERS TO QUICK QUIZZES
in Section 15.2.7.2. This counter-temporal nature of to work in a given situation. However, even in these cases,
load-to-store and store-to-store links necessitates strong it may be very worthwhile to spend a little time trying
ordering. to come up with a simpler algorithm! After all, if you
In constrast, store-to-load links are temporal, as illus- managed to invent the first algorithm to do some task, it
trated by Listings 15.12 and 15.13. This temporal nature shouldn’t be that hard to go on to invent a simpler one. q
of store-to-load links permits use of minimal ordering. q
v2022.09.25a
E.17. CONFLICTING VISIONS OF THE FUTURE 553
of this book to see that the venerable asynchronous presents an all-too-rare example of good scalability com-
call_rcu() primitive enables RCU to perform and bined with strong read-side coherence. They are also to be
scale quite well with large numbers of updaters. Fur- congratulated on overcoming the traditional academic prej-
thermore, in Section 3.7 of their paper, the authors udice against asynchronous grace periods, which greatly
admit that asynchronous grace periods are important aided their scalability.
to MV-RLU scalability. A fair comparison would Interestingly enough, RLU and RCU take different
also allow RCU the benefits of asynchrony. approaches to avoid the inherent limitations of STM noted
by Hagit Attiya et al. [AHM09]. RCU avoids providing
2. They use a poorly tuned 1,000-bucket hash table con- strict serializability and RLU avoids providing invisible
taining 10,000 elements. In addition, their 448 hard- read-only transactions, both thus avoiding the limitations.
ware threads need considerably more than 1,000 buck- q
ets to avoid the lock contention that they correctly
state limits RCU performance in their benchmarks.
Quick Quiz 17.4: p.381
A useful comparison would feature a properly tuned
hash table. Given things like spin_trylock(), how does it make
any sense at all to claim that TM introduces the concept
3. Their RCU hash table used per-bucket locks, which of failure???
they call out as a bottleneck, which is not a surprise
given the long hash chains and small ratio of buckets Answer:
to threads. A number of their competing mecha- When using locking, spin_trylock() is a choice, with
nisms instead use lockfree techniques, thus avoiding a corresponding failure-free choice being spin_lock(),
the per-bucket-lock bottleneck, which cynics might which is used in the common case, as in there are more than
claim sheds some light on the authors’ otherwise 100 times as many calls to spin_lock() than to spin_
inexplicable choice of poorly tuned hash tables. The trylock() in the v5.11 Linux kernel. When using TM,
first graph in the middle row of the authors’ Figure 4 the only failure-free choice is the irrevocable transaction,
show what RCU can achieve if not hobbled by ar- which is not used in the common case. In fact, the
tificial bottlenecks, as does the first portion of the irrevocable transaction is not even available in all TM
second graph in that same row. implementations. q
v2022.09.25a
554 APPENDIX E. ANSWERS TO QUICK QUIZZES
CPUs. These invalidations will generate large numbers of The program is now in the else-clause instead of the
conflicts and retries, perhaps even degrading performance then-clause.
and scalability compared to locking. q This is not what I call an easy-to-use debugger. q
v2022.09.25a
E.17. CONFLICTING VISIONS OF THE FUTURE 555
On the other hand, it is possible for a non-empty lock- Worker threads’ code is as follows:
based critical section to be relying on both the data- 1 int my_status = -1; /* Thread local. */
protection and time-based and messaging semantics of 2
3 while (continue_working()) {
locking. Using transactional lock elision in such a case 4 enqueue_any_new_work();
would be incorrect, and would result in bugs. q 5 wp = dequeue_work();
6 do_work(wp);
7 my_timestamp = clock_gettime(...);
p.387 8 }
Quick Quiz 17.12: 9
Given modern hardware [MOZ09], how can anyone 10 acquire_lock(&departing_thread_lock);
11
possibly expect parallel software relying on timing to 12 /*
work? 13 * Disentangle from application, might
14 * acquire other locks, can take much longer
15 * than MAX_LOOP_TIME, especially if many
Answer: 16 * threads exit concurrently.
The short answer is that on commonplace commodity 17 */
18 my_status = get_return_status();
hardware, synchronization designs based on any sort of 19 release_lock(&departing_thread_lock);
fine-grained timing are foolhardy and cannot be expected 20
21 /* thread awaits repurposing. */
to operate correctly under all conditions.
That said, there are systems designed for hard real-time
use that are much more deterministic. In the (very un- The control thread’s code is as follows:
likely) event that you are using such a system, here is a 1 for (;;) {
toy example showing how time-based synchronization can 2 for_each_thread(t) {
3 ct = clock_gettime(...);
work. Again, do not try this on commodity microproces- 4 d = ct - per_thread(my_timestamp, t);
sors, as they have highly nondeterministic performance 5 if (d >= MAX_LOOP_TIME) {
6 /* thread departing. */
characteristics. 7 acquire_lock(&departing_thread_lock);
This example uses multiple worker threads along with 8 release_lock(&departing_thread_lock);
9 i = per_thread(my_status, t);
a control thread. Each worker thread corresponds to an 10 status_hist[i]++; /* Bug if TLE! */
outbound data feed, and records the current time (for 11 }
12 }
example, from the clock_gettime() system call) in a 13 /* Repurpose threads as needed. */
per-thread my_timestamp variable after executing each 14 }
unit of work. The real-time nature of this example results
in the following set of constraints: Line 5 uses the passage of time to deduce that the thread
has exited, executing lines 6 and 10 if so. The empty
1. It is a fatal error for a given worker thread to fail to lock-based critical section on lines 7 and 8 guarantees that
update its timestamp for a time period of more than any thread in the process of exiting completes (remember
MAX_LOOP_TIME. that locks are granted in FIFO order!).
Once again, do not try this sort of thing on commodity
2. Locks are used sparingly to access and update global
microprocessors. After all, it is difficult enough to get this
state.
right on systems specifically designed for hard real-time
3. Locks are granted in strict FIFO order within a given use! q
thread priority.
Quick Quiz 17.13: p.387
When worker threads complete their feed, they must
But the boostee() function in Listing 17.1 alternatively
disentangle themselves from the rest of the application
acquires its locks in reverse order! Won’t this result in
and place a status value in a per-thread my_status vari-
deadlock?
able that is initialized to −1. Threads do not exit; they
instead are placed on a thread pool to accommodate later Answer:
processing requirements. The control thread assigns (and No deadlock will result. To arrive at deadlock, two differ-
re-assigns) worker threads as needed, and also maintains ent threads must each acquire the two locks in opposite
a histogram of thread statuses. The control thread runs orders, which does not happen in this example. However,
at a real-time priority no higher than that of the worker deadlock detectors such as lockdep [Cor06a] will flag this
threads. as a false positive. q
v2022.09.25a
556 APPENDIX E. ANSWERS TO QUICK QUIZZES
p.388
Table E.5: Emulating Locking: Performance Compari-
Quick Quiz 17.14: son (s)
So a bunch of people set out to supplant locking, and
they mostly end up just optimizing locking??? cmpxchg_acquire() xchg_acquire()
v2022.09.25a
E.18. IMPORTANT QUESTIONS 557
v2022.09.25a
558 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer: p.413
Quick Quiz A.5:
Here are a few reasons for such gaps:
In what part of the second (scheduler-based) perspective
would the lock-based single-thread-per-CPU workload
1. The consumer might be preempted for long time
be considered “concurrent”?
periods.
p.411
Quick Quiz A.3: E.19 “Toy” RCU Implementations
But if fully ordered implementations cannot offer
stronger guarantees than the better performing and more
scalable weakly ordered implementations, why bother
Quick Quiz B.1: p.415
with full ordering?
Why wouldn’t any deadlock in the RCU implementation
Answer: in Listing B.1 also be a deadlock in any other RCU
Because strongly ordered implementations are sometimes implementation?
able to provide greater consistency among sets of calls to
functions accessing a given data structure. For example, Answer:
compare the atomic counter of Listing 5.2 to the statistical Suppose the functions foo() and bar() in Listing E.18
counter of Section 5.2. Suppose that one thread is adding are invoked concurrently from different CPUs. Then
the value 3 and another is adding the value 5, while two foo() will acquire my_lock() on line 3, while bar()
other threads are concurrently reading the counter’s value. will acquire rcu_gp_lock on line 13.
With atomic counters, it is not possible for one of the
readers to obtain the value 3 while the other obtains the When foo() advances to line 4, it will attempt to
value 5. With statistical counters, this outcome really can acquire rcu_gp_lock, which is held by bar(). Then
happen. In fact, in some computing environments, this when bar() advances to line 14, it will attempt to acquire
outcome can happen even on relatively strongly ordered my_lock, which is held by foo().
hardware such as x86. Each function is then waiting for a lock that the other
Therefore, if your user happen to need this admittedly holds, a classic deadlock.
unusual level of consistency, you should avoid weakly
ordered statistical counters. q Other RCU implementations neither spin nor block in
rcu_read_lock(), hence avoiding deadlocks. q
v2022.09.25a
E.19. “TOY” RCU IMPLEMENTATIONS 559
Listing E.18: Deadlock in Lock-Based RCU Implementation within an RCU read-side critical section. However, this
1 void foo(void) situation could deadlock any correctly designed RCU
2 {
3 spin_lock(&my_lock); implementation. After all, the synchronize_rcu()
4 rcu_read_lock(); primitive must wait for all pre-existing RCU read-side
5 do_something();
6 rcu_read_unlock(); critical sections to complete, but if one of those critical
7 do_something_else(); sections is spinning on a lock held by the thread executing
8 spin_unlock(&my_lock);
9 } the synchronize_rcu(), we have a deadlock inherent
10 in the definition of RCU.
11 void bar(void)
12 { Another deadlock happens when attempting to nest
13 rcu_read_lock();
14 spin_lock(&my_lock);
RCU read-side critical sections. This deadlock is peculiar
15 do_some_other_thing(); to this implementation, and might be avoided by using
16 spin_unlock(&my_lock);
17 do_whatever();
recursive locks, or by using reader-writer locks that are
18 rcu_read_unlock(); read-acquired by rcu_read_lock() and write-acquired
19 }
by synchronize_rcu().
However, if we exclude the above two cases, this im-
plementation of RCU does not introduce any deadlock
Quick Quiz B.2: p.415
situations. This is because only time some other thread’s
Why not simply use reader-writer locks in the RCU lock is acquired is when executing synchronize_rcu(),
implementation in Listing B.1 in order to allow RCU and in that case, the lock is immediately released, pro-
readers to proceed in parallel? hibiting a deadlock cycle that does not involve a lock held
across the synchronize_rcu() which is the first case
Answer: above. q
One could in fact use reader-writer locks in this manner.
However, textbook reader-writer locks suffer from memory
contention, so that the RCU read-side critical sections Quick Quiz B.5: p.416
would need to be quite long to actually permit parallel Isn’t one advantage of the RCU algorithm shown in
execution [McK03]. Listing B.2 that it uses only primitives that are widely
On the other hand, use of a reader-writer lock that available, for example, in POSIX pthreads?
is read-acquired in rcu_read_lock() would avoid the
deadlock condition noted above. q Answer:
This is indeed an advantage, but do not forget that rcu_
p.416 dereference() and rcu_assign_pointer() are still
Quick Quiz B.3:
required, which means volatile manipulation for rcu_
Wouldn’t it be cleaner to acquire all the locks, and
dereference() and memory barriers for rcu_assign_
then release them all in the loop from lines 15–18 of
pointer(). Of course, many Alpha CPUs require mem-
Listing B.2? After all, with this change, there would be
ory barriers for both primitives. q
a point in time when there were no readers, simplifying
things greatly.
Quick Quiz B.6: p.417
Answer:
But what if you hold a lock across a call to
Making this change would re-introduce the deadlock, so
synchronize_rcu(), and then acquire that same lock
no, it would not be cleaner. q
within an RCU read-side critical section?
v2022.09.25a
560 APPENDIX E. ANSWERS TO QUICK QUIZZES
p.417 Listing B.6 are really needed. See Chapter 12 for informa-
Quick Quiz B.7:
tion on using these tools. The first correct and complete
How can the grace period possibly elapse in 40
response will be credited. q
nanoseconds when synchronize_rcu() contains a
10-millisecond delay?
Quick Quiz B.10: p.419
Answer:
Why is the counter flipped twice in Listing B.6?
The update-side test was run in absence of readers, so the
Shouldn’t a single flip-and-wait cycle be sufficient?
poll() system call was never invoked. In addition, the
actual code has this poll() system call commented out,
the better to evaluate the true overhead of the update-side Answer:
code. Any production uses of this code would be better Both flips are absolutely required. To see this, consider
served by using the poll() system call, but then again, the following sequence of events:
production uses would be even better served by other
implementations shown later in this section. q 1 Line 8 of rcu_read_lock() in Listing B.5 picks
up rcu_idx, finding its value to be zero.
Quick Quiz B.8: p.417
2 Line 8 of synchronize_rcu() in Listing B.6 com-
Why not simply make rcu_read_lock() wait when plements the value of rcu_idx, setting its value to
a concurrent synchronize_rcu() has been waiting one.
too long in the RCU implementation in Listing B.3?
Wouldn’t that prevent synchronize_rcu() from starv- 3 Lines 10–12 of synchronize_rcu() find that the
ing? value of rcu_refcnt[0] is zero, and thus returns.
(Recall that the question is asking what happens if
Answer: lines 13–20 are omitted.)
Although this would in fact eliminate the starvation, it
would also mean that rcu_read_lock() would spin or 4 Lines 9 and 10 of rcu_read_lock() store the value
block waiting for the writer, which is in turn waiting on zero to this thread’s instance of rcu_read_idx and
readers. If one of these readers is attempting to acquire a increments rcu_refcnt[0], respectively. Execu-
lock that the spinning/blocking rcu_read_lock() holds, tion then proceeds into the RCU read-side critical
we again have deadlock. section.
In short, the cure is worse than the disease. See Appen-
dix B.4 for a proper cure. q 5 Another instance of synchronize_rcu() again
complements rcu_idx, this time setting its value
to zero. Because rcu_refcnt[1] is zero,
Quick Quiz B.9: p.418
synchronize_rcu() returns immediately. (Re-
Why the memory barrier on line 5 of synchronize_ call that rcu_read_lock() incremented rcu_
rcu() in Listing B.6 given that there is a spin-lock refcnt[0], not rcu_refcnt[1]!)
acquisition immediately after?
6 The grace period that started in step 5 has been
Answer: allowed to end, despite the fact that the RCU read-
The spin-lock acquisition only guarantees that the spin- side critical section that started beforehand in step 4
lock’s critical section will not “bleed out” to precede the has not completed. This violates RCU semantics, and
acquisition. It in no way guarantees that code preceding could allow the update to free a data element that the
the spin-lock acquisition won’t be reordered into the RCU read-side critical section was still referencing.
critical section. Such reordering could cause a removal
from an RCU-protected list to be reordered to follow the Exercise for the reader: What happens if rcu_read_
complementing of rcu_idx, which could allow a newly lock() is preempted for a very long time (hours!) just
starting RCU read-side critical section to see the recently after line 8? Does this implementation operate correctly
removed data element. in that case? Why or why not? The first correct and
Exercise for the reader: Use a tool such as Promela/spin complete response will be credited. q
to determine which (if any) of the memory barriers in
v2022.09.25a
E.19. “TOY” RCU IMPLEMENTATIONS 561
v2022.09.25a
562 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer: Answer:
Indeed it could, with a few modifications. This work is It can indeed be fatal. To see this, consider the following
left as an exercise for the reader. q sequence of events:
1. Thread 0 enters rcu_read_lock(), determines that
Quick Quiz B.18: p.423
it is not nested, and therefore fetches the value of
Is the possibility of readers being preempted in lines 3–4 the global rcu_gp_ctr. Thread 0 is then preempted
of Listing B.14 a real problem, in other words, is there a for an extremely long time (before storing to its
real sequence of events that could lead to failure? If not, per-thread rcu_reader_gp variable).
why not? If so, what is the sequence of events, and how
can the failure be addressed? 2. Other threads repeatedly invoke synchronize_
rcu(), so that the new value of the global rcu_gp_
Answer: ctr is now RCU_GP_CTR_BOTTOM_BIT less than it
It is a real problem, there is a sequence of events leading was when thread 0 fetched it.
to failure, and there are a number of possible ways of
addressing it. For more details, see the Quick Quizzes 3. Thread 0 now starts running again, and stores into its
near the end of Appendix B.8. The reason for locating per-thread rcu_reader_gp variable. The value it
the discussion there is to (1) give you more time to think stores is RCU_GP_CTR_BOTTOM_BIT+1 greater than
about it, and (2) because the nesting support added in that that of the global rcu_gp_ctr.
section greatly reduces the time required to overflow the
4. Thread 0 acquires a reference to RCU-protected data
counter. q
element A.
Quick Quiz B.19: p.424 5. Thread 1 now removes the data element A that
Why not simply maintain a separate per-thread nesting- thread 0 just acquired a reference to.
level variable, as was done in previous section, rather 6. Thread 1 invokes synchronize_rcu(), which in-
than having all this complicated bit manipulation? crements the global rcu_gp_ctr by RCU_GP_CTR_
Answer: BOTTOM_BIT. It then checks all of the per-thread
The apparent simplicity of the separate per-thread variable rcu_reader_gp variables, but thread 0’s value (in-
is a red herring. This approach incurs much greater correctly) indicates that it started after thread 1’s call
complexity in the guise of careful ordering of operations, to synchronize_rcu(), so thread 1 does not wait
especially if signal handlers are to be permitted to contain for thread 0 to complete its RCU read-side critical
RCU read-side critical sections. But don’t take my word section.
for it, code it up and see what you end up with! q 7. Thread 1 then frees up data element A, which thread 0
is still referencing.
Quick Quiz B.20: p.424
Note that scenario can also occur in the implementation
Given the algorithm shown in Listing B.16, how could presented in Appendix B.7.
you double the time required to overflow the global One strategy for fixing this problem is to use 64-bit
rcu_gp_ctr? counters so that the time required to overflow them would
Answer: exceed the useful lifetime of the computer system. Note
One way would be to replace the magnitude compar- that non-antique members of the 32-bit x86 CPU family
ison on lines 32 and 33 with an inequality check of allow atomic manipulation of 64-bit counters via the
the per-thread rcu_reader_gp variable against rcu_gp_ cmpxchg64b instruction.
ctr+RCU_GP_CTR_BOTTOM_BIT. q Another strategy is to limit the rate at which grace
periods are permitted to occur in order to achieve a similar
v2022.09.25a
E.19. “TOY” RCU IMPLEMENTATIONS 563
effect. For example, synchronize_rcu() could record However, this memory barrier is absolutely required so
the last time that it was invoked, and any subsequent that other threads will see the store on lines 12–13 before
invocation would then check this time and block as needed any subsequent RCU read-side critical sections executed
to force the desired spacing. For example, if the low-order by the caller. q
four bits of the counter were reserved for nesting, and if
grace periods were permitted to occur at most ten times
Quick Quiz B.23: p.425
per second, then it would take more than 300 days for the
counter to overflow. However, this approach is not helpful Why are the two memory barriers on lines 11 and 14 of
if there is any possibility that the system will be fully Listing B.18 needed?
loaded with CPU-bound high-priority real-time threads
for the full 300 days. (A remote possibility, perhaps, but Answer:
best to consider it ahead of time.) The memory barrier on line 11 prevents any RCU read-
side critical sections that might precede the call to rcu_
A third approach is to administratively abolish real-
thread_offline() won’t be reordered by either the com-
time threads from the system in question. In this case,
piler or the CPU to follow the assignment on lines 12–13.
the preempted process will age up in priority, thus getting
The memory barrier on line 14 is, strictly speaking, unnec-
to run long before the counter had a chance to overflow.
essary, as it is illegal to have any RCU read-side critical
Of course, this approach is less than helpful for real-time
sections following the call to rcu_thread_offline().
applications.
q
A fourth approach would be for rcu_read_lock() to
recheck the value of the global rcu_gp_ctr after storing
Quick Quiz B.24: p.426
to its per-thread rcu_reader_gp counter, retrying if the
new value of the global rcu_gp_ctr is inappropriate. To be sure, the clock frequencies of POWER systems
This works, but introduces non-deterministic execution in 2008 were quite high, but even a 5 GHz clock fre-
time into rcu_read_lock(). On the other hand, if your quency is insufficient to allow loops to be executed in
application is being preempted long enough for the counter 50 picoseconds! What is going on here?
to overflow, you have no hope of deterministic execution
time in any case! Answer:
Since the measurement loop contains a pair of empty
A fifth approach is for the grace period process to wait
functions, the compiler optimizes it away. The measure-
for all readers to become aware of the new grace period.
ment loop takes 1,000 passes between each call to rcu_
This works nicely in theory, but hangs if a reader blocks
quiescent_state(), so this measurement is roughly
indefinitely outside of an RCU read-side critical section.
one thousandth of the overhead of a single call to rcu_
A final approach is, oddly enough, to use a single-bit quiescent_state(). q
grace-period counter and for each call to synchronize_
rcu() to take two passes through its algorithm. This is
the approached use by userspace RCU [Des09b], and is Quick Quiz B.25: p.426
described in detail in the journal article and supplementary Why would the fact that the code is in a library make
materials [DMS+ 12, Appendix D]. q any difference for how easy it is to use the RCU imple-
mentation shown in Listings B.18 and B.19?
v2022.09.25a
564 APPENDIX E. ANSWERS TO QUICK QUIZZES
v2022.09.25a
E.20. WHY MEMORY BARRIERS? 565
Answer: p.434
Quick Quiz C.7:
It might, if large-scale multiprocessors were in fact im-
But then why do uniprocessors also have store buffers?
plemented that way. Larger multiprocessors, particularly
NUMA machines, tend to use so-called “directory-based”
cache-coherence protocols to avoid this and other prob- Answer:
lems. q Because the purpose of store buffers is not just to
hide acknowledgement latencies in multiprocessor cache-
Quick Quiz C.4: p.432 coherence protocols, but to hide memory latencies in
If SMP machines are really using message passing general. Because memory is much slower than is cache
anyway, why bother with SMP at all? on uniprocessors, store buffers on uniprocessors can help
to hide write-miss memory latencies. q
Answer:
There has been quite a bit of controversy on this topic Quick Quiz C.8: p.434
over the past few decades. One answer is that the cache- So store-buffer entries are variable length? Isn’t that
coherence protocols are quite simple, and therefore can difficult to implement in hardware?
be implemented directly in hardware, gaining bandwidths
and latencies unattainable by software message passing. Answer:
Another answer is that the real truth is to be found in Here are two ways for hardware to easily handle variable-
economics due to the relative prices of large SMP machines length stores.
and that of clusters of smaller SMP machines. A third First, each store-buffer entry could be a single byte wide.
answer is that the SMP programming model is easier to Then an 64-bit store would consume eight store-buffer
use than that of distributed systems, but a rebuttal might entries. This approach is simple and flexible, but one
note the appearance of HPC clusters and MPI. And so the disadvantage is that each entry would need to replicate
argument continues. q much of the address that was stored to.
Second, each store-buffer entry could be double the
Quick Quiz C.5: p.433 size of a cache line, with half of the bits containing the
How does the hardware handle the delayed transitions values stored, and the other half indicating which bits
described above? had been stored to. So, assuming a 32-bit cache line,
a single-byte store of 0x5a to the low-order byte of a
Answer: given cache line would result in 0xXXXXXX5a for the
Usually by adding additional states, though these addi- first half and 0x000000ff for the second half, where
tional states need not be actually stored with the cache the values labeled X are arbitrary because they would
line, due to the fact that only a few lines at a time will be ignored. This approach allows multiple consecutive
be transitioning. The need to delay transitions is but one stores corresponding to a given cache line to be merged
issue that results in real-world cache coherence protocols into a single store-buffer entry, but is space-inefficient for
being much more complex than the over-simplified MESI random stores of single bytes.
protocol described in this appendix. Hennessy and Patter- Much more complex and efficient schemes are of course
son’s classic introduction to computer architecture [HP95] used by actual hardware designers. q
covers many of these issues. q
Quick Quiz C.9: p.436
Quick Quiz C.6: p.433 In step 1 above, why does CPU 0 need to issue a “read
What sequence of operations would put the CPUs’ caches invalidate” rather than a simple “invalidate”? After all,
all back into the “invalid” state? foo() will overwrite the variable a in any case, so why
should it care about the old value of a?
Answer:
There is no such sequence, at least in absence of special Answer:
“flush my cache” instructions in the CPU’s instruction set. Because the cache line in question contains more data
Most CPUs do have such instructions. q than just the variable a. Issuing “invalidate” instead of the
needed “read invalidate” would cause that other data to be
v2022.09.25a
566 APPENDIX E. ANSWERS TO QUICK QUIZZES
v2022.09.25a
E.20. WHY MEMORY BARRIERS? 567
the trick. Except that the common-case super-scalar CPU between CPU 1’s “while” and assignment to “c”? Why
is executing many instructions at once, and not necessarily or why not?
even in the expected order. So what would “immediate”
even mean? The answer is clearly “not much”. Answer:
Nevertheless, for simpler CPUs that execute instruc- No. Such a memory barrier would only force ordering
tions serially, flushing the invalidation queue might be a local to CPU 1. It would have no effect on the relative
reasonable implementation strategy. q ordering of CPU 0’s and CPU 1’s accesses, so the asser-
tion could still fail. However, all mainstream computer
Quick Quiz C.15: p.440 systems provide one mechanism or another to provide
But can’t full memory barriers impose global ordering? “transitivity”, which provides intuitive causal ordering: If
After all, isn’t that needed to provide the ordering shown B saw the effects of A’s accesses, and C saw the effects
in Listing 12.27? of B’s accesses, then C must also see the effects of A’s
accesses. In short, hardware designers have taken at least
Answer: a little pity on software developers. q
Sort of.
Note well that this litmus test has not one but two
Quick Quiz C.18: p.442
full memory-barrier instructions, namely the two sync
instructions executed by P2 and P3. Suppose that lines 3–5 for CPUs 1 and 2 in Listing C.3
It is the interaction of those two instructions that pro- are in an interrupt handler, and that the CPU 2’s line 9
vides the global ordering, not just their individual execu- runs at process level. In other words, the code in all
tion. For example, each of those two sync instructions three columns of the table runs on the same CPU, but
might stall waiting for all CPUs to process their invali- the first two columns run in an interrupt handler, and
dation queues before allowing subsequent instructions to the third column runs at process level, so that the code
execute.14 q in third column can be interrupted by the code in the
first two columns. What changes, if any, are required
p.441
to enable the code to work correctly, in other words, to
Quick Quiz C.16:
prevent the assertion from firing?
Does the guarantee that each CPU sees its own memory
accesses in order also guarantee that each user-level
Answer:
thread will see its own memory accesses in order? Why
The assertion must ensure that the load of “e” precedes
or why not?
that of “a”. In the Linux kernel, the barrier() primitive
Answer: may be used to accomplish this in much the same way
No. Consider the case where a thread migrates from one that the memory barrier was used in the assertions in the
CPU to another, and where the destination CPU perceives previous examples. For example, the assertion can be
the source CPU’s recent memory operations out of order. modified as follows:
To preserve user-mode sanity, kernel hackers must use
memory barriers in the context-switch path. However, r1 = e;
barrier();
the locking already required to safely do a context switch
assert(r1 == 0 || a == 1);
should automatically provide the memory barriers needed
to cause the user-level task to see its own accesses in
order. That said, if you are designing a super-optimized No changes are needed to the code in the first two
scheduler, either in the kernel or at user level, please keep columns, because interrupt handlers run atomically from
this scenario in mind! q the perspective of the interrupted code. q
v2022.09.25a
568 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
The result depends on whether the CPU supports “transi-
tivity”. In other words, CPU 0 stored to “e” after seeing
CPU 1’s store to “c”, with a memory barrier between
CPU 0’s load from “c” and store to “e”. If some other
CPU sees CPU 0’s store to “e”, is it also guaranteed to
see CPU 1’s store?
All CPUs I am aware of claim to provide transitivity. q
v2022.09.25a
Dictionaries are inherently circular in nature.
Acquire Load: A read from memory that has acquire example, on most CPUs, a store to a properly aligned
semantics. Normal use cases pair an acquire load pointer is atomic, because other CPUs will see either
with a release store, in which case if the load returns the old value or the new value, but are guaranteed
the value stored, then all code executed by the loading not to see some mixed value containing some pieces
CPU after that acquire load will see the effects of of the new and old values.
all memory-reference instructions executed by the
storing CPU prior to that release store. Acquiring a Atomic Read-Modify-Write Operation: An atomic op-
lock provides similar memory-ordering semantics, eration that both reads and writes memory is con-
hence the “acquire” in “acquire load”. (See also sidered an atomic read-modify-write operation, or
“memory barrier” and “release store”.) atomic RMW operation for short. Although the
value written usually depends on the value read,
Amdahl’s Law: If sufficient numbers of CPUs are used atomic_xchg() is the exception that proves this
to run a job that has both a sequential portion and a rule.
concurrent portion, performance and scalability will
be limited by the overhead of the sequential portion. Bounded Wait Free: A forward-progress guarantee in
which every thread makes progress within a specific
Associativity: The number of cache lines that can be held finite period of time, the specific time being the
simultaneously in a given cache, when all of these bound.
cache lines hash identically in that cache. A cache
that could hold four cache lines for each possible hash Bounded Population-Oblivious Wait Free: A forward-
value would be termed a “four-way set-associative” progress guarantee in which every thread makes
cache, while a cache that could hold only one cache progress within a specific finite period of time, the
line for each possible hash value would be termed a specific time being the bound, where this bound is
“direct-mapped” cache. A cache whose associativity independent of the number of threads.
was equal to its capacity would be termed a “fully
Cache: In modern computer systems, CPUs have caches
associative” cache. Fully associative caches have the
in which to hold frequently used data. These caches
advantage of eliminating associativity misses, but,
can be thought of as hardware hash tables with very
due to hardware limitations, fully associative caches
simple hash functions, but in which each hash bucket
are normally quite limited in size. The associativity
(termed a “set” by hardware types) can hold only a
of the large caches found on modern microprocessors
limited number of data items. The number of data
typically range from two-way to eight-way.
items that can be held by each of a cache’s hash
Associativity Miss: A cache miss incurred because the buckets is termed the cache’s “associativity”. These
corresponding CPU has recently accessed more data data items are normally called “cache lines”, which
hashing to a given set of the cache than will fit in can be thought of a fixed-length blocks of data that
that set. Fully associative caches are not subject circulate among the CPUs and memory.
to associativity misses (or, equivalently, in fully
associative caches, associativity and capacity misses Cache Coherence: A property of most modern SMP
are identical). machines where all CPUs will observe a sequence
of values for a given variable that is consistent with
Atomic: An operation is considered “atomic” if it is at least one global order of values for that variable.
not possible to observe any intermediate state. For Cache coherence also guarantees that at the end of
569
v2022.09.25a
570 GLOSSARY
a group of stores to a given variable, all CPUs will the same cache line) since this CPU has accessed it
agree on the final value for that variable. Note that (“communication miss”), or (5) This CPU attempted
cache coherence applies only to the series of values to write to a cache line that is currently read-only,
taken on by a single variable. In contrast, the memory possibly due to that line being replicated in other
consistency model for a given machine describes the CPUs’ caches.
order in which loads and stores to groups of variables
will appear to occur. See Section 15.2.6 for more Capacity Miss: A cache miss incurred because the corre-
information. sponding CPU has recently accessed more data than
will fit into the cache.
Cache-Coherence Protocol: A communications proto-
col, normally implemented in hardware, that enforces CAS: Compare-and-swap operation, which is an atomic
memory consistency and ordering, preventing dif- operation that takes a pointer, and old value, and
ferent CPUs from seeing inconsistent views of data a new value. If the pointed-to value is equal to
held in their caches. the old value, it is atomically replaced with the
new value. There is some variety in CAS API.
Cache Geometry: The size and associativity of a cache One variation returns the actual pointed-to value,
is termed its geometry. Each cache may be thought of so that the caller compares the CAS return value to
as a two-dimensional array, with rows of cache lines the specified old value, with equality indicating a
(“sets”) that have the same hash value, and columns successful CAS operation. Another variation returns
of cache lines (“ways”) in which every cache line a boolean success indication, in which case a pointer
has a different hash value. The associativity of a to the old value may be passed in, and if so, the old
given cache is its number of columns (hence the value is updated in the CAS failure case.
name “way”—a two-way set-associative cache has
two “ways”), and the size of the cache is its number Clash Free: A forward-progress guarantee in which, in
of rows multiplied by its number of columns. the absence of contention, at least one thread makes
progress within a finite period of time.
Cache Line: (1) The unit of data that circulates among
the CPUs and memory, usually a moderate power of Code Locking: A simple locking design in which a
two in size. Typical cache-line sizes range from 16 “global lock” is used to protect a set of critical sec-
to 256 bytes. tions, so that access by a given thread to that set is
(2) A physical location in a CPU cache capable of granted or denied based only on the set of threads
holding one cache-line unit of data. currently occupying the set of critical sections, not
(3) A physical location in memory capable of holding based on what data the thread intends to access. The
one cache-line unit of data, but that it also aligned scalability of a code-locked program is limited by
on a cache-line boundary. For example, the address the code; increasing the size of the data set will nor-
of the first word of a cache line in memory will end mally not increase scalability (in fact, will typically
in 0x00 on systems with 256-byte cache lines. decrease scalability by increasing “lock contention”).
Contrast with “data locking”.
Cache Miss: A cache miss occurs when data needed
by the CPU is not in that CPU’s cache. The data Communication Miss: A cache miss incurred because
might be missing because of a number of reasons, some other CPU has written to the cache line since
including: (1) This CPU has never accessed the the last time this CPU accessed it.
data before (“startup” or “warmup” miss), (2) This
Concurrent: In this book, a synonym of parallel. Please
CPU has recently accessed more data than would
see Appendix A.6 on page 412 for a discussion of
fit in its cache, so that some of the older data had
the recent distinction between these two terms.
to be removed (“capacity” miss), (3) This CPU has
recently accessed more data in a given set1 than that Critical Section: A section of code guarded by some
set could hold (“associativity” miss), (4) Some other synchronization mechanism, so that its execution
CPU has written to the data (or some other data in constrained by that primitive. For example, if a set
1 In hardware-cache terminology, the word “set” is used in the same of critical sections are guarded by the same global
way that the word “bucket” is used when discussing software caches. lock, then only one of those critical sections may be
v2022.09.25a
571
executing at a given time. If a thread is executing in with reduced energy consumption. Sublinear scala-
one such critical section, any other threads must wait bility can be an obstacle to energy-efficient use of a
until the first thread completes before executing any multicore system.
of the critical sections in the set.
Epoch-Based Reclamation (EBR): An RCU implemen-
Data Locking: A scalable locking design in which each tation style put forward by Keir Fraser [Fra03, Fra04,
instance of a given data structure has its own lock. If FH07].
each thread is using a different instance of the data
structure, then all of the threads may be executing Existence Guarantee: An existence guarantee is pro-
in the set of critical sections simultaneously. Data vided by a synchronization mechanism that prevents
locking has the advantage of automatically scaling a given dynamically allocated object from being
to increasing numbers of CPUs as the number of in- freed for the duration of that guarantee. For example,
stances of data grows. Contrast with “code locking”. RCU provides existence guarantees for the duration
of RCU read-side critical sections. A similar but
Data Race: A race condition in which several CPUs or strictly weaker guarantee is provided by type-safe
threads access a variable concurrently, and in which memory.
at least one of those accesses is a store and at least
one of those accesses is a plain access. It is important Exclusive Lock: An exclusive lock is a mutual-exclusion
to note that while the presence of data races often mechanism that permits only one thread at a time
indicates the presence of bugs, the absence of data into the set of critical sections guarded by that lock.
races in no way implies the absence of bugs. (See
False Sharing: If two CPUs each frequently write to one
“Plain access” and “Race condition”.)
of a pair of data items, but the pair of data items
Deadlock: A failure mode in which each of several are located in the same cache line, this cache line
threads is unable to make progress until some other will be repeatedly invalidated, “ping-ponging” back
thread makes progress. For example, if two threads and forth between the two CPUs’ caches. This is
acquire a pair of locks in opposite orders, dead- a common cause of “cache thrashing”, also called
lock can result. More information is provided in “cacheline bouncing” (the latter most commonly in the
Section 7.1.1. Linux community). False sharing can dramatically
reduce both performance and scalability.
Deadlock Free: A forward-progress guarantee in which,
in the absence of failures, at least one thread makes Forward-Progress Guarantee: Algorithms or programs
progress within a finite period of time. that guarantee that execution will progress at some
rate under specified conditions. Academic forward-
Direct-Mapped Cache: A cache with only one way, so
progress guarantees are grouped into a formal hi-
that it may hold only one cache line with a given
erarchy shown in Section 14.2. A wide variety of
hash value.
practical forward-progress guarantees are provided
Efficiency: A measure of effectiveness normally ex- by real-time systems, as discussed in Section 14.3.
pressed as a ratio of some metric actually achieved to
Fragmentation: A memory pool that has a large amount
some maximum value. The maximum value might
of unused memory, but not laid out to permit satisfy-
be a theoretical maximum, but in parallel program-
ing a relatively small request is said to be fragmented.
ming is often based on the corresponding measured
External fragmentation occurs when the space is
single-threaded metric.
divided up into small fragments lying between allo-
Embarrassingly Parallel: A problem or algorithm cated blocks of memory, while internal fragmentation
where adding threads does not significantly increase occurs when specific requests or types of requests
the overall cost of the computation, resulting in linear have been allotted more memory than they actually
speedups as threads are added (assuming sufficient requested.
CPUs are available).
Fully Associative Cache: A fully associative cache con-
Energy Efficiency: Shorthand for “energy-efficient use” tains only one set, so that it can hold any subset of
in which the goal is to carry out a given computation memory that fits within its capacity.
v2022.09.25a
572 GLOSSARY
Grace Period: A grace period is any contiguous time Latency: The wall-clock time required for a given opera-
interval such that any RCU read-side critical section tion to complete.
that began before the start of that interval has com-
Linearizable: A sequence of operations is “linearizable”
pleted before the end of that same interval. Many
if there is at least one global ordering of the sequence
RCU implementations define a grace period to be a
that is consistent with the observations of all CPUs
time interval during which each thread has passed
and/or threads. Linearizability is much prized by
through at least one quiescent state. Since RCU
many researchers, but less useful in practice than one
read-side critical sections by definition cannot con-
might expect [HKLP12].
tain quiescent states, these two definitions are almost
always interchangeable. Livelock: A failure mode in which each of several threads
is able to execute, but in which a repeating series of
Hardware Transactional Memory (HTM): A
failed operations prevents any of the threads from
transactional-memory system based on hardware
making any useful forward progress. For example,
instructions provided for this purpose, as discussed
incorrect use of conditional locking (for example,
in Section 17.3. (See “Transactional memory”.)
spin_trylock() in the Linux kernel) can result
Hazard Pointer: A scalable counterpart to a reference in livelock. More information is provided in Sec-
counter in which an object’s reference count is repre- tion 7.1.2.
sented implicitly by a count of the number of special
Lock: A software abstraction that can be used to guard
hazard pointers referencing that object.
critical sections, as such, an example of a “mutual
Heisenbug: A timing-sensitive bug that disappears from exclusion mechanism”. An “exclusive lock” permits
sight when you add print statements or tracing in an only one thread at a time into the set of critical
attempt to track it down. sections guarded by that lock, while a “reader-writer
lock” permits any number of reading threads, or but
Hot Spot: Data structure that is very heavily used, result-
one writing thread, into the set of critical sections
ing in high levels of contention on the corresponding
guarded by that lock. (Just to be clear, the presence
lock. One example of this situation would be a hash
of a writer thread in any of a given reader-writer
table with a poorly chosen hash function.
lock’s critical sections will prevent any reader from
Humiliatingly Parallel: A problem or algorithm where entering any of that lock’s critical sections and vice
adding threads significantly decreases the overall versa.)
cost of the computation, resulting in large superlinear
Lock Contention: A lock is said to be suffering con-
speedups as threads are added (assuming sufficient
tention when it is being used so heavily that there is
CPUs are available).
often a CPU waiting on it. Reducing lock contention
Immutable: In this book, a synonym for read-mostly. is often a concern when designing parallel algorithms
and when implementing parallel programs.
Invalidation: When a CPU wishes to write to a data
item, it must first ensure that this data item is not Lock Free: A forward-progress guarantee in which at
present in any other CPUs’ cache. If necessary, the least one thread makes progress within a finite period
item is removed from the other CPUs’ caches via of time.
“invalidation” messages from the writing CPUs to
Marked Access: A source-code memory access that uses
any CPUs having a copy in their caches.
a special function or macro, such as READ_ONCE(),
IPI: Inter-processor interrupt, which is an interrupt sent WRITE_ONCE(), atomic_inc(), and so on, in order
from one CPU to another. IPIs are used heavily in to protect that access from compiler and/or hardware
the Linux kernel, for example, within the scheduler optimizations. In contrast, a plain access simply
to alert CPUs that a high-priority process is now mentions the name of the object being accessed,
runnable. so that in the following, line 2 is the plain-access
equivalent of line 1:
IRQ: Interrupt request, often used as an abbreviation for
“interrupt” within the Linux kernel community, as in 1 WRITE_ONCE(a, READ_ONCE(b) + READ_ONCE(c));
2 a = b + c;
“irq handler”.
v2022.09.25a
573
Memory: From the viewpoint of memory models, the purposes such as profiling. The advantage of using
main memory, caches, and store buffers in which NMIs for profiling is that it allows you to profile code
values might be stored. However, this term is often that runs with interrupts disabled.
used to denote the main memory itself, excluding
caches and store buffers. Non-Blocking: A group of academic forward-progress
guarantees that includes bounded population-
Memory Barrier: A compiler directive that might also oblivious wait free, bounded wait free, wait free,
include a special memory-barrier instruction. The lock free, obstruction free, clash free, starvation
purpose of a memory barrier is to order memory- free, and deadlock free. See Section 14.2 for more
reference instructions that executed before the mem- information.
ory barrier to precede those that will execute follow-
ing that memory barrier. (See also “read memory Non-Blocking Synchronization (NBS): The use of al-
barrier” and “write memory barrier”.) gorithms, mechanisms, or techniques that provide
non-blocking forward-progress guarantees. NBS is
Memory Consistency: A set of properties that impose often used in a more restrictive sense of providing one
constraints on the order in which accesses to groups of the stronger forward-progress guarantees, usually
of variables appear to occur. Memory consistency wait free or lock free, but sometimes also obstruction
models range from sequential consistency, a very con- free. (See “Non-blocking”.)
straining model popular in academic circles, through
process consistency, release consistency, and weak NUCA: Non-uniform cache architecture, where groups
consistency. of CPUs share caches and/or store buffers. CPUs
in a group can therefore exchange cache lines with
MESI Protocol: The cache-coherence protocol featur- each other much more quickly than they can with
ing modified, exclusive, shared, and invalid (MESI) CPUs in other groups. Systems comprised of CPUs
states, so that this protocol is named after the states with hardware threads will generally have a NUCA
that the cache lines in a given cache can take on. A architecture.
modified line has been recently written to by this
CPU, and is the sole representative of the current NUMA: Non-uniform memory architecture, where mem-
value of the corresponding memory location. An ory is split into banks and each such bank is “close” to
exclusive cache line has not been written to, but this a group of CPUs, the group being termed a “NUMA
CPU has the right to write to it at any time, as the node”. An example NUMA machine is Sequent’s
line is guaranteed not to be replicated into any other NUMA-Q system, where each group of four CPUs
CPU’s cache (though the corresponding location in had a bank of memory nearby. The CPUs in a given
main memory is up to date). A shared cache line is group can access their memory much more quickly
(or might be) replicated in some other CPUs’ cache, than another group’s memory.
meaning that this CPU must interact with those other
CPUs before writing to this cache line. An invalid NUMA Node: A group of closely placed CPUs and
cache line contains no value, instead representing associated memory within a larger NUMA machines.
“empty space” in the cache into which data from
Obstruction Free: A forward-progress guarantee in
memory might be loaded.
which, in the absence of contention, every thread
Moore’s Law: A 1965 empirical projection by Gordon makes progress within a finite period of time.
Moore that transistor density increases exponentially
Overhead: Operations that must be executed, but which
over time [Moo65].
do not contribute directly to the work that must be
Mutual-Exclusion Mechanism: A software abstraction accomplished. For example, lock acquisition and
that regulates threads’ access to “critical sections” release is normally considered to be overhead, and
and corresponding data. specifically to be synchronization overhead.
NMI: Non-maskable interrupt. As the name indicates, Parallel: In this book, a synonym of concurrent. Please
this is an extremely high-priority interrupt that cannot see Appendix A.6 on page 412 for a discussion of
be masked. These are used for hardware-specific the recent distinction between these two terms.
v2022.09.25a
574 GLOSSARY
Performance: Rate at which work is done, expressed as Race Condition: Any situation where multiple CPUs or
work per unit time. If this work is fully serialized, threads can interact, though this term is often used
then the performance will be the reciprocal of the in cases where such interaction is undesirable. (See
mean latency of the work items. “Data race”.)
RCU-Protected Data: A block of dynamically allocated
Pipelined CPU: A CPU with a pipeline, which is an
memory whose freeing will be deferred such that
internal flow of instructions internal to the CPU that
an RCU grace period will elapse between the time
is in some way similar to an assembly line, with
that there were no longer any RCU-reader-accessible
many of the same advantages and disadvantages. In
pointers to that block and the time that that block is
the 1960s through the early 1980s, pipelined CPUs
freed. This ensures that no RCU readers will have
were the province of supercomputers, but started
access to that block at the time that it is freed.
appearing in microprocessors (such as the 80486) in
the late 1980s. RCU-Protected Pointer: A pointer to RCU-protected
data. Such pointers must be handled carefully, for ex-
Plain Access: A source-code memory access that simply ample, any reader that intends to dereference an RCU-
mentions the name of the object being accessed. (See protected pointer must use rcu_dereference() (or
“Marked access”.) stronger) to load that pointer, and any updater must
use rcu_assign_pointer() (or stronger) to store
Process Consistency: A memory-consistency model in to that pointer. More information is provided in
which each CPU’s stores appear to occur in program Section 15.3.2.
order, but in which different CPUs might see accesses
from more than one CPU as occurring in different RCU Read-Side Critical Section: A section of code
orders. protected by RCU, for example, beginning with
rcu_read_lock() and ending with rcu_read_
Program Order: The order in which a given thread’s unlock(). (See “Read-side critical section”.)
instructions would be executed by a now-mythical “in- Read-Copy Update (RCU): A synchronization mech-
order” CPU that completely executed each instruction anism that can be thought of as a replacement for
before proceeding to the next instruction. (The reason reader-writer locking or reference counting. RCU
such CPUs are now the stuff of ancient myths and provides extremely low-overhead access for readers,
legends is that they were extremely slow. These while writers incur additional overhead maintaining
dinosaurs were one of the many victims of Moore’s- old versions for the benefit of pre-existing readers.
Law-driven increases in CPU clock frequency. Some Readers neither block nor spin, and thus cannot par-
claim that these beasts will roam the earth once again, ticipate in deadlocks, however, they also can see stale
others vehemently disagree.) data and can run concurrently with updates. RCU
is thus best-suited for read-mostly situations where
Quiescent State: In RCU, a point in the code where stale data can either be tolerated (as in routing tables)
there can be no references held to RCU-protected or avoided (as in the Linux kernel’s System V IPC
data structures, which is normally any point outside implementation).
of an RCU read-side critical section. Any interval of
time during which all threads pass through at least Read Memory Barrier: A memory barrier that is only
one quiescent state each is termed a “grace period”. guaranteed to affect the ordering of load instructions,
that is, reads from memory. (See also “memory
Quiescent-State-Based Reclamation (QSBR): An barrier” and “write memory barrier”.)
RCU implementation style characterized by ex-
Read Mostly: Read-mostly data is (again, as the name im-
plicit quiescent states. In QSBR implementa-
plies) rarely updated. However, it might be updated
tions, read-side markers (rcu_read_lock() and
at any time.
rcu_read_unlock() in the Linux kernel) are no-
ops [MS98a, SM95]. Hooks in other parts of the Read Only: Read-only data is, as the name implies, never
software (for example, the Linux-kernel scheduler) updated except by beginning-of-time initialization.
provide the quiescent states. In this book, a synonym for immutable.
v2022.09.25a
575
Read-Side Critical Section: A section of code guarded Sequence Lock: A reader-writer synchronization mech-
by read-acquisition of some reader-writer synchro- anism in which readers retry their operations if a
nization mechanism. For example, if one set of writer was present.
critical sections are guarded by read-acquisition of a
given global reader-writer lock, while a second set Sequential Consistency: A memory-consistency model
of critical section are guarded by write-acquisition where all memory references appear to occur in an
of that same reader-writer lock, then the first set of order consistent with a single global order, and where
critical sections will be the read-side critical sections each CPU’s memory references appear to all CPUs
for that lock. Any number of threads may concur- to occur in program order.
rently execute the read-side critical sections, but only
if no thread is executing one of the write-side critical Software Transactional Memory (HTM): A
sections. (See also “RCU read-side critical section”.) transactional-memory system capable running on
computer systems without special hardware support.
(See “Transactional memory”.)
Reader-Writer Lock: A reader-writer lock is a mutual-
exclusion mechanism that permits any number of Starvation: A condition where at least one CPU or thread
reading threads, or but one writing thread, into the is unable to make progress due to an unfortunate
set of critical sections guarded by that lock. Threads series of resource-allocation decisions, as discussed
attempting to write must wait until all pre-existing in Section 7.1.2. For example, in a multisocket
reading threads release the lock, and, similarly, if system, CPUs on one socket having privileged access
there is a pre-existing writer, any threads attempting to the data structure implementing a given lock could
to write must wait for the writer to release the lock. prevent CPUs on other sockets from ever acquiring
A key concern for reader-writer locks is “fairness”: that lock.
Can an unending stream of readers starve a writer or
vice versa? Starvation Free: A forward-progress guarantee in which,
in the absence of failures, every thread makes
Real Time: A situation in which getting the correct result progress within a finite period of time.
is not sufficient, but where this result must also be
obtained within a given amount of time. Store Buffer: A small set of internal registers used by
a given CPU to record pending stores while the
Reference Count: A counter that tracks the number of corresponding cache lines are making their way to
users of a given object or entity. Reference counters that CPU. Also called “store queue”.
provide existence guarantees and are sometimes used
Store Forwarding: An arrangement where a given CPU
to implement garbage collectors.
refers to its store buffer as well as its cache so as to
ensure that the software sees the memory operations
Release Store: A write to memory that has release se- performed by this CPU as if they were carried out in
mantics. Normal use cases pair an acquire load with program order.
a release store, in which case if the load returns the
value stored, then all code executed by the loading Superscalar CPU: A scalar (non-vector) CPU capable
CPU after that acquire load will see the effects of of executing multiple instructions concurrently. This
all memory-reference instructions executed by the is a step up from a pipelined CPU that executes
storing CPU prior to that release store. Releasing a multiple instructions in an assembly-line fashion—in
lock provides similar memory-ordering semantics, a superscalar CPU, each stage of the pipeline would
hence the “release” in “release store”. (See also be capable of handling more than one instruction.
“acquire load” and “memory barrier”.) For example, if the conditions were exactly right, the
Intel Pentium Pro CPU from the mid-1990s could
Scalability: A measure of how effectively a given system execute two (and sometimes three) instructions per
is able to utilize additional resources. For paral- clock cycle. Thus, a 200 MHz Pentium Pro CPU
lel computing, the additional resources are usually could “retire”, or complete the execution of, up to
additional CPUs. 400 million instructions per second.
v2022.09.25a
576 GLOSSARY
Synchronization: Means for avoiding destructive inter- Unfairness: A condition where the progress of at least
actions among CPUs or threads. Synchronization one CPU or thread is impeded by an unfortunate
mechanisms include atomic RMW operations, mem- series of resource-allocation decisions, as discussed
ory barriers, locking, reference counting, hazard in Section 7.1.2. Extreme levels of unfairness are
pointers, sequence locking, RCU, non-blocking syn- termed “starvation”.
chronization, and transactional memory.
Unteachable: A topic, concept, method, or mechanism
Teachable: A topic, concept, method, or mechanism that that the teacher does not understand well is therefore
teachers believe that they understand completely and uncomfortable teaching.
are therefore comfortable teaching. Vector CPU: A CPU that can apply a single instruction
to multiple items of data concurrently. In the 1960s
Throughput: A performance metric featuring work through the 1980s, only supercomputers had vector
items completed per unit time. capabilities, but the advent of MMX in x86 CPUs and
VMX in PowerPC CPUs brought vector processing
Transactional Lock Elision (TLE): The use of transac- to the masses.
tional memory to emulate locking. Synchronization
is instead carried out by conflicting accesses to the Wait Free: A forward-progress guarantee in which every
data to be protected by the lock. In some cases, thread makes progress within a finite period of time.
this can increase performance because TLE avoids
contention on the lock word [PD11, Kle14, FIMR16, Write Memory Barrier: A memory barrier that is only
PMDY20]. guaranteed to affect the ordering of store instructions,
that is, writes to memory. (See also “memory barrier”
Transactional Memory (TM): A synchronization and “read memory barrier”.)
mechanism that gathers groups of memory accesses
Write Miss: A cache miss incurred because the corre-
so as to execute them atomically from the viewpoint
sponding CPU attempted to write to a cache line that
of transactions on other CPUs or threads, discussed
is read-only, most likely due to its being replicated
in Sections 17.2 and 17.3.
in other CPUs’ caches.
Type-Safe Memory: Type-safe memory [GC96] is pro- Write Mostly: Write-mostly data is (yet again, as the
vided by a synchronization mechanism that prevents name implies) frequently updated.
a given dynamically allocated object from chang-
ing to an incompatible type. Note that the object Write-Side Critical Section: A section of code guarded
might well be freed and then reallocated, but the by write-acquisition of some reader-writer synchro-
reallocated object is guaranteed to be of a compatible nization mechanism. For example, if one set of
type. Within the Linux kernel, type-safe memory critical sections are guarded by write-acquisition of
is provided within RCU read-side critical sections a given global reader-writer lock, while a second set
for memory allocated from slabs marked with the of critical section are guarded by read-acquisition of
SLAB_TYPESAFE_BY_RCU flag. The strictly stronger that same reader-writer lock, then the first set of criti-
existence guarantee also prevents freeing of the pro- cal sections will be the write-side critical sections for
tected object. that lock. Only one thread may execute in the write-
side critical section at a time, and even then only if
Unbounded Transactional Memory (UTM): A there are no threads are executing concurrently in
transactional-memory system based on hardware any of the corresponding read-side critical sections.
instructions provided for this purpose, but with spe-
cial hardware or software capabilities that allow
a given transaction to have a very large memory
footprint. Such a system would at least partially
avoid HTM’s transaction-size limitations called out
in Section 17.3.2.1. (See “Hardware transactional
memory”.)
v2022.09.25a
577
v2022.09.25a
578 GLOSSARY
v2022.09.25a
Bibliography
[AA14] Maya Arbel and Hagit Attiya. Concurrent updates with RCU: Search tree as
an example. In Proceedings of the 2014 ACM Symposium on Principles of
Distributed Computing, PODC ’14, page 196–205, Paris, France, 2014. ACM.
[AAKL06] C. Scott Ananian, Krste Asanovic, Bradley C. Kuszmaul, and Charles E.
Leiserson. Unbounded transactional memory. IEEE Micro, pages 59–69,
January-February 2006.
[AB13] Samy Al Bahra. Nonblocking algorithms and scalable multicore programming.
Commun. ACM, 56(7):50–61, July 2013.
[ABD+ 97] Jennifer M. Anderson, Lance M. Berc, Jeffrey Dean, Sanjay Ghemawat,
Monika R. Henzinger, Shun-Tak A. Leung, Richard L. Sites, Mark T. Vande-
voorde, Carl A. Waldspurger, and William E. Weihl. Continuous profiling:
Where have all the cycles gone? In Proceedings of the 16th ACM Symposium
on Operating Systems Principles, pages 1–14, New York, NY, October 1997.
[ACA+ 18] A. Aljuhni, C. E. Chow, A. Aljaedi, S. Yusuf, and F. Torres-Reyes. Towards
understanding application performance and system behavior with the full
dynticks feature. In 2018 IEEE 8th Annual Computing and Communication
Workshop and Conference (CCWC), pages 394–401, 2018.
[ACHS13] Dan Alistarh, Keren Censor-Hillel, and Nir Shavit. Are lock-free concurrent
algorithms practically wait-free?, December 2013. ArXiv:1311.3200v2.
[ACMS03] Andrea Arcangeli, Mingming Cao, Paul E. McKenney, and Dipankar Sarma.
Using read-copy update techniques for System V IPC in the Linux 2.5 kernel.
In Proceedings of the 2003 USENIX Annual Technical Conference (FREENIX
Track), pages 297–310, San Antonio, Texas, USA, June 2003. USENIX
Association.
[Ada11] Andrew Adamatzky. Slime mould solves maze in one pass . . . assisted by
gradient of chemo-attractants, August 2011. arXiv:1108.4956.
[ADF+ 19] Jade Alglave, Will Deacon, Boqun Feng, David Howells, Daniel Lustig, Luc
Maranget, Paul E. McKenney, Andrea Parri, Nicholas Piggin, Alan Stern,
Akira Yokosawa, and Peter Zijlstra. Who’s afraid of a big bad optimizing
compiler?, July 2019. Linux Weekly News.
[Adv02] Advanced Micro Devices. AMD x86-64 Architecture Programmer’s Manual
Volumes 1–5, 2002.
[AGH+ 11a] Hagit Attiya, Rachid Guerraoui, Danny Hendler, Petr Kuznetsov, Maged M.
Michael, and Martin Vechev. Laws of order: Expensive synchronization in
concurrent algorithms cannot be eliminated. In 38th ACM SIGACT-SIGPLAN
579
v2022.09.25a
580 BIBLIOGRAPHY
v2022.09.25a
BIBLIOGRAPHY 581
[AMM+ 18] Jade Alglave, Luc Maranget, Paul E. McKenney, Andrea Parri, and Alan Stern.
Frightening small children and disconcerting grown-ups: Concurrency in the
Linux kernel. In Proceedings of the Twenty-Third International Conference
on Architectural Support for Programming Languages and Operating Systems,
ASPLOS ’18, pages 405–418, Williamsburg, VA, USA, 2018. ACM.
[AMP+ 11] Jade Alglave, Luc Maranget, Pankaj Pawan, Susmit Sarkar, Peter Sewell, Derek
Williams, and Francesco Zappa Nardelli. PPCMEM/ARMMEM: A tool for
exploring the POWER and ARM memory models, June 2011. https://www.
cl.cam.ac.uk/~pes20/ppc-supplemental/pldi105-sarkar.pdf.
[AMT14] Jade Alglave, Luc Maranget, and Michael Tautschnig. Herding cats: Modelling,
simulation, testing, and data-mining for weak memory. In Proceedings of
the 35th ACM SIGPLAN Conference on Programming Language Design and
Implementation, PLDI ’14, pages 40–40, Edinburgh, United Kingdom, 2014.
ACM.
[And90] T. E. Anderson. The performance of spin lock alternatives for shared-memory
multiprocessors. IEEE Transactions on Parallel and Distributed Systems,
1(1):6–16, January 1990.
[And91] Gregory R. Andrews. Concurrent Programming, Principles, and Practices.
Benjamin Cummins, 1991.
[And19] Jim Anderson. Software transactional memory for real-time systems, August
2019. https://www.cs.unc.edu/~anderson/projects/rtstm.html.
[ARM10] ARM Limited. ARM Architecture Reference Manual: ARMv7-A and ARMv7-R
Edition, 2010.
[ARM17] ARM Limited. ARM Architecture Reference Manual (ARMv8, for ARMv8-A
architecture profile), 2017.
[Ash15] Mike Ash. Concurrent memory deallocation in the objective-c runtime, May
2015. mikeash.com: just this guy, you know?
[ATC+ 11] Ege Akpinar, Sasa Tomic, Adrian Cristal, Osman Unsal, and Mateo Valero. A
comprehensive study of conflict resolution policies in hardware transactional
memory. In TRANSACT 2011, New Orleans, LA, USA, June 2011. ACM
SIGPLAN.
[ATS09] Ali-Reza Adl-Tabatabai and Tatiana Shpeisman. Draft specification of transac-
tional language constructs for C++, August 2009. URL: https://software.
intel.com/sites/default/files/ee/47/21569 (may need to append
.pdf to view after download).
[Att10] Hagit Attiya. The inherent complexity of transactional memory and what to
do about it. In Proceedings of the 29th ACM SIGACT-SIGOPS Symposium
on Principles of Distributed Computing, PODC ’10, pages 1–5, Zurich,
Switzerland, 2010. ACM.
[BA01] Jeff Bonwick and Jonathan Adams. Magazines and vmem: Extending the slab
allocator to many CPUs and arbitrary resources. In USENIX Annual Technical
Conference, General Track 2001, pages 15–33, 2001.
[Bah11a] Samy Al Bahra. ck_epoch: Support per-object destructors, Oc-
tober 2011. https://github.com/concurrencykit/ck/commit/
10ffb2e6f1737a30e2dcf3862d105ad45fcd60a4.
v2022.09.25a
582 BIBLIOGRAPHY
v2022.09.25a
BIBLIOGRAPHY 583
[BJ12] Rex Black and Capers Jones. Economics of software quality: An interview
with Capers Jones, part 1 of 2 (podcast transcript), January 2012. https:
//www.informit.com/articles/article.aspx?p=1824791.
[BK85] Bob Beck and Bob Kasten. VLSI assist in building a multiprocessor UNIX
system. In USENIX Conference Proceedings, pages 255–275, Portland, OR,
June 1985. USENIX Association.
[BLM05] C. Blundell, E. C. Lewis, and M. Martin. Deconstructing transactional seman-
tics: The subtleties of atomicity. In Annual Workshop on Duplicating, De-
constructing, and Debunking (WDDD), June 2005. Available: http://acg.
cis.upenn.edu/papers/wddd05_atomic_semantics.pdf [Viewed Feb-
ruary 28, 2021].
[BLM06] C. Blundell, E. C. Lewis, and M. Martin. Subtleties of transactional
memory and atomicity semantics. Computer Architecture Letters, 5(2),
2006. Available: http://acg.cis.upenn.edu/papers/cal06_atomic_
semantics.pdf [Viewed February 28, 2021].
[BM18] JF Bastien and Paul E. McKenney. P0750r1: Consume, February
2018. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/
2018/p0750r1.html.
[BMMM05] Luke Browning, Thomas Mathews, Paul E. McKenney, and James Moody.
Apparatus, method, and computer program product for converting simple locks
in a multiprocessor system. US Patent 6,842,809, Assigned to International
Business Machines Corporation, Washington, DC, January 2005.
+
[BMN 15] Mark Batty, Kayvan Memarian, Kyndylan Nienhuis, Jean Pichon-Pharabod,
and Peter Sewell. The problem of programming language concurrency
semantics. In Jan Vitek, editor, Programming Languages and Systems, volume
9032 of Lecture Notes in Computer Science, pages 283–307. Springer Berlin
Heidelberg, 2015.
[BMP08] R. F. Berry, P. E. McKenney, and F. N. Parr. Responsive systems: An
introduction. IBM Systems Journal, 47(2):197–206, April 2008.
[Boe05] Hans-J. Boehm. Threads cannot be implemented as a library. SIGPLAN Not.,
40(6):261–268, June 2005.
[Boe09] Hans-J. Boehm. Transactional memory should be an implementation technique,
not a programming interface. In HOTPAR 2009, page 6, Berkeley, CA, USA,
March 2009. Available: https://www.usenix.org/event/hotpar09/
tech/full_papers/boehm/boehm.pdf [Viewed May 24, 2009].
[Boe20] Hans Boehm. “Undefined behavior” and the concurrency memory
model, August 2020. http://www.open-std.org/jtc1/sc22/wg21/
docs/papers/2020/p2215r0.pdf.
[Boh01] Kristoffer Bohmann. Response time still matters, July 2001. URL: http:
//www.bohmann.dk/articles/response_time_still_matters.html
[broken, November 2016].
[Bon13] Paolo Bonzini. seqlock: introduce read-write seqlock, Sep-
tember 2013. https://git.qemu.org/?p=qemu.git;a=commit;h=
ea753d81e8b085d679f13e4a6023e003e9854d51.
[Bon15] Paolo Bonzini. rcu: add rcu library, February
2015. https://git.qemu.org/?p=qemu.git;a=commit;h=
7911747bd46123ef8d8eef2ee49422bb8a4b274f.
v2022.09.25a
584 BIBLIOGRAPHY
v2022.09.25a
BIBLIOGRAPHY 585
v2022.09.25a
586 BIBLIOGRAPHY
[CnRR18] Armando Castañeda, Sergio Rajsbaum, and Michel Raynal. Unifying con-
current objects and distributed tasks: Interval-linearizability. J. ACM, 65(6),
November 2018.
[Com01] Compaq Computer Corporation. Shared memory, threads, inter-
process communication, August 2001. Zipped archive: wiz_
2637.txt in https://www.digiater.nl/openvms/freeware/v70/
ask_the_wizard/wizard.zip.
[Coo18] Byron Cook. Formal reasoning about the security of amazon web services. In
Hana Chockler and Georg Weissenbacher, editors, Computer Aided Verifica-
tion, pages 38–47, Cham, 2018. Springer International Publishing.
[Cor02] Compaq Computer Corporation. Alpha Architecture Reference Manual. Digital
Press, fourth edition, 2002.
[Cor03] Jonathan Corbet. Driver porting: mutual exclusion with seqlocks, February
2003. https://lwn.net/Articles/22818/.
[Cor04a] Jonathan Corbet. Approaches to realtime Linux, October 2004. URL:
https://lwn.net/Articles/106010/.
[Cor04b] Jonathan Corbet. Finding kernel problems automatically, June 2004. https:
//lwn.net/Articles/87538/.
[Cor04c] Jonathan Corbet. Realtime preemption, part 2, October 2004. URL: https:
//lwn.net/Articles/107269/.
[Cor06a] Jonathan Corbet. The kernel lock validator, May 2006. Available: https:
//lwn.net/Articles/185666/ [Viewed: March 26, 2010].
[Cor06b] Jonathan Corbet. Priority inheritance in the kernel, April 2006. Available:
https://lwn.net/Articles/178253/ [Viewed June 29, 2009].
[Cor10a] Jonathan Corbet. Dcache scalability and RCU-walk, December 2010. Avail-
able: https://lwn.net/Articles/419811/ [Viewed May 29, 2017].
[Cor10b] Jonathan Corbet. sys_membarrier(), January 2010. https://lwn.net/
Articles/369567/.
[Cor11] Jonathan Corbet. How to ruin linus’s vacation, July 2011. Available: https:
//lwn.net/Articles/452117/ [Viewed May 29, 2017].
[Cor12] Jonathan Corbet. ACCESS_ONCE(), August 2012. https://lwn.net/
Articles/508991/.
[Cor13] Jonathan Corbet. (Nearly) full tickless operation in 3.10, May 2013. https:
//lwn.net/Articles/549580/.
[Cor14a] Jonathan Corbet. ACCESS_ONCE() and compiler bugs, December 2014.
https://lwn.net/Articles/624126/.
[Cor14b] Jonathan Corbet. MCS locks and qspinlocks, March 2014. https://lwn.
net/Articles/590243/.
[Cor14c] Jonathan Corbet. Relativistic hash tables, part 1: Algorithms, September
2014. https://lwn.net/Articles/612021/.
[Cor14d] Jonathan Corbet. Relativistic hash tables, part 2: Implementation, September
2014. https://lwn.net/Articles/612100/.
[Cor16a] Jonathan Corbet. Finding race conditions with KCSAN, June 2016. https:
//lwn.net/Articles/691128/.
v2022.09.25a
BIBLIOGRAPHY 587
[Cor16b] Jonathan Corbet. Time to move to C11 atomics?, June 2016. https:
//lwn.net/Articles/691128/.
[Cor18] Jonathan Corbet. membarrier(2), October 2018. https://man7.org/
linux/man-pages/man2/membarrier.2.html.
[Cra93] Travis Craig. Building FIFO and priority-queuing spin locks from atomic swap.
Technical Report 93-02-02, University of Washington, Seattle, Washington,
February 1993.
[CRKH05] Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman. Linux
Device Drivers. O’Reilly Media, Inc., third edition, 2005. URL: https:
//lwn.net/Kernel/LDD3/.
[CSG99] David E. Culler, Jaswinder Pal Singh, and Anoop Gupta. Parallel Computer
Architecture: a Hardware/Software Approach. Morgan Kaufman, 1999.
[cut17] crates.io user ticki. conc v0.5.0: Hazard-pointer-based concurrent memory
reclamation, August 2017. https://crates.io/crates/conc.
[Dat82] C. J. Date. An Introduction to Database Systems, volume 1. Addison-Wesley
Publishing Company, 1982.
[DBA09] Saeed Dehnadi, Richard Bornat, and Ray Adams. Meta-analysis of the effect of
consistency on success in early learning of programming. In PPIG 2009, pages
1–13, University of Limerick, Ireland, June 2009. Psychology of Programming
Interest Group.
[DCW+ 11] Luke Dalessandro, Francois Carouge, Sean White, Yossi Lev, Mark Moir,
Michael L. Scott, and Michael F. Spear. Hybrid NOrec: A case study in the
effectiveness of best effort hardware transactional memory. In Proceedings of
the 16th International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS), ASPLOS ’11, page 39–52,
Newport Beach, CA, USA, 2011. ACM.
[Dea18] Will Deacon. [PATCH 00/10] kernel/locking: qspinlock improve-
ments, April 2018. https://lkml.kernel.org/r/1522947547-24081-
[email protected].
[Dea19] Will Deacon. Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates,
August 2019. https://lore.kernel.org/lkml/20190821103200.
kpufwtviqhpbuv2n@willie-the-truck/.
[Den15] Peter Denning. Perspectives on OS foundations. In SOSP History Day 2015,
SOSP ’15, pages 3:1–3:46, Monterey, California, 2015. ACM.
[Dep06] Department of Computing and Information Systems, University of Melbourne.
CSIRAC, 2006. https://cis.unimelb.edu.au/about/csirac/.
[Des09a] Mathieu Desnoyers. Low-Impact Operating System Tracing. PhD
thesis, Ecole Polytechnique de Montréal, December 2009. Available:
https://lttng.org/files/thesis/desnoyers-dissertation-
2009-12-v27.pdf [Viewed February 27, 2021].
[Des09b] Mathieu Desnoyers. [RFC git tree] userspace RCU (urcu) for Linux, February
2009. https://liburcu.org.
[DFGG11] Aleksandar Dragovejic, Pascal Felber, Vincent Gramoli, and Rachid Guerraoui.
Why STM can be more than a research toy. Communications of the ACM,
pages 70–77, April 2011.
v2022.09.25a
588 BIBLIOGRAPHY
[DFLO19] Dino Distefano, Manuel Fähndrich, Francesco Logozzo, and Peter W. O’Hearn.
Scaling static analyses at facebook. Commun. ACM, 62(8):62–70, July 2019.
[DHJ+ 07] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakula-
pati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter
Vosshall, and Werner Vogels. Dynamo: Amazon’s highly available key-value
store. SIGOPS Oper. Syst. Rev., 41(6):205–220, October 2007.
[DHK12] Vijay D’Silva, Leopold Haller, and Daniel Kroening. Satisfiability solvers are
static analyzers. In Static Analysis Symposium (SAS), volume 7460 of LNCS,
pages 317–333. Springer, 2012.
[DHL+ 08] Dave Dice, Maurice Herlihy, Doug Lea, Yossi Lev, Victor Luchangco, Wayne
Mesard, Mark Moir, Kevin Moore, and Dan Nussbaum. Applications of the
adaptive transactional memory test platform. In 3rd ACM SIGPLAN Workshop
on Transactional Computing, pages 1–10, Salt Lake City, UT, USA, February
2008.
[Dij65] E. W. Dijkstra. Solution of a problem in concurrent programming control.
Communications of the ACM, 8(9):569, Sept 1965.
[Dij68] Edsger W. Dijkstra. Letters to the editor: Go to statement considered harmful.
Commun. ACM, 11(3):147–148, March 1968.
[Dij71] Edsger W. Dijkstra. Hierarchical ordering of sequential processes. Acta
Informatica, 1(2):115–138, 1971. Available: https://www.cs.utexas.
edu/users/EWD/ewd03xx/EWD310.PDF [Viewed January 13, 2008].
[DKS89] Alan Demers, Srinivasan Keshav, and Scott Shenker. Analysis and simulation
of a fair queuing algorithm. SIGCOMM ’89, pages 1–12, 1989.
[DLM+ 10] Dave Dice, Yossi Lev, Virendra J. Marathe, Mark Moir, Dan Nussbaum,
and Marek Oleszewski. Simplifying concurrent algorithms by exploiting
hardware transactional memory. In Proceedings of the 22nd ACM symposium
on Parallelism in algorithms and architectures, SPAA ’10, pages 325–334,
Thira, Santorini, Greece, 2010. ACM.
[DLMN09] Dave Dice, Yossi Lev, Mark Moir, and Dan Nussbaum. Early experience with
a commercial hardware transactional memory implementation. In Fourteenth
International Conference on Architectural Support for Programming Lan-
guages and Operating Systems (ASPLOS ’09), pages 157–168, Washington,
DC, USA, March 2009.
[DMD13] Mathieu Desnoyers, Paul E. McKenney, and Michel R. Dagenais. Multi-core
systems modeling for formal verification of parallel algorithms. SIGOPS Oper.
Syst. Rev., 47(2):51–65, July 2013.
[DMLP79] Richard A. De Millo, Richard J. Lipton, and Alan J. Perlis. Social processes
and proofs of theorems and programs. Commun. ACM, 22(5):271–280, May
1979.
[DMS+ 12] Mathieu Desnoyers, Paul E. McKenney, Alan Stern, Michel R. Dagenais, and
Jonathan Walpole. User-level implementations of read-copy update. IEEE
Transactions on Parallel and Distributed Systems, 23:375–382, 2012.
[dO18a] Daniel Bristot de Oliveira. Deadline scheduler part 2 – details and usage,
January 2018. URL: https://lwn.net/Articles/743946/.
[dO18b] Daniel Bristot de Oliveira. Deadline scheduling part 1 – overview and theory,
January 2018. URL: https://lwn.net/Articles/743740/.
v2022.09.25a
BIBLIOGRAPHY 589
[dOCdO19] Daniel Bristot de Oliveira, Tommaso Cucinotta, and Rômulo Silva de Oliveira.
Modeling the behavior of threads in the PREEMPT_RT Linux kernel using
automata. SIGBED Rev., 16(3):63–68, November 2019.
[Don21] Jason Donenfeld. Introduce WireGuardNT, August 2021. Git
commit: https://git.zx2c4.com/wireguard-nt/commit/?id=
d64c53776d7f72751d7bd580ead9846139c8f12f.
[Dov90] Ken F. Dove. A high capacity TCP/IP in parallel STREAMS. In UKUUG
Conference Proceedings, London, June 1990.
[Dow20] Travis Downs. Gathering intel on Intel AVX-512 transitions, Jan-
uary 2020. https://travisdowns.github.io/blog/2020/01/17/
avxfreq1.html.
[Dre11] Ulrich Drepper. Futexes are tricky. Technical Report FAT2011, Red Hat, Inc.,
Raleigh, NC, USA, November 2011.
[DSS06] Dave Dice, Ori Shalev, and Nir Shavit. Transactional locking II. In Proc.
International Symposium on Distributed Computing. Springer Verlag, 2006.
[Duf10a] Joe Duffy. A (brief) retrospective on transactional memory,
January 2010. http://joeduffyblog.com/2010/01/03/a-brief-
retrospective-on-transactional-memory/.
[Duf10b] Joe Duffy. More thoughts on transactional memory, May
2010. http://joeduffyblog.com/2010/05/16/more-thoughts-on-
transactional-memory/.
[Dug10] Abhinav Duggal. Stopping data races using redflag. Master’s thesis, Stony
Brook University, 2010.
[Eas71] William B. Easton. Process synchronization without long-term interlock. In
Proceedings of the Third ACM Symposium on Operating Systems Principles,
SOSP ’71, pages 95–100, Palo Alto, California, USA, 1971. Association for
Computing Machinery.
[Edg13] Jake Edge. The future of realtime Linux, November 2013. URL: https:
//lwn.net/Articles/572740/.
[Edg14] Jake Edge. The future of the realtime patch set, October 2014. URL:
https://lwn.net/Articles/617140/.
[EGCD03] T. A. El-Ghazawi, W. W. Carlson, and J. M. Draper. UPC language specifica-
tions v1.1, May 2003. URL: http://upc.gwu.edu [broken, February 27,
2021].
[EGMdB11] Stephane Eranian, Eric Gouriou, Tipp Moseley, and Willem de Bruijn. Linux
kernel profiling with perf, June 2011. https://perf.wiki.kernel.org/
index.php/Tutorial.
[Ell80] Carla Schlatter Ellis. Concurrent search and insertion in AVL trees. IEEE
Transactions on Computers, C-29(9):811–817, September 1980.
[ELLM07] Faith Ellen, Yossi Lev, Victor Luchangco, and Mark Moir. SNZI: scalable
NonZero indicators. In Proceedings of the twenty-sixth annual ACM symposium
on Principles of distributed computing, PODC ’07, pages 13–22, Portland,
Oregon, USA, 2007. ACM.
v2022.09.25a
590 BIBLIOGRAPHY
[EMV+ 20a] Marco Elver, Paul E. McKenney, Dmitry Vyukov, Andrey Konovalov, Alexan-
der Potapenko, Kostya Serebryany, Alan Stern, Andrea Parri, Akira Yokosawa,
Peter Zijlstra, Will Deacon, Daniel Lustig, Boqun Feng, Joel Fernandes,
Jade Alglave, and Luc Maranget. Concurrency bugs should fear the big bad
data-race detector (part 1), April 2020. Linux Weekly News.
[EMV+ 20b] Marco Elver, Paul E. McKenney, Dmitry Vyukov, Andrey Konovalov, Alexan-
der Potapenko, Kostya Serebryany, Alan Stern, Andrea Parri, Akira Yokosawa,
Peter Zijlstra, Will Deacon, Daniel Lustig, Boqun Feng, Joel Fernandes,
Jade Alglave, and Luc Maranget. Concurrency bugs should fear the big bad
data-race detector (part 2), April 2020. Linux Weekly News.
[Eng68] Douglas Engelbart. The demo, December 1968. URL: http://thedemo.
org/.
[ENS05] Ryan Eccles, Blair Nonneck, and Deborah A. Stacey. Exploring parallel
programming knowledge in the novice. In HPCS ’05: Proceedings of the
19th International Symposium on High Performance Computing Systems and
Applications, pages 97–102, Guelph, Ontario, Canada, 2005. IEEE Computer
Society.
[Eri08] Christer Ericson. Aiding pathfinding with cellular automata, June 2008.
http://realtimecollisiondetection.net/blog/?p=57.
[ES90] Margaret A. Ellis and Bjarne Stroustrup. The Annotated C++ Reference
Manual. Addison Wesley, 1990.
[ES05] Ryan Eccles and Deborah A. Stacey. Understanding the parallel programmer.
In HPCS ’05: Proceedings of the 19th International Symposium on High
Performance Computing Systems and Applications, pages 156–160, Guelph,
Ontario, Canada, 2005. IEEE Computer Society.
[ETH11] ETH Zurich. Parallel solver for a perfect maze, March
2011. URL: http://nativesystems.inf.ethz.ch/pub/Main/
WebHomeLecturesParallelProgrammingExercises/pp2011hw04.pdf
[broken, November 2016].
[Eva11] Jason Evans. Scalable memory allocation using jemalloc, Janu-
ary 2011. https://engineering.fb.com/2011/01/03/core-data/
scalable-memory-allocation-using-jemalloc/.
[Fel50] W. Feller. An Introduction to Probability Theory and its Applications. John
Wiley, 1950.
[Fen73] J. Fennel. Instruction selection in a two-program counter instruction unit.
Technical Report US Patent 3,728,692, Assigned to International Business
Machines Corp, Washington, DC, April 1973.
[Fen15] Boqun Feng. powerpc: Make value-returning atomics fully ordered, November
2015. Git commit: https://git.kernel.org/linus/49e9cf3f0c04.
[FH07] Keir Fraser and Tim Harris. Concurrent programming without locks. ACM
Trans. Comput. Syst., 25(2):1–61, 2007.
[FIMR16] Pascal Felber, Shady Issa, Alexander Matveev, and Paolo Romano. Hardware
read-write lock elision. In Proceedings of the Eleventh European Conference on
Computer Systems, EuroSys ’16, London, United Kingdom, 2016. Association
for Computing Machinery.
v2022.09.25a
BIBLIOGRAPHY 591
[Fos10] Ron Fosner. Scalable multithreaded programming with tasks. MSDN Magazine,
2010(11):60–69, November 2010. http://msdn.microsoft.com/en-us/
magazine/gg309176.aspx.
[FPB79] Jr. Frederick P. Brooks. The Mythical Man-Month. Addison-Wesley, 1979.
[Fra03] Keir Anthony Fraser. Practical Lock-Freedom. PhD thesis, King’s College,
University of Cambridge, 2003.
[Fra04] Keir Fraser. Practical lock-freedom. Technical Report UCAM-CL-TR-579,
University of Cambridge, Computer Laboratory, February 2004.
[FRK02] Hubertus Francke, Rusty Russell, and Matthew Kirkwood. Fuss, futexes
and furwocks: Fast userlevel locking in linux. In Ottawa Linux Symposium,
pages 479–495, June 2002. Available: https://www.kernel.org/doc/
ols/2002/ols2002-pages-479-495.pdf [Viewed May 22, 2011].
[FSP+ 17] Shaked Flur, Susmit Sarkar, Christopher Pulte, Kyndylan Nienhuis, Luc
Maranget, Kathryn E. Gray, Ali Sezgin, Mark Batty, and Peter Sewell.
Mixed-size concurrency: ARM, POWER, C/C++11, and SC. SIGPLAN Not.,
52(1):429–442, January 2017.
[GAJM15] Alex Groce, Iftekhar Ahmed, Carlos Jensen, and Paul E. McKenney. How
verified is my code? falsification-driven verification (t). In Proceedings of
the 2015 30th IEEE/ACM International Conference on Automated Software
Engineering (ASE), ASE ’15, pages 737–748, Washington, DC, USA, 2015.
IEEE Computer Society.
[Gar90] Arun Garg. Parallel STREAMS: a multi-processor implementation. In
USENIX Conference Proceedings, pages 163–176, Berkeley CA, February
1990. USENIX Association. Available: https://archive.org/details/
1990-proceedings-winter-dc/page/163/mode/2up.
[Gar07] Bryan Gardiner. IDF: Gordon Moore predicts end of Moore’s law (again),
September 2007. Available: https://www.wired.com/2007/09/idf-
gordon-mo-1/ [Viewed: February 27, 2021].
[GC96] Michael Greenwald and David R. Cheriton. The synergy between non-blocking
synchronization and operating system structure. In Proceedings of the Second
Symposium on Operating Systems Design and Implementation, pages 123–136,
Seattle, WA, October 1996. USENIX Association.
[GDZE10] Olga Golovanevsky, Alon Dayan, Ayal Zaks, and David Edelsohn. Trace-based
data layout optimizations for multi-core processors. In Proceedings of the 5th
International Conference on High Performance Embedded Architectures and
Compilers, HiPEAC’10, pages 81–95, Pisa, Italy, 2010. Springer-Verlag.
[GG14] Vincent Gramoli and Rachid Guerraoui. Democratizing transactional pro-
gramming. Commun. ACM, 57(1):86–93, January 2014.
[GGK18] Christina Giannoula, Georgios Goumas, and Nectarios Koziris. Combining
HTM with RCU to speed up graph coloring on multicore platforms. In Rio
Yokota, Michèle Weiland, David Keyes, and Carsten Trinitis, editors, High
Performance Computing, pages 350–369, Cham, 2018. Springer International
Publishing.
[GGL+ 19] Rachid Guerraoui, Hugo Guiroux, Renaud Lachaize, Vivien Quéma, and
Vasileios Trigonakis. Lock–unlock: Is that all? a pragmatic analysis of
locking in software systems. ACM Trans. Comput. Syst., 36(1):1:1–1:149,
March 2019.
v2022.09.25a
592 BIBLIOGRAPHY
v2022.09.25a
BIBLIOGRAPHY 593
[Gra91] Jim Gray. The Benchmark Handbook for Database and Transaction Processing
Systems. Morgan Kaufmann, 1991.
[Gra02] Jim Gray. Super-servers: Commodity computer clusters pose a software chal-
lenge, April 2002. Available: http://research.microsoft.com/en-
us/um/people/gray/papers/superservers(4t_computers).doc
[Viewed: June 23, 2004].
[Gre19] Brendan Gregg. BPF Performance Tools: Linux System and Application
Observability. Addison-Wesley Professional, 1st edition, 2019.
[Gri00] Scott Griffen. Internet pioneers: Doug englebart, May 2000. Available:
https://www.ibiblio.org/pioneers/englebart.html [Viewed No-
vember 28, 2008].
[Gro01] The Open Group. Single UNIX specification, July 2001. http://www.
opengroup.org/onlinepubs/007908799/index.html.
[Gro07] Dan Grossman. The transactional memory / garbage collection analogy. In
OOPSLA ’07: Proceedings of the 22nd annual ACM SIGPLAN conference on
Object oriented programming systems and applications, pages 695–706, Mont-
real, Quebec, Canada, October 2007. ACM. Available: https://homes.cs.
washington.edu/~djg/papers/analogy_oopsla07.pdf [Viewed Feb-
ruary 27, 2021].
[GRY12] Alexey Gotsman, Noam Rinetzky, and Hongseok Yang. Verify-
ing highly concurrent algorithms with grace (extended version), July
2012. https://software.imdea.org/~gotsman/papers/recycling-
esop13-ext.pdf.
[GRY13] Alexey Gotsman, Noam Rinetzky, and Hongseok Yang. Verifying concurrent
memory reclamation algorithms with grace. In ESOP’13: European Sympo-
sium on Programming, pages 249–269, Rome, Italy, 2013. Springer-Verlag.
[GT90] Gary Graunke and Shreekant Thakkar. Synchronization algorithms for shared-
memory multiprocessors. IEEE Computer, 23(6):60–69, June 1990.
[Gui18] Hugo Guiroux. Understanding the performance of mutual exclusion algorithms
on modern multicore machines. PhD thesis, Université Grenoble Alpes, 2018.
https://hugoguiroux.github.io/assets/these.pdf.
[Gwy15] David Gwynne. introduce srp, which according to the manpage i wrote is
short for “shared reference pointers”., July 2015. https://github.com/
openbsd/src/blob/HEAD/sys/kern/kern_srp.c.
[GYW+ 19] Jinyu Gu, Qianqian Yu, Xiayang Wang, Zhaoguo Wang, Binyu Zang, Haibing
Guan, and Haibo Chen. Pisces: A scalable and efficient persistent transactional
memory. In Proceedings of the 2019 USENIX Conference on Usenix Annual
Technical Conference, USENIX ATC ’19, pages 913–928, Renton, WA, USA,
2019. USENIX Association.
[Har16] "No Bugs" Hare. Infographics: Operation costs in CPU clock cycles, Sep-
tember 2016. http://ithare.com/infographics-operation-costs-
in-cpu-clock-cycles/.
[Hay20] Timothy Hayes. A shift to concurrency, October 2020. https:
//community.arm.com/developer/research/b/articles/posts/
arms-transactional-memory-extension-support-.
v2022.09.25a
594 BIBLIOGRAPHY
[HCS+ 05] Lorin Hochstein, Jeff Carver, Forrest Shull, Sima Asgari, and Victor Basili. Par-
allel programmer productivity: A case study of novice parallel programmers.
In SC ’05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing,
page 35, Seattle, WA, USA, 2005. IEEE Computer Society.
[Hei27] W. Heisenberg. Über den anschaulichen Inhalt der quantentheoretischen
Kinematik und Mechanik. Zeitschrift für Physik, 43(3-4):172–198, 1927.
English translation in “Quantum theory and measurement” by Wheeler and
Zurek.
[Her90] Maurice P. Herlihy. A methodology for implementing highly concurrent
data structures. In Proceedings of the 2nd ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming, pages 197–206, Seattle,
WA, USA, March 1990.
[Her91] Maurice Herlihy. Wait-free synchronization. ACM TOPLAS, 13(1):124–149,
January 1991.
[Her93] Maurice Herlihy. A methodology for implementing highly concurrent data ob-
jects. ACM Transactions on Programming Languages and Systems, 15(5):745–
770, November 1993.
[Her05] Maurice Herlihy. The transactional manifesto: software engineering and
non-blocking synchronization. In PLDI ’05: Proceedings of the 2005 ACM
SIGPLAN conference on Programming language design and implementation,
pages 280–280, Chicago, IL, USA, 2005. ACM Press.
[Her11] Benjamin Herrenschmidt. powerpc: Fix atomic_xxx_return barrier seman-
tics, November 2011. Git commit: https://git.kernel.org/linus/
b97021f85517.
[HHK+ 13] A. Haas, T.A. Henzinger, C.M. Kirsch, M. Lippautz, H. Payer, A. Sezgin,
and A. Sokolova. Distributed queues in shared memory—multicore perfor-
mance and scalability through quantitative relaxation. In Proc. International
Conference on Computing Frontiers, Ischia, Italy, 2013. ACM.
[HKLP12] Andreas Haas, Christoph M. Kirsch, Michael Lippautz, and Hannes Payer.
How FIFO is your concurrent FIFO queue? In Proceedings of the Workshop
on Relaxing Synchronization for Multicore and Manycore Scalability, Tucson,
AZ USA, October 2012.
[HL86] Frederick S. Hillier and Gerald J. Lieberman. Introduction to Operations
Research. Holden-Day, 1986.
[HLM02] Maurice Herlihy, Victor Luchangco, and Mark Moir. The repeat offender
problem: A mechanism for supporting dynamic-sized, lock-free data structures.
In Proceedings of 16th International Symposium on Distributed Computing,
pages 339–353, Toulouse, France, October 2002.
[HLM03] Maurice Herlihy, Victor Luchangco, and Mark Moir. Obstruction-free syn-
chronization: Double-ended queues as an example. In Proceedings of the 23rd
IEEE International Conference on Distributed Computing Systems (ICDCS),
pages 73–82, Providence, RI, May 2003. The Institute of Electrical and
Electronics Engineers, Inc.
[HM93] Maurice Herlihy and J. Eliot B. Moss. Transactional memory: Architectural
support for lock-free data structures. In ISCA ’93: Proceeding of the 20th
Annual International Symposium on Computer Architecture, pages 289–300,
San Diego, CA, USA, May 1993.
v2022.09.25a
BIBLIOGRAPHY 595
[HMB06] Thomas E. Hart, Paul E. McKenney, and Angela Demke Brown. Making lock-
less synchronization fast: Performance implications of memory reclamation.
In 20th IEEE International Parallel and Distributed Processing Symposium,
Rhodes, Greece, April 2006. Available: http://www.rdrop.com/users/
paulmck/RCU/hart_ipdps06.pdf [Viewed April 28, 2008].
[HMBW07] Thomas E. Hart, Paul E. McKenney, Angela Demke Brown, and Jonathan
Walpole. Performance of memory reclamation for lockless synchronization. J.
Parallel Distrib. Comput., 67(12):1270–1285, 2007.
[HMDZ06] David Howells, Paul E. McKenney, Will Deacon, and Peter Zijlstra. Linux
kernel memory barriers, March 2006. https://www.kernel.org/doc/
Documentation/memory-barriers.txt.
[Hoa74] C. A. R. Hoare. Monitors: An operating system structuring concept. Commu-
nications of the ACM, 17(10):549–557, October 1974.
[Hol03] Gerard J. Holzmann. The Spin Model Checker: Primer and Reference Manual.
Addison-Wesley, Boston, MA, USA, 2003.
[Hor18] Jann Horn. Reading privileged memory with a side-channel, Jan-
uary 2018. https://googleprojectzero.blogspot.com/2018/01/
reading-privileged-memory-with-side.html.
[HOS89] James P. Hennessy, Damian L. Osisek, and Joseph W. Seigh II. Passive
serialization in a multitasking environment. Technical Report US Patent
4,809,168, Assigned to International Business Machines Corp, Washington,
DC, February 1989.
[How12] Phil Howard. Extending Relativistic Programming to Multiple Writers. PhD
thesis, Portland State University, 2012.
[HP95] John L. Hennessy and David A. Patterson. Computer Architecture: A
Quantitative Approach. Morgan Kaufman, 1995.
[HP11] John L. Hennessy and David A. Patterson. Computer Architecture: A
Quantitative Approach, Fifth Edition. Morgan Kaufman, 2011.
[HP17] John L. Hennessy and David A. Patterson. Computer Architecture: A
Quantitative Approach, Sixth Edition. Morgan Kaufman, 2017.
[Hra13] Adam Hraška. Read-copy-update for helenos. Master’s thesis, Charles
University in Prague, Faculty of Mathematics and Physics, Department of
Distributed and Dependable Systems, 2013.
[HS08] Maurice Herlihy and Nir Shavit. The Art of Multiprocessor Programming.
Morgan Kaufmann, Burlington, MA, USA, 2008.
[HSLS20] Maurice Herlihy, Nir Shavit, Victor Luchangco, and Michael Spear. The Art
of Multiprocessor Programming, 2nd Edition. Morgan Kaufmann, Burlington,
MA, USA, 2020.
[HW90] Maurice P. Herlihy and Jeannette M. Wing. Linearizability: a correctness
condition for concurrent objects. ACM Trans. Program. Lang. Syst., 12(3):463–
492, July 1990.
[HW92] Wilson C. Hsieh and William E. Weihl. Scalable reader-writer locks for
parallel systems. In Proceedings of the 6th International Parallel Processing
Symposium, pages 216–230, Beverly Hills, CA, USA, March 1992.
v2022.09.25a
596 BIBLIOGRAPHY
v2022.09.25a
BIBLIOGRAPHY 597
[Int20c] International Business Machines Corporation. Power ISA™ Version 3.1, 2020.
[Int21] Intel. Performance monitoring impact of Intel® Transactional
Synchronization Extension memory ordering issue, June 2021.
https://www.intel.com/content/dam/support/us/en/documents/
processors/Performance-Monitoring-Impact-of-TSX-Memory-
Ordering-Issue-604224.pdf.
[Jac88] Van Jacobson. Congestion avoidance and control. In SIGCOMM ’88, pages
314–329, August 1988.
[Jac93] Van Jacobson. Avoid read-side locking via delayed free, September 1993.
private communication.
[Jac08] Daniel Jackson. MapReduce course, January 2008. Available: https:
//sites.google.com/site/mriap2008/ [Viewed January 3, 2013].
[JED] JEDEC. mega (M) (as a prefix to units of semiconductor storage capacity)
[online].
[Jef14] Alan Jeffrey. Jmm revision status, July 2014. https://mail.openjdk.
java.net/pipermail/jmm-dev/2014-July/000072.html.
[JJKD21] Ralf Jung, Jacques-Henri Jourdan, Robbert Krebbers, and Derek Dreyer. Safe
systems programming in Rust. Commun. ACM, 64(4):144–152, March 2021.
[JLK16a] Yeongjin Jang, Sangho Lee, and Taesoo Kim. Breaking kernel ad-
dress space layout randomization (KASLR) with Intel TSX, July
2016. Black Hat USA 2018 https://www.blackhat.com/us-
16/briefings.html#breaking-kernel-address-space-layout-
randomization-kaslr-with-intel-tsx.
[JLK16b] Yeongjin Jang, Sangho Lee, and Taesoo Kim. Breaking kernel address space
layout randomization with Intel TSX. In Proceedings of the 2016 ACM
SIGSAC Conference on Computer and Communications Security, CCS ’16,
pages 380–392, Vienna, Austria, 2016. ACM.
[JMRR02] Benedict Joseph Jackson, Paul E. McKenney, Ramakrishnan Rajamony, and
Ronald Lynn Rockhold. Scalable interruptible queue locks for shared-memory
multiprocessor. US Patent 6,473,819, Assigned to International Business
Machines Corporation, Washington, DC, October 2002.
[Joh77] Stephen Johnson. Lint, a C program checker, December 1977. Computer
Science Technical Report 65, Bell Laboratories.
[Joh95] Aju John. Dynamic vnodes – design and implementation. In USENIX Winter
1995, pages 11–23, New Orleans, LA, January 1995. USENIX Associa-
tion. Available: https://www.usenix.org/publications/library/
proceedings/neworl/full_papers/john.a [Viewed October 1, 2010].
[Jon11] Dave Jones. Trinity: A system call fuzzer. In 13th Ottawa Linux Symposium,
Ottawa, Canada, June 2011. Project repository: https://github.com/
kernelslacker/trinity.
[JSG12] Christian Jacobi, Timothy Slegel, and Dan Greiner. Transactional mem-
ory architecture and implementation for IBM System z. In Proceedings of
the 45th Annual IEEE/ACM International Symposium on Microarchitecture,
MICRO 45, pages 25–36, Vancouver B.C. Canada, December 2012. Presenta-
tion slides: https://www.microarch.org/micro45/talks-posters/
3-jacobi-presentation.pdf.
v2022.09.25a
598 BIBLIOGRAPHY
[Kaa15] Frans Kaashoek. Parallel computing and the os. In SOSP History Day, October
2015.
[KCH+ 06] Sanjeev Kumar, Michael Chu, Christopher J. Hughes, Partha Kumar, and
Anthony Nguyen. Hybrid transactional memory. In Proceedings of the
ACM SIGPLAN 2006 Symposium on Principles and Practice of Parallel
Programming, New York, New York, United States, 2006. ACM SIGPLAN.
[KDI20] Alex Kogan, Dave Dice, and Shady Issa. Scalable range locks for scalable
address spaces and beyond. In Proceedings of the Fifteenth European
Conference on Computer Systems, EuroSys ’20, Heraklion, Greece, 2020.
Association for Computing Machinery.
[Kel17] Michael J. Kelly. How might the manufacturability of the hardware at
device level impact on exascale computing?, 2017. Keynote speech at
Multicore World 2017, URL: https://openparallel.com/multicore-
world-2017/program-2017/abstracts2017/.
[Ken20] Chris Kennelly. TCMalloc overview, February 2020. https://google.
github.io/tcmalloc/overview.html.
[KFC11] KFC. Memristor processor solves mazes, March 2011. URL: https:
//www.technologyreview.com/2011/03/03/196572/memristor-
processor-solves-mazes/.
[Khi14] Maxim Khizhinsky. Memory management schemes, June 2014.
https://kukuruku.co/post/lock-free-data-structures-the-
inside-memory-management-schemes/.
[Khi15] Max Khiszinsky. Lock-free data structures. the inside. RCU, February
2015. https://kukuruku.co/post/lock-free-data-structures-
the-inside-rcu/.
[Kis14] Jan Kiszka. Real-time virtualization - how crazy are we? In Linux Plumbers
Conference, Duesseldorf, Germany, October 2014. URL: https://blog.
linuxplumbersconf.org/2014/ocw/proposals/1935.
[Kiv13] Avi Kivity. rcu: add basic read-copy-update implementation, Au-
gust 2013. https://github.com/cloudius-systems/osv/commit/
94b69794fb9e6c99d78ca9a58ddaee1c31256b43.
[Kiv14a] Avi Kivity. rcu hashtable, July 2014. https:
//github.com/cloudius-systems/osv/commit/
7fa2728e5d03b2174b4a39d94b21940d11926e90.
[Kiv14b] Avi Kivity. rcu: introduce an rcu list type, April 2014.
https://github.com/cloudius-systems/osv/commit/
4e46586093aeaf339fef8e08d123a6f6b0abde5b.
[KL80] H. T. Kung and Philip L. Lehman. Concurrent manipulation of binary search
trees. ACM Transactions on Database Systems, 5(3):354–382, September
1980.
[Kle14] Andi Kleen. Scaling existing lock-based applications with lock elision.
Commun. ACM, 57(3):52–56, March 2014.
[Kle17] Matt Klein. Envoy threading model, July 2017. https://blog.
envoyproxy.io/envoy-threading-model-a8d44b922310.
v2022.09.25a
BIBLIOGRAPHY 599
[KLP12] Christoph M. Kirsch, Michael Lippautz, and Hannes Payer. Fast and scalable
k-FIFO queues. Technical Report 2012-04, University of Salzburg, Salzburg,
Austria, June 2012.
[KM13] Konstantin Khlebnikov and Paul E. McKenney. RCU: non-atomic assignment
to long/pointer variables in gcc, January 2013. https://lore.kernel.
org/lkml/[email protected]/.
[KMK+ 19] Jaeho Kim, Ajit Mathew, Sanidhya Kashyap, Madhava Krishnan Ramanathan,
and Changwoo Min. Mv-rlu: Scaling read-log-update with multi-versioning.
In Proceedings of the Twenty-Fourth International Conference on Architectural
Support for Programming Languages and Operating Systems, ASPLOS ’19,
pages 779–792, Providence, RI, USA, 2019. ACM.
[Kni86] Tom Knight. An architecture for mostly functional languages. In Proceedings
of the 1986 ACM Conference on LISP and Functional Programming, LFP ’86,
pages 105–112, Cambridge, Massachusetts, USA, 1986. ACM.
[Kni08] John U. Knickerbocker. 3D chip technology. IBM Journal of Research and
Development, 52(6), November 2008. URL: http://www.research.ibm.
com/journal/rd52-6.html [Link to each article is broken as of November
2016; Available via https://ieeexplore.ieee.org/xpl/tocresult.
jsp?isnumber=5388557].
[Knu73] Donald Knuth. The Art of Computer Programming. Addison-Wesley, 1973.
[Kra17] Vlad Krasnov. On the dangers of Intel’s frequency scaling, No-
vember 2017. https://blog.cloudflare.com/on-the-dangers-of-
intels-frequency-scaling/.
[KS08] Daniel Kroening and Ofer Strichman. Decision Procedures: An Algorithmic
Point of View. Springer Publishing Company, Incorporated, 1 edition, 2008.
[KS17a] Michalis Kokologiannakis and Konstantinos Sagonas. Stateless model
checking of the linux kernel’s hierarchical read-copy update (Tree RCU).
Technical report, National Technical University of Athens, January 2017.
https://github.com/michalis-/rcu/blob/master/rcupaper.pdf.
[KS17b] Michalis Kokologiannakis and Konstantinos Sagonas. Stateless model check-
ing of the Linux kernel’s hierarchical read-copy-update (Tree RCU). In
Proceedings of International SPIN Symposium on Model Checking of Soft-
ware, SPIN 2017, New York, NY, USA, July 2017. ACM.
[KS19] Michalis Kokologiannakis and Konstantinos Sagonas. Stateless model check-
ing of the linux kernel’s read—copy update (RCU). Int. J. Softw. Tools Technol.
Transf., 21(3):287–306, June 2019.
[KWS97] Leonidas Kontothanassis, Robert W. Wisniewski, and Michael L. Scott.
Scheduler-conscious synchronization. ACM Transactions on Computer Sys-
tems, 15(1):3–40, February 1997.
[LA94] Beng-Hong Lim and Anant Agarwal. Reactive synchronization algorithms
for multiprocessors. In Proceedings of the sixth international conference
on Architectural support for programming languages and operating systems,
ASPLOS VI, pages 25–35, San Jose, California, USA, October 1994. ACM.
[Lam74] Leslie Lamport. A new solution of Dijkstra’s concurrent programming
problem. Communications of the ACM, 17(8):453–455, August 1974.
v2022.09.25a
600 BIBLIOGRAPHY
[Lam77] Leslie Lamport. Concurrent reading and writing. Commun. ACM, 20(11):806–
811, November 1977.
[Lam02] Leslie Lamport. Specifying Systems: The TLA+ Language and Tools for
Hardware and Software Engineers. Addison-Wesley Longman Publishing
Co., Inc., Boston, MA, USA, 2002.
[Lar21] Michael Larabel. Intel to disable TSX by default on more CPUs with new
microcode, June 2021. https://www.phoronix.com/scan.php?page=
news_item&px=Intel-TSX-Off-New-Microcode.
[LBD+ 04] James R. Larus, Thomas Ball, Manuvir Das, Robert DeLine, Manuel Fahndrich,
Jon Pincus, Sriram K. Rajamani, and Ramanathan Venkatapathy. Righting
software. IEEE Softw., 21(3):92–100, May 2004.
[Lea97] Doug Lea. Concurrent Programming in Java: Design Principles and Patterns.
Addison Wesley Longman, Reading, MA, USA, 1997.
[Lem18] Daniel Lemire. AVX-512: when and how to use these new instructions, Sep-
tember 2018. https://lemire.me/blog/2018/09/07/avx-512-when-
and-how-to-use-these-new-instructions/.
[LGW+ 15] H. Q. Le, G. L. Guthrie, D. E. Williams, M. M. Michael, B. G. Frey, W. J.
Starke, C. May, R. Odaira, and T. Nakaike. Transactional memory support in
the ibm power8 processor. IBM J. Res. Dev., 59(1):8:1–8:14, January 2015.
[LHF05] Michael Lyons, Bill Hay, and Brad Frey. PowerPC storage model and AIX
programming, November 2005. http://www.ibm.com/developerworks/
systems/articles/powerpc.html.
[Lis88] Barbara Liskov. Distributed programming in Argus. Commun. ACM,
31(3):300–312, 1988.
[LLO09] Yossi Lev, Victor Luchangco, and Marek Olszewski. Scalable reader-writer
locks. In SPAA ’09: Proceedings of the twenty-first annual symposium on
Parallelism in algorithms and architectures, pages 101–110, Calgary, AB,
Canada, 2009. ACM.
[LLS13] Yujie Liu, Victor Luchangco, and Michael Spear. Mindicators: A scalable
approach to quiescence. In Proceedings of the 2013 IEEE 33rd International
Conference on Distributed Computing Systems, ICDCS ’13, pages 206–215,
Washington, DC, USA, 2013. IEEE Computer Society.
[LMKM16] Lihao Liang, Paul E. McKenney, Daniel Kroening, and Tom Melham. Ver-
ification of the tree-based hierarchical read-copy update in the Linux ker-
nel. Technical report, Cornell University Library, October 2016. https:
//arxiv.org/abs/1610.03052.
[LMKM18] Lihao Liang, Paul E. McKenney, Daniel Kroening, and Tom Melham. Verifi-
cation of tree-based hierarchical Read-Copy Update in the Linux Kernel. In
2018 Design, Automation & Test in Europe Conference & Exhibition (DATE),
Dresden, Germany, March 2018.
[Loc02] Doug Locke. Priority inheritance: The real story, July 2002. URL:
http://www.linuxdevices.com/articles/AT5698775833.html [bro-
ken, November 2016], page capture available at https://www.math.unipd.
it/%7Etullio/SCD/2007/Materiale/Locke.pdf.
v2022.09.25a
BIBLIOGRAPHY 601
v2022.09.25a
602 BIBLIOGRAPHY
v2022.09.25a
BIBLIOGRAPHY 603
v2022.09.25a
604 BIBLIOGRAPHY
v2022.09.25a
BIBLIOGRAPHY 605
v2022.09.25a
606 BIBLIOGRAPHY
[MCS91] John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable syn-
chronization on shared-memory multiprocessors. Transactions of Computer
Systems, 9(1):21–65, February 1991.
[MD92] Paul E. McKenney and Ken F. Dove. Efficient demultiplexing of incoming tcp
packets. In SIGCOMM ’92, Proceedings of the Conference on Communications
Architecture & Protocols, pages 269–279, Baltimore, MD, August 1992.
Association for Computing Machinery.
[MDJ13a] Paul E. McKenney, Mathieu Desnoyers, and Lai Jiangshan. URCU-protected
hash tables, November 2013. https://lwn.net/Articles/573431/.
[MDJ13b] Paul E. McKenney, Mathieu Desnoyers, and Lai Jiangshan. URCU-protected
queues and stacks, November 2013. https://lwn.net/Articles/
573433/.
[MDJ13c] Paul E. McKenney, Mathieu Desnoyers, and Lai Jiangshan. User-space RCU,
November 2013. https://lwn.net/Articles/573424/.
[MDR16] Paul E. McKenney, Will Deacon, and Luis R. Rodriguez. Semantics of MMIO
mapping attributes across architectures, August 2016. https://lwn.net/
Articles/698014/.
[MDSS20] Hans Meuer, Jack Dongarra, Erich Strohmaier, and Horst Simon. Top 500: The
list, November 2020. Available: https://top500.org/lists/ [Viewed
March 6, 2021].
[Men16] Alexis Menard. Move OneWriterSeqLock and SharedMemorySe-
qLockBuffer from content/ to device/base/synchronization, September
2016. https://source.chromium.org/chromium/chromium/src/+/
b39a3082846d5877a15e8b7e18d66cb142abe8af.
[Mer11] Rick Merritt. IBM plants transactional memory in CPU, August 2011.
EE Times https://www.eetimes.com/ibm-plants-transactional-
memory-in-cpu/.
[Met99] Panagiotis Takis Metaxas. Fast dithering on a data-parallel computer. In
Proceedings of the IASTED International Conference on Parallel and Distrib-
uted Computing and Systems, pages 570–576, Cambridge, MA, USA, 1999.
IASTED.
[MG92] Paul E. McKenney and Gary Graunke. Efficient buffer allocation on shared-
memory multiprocessors. In IEEE Workshop on the Architecture and Imple-
mentation of High Performance Communication Subsystems, pages 194–199,
Tucson, AZ, February 1992. The Institute of Electrical and Electronics Engi-
neers, Inc.
+
[MGM 09] Paul E. McKenney, Manish Gupta, Maged M. Michael, Phil Howard, Joshua
Triplett, and Jonathan Walpole. Is parallel programming hard, and if so,
why? Technical Report TR-09-02, Portland State University, Portland, OR,
USA, February 2009. URL: https://archives.pdx.edu/ds/psu/10386
[Viewed February 13, 2021].
[MHS12] Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. Why on-chip coherence
is here to stay. Communications of the ACM, 55(7):78–89, July 2012.
[Mic02] Maged M. Michael. Safe memory reclamation for dynamic lock-free objects
using atomic reads and writes. In Proceedings of the 21st Annual ACM
Symposium on Principles of Distributed Computing, pages 21–30, August
2002.
v2022.09.25a
BIBLIOGRAPHY 607
v2022.09.25a
608 BIBLIOGRAPHY
v2022.09.25a
BIBLIOGRAPHY 609
[MS93] Paul E. McKenney and Jack Slingwine. Efficient kernel memory allocation
on shared-memory multiprocessors. In USENIX Conference Proceedings,
pages 295–306, Berkeley CA, February 1993. USENIX Association. Avail-
able: http://www.rdrop.com/users/paulmck/scalability/paper/
mpalloc.pdf [Viewed January 30, 2005].
[MS95] Maged M. Michael and Michael L. Scott. Correction of a memory management
method for lock-free data structures, December 1995. Technical Report TR599.
[MS96] M.M Michael and M. L. Scott. Simple, fast, and practical non-blocking
and blocking concurrent queue algorithms. In Proc of the Fifteenth ACM
Symposium on Principles of Distributed Computing, pages 267–275, May
1996.
[MS98a] Paul E. McKenney and John D. Slingwine. Read-copy update: Using execution
history to solve concurrency problems. In Parallel and Distributed Computing
and Systems, pages 509–518, Las Vegas, NV, October 1998.
[MS98b] Maged M. Michael and Michael L. Scott. Nonblocking algorithms and
preemption-safe locking on multiprogrammed shared memory multiprocessors.
J. Parallel Distrib. Comput., 51(1):1–26, 1998.
[MS01] Paul E. McKenney and Dipankar Sarma. Read-copy update mutual exclusion
in Linux, February 2001. Available: http://lse.sourceforge.net/
locking/rcu/rcupdate_doc.html [Viewed October 18, 2004].
[MS08] MySQL AB and Sun Microsystems. MySQL Downloads, November 2008.
Available: http://dev.mysql.com/downloads/ [Viewed November 26,
2008].
[MS09] Paul E. McKenney and Raul Silvera. Example POWER im-
plementation for C/C++ memory model, February 2009. Avail-
able: http://www.rdrop.com/users/paulmck/scalability/paper/
N2745r.2009.02.27a.html [Viewed: April 5, 2009].
[MS12] Alexander Matveev and Nir Shavit. Towards a fully pessimistic STM model.
In TRANSACT 2012, San Jose, CA, USA, February 2012. ACM SIGPLAN.
[MS14] Paul E. McKenney and Alan Stern. Axiomatic validation of memory barri-
ers and atomic instructions, August 2014. https://lwn.net/Articles/
608550/.
[MS18] Luc Maranget and Alan Stern. lock.cat, May 2018. https://github.com/
torvalds/linux/blob/master/tools/memory-model/lock.cat.
[MSA+ 02] Paul E. McKenney, Dipankar Sarma, Andrea Arcangeli, Andi Kleen, Orran
Krieger, and Rusty Russell. Read-copy update. In Ottawa Linux Symposium,
pages 338–367, June 2002. Available: https://www.kernel.org/doc/
ols/2002/ols2002-pages-338-367.pdf [Viewed February 14, 2021].
[MSFM15] Alexander Matveev, Nir Shavit, Pascal Felber, and Patrick Marlier. Read-log-
update: A lightweight synchronization mechanism for concurrent program-
ming. In Proceedings of the 25th Symposium on Operating Systems Principles,
SOSP ’15, pages 168–183, Monterey, California, 2015. ACM.
[MSK01] Paul E. McKenney, Jack Slingwine, and Phil Krueger. Experience with an
efficient parallel kernel memory allocator. Software – Practice and Experience,
31(3):235–257, March 2001.
v2022.09.25a
610 BIBLIOGRAPHY
v2022.09.25a
BIBLIOGRAPHY 611
v2022.09.25a
612 BIBLIOGRAPHY
v2022.09.25a
BIBLIOGRAPHY 613
v2022.09.25a
614 BIBLIOGRAPHY
[RKM+ 10] Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, and David I.
August. Speculative parallelization using software multi-threaded transactions.
SIGARCH Comput. Archit. News, 38(1):65–76, 2010.
[RLPB18] Yuxin Ren, Guyue Liu, Gabriel Parmer, and Björn Brandenburg. Scalable
memory reclamation for multi-core, real-time systems. In Proceedings of the
2018 IEEE Real-Time and Embedded Technology and Applications Symposium
(RTAS), page 12, Porto, Portugal, April 2018. IEEE.
[RMF19] Federico Reghenzani, Giuseppe Massari, and William Fornaciari. The real-
time Linux kernel: A survey on PREEMPT_RT. ACM Comput. Surv.,
52(1):18:1–18:36, February 2019.
[Ros06] Steven Rostedt. Lightweight PI-futexes, June 2006. Available: https:
//www.kernel.org/doc/html/latest/locking/pi-futex.html
[Viewed February 14, 2021].
[Ros10a] Steven Rostedt. tracing: Harry Potter and the Deathly Macros, December
2010. Available: https://lwn.net/Articles/418710/ [Viewed: August
28, 2011].
[Ros10b] Steven Rostedt. Using the TRACE_EVENT() macro (part 1), March 2010.
Available: https://lwn.net/Articles/379903/ [Viewed: August 28,
2011].
[Ros10c] Steven Rostedt. Using the TRACE_EVENT() macro (part 2), March 2010.
Available: https://lwn.net/Articles/381064/ [Viewed: August 28,
2011].
[Ros10d] Steven Rostedt. Using the TRACE_EVENT() macro (part 3), April 2010.
Available: https://lwn.net/Articles/383362/ [Viewed: August 28,
2011].
[Ros11] Steven Rostedt. lockdep: How to read its cryptic output, September 2011.
http://www.linuxplumbersconf.org/2011/ocw/sessions/153.
[Roy17] Lance Roy. rcutorture: Add CBMC-based formal verification for
SRCU, January 2017. URL: https://www.spinics.net/lists/kernel/
msg2421833.html.
[RR20] Sergio Rajsbaum and Michel Raynal. Mastering concurrent computing through
sequential thinking. Commun. ACM, 63(1):78–87, January 2020.
[RSB+ 97] Rajeev Rastogi, S. Seshadri, Philip Bohannon, Dennis W. Leinbaugh, Abraham
Silberschatz, and S. Sudarshan. Logical and physical versioning in main
memory databases. In Proceedings of the 23rd International Conference on
Very Large Data Bases, VLDB ’97, pages 86–95, San Francisco, CA, USA,
August 1997. Morgan Kaufmann Publishers Inc.
[RTY+ 87] Richard Rashid, Avadis Tevanian, Michael Young, David Golub, Robert Baron,
David Black, William Bolosky, and Jonathan Chew. Machine-independent
virtual memory management for paged uniprocessor and multiprocessor
architectures. In 2nd Symposium on Architectural Support for Programming
Languages and Operating Systems, pages 31–39, Palo Alto, CA, October
1987. Association for Computing Machinery.
[Rus00a] Rusty Russell. Re: modular net drivers, June 2000. URL: http://oss.
sgi.com/projects/netdev/archive/2000-06/msg00250.html [bro-
ken, February 15, 2021].
v2022.09.25a
BIBLIOGRAPHY 615
[Rus00b] Rusty Russell. Re: modular net drivers, June 2000. URL: http://oss.
sgi.com/projects/netdev/archive/2000-06/msg00254.html [bro-
ken, February 15, 2021].
[Rus03] Rusty Russell. Hanging out with smart people: or... things I learned being a
kernel monkey, July 2003. 2003 Ottawa Linux Symposium Keynote https://
ozlabs.org/~rusty/ols-2003-keynote/ols-keynote-2003.html.
[Rut17] Mark Rutland. compiler.h: Remove ACCESS_ONCE(), November 2017. Git
commit: https://git.kernel.org/linus/b899a850431e.
[SAE+ 18] Caitlin Sadowski, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and
Ciera Jaspan. Lessons from building static analysis tools at google. Commun.
ACM, 61(4):58–66, March 2018.
[SAH+ 03] Craig A. N. Soules, Jonathan Appavoo, Kevin Hui, Dilma Da Silva, Gre-
gory R. Ganger, Orran Krieger, Michael Stumm, Robert W. Wisniewski, Marc
Auslander, Michal Ostrowski, Bryan Rosenburg, and Jimi Xenidis. System
support for online reconfiguration. In Proceedings of the 2003 USENIX
Annual Technical Conference, pages 141–154, San Antonio, Texas, USA, June
2003. USENIX Association.
[SATG+ 09] Tatiana Shpeisman, Ali-Reza Adl-Tabatabai, Robert Geva, Yang Ni, and
Adam Welc. Towards transactional memory semantics for C++. In SPAA ’09:
Proceedings of the twenty-first annual symposium on Parallelism in algorithms
and architectures, pages 49–58, Calgary, AB, Canada, 2009. ACM.
[SBN+ 20] Dimitrios Siakavaras, Panagiotis Billis, Konstantinos Nikas, Georgios Goumas,
and Nectarios Koziris. Efficient concurrent range queries in b+-trees using
rcu-htm. In Proceedings of the 32nd ACM Symposium on Parallelism in
Algorithms and Architectures, SPAA ’20, page 571–573, Virtual Event, USA,
2020. Association for Computing Machinery.
[SBV10] Martin Schoeberl, Florian Brandner, and Jan Vitek. RTTM: Real-time
transactional memory. In Proceedings of the 2010 ACM Symposium on
Applied Computing, pages 326–333, 01 2010.
[Sch35] E. Schrödinger. Die gegenwärtige Situation in der Quantenmechanik. Natur-
wissenschaften, 23:807–812; 823–828; 844–849, November 1935.
[Sch94] Curt Schimmel. UNIX Systems for Modern Architectures: Symmetric Multi-
processing and Caching for Kernel Programmers. Addison-Wesley Longman
Publishing Co., Inc., Boston, MA, USA, 1994.
[Sco06] Michael Scott. Programming Language Pragmatics. Morgan Kaufmann,
Burlington, MA, USA, 2006.
[Sco13] Michael L. Scott. Shared-Memory Synchronization. Morgan & Claypool, San
Rafael, CA, USA, 2013.
[Sco15] Michael Scott. Programming Language Pragmatics, 4th Edition. Morgan
Kaufmann, Burlington, MA, USA, 2015.
[Seq88] Sequent Computer Systems, Inc. Guide to Parallel Programming, 1988.
[Sew] Peter Sewell. Relaxed-memory concurrency. Available: https://www.cl.
cam.ac.uk/~pes20/weakmemory/ [Viewed: February 15, 2021].
[Sey12] Justin Seyster. Runtime Verification of Kernel-Level Concurrency Using
Compiler-Based Instrumentation. PhD thesis, Stony Brook University, 2012.
v2022.09.25a
616 BIBLIOGRAPHY
[SF95] Janice M. Stone and Robert P. Fitzgerald. Storage in the PowerPC. IEEE
Micro, 15(2):50–58, April 1995.
[Sha11] Nir Shavit. Data structures in the multicore age. Commun. ACM, 54(3):76–84,
March 2011.
[She06] Gautham R. Shenoy. [patch 4/5] lock_cpu_hotplug: Redesign - lightweight
implementation of lock_cpu_hotplug, October 2006. Available: https:
//lkml.org/lkml/2006/10/26/73 [Viewed January 26, 2009].
[SHW11] Daniel J. Sorin, Mark D. Hill, and David A. Wood. A Primer on Memory Con-
sistency and Cache Coherence. Synthesis Lectures on Computer Architecture.
Morgan & Claypool, 2011.
[Slo10] Lubos Slovak. First steps for utilizing userspace RCU library,
July 2010. https://gitlab.labs.nic.cz/knot/knot-dns/commit/
f67acc0178ee9a781d7a63fb041b5d09eb5fb4a2.
[SM95] John D. Slingwine and Paul E. McKenney. Apparatus and method for
achieving reduced overhead mutual exclusion and maintaining coherency in
a multiprocessor system utilizing execution history and thread monitoring.
Technical Report US Patent 5,442,758, Assigned to International Business
Machines Corp, Washington, DC, August 1995.
[SM97] John D. Slingwine and Paul E. McKenney. Method for maintaining data co-
herency using thread activity summaries in a multicomputer system. Technical
Report US Patent 5,608,893, Assigned to International Business Machines
Corp, Washington, DC, March 1997.
[SM98] John D. Slingwine and Paul E. McKenney. Apparatus and method for
achieving reduced overhead mutual exclusion and maintaining coherency in
a multiprocessor system utilizing execution history and thread monitoring.
Technical Report US Patent 5,727,209, Assigned to International Business
Machines Corp, Washington, DC, March 1998.
[SM04a] Dipankar Sarma and Paul E. McKenney. Issues with selected scalability
features of the 2.6 kernel. In Ottawa Linux Symposium, page 16, July
2004. https://www.kernel.org/doc/ols/2004/ols2004v2-pages-
195-208.pdf.
[SM04b] Dipankar Sarma and Paul E. McKenney. Making RCU safe for deep sub-
millisecond response realtime applications. In Proceedings of the 2004
USENIX Annual Technical Conference (FREENIX Track), pages 182–191,
Boston, MA, USA, June 2004. USENIX Association.
[SM13] Thomas Sewell and Toby Murray. Above and beyond: seL4 noninterference
and binary verification, May 2013. https://cps-vo.org/node/7706.
[Smi19] Richard Smith. Working draft, standard for programming language C++,
January 2019. http://www.open-std.org/jtc1/sc22/wg21/docs/
papers/2019/n4800.pdf.
[SMS08] Michael Spear, Maged Michael, and Michael Scott. Inevitability mech-
anisms for software transactional memory. In 3rd ACM SIGPLAN Work-
shop on Transactional Computing, Salt Lake City, Utah, February 2008.
ACM. Available: http://www.cs.rochester.edu/u/scott/papers/
2008_TRANSACT_inevitability.pdf [Viewed January 10, 2009].
v2022.09.25a
BIBLIOGRAPHY 617
v2022.09.25a
618 BIBLIOGRAPHY
[ST87] William E. Snaman and David W. Thiel. The VAX/VMS distributed lock
manager. Digital Technical Journal, 5:29–44, September 1987.
[ST95] Nir Shavit and Dan Touitou. Software transactional memory. In Proceedings
of the 14th Annual ACM Symposium on Principles of Distributed Computing,
pages 204–213, Ottawa, Ontario, Canada, August 1995.
[Ste92] W. Richard Stevens. Advanced Programming in the UNIX Environment.
Addison Wesley, 1992.
[Ste13] W. Richard Stevens. Advanced Programming in the UNIX Environment, 3rd
Edition. Addison Wesley, 2013.
[Sut08] Herb Sutter. Effective concurrency, 2008. Series in Dr. Dobbs Journal.
[Sut13] Adrian Sutton. Concurrent programming with the Disruptor, January 2013.
Presentation at Linux.conf.au 2013, URL: https://www.youtube.com/
watch?v=ItpT_vmRHyI.
[SW95] Richard L. Sites and Richard T. Witek. Alpha AXP Architecture. Digital Press,
second edition, 1995.
[SWS16] Harshal Sheth, Aashish Welling, and Nihar Sheth. Read-copy up-
date in a garbage collected environment, 2016. MIT PRIMES
program: https://math.mit.edu/research/highschool/primes/
materials/2016/conf/10-1%20Sheth-Welling-Sheth.pdf.
[SZJ12] KC Sivaramakrishnan, Lukasz Ziarek, and Suresh Jagannathan. Eliminating
read barriers through procrastination and cleanliness. In Proceedings of the
2012 International Symposium on Memory Management, ISMM ’12, pages
49–60, Beijing, China, 2012. ACM.
[Tal07] Nassim Nicholas Taleb. The Black Swan. Random House, 2007.
[TDV15] Joseph Tassarotti, Derek Dreyer, and Victor Vafeiadis. Verifying read-copy-
update in a logic for weak memory. In Proceedings of the 2015 Proceedings
of the 36th annual ACM SIGPLAN conference on Programming Language
Design and Implementation, PLDI ’15, pages 110–120, New York, NY, USA,
June 2015. ACM.
[The08] The Open MPI Project. Open MPI, November 2008. Available: http:
//www.open-mpi.org/software/ [Viewed November 26, 2008].
[The11] The Valgrind Developers. Valgrind, November 2011. http://www.
valgrind.org/.
[The12a] The NetBSD Foundation. pserialize(9), October 2012. http://netbsd.
gw.com/cgi-bin/man-cgi?pserialize+9+NetBSD-current.
[The12b] The OProfile Developers. Oprofile, April 2012. http://oprofile.
sourceforge.net.
[TMW11] Josh Triplett, Paul E. McKenney, and Jonathan Walpole. Resizable, scalable,
concurrent hash tables via relativistic programming. In Proceedings of the
2011 USENIX Annual Technical Conference, pages 145–158, Portland, OR
USA, June 2011. The USENIX Association.
[Tor01] Linus Torvalds. Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of
lists with insertion, October 2001. URL: https://lkml.org/lkml/2001/
10/13/105, https://lkml.org/lkml/2001/10/13/82.
v2022.09.25a
BIBLIOGRAPHY 619
v2022.09.25a
620 BIBLIOGRAPHY
v2022.09.25a
BIBLIOGRAPHY 621
[WTS96] Cai-Dong Wang, Hiroaki Takada, and Ken Sakamura. Priority inheritance
spin locks for multiprocessor real-time systems. In Proceedings of the 2nd
International Symposium on Parallel Architectures, Algorithms, and Networks,
ISPAN ’96, pages 70–76, Beijing, China, 1996. IEEE Computer Society.
[xen14] xenomai.org. Xenomai, December 2014. URL: http://xenomai.org/.
[Xu10] Herbert Xu. bridge: Add core IGMP snooping support, February
2010. Available: https://marc.info/?t=126719855400006&r=1&w=2
[Viewed March 20, 2011].
[YHLR13] Richard M. Yoo, Christopher J. Hughes, Konrad Lai, and Ravi Rajwar.
Performance evaluation of Intel® Transactional Synchronization Extensions
for high-performance computing. In Proceedings of SC13: International
Conference for High Performance Computing, Networking, Storage and
Analysis, SC ’13, pages 19:1–19:11, Denver, Colorado, 2013. ACM.
[Yod04a] Victor Yodaiken. Against priority inheritance, September 2004. Avail-
able: https://www.yodaiken.com/papers/inherit.pdf [Viewed May
26, 2007].
[Yod04b] Victor Yodaiken. Temporal inventory and real-time synchronization in RTLin-
uxPro, September 2004. URL: https://www.yodaiken.com/papers/
sync.pdf.
[Zel11] Cyril Zeller. CUDA C/C++ basics: Supercomputing 2011 tutorial, Novem-
ber 2011. https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-
basics.pdf.
[Zha89] Lixia Zhang. A New Architecture for Packet Switching Network Protocols.
PhD thesis, Massachusetts Institute of Technology, July 1989.
[Zij14] Peter Zijlstra. Another go at speculative page faults, October 2014. https:
//lkml.org/lkml/2014/10/20/620.
v2022.09.25a
622 BIBLIOGRAPHY
v2022.09.25a
If I have seen further it is by standing on the
shoulders of giants.
Akira Yokosawa is this book’s LATEX advisor, which per- • Richard Woodruff (Appendix C).
haps most notably includes the care and feeding of the
style guide laid out in Appendix D. This work includes • Suparna Bhattacharya (Chapter 12).
table layout, listings, fonts, rendering of math, acronyms,
bibliography formatting, epigraphs, hyperlinks, paper size. • Vara Prasad (Section 12.1.5).
Akira also perfected the cross-referencing of quick quizzes,
allowing easy and exact navigation between quick quizzes Reviewers whose feedback took the extremely welcome
and their answers. He also added build options that permit form of a patch are credited in the git logs.
quick quizzes to be hidden and to be gathered at the end
of each chapter, textbook style.
This role also includes the build system, which Akira Machine Owners
has optimized and made much more user-friendly. His
enhancements have included automating response to bibli- Readers might have noticed some graphs showing scala-
ography changes, automatically determining which source bility data out to several hundred CPUs, courtesy of my
files are present, and automatically generating listings current employer, with special thanks to Paul Saab, Yashar
(with automatically generated hyperlinked line-number Bayani, Joe Boyd, and Kyle McMartin.
references) from the source files. From back in my time at IBM, a great debt of thanks goes
to Martin Bligh, who originated the Advanced Build and
Test (ABAT) system at IBM’s Linux Technology Center,
Reviewers as well as to Andy Whitcroft, Dustin Kirkland, and many
others who extended this system. Many thanks go also to a
• Alan Stern (Chapter 15).
great number of machine owners: Andrew Theurer, Andy
• Andy Whitcroft (Section 9.5.2, Section 9.5.3). Whitcroft, Anton Blanchard, Chris McDermott, Cody
Schaefer, Darrick Wong, David “Shaggy” Kleikamp, Jon
• Artem Bityutskiy (Chapter 15, Appendix C). M. Tollefson, Jose R. Santos, Marvin Heffler, Nathan
Lynch, Nishanth Aravamudan, Tim Pepper, and Tony
• Dave Keck (Appendix C).
Breeds.
• David S. Horner (Section 12.1.5).
623
v2022.09.25a
624 CREDITS
5. Section 9.5.3 (“RCU Linux-Kernel API”) on 4. Figure 3.5 (p 19) by Melissa Broussard.
page 150 originally appeared in Linux Weekly
5. Figure 3.6 (p 20) by Melissa Broussard.
News [McK08e].
6. Figure 3.7 (p 20) by Melissa Broussard.
6. Section 9.5.4 (“RCU Usage”) on page 160 originally
appeared in Linux Weekly News [McK08g]. 7. Figure 3.8 (p 21) by Melissa Broussard.
7. Section 9.5.5 (“RCU Related Work”) on 8. Figure 3.9 (p 21) by Melissa Broussard.
page 177 originally appeared in Linux Weekly
News [McK14g]. 9. Figure 3.11 (p 25) by Melissa Broussard.
8. Section 9.5.5 (“RCU Related Work”) on page 177 10. Figure 5.3 (p 51) by Melissa Broussard.
originally appeared in Linux Weekly News [MP15a].
11. Figure 6.1 (p 73) by Kornilios Kourtis.
9. Chapter 12 (“Formal Verification”) on page 229
12. Figure 6.2 (p 74) by Melissa Broussard.
originally appeared in Linux Weekly News [McK07f,
MR08, McK11d]. 13. Figure 6.3 (p 74) by Kornilios Kourtis.
10. Section 12.3 (“Axiomatic Approaches”) on page 260 14. Figure 6.4 (p 75) by Kornilios Kourtis.
originally appeared in Linux Weekly News [MS14].
15. Figure 6.13 (p 84) by Melissa Broussard.
11. Section 13.5.4 (“Correlated Fields”) on page 280
originally appeared in Oregon Graduate Insti- 16. Figure 6.14 (p 85) by Melissa Broussard.
tute [McK04].
17. Figure 6.15 (p 85) by Melissa Broussard.
12. Chapter 15 (“Advanced Synchronization: Memory
18. Figure 7.1 (p 102) by Melissa Broussard.
Ordering”) on page 313 originally appeared in the
Linux kernel [HMDZ06]. 19. Figure 7.2 (p 102) by Melissa Broussard.
13. Chapter 15 (“Advanced Synchronization: Memory 20. Figure 10.13 (p 194) by Melissa Broussard.
Ordering”) on page 313 originally appeared in Linux
Weekly News [AMM+ 17a, AMM+ 17b]. 21. Figure 10.14 (p 194) by Melissa Broussard.
14. Chapter 15 (“Advanced Synchronization: Memory 22. Figure 11.1 (p 209) by Melissa Broussard.
Ordering”) on page 313 originally appeared in ASP-
23. Figure 11.2 (p 209) by Melissa Broussard.
LOS ’18 [AMM+ 18].
24. Figure 11.3 (p 216) by Melissa Broussard.
15. Section 15.3.2 (“Address- and Data-Dependency Dif-
ficulties”) on page 335 originally appeared in the 25. Figure 11.6 (p 227) by Melissa Broussard.
Linux kernel [McK14e].
26. Figure 14.1 (p 293) by Melissa Broussard.
16. Section 15.5 (“Memory-Barrier Instructions For Spe-
cific CPUs”) on page 351 originally appeared in 27. Figure 14.2 (p 293) by Melissa Broussard.
Linux Journal [McK05a, McK05b].
28. Figure 14.3 (p 294) by Melissa Broussard.
v2022.09.25a
OTHER SUPPORT 625
30. Figure 14.11 (p 302) by Melissa Broussard. The bibtex-generation service of the Association for
Computing Machinery has saved us a huge amount of time
31. Figure 14.14 (p 304) by Melissa Broussard. and effort compiling the bibliography, for which we are
32. Figure 14.15 (p 311) by Sarah McKenney. grateful. Thanks are also due to Stamatis Karnouskos, who
convinced me to drag my antique bibliography database
33. Figure 14.16 (p 311) by Sarah McKenney. kicking and screaming into the 21st century. Any technical
work of this sort owes thanks to the many individuals and
34. Figure 15.2 (p 315) by Melissa Broussard. organizations that keep Internet and the World Wide Web
35. Figure 15.5 (p 321) by Akira Yokosawa. up and running, and this one is no exception.
Portions of this material are based upon work supported
36. Figure 15.18 (p 355) by Melissa Brossard. by the National Science Foundation under Grant No. CNS-
0719851.
37. Figure 16.2 (p 365) by Melissa Broussard.
38. Figure 17.1 (p 367) by Melissa Broussard.
39. Figure 17.2 (p 368) by Melissa Broussard.
40. Figure 17.3 (p 368) by Melissa Broussard.
41. Figure 17.4 (p 368) by Melissa Broussard.
42. Figure 17.5 (p 368) by Melissa Broussard, remixed.
43. Figure 17.9 (p 382) by Melissa Broussard.
44. Figure 17.10 (p 382) by Melissa Broussard.
45. Figure 17.11 (p 382) by Melissa Broussard.
46. Figure 17.12 (p 383) by Melissa Broussard.
47. Figure 18.1 (p 405) by Melissa Broussard.
48. Figure A.1 (p 408) by Melissa Broussard.
49. Figure E.2 (p 489) by Kornilios Kourtis.
Other Support
We owe thanks to many CPU architects for patiently ex-
plaining the instruction- and memory-reordering features
of their CPUs, particularly Wayne Cardoza, Ed Silha, An-
ton Blanchard, Tim Slegel, Juergen Probst, Ingo Adlung,
Ravi Arimilli, Cathy May, Derek Williams, H. Peter An-
vin, Andy Glew, Leonid Yegoshin, Richard Grisenthwaite,
and Will Deacon. Wayne deserves special thanks for his
patience in explaining Alpha’s reordering of dependent
loads, a lesson that Paul resisted quite strenuously!
v2022.09.25a
626 CREDITS
v2022.09.25a
Acronyms
CAS compare and swap, 22, 27, 36, 46, 258, 270, 386,
468, 535, 570
CBMC C bounded model checker, 179, 263, 395, 529
627
v2022.09.25a
628 Acronyms
v2022.09.25a
Index
Bold: Major reference.
Underline: Definition.
Acquire load, 46, 145, 324, 569 write, 430, 576 Dining philosophers problem, 73
Ahmed, Iftekhar, 179 Cache-coherence protocol, 431, 570 Direct-mapped cache, see Cache,
Alglave, Jade, 241, 257, 260, 345, 348 Cache-invalidation latency, see Latency, direct-mapped
Amdahl’s Law, 7, 81, 97, 569 cache-invalidation Dreyer, Derek, 179
Anti-Heisenbug, see Heisenbug, anti- Cache-miss latency, see Latency, Dufour, Laurent, 177
Arbel, Maya, 178 cache-miss
Ash, Mike, 178 Capacity miss, see Cache miss, capacity Efficiency, 9, 80, 86, 115, 412, 571
Associativity, see Cache associativity Chen, Haibo, 178 energy, 25, 223, 571
Associativity miss, see Cache miss, Chien, Andrew, 4 Embarrassingly parallel, 12, 86, 93, 571
associativity Clash free, 286, 570 Epoch-based reclamation (EBR), 178,
Atomic, 19, 28, 36, 37, 46, 50, 55, 61, 569 Clements, Austin, 177 183, 571
Atomic read-modify-write operation, 316, Code locking, see Locking, code Exclusive lock, see Lock, exclusive
317, 432, 569 Communication miss, see Cache miss, Existence guarantee, 116, 165, 166, 180,
Attiya, Hagit, 178, 553 communication 270, 492, 571
Compare and swap (CAS), 22, 27, 36,
Belay, Adam, 178 258, 270, 386, 468, 535, 570 False sharing, 24, 78, 97, 192, 205, 480,
Bhat, Srivatsa, 178 Concurrent, 412, 570 498, 518, 571
Bonzini, Paolo, 4 Consistency Felber, Pascal, 178
Bornat, Richard, 3 memory, 353, 573 Forward-progress guarantee, 121, 178,
Bounded population-oblivious wait free, process, 574 181, 286, 571
see Wait free, bounded sequential, 275, 396, 575 Fragmentation, 92, 571
population-oblivious weak, 356 Fraser, Keir, 178, 571
Bounded wait free, see Wait free, bounded Corbet, Jonathan, 3 Full memory barrier, see Memory barrier,
Butenhof, David R., 3 Correia, Andreia, 178 full
Critical section, 20, 35, 81, 84, 89, 109, Fully associative cache, see Cache, fully
C bounded model checker (CBMC), 179, 116, 570 associative
263, 395, 529 RCU read-side, 139, 145, 574
Cache, 569 read-side, 111, 135, 575 Generality, 8, 10, 27, 80
direct-mapped, 433, 571 write-side, 576 Giannoula, Christina, 179, 379
fully associative, 385, 571 Gotsman, Alexey, 179
Cache associativity, 385, 430, 569 Data locking, see Locking, data Grace period, 140, 151, 182, 190, 212,
Cache coherence, 326, 355, 385, 569 Data race, 32, 40, 101, 212, 335, 571 241, 261, 274, 305, 344, 370, 415,
Cache geometry, 430, 570 Deacon, Will, 41 572
Cache line, 21, 50, 115, 205, 315, 328, Deadlock, 7, 15, 76, 101, 141, 197, 306, Grace-period latency, see Latency,
353, 383, 429, 570 337, 364, 376, 386, 571 grace-period
Cache miss, 570 Deadlock cycle, 415, 417 Groce, Alex, 179
associativity, 430, 569 Deadlock free, 286, 571
capacity, 430, 570 Desnoyers, Mathieu, 177, 178 Hardware transactional memory (HTM),
communication, 430, 570 Dijkstra, Edsger W., 1, 74 383, 384, 554, 556, 572
629
v2022.09.25a
630 INDEX
v2022.09.25a
INDEX 631
Sarkar, Susmit, 257 Store buffer, 575 Wait free, 286, 576
Scalability, 9, 412, 575 Store forwarding, 575 bounded, 181, 286, 569
Scheduling latency, see Latency, Superscalar CPU, 575 bounded population-oblivious, 286,
scheduling Sutter, Herb, 4 569
Schimmel, Curt, 4 Synchronization, 576 Walpole, Jon, 177
Schmidt, Douglas C., 3 Weak consistency, see Consistency, weak
Scott, Michael, 3 Tassarotti, Joseph, 179 Weiss, Stewart, 3
Sequence lock, see Lock, sequence Teachable, 576 Weizenbaum, Joseph, 177
Sequential consistency, see Consistency, Throughput, 576 Williams, Anthony, 3
sequential Torvalds, Linus, 477, 514, 548 Williams, Derek, 257
Sewell, Peter, 257 Transactional lock elision (TLE), 387, Write memory barrier, see Memory
410, 576 barrier, write
Shavit, Nir, 3, 178
Transactional memory (TM), 576 Write miss, see Cache miss, write
Shenoy, Gautham, 178
Triplett, Josh, 177 Write mostly, 576
Siakavaras, Dimitrios, 179, 379
Tu, Stephen, 177 Write-side critical section, see Critical
Sivaramakrishnan, KC, 178
Type-safe memory, 151, 165, 270, 576 section, write-side
Software transactional memory (STM),
385, 554, 575 Unbounded transactional memory (UTM), Xu, Herbert, 177, 195, 514
Sorin, Daniel, 4 385, 576
Spear, Michael, 3, 178 Uncertainty principle, 218 Yang, Hongseok, 179
Spin, 229 Unfairness, 101, 109, 114, 137, 576
Starvation, 73, 101, 108, 114, 137, 275, Unteachable, 576 Zeldovich, Nickolai, 177
304, 380, 417, 496, 575 Zhang, Heng, 178
Starvation free, 286, 575 Vafeiadis, Viktor, 179 Zheng, Wenting, 177
Stevens, W. Richard, 3 Vector CPU, 576 Zijlstra, Peter, 177
v2022.09.25a
632 INDEX
v2022.09.25a
API Index
(c): Cxx standard, (g): GCC extension, (k): Linux kernel,
(kh): Linux kernel historic, (pf): perfbook CodeSamples,
(px): POSIX, (ur): userspace RCU.
633
v2022.09.25a
634 API INDEX
list_replace_rcu() (k), 156, 158 rcu_dereference_protected() (k), smp_load_acquire() (k), 46, 475
lockless_dereference() (kh), 324, 154 smp_mb() (k), 45
544 rcu_dereference_raw() (k), 155 smp_store_release() (k), 43, 46, 475
rcu_dereference_raw_notrace() (k), smp_thread_id() (pf), 38, 39, 475
NR_THREADS (pf), 38 155 smp_wmb() (k), 43
rcu_head (k), 159 spin_lock() (k), 39, 40
per_cpu() (k), 46 rcu_head_after_call_rcu() (k), 159 spin_lock_init() (k), 39
per_thread() (pf), 47, 52 rcu_head_init() (k), 159 spin_trylock() (k), 39, 106
pthread_atfork() (px), 120 rcu_init() (ur), 37 spin_unlock() (k), 39, 40
pthread_cond_wait() (px), 106 RCU_INIT_POINTER() (k), 154 spinlock_t (k), 39
pthread_create() (px), 31 rcu_is_watching() (k), 159 srcu_barrier() (k), 153
pthread_exit() (px), 31 RCU_LOCKDEP_WARN() (k), 159 srcu_read_lock() (k), 151
pthread_getspecific() (px), 37 RCU_NONIDLE() (k), 159 srcu_read_lock_held() (k), 159
pthread_join() (px), 31 rcu_pointer_handoff() (k), 154 srcu_read_unlock() (k), 151
pthread_key_create() (px), 37 RCU_POINTER_INITIALIZER() (k), 154 srcu_struct (k), 151
pthread_key_delete() (px), 37 rcu_read_lock() (k), 139, 151 struct task_struct (k), 38
pthread_kill() (px), 67 rcu_read_lock_bh() (k), 151 synchronize_irq() (k), 510
pthread_mutex_init() (px), 32 rcu_read_lock_bh_held() (k), 159 synchronize_net() (k), 151
PTHREAD_MUTEX_INITIALIZER (px), 32 rcu_read_lock_held() (k), 159 synchronize_rcu() (k), 139, 151
pthread_mutex_lock() (px), 32, 107 rcu_read_lock_sched() (k), 151 synchronize_rcu_expedited() (k),
pthread_mutex_t (px), 32, 34, 106 rcu_read_lock_sched_held() (k), 151
pthread_mutex_unlock() (px), 32 159 synchronize_rcu_tasks() (k), 153
pthread_rwlock_init() (px), 34 rcu_read_unlock() (k), 139, 151 synchronize_srcu() (k), 153
PTHREAD_RWLOCK_INITIALIZER (px), rcu_read_unlock_bh() (k), 151 synchronize_srcu_expedited() (k),
34 rcu_read_unlock_sched() (k), 151 153
pthread_rwlock_rdlock() (px), 34 rcu_register_thread() (ur), 37
this_cpu_ptr() (k), 46
pthread_rwlock_t (px), 34 rcu_replace_pointer() (k), 154
thread_id_t (pf), 38
pthread_rwlock_unlock() (px), 34 rcu_sleep_check() (k), 159
pthread_rwlock_wrlock() (px), 34 rcu_unregister_thread() (ur), 38
unlikely() (k), 43
pthread_setspecific() (px), 37 READ_ONCE() (k), 33, 35–37, 41, 42,
pthread_t (px), 37 44–46, 472–474 vfork() (px), 47, 476
volatile (c), 43, 44, 48
rcu_access_pointer() (k), 154 schedule() (k), 153
rcu_assign_pointer() (k), 139, 154 schedule_timeout_ wait() (px), 30, 31, 39, 47, 474
rcu_barrier() (k), 151 interruptible() (k), wait_all_threads() (pf), 38, 39
rcu_barrier_tasks() (k), 153 38 wait_thread() (pf), 38, 39
rcu_cpu_stall_reset() (k), 159 sig_atomic_t (c), 42 waitall() (px), 30
rcu_dereference() (k), 139, 154 SLAB_TYPESAFE_BY_RCU (k), 151 WRITE_ONCE() (k), 33, 36, 41, 42, 44–46,
rcu_dereference_check() (k), 154 smp_init() (pf), 37 472, 474, 475
v2022.09.25a