0% found this document useful (0 votes)
210 views

The Parallel Book

Everything about parallel computing

Uploaded by

fixipek302
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
210 views

The Parallel Book

Everything about parallel computing

Uploaded by

fixipek302
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 646

arXiv:1701.00854v5 [cs.

DC] 26 Sep 2022

Is Parallel Programming Hard, And, If So,


What Can You Do About It?

Edited by:

Paul E. McKenney
Facebook
[email protected]

September 25, 2022


Release v2022.09.25a

v2022.09.25a
ii

Legal Statement
This work represents the views of the editor and the authors and does not necessarily
represent the view of their respective employers.

Trademarks:
• IBM, z Systems, and PowerPC are trademarks or registered trademarks of Inter-
national Business Machines Corporation in the United States, other countries, or
both.
• Linux is a registered trademark of Linus Torvalds.
• Intel, Itanium, Intel Core, and Intel Xeon are trademarks of Intel Corporation or
its subsidiaries in the United States, other countries, or both.
• Arm is a registered trademark of Arm Limited (or its subsidiaries) in the US and/or
elsewhere.
• SPARC is a registered trademark of SPARC International, Inc. Products bearing
SPARC trademarks are based on an architecture developed by Sun Microsystems,
Inc.
• Other company, product, and service names may be trademarks or service marks
of such companies.
The non-source-code text and images in this document are provided under the terms
of the Creative Commons Attribution-Share Alike 3.0 United States license.1 In brief,
you may use the contents of this document for any purpose, personal, commercial, or
otherwise, so long as attribution to the authors is maintained. Likewise, the document
may be modified, and derivative works and translations made available, so long as
such modifications and derivations are offered to the public on equal terms as the
non-source-code text and images in the original document.
Source code is covered by various versions of the GPL.2 Some of this code is
GPLv2-only, as it derives from the Linux kernel, while other code is GPLv2-or-later. See
the comment headers of the individual source files within the CodeSamples directory in
the git archive3 for the exact licenses. If you are unsure of the license for a given code
fragment, you should assume GPLv2-only.
Combined work © 2005–2022 by Paul E. McKenney. Each individual contribution
is copyright by its contributor at the time of contribution, as recorded in the git archive.

1 https://creativecommons.org/licenses/by-sa/3.0/us/
2 https://www.gnu.org/licenses/gpl-2.0.html
3 git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git

v2022.09.25a
Contents

1 How To Use This Book 1


1.1 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Quick Quizzes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Alternatives to This Book . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Sample Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Whose Book Is This? . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Introduction 7
2.1 Historic Parallel Programming Difficulties . . . . . . . . . . . . . . . 7
2.2 Parallel Programming Goals . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Alternatives to Parallel Programming . . . . . . . . . . . . . . . . . . 12
2.3.1 Multiple Instances of a Sequential Application . . . . . . . . 12
2.3.2 Use Existing Parallel Software . . . . . . . . . . . . . . . . . 12
2.3.3 Performance Optimization . . . . . . . . . . . . . . . . . . . 13
2.4 What Makes Parallel Programming Hard? . . . . . . . . . . . . . . . 13
2.4.1 Work Partitioning . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Parallel Access Control . . . . . . . . . . . . . . . . . . . . . 14
2.4.3 Resource Partitioning and Replication . . . . . . . . . . . . . 15
2.4.4 Interacting With Hardware . . . . . . . . . . . . . . . . . . . 15
2.4.5 Composite Capabilities . . . . . . . . . . . . . . . . . . . . . 15
2.4.6 How Do Languages and Environments Assist With These Tasks? 16
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Hardware and its Habits 17


3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 Pipelined CPUs . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Memory References . . . . . . . . . . . . . . . . . . . . . . 19
3.1.3 Atomic Operations . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.4 Memory Barriers . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.5 Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.6 I/O Operations . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Hardware System Architecture . . . . . . . . . . . . . . . . . 21
3.2.2 Costs of Operations . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.3 Hardware Optimizations . . . . . . . . . . . . . . . . . . . . 24

iii

v2022.09.25a
iv CONTENTS

3.3 Hardware Free Lunch? . . . . . . . . . . . . . . . . . . . . . . . . . 25


3.3.1 3D Integration . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.2 Novel Materials and Processes . . . . . . . . . . . . . . . . . 26
3.3.3 Light, Not Electrons . . . . . . . . . . . . . . . . . . . . . . 26
3.3.4 Special-Purpose Accelerators . . . . . . . . . . . . . . . . . 27
3.3.5 Existing Parallel Software . . . . . . . . . . . . . . . . . . . 27
3.4 Software Design Implications . . . . . . . . . . . . . . . . . . . . . . 27

4 Tools of the Trade 29


4.1 Scripting Languages . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 POSIX Multiprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.1 POSIX Process Creation and Destruction . . . . . . . . . . . 30
4.2.2 POSIX Thread Creation and Destruction . . . . . . . . . . . 31
4.2.3 POSIX Locking . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.4 POSIX Reader-Writer Locking . . . . . . . . . . . . . . . . . 34
4.2.5 Atomic Operations (GCC Classic) . . . . . . . . . . . . . . . 36
4.2.6 Atomic Operations (C11) . . . . . . . . . . . . . . . . . . . . 36
4.2.7 Atomic Operations (Modern GCC) . . . . . . . . . . . . . . 37
4.2.8 Per-Thread Variables . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Alternatives to POSIX Operations . . . . . . . . . . . . . . . . . . . 37
4.3.1 Organization and Initialization . . . . . . . . . . . . . . . . . 37
4.3.2 Thread Creation, Destruction, and Control . . . . . . . . . . . 38
4.3.3 Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.4 Accessing Shared Variables . . . . . . . . . . . . . . . . . . 40
4.3.5 Atomic Operations . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.6 Per-CPU Variables . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 The Right Tool for the Job: How to Choose? . . . . . . . . . . . . . . 47

5 Counting 49
5.1 Why Isn’t Concurrent Counting Trivial? . . . . . . . . . . . . . . . . 49
5.2 Statistical Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.2 Array-Based Implementation . . . . . . . . . . . . . . . . . . 51
5.2.3 Per-Thread-Variable-Based Implementation . . . . . . . . . . 52
5.2.4 Eventually Consistent Implementation . . . . . . . . . . . . . 54
5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Approximate Limit Counters . . . . . . . . . . . . . . . . . . . . . . 55
5.3.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.2 Simple Limit Counter Implementation . . . . . . . . . . . . . 56
5.3.3 Simple Limit Counter Discussion . . . . . . . . . . . . . . . 59
5.3.4 Approximate Limit Counter Implementation . . . . . . . . . 60
5.3.5 Approximate Limit Counter Discussion . . . . . . . . . . . . 61
5.4 Exact Limit Counters . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4.1 Atomic Limit Counter Implementation . . . . . . . . . . . . 61
5.4.2 Atomic Limit Counter Discussion . . . . . . . . . . . . . . . 64
5.4.3 Signal-Theft Limit Counter Design . . . . . . . . . . . . . . 64
5.4.4 Signal-Theft Limit Counter Implementation . . . . . . . . . . 65
5.4.5 Signal-Theft Limit Counter Discussion . . . . . . . . . . . . 67
5.4.6 Applying Exact Limit Counters . . . . . . . . . . . . . . . . 68
5.5 Parallel Counting Discussion . . . . . . . . . . . . . . . . . . . . . . 68

v2022.09.25a
CONTENTS v

5.5.1 Parallel Counting Validation . . . . . . . . . . . . . . . . . . 69


5.5.2 Parallel Counting Performance . . . . . . . . . . . . . . . . . 69
5.5.3 Parallel Counting Specializations . . . . . . . . . . . . . . . 69
5.5.4 Parallel Counting Lessons . . . . . . . . . . . . . . . . . . . 70

6 Partitioning and Synchronization Design 73


6.1 Partitioning Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.1.1 Dining Philosophers Problem . . . . . . . . . . . . . . . . . 73
6.1.2 Double-Ended Queue . . . . . . . . . . . . . . . . . . . . . . 75
6.1.3 Partitioning Example Discussion . . . . . . . . . . . . . . . . 80
6.2 Design Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.3 Synchronization Granularity . . . . . . . . . . . . . . . . . . . . . . 82
6.3.1 Sequential Program . . . . . . . . . . . . . . . . . . . . . . . 82
6.3.2 Code Locking . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3.3 Data Locking . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.3.4 Data Ownership . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3.5 Locking Granularity and Performance . . . . . . . . . . . . . 86
6.4 Parallel Fastpath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4.1 Reader/Writer Locking . . . . . . . . . . . . . . . . . . . . . 89
6.4.2 Hierarchical Locking . . . . . . . . . . . . . . . . . . . . . . 89
6.4.3 Resource Allocator Caches . . . . . . . . . . . . . . . . . . . 89
6.5 Beyond Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.5.1 Work-Queue Parallel Maze Solver . . . . . . . . . . . . . . . 93
6.5.2 Alternative Parallel Maze Solver . . . . . . . . . . . . . . . . 95
6.5.3 Maze Validation . . . . . . . . . . . . . . . . . . . . . . . . 95
6.5.4 Performance Comparison I . . . . . . . . . . . . . . . . . . . 96
6.5.5 Alternative Sequential Maze Solver . . . . . . . . . . . . . . 98
6.5.6 Performance Comparison II . . . . . . . . . . . . . . . . . . 98
6.5.7 Future Directions and Conclusions . . . . . . . . . . . . . . . 99
6.6 Partitioning, Parallelism, and Optimization . . . . . . . . . . . . . . . 99

7 Locking 101
7.1 Staying Alive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.1.1 Deadlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.1.2 Livelock and Starvation . . . . . . . . . . . . . . . . . . . . 108
7.1.3 Unfairness . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.1.4 Inefficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.2 Types of Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2.1 Exclusive Locks . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2.2 Reader-Writer Locks . . . . . . . . . . . . . . . . . . . . . . 110
7.2.3 Beyond Reader-Writer Locks . . . . . . . . . . . . . . . . . . 111
7.2.4 Scoped Locking . . . . . . . . . . . . . . . . . . . . . . . . 112
7.3 Locking Implementation Issues . . . . . . . . . . . . . . . . . . . . . 114
7.3.1 Sample Exclusive-Locking Implementation Based on Atomic
Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3.2 Other Exclusive-Locking Implementations . . . . . . . . . . 114
7.4 Lock-Based Existence Guarantees . . . . . . . . . . . . . . . . . . . 116
7.5 Locking: Hero or Villain? . . . . . . . . . . . . . . . . . . . . . . . 117
7.5.1 Locking For Applications: Hero! . . . . . . . . . . . . . . . 118
7.5.2 Locking For Parallel Libraries: Just Another Tool . . . . . . . 118

v2022.09.25a
vi CONTENTS

7.5.3 Locking For Parallelizing Sequential Libraries: Villain! . . . 120


7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

8 Data Ownership 123


8.1 Multiple Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.2 Partial Data Ownership and pthreads . . . . . . . . . . . . . . . . . . 124
8.3 Function Shipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8.4 Designated Thread . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8.5 Privatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.6 Other Uses of Data Ownership . . . . . . . . . . . . . . . . . . . . . 125

9 Deferred Processing 127


9.1 Running Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
9.2 Reference Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9.3 Hazard Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
9.4 Sequence Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
9.5 Read-Copy Update (RCU) . . . . . . . . . . . . . . . . . . . . . . . 137
9.5.1 Introduction to RCU . . . . . . . . . . . . . . . . . . . . . . 138
9.5.2 RCU Fundamentals . . . . . . . . . . . . . . . . . . . . . . . 144
9.5.3 RCU Linux-Kernel API . . . . . . . . . . . . . . . . . . . . 150
9.5.4 RCU Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
9.5.5 RCU Related Work . . . . . . . . . . . . . . . . . . . . . . . 177
9.6 Which to Choose? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
9.6.1 Which to Choose? (Overview) . . . . . . . . . . . . . . . . . 179
9.6.2 Which to Choose? (Details) . . . . . . . . . . . . . . . . . . 180
9.6.3 Which to Choose? (Production Use) . . . . . . . . . . . . . . 182
9.7 What About Updates? . . . . . . . . . . . . . . . . . . . . . . . . . . 183

10 Data Structures 185


10.1 Motivating Application . . . . . . . . . . . . . . . . . . . . . . . . . 185
10.2 Partitionable Data Structures . . . . . . . . . . . . . . . . . . . . . . 185
10.2.1 Hash-Table Design . . . . . . . . . . . . . . . . . . . . . . . 186
10.2.2 Hash-Table Implementation . . . . . . . . . . . . . . . . . . 186
10.2.3 Hash-Table Performance . . . . . . . . . . . . . . . . . . . . 187
10.3 Read-Mostly Data Structures . . . . . . . . . . . . . . . . . . . . . . 189
10.3.1 RCU-Protected Hash Table Implementation . . . . . . . . . . 189
10.3.2 RCU-Protected Hash Table Validation . . . . . . . . . . . . . 190
10.3.3 RCU-Protected Hash Table Performance . . . . . . . . . . . . 190
10.3.4 RCU-Protected Hash Table Discussion . . . . . . . . . . . . 193
10.4 Non-Partitionable Data Structures . . . . . . . . . . . . . . . . . . . 194
10.4.1 Resizable Hash Table Design . . . . . . . . . . . . . . . . . . 194
10.4.2 Resizable Hash Table Implementation . . . . . . . . . . . . . 195
10.4.3 Resizable Hash Table Discussion . . . . . . . . . . . . . . . 200
10.4.4 Other Resizable Hash Tables . . . . . . . . . . . . . . . . . . 201
10.5 Other Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 203
10.6 Micro-Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
10.6.1 Specialization . . . . . . . . . . . . . . . . . . . . . . . . . . 204
10.6.2 Bits and Bytes . . . . . . . . . . . . . . . . . . . . . . . . . 204
10.6.3 Hardware Considerations . . . . . . . . . . . . . . . . . . . . 205
10.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

v2022.09.25a
CONTENTS vii

11 Validation 207
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
11.1.1 Where Do Bugs Come From? . . . . . . . . . . . . . . . . . 207
11.1.2 Required Mindset . . . . . . . . . . . . . . . . . . . . . . . . 208
11.1.3 When Should Validation Start? . . . . . . . . . . . . . . . . . 210
11.1.4 The Open Source Way . . . . . . . . . . . . . . . . . . . . . 211
11.2 Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
11.3 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
11.4 Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
11.5 Code Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
11.5.1 Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
11.5.2 Walkthroughs . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.5.3 Self-Inspection . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.6 Probability and Heisenbugs . . . . . . . . . . . . . . . . . . . . . . . 215
11.6.1 Statistics for Discrete Testing . . . . . . . . . . . . . . . . . . 216
11.6.2 Statistics Abuse for Discrete Testing . . . . . . . . . . . . . . 217
11.6.3 Statistics for Continuous Testing . . . . . . . . . . . . . . . . 217
11.6.4 Hunting Heisenbugs . . . . . . . . . . . . . . . . . . . . . . 218
11.7 Performance Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 222
11.7.1 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . 222
11.7.2 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
11.7.3 Differential Profiling . . . . . . . . . . . . . . . . . . . . . . 223
11.7.4 Microbenchmarking . . . . . . . . . . . . . . . . . . . . . . 223
11.7.5 Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
11.7.6 Detecting Interference . . . . . . . . . . . . . . . . . . . . . 225
11.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

12 Formal Verification 229


12.1 State-Space Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
12.1.1 Promela and Spin . . . . . . . . . . . . . . . . . . . . . . . . 229
12.1.2 How to Use Promela . . . . . . . . . . . . . . . . . . . . . . 231
12.1.3 Promela Example: Locking . . . . . . . . . . . . . . . . . . 234
12.1.4 Promela Example: QRCU . . . . . . . . . . . . . . . . . . . 236
12.1.5 Promela Parable: dynticks and Preemptible RCU . . . . . . . 241
12.1.6 Validating Preemptible RCU and dynticks . . . . . . . . . . . 244
12.2 Special-Purpose State-Space Search . . . . . . . . . . . . . . . . . . 256
12.2.1 Anatomy of a Litmus Test . . . . . . . . . . . . . . . . . . . 257
12.2.2 What Does This Litmus Test Mean? . . . . . . . . . . . . . . 258
12.2.3 Running a Litmus Test . . . . . . . . . . . . . . . . . . . . . 258
12.2.4 PPCMEM Discussion . . . . . . . . . . . . . . . . . . . . . 259
12.3 Axiomatic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 260
12.3.1 Axiomatic Approaches and Locking . . . . . . . . . . . . . . 261
12.3.2 Axiomatic Approaches and RCU . . . . . . . . . . . . . . . . 261
12.4 SAT Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
12.5 Stateless Model Checkers . . . . . . . . . . . . . . . . . . . . . . . . 264
12.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
12.7 Choosing a Validation Plan . . . . . . . . . . . . . . . . . . . . . . . 266

v2022.09.25a
viii CONTENTS

13 Putting It All Together 269


13.1 Counter Conundrums . . . . . . . . . . . . . . . . . . . . . . . . . . 269
13.1.1 Counting Updates . . . . . . . . . . . . . . . . . . . . . . . . 269
13.1.2 Counting Lookups . . . . . . . . . . . . . . . . . . . . . . . 269
13.2 Refurbish Reference Counting . . . . . . . . . . . . . . . . . . . . . 270
13.2.1 Implementation of Reference-Counting Categories . . . . . . 271
13.2.2 Counter Optimizations . . . . . . . . . . . . . . . . . . . . . 274
13.3 Hazard-Pointer Helpers . . . . . . . . . . . . . . . . . . . . . . . . . 274
13.3.1 Scalable Reference Count . . . . . . . . . . . . . . . . . . . 274
13.3.2 Long-Duration Accesses . . . . . . . . . . . . . . . . . . . . 274
13.4 Sequence-Locking Specials . . . . . . . . . . . . . . . . . . . . . . . 274
13.4.1 Dueling Sequence Locks . . . . . . . . . . . . . . . . . . . . 275
13.4.2 Correlated Data Elements . . . . . . . . . . . . . . . . . . . 275
13.4.3 Atomic Move . . . . . . . . . . . . . . . . . . . . . . . . . . 276
13.4.4 Upgrade to Writer . . . . . . . . . . . . . . . . . . . . . . . 277
13.5 RCU Rescues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
13.5.1 RCU and Per-Thread-Variable-Based Statistical Counters . . . 277
13.5.2 RCU and Counters for Removable I/O Devices . . . . . . . . 279
13.5.3 Array and Length . . . . . . . . . . . . . . . . . . . . . . . . 279
13.5.4 Correlated Fields . . . . . . . . . . . . . . . . . . . . . . . . 280
13.5.5 Update-Friendly Traversal . . . . . . . . . . . . . . . . . . . 280
13.5.6 Scalable Reference Count Two . . . . . . . . . . . . . . . . . 281
13.5.7 Retriggered Grace Periods . . . . . . . . . . . . . . . . . . . 281
13.5.8 Long-Duration Accesses Two . . . . . . . . . . . . . . . . . 283

14 Advanced Synchronization 285


14.1 Avoiding Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
14.2 Non-Blocking Synchronization . . . . . . . . . . . . . . . . . . . . . 286
14.2.1 Simple NBS . . . . . . . . . . . . . . . . . . . . . . . . . . 286
14.2.2 Applicability of NBS Benefits . . . . . . . . . . . . . . . . . 288
14.2.3 NBS Discussion . . . . . . . . . . . . . . . . . . . . . . . . 291
14.3 Parallel Real-Time Computing . . . . . . . . . . . . . . . . . . . . . 292
14.3.1 What is Real-Time Computing? . . . . . . . . . . . . . . . . 292
14.3.2 Who Needs Real-Time? . . . . . . . . . . . . . . . . . . . . 296
14.3.3 Who Needs Parallel Real-Time? . . . . . . . . . . . . . . . . 297
14.3.4 Implementing Parallel Real-Time Systems . . . . . . . . . . . 297
14.3.5 Implementing Parallel Real-Time Operating Systems . . . . . 298
14.3.6 Implementing Parallel Real-Time Applications . . . . . . . . 308
14.3.7 Real Time vs. Real Fast: How to Choose? . . . . . . . . . . . 311

15 Advanced Synchronization: Memory Ordering 313


15.1 Ordering: Why and How? . . . . . . . . . . . . . . . . . . . . . . . 313
15.1.1 Why Hardware Misordering? . . . . . . . . . . . . . . . . . . 314
15.1.2 How to Force Ordering? . . . . . . . . . . . . . . . . . . . . 315
15.1.3 Basic Rules of Thumb . . . . . . . . . . . . . . . . . . . . . 318
15.2 Tricks and Traps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
15.2.1 Variables With Multiple Values . . . . . . . . . . . . . . . . 319
15.2.2 Memory-Reference Reordering . . . . . . . . . . . . . . . . 320
15.2.3 Address Dependencies . . . . . . . . . . . . . . . . . . . . . 323
15.2.4 Data Dependencies . . . . . . . . . . . . . . . . . . . . . . . 324

v2022.09.25a
CONTENTS ix

15.2.5 Control Dependencies . . . . . . . . . . . . . . . . . . . . . 325


15.2.6 Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . 326
15.2.7 Multicopy Atomicity . . . . . . . . . . . . . . . . . . . . . . 326
15.3 Compile-Time Consternation . . . . . . . . . . . . . . . . . . . . . . 334
15.3.1 Memory-Reference Restrictions . . . . . . . . . . . . . . . . 334
15.3.2 Address- and Data-Dependency Difficulties . . . . . . . . . . 335
15.3.3 Control-Dependency Calamities . . . . . . . . . . . . . . . . 337
15.4 Higher-Level Primitives . . . . . . . . . . . . . . . . . . . . . . . . . 340
15.4.1 Memory Allocation . . . . . . . . . . . . . . . . . . . . . . . 341
15.4.2 Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
15.4.3 RCU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
15.5 Hardware Specifics . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
15.5.1 Alpha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
15.5.2 Armv7-A/R . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
15.5.3 Armv8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
15.5.4 Itanium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
15.5.5 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
15.5.6 POWER / PowerPC . . . . . . . . . . . . . . . . . . . . . . . 357
15.5.7 SPARC TSO . . . . . . . . . . . . . . . . . . . . . . . . . . 358
15.5.8 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
15.5.9 z Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
15.6 Where is Memory Ordering Needed? . . . . . . . . . . . . . . . . . . 359

16 Ease of Use 363


16.1 What is Easy? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
16.2 Rusty Scale for API Design . . . . . . . . . . . . . . . . . . . . . . . 363
16.3 Shaving the Mandelbrot Set . . . . . . . . . . . . . . . . . . . . . . . 364

17 Conflicting Visions of the Future 367


17.1 The Future of CPU Technology Ain’t What it Used to Be . . . . . . . 367
17.1.1 Uniprocessor Über Alles . . . . . . . . . . . . . . . . . . . . 367
17.1.2 Multithreaded Mania . . . . . . . . . . . . . . . . . . . . . . 369
17.1.3 More of the Same . . . . . . . . . . . . . . . . . . . . . . . . 369
17.1.4 Crash Dummies Slamming into the Memory Wall . . . . . . . 370
17.1.5 Astounding Accelerators . . . . . . . . . . . . . . . . . . . . 371
17.2 Transactional Memory . . . . . . . . . . . . . . . . . . . . . . . . . 371
17.2.1 Outside World . . . . . . . . . . . . . . . . . . . . . . . . . 371
17.2.2 Process Modification . . . . . . . . . . . . . . . . . . . . . . 374
17.2.3 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . 378
17.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
17.3 Hardware Transactional Memory . . . . . . . . . . . . . . . . . . . . 383
17.3.1 HTM Benefits WRT Locking . . . . . . . . . . . . . . . . . 383
17.3.2 HTM Weaknesses WRT Locking . . . . . . . . . . . . . . . 384
17.3.3 HTM Weaknesses WRT Locking When Augmented . . . . . 388
17.3.4 Where Does HTM Best Fit In? . . . . . . . . . . . . . . . . . 391
17.3.5 Potential Game Changers . . . . . . . . . . . . . . . . . . . . 391
17.3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
17.4 Formal Regression Testing? . . . . . . . . . . . . . . . . . . . . . . . 395
17.4.1 Automatic Translation . . . . . . . . . . . . . . . . . . . . . 395
17.4.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 396

v2022.09.25a
x CONTENTS

17.4.3 Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396


17.4.4 Locate Bugs . . . . . . . . . . . . . . . . . . . . . . . . . . 397
17.4.5 Minimal Scaffolding . . . . . . . . . . . . . . . . . . . . . . 398
17.4.6 Relevant Bugs . . . . . . . . . . . . . . . . . . . . . . . . . 398
17.4.7 Formal Regression Scorecard . . . . . . . . . . . . . . . . . 399
17.5 Functional Programming for Parallelism . . . . . . . . . . . . . . . . 400
17.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401

18 Looking Forward and Back 403

A Important Questions 407


A.1 Why Aren’t Parallel Programs Always Faster? . . . . . . . . . . . . . 407
A.2 Why Not Remove Locking? . . . . . . . . . . . . . . . . . . . . . . . 407
A.3 What Time Is It? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
A.4 What Does “After” Mean? . . . . . . . . . . . . . . . . . . . . . . . 408
A.5 How Much Ordering Is Needed? . . . . . . . . . . . . . . . . . . . . 410
A.5.1 Where is the Defining Data? . . . . . . . . . . . . . . . . . . 411
A.5.2 Consistent Data Used Consistently? . . . . . . . . . . . . . . 412
A.5.3 Is the Problem Partitionable? . . . . . . . . . . . . . . . . . . 412
A.5.4 None of the Above? . . . . . . . . . . . . . . . . . . . . . . . 412
A.6 What is the Difference Between “Concurrent” and “Parallel”? . . . . . 412
A.7 Why Is Software Buggy? . . . . . . . . . . . . . . . . . . . . . . . . 413

B “Toy” RCU Implementations 415


B.1 Lock-Based RCU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
B.2 Per-Thread Lock-Based RCU . . . . . . . . . . . . . . . . . . . . . . 416
B.3 Simple Counter-Based RCU . . . . . . . . . . . . . . . . . . . . . . 416
B.4 Starvation-Free Counter-Based RCU . . . . . . . . . . . . . . . . . . 417
B.5 Scalable Counter-Based RCU . . . . . . . . . . . . . . . . . . . . . . 419
B.6 Scalable Counter-Based RCU With Shared Grace Periods . . . . . . . 420
B.7 RCU Based on Free-Running Counter . . . . . . . . . . . . . . . . . 422
B.8 Nestable RCU Based on Free-Running Counter . . . . . . . . . . . . 423
B.9 RCU Based on Quiescent States . . . . . . . . . . . . . . . . . . . . 425
B.10 Summary of Toy RCU Implementations . . . . . . . . . . . . . . . . 426

C Why Memory Barriers? 429


C.1 Cache Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
C.2 Cache-Coherence Protocols . . . . . . . . . . . . . . . . . . . . . . . 431
C.2.1 MESI States . . . . . . . . . . . . . . . . . . . . . . . . . . 431
C.2.2 MESI Protocol Messages . . . . . . . . . . . . . . . . . . . . 431
C.2.3 MESI State Diagram . . . . . . . . . . . . . . . . . . . . . . 432
C.2.4 MESI Protocol Example . . . . . . . . . . . . . . . . . . . . 433
C.3 Stores Result in Unnecessary Stalls . . . . . . . . . . . . . . . . . . . 433
C.3.1 Store Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . 434
C.3.2 Store Forwarding . . . . . . . . . . . . . . . . . . . . . . . . 435
C.3.3 Store Buffers and Memory Barriers . . . . . . . . . . . . . . 436
C.4 Store Sequences Result in Unnecessary Stalls . . . . . . . . . . . . . 437
C.4.1 Invalidate Queues . . . . . . . . . . . . . . . . . . . . . . . . 438
C.4.2 Invalidate Queues and Invalidate Acknowledge . . . . . . . . 438
C.4.3 Invalidate Queues and Memory Barriers . . . . . . . . . . . . 438

v2022.09.25a
CONTENTS xi

C.5 Read and Write Memory Barriers . . . . . . . . . . . . . . . . . . . 440


C.6 Example Memory-Barrier Sequences . . . . . . . . . . . . . . . . . . 440
C.6.1 Ordering-Hostile Architecture . . . . . . . . . . . . . . . . . 440
C.6.2 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
C.6.3 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
C.6.4 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
C.7 Are Memory Barriers Forever? . . . . . . . . . . . . . . . . . . . . . 442
C.8 Advice to Hardware Designers . . . . . . . . . . . . . . . . . . . . . 443

D Style Guide 445


D.1 Paul’s Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
D.2 NIST Style Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
D.2.1 Unit Symbol . . . . . . . . . . . . . . . . . . . . . . . . . . 446
D.2.2 NIST Guide Yet To Be Followed . . . . . . . . . . . . . . . . 447
D.3 LATEX Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
D.3.1 Monospace Font . . . . . . . . . . . . . . . . . . . . . . . . 447
D.3.2 Cross-reference . . . . . . . . . . . . . . . . . . . . . . . . . 451
D.3.3 Non Breakable Spaces . . . . . . . . . . . . . . . . . . . . . 451
D.3.4 Hyphenation and Dashes . . . . . . . . . . . . . . . . . . . . 452
D.3.5 Punctuation . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
D.3.6 Floating Object Format . . . . . . . . . . . . . . . . . . . . . 453
D.3.7 Improvement Candidates . . . . . . . . . . . . . . . . . . . . 454

E Answers to Quick Quizzes 461


E.1 How To Use This Book . . . . . . . . . . . . . . . . . . . . . . . . . 461
E.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
E.3 Hardware and its Habits . . . . . . . . . . . . . . . . . . . . . . . . . 466
E.4 Tools of the Trade . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
E.5 Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
E.6 Partitioning and Synchronization Design . . . . . . . . . . . . . . . . 489
E.7 Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
E.8 Data Ownership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
E.9 Deferred Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
E.10 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
E.11 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
E.12 Formal Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
E.13 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . 535
E.14 Advanced Synchronization . . . . . . . . . . . . . . . . . . . . . . . 538
E.15 Advanced Synchronization: Memory Ordering . . . . . . . . . . . . 541
E.16 Ease of Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
E.17 Conflicting Visions of the Future . . . . . . . . . . . . . . . . . . . . 552
E.18 Important Questions . . . . . . . . . . . . . . . . . . . . . . . . . . 557
E.19 “Toy” RCU Implementations . . . . . . . . . . . . . . . . . . . . . . 558
E.20 Why Memory Barriers? . . . . . . . . . . . . . . . . . . . . . . . . . 564

Glossary 569

Bibliography 579

v2022.09.25a
xii CONTENTS

Credits 623
LATEX Advisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Reviewers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Machine Owners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Original Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Figure Credits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
Other Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625

Acronyms 627

Index 629

API Index 633

v2022.09.25a
If you would only recognize that life is hard, things
would be so much easier for you.

Chapter 1 Louis D. Brandeis

How To Use This Book

The purpose of this book is to help you program shared- 1.1 Roadmap
memory parallel systems without risking your sanity.1
Nevertheless, you should think of the information in this
book as a foundation on which to build, rather than as Cat: Where are you going?
a completed cathedral. Your mission, if you choose to Alice: Which way should I go?
accept, is to help make further progress in the exciting Cat: That depends on where you are going.
Alice: I don’t know.
field of parallel programming—progress that will in time
Cat: Then it doesn’t matter which way you go.
render this book obsolete.
Parallel programming in the 21st century is no longer Lewis Carroll, Alice in Wonderland
focused solely on science, research, and grand-challenge
projects. And this is all to the good, because it means This book is a handbook of widely applicable and heav-
that parallel programming is becoming an engineering ily used design techniques, rather than a collection of
discipline. Therefore, as befits an engineering discipline, optimal algorithms with tiny areas of applicability. You
this book examines specific parallel-programming tasks are currently reading Chapter 1, but you knew that al-
and describes how to approach them. In some surprisingly ready. Chapter 2 gives a high-level overview of parallel
common cases, these tasks can be automated. programming.
This book is written in the hope that presenting the
Chapter 3 introduces shared-memory parallel hardware.
engineering discipline underlying successful parallel-
After all, it is difficult to write good parallel code un-
programming projects will free a new generation of par-
less you understand the underlying hardware. Because
allel hackers from the need to slowly and painstakingly
hardware constantly evolves, this chapter will always be
reinvent old wheels, enabling them to instead focus their
out of date. We will nevertheless do our best to keep up.
energy and creativity on new frontiers. However, what
Chapter 4 then provides a very brief overview of common
you get from this book will be determined by what you
shared-memory parallel-programming primitives.
put into it. It is hoped that simply reading this book will
be helpful, and that working the Quick Quizzes will be Chapter 5 takes an in-depth look at parallelizing one
even more helpful. However, the best results come from of the simplest problems imaginable, namely counting.
applying the techniques taught in this book to real-life Because almost everyone has an excellent grasp of count-
problems. As always, practice makes perfect. ing, this chapter is able to delve into many important
But no matter how you approach it, we sincerely hope parallel-programming issues without the distractions of
that parallel programming brings you at least as much fun, more-typical computer-science problems. My impression
excitement, and challenge that it has brought to us! is that this chapter has seen the greatest use in parallel-
programming coursework.
Chapter 6 introduces a number of design-level methods
of addressing the issues identified in Chapter 5. It turns out
1 Or, perhaps more accurately, without much greater risk to your that it is important to address parallelism at the design level
sanity than that incurred by non-parallel programming. Which, come to when feasible: To paraphrase Dijkstra [Dij68], “retrofitted
think of it, might not be saying all that much. parallelism considered grossly suboptimal” [McK12c].

v2022.09.25a
2 CHAPTER 1. HOW TO USE THIS BOOK

The next three chapters examine three important ap- Some of them are based on material in which that quick
proaches to synchronization. Chapter 7 covers locking, quiz appears, but others require you to think beyond that
which is still not only the workhorse of production-quality section, and, in some cases, beyond the realm of current
parallel programming, but is also widely considered to knowledge. As with most endeavors, what you get out of
be parallel programming’s worst villain. Chapter 8 gives this book is largely determined by what you are willing to
a brief overview of data ownership, an often overlooked put into it. Therefore, readers who make a genuine effort
but remarkably pervasive and powerful approach. Finally, to solve a quiz before looking at the answer find their
Chapter 9 introduces a number of deferred-processing effort repaid handsomely with increased understanding of
mechanisms, including reference counting, hazard point- parallel programming.
ers, sequence locking, and RCU. Quick Quiz 1.1: Where are the answers to the Quick Quizzes
Chapter 10 applies the lessons of previous chapters to found?
hash tables, which are heavily used due to their excel-
lent partitionability, which (usually) leads to excellent Quick Quiz 1.2: Some of the Quick Quiz questions seem to
performance and scalability. be from the viewpoint of the reader rather than the author. Is
As many have learned to their sorrow, parallel program- that really the intent?
ming without validation is a sure path to abject failure.
Chapter 11 covers various forms of testing. It is of course Quick Quiz 1.3: These Quick Quizzes are just not my cup
impossible to test reliability into your program after the of tea. What can I do about it?
fact, so Chapter 12 follows up with a brief overview of a
In short, if you need a deep understanding of the mate-
couple of practical approaches to formal verification.
rial, then you should invest some time into answering the
Chapter 13 contains a series of moderate-sized parallel
Quick Quizzes. Don’t get me wrong, passively reading the
programming problems. The difficulty of these problems
material can be quite valuable, but gaining full problem-
vary, but should be appropriate for someone who has
solving capability really does require that you practice
mastered the material in the previous chapters.
solving problems. Similarly, gaining full code-production
Chapter 14 looks at advanced synchronization meth-
capability really does require that you practice producing
ods, including non-blocking synchronization and parallel
code.
real-time computing, while Chapter 15 covers the ad-
vanced topic of memory ordering. Chapter 16 follows up Quick Quiz 1.4: If passively reading this book doesn’t get me
full problem-solving and code-production capabilities, what
with some ease-of-use advice. Chapter 17 looks at a few
on earth is the point???
possible future directions, including shared-memory par-
allel system design, software and hardware transactional I learned this the hard way during coursework for my
memory, and functional programming for parallelism. Fi- late-in-life Ph.D. I was studying a familiar topic, and
nally, Chapter 18 reviews the material in this book and its was surprised at how few of the chapter’s exercises I
origins. could answer off the top of my head.2 Forcing myself to
This chapter is followed by a number of appendices. The answer the questions greatly increased my retention of the
most popular of these appears to be Appendix C, which material. So with these Quick Quizzes I am not asking
delves even further into memory ordering. Appendix E you to do anything that I have not been doing myself.
contains the answers to the infamous Quick Quizzes, Finally, the most common learning disability is thinking
which are discussed in the next section. that you already understand the material at hand. The
quick quizzes can be an extremely effective cure.
1.2 Quick Quizzes

Undertake something difficult, otherwise you will


never grow.

Abbreviated from Ronald E. Osburn

“Quick quizzes” appear throughout this book, and the 2 So I suppose that it was just as well that my professors refused to

answers may be found in Appendix E starting on page 461. let me waive that class!

v2022.09.25a
1.3. ALTERNATIVES TO THIS BOOK 3

1.3 Alternatives to This Book any data structure.” These are clearly not the words
of someone who is hostile towards RCU.
Between two evils I always pick the one I never tried 2. If you would like an academic treatment of parallel
before. programming from a programming-language-prag-
Mae West matics viewpoint, you might be interested in the
concurrency chapter from Scott’s textbook [Sco06,
As Knuth learned the hard way, if you want your book Sco15] on programming-language pragmatics.
to be finite, it must be focused. This book focuses on
3. If you are interested in an object-oriented patternist
shared-memory parallel programming, with an emphasis
treatment of parallel programming focussing on C++,
on software that lives near the bottom of the software
you might try Volumes 2 and 4 of Schmidt’s POSA
stack, such as operating-system kernels, parallel data-
series [SSRB00, BHS07]. Volume 4 in particular
management systems, low-level libraries, and the like.
has some interesting chapters applying this work to a
The programming language used by this book is C.
warehouse application. The realism of this example
If you are interested in other aspects of parallelism,
is attested to by the section entitled “Partitioning the
you might well be better served by some other book.
Big Ball of Mud”, in which the problems inherent
Fortunately, there are many alternatives available to you:
in parallelism often take a back seat to getting one’s
1. If you prefer a more academic and rigorous treatment head around a real-world application.
of parallel programming, you might like Herlihy’s 4. If you want to work with Linux-kernel device driv-
and Shavit’s textbook [HS08, HSLS20]. This book ers, then Corbet’s, Rubini’s, and Kroah-Hartman’s
starts with an interesting combination of low-level “Linux Device Drivers” [CRKH05] is indispensable,
primitives at high levels of abstraction from the hard- as is the Linux Weekly News web site (https:
ware, and works its way through locking and simple //lwn.net/). There is a large number of books and
data structures including lists, queues, hash tables, resources on the more general topic of Linux kernel
and counters, culminating with transactional mem- internals.
ory, all in Java. Michael Scott’s textbook [Sco13]
approaches similar material with more of a software- 5. If your primary focus is scientific and technical com-
engineering focus, and, as far as I know, is the first puting, and you prefer a patternist approach, you
formally published academic textbook with section might try Mattson et al.’s textbook [MSM05]. It
devoted to RCU. covers Java, C/C++, OpenMP, and MPI. Its pat-
Herlihy, Shavit, Luchangco, and Spear did catch up terns are admirably focused first on design, then on
in their second edition [HSLS20] by adding short implementation.
sections on hazard pointers and on RCU, with the 6. If your primary focus is scientific and technical com-
latter in the guise of EBR.3 They also include puting, and you are interested in GPUs, CUDA, and
a brief history of both, albeit with an abbreviated MPI, you might check out Norm Matloff’s “Program-
history of RCU that picks up almost a year after it was ming on Parallel Machines” [Mat17]. Of course, the
accepted into the Linux kernel and more than 20 years GPU vendors have quite a bit of additional informa-
after Kung’s and Lehman’s landmark paper [KL80]. tion [AMD20, Zel11, NVi17a, NVi17b].
Those wishing a deeper view of the history may find
it in this book’s Section 9.5.5. 7. If you are interested in POSIX Threads, you might
However, readers who might otherwise suspect a take a look at David R. Butenhof’s book [But97]. In
hostile attitude towards RCU on the part of this text- addition, W. Richard Stevens’s book [Ste92, Ste13]
book’s first author should refer to the last full sentence covers UNIX and POSIX, and Stewart Weiss’s lecture
on the first page of one of his papers [BGHZ16]. This notes [Wei13] provide an thorough and accessible
sentence reads “QSBR [a particular class of RCU im- introduction with a good set of examples.
plementations] is fast and can be applied to virtually 8. If you are interested in C++11, you might like An-
3 Albeitan implementation that contains a reader-preemption bug thony Williams’s “C++ Concurrency in Action: Prac-
noted by Richard Bornat. tical Multithreading” [Wil12, Wil19].

v2022.09.25a
4 CHAPTER 1. HOW TO USE THIS BOOK

9. If you are interested in C++, but in a Windows Listing 1.1: Creating an Up-To-Date PDF
environment, you might try Herb Sutter’s “Effective git clone git://git.kernel.org/pub/scm/linux/kernel/git/↵
↩→ paulmck/perfbook.git
Concurrency” series in Dr. Dobbs Journal [Sut08]. cd perfbook
This series does a reasonable job of presenting a # You may need to install a font. See item 1 in FAQ.txt.
make # -jN for parallel build
commonsense approach to parallelism. evince perfbook.pdf & # Two-column version
make perfbook-1c.pdf
10. If you want to try out Intel Threading Building Blocks, evince perfbook-1c.pdf & # One-column version for e-readers
make help # Display other build options
then perhaps James Reinders’s book [Rei07] is what
you are looking for.
This command will locate the file rcu_rcpls.c, which
11. Those interested in learning how various types of
is called out in Appendix B. Non-UNIX systems have
multi-processor hardware cache organizations affect
their own well-known ways of locating files by filename.
the implementation of kernel internals should take
a look at Curt Schimmel’s classic treatment of this
subject [Sch94].
1.5 Whose Book Is This?
12. If you are looking for a hardware view, Hennessy’s
and Patterson’s classic textbook [HP17, HP11] is If you become a teacher, by your pupils you’ll be
well worth a read. A “Readers Digest” version of this taught.
tome geared for scientific and technical workloads Oscar Hammerstein II
(bashing big arrays) may be found in Andrew Chien’s
textbook [Chi22]. If you are looking for an academic As the cover says, the editor is one Paul E. McKenney.
textbook on memory ordering, that of Daniel Sorin However, the editor does accept contributions via the
et al. [SHW11, NSHW20] is highly recommended. [email protected] email list. These contri-
For a memory-ordering tutorial from a Linux-kernel butions can be in pretty much any form, with popular
viewpoint, Paolo Bonzini’s LWN series is a good approaches including text emails, patches against the
place to start [Bon21a, Bon21e, Bon21c, Bon21b, book’s LATEX source, and even git pull requests. Use
Bon21d, Bon21f]. whatever form works best for you.
To create patches or git pull requests, you will
13. Finally, those using Java might be well-served by
need the LATEX source to the book, which is at
Doug Lea’s textbooks [Lea97, GPB+ 07].
git://git.kernel.org/pub/scm/linux/kernel/
git/paulmck/perfbook.git. You will of course also
However, if you are interested in principles of parallel
need git and LATEX, which are available as part of most
design for low-level software, especially software written
mainstream Linux distributions. Other packages may be
in C, read on!
required, depending on the distribution you use. The
required list of packages for a few popular distributions is
listed in the file FAQ-BUILD.txt in the LATEX source to
1.4 Sample Source Code the book.
To create and display a current LATEX source tree of this
Use the source, Luke! book, use the list of Linux commands shown in Listing 1.1.
In some environments, the evince command that displays
Unknown Star Wars fan
perfbook.pdf may need to be replaced, for example,
with acroread. The git clone command need only be
This book discusses its fair share of source code, and used the first time you create a PDF, subsequently, you
in many cases this source code may be found in the can run the commands shown in Listing 1.2 to pull in any
CodeSamples directory of this book’s git tree. For ex- updates and generate an updated PDF. The commands
ample, on UNIX systems, you should be able to type the in Listing 1.2 must be run within the perfbook directory
following: created by the commands shown in Listing 1.1.
PDFs of this book are sporadically posted at
find CodeSamples -name rcu_rcpls.c -print
https://kernel.org/pub/linux/kernel/people/

v2022.09.25a
1.5. WHOSE BOOK IS THIS? 5

Listing 1.2: Generating an Updated PDF must use your real name: I unfortunately cannot accept
git remote update pseudonymous or anonymous contributions.
git checkout origin/master
make # -jN for parallel build The language of this book is American English, however,
evince perfbook.pdf & # Two-column version the open-source nature of this book permits translations,
make perfbook-1c.pdf
evince perfbook-1c.pdf & # One-column version for e-readers and I personally encourage them. The open-source li-
censes covering this book additionally allow you to sell
your translation, if you wish. I do request that you send
paulmck/perfbook/perfbook.html and at http: me a copy of the translation (hardcopy if available), but
//www.rdrop.com/users/paulmck/perfbook/. this is a request made as a professional courtesy, and
The actual process of contributing patches and is not in any way a prerequisite to the permission that
sending git pull requests is similar to that of you already have under the Creative Commons and GPL
the Linux kernel, which is documented here: licenses. Please see the FAQ.txt file in the source tree
https://www.kernel.org/doc/html/latest/ for a list of translations currently in progress. I consider
process/submitting-patches.html. One important a translation effort to be “in progress” once at least one
requirement is that each patch (or commit, in the chapter has been fully translated.
case of a git pull request) must contain a valid There are many styles under the “American English”
Signed-off-by: line, which has the following format: rubric. The style for this particular book is documented
in Appendix D.
Signed-off-by: My Name <[email protected]> As noted at the beginning of this section, I am this
book’s editor. However, if you choose to contribute, it will
Please see https://lkml.org/lkml/2007/1/15/ be your book as well. In that spirit, I offer you Chapter 2,
219 for an example patch with a Signed-off-by: line. our introduction.
Note well that the Signed-off-by: line has a very spe-
cific meaning, namely that you are certifying that:

(a) The contribution was created in whole or in part by


me and I have the right to submit it under the open
source license indicated in the file; or

(b) The contribution is based upon previous work that, to


the best of my knowledge, is covered under an appro-
priate open source license and I have the right under
that license to submit that work with modifications,
whether created in whole or in part by me, under the
same open source license (unless I am permitted to
submit under a different license), as indicated in the
file; or

(c) The contribution was provided directly to me by


some other person who certified (a), (b) or (c) and I
have not modified it.

(d) I understand and agree that this project and the contri-
bution are public and that a record of the contribution
(including all personal information I submit with it,
including my sign-off) is maintained indefinitely and
may be redistributed consistent with this project or
the open source license(s) involved.

This is quite similar to the Developer’s Certificate


of Origin (DCO) 1.1 used by the Linux kernel. You

v2022.09.25a
6 CHAPTER 1. HOW TO USE THIS BOOK

v2022.09.25a
If parallel programming is so hard, why are there so
many parallel programs?

Chapter 2 Unknown

Introduction

Parallel programming has earned a reputation as one of 2.1 Historic Parallel Programming
the most difficult areas a hacker can tackle. Papers and
textbooks warn of the perils of deadlock, livelock, race Difficulties
conditions, non-determinism, Amdahl’s-Law limits to
scaling, and excessive realtime latencies. And these perils Not the power to remember, but its very opposite, the
are quite real; we authors have accumulated uncounted power to forget, is a necessary condition for our
years of experience along with the resulting emotional existence.
scars, grey hairs, and hair loss.
Sholem Asch
However, new technologies that are difficult to use at
introduction invariably become easier over time. For As indicated by its title, this book takes a different ap-
example, the once-rare ability to drive a car is now com- proach. Rather than complain about the difficulty of
monplace in many countries. This dramatic change came parallel programming, it instead examines the reasons
about for two basic reasons: (1) Cars became cheaper why parallel programming is difficult, and then works to
and more readily available, so that more people had the help the reader to overcome these difficulties. As will be
opportunity to learn to drive, and (2) Cars became easier to seen, these difficulties have historically fallen into several
operate due to automatic transmissions, automatic chokes, categories, including:
automatic starters, greatly improved reliability, and a host
of other technological improvements. 1. The historic high cost and relative rarity of parallel
systems.
The same is true for many other technologies, includ-
ing computers. It is no longer necessary to operate a 2. The typical researcher’s and practitioner’s lack of
keypunch in order to program. Spreadsheets allow most experience with parallel systems.
non-programmers to get results from their computers that 3. The paucity of publicly accessible parallel code.
would have required a team of specialists a few decades
ago. Perhaps the most compelling example is web-surfing 4. The lack of a widely understood engineering disci-
and content creation, which since the early 2000s has been pline of parallel programming.
easily done by untrained, uneducated people using various
now-commonplace social-networking tools. As recently 5. The high overhead of communication relative to that
as 1968, such content creation was a far-out research of processing, even in tightly coupled shared-memory
project [Eng68], described at the time as “like a UFO computers.
landing on the White House lawn” [Gri00].
Many of these historic difficulties are well on the way
Therefore, if you wish to argue that parallel program- to being overcome. First, over the past few decades,
ming will remain as difficult as it is currently perceived the cost of parallel systems has decreased from many
by many to be, it is you who bears the burden of proof, multiples of that of a house to that of a modest meal,
keeping in mind the many centuries of counter-examples courtesy of Moore’s Law [Moo65]. Papers calling out the
in many fields of endeavor. advantages of multicore CPUs were published as early

v2022.09.25a
8 CHAPTER 2. INTRODUCTION

as 1996 [ONH+ 96]. IBM introduced simultaneous multi- hardware will be more friendly to parallel software, as
threading into its high-end POWER family in 2000, and discussed in Section 3.3.
multicore in 2001. Intel introduced hyperthreading into Quick Quiz 2.1: Come on now!!! Parallel programming has
its commodity Pentium line in November 2000, and both been known to be exceedingly hard for many decades. You
AMD and Intel introduced dual-core CPUs in 2005. Sun seem to be hinting that it is not so hard. What sort of game
followed with the multicore/multi-threaded Niagara in are you playing?
late 2005. In fact, by 2008, it was becoming difficult to
find a single-CPU desktop system, with single-core CPUs However, even though parallel programming might not
being relegated to netbooks and embedded devices. By be as hard as is commonly advertised, it is often more
2012, even smartphones were starting to sport multiple work than is sequential programming.
CPUs. By 2020, safety-critical software standards started Quick Quiz 2.2: How could parallel programming ever be
addressing concurrency. as easy as sequential programming?
Second, the advent of low-cost and readily available
multicore systems means that the once-rare experience It therefore makes sense to consider alternatives to
of parallel programming is now available to almost all parallel programming. However, it is not possible to
researchers and practitioners. In fact, parallel systems reasonably consider parallel-programming alternatives
have long been within the budget of students and hobbyists. without understanding parallel-programming goals. This
We can therefore expect greatly increased levels of inven- topic is addressed in the next section.
tion and innovation surrounding parallel systems, and that
increased familiarity will over time make the once pro-
hibitively expensive field of parallel programming much
2.2 Parallel Programming Goals
more friendly and commonplace.
If you don’t know where you are going, you will end
Third, in the 20th century, large systems of highly par-
up somewhere else.
allel software were almost always closely guarded propri-
etary secrets. In happy contrast, the 21st century has seen Yogi Berra
numerous open-source (and thus publicly available) paral-
lel software projects, including the Linux kernel [Tor03], The three major goals of parallel programming (over and
database systems [Pos08, MS08], and message-passing above those of sequential programming) are as follows:
systems [The08, Uni08a]. This book will draw primarily
from the Linux kernel, but will provide much material 1. Performance.
suitable for user-level applications.
2. Productivity.
Fourth, even though the large-scale parallel-program-
ming projects of the 1980s and 1990s were almost all 3. Generality.
proprietary projects, these projects have seeded other
communities with cadres of developers who understand Unfortunately, given the current state of the art, it is
the engineering discipline required to develop production- possible to achieve at best two of these three goals for any
quality parallel code. A major purpose of this book is to given parallel program. These three goals therefore form
present this engineering discipline. the iron triangle of parallel programming, a triangle upon
Unfortunately, the fifth difficulty, the high cost of com- which overly optimistic hopes all too often come to grief.1
munication relative to that of processing, remains largely Quick Quiz 2.3: Oh, really??? What about correctness,
in force. This difficulty has been receiving increasing maintainability, robustness, and so on?
attention during the new millennium. However, accord-
ing to Stephen Hawking, the finite speed of light and Quick Quiz 2.4: And if correctness, maintainability, and
the atomic nature of matter will limit progress in this robustness don’t make the list, why do productivity and gener-
area [Gar07, Moo03]. Fortunately, this difficulty has been ality?
in force since the late 1980s, so that the aforementioned
engineering discipline has evolved practical and effective
strategies for handling it. In addition, hardware designers
are increasingly aware of these issues, so perhaps future 1 Kudos to Michael Wong for naming the iron triangle.

v2022.09.25a
2.2. PARALLEL PROGRAMMING GOALS 9

10000 single-threaded code and simply waiting a year or two for


the CPUs to catch up may no longer be an option. Given
CPU Clock Frequency / MIPS

1000 the recent trends on the part of all major manufacturers


towards multicore/multithreaded systems, parallelism is
the way to go for those wanting to avail themselves of the
100
full performance of their systems.
Quick Quiz 2.8: Why not instead rewrite programs from
10
inefficient scripting languages to C or C++?

1 Even so, the first goal is performance rather than scal-


ability, especially given that the easiest way to attain
0.1
linear scalability is to reduce the performance of each
CPU [Tor01]. Given a four-CPU system, which would
1975
1980
1985
1990
1995
2000
2005
2010
2015
2020
you prefer? A program that provides 100 transactions per
second on a single CPU, but does not scale at all? Or
Year a program that provides 10 transactions per second on a
Figure 2.1: MIPS/Clock-Frequency Trend for Intel CPUs single CPU, but scales perfectly? The first program seems
like a better bet, though the answer might change if you
happened to have a 32-CPU system.
Quick Quiz 2.5: Given that parallel programs are much That said, just because you have multiple CPUs is
harder to prove correct than are sequential programs, again, not necessarily in and of itself a reason to use them all,
shouldn’t correctness really be on the list? especially given the recent decreases in price of multi-
CPU systems. The key point to understand is that parallel
Quick Quiz 2.6: What about just having fun? programming is primarily a performance optimization,
and, as such, it is one potential optimization of many. If
Each of these goals is elaborated upon in the following your program is fast enough as currently written, there is no
sections. reason to optimize, either by parallelizing it or by applying
any of a number of potential sequential optimizations.3
2.2.1 Performance By the same token, if you are looking to apply parallelism
as an optimization to a sequential program, then you will
Performance is the primary goal behind most parallel- need to compare parallel algorithms to the best sequential
programming effort. After all, if performance is not a algorithms. This may require some care, as far too many
concern, why not do yourself a favor: Just write sequential publications ignore the sequential case when analyzing
code, and be happy? It will very likely be easier and you the performance of parallel algorithms.
will probably get done much more quickly.
Quick Quiz 2.7: Are there no cases where parallel program-
ming is about something other than performance?

Note that “performance” is interpreted broadly here,


including for example scalability (performance per CPU)
and efficiency (performance per watt). of instructions per second, usually from the old Dhrystone benchmark)
That said, the focus of performance has shifted from for older CPUs requiring multiple clocks to execute even the simplest
hardware to parallel software. This change in focus is due instruction. The reason for shifting between these two measures is
that the newer CPUs’ ability to retire multiple instructions per clock
to the fact that, although Moore’s Law continues to deliver is typically limited by memory-system performance. Furthermore, the
increases in transistor density, it has ceased to provide the benchmarks commonly used on the older CPUs are obsolete, and it is
traditional single-threaded performance increases. This difficult to run the newer benchmarks on systems containing the old
can be seen in Figure 2.1,2 which shows that writing CPUs, in part because it is hard to find working instances of the old
CPUs.
3 Of course, if you are a hobbyist whose primary interest is writing
2 This plot shows clock frequencies for newer CPUs theoretically parallel software, that is more than enough reason to parallelize whatever
capable of retiring one or more instructions per clock, and MIPS (millions software you are interested in.

v2022.09.25a
10 CHAPTER 2. INTRODUCTION

2.2.2 Productivity 1x106

100000
Quick Quiz 2.9: Why all this prattling on about non-technical
issues??? And not just any non-technical issue, but productivity 10000

MIPS per Die


of all things? Who cares?
1000
Productivity has been becoming increasingly important
in recent decades. To see this, consider that the price of 100
early computers was tens of millions of dollars at a time 10
when engineering salaries were but a few thousand dollars
a year. If dedicating a team of ten engineers to such a 1
machine would improve its performance, even by only
0.1
10 %, then their salaries would be repaid many times over.

1975
1980
1985
1990
1995
2000
2005
2010
2015
2020
One such machine was the CSIRAC, the oldest still-
intact stored-program computer, which was put into op-
eration in 1949 [Mus04, Dep06]. Because this machine Year
was built before the transistor era, it was constructed of Figure 2.2: MIPS per Die for Intel CPUs
2,000 vacuum tubes, ran with a clock frequency of 1 kHz,
consumed 30 kW of power, and weighed more than three
metric tons. Given that this machine had but 768 words cently has high productivity become critically important
of RAM, it is safe to say that it did not suffer from the when creating parallel software.
productivity issues that often plague today’s large-scale
Quick Quiz 2.10: Given how cheap parallel systems have
software projects.
become, how can anyone afford to pay people to program
Today, it would be quite difficult to purchase a machine them?
with so little computing power. Perhaps the closest equiv-
alents are 8-bit embedded microprocessors exemplified Perhaps at one time, the sole purpose of parallel software
by the venerable Z80 [Wik08], but even the old Z80 had was performance. Now, however, productivity is gaining
a CPU clock frequency more than 1,000 times faster than the spotlight.
the CSIRAC. The Z80 CPU had 8,500 transistors, and
could be purchased in 2008 for less than $2 US per unit 2.2.3 Generality
in 1,000-unit quantities. In stark contrast to the CSIRAC,
software-development costs are anything but insignificant One way to justify the high cost of developing parallel
for the Z80. software is to strive for maximal generality. All else being
The CSIRAC and the Z80 are two points in a long- equal, the cost of a more-general software artifact can be
term trend, as can be seen in Figure 2.2. This figure spread over more users than that of a less-general one. In
plots an approximation to computational power per die fact, this economic force explains much of the maniacal
over the past four decades, showing an impressive six- focus on portability, which can be seen as an important
order-of-magnitude increase over a period of forty years. special case of generality.4
Note that the advent of multicore CPUs has permitted this Unfortunately, generality often comes at the cost of per-
increase to continue apace despite the clock-frequency wall formance, productivity, or both. For example, portability
encountered in 2003, albeit courtesy of dies supporting is often achieved via adaptation layers, which inevitably
more than 50 hardware threads each. exact a performance penalty. To see this more gener-
One of the inescapable consequences of the rapid de- ally, consider the following popular parallel programming
crease in the cost of hardware is that software productivity environments:
becomes increasingly important. It is no longer sufficient
C/C++ “Locking Plus Threads”: This category, which
merely to make efficient use of the hardware: It is now
includes POSIX Threads (pthreads) [Ope97], Win-
necessary to make extremely efficient use of software
dows Threads, and numerous operating-system ker-
developers as well. This has long been the case for se-
nel environments, offers excellent performance (at
quential hardware, but parallel hardware has become a
low-cost commodity only recently. Therefore, only re- 4 Kudos to Michael Wong for pointing this out.

v2022.09.25a
2.2. PARALLEL PROGRAMMING GOALS 11

least within the confines of a single SMP system) and Productivity


also offers good generality. Pity about the relatively
low productivity. Application
Java: This general purpose and inherently multithreaded
Middleware (e.g., DBMS)
programming environment is widely believed to offer

Performance
much higher productivity than C or C++, courtesy

Generality
System Libraries
of the automatic garbage collector and the rich set
of class libraries. However, its performance, though Container
greatly improved in the early 2000s, lags that of C Operating System Kernel
and C++.
Hypervisor
MPI: This Message Passing Interface [MPI08] powers
Firmware
the largest scientific and technical computing clusters
in the world and offers unparalleled performance
Hardware
and scalability. In theory, it is general purpose,
but it is mainly used for scientific and technical
computing. Its productivity is believed by many Figure 2.3: Software Layers and Performance, Produc-
to be even lower than that of C/C++ “locking plus tivity, and Generality
threads” environments.
OpenMP: This set of compiler directives can be used to Special−Purpose
User 1 Env Productive User 2
parallelize loops. It is thus quite specific to this task, for User 1
and this specificity often limits its performance. It
is, however, much easier to use than MPI or C/C++
“locking plus threads.” HW / Special−Purpose
Abs Environment
Productive for User 2
SQL: Structured Query Language [Int92] is specific to
relational database queries. However, its perfor-
mance is quite good as measured by the Transaction User 3
General−Purpose User 4
Processing Performance Council (TPC) benchmark Environment
results [Tra01]. Productivity is excellent; in fact, this
parallel programming environment enables people to Special−Purpose Environment
Special−Purpose
make good use of a large parallel system despite hav- Productive for User 3
Environment
ing little or no knowledge of parallel programming Productive for User 4
concepts. Figure 2.4: Tradeoff Between Productivity and Generality
The nirvana of parallel programming environments,
one that offers world-class performance, productivity, and
lost in lower layers cannot easily be recovered further up
generality, simply does not yet exist. Until such a nirvana
the stack. In the upper layers of the stack, there might be
appears, it will be necessary to make engineering tradeoffs
very few users for a given specific application, in which
among performance, productivity, and generality. One
case productivity concerns are paramount. This explains
such tradeoff is depicted by the green “iron triangle”5
the tendency towards “bloatware” further up the stack:
shown in Figure 2.3, which shows how productivity be-
Extra hardware is often cheaper than extra developers.
comes increasingly important at the upper layers of the
This book is intended for developers working near the
system stack, while performance and generality become
bottom of the stack, where performance and generality
increasingly important at the lower layers of the system
are of greatest concern.
stack. The huge development costs incurred at the lower
It is important to note that a tradeoff between produc-
layers must be spread over equally huge numbers of users
tivity and generality has existed for centuries in many
(hence the importance of generality), and performance
fields. For but one example, a nailgun is more productive
5 Kudos to Michael Wong for coining “iron triangle.” than a hammer for driving nails, but in contrast to the

v2022.09.25a
12 CHAPTER 2. INTRODUCTION

nailgun, a hammer can be used for many things besides tion 2.2, the primary goals of parallel programming are
driving nails. It should therefore be no surprise to see performance, productivity, and generality. Because this
similar tradeoffs appear in the field of parallel computing. book is intended for developers working on performance-
This tradeoff is shown schematically in Figure 2.4. Here, critical code near the bottom of the software stack, the
users 1, 2, 3, and 4 have specific jobs that they need the remainder of this section focuses primarily on performance
computer to help them with. The most productive possible improvement.
language or environment for a given user is one that simply It is important to keep in mind that parallelism is but
does that user’s job, without requiring any programming, one way to improve performance. Other well-known
configuration, or other setup. approaches include the following, in roughly increasing
order of difficulty:
Quick Quiz 2.11: This is a ridiculously unachievable ideal!
Why not focus on something that is achievable in practice?
1. Run multiple instances of a sequential application.
Unfortunately, a system that does the job required by 2. Make the application use existing parallel software.
user 1 is unlikely to do user 2’s job. In other words, the
most productive languages and environments are domain- 3. Optimize the serial application.
specific, and thus by definition lacking generality.
These approaches are covered in the following sections.
Another option is to tailor a given programming lan-
guage or environment to the hardware system (for example,
low-level languages such as assembly, C, C++, or Java) 2.3.1 Multiple Instances of a Sequential
or to some abstraction (for example, Haskell, Prolog, or Application
Snobol), as is shown by the circular region near the center
of Figure 2.4. These languages can be considered to be Running multiple instances of a sequential application can
general in the sense that they are equally ill-suited to the allow you to do parallel programming without actually
jobs required by users 1, 2, 3, and 4. In other words, doing parallel programming. There are a large number of
their generality comes at the expense of decreased produc- ways to approach this, depending on the structure of the
tivity when compared to domain-specific languages and application.
environments. Worse yet, a language that is tailored to a If your program is analyzing a large number of different
given abstraction is likely to suffer from performance and scenarios, or is analyzing a large number of independent
scalability problems unless and until it can be efficiently data sets, one easy and effective approach is to create a
mapped to real hardware. single sequential program that carries out a single analysis,
Is there no escape from iron triangle’s three conflicting then use any of a number of scripting environments (for
goals of performance, productivity, and generality? example the bash shell) to run a number of instances of
It turns out that there often is an escape, for example, that sequential program in parallel. In some cases, this
using the alternatives to parallel programming discussed approach can be easily extended to a cluster of machines.
in the next section. After all, parallel programming can This approach may seem like cheating, and in fact some
be a great deal of fun, but it is not always the best tool for denigrate such programs as “embarrassingly parallel”.
the job. And in fact, this approach does have some potential dis-
advantages, including increased memory consumption,
waste of CPU cycles recomputing common intermediate
2.3 Alternatives to Parallel Pro- results, and increased copying of data. However, it is of-
ten extremely productive, garnering extreme performance
gramming gains with little or no added effort.

Experiment is folly when experience shows the way. 2.3.2 Use Existing Parallel Software
Roger M. Babson There is no longer any shortage of parallel software envi-
ronments that can present a single-threaded programming
In order to properly consider alternatives to parallel pro- environment, including relational databases [Dat82], web-
gramming, you must first decide on what exactly you application servers, and map-reduce environments. For
expect the parallelism to do for you. As seen in Sec- example, a common design provides a separate process for

v2022.09.25a
2.4. WHAT MAKES PARALLEL PROGRAMMING HARD? 13

each user, each of which generates SQL from user queries. Furthermore, different programs might have different
This per-user SQL is run against a common relational performance bottlenecks. For example, if your program
database, which automatically runs the users’ queries spends most of its time waiting on data from your disk
concurrently. The per-user programs are responsible only drive, using multiple CPUs will probably just increase the
for the user interface, with the relational database tak- time wasted waiting for the disks. In fact, if the program
ing full responsibility for the difficult issues surrounding was reading from a single large file laid out sequentially
parallelism and persistence. on a rotating disk, parallelizing your program might well
In addition, there are a growing number of parallel make it a lot slower due to the added seek overhead. You
library functions, particularly for numeric computation. should instead optimize the data layout so that the file can
Even better, some libraries take advantage of special- be smaller (thus faster to read), split the file into chunks
purpose hardware such as vector units and general-purpose which can be accessed in parallel from different drives,
graphical processing units (GPGPUs). cache frequently accessed data in main memory, or, if
Taking this approach often sacrifices some performance, possible, reduce the amount of data that must be read.
at least when compared to carefully hand-coding a fully
Quick Quiz 2.13: What other bottlenecks might prevent
parallel application. However, such sacrifice is often well
additional CPUs from providing additional performance?
repaid by a huge reduction in development effort.
Quick Quiz 2.12: Wait a minute! Doesn’t this approach Parallelism can be a powerful optimization technique,
simply shift the development effort from you to whoever wrote but it is not the only such technique, nor is it appropriate
the existing parallel software you are using? for all situations. Of course, the easier it is to parallelize
your program, the more attractive parallelization becomes
as an optimization. Parallelization has a reputation of
2.3.3 Performance Optimization being quite difficult, which leads to the question “exactly
what makes parallel programming so difficult?”
Up through the early 2000s, CPU clock frequencies dou-
bled every 18 months. It was therefore usually more impor-
tant to create new functionality than to carefully optimize
performance. Now that Moore’s Law is “only” increasing 2.4 What Makes Parallel Program-
transistor density instead of increasing both transistor ming Hard?
density and per-transistor performance, it might be a good
time to rethink the importance of performance optimiza-
Real difficulties can be overcome; it is only the
tion. After all, new hardware generations no longer bring
imaginary ones that are unconquerable.
significant single-threaded performance improvements.
Furthermore, many performance optimizations can also Theodore N. Vail
conserve energy.
From this viewpoint, parallel programming is but an- It is important to note that the difficulty of parallel pro-
other performance optimization, albeit one that is be- gramming is as much a human-factors issue as it is a set of
coming much more attractive as parallel systems become technical properties of the parallel programming problem.
cheaper and more readily available. However, it is wise We do need human beings to be able to tell parallel sys-
to keep in mind that the speedup available from paral- tems what to do, otherwise known as programming. But
lelism is limited to roughly the number of CPUs (but parallel programming involves two-way communication,
see Section 6.5 for an interesting exception). In contrast, with a program’s performance and scalability being the
the speedup available from traditional single-threaded communication from the machine to the human. In short,
software optimizations can be much larger. For example, the human writes a program telling the computer what
replacing a long linked list with a hash table or a search to do, and the computer critiques this program via the
tree can improve performance by many orders of mag- resulting performance and scalability. Therefore, appeals
nitude. This highly optimized single-threaded program to abstractions or to mathematical analyses will often be
might run much faster than its unoptimized parallel coun- of severely limited utility.
terpart, making parallelization unnecessary. Of course, a In the Industrial Revolution, the interface between hu-
highly optimized parallel program would be even better, man and machine was evaluated by human-factor studies,
aside from the added development effort required. then called time-and-motion studies. Although there have

v2022.09.25a
14 CHAPTER 2. INTRODUCTION

Performance Productivity
errors and events: A parallel program may need to carry
out non-trivial synchronization in order to safely process
Work
such global events. More generally, each partition requires
Partitioning
some sort of communication: After all, if a given thread
Resource did not communicate at all, it would have no effect and
Parallel
Partitioning and would thus not need to be executed. However, because
Access Control Replication
communication incurs overhead, careless partitioning
choices can result in severe performance degradation.
Interacting
Furthermore, the number of concurrent threads must
With Hardware
often be controlled, as each such thread occupies common
Generality resources, for example, space in CPU caches. If too
many threads are permitted to execute concurrently, the
Figure 2.5: Categories of Tasks Required of Parallel CPU caches will overflow, resulting in high cache miss
Programmers rate, which in turn degrades performance. Conversely,
large numbers of threads are often required to overlap
computation and I/O so as to fully utilize I/O devices.
been a few human-factor studies examining parallel pro-
Quick Quiz 2.14: Other than CPU cache capacity, what
gramming [ENS05, ES05, HCS+ 05, SS94], these studies
might require limiting the number of concurrent threads?
have been extremely narrowly focused, and hence unable
to demonstrate any general results. Furthermore, given Finally, permitting threads to execute concurrently
that the normal range of programmer productivity spans greatly increases the program’s state space, which can
more than an order of magnitude, it is unrealistic to expect make the program difficult to understand and debug, de-
an affordable study to be capable of detecting (say) a 10 % grading productivity. All else being equal, smaller state
difference in productivity. Although the multiple-order-of- spaces having more regular structure are more easily un-
magnitude differences that such studies can reliably detect derstood, but this is a human-factors statement as much as
are extremely valuable, the most impressive improvements it is a technical or mathematical statement. Good parallel
tend to be based on a long series of 10 % improvements. designs might have extremely large state spaces, but never-
We must therefore take a different approach. theless be easy to understand due to their regular structure,
One such approach is to carefully consider the tasks that while poor designs can be impenetrable despite having a
parallel programmers must undertake that are not required comparatively small state space. The best designs exploit
of sequential programmers. We can then evaluate how embarrassing parallelism, or transform the problem to
well a given programming language or environment assists one having an embarrassingly parallel solution. In either
the developer with these tasks. These tasks fall into the case, “embarrassingly parallel” is in fact an embarrass-
four categories shown in Figure 2.5, each of which is ment of riches. The current state of the art enumerates
covered in the following sections. good designs; more work is required to make more general
judgments on state-space size and structure.
2.4.1 Work Partitioning
2.4.2 Parallel Access Control
Work partitioning is absolutely required for parallel ex-
ecution: If there is but one “glob” of work, then it can Given a single-threaded sequential program, that single
be executed by at most one CPU at a time, which is by thread has full access to all of the program’s resources.
definition sequential execution. However, partitioning the These resources are most often in-memory data structures,
work requires great care. For example, uneven partitioning but can be CPUs, memory (including caches), I/O devices,
can result in sequential execution once the small partitions computational accelerators, files, and much else besides.
have completed [Amd67]. In less extreme cases, load The first parallel-access-control issue is whether the
balancing can be used to fully utilize available hardware form of access to a given resource depends on that re-
and restore performance and scalability. source’s location. For example, in many message-passing
Although partitioning can greatly improve performance environments, local-variable access is via expressions
and scalability, it can also increase complexity. For and assignments, while remote-variable access uses an
example, partitioning can complicate handling of global entirely different syntax, usually involving messaging.

v2022.09.25a
2.4. WHAT MAKES PARALLEL PROGRAMMING HARD? 15

The POSIX Threads environment [Ope97], Structured Performance Productivity


Query Language (SQL) [Int92], and partitioned global
Work
address-space (PGAS) environments such as Universal
Partitioning
Parallel C (UPC) [EGCD03, CBF13] offer implicit access,
while Message Passing Interface (MPI) [MPI08] offers Resource
Parallel
explicit access because access to remote data requires Partitioning and
Access Control Replication
explicit messaging.
The other parallel-access-control issue is how threads Interacting
coordinate access to the resources. This coordination is With Hardware
carried out by the very large number of synchronization
mechanisms provided by various parallel languages and Generality
environments, including message passing, locking, trans-
actions, reference counting, explicit timing, shared atomic Figure 2.6: Ordering of Parallel-Programming Tasks
variables, and data ownership. Many traditional parallel-
programming concerns such as deadlock, livelock, and
transaction rollback stem from this coordination. This Resource partitioning is extremely effective, but it
framework can be elaborated to include comparisons of can be quite challenging for complex multilinked data
these synchronization mechanisms, for example locking structures.
vs. transactional memory [MMW07], but such elaboration
is beyond the scope of this section. (See Sections 17.2 2.4.4 Interacting With Hardware
and 17.3 for more information on transactional memory.)
Hardware interaction is normally the domain of the op-
Quick Quiz 2.15: Just what is “explicit timing”??? erating system, the compiler, libraries, or other software-
environment infrastructure. However, developers working
with novel hardware features and components will often
need to work directly with such hardware. In addition,
2.4.3 Resource Partitioning and Replica- direct access to the hardware can be required when squeez-
tion ing the last drop of performance out of a given system. In
this case, the developer may need to tailor or configure
The most effective parallel algorithms and systems exploit the application to the cache geometry, system topology,
resource parallelism, so much so that it is usually wise to or interconnect protocol of the target hardware.
begin parallelization by partitioning your write-intensive In some cases, hardware may be considered to be a
resources and replicating frequently accessed read-mostly resource which is subject to partitioning or access control,
resources. The resource in question is most frequently as described in the previous sections.
data, which might be partitioned over computer systems,
mass-storage devices, NUMA nodes, CPU cores (or dies
or hardware threads), pages, cache lines, instances of
2.4.5 Composite Capabilities
synchronization primitives, or critical sections of code. Although these four capabilities are fundamental, good
For example, partitioning over locking primitives is termed engineering practice uses composites of these capabilities.
“data locking” [BK85]. For example, the data-parallel approach first partitions
Resource partitioning is frequently application depen- the data so as to minimize the need for inter-partition
dent. For example, numerical applications frequently par- communication, partitions the code accordingly, and fi-
tition matrices by row, column, or sub-matrix, while com- nally maps data partitions and threads so as to maximize
mercial applications frequently partition write-intensive throughput while minimizing inter-thread communication,
data structures and replicate read-mostly data structures. as shown in Figure 2.6. The developer can then con-
Thus, a commercial application might assign the data for sider each partition separately, greatly reducing the size
a given customer to a given few computers out of a large of the relevant state space, in turn increasing productiv-
cluster. An application might statically partition data, or ity. Even though some problems are non-partitionable,
dynamically change the partitioning over time. clever transformations into forms permitting partitioning

v2022.09.25a
16 CHAPTER 2. INTRODUCTION

can sometimes greatly enhance both performance and The authors of these older guides were well up to the
scalability [Met99]. parallel programming challenge back in the 1980s. As
such, there are simply no excuses for refusing to step up
2.4.6 How Do Languages and Environ- to the parallel-programming challenge here in the 21st
century!
ments Assist With These Tasks? We are now ready to proceed to the next chapter, which
Although many environments require the developer to dives into the relevant properties of the parallel hardware
deal manually with these tasks, there are long-standing underlying our parallel software.
environments that bring significant automation to bear.
The poster child for these environments is SQL, many
implementations of which automatically parallelize single
large queries and also automate concurrent execution of
independent queries and updates.
These four categories of tasks must be carried out in all
parallel programs, but that of course does not necessarily
mean that the developer must manually carry out these
tasks. We can expect to see ever-increasing automation of
these four tasks as parallel systems continue to become
cheaper and more readily available.
Quick Quiz 2.16: Are there any other obstacles to parallel
programming?

2.5 Discussion

Until you try, you don’t know what you can’t do.
Henry James

This section has given an overview of the difficulties with,


goals of, and alternatives to parallel programming. This
overview was followed by a discussion of what can make
parallel programming hard, along with a high-level ap-
proach for dealing with parallel programming’s difficulties.
Those who still insist that parallel programming is impossi-
bly difficult should review some of the older guides to par-
allel programmming [Seq88, Bir89, BK85, Inm85]. The
following quote from Andrew Birrell’s monograph [Bir89]
is especially telling:

Writing concurrent programs has a reputation


for being exotic and difficult. I believe it is
neither. You need a system that provides you
with good primitives and suitable libraries, you
need a basic caution and carefulness, you need
an armory of useful techniques, and you need
to know of the common pitfalls. I hope that
this paper has helped you towards sharing my
belief.

v2022.09.25a
Premature abstraction is the root of all evil.

A cast of thousands
Chapter 3

Hardware and its Habits

Most people intuitively understand that passing messages


between systems is more expensive than performing simple
calculations within the confines of a single system. But it
is also the case that communicating among threads within
the confines of a single shared-memory system can also be
quite expensive. This chapter therefore looks at the cost
of synchronization and communication within a shared-
memory system. These few pages can do no more than
scratch the surface of shared-memory parallel hardware
design; readers desiring more detail would do well to start
with a recent edition of Hennessy’s and Patterson’s classic
text [HP17, HP95].
Quick Quiz 3.1: Why should parallel programmers bother CPU Benchmark Trackmeet
learning low-level properties of the hardware? Wouldn’t it be
easier, better, and more elegant to remain at a higher level of
abstraction? Figure 3.1: CPU Performance at its Best

3.1.1 Pipelined CPUs


3.1 Overview
In the 1980s, the typical microprocessor fetched an in-
struction, decoded it, and executed it, typically taking
Mechanical Sympathy: Hardware and software
working together in harmony. at least three clock cycles to complete one instruction
before even starting the next. In contrast, the CPU of the
Martin Thompson late 1990s and of the 2000s execute many instructions
simultaneously, using pipelines; superscalar techniques;
Careless reading of computer-system specification sheets out-of-order instruction and data handling; speculative
might lead one to believe that CPU performance is a execution, and more [HP17, HP11] in order to optimize
footrace on a clear track, as illustrated in Figure 3.1, the flow of instructions and data through the CPU. Some
where the race always goes to the swiftest. cores have more than one hardware thread, which is
Although there are a few CPU-bound benchmarks that variously called simultaneous multithreading (SMT) or
approach the ideal case shown in Figure 3.1, the typical hyperthreading (HT) [Fen73], each of which appears as
program more closely resembles an obstacle course than an independent CPU to software, at least from a functional
a race track. This is because the internal architecture of viewpoint. These modern hardware features can greatly
CPUs has changed dramatically over the past few decades, improve performance, as illustrated by Figure 3.2.
courtesy of Moore’s Law. These changes are described in Achieving full performance with a CPU having a long
the following sections. pipeline requires highly predictable control flow through

17

v2022.09.25a
18 CHAPTER 3. HARDWARE AND ITS HABITS

Thread 0 Thread 1
4.0 GHz clock, 20 MB L3 Instructions
Decode and
Instructions
cache, 20 stage pipeline... Translate

Micro-Op
Scheduler

The only pipeline I need


is to cool off that hot-
headed brat. Registers
Execution (100s!)
Units
Figure 3.2: CPUs Old and New
Figure 3.4: Rough View of Modern Micro-Architecture

for execution to proceed far enough to be certain where that


branch leads, or it must guess and then proceed using spec-
ulative execution. Although guessing works extremely
well for programs with predictable control flow, for un-
N predictable branches (such as those in binary search) the
IO
CT
R ED
I guesses will frequently be wrong. A wrong guess can be
SP
PI
PE
MI expensive because the CPU must discard any speculatively
LI CH
NE AN executed instructions following the corresponding branch,
ER BR
RO
R resulting in a pipeline flush. If pipeline flushes appear too
frequently, they drastically reduce overall performance, as
fancifully depicted in Figure 3.3.
This gets even worse in the increasingly common case
of hyperthreading (or SMT, if you prefer), especially on
pipelined superscalar out-of-order CPU featuring specu-
Figure 3.3: CPU Meets a Pipeline Flush lative execution. In this increasingly common case, all
the hardware threads sharing a core also share that core’s
resources, including registers, cache, execution units, and
the program. Suitable control flow can be provided by a so on. The instructions are often decoded into micro-
program that executes primarily in tight loops, for example, operations, and use of the shared execution units and the
arithmetic on large matrices or vectors. The CPU can then hundreds of hardware registers is often coordinated by a
correctly predict that the branch at the end of the loop will micro-operation scheduler. A rough diagram of such a
be taken in almost all cases, allowing the pipeline to be two-threaded core is shown in Figure 3.4, and more accu-
kept full and the CPU to execute at full speed. rate (and thus more complex) diagrams are available in
However, branch prediction is not always so easy. For textbooks and scholarly papers.1 Therefore, the execution
example, consider a program with many loops, each of of one hardware thread can and often is perturbed by the
which iterates a small but random number of times. For actions of other hardware threads sharing that core.
another example, consider an old-school object-oriented Even if only one hardware thread is active (for example,
program with many virtual objects that can reference many in old-school CPU designs where there is only one thread),
different real objects, all with different implementations counterintuitive results are quite common. Execution
for frequently invoked member functions, resulting in units often have overlapping capabilities, so that a CPU’s
many calls through pointers. In these cases, it is difficult 1 Here is one example for a late-2010s Intel core: https:
or even impossible for the CPU to predict where the next //en.wikichip.org/wiki/intel/microarchitectures/
branch might lead. Then either the CPU must stall waiting skylake_(server).

v2022.09.25a
3.1. OVERVIEW 19

choice of execution unit can result in pipeline stalls due to


contention for that execution unit from later instructions.
In theory, this contention is avoidable, but in practice
CPUs must choose very quickly and without the benefit of
clairvoyance. In particular, adding an instruction to a tight
loop can sometimes actually cause execution to speed up.
Unfortunately, pipeline flushes and shared-resource
contention are not the only hazards in the obstacle course
that modern CPUs must run. The next section covers the
hazards of referencing memory.

3.1.2 Memory References


In the 1980s, it often took less time for a microprocessor
to load a value from memory than it did to execute an
instruction. More recently, microprocessors might execute
hundreds or even thousands of instructions in the time
required to access memory. This disparity is due to the
fact that Moore’s Law has increased CPU performance at
a much greater rate than it has decreased memory latency, Figure 3.5: CPU Meets a Memory Reference
in part due to the rate at which memory sizes have grown.
For example, a typical 1970s minicomputer might have
4 KB (yes, kilobytes, not megabytes, let alone gigabytes 3.1.3 Atomic Operations
or terabytes) of main memory, with single-cycle access.2
Present-day CPU designers still can construct a 4 KB One such obstacle is atomic operations. The problem here
memory with single-cycle access, even on systems with is that the whole idea of an atomic operation conflicts
multi-GHz clock frequencies. And in fact they frequently with the piece-at-a-time assembly-line operation of a
do construct such memories, but they now call them CPU pipeline. To hardware designers’ credit, modern
“level-0 caches”, plus they can be quite a bit bigger than CPUs use a number of extremely clever tricks to make
4 KB. such operations look atomic even though they are in fact
Although the large caches found on modern micro- being executed piece-at-a-time, with one common trick
processors can do quite a bit to help combat memory- being to identify all the cachelines containing the data to
access latencies, these caches require highly predictable be atomically operated on, ensure that these cachelines
data-access patterns to successfully hide those latencies. are owned by the CPU executing the atomic operation,
Unfortunately, common operations such as traversing a and only then proceed with the atomic operation while
linked list have extremely unpredictable memory-access ensuring that these cachelines remained owned by this
patterns—after all, if the pattern was predictable, us soft- CPU. Because all the data is private to this CPU, other
ware types would not bother with the pointers, right? CPUs are unable to interfere with the atomic operation
Therefore, as shown in Figure 3.5, memory references despite the piece-at-a-time nature of the CPU’s pipeline.
often pose severe obstacles to modern CPUs. Needless to say, this sort of trick can require that the
Thus far, we have only been considering obstacles pipeline must be delayed or even flushed in order to
that can arise during a given CPU’s execution of single- perform the setup operations that permit a given atomic
threaded code. Multi-threading presents additional obsta- operation to complete correctly.
cles to the CPU, as described in the following sections. In contrast, when executing a non-atomic operation,
the CPU can load values from cachelines as they appear
and place the results in the store buffer, without the need
to wait for cacheline ownership. Although there are a
number of hardware optimizations that can sometimes
2 It is only fair to add that each of these single cycles lasted no less hide cache latencies, the resulting effect on performance
than 1.6 microseconds. is all too often as depicted in Figure 3.6.

v2022.09.25a
20 CHAPTER 3. HARDWARE AND ITS HABITS

Memory
Barrier

Figure 3.6: CPU Meets an Atomic Operation

Unfortunately, atomic operations usually apply only to


single elements of data. Because many parallel algorithms
require that ordering constraints be maintained between Figure 3.7: CPU Meets a Memory Barrier
updates of multiple data elements, most CPUs provide
memory barriers. These memory barriers also serve as
performance-sapping obstacles, as described in the next 3.1.5 Cache Misses
section.
An additional multi-threading obstacle to CPU perfor-
Quick Quiz 3.2: What types of machines would allow atomic
operations on multiple data elements? mance is the “cache miss”. As noted earlier, modern
CPUs sport large caches in order to reduce the perfor-
mance penalty that would otherwise be incurred due to
3.1.4 Memory Barriers high memory latencies. However, these caches are actu-
ally counter-productive for variables that are frequently
Memory barriers will be considered in more detail in shared among CPUs. This is because when a given CPU
Chapter 15 and Appendix C. In the meantime, consider wishes to modify the variable, it is most likely the case
the following simple lock-based critical section: that some other CPU has modified it recently. In this case,
the variable will be in that other CPU’s cache, but not in
1 spin_lock(&mylock);
2 a = a + 1; this CPU’s cache, which will therefore incur an expensive
3 spin_unlock(&mylock); cache miss (see Appendix C.1 for more detail). Such
cache misses form a major obstacle to CPU performance,
If the CPU were not constrained to execute these state- as shown in Figure 3.8.
ments in the order shown, the effect would be that the
variable “a” would be incremented without the protection Quick Quiz 3.3: So have CPU designers also greatly reduced
the overhead of cache misses?
of “mylock”, which would certainly defeat the purpose of
acquiring it. To prevent such destructive reordering, lock-
ing primitives contain either explicit or implicit memory
barriers. Because the whole purpose of these memory 3.1.6 I/O Operations
barriers is to prevent reorderings that the CPU would
otherwise undertake in order to increase performance, A cache miss can be thought of as a CPU-to-CPU I/O
memory barriers almost always reduce performance, as operation, and as such is one of the cheapest I/O operations
depicted in Figure 3.7. available. I/O operations involving networking, mass
As with atomic operations, CPU designers have been storage, or (worse yet) human beings pose much greater
working hard to reduce memory-barrier overhead, and obstacles than the internal obstacles called out in the prior
have made substantial progress. sections, as illustrated by Figure 3.9.

v2022.09.25a
3.2. OVERHEADS 21

This is one of the differences between shared-memory


and distributed-system parallelism: Shared-memory par-
allel programs must normally deal with no obstacle worse
than a cache miss, while a distributed parallel program
will typically incur the larger network communication
latencies. In both cases, the relevant latencies can be
CACHE- thought of as a cost of communication—a cost that would
MISS be absent in a sequential program. Therefore, the ratio
TOLL between the overhead of the communication to that of the
BOOTH actual work being performed is a key design parameter.
A major goal of parallel hardware design is to reduce this
ratio as needed to achieve the relevant performance and
scalability goals. In turn, as will be seen in Chapter 6,
a major goal of parallel software design is to reduce the
frequency of expensive operations like communications
cache misses.
Of course, it is one thing to say that a given operation is
an obstacle, and quite another to show that the operation
is a significant obstacle. This distinction is discussed in
the following sections.
Figure 3.8: CPU Meets a Cache Miss

3.2 Overheads

Don’t design bridges in ignorance of materials, and


don’t design low-level software in ignorance of the
underlying hardware.

Unknown

This section presents actual overheads of the obstacles to


TELE performance listed out in the previous section. However,
Please stay on the
line. Your call is very
it is first necessary to get a rough view of hardware system
important to us... architecture, which is the subject of the next section.

3.2.1 Hardware System Architecture


Figure 3.10 shows a rough schematic of an eight-core
computer system. Each die has a pair of CPU cores, each
with its cache, as well as an interconnect allowing the pair
of CPUs to communicate with each other. The system
interconnect allows the four dies to communicate with
each other and with main memory.
Data moves through this system in units of “cache
lines”, which are power-of-two fixed-size aligned blocks
Figure 3.9: CPU Waits for I/O Completion of memory, usually ranging from 32 to 256 bytes in size.
When a CPU loads a variable from memory to one of its
registers, it must first load the cacheline containing that
variable into its cache. Similarly, when a CPU stores a
value from one of its registers into memory, it must also

v2022.09.25a
22 CHAPTER 3. HARDWARE AND ITS HABITS

CPU 0 CPU 1 CPU 2 CPU 3


8. CPU 0’s and 1’s interconnect forwards the cacheline
to CPU 0’s cache.
Cache Cache Cache Cache
Interconnect Interconnect 9. CPU 0 can now complete the write, updating the
relevant portions of the newly arrived cacheline from
the value previously recorded in the store buffer.
Memory System Interconnect Memory

Quick Quiz 3.4: This is a simplified sequence of events?


Interconnect Interconnect
How could it possibly be any more complex?
Cache Cache Cache Cache
Quick Quiz 3.5: Why is it necessary to flush the cacheline
CPU 4 CPU 5 CPU 6 CPU 7 from CPU 7’s cache?

This simplified sequence is just the beginning of a dis-


cipline called cache-coherency protocols [HP95, CSG99,
Speed−of−Light Round−Trip Distance in Vacuum
MHS12, SHW11], which is discussed in more detail in
for 1.8 GHz Clock Period (8 cm)
Appendix C. As can be seen in the sequence of events
Figure 3.10: System Hardware Architecture triggered by a CAS operation, a single instruction can
cause considerable protocol traffic, which can significantly
degrade your parallel program’s performance.
load the cacheline containing that variable into its cache, Fortunately, if a given variable is being frequently read
but must also ensure that no other CPU has a copy of that during a time interval during which it is never updated,
cacheline. that variable can be replicated across all CPUs’ caches.
For example, if CPU 0 were to write to a variable This replication permits all CPUs to enjoy extremely fast
whose cacheline resided in CPU 7’s cache, the following access to this read-mostly variable. Chapter 9 presents
over-simplified sequence of events might ensue: synchronization mechanisms that take full advantage of
this important hardware read-mostly optimization.
1. CPU 0 checks its local cache, and does not find the
cacheline. It therefore records the write in its store 3.2.2 Costs of Operations
buffer.
The overheads of some common operations important to
2. A request for this cacheline is forwarded to CPU 0’s parallel programs are displayed in Table 3.1. This system’s
and 1’s interconnect, which checks CPU 1’s local clock period rounds to 0.5 ns. Although it is not unusual
cache, and does not find the cacheline. for modern microprocessors to be able to retire multiple
instructions per clock period, the operations’ costs are
3. This request is forwarded to the system interconnect, nevertheless normalized to a clock period in the third
which checks with the other three dies, learning that column, labeled “Ratio”. The first thing to note about this
the cacheline is held by the die containing CPU 6 table is the large values of many of the ratios.
and 7. The same-CPU compare-and-swap (CAS) operation
consumes about seven nanoseconds, a duration more than
4. This request is forwarded to CPU 6’s and 7’s inter-
ten times that of the clock period. CAS is an atomic
connect, which checks both CPUs’ caches, finding
operation in which the hardware compares the contents
the value in CPU 7’s cache.
of the specified memory location to a specified “old”
5. CPU 7 forwards the cacheline to its interconnect, and value, and if they compare equal, stores a specified “new”
also flushes the cacheline from its cache. value, in which case the CAS operation succeeds. If
they compare unequal, the memory location keeps its
6. CPU 6’s and 7’s interconnect forwards the cacheline (unexpected) value, and the CAS operation fails. The
to the system interconnect. operation is atomic in that the hardware guarantees that
the memory location will not be changed between the
7. The system interconnect forwards the cacheline to compare and the store. CAS functionality is provided by
CPU 0’s and 1’s interconnect. the lock;cmpxchg instruction on x86.

v2022.09.25a
3.2. OVERHEADS 23

Table 3.1: CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs @
2.10 GHz
Ratio
Operation Cost (ns) (cost/clock) CPUs
Clock period 0.5 1.0
Same-CPU CAS 7.0 14.6 0
Same-CPU lock 15.4 32.3 0
In-core blind CAS 7.2 15.2 224
In-core CAS 18.0 37.7 224
Off-core blind CAS 47.5 99.8 1–27,225–251
Off-core CAS 101.9 214.0 1–27,225–251
Off-socket blind CAS 148.8 312.5 28–111,252–335
Off-socket CAS 442.9 930.1 28–111,252–335
Cross-interconnect blind CAS 336.6 706.8 112–223,336–447
Cross-interconnect CAS 944.8 1,984.2 112–223,336–447
Off-System
Comms Fabric 5,000 10,500
Global Comms 195,000,000 409,500,000

The “same-CPU” prefix means that the CPU now per- new value. Then in the CAS operation, the value actu-
forming the CAS operation on a given variable was also ally loaded would be specified as the old value and the
the last CPU to access this variable, so that the corre- incremented value as the new value. If the value had
sponding cacheline is already held in that CPU’s cache. not been changed between the load and the CAS, this
Similarly, the same-CPU lock operation (a “round trip” would increment the memory location. However, if the
pair consisting of a lock acquisition and release) consumes value had in fact changed, then the old value would not
more than fifteen nanoseconds, or more than thirty clock match, causing a miscompare that would result in the CAS
cycles. The lock operation is more expensive than CAS operation failing. The key point is that there are now two
because it requires two atomic operations on the lock data accesses to the memory location, the load and the CAS.
structure, one for acquisition and the other for release. Thus, it is not surprising that in-core blind CAS con-
In-core operations involving interactions between the sumes only about seven nanoseconds, while in-core CAS
hardware threads sharing a single core are about the same consumes about 18 nanoseconds. The non-blind case’s
cost as same-CPU operations. This should not be too extra load does not come for free. That said, the overhead
surprising, given that these two hardware threads also of these operations are similar to single-CPU CAS and
share the full cache hierarchy. lock, respectively.
In the case of the blind CAS, the software specifies the Quick Quiz 3.6: Table 3.1 shows CPU 0 sharing a core with
old value without looking at the memory location. This CPU 224. Shouldn’t that instead be CPU 1???
approach is appropriate when attempting to acquire a lock.
If the unlocked state is represented by zero and the locked A blind CAS involving CPUs in different cores but
state is represented by the value one, then a CAS operation on the same socket consumes almost fifty nanoseconds,
on the lock that specifies zero for the old value and one or almost one hundred clock cycles. The code used for
for the new value will acquire the lock if it is not already this cache-miss measurement passes the cache line back
held. The key point is that there is only one access to the and forth between a pair of CPUs, so this cache miss
memory location, namely the CAS operation itself. is satisfied not from memory, but rather from the other
In contrast, a normal CAS operation’s old value is de- CPU’s cache. A non-blind CAS operation, which as
rived from some earlier load. For example, to implement noted earlier must look at the old value of the variable
an atomic increment, the current value of that location as well as store a new value, consumes over one hundred
is loaded and that value is incremented to produce the nanoseconds, or more than two hundred clock cycles.

v2022.09.25a
24 CHAPTER 3. HARDWARE AND ITS HABITS

Table 3.2: Cache Geometry for 8-Socket System With carefully chosen addresses) are required to overflow this
Intel Xeon Platinum 8176 CPUs @ 2.10 GHz 40 MB cache. On the other hand, equally careful choice
of addresses might make good use of the entire 40 MB.
Level Scope Line Size Sets Ways Size
Spatial locality of reference is clearly extremely impor-
L0 Core 64 64 8 32K tant, as is spreading the data across memory.
L1 Core 64 64 8 32K I/O operations are even more expensive. As shown
L2 Core 64 1024 16 1024K in the “Comms Fabric” row, high performance (and ex-
L3 Socket 64 57,344 11 39,424K pensive!) communications fabric, such as InfiniBand or
any number of proprietary interconnects, has a latency of
roughly five microseconds for an end-to-end round trip,
Think about this a bit. In the time required to do one CAS during which time more than ten thousand instructions
operation, the CPU could have executed more than two might have been executed. Standards-based communi-
hundred normal instructions. This should demonstrate cations networks often require some sort of protocol
the limitations not only of fine-grained locking, but of any processing, which further increases the latency. Of course,
other synchronization mechanism relying on fine-grained geographic distance also increases latency, with the speed-
global agreement. of-light through optical fiber latency around the world
If the pair of CPUs are on different sockets, the oper- coming to roughly 195 milliseconds, or more than 400
ations are considerably more expensive. A blind CAS million clock cycles, as shown in the “Global Comms”
operation consumes almost 150 nanoseconds, or more row.
than three hundred clock cycles. A normal CAS operation Quick Quiz 3.8: These numbers are insanely large! How
consumes more than 400 nanoseconds, or almost one can I possibly get my head around them?
thousand clock cycles.
Worse yet, not all pairs of sockets are created equal.
This particular system appears to be constructed as a
3.2.3 Hardware Optimizations
pair of four-socket components, with additional latency
penalties when the CPUs reside in different components. It is only natural to ask how the hardware is helping, and
In this case, a blind CAS operation consumes more than the answer is “Quite a bit!”
three hundred nanoseconds, or more than seven hundred One hardware optimization is large cachelines. This
clock cycles. A CAS operation consumes almost a full can provide a big performance boost, especially when
microsecond, or almost two thousand clock cycles. software is accessing memory sequentially. For example,
Quick Quiz 3.7: Surely the hardware designers could be per- given a 64-byte cacheline and software accessing 64-
suaded to improve this situation! Why have they been content bit variables, the first access will still be slow due to
with such abysmal performance for these single-instruction speed-of-light delays (if nothing else), but the remaining
operations? seven can be quite fast. However, this optimization has
a dark side, namely false sharing, which happens when
Unfortunately, the high speed of within-core and within- different variables in the same cacheline are being updated
socket communication does not come for free. First, there by different CPUs, resulting in a high cache-miss rate.
are only two CPUs within a given core and only 56 within a Software can use the alignment directives available in
given socket, compared to 448 across the system. Second, many compilers to avoid false sharing, and adding such
as shown in Table 3.2, the in-core caches are quite small directives is a common step in tuning parallel software.
compared to the in-socket caches, which are in turn quite A second related hardware optimization is cache
small compared to the 1.4 TB of memory configured on prefetching, in which the hardware reacts to consecutive
this system. Third, again referring to the figure, the caches accesses by prefetching subsequent cachelines, thereby
are organized as a hardware hash table with a limited evading speed-of-light delays for these subsequent cache-
number of items per bucket. For example, the raw size of lines. Of course, the hardware must use simple heuristics
the L3 cache (“Size”) is almost 40 MB, but each bucket to determine when to prefetch, and these heuristics can be
(“Line”) can only hold 11 blocks of memory (“Ways”), fooled by the complex data-access patterns in many appli-
each of which can be at most 64 bytes (“Line Size”). cations. Fortunately, some CPU families allow for this by
This means that only 12 bytes of memory (admittedly at providing special prefetch instructions. Unfortunately, the

v2022.09.25a
3.3. HARDWARE FREE LUNCH? 25

is trying its best to exceed the speed of light. The next


section discusses some additional things that the hardware
engineers might (or might not) be able to do, depending on
how well recent research translates to practice. Software’s
contribution to this noble goal is outlined in the remaining
chapters of this book.

3.3 Hardware Free Lunch?

Figure 3.11: Hardware and Software: On Same Side The great trouble today is that there are too many
people looking for someone else to do something for
them. The solution to most of our troubles is to be
found in everyone doing something for themselves.
effectiveness of these instructions in the general case is
subject to some dispute. Henry Ford, updated
A third hardware optimization is the store buffer, which
allows a string of store instructions to execute quickly The major reason that concurrency has been receiving so
even when the stores are to non-consecutive addresses much focus over the past few years is the end of Moore’s-
and when none of the needed cachelines are present in Law induced single-threaded performance increases (or
the CPU’s cache. The dark side of this optimization is “free lunch” [Sut08]), as shown in Figure 2.1 on page 9.
memory misordering, for which see Chapter 15. This section briefly surveys a few ways that hardware
A fourth hardware optimization is speculative execution, designers might bring back the “free lunch”.
which can allow the hardware to make good use of the store However, the preceding section presented some substan-
buffers without resulting in memory misordering. The tial hardware obstacles to exploiting concurrency. One
dark side of this optimization can be energy inefficiency severe physical limitation that hardware designers face
and lowered performance if the speculative execution goes is the finite speed of light. As noted in Figure 3.10 on
awry and must be rolled back and retried. Worse yet, the page 22, light can manage only about an 8-centimeters
advent of Spectre and Meltdown [Hor18] made it apparent round trip in a vacuum during the duration of a 1.8 GHz
that hardware speculation can also enable side-channel clock period. This distance drops to about 3 centimeters
attacks that defeat memory-protection hardware so as to for a 5 GHz clock. Both of these distances are relatively
allow unprivileged processes to read memory that they small compared to the size of a modern computer system.
should not have access to. It is clear that the combination To make matters even worse, electric waves in silicon
of speculative execution and cloud computing needs more move from three to thirty times more slowly than does light
than a bit of rework! in a vacuum, and common clocked logic constructs run
A fifth hardware optimization is large caches, allowing still more slowly, for example, a memory reference may
individual CPUs to operate on larger datasets without need to wait for a local cache lookup to complete before
incurring expensive cache misses. Although large caches the request may be passed on to the rest of the system.
can degrade energy efficiency and cache-miss latency, the Furthermore, relatively low speed and high power drivers
ever-growing cache sizes on production microprocessors are required to move electrical signals from one silicon
attests to the power of this optimization. die to another, for example, to communicate between a
A final hardware optimization is read-mostly replication, CPU and main memory.
in which data that is frequently read but rarely updated is
present in all CPUs’ caches. This optimization allows the Quick Quiz 3.9: But individual electrons don’t move any-
where near that fast, even in conductors!!! The electron drift
read-mostly data to be accessed exceedingly efficiently,
velocity in a conductor under semiconductor voltage levels is
and is the subject of Chapter 9. on the order of only one millimeter per second. What gives???
In short, hardware and software engineers are really
on the same side, with both trying to make computers
go fast despite the best efforts of the laws of physics, as There are nevertheless some technologies (both hard-
fancifully depicted in Figure 3.11 where our data stream ware and software) that might help improve matters:

v2022.09.25a
26 CHAPTER 3. HARDWARE AND ITS HABITS

70 um
increases to which some people have become accustomed.
That said, they may be necessary steps on the path to the
late Jim Gray’s “smoking hairy golf balls” [Gra02].

3.3.2 Novel Materials and Processes


Stephen Hawking is said to have claimed that semicon-
3 cm 1.5 cm ductor manufacturers have but two fundamental problems:
Figure 3.12: Latency Benefit of 3D Integration (1) The finite speed of light and (2) The atomic nature of
matter [Gar07]. It is possible that semiconductor man-
ufacturers are approaching these limits, but there are
1. 3D integration, nevertheless a few avenues of research and development
focused on working around these fundamental limits.
2. Novel materials and processes, One workaround for the atomic nature of matter are
3. Substituting light for electricity, so-called “high-K dielectric” materials, which allow larger
devices to mimic the electrical properties of infeasibly
4. Special-purpose accelerators, and small devices. These materials pose some severe fab-
rication challenges, but nevertheless may help push the
5. Existing parallel software. frontiers out a bit farther. Another more-exotic work-
around stores multiple bits in a single electron, relying
Each of these is described in one of the following
on the fact that a given electron can exist at a number
sections.
of energy levels. It remains to be seen if this particular
approach can be made to work reliably in production
3.3.1 3D Integration semiconductor devices.
3-dimensional integration (3DI) is the practice of bonding Another proposed workaround is the “quantum dot”
very thin silicon dies to each other in a vertical stack. approach that allows much smaller device sizes, but which
This practice provides potential benefits, but also poses is still in the research stage.
significant fabrication challenges [Kni08]. One challenge is that many recent hardware-device-
Perhaps the most important benefit of 3DI is decreased level breakthroughs require very tight control of which
path length through the system, as shown in Figure 3.12. atoms are placed where [Kel17]. It therefore seems likely
A 3-centimeter silicon die is replaced with a stack of four that whoever finds a good way to hand-place atoms on
1.5-centimeter dies, in theory decreasing the maximum each of the billions of devices on a chip will have most
path through the system by a factor of two, keeping in excellent bragging rights, if nothing else!
mind that each layer is quite thin. In addition, given proper
attention to design and placement, long horizontal electri- 3.3.3 Light, Not Electrons
cal connections (which are both slow and power hungry)
can be replaced by short vertical electrical connections, Although the speed of light would be a hard limit, the fact
which are both faster and more power efficient. is that semiconductor devices are limited by the speed of
However, delays due to levels of clocked logic will not be electricity rather than that of light, given that electric waves
decreased by 3D integration, and significant manufactur- in semiconductor materials move at between 3 % and 30 %
ing, testing, power-supply, and heat-dissipation problems of the speed of light in a vacuum. The use of copper
must be solved for 3D integration to reach production connections on silicon devices is one way to increase the
while still delivering on its promise. The heat-dissipation speed of electricity, and it is quite possible that additional
problems might be solved using semiconductors based advances will push closer still to the actual speed of
on diamond, which is a good conductor for heat, but an light. In addition, there have been some experiments with
electrical insulator. That said, it remains difficult to grow tiny optical fibers as interconnects within and between
large single diamond crystals, to say nothing of slicing chips, based on the fact that the speed of light in glass is
them into wafers. In addition, it seems unlikely that any of more than 60 % of the speed of light in a vacuum. One
these technologies will be able to deliver the exponential obstacle to such optical fibers is the inefficiency conversion

v2022.09.25a
3.4. SOFTWARE DESIGN IMPLICATIONS 27

between electricity and light and vice versa, resulting in audio for several minutes—with its CPU fully powered
both power-consumption and heat-dissipation problems. off the entire time. The purpose of these accelerators
That said, absent some fundamental advances in the is to improve energy efficiency and thus extend battery
field of physics, any exponential increases in the speed of life: Special purpose hardware can often compute more
data flow will be sharply limited by the actual speed of efficiently than can a general-purpose CPU. This is an-
light in a vacuum. other example of the principle called out in Section 2.2.3:
Generality is almost never free.
Nevertheless, given the end of Moore’s-Law-induced
3.3.4 Special-Purpose Accelerators single-threaded performance increases, it seems safe to as-
sume that increasing varieties of special-purpose hardware
A general-purpose CPU working on a specialized problem
will appear.
is often spending significant time and energy doing work
that is only tangentially related to the problem at hand.
For example, when taking the dot product of a pair of 3.3.5 Existing Parallel Software
vectors, a general-purpose CPU will normally use a loop
(possibly unrolled) with a loop counter. Decoding the Although multicore CPUs seem to have taken the com-
instructions, incrementing the loop counter, testing this puting industry by surprise, the fact remains that shared-
counter, and branching back to the top of the loop are in memory parallel computer systems have been commer-
some sense wasted effort: The real goal is instead to multi- cially available for more than a quarter century. This is
ply corresponding elements of the two vectors. Therefore, more than enough time for significant parallel software to
a specialized piece of hardware designed specifically to make its appearance, and it indeed has. Parallel operating
multiply vectors could get the job done more quickly and systems are quite commonplace, as are parallel threading
with less energy consumed. libraries, parallel relational database management sys-
tems, and parallel numerical software. Use of existing
This is in fact the motivation for the vector instructions
parallel software can go a long ways towards solving any
present in many commodity microprocessors. Because
parallel-software crisis we might encounter.
these instructions operate on multiple data items simulta-
Perhaps the most common example is the parallel re-
neously, they would permit a dot product to be computed
lational database management system. It is not unusual
with less instruction-decode and loop overhead.
for single-threaded programs, often written in high-level
Similarly, specialized hardware can more efficiently
scripting languages, to access a central relational database
encrypt and decrypt, compress and decompress, encode
concurrently. In the resulting highly parallel system, only
and decode, and many other tasks besides. Unfortunately,
the database need actually deal directly with parallelism.
this efficiency does not come for free. A computer system
A very nice trick when it works!
incorporating this specialized hardware will contain more
transistors, which will consume some power even when
not in use. Software must be modified to take advantage of
this specialized hardware, and this specialized hardware
3.4 Software Design Implications
must be sufficiently generally useful that the high up-front
hardware-design costs can be spread over enough users to One ship drives east and another west
make the specialized hardware affordable. In part due to While the self-same breezes blow;
these sorts of economic considerations, specialized hard- ’Tis the set of the sail and not the gail
ware has thus far appeared only for a few application areas, That bids them where to go.
including graphics processing (GPUs), vector processors Ella Wheeler Wilcox
(MMX, SSE, and VMX instructions), and, to a lesser ex-
tent, encryption. And even in these areas, it is not always The values of the ratios in Table 3.1 are critically important,
easy to realize the expected performance gains, for exam- as they limit the efficiency of a given parallel application.
ple, due to thermal throttling [Kra17, Lem18, Dow20]. To see this, suppose that the parallel application uses CAS
Unlike the server and PC arena, smartphones have long operations to communicate among threads. These CAS
used a wide variety of hardware accelerators. These hard- operations will typically involve a cache miss, that is,
ware accelerators are often used for media decoding, so assuming that the threads are communicating primarily
much so that a high-end MP3 player might be able to play with each other rather than with themselves. Suppose

v2022.09.25a
28 CHAPTER 3. HARDWARE AND ITS HABITS

further that the unit of work corresponding to each CAS 3. The bad news is that the overhead of cache misses is
communication operation takes 300 ns, which is sufficient still high, especially on large systems.
time to compute several floating-point transcendental
functions. Then about half of the execution time will be The remainder of this book describes ways of handling
consumed by the CAS communication operations! This this bad news.
in turn means that a two-CPU system running such a In particular, Chapter 4 will cover some of the low-
parallel program would run no faster than a sequential level tools used for parallel programming, Chapter 5 will
implementation running on a single CPU. investigate problems and solutions to parallel counting,
The situation is even worse in the distributed-system and Chapter 6 will discuss design disciplines that promote
case, where the latency of a single communications oper- performance and scalability.
ation might take as long as thousands or even millions of
floating-point operations. This illustrates how important
it is for communications operations to be extremely infre-
quent and to enable very large quantities of processing.
Quick Quiz 3.10: Given that distributed-systems communi-
cation is so horribly expensive, why does anyone bother with
such systems?

The lesson should be quite clear: Parallel algorithms


must be explicitly designed with these hardware properties
firmly in mind. One approach is to run nearly independent
threads. The less frequently the threads communicate,
whether by atomic operations, locks, or explicit messages,
the better the application’s performance and scalability
will be. This approach will be touched on in Chapter 5,
explored in Chapter 6, and taken to its logical extreme in
Chapter 8.
Another approach is to make sure that any sharing be
read-mostly, which allows the CPUs’ caches to replicate
the read-mostly data, in turn allowing all CPUs fast access.
This approach is touched on in Section 5.2.4, and explored
more deeply in Chapter 9.
In short, achieving excellent parallel performance and
scalability means striving for embarrassingly parallel al-
gorithms and implementations, whether by careful choice
of data structures and algorithms, use of existing paral-
lel applications and environments, or transforming the
problem into an embarrassingly parallel form.
Quick Quiz 3.11: OK, if we are going to have to apply
distributed-programming techniques to shared-memory par-
allel programs, why not just always use these distributed
techniques and dispense with shared memory?

So, to sum up:


1. The good news is that multicore systems are inexpen-
sive and readily available.
2. More good news: The overhead of many synchro-
nization operations is much lower than it was on
parallel systems from the early 2000s.

v2022.09.25a
You are only as good as your tools, and your tools are
only as good as you are.

Chapter 4 Unknown

Tools of the Trade

This chapter provides a brief introduction to some basic


tools of the parallel-programming trade, focusing mainly
on those available to user applications running on op-
erating systems similar to Linux. Section 4.1 begins compute_it 1 > compute_it 2 >
with scripting languages, Section 4.2 describes the multi- compute_it.1.out & compute_it.2.out &
process parallelism supported by the POSIX API and
touches on POSIX threads, Section 4.3 presents analogous
wait
operations in other environments, and finally, Section 4.4
helps to choose the tool that will get the job done.
Quick Quiz 4.1: You call these tools??? They look more cat compute_it.1.out
like low-level synchronization primitives to me!

Please note that this chapter provides but a brief intro- cat compute_it.2.out
duction. More detail is available from the references (and
from the Internet), and more information will be provided Figure 4.1: Execution Diagram for Parallel Shell Execu-
in later chapters. tion

4.1 Scripting Languages character directing the shell to run the two instances of
the program in the background. Line 3 waits for both
The supreme excellence is simplicity. instances to complete, and lines 4 and 5 display their
output. The resulting execution is as shown in Figure 4.1:
Henry Wadsworth Longfellow, simplified The two instances of compute_it execute in parallel,
wait completes after both of them do, and then the two
The Linux shell scripting languages provide simple but instances of cat execute sequentially.
effective ways of managing parallelism. For example,
suppose that you had a program compute_it that you Quick Quiz 4.2: But this silly shell script isn’t a real parallel
needed to run twice with two different sets of arguments. program! Why bother with such trivia???
This can be accomplished using UNIX shell scripting as
follows: Quick Quiz 4.3: Is there a simpler way to create a parallel
1 compute_it 1 > compute_it.1.out & shell script? If so, how? If not, why not?
2 compute_it 2 > compute_it.2.out &
3 wait
4 cat compute_it.1.out For another example, the make software-build scripting
5 cat compute_it.2.out language provides a -j option that specifies how much par-
allelism should be introduced into the build process. Thus,
Lines 1 and 2 launch two instances of this program, typing make -j4 when building a Linux kernel specifies
redirecting their output to two separate files, with the & that up to four build steps be executed concurrently.

29

v2022.09.25a
30 CHAPTER 4. TOOLS OF THE TRADE

It is hoped that these simple examples convince you Listing 4.1: Using the fork() Primitive
that parallel programming need not always be complex or 1 pid = fork();
2 if (pid == 0) {
difficult. 3 /* child */
4 } else if (pid < 0) {
Quick Quiz 4.4: But if script-based parallel programming is 5 /* parent, upon error */
so easy, why bother with anything else? 6 perror("fork");
7 exit(EXIT_FAILURE);
8 } else {
9 /* parent, pid == child ID */
10 }
4.2 POSIX Multiprocessing
Listing 4.2: Using the wait() Primitive
1 static __inline__ void waitall(void)
A camel is a horse designed by committee. 2 {
3 int pid;
Unknown 4 int status;
5
6 for (;;) {
This section scratches the surface of the POSIX environ- 7 pid = wait(&status);
ment, including pthreads [Ope97], as this environment is 8 if (pid == -1) {
9 if (errno == ECHILD)
readily available and widely implemented. Section 4.2.1 10 break;
provides a glimpse of the POSIX fork() and related 11 perror("wait");
12 exit(EXIT_FAILURE);
primitives, Section 4.2.2 touches on thread creation and 13 }
destruction, Section 4.2.3 gives a brief overview of POSIX 14 }
15 }
locking, and, finally, Section 4.2.4 describes a specific
lock which can be used for data that is read by many
threads and only occasionally updated. noted earlier, the child may terminate via the exit()
primitive. Otherwise, this is the parent, which checks for
4.2.1 POSIX Process Creation and De- an error return from the fork() primitive on line 4, and
struction prints an error and exits on lines 5–7 if so. Otherwise,
the fork() has executed successfully, and the parent
Processes are created using the fork() primitive, they therefore executes line 9 with the variable pid containing
may be destroyed using the kill() primitive, they may the process ID of the child.
destroy themselves using the exit() primitive. A process The parent process may use the wait() primitive to
executing a fork() primitive is said to be the “parent” wait for its children to complete. However, use of this
of the newly created process. A parent may wait on its primitive is a bit more complicated than its shell-script
children using the wait() primitive. counterpart, as each invocation of wait() waits for but one
Please note that the examples in this section are quite child process. It is therefore customary to wrap wait()
simple. Real-world applications using these primitives into a function similar to the waitall() function shown
might need to manipulate signals, file descriptors, shared in Listing 4.2 (api-pthreads.h), with this waitall()
memory segments, and any number of other resources. In function having semantics similar to the shell-script wait
addition, some applications need to take specific actions command. Each pass through the loop spanning lines 6–14
if a given child terminates, and might also need to be waits on one child process. Line 7 invokes the wait()
concerned with the reason that the child terminated. These primitive, which blocks until a child process exits, and
issues can of course add substantial complexity to the code. returns that child’s process ID. If the process ID is instead
For more information, see any of a number of textbooks −1, this indicates that the wait() primitive was unable to
on the subject [Ste92, Wei13]. wait on a child. If so, line 9 checks for the ECHILD errno,
If fork() succeeds, it returns twice, once for the which indicates that there are no more child processes, so
parent and again for the child. The value returned from that line 10 exits the loop. Otherwise, lines 11 and 12
fork() allows the caller to tell the difference, as shown in print an error and exit.
Listing 4.1 (forkjoin.c). Line 1 executes the fork()
Quick Quiz 4.5: Why does this wait() primitive need to be
primitive, and saves its return value in local variable pid.
so complicated? Why not just make it work like the shell-script
Line 2 checks to see if pid is zero, in which case, this wait does?
is the child, which continues on to execute line 3. As

v2022.09.25a
4.2. POSIX MULTIPROCESSING 31

Listing 4.3: Processes Created Via fork() Do Not Share Listing 4.4: Threads Created Via pthread_create() Share
Memory Memory
1 int x = 0; 1 int x = 0;
2 2
3 int main(int argc, char *argv[]) 3 void *mythread(void *arg)
4 { 4 {
5 int pid; 5 x = 1;
6 6 printf("Child process set x=1\n");
7 pid = fork(); 7 return NULL;
8 if (pid == 0) { /* child */ 8 }
9 x = 1; 9
10 printf("Child process set x=1\n"); 10 int main(int argc, char *argv[])
11 exit(EXIT_SUCCESS); 11 {
12 } 12 int en;
13 if (pid < 0) { /* parent, upon error */ 13 pthread_t tid;
14 perror("fork"); 14 void *vp;
15 exit(EXIT_FAILURE); 15
16 } 16 if ((en = pthread_create(&tid, NULL,
17 17 mythread, NULL)) != 0) {
18 /* parent */ 18 fprintf(stderr, "pthread_create: %s\n", strerror(en));
19 19 exit(EXIT_FAILURE);
20 waitall(); 20 }
21 printf("Parent process sees x=%d\n", x); 21
22 22 /* parent */
23 return EXIT_SUCCESS; 23
24 } 24 if ((en = pthread_join(tid, &vp)) != 0) {
25 fprintf(stderr, "pthread_join: %s\n", strerror(en));
26 exit(EXIT_FAILURE);
27 }
It is critically important to note that the parent and child 28 printf("Parent process sees x=%d\n", x);
29
do not share memory. This is illustrated by the program 30 return EXIT_SUCCESS;
shown in Listing 4.3 (forkjoinvar.c), in which the 31 }

child sets a global variable x to 1 on line 9, prints a


message on line 10, and exits on line 11. The parent
continues at line 20, where it waits on the child, and on that is to be invoked by the new thread, and the last
line 21 finds that its copy of the variable x is still zero. NULL argument is the argument that will be passed to
The output is thus as follows: mythread().
In this example, mythread() simply returns, but it
Child process set x=1 could instead call pthread_exit().
Parent process sees x=0
Quick Quiz 4.7: If the mythread() function in Listing 4.4
can simply return, why bother with pthread_exit()?
Quick Quiz 4.6: Isn’t there a lot more to fork() and wait()
than discussed here?
The pthread_join() primitive, shown on line 24, is
The finest-grained parallelism requires shared memory, analogous to the fork-join wait() primitive. It blocks
and this is covered in Section 4.2.2. That said, shared- until the thread specified by the tid variable completes
memory parallelism can be significantly more complex execution, either by invoking pthread_exit() or by re-
than fork-join parallelism. turning from the thread’s top-level function. The thread’s
exit value will be stored through the pointer passed as
the second argument to pthread_join(). The thread’s
4.2.2 POSIX Thread Creation and De- exit value is either the value passed to pthread_exit()
struction or the value returned by the thread’s top-level function,
To create a thread within an existing process, invoke the depending on how the thread in question exits.
pthread_create() primitive, for example, as shown The program shown in Listing 4.4 produces output
on lines 16 and 17 of Listing 4.4 (pcreate.c). The as follows, demonstrating that memory is in fact shared
first argument is a pointer to a pthread_t in which to between the two threads:
store the ID of the thread to be created, the second NULL
argument is a pointer to an optional pthread_attr_t, the Child process set x=1
Parent process sees x=1
third argument is the function (in this case, mythread())

v2022.09.25a
32 CHAPTER 4. TOOLS OF THE TRADE

Note that this program carefully makes sure that only


one of the threads stores a value to variable x at a time.
Any situation in which one thread might be storing a
value to a given variable while some other thread either
loads from or stores to that same variable is termed a data
race. Because the C language makes no guarantee that
Listing 4.5: Demonstration of Exclusive Locks
the results of a data race will be in any way reasonable, 1 pthread_mutex_t lock_a = PTHREAD_MUTEX_INITIALIZER;
we need some way of safely accessing and modifying data 2 pthread_mutex_t lock_b = PTHREAD_MUTEX_INITIALIZER;
3
concurrently, such as the locking primitives discussed in 4 int x = 0;
the following section. 5
6 void *lock_reader(void *arg)
But your data races are benign, you say? Well, maybe 7 {
they are. But please do everyone (yourself included) a 8 int en;
9 int i;
big favor and read Section 4.3.4.1 very carefully. As 10 int newx = -1;
compilers optimize more and more aggressively, there are 11 int oldx = -1;
12 pthread_mutex_t *pmlp = (pthread_mutex_t *)arg;
fewer and fewer truly benign data races. 13
14 if ((en = pthread_mutex_lock(pmlp)) != 0) {
Quick Quiz 4.8: If the C language makes no guarantees in 15 fprintf(stderr, "lock_reader:pthread_mutex_lock: %s\n",
presence of a data race, then why does the Linux kernel have 16 strerror(en));
so many data races? Are you trying to tell me that the Linux 17 exit(EXIT_FAILURE);
18 }
kernel is completely broken??? 19 for (i = 0; i < 100; i++) {
20 newx = READ_ONCE(x);
21 if (newx != oldx) {
22 printf("lock_reader(): x = %d\n", newx);
}
4.2.3 POSIX Locking 23
24 oldx = newx;
25 poll(NULL, 0, 1);
The POSIX standard allows the programmer to avoid 26 }
data races via “POSIX locking”. POSIX locking fea- 27 if ((en = pthread_mutex_unlock(pmlp)) != 0) {
28 fprintf(stderr, "lock_reader:pthread_mutex_unlock: %s\n",
tures a number of primitives, the most fundamental 29 strerror(en));
of which are pthread_mutex_lock() and pthread_ 30 exit(EXIT_FAILURE);
31 }
mutex_unlock(). These primitives operate on locks, 32 return NULL;
which are of type pthread_mutex_t. These locks may be 33 }
34
declared statically and initialized with PTHREAD_MUTEX_ 35 void *lock_writer(void *arg)
INITIALIZER, or they may be allocated dynamically and 36 {
37 int en;
initialized using the pthread_mutex_init() primitive. 38 int i;
The demonstration code in this section will take the former 39 pthread_mutex_t *pmlp = (pthread_mutex_t *)arg;
40
course. 41 if ((en = pthread_mutex_lock(pmlp)) != 0) {
The pthread_mutex_lock() primitive “acquires” the 42 fprintf(stderr, "lock_writer:pthread_mutex_lock: %s\n",
43 strerror(en));
specified lock, and the pthread_mutex_unlock() “re- 44 exit(EXIT_FAILURE);
leases” the specified lock. Because these are “exclusive” 45 }
46 for (i = 0; i < 3; i++) {
locking primitives, only one thread at a time may “hold” 47 WRITE_ONCE(x, READ_ONCE(x) + 1);
a given lock at a given time. For example, if a pair of 48 poll(NULL, 0, 5);
49 }
threads attempt to acquire the same lock concurrently, 50 if ((en = pthread_mutex_unlock(pmlp)) != 0) {
one of the pair will be “granted” the lock first, and the 51 fprintf(stderr, "lock_writer:pthread_mutex_unlock: %s\n",
52 strerror(en));
other will wait until the first thread releases the lock. A 53 exit(EXIT_FAILURE);
simple and reasonably useful programming model permits 54 }
55 return NULL;
a given data item to be accessed only while holding the 56 }
corresponding lock [Hoa74].
Quick Quiz 4.9: What if I want several threads to hold the
same lock at the same time?

This exclusive-locking property is demonstrated using


the code shown in Listing 4.5 (lock.c). Line 1 defines

v2022.09.25a
4.2. POSIX MULTIPROCESSING 33

and initializes a POSIX lock named lock_a, while line 2 Listing 4.6: Demonstration of Same Exclusive Lock
similarly defines and initializes a lock named lock_b. 1 printf("Creating two threads using same lock:\n");
2 en = pthread_create(&tid1, NULL, lock_reader, &lock_a);
Line 4 defines and initializes a shared variable x. 3 if (en != 0) {
4 fprintf(stderr, "pthread_create: %s\n", strerror(en));
Lines 6–33 define a function lock_reader() which 5 exit(EXIT_FAILURE);
repeatedly reads the shared variable x while holding the 6 }
7 en = pthread_create(&tid2, NULL, lock_writer, &lock_a);
lock specified by arg. Line 12 casts arg to a pointer to a 8 if (en != 0) {
pthread_mutex_t, as required by the pthread_mutex_ 9 fprintf(stderr, "pthread_create: %s\n", strerror(en));
10 exit(EXIT_FAILURE);
lock() and pthread_mutex_unlock() primitives. 11 }
12 if ((en = pthread_join(tid1, &vp)) != 0) {
Quick Quiz 4.10: Why not simply make the argument to 13 fprintf(stderr, "pthread_join: %s\n", strerror(en));
lock_reader() on line 6 of Listing 4.5 be a pointer to a 14 exit(EXIT_FAILURE);
15 }
pthread_mutex_t? 16 if ((en = pthread_join(tid2, &vp)) != 0) {
17 fprintf(stderr, "pthread_join: %s\n", strerror(en));
18 exit(EXIT_FAILURE);
Quick Quiz 4.11: What is the READ_ONCE() on lines 20 19 }
and 47 and the WRITE_ONCE() on line 47 of Listing 4.5?
Listing 4.7: Demonstration of Different Exclusive Locks
Lines 14–18 acquire the specified pthread_mutex_t, 1 printf("Creating two threads w/different locks:\n");
2 x = 0;
checking for errors and exiting the program if any occur. 3 en = pthread_create(&tid1, NULL, lock_reader, &lock_a);
Lines 19–26 repeatedly check the value of x, printing 4 if (en != 0) {
5 fprintf(stderr, "pthread_create: %s\n", strerror(en));
the new value each time that it changes. Line 25 sleeps 6 exit(EXIT_FAILURE);
for one millisecond, which allows this demonstration 7 }
8 en = pthread_create(&tid2, NULL, lock_writer, &lock_b);
to run nicely on a uniprocessor machine. Lines 27–31 9 if (en != 0) {
release the pthread_mutex_t, again checking for errors 10 fprintf(stderr, "pthread_create: %s\n", strerror(en));
11 exit(EXIT_FAILURE);
and exiting the program if any occur. Finally, line 32 12 }
returns NULL, again to match the function type required 13 if ((en = pthread_join(tid1, &vp)) != 0) {
14 fprintf(stderr, "pthread_join: %s\n", strerror(en));
by pthread_create(). 15 exit(EXIT_FAILURE);
16 }
Quick Quiz 4.12: Writing four lines of code for each 17 if ((en = pthread_join(tid2, &vp)) != 0) {
18 fprintf(stderr, "pthread_join: %s\n", strerror(en));
acquisition and release of a pthread_mutex_t sure seems 19 exit(EXIT_FAILURE);
painful! Isn’t there a better way? 20 }

Lines 35–56 of Listing 4.5 show lock_writer(),


which periodically updates the shared variable x while Because both threads are using the same lock, the lock_
holding the specified pthread_mutex_t. As with lock_ reader() thread cannot see any of the intermediate values
reader(), line 39 casts arg to a pointer to pthread_ of x produced by lock_writer() while holding the lock.
mutex_t, lines 41–45 acquire the specified lock, and Quick Quiz 4.13: Is “x = 0” the only possible output from
lines 50–54 release it. While holding the lock, lines 46–49 the code fragment shown in Listing 4.6? If so, why? If not,
increment the shared variable x, sleeping for five millisec- what other output could appear, and why?
onds between each increment. Finally, lines 50–54 release
the lock. Listing 4.7 shows a similar code fragment, but this time
Listing 4.6 shows a code fragment that runs lock_ using different locks: lock_a for lock_reader() and
reader() and lock_writer() as threads using the same lock_b for lock_writer(). The output of this code
lock, namely, lock_a. Lines 2–6 create a thread running fragment is as follows:
lock_reader(), and then lines 7–11 create a thread Creating two threads w/different locks:
running lock_writer(). Lines 12–19 wait for both lock_reader(): x = 0
lock_reader(): x = 1
threads to complete. The output of this code fragment is lock_reader(): x = 2
as follows: lock_reader(): x = 3

Creating two threads using same lock: Because the two threads are using different locks, they
lock_reader(): x = 0
do not exclude each other, and can run concurrently. The

v2022.09.25a
34 CHAPTER 4. TOOLS OF THE TRADE

lock_reader() function can therefore see the interme-


diate values of x stored by lock_writer().
Quick Quiz 4.14: Using different locks could cause quite
a bit of confusion, what with threads seeing each others’
intermediate states. So should well-written parallel programs
restrict themselves to using a single lock in order to avoid this
kind of confusion?

Quick Quiz 4.15: In the code shown in Listing 4.7, is


lock_reader() guaranteed to see all the values produced by
Listing 4.8: Measuring Reader-Writer Lock Scalability
lock_writer()? Why or why not?
1 pthread_rwlock_t rwl = PTHREAD_RWLOCK_INITIALIZER;
2 unsigned long holdtime = 0;
3 unsigned long thinktime = 0;
Quick Quiz 4.16: Wait a minute here!!! Listing 4.6 didn’t 4 long long *readcounts;
initialize shared variable x, so why does it need to be initialized 5 int nreadersrunning = 0;
in Listing 4.7? 6
7 #define GOFLAG_INIT 0
8 #define GOFLAG_RUN 1
Although there is quite a bit more to POSIX exclusive 9 #define GOFLAG_STOP 2
10 char goflag = GOFLAG_INIT;
locking, these primitives provide a good start and are in 11
fact sufficient in a great many situations. The next section 12 void *reader(void *arg)
13 {
takes a brief look at POSIX reader-writer locking. 14 int en;
15 int i;
16 long long loopcnt = 0;
4.2.4 POSIX Reader-Writer Locking 17 long me = (long)arg;
18
19 __sync_fetch_and_add(&nreadersrunning, 1);
The POSIX API provides a reader-writer lock, which 20 while (READ_ONCE(goflag) == GOFLAG_INIT) {
is represented by a pthread_rwlock_t. As with 21 continue;
22 }
pthread_mutex_t, pthread_rwlock_t may be stat- 23 while (READ_ONCE(goflag) == GOFLAG_RUN) {
ically initialized via PTHREAD_RWLOCK_INITIALIZER 24 if ((en = pthread_rwlock_rdlock(&rwl)) != 0) {
25 fprintf(stderr,
or dynamically initialized via the pthread_rwlock_ 26 "pthread_rwlock_rdlock: %s\n", strerror(en));
init() primitive. The pthread_rwlock_rdlock() 27 exit(EXIT_FAILURE);
28 }
primitive read-acquires the specified pthread_rwlock_ 29 for (i = 1; i < holdtime; i++) {
t, the pthread_rwlock_wrlock() primitive write- 30 wait_microseconds(1);
31 }
acquires it, and the pthread_rwlock_unlock() prim- 32 if ((en = pthread_rwlock_unlock(&rwl)) != 0) {
itive releases it. Only a single thread may write-hold a 33 fprintf(stderr,
34 "pthread_rwlock_unlock: %s\n", strerror(en));
given pthread_rwlock_t at any given time, but multiple 35 exit(EXIT_FAILURE);
threads may read-hold a given pthread_rwlock_t, at 36 }
37 for (i = 1; i < thinktime; i++) {
least while there is no thread currently write-holding it. 38 wait_microseconds(1);
As you might expect, reader-writer locks are designed 39 }
40 loopcnt++;
for read-mostly situations. In these situations, a reader- 41 }
writer lock can provide greater scalability than can an 42 readcounts[me] = loopcnt;
43 return NULL;
exclusive lock because the exclusive lock is by definition 44 }
limited to a single thread holding the lock at any given time,
while the reader-writer lock permits an arbitrarily large
number of readers to concurrently hold the lock. How-
ever, in practice, we need to know how much additional
scalability is provided by reader-writer locks.
Listing 4.8 (rwlockscale.c) shows one way of mea-
suring reader-writer lock scalability. Line 1 shows the
definition and initialization of the reader-writer lock, line 2
shows the holdtime argument controlling the time each
thread holds the reader-writer lock, line 3 shows the

v2022.09.25a
4.2. POSIX MULTIPROCESSING 35

10
thinktime argument controlling the time between the
release of the reader-writer lock and the next acquisition,
line 4 defines the readcounts array into which each ideal 10000us
1
reader thread places the number of times it acquired the

Critical Section Performance


lock, and line 5 defines the nreadersrunning variable, 1000us
which determines when all reader threads have started 0.1
running. 100us

Lines 7–10 define goflag, which synchronizes the


start and the end of the test. This variable is initially set to 0.01
10us
GOFLAG_INIT, then set to GOFLAG_RUN after all the reader
threads have started, and finally set to GOFLAG_STOP to
0.001
terminate the test run. 1us
Lines 12–44 define reader(), which is the
reader thread. Line 19 atomically increments the 0.0001
nreadersrunning variable to indicate that this thread 0 50 100 150 200 250 300 350 400 450
Number of CPUs (Threads)
is now running, and lines 20–22 wait for the test to start.
The READ_ONCE() primitive forces the compiler to fetch Figure 4.2: Reader-Writer Lock Scalability vs. Microsec-
goflag on each pass through the loop—the compiler onds in Critical Section on 8-Socket System With
would otherwise be within its rights to assume that the Intel Xeon Platinum 8176 CPUs @ 2.10GHz
value of goflag would never change.

Quick Quiz 4.17: Instead of using READ_ONCE() every- on the graph). The actual value plotted is:
where, why not just declare goflag as volatile on line 10
of Listing 4.8? 𝐿𝑁
(4.1)
𝑁 𝐿1
Quick Quiz 4.18: READ_ONCE() only affects the compiler, where 𝑁 is the number of threads in the current run, 𝐿 𝑁 is
not the CPU. Don’t we also need memory barriers to make the total number of lock acquisitions by all 𝑁 threads in the
sure that the change in goflag’s value propagates to the CPU current run, and 𝐿 1 is the number of lock acquisitions in
in a timely fashion in Listing 4.8? a single-threaded run. Given ideal hardware and software
scalability, this value will always be 1.0.
Quick Quiz 4.19: Would it ever be necessary to use READ_ As can be seen in the figure, reader-writer locking
ONCE() when accessing a per-thread variable, for example, a scalability is decidedly non-ideal, especially for smaller
variable declared using GCC’s __thread storage class? sizes of critical sections. To see why read-acquisition can
be so slow, consider that all the acquiring threads must
The loop spanning lines 23–41 carries out the perfor- update the pthread_rwlock_t data structure. Therefore,
mance test. Lines 24–28 acquire the lock, lines 29–31 if all 448 executing threads attempt to read-acquire the
hold the lock for the specified number of microseconds, reader-writer lock concurrently, they must update this
lines 32–36 release the lock, and lines 37–39 wait for the underlying pthread_rwlock_t one at a time. One lucky
specified number of microseconds before re-acquiring the thread might do so almost immediately, but the least-lucky
lock. Line 40 counts this lock acquisition. thread must wait for all the other 447 threads to do their
Line 42 moves the lock-acquisition count to this thread’s updates. This situation will only get worse as you add
element of the readcounts[] array, and line 43 returns, CPUs. Note also the logscale y-axis. Even though the
terminating this thread. 10,000 microsecond trace appears quite ideal, it has in fact
degraded by about 10 % from ideal.
Figure 4.2 shows the results of running this test on a
224-core Xeon system with two hardware threads per core Quick Quiz 4.20: Isn’t comparing against single-CPU
for a total of 448 software-visible CPUs. The thinktime throughput a bit harsh?
parameter was zero for all these tests, and the holdtime
parameter set to values ranging from one microsecond Quick Quiz 4.21: But one microsecond is not a particularly
(“1us” on the graph) to 10,000 microseconds (“10000us” small size for a critical section. What do I do if I need a much

v2022.09.25a
36 CHAPTER 4. TOOLS OF THE TRADE

smaller critical section, for example, one containing only a few Listing 4.9: Compiler Barrier Primitive (for GCC)
instructions? #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
#define READ_ONCE(x) \
({ typeof(x) ___x = ACCESS_ONCE(x); ___x; })
Quick Quiz 4.22: The system used is a few years old, and #define WRITE_ONCE(x, val) \
new hardware should be faster. So why should anyone worry do { ACCESS_ONCE(x) = (val); } while (0)
#define barrier() __asm__ __volatile__("": : :"memory")
about reader-writer locks being slow?

Despite these limitations, reader-writer locking is quite


useful in many cases, for example when the readers must is “universal” in the sense that any atomic operation
do high-latency file or network I/O. There are alternatives, on a single location can be implemented in terms of
some of which will be presented in Chapters 5 and 9. compare-and-swap, though the earlier operations are often
more efficient where they apply. The compare-and-swap
operation is also capable of serving as the basis for a
4.2.5 Atomic Operations (GCC Classic) wider set of atomic operations, though the more elaborate
of these often suffer from complexity, scalability, and
Figure 4.2 shows that the overhead of reader-writer locking
performance problems [Her90].
is most severe for the smallest critical sections, so it would
be nice to have some other way of protecting tiny critical Quick Quiz 4.24: Given that these atomic operations will
sections. One such way uses atomic operations. We have often be able to generate single atomic instructions that are
seen an atomic operation already, namely the __sync_ directly supported by the underlying instruction set, shouldn’t
fetch_and_add() primitive on line 19 of Listing 4.8. they be the fastest possible way to get things done?
This primitive atomically adds the value of its second
The __sync_synchronize() primitive issues a
argument to the value referenced by its first argument,
“memory barrier”, which constrains both the compiler’s
returning the old value (which was ignored in this case).
and the CPU’s ability to reorder operations, as discussed in
If a pair of threads concurrently execute __sync_fetch_
Chapter 15. In some cases, it is sufficient to constrain the
and_add() on the same variable, the resulting value of
compiler’s ability to reorder operations, while allowing the
the variable will include the result of both additions.
CPU free rein, in which case the barrier() primitive may
The GNU C compiler offers a number of addi-
be used. In some cases, it is only necessary to ensure that
tional atomic operations, including __sync_fetch_and_
the compiler avoids optimizing away a given memory read,
sub(), __sync_fetch_and_or(), __sync_fetch_
in which case the READ_ONCE() primitive may be used,
and_and(), __sync_fetch_and_xor(), and __sync_
as it was on line 20 of Listing 4.5. Similarly, the WRITE_
fetch_and_nand(), all of which return the old value.
ONCE() primitive may be used to prevent the compiler
If you instead need the new value, you can instead
from optimizing away a given memory write. These last
use the __sync_add_and_fetch(), __sync_sub_
three primitives are not provided directly by GCC, but may
and_fetch(), __sync_or_and_fetch(), __sync_
be implemented straightforwardly as shown in Listing 4.9,
and_and_fetch(), __sync_xor_and_fetch(), and
and all three are discussed at length in Section 4.3.4. Al-
__sync_nand_and_fetch() primitives.
ternatively, READ_ONCE(x) has much in common with
Quick Quiz 4.23: Is it really necessary to have both sets of the GCC intrinsic __atomic_load_n(&x, __ATOMIC_
primitives? RELAXED) and WRITE_ONCE() has much in common
with the GCC intrinsic __atomic_store_n(&x, v,
The classic compare-and-swap operation is provided __ATOMIC_RELAXED).
by a pair of primitives, __sync_bool_compare_and_
swap() and __sync_val_compare_and_swap(). Both Quick Quiz 4.25: What happened to ACCESS_ONCE()?
of these primitives atomically update a location to a new
value, but only if its prior value was equal to the specified
old value. The first variant returns 1 if the operation 4.2.6 Atomic Operations (C11)
succeeded and 0 if it failed, for example, if the prior value
was not equal to the specified old value. The second The C11 standard added atomic operations, in-
variant returns the prior value of the location, which, if cluding loads (atomic_load()), stores (atomic_
equal to the specified old value, indicates that the operation store()), memory barriers (atomic_thread_fence()
succeeded. Either of the compare-and-swap operation and atomic_signal_fence()), and read-modify-

v2022.09.25a
4.3. ALTERNATIVES TO POSIX OPERATIONS 37

write atomics. The read-modify-write atom- __thread is much easier to use than the POSIX thead-
ics include atomic_fetch_add(), atomic_fetch_ specific data, and so __thread is usually preferred for
sub(), atomic_fetch_and(), atomic_fetch_xor(), code that is to be built only with GCC or other compilers
atomic_exchange(), atomic_compare_exchange_ supporting __thread.
strong(), and atomic_compare_exchange_weak(). Fortunately, the C11 standard introduced a _Thread_
These operate in a manner similar to those described local keyword that can be used in place of __thread. In
in Section 4.2.5, but with the addition of memory-order the fullness of time, this new keyword should combine the
arguments to _explicit variants of all of the opera- ease of use of __thread with the portability of POSIX
tions. Without memory-order arguments, all the atomic thread-specific data.
operations are fully ordered, and the arguments per-
mit weaker orderings. For example, “atomic_load_
explicit(&a, memory_order_relaxed)” is vaguely 4.3 Alternatives to POSIX Opera-
similar to the Linux kernel’s “READ_ONCE()”.1 tions
4.2.7 Atomic Operations (Modern GCC) The strategic marketing paradigm of Open Source is
One restriction of the C11 atomics is that they apply a massively parallel drunkard’s walk filtered by a
Darwinistic process.
only to special atomic types, which can be problematic.
The GNU C compiler therefore provides atomic intrin- Bruce Perens
sics, including __atomic_load(), __atomic_load_
n(), __atomic_store(), __atomic_store_n(), __ Unfortunately, threading operations, locking primitives,
atomic_thread_fence(), etc. These intrinsics offer and atomic operations were in reasonably wide use long
the same semantics as their C11 counterparts, but may before the various standards committees got around to
be used on plain non-atomic objects. Some of these in- them. As a result, there is considerable variation in how
trinsics may be passed a memory-order argument from these operations are supported. It is still quite common to
this list: __ATOMIC_RELAXED, __ATOMIC_CONSUME, find these operations implemented in assembly language,
__ATOMIC_ACQUIRE, __ATOMIC_RELEASE, __ATOMIC_ either for historical reasons or to obtain better perfor-
ACQ_REL, and __ATOMIC_SEQ_CST. mance in specialized circumstances. For example, GCC’s
__sync_ family of primitives all provide full memory-
4.2.8 Per-Thread Variables ordering semantics, which in the past motivated many
developers to create their own implementations for situa-
Per-thread variables, also called thread-specific data, tions where the full memory ordering semantics are not
thread-local storage, and other less-polite names, are used required. The following sections show some alternatives
extremely heavily in concurrent code, as will be explored from the Linux kernel and some historical primitives used
in Chapters 5 and 8. POSIX supplies the pthread_key_ by this book’s sample code.
create() function to create a per-thread variable (and
return the corresponding key), pthread_key_delete()
4.3.1 Organization and Initialization
to delete the per-thread variable corresponding to key,
pthread_setspecific() to set the value of the current Although many environments do not require any special
thread’s variable corresponding to the specified key, and initialization code, the code samples in this book start
pthread_getspecific() to return that value. with a call to smp_init(), which initializes a mapping
A number of compilers (including GCC) provide a __ from pthread_t to consecutive integers. The userspace
thread specifier that may be used in a variable definition RCU library2 similarly requires a call to rcu_init().
to designate that variable as being per-thread. The name of Although these calls can be hidden in environments (such
the variable may then be used normally to access the value as that of GCC) that support constructors, most of the
of the current thread’s instance of that variable. Of course, RCU flavors supported by the userspace RCU library also
require each thread invoke rcu_register_thread()

1 Memory ordering is described in more detail in Chapter 15 and

Appendix C. 2 See Section 9.5 for more information on RCU.

v2022.09.25a
38 CHAPTER 4. TOOLS OF THE TRADE

Listing 4.10: Thread API thread_id_t corresponding to the newly created child
int smp_thread_id(void) thread.
thread_id_t create_thread(void *(*func)(void *), void *arg)
for_each_thread(t) This primitive will abort the program if more than
for_each_running_thread(t) NR_THREADS threads are created, counting the one im-
void *wait_thread(thread_id_t tid)
void wait_all_threads(void) plicitly created by running the program. NR_THREADS is
a compile-time constant that may be modified, though
some systems may have an upper bound for the allowable
upon thread creation and rcu_unregister_thread() number of threads.
before thread exit.
In the case of the Linux kernel, it is a philosophical 4.3.2.2 smp_thread_id()
question as to whether the kernel does not require calls
to special initialization code or whether the kernel’s boot- Because the thread_id_t returned from create_
time code is in fact the required initialization code. thread() is system-dependent, the smp_thread_id()
primitive returns a thread index corresponding to the
thread making the request. This index is guaranteed to be
4.3.2 Thread Creation, Destruction, and
less than the maximum number of threads that have been
Control in existence since the program started, and is therefore
The Linux kernel uses struct task_struct pointers useful for bitmasks, array indices, and the like.
to track kthreads, kthread_create() to create them,
kthread_should_stop() to externally suggest that they 4.3.2.3 for_each_thread()
stop (which has no POSIX equivalent),3 kthread_
stop() to wait for them to stop, and schedule_ The for_each_thread() macro loops through all
timeout_interruptible() for a timed wait. There threads that exist, including all threads that would ex-
are quite a few additional kthread-management APIs, but ist if created. This macro is useful for handling the
this provides a good start, as well as good search terms. per-thread variables introduced in Section 4.2.8.
The CodeSamples API focuses on “threads”, which are a
locus of control.4 Each such thread has an identifier of type 4.3.2.4 for_each_running_thread()
thread_id_t, and no two threads running at a given time
will have the same identifier. Threads share everything The for_each_running_thread() macro loops
except for per-thread local state,5 which includes program through only those threads that currently exist. It is the
counter and stack. caller’s responsibility to synchronize with thread creation
The thread API is shown in Listing 4.10, and members and deletion if required.
are described in the following sections.
4.3.2.5 wait_thread()
4.3.2.1 create_thread()
The wait_thread() primitive waits for completion of the
The create_thread() primitive creates a new thread, thread specified by the thread_id_t passed to it. This in
starting the new thread’s execution at the function func no way interferes with the execution of the specified thread;
specified by create_thread()’s first argument, and instead, it merely waits for it. Note that wait_thread()
passing it the argument specified by create_thread()’s returns the value that was returned by the corresponding
second argument. This newly created thread will termi- thread.
nate when it returns from the starting function specified
by func. The create_thread() primitive returns the 4.3.2.6 wait_all_threads()
3 POSIX environments can work around the lack of kthread_ The wait_all_threads() primitive waits for comple-
should_stop() by using a properly synchronized boolean flag in tion of all currently running threads. It is the caller’s
conjunction with pthread_join(). responsibility to synchronize with thread creation and
4 There are many other names for similar software constructs, in-

cluding “process”, “task”, “fiber”, “event”, “execution agent”, and so on.


deletion if required. However, this primitive is normally
Similar design principles apply to all of them. used to clean up at the end of a run, so such synchronization
5 How is that for a circular definition? is normally not needed.

v2022.09.25a
4.3. ALTERNATIVES TO POSIX OPERATIONS 39

Listing 4.11: Example Child Thread Listing 4.13: Locking API


1 void *thread_test(void *arg) void spin_lock_init(spinlock_t *sp);
2 { void spin_lock(spinlock_t *sp);
3 int myarg = (intptr_t)arg; int spin_trylock(spinlock_t *sp);
4 void spin_unlock(spinlock_t *sp);
5 printf("child thread %d: smp_thread_id() = %d\n",
6 myarg, smp_thread_id());
7 return NULL;
}
8
4.3.3 Locking
Listing 4.12: Example Parent Thread A good starting subset of the Linux kernel’s locking API is
1 int main(int argc, char *argv[]) shown in Listing 4.13, each API element being described
2 { in the following sections. This book’s CodeSamples
3 int i;
4 int nkids = 1; locking API closely follows that of the Linux kernel.
5
6 smp_init();
7
4.3.3.1 spin_lock_init()
8 if (argc > 1) {
9 nkids = strtoul(argv[1], NULL, 0);
10 if (nkids > NR_THREADS) { The spin_lock_init() primitive initializes the speci-
11 fprintf(stderr, "nkids = %d too large, max = %d\n", fied spinlock_t variable, and must be invoked before
12 nkids, NR_THREADS);
13 usage(argv[0]); this variable is passed to any other spinlock primitive.
14 }
15 }
16 printf("Parent thread spawning %d threads.\n", nkids);
17
4.3.3.2 spin_lock()
18 for (i = 0; i < nkids; i++)
19 create_thread(thread_test, (void *)(intptr_t)i); The spin_lock() primitive acquires the specified spin-
20
21 wait_all_threads();
lock, if necessary, waiting until the spinlock becomes
22 available. In some environments, such as pthreads, this
23 printf("All spawned threads completed.\n");
24
waiting will involve blocking, while in others, such as the
25 exit(0); Linux kernel, it might involve a CPU-bound spin loop.
26 }
The key point is that only one thread may hold a spinlock
at any given time.

4.3.2.7 Example Usage


4.3.3.3 spin_trylock()
Listing 4.11 (threadcreate.c) shows an example hello-
world-like child thread. As noted earlier, each thread The spin_trylock() primitive acquires the specified
is allocated its own stack, so each thread has its own spinlock, but only if it is immediately available. It returns
private arg argument and myarg variable. Each child true if it was able to acquire the spinlock and false
simply prints its argument and its smp_thread_id() otherwise.
before exiting. Note that the return statement on line 7
terminates the thread, returning a NULL to whoever invokes 4.3.3.4 spin_unlock()
wait_thread() on this thread.
The parent program is shown in Listing 4.12. It invokes The spin_unlock() primitive releases the specified spin-
smp_init() to initialize the threading system on line 6, lock, allowing other threads to acquire it.
parses arguments on lines 8–15, and announces its pres-
ence on line 16. It creates the specified number of child 4.3.3.5 Example Usage
threads on lines 18–19, and waits for them to complete
on line 21. Note that wait_all_threads() discards the A spinlock named mutex may be used to protect a variable
threads return values, as in this case they are all NULL, counter as follows:
which is not very interesting.
spin_lock(&mutex);
counter++;
Quick Quiz 4.26: What happened to the Linux-kernel spin_unlock(&mutex);
equivalents to fork() and wait()?

v2022.09.25a
40 CHAPTER 4. TOOLS OF THE TRADE

Listing 4.14: Living Dangerously Early 1990s Style 4.3.4.1 Shared-Variable Shenanigans
1 ptr = global_ptr;
2 if (ptr != NULL && ptr < high_address) Given code that does plain loads and stores,6 the compiler
3 do_low(ptr);
is within its rights to assume that the affected variables are
neither accessed nor modified by any other thread. This
Listing 4.15: C Compilers Can Invent Loads assumption allows the compiler to carry out a large number
1 if (global_ptr != NULL && of transformations, including load tearing, store tearing,
2 global_ptr < high_address)
3 do_low(global_ptr); load fusing, store fusing, code reordering, invented loads,
invented stores, store-to-load transformations, and dead-
code elimination, all of which work just fine in single-
Quick Quiz 4.27: What problems could occur if the variable threaded code. But concurrent code can be broken by each
counter were incremented without the protection of mutex? of these transformations, or shared-variable shenanigans,
as described below.
Load tearing occurs when the compiler uses multiple
However, the spin_lock() and spin_unlock() load instructions for a single access. For example, the
primitives do have performance consequences, as will compiler could in theory compile the load from global_
be seen in Chapter 10. ptr (see line 1 of Listing 4.14) as a series of one-byte loads.
If some other thread was concurrently setting global_
ptr to NULL, the result might have all but one byte of
4.3.4 Accessing Shared Variables the pointer set to zero, thus forming a “wild pointer”.
Stores using such a wild pointer could corrupt arbitrary
It was not until 2011 that the C standard defined seman- regions of memory, resulting in rare and difficult-to-debug
tics for concurrent read/write access to shared variables. crashes.
However, concurrent C code was being written at least Worse yet, on (say) an 8-bit system with 16-bit pointers,
a quarter century earlier [BK85, Inm85]. This raises the the compiler might have no choice but to use a pair of
question as to what today’s greybeards did back in long- 8-bit instructions to access a given pointer. Because the C
past pre-C11 days. A short answer to this question is “they standard must support all manner of systems, the standard
lived dangerously”. cannot rule out load tearing in the general case.
At least they would have been living dangerously had Store tearing occurs when the compiler uses multiple
they been using 2021 compilers. In (say) the early 1990s, store instructions for a single access. For example, one
compilers did fewer optimizations, in part because there thread might store 0x12345678 to a four-byte integer vari-
were fewer compiler writers and in part due to the relatively able at the same time another thread stored 0xabcdef00.
small memories of that era. Nevertheless, problems did If the compiler used 16-bit stores for either access, the
arise, as shown in Listing 4.14, which the compiler is result might well be 0x1234ef00, which could come as
within its rights to transform into Listing 4.15. As you quite a surprise to code loading from this integer. Nor
can see, the temporary on line 1 of Listing 4.14 has been is this a strictly theoretical issue. For example, there are
optimized away, so that global_ptr will be loaded up to CPUs that feature small immediate instruction fields, and
three times. on such CPUs, the compiler might split a 64-bit store
Quick Quiz 4.28: What is wrong with loading Listing 4.14’s into two 32-bit stores in order to reduce the overhead of
global_ptr up to three times? explicitly forming the 64-bit constant in a register, even on
a 64-bit CPU. There are historical reports of this actually
Section 4.3.4.1 describes additional problems caused by happening in the wild (e.g. [KM13]), but there is also a
plain accesses, Sections 4.3.4.2 and 4.3.4.3 describe some recent report [Dea19].7
pre-C11 solutions. Of course, where practical, direct
C-language memory references should be replaced by 6 That is, normal loads and stores instead of C11 atomics, inline
the primitives described in Section 4.2.5 or (especially) assembly, or volatile accesses.
7 Note that this tearing can happen even on properly aligned and
Section 4.2.6. Use these primitives to avoid data races,
machine-word-sized accesses, and in this particular case, even for volatile
that is, ensure that if there are multiple concurrent C- stores. Some might argue that this behavior constitutes a bug in the
language accesses to a given variable, all of those accesses compiler, but either way it illustrates the perceived value of store tearing
are loads. from a compiler-writer viewpoint.

v2022.09.25a
4.3. ALTERNATIVES TO POSIX OPERATIONS 41

Listing 4.16: Inviting Load Fusing Listing 4.18: C Compilers Can Fuse Non-Adjacent Loads
1 while (!need_to_stop) 1 int *gp;
2 do_something_quickly(); 2
3 void t0(void)
4 {
5 WRITE_ONCE(gp, &myvar);
Listing 4.17: C Compilers Can Fuse Loads 6 }
1 if (!need_to_stop) 7
2 for (;;) { 8 void t1(void)
3 do_something_quickly(); 9 {
4 do_something_quickly(); 10 p1 = gp;
5 do_something_quickly(); 11 do_something(p1);
6 do_something_quickly(); 12 p2 = READ_ONCE(gp);
7 do_something_quickly(); 13 if (p2) {
8 do_something_quickly(); 14 do_something_else();
9 do_something_quickly(); 15 p3 = *gp;
10 do_something_quickly(); 16 }
11 do_something_quickly(); 17 }
12 do_something_quickly();
13 do_something_quickly();
14 do_something_quickly();
15 do_something_quickly(); t1() run concurrently, and do_something() and do_
16 do_something_quickly();
17 do_something_quickly(); something_else() are inline functions. Line 1 declares
18 do_something_quickly(); pointer gp, which C initializes to NULL by default. At
19 }
some point, line 5 of t0() stores a non-NULL pointer to gp.
Meanwhile, t1() loads from gp three times on lines 10,
Of course, the compiler simply has no choice but to tear 12, and 15. Given that line 13 finds that gp is non-NULL,
some stores in the general case, given the possibility of one might hope that the dereference on line 15 would be
code using 64-bit integers running on a 32-bit system. But guaranteed never to fault. Unfortunately, the compiler is
for properly aligned machine-sized stores, WRITE_ONCE() within its rights to fuse the read on lines 10 and 15, which
will prevent store tearing. means that if line 10 loads NULL and line 12 loads &myvar,
line 15 could load NULL, resulting in a fault.8 Note that
Load fusing occurs when the compiler uses the result
the intervening READ_ONCE() does not prevent the other
of a prior load from a given variable instead of repeating
two loads from being fused, despite the fact that all three
the load. Not only is this sort of optimization just fine in
are loading from the same variable.
single-threaded code, it is often just fine in multithreaded
code. Unfortunately, the word “often” hides some truly Quick Quiz 4.29: Why does it matter whether do_
annoying exceptions. something() and do_something_else() in Listing 4.18
For example, suppose that a real-time system needs to are inline functions?
invoke a function named do_something_quickly() Store fusing can occur when the compiler notices a
repeatedly until the variable need_to_stop was set, pair of successive stores to a given variable with no
and that the compiler can see that do_something_ intervening loads from that variable. In this case, the
quickly() does not store to need_to_stop. One (un- compiler is within its rights to omit the first store. This is
safe) way to code this is shown in Listing 4.16. The never a problem in single-threaded code, and in fact it is
compiler might reasonably unroll this loop sixteen times usually not a problem in correctly written concurrent code.
in order to reduce the per-invocation of the backwards After all, if the two stores are executed in quick succession,
branch at the end of the loop. Worse yet, because the there is very little chance that some other thread could
compiler knows that do_something_quickly() does load the value from the first store.
not store to need_to_stop, the compiler could quite However, there are exceptions, for example as shown
reasonably decide to check this variable only once, re- in Listing 4.19. The function shut_it_down() stores
sulting in the code shown in Listing 4.17. Once entered, to the shared variable status on lines 3 and 8, and so
the loop on lines 2–19 will never exit, regardless of how assuming that neither start_shutdown() nor finish_
many times some other thread stores a non-zero value to shutdown() access status, the compiler could reason-
need_to_stop. The result will at best be consternation, ably remove the store to status on line 3. Unfortunately,
and might well also include severe physical damage. this would mean that work_until_shut_down() would
The compiler can fuse loads across surprisingly large
spans of code. For example, in Listing 4.18, t0() and 8 Will Deacon reports that this happened in the Linux kernel.

v2022.09.25a
42 CHAPTER 4. TOOLS OF THE TRADE

Listing 4.19: C Compilers Can Fuse Stores Listing 4.20: Inviting an Invented Store
1 void shut_it_down(void) 1 if (condition)
2 { 2 a = 1;
3 status = SHUTTING_DOWN; /* BUGGY!!! */ 3 else
4 start_shutdown(); 4 do_a_bunch_of_stuff();
5 while (!other_task_ready) /* BUGGY!!! */
6 continue;
7 finish_shutdown();
8 status = SHUT_DOWN; /* BUGGY!!! */ Listing 4.21: Compiler Invents an Invited Store
9 do_something_else(); 1 a = 1;
10 } 2 if (!condition) {
11 3 a = 0;
12 void work_until_shut_down(void) 4 do_a_bunch_of_stuff();
13 { 5 }
14 while (status != SHUTTING_DOWN) /* BUGGY!!! */
15 do_more_work();
16 other_task_ready = 1; /* BUGGY!!! */
17 }
Invented loads were illustrated by the code in List-
ings 4.14 and 4.15, in which the compiler optimized away
never exit its loop spanning lines 14 and 15, and thus would a temporary variable, thus loading from a shared variable
never set other_task_ready, which would in turn mean more often than intended.
that shut_it_down() would never exit its loop spanning Invented loads can be a performance hazard. These
lines 5 and 6, even if the compiler chooses not to fuse the hazards can occur when a load of variable in a “hot”
successive loads from other_task_ready on line 5. cacheline is hoisted out of an if statement. These hoisting
And there are more problems with the code in List- optimizations are not uncommon, and can cause significant
ing 4.19, including code reordering. increases in cache misses, and thus significant degradation
Code reordering is a common compilation technique of both performance and scalability.
used to combine common subexpressions, reduce register
Invented stores can occur in a number of situations.
pressure, and improve utilization of the many functional
For example, a compiler emitting code for work_until_
units available on modern superscalar microprocessors.
shut_down() in Listing 4.19 might notice that other_
It is also another reason why the code in Listing 4.19 is
task_ready is not accessed by do_more_work(), and
buggy. For example, suppose that the do_more_work()
stored to on line 16. If do_more_work() was a complex
function on line 15 does not access other_task_ready.
inline function, it might be necessary to do a register spill,
Then the compiler would be within its rights to move the
in which case one attractive place to use for temporary
assignment to other_task_ready on line 16 to precede
storage is other_task_ready. After all, there are no
line 14, which might be a great disappointment for anyone
accesses to it, so what is the harm?
hoping that the last call to do_more_work() on line 15
happens before the call to finish_shutdown() on line 7. Of course, a non-zero store to this variable at just the
It might seem futile to prevent the compiler from chang- wrong time would result in the while loop on line 5 termi-
ing the order of accesses in cases where the underlying nating prematurely, again allowing finish_shutdown()
hardware is free to reorder them. However, modern ma- to run concurrently with do_more_work(). Given that
chines have exact exceptions and exact interrupts, mean- the entire point of this while appears to be to prevent
ing that any interrupt or exception will appear to have such concurrency, this is not a good thing.
happened at a specific place in the instruction stream. Using a stored-to variable as a temporary might seem
This means that the handler will see the effect of all outlandish, but it is permitted by the standard. Neverthe-
prior instructions, but won’t see the effect of any subse- less, readers might be justified in wanting a less outlandish
quent instructions. READ_ONCE() and WRITE_ONCE() example, which is provided by Listings 4.20 and 4.21.
can therefore be used to control communication between A compiler emitting code for Listing 4.20 might know
interrupted code and interrupt handlers, independent of that the value of a is initially zero, which might be a strong
the ordering provided by the underlying hardware.9 temptation to optimize away one branch by transforming
this code to that in Listing 4.21. Here, line 1 uncondi-
tionally stores 1 to a, then resets the value back to zero
9 That said, the various standards committees would prefer that
on line 3 if condition was not set. This transforms the
you use atomics or variables of type sig_atomic_t, instead of READ_
ONCE() and WRITE_ONCE(). if-then-else into an if-then, saving one branch.

v2022.09.25a
4.3. ALTERNATIVES TO POSIX OPERATIONS 43

Listing 4.22: Inviting a Store-to-Load Conversion surprises will be at all pleasant. Elimination of store-only
1 r1 = p; variables is especially dangerous in cases where external
2 if (unlikely(r1))
3 do_something_with(r1); code locates the variable via symbol tables: The compiler
4 barrier(); is necessarily ignorant of such external-code accesses,
5 p = NULL;
and might thus eliminate a variable that the external code
relies upon.
Listing 4.23: Compiler Converts a Store to a Load Reliable concurrent code clearly needs a way to cause
1 r1 = p;
2 if (unlikely(r1)) the compiler to preserve the number, order, and type of
3 do_something_with(r1); important accesses to shared memory, a topic taken up by
4 barrier();
5 if (p != NULL) Sections 4.3.4.2 and 4.3.4.3, which are up next.
6 p = NULL;

4.3.4.2 A Volatile Solution


Quick Quiz 4.30: Ouch! So can’t the compiler invent a store Although it is now much maligned, before the advent of
to a normal variable pretty much any time it likes? C11 and C++11 [Bec11], the volatile keyword was an
indispensable tool in the parallel programmer’s toolbox.
Finally, pre-C11 compilers could invent writes to unre- This raises the question of exactly what volatile means,
lated variables that happened to be adjacent to written-to a question that is not answered with excessive precision
variables [Boe05, Section 4.2]. This variant of invented even by more recent versions of this standard [Smi19].11
stores has been outlawed by the prohibition against com- This version guarantees that “Accesses through volatile
piler optimizations that invent data races. glvalues are evaluated strictly according to the rules of
Store-to-load transformations can occur when the the abstract machine”, that volatile accesses are side
compiler notices that a plain store might not actually effects, that they are one of the four forward-progress indi-
change the value in memory. For example, consider cators, and that their exact semantics are implementation-
Listing 4.22. Line 1 fetches p, but the “if” statement defined. Perhaps the clearest guidance is provided by this
on line 2 also tells the compiler that the developer thinks non-normative note:
that p is usually zero.10 The barrier() statment on
line 4 forces the compiler to forget the value of p, but volatile is a hint to the implementation to
one could imagine a compiler choosing to remember the avoid aggressive optimization involving the ob-
hint—or getting an additional hint via feedback-directed ject because the value of the object might be
optimization. Doing so would cause the compiler to changed by means undetectable by an implemen-
realize that line 5 is often an expensive no-op. tation. Furthermore, for some implementations,
Such a compiler might therefore guard the store of NULL volatile might indicate that special hardware
with a check, as shown on lines 5 and 6 of Listing 4.23. instructions are required to access the object.
Although this transformation is often desirable, it could be See 6.8.1 for detailed semantics. In general, the
problematic if the actual store was required for ordering. semantics of volatile are intended to be the
For example, a write memory barrier (Linux kernel smp_ same in C++ as they are in C.
wmb()) would order the store, but not the load. This
This wording might be reassuring to those writing low-
situation might suggest use of smp_store_release()
level code, except for the fact that compiler writers are
over smp_wmb().
free to completely ignore non-normative notes. Parallel
Dead-code elimination can occur when the compiler
programmers might instead reassure themselves that com-
notices that the value from a load is never used, or when a
piler writers would like to avoid breaking device drivers
variable is stored to, but never loaded from. This can of
(though perhaps only after a few “frank and open” discus-
course eliminate an access to a shared variable, which can
sions with device-driver developers), and device drivers
in turn defeat a memory-ordering primitive, which could
impose at least the following constraints [MWPF18]:
cause your concurrent code to act in surprising ways.
Experience thus far indicates that relatively few such 1. Implementations are forbidden from tearing an
10 The unlikely() function provides this hint to the compiler,
aligned volatile access when machine instructions of
and different compilers provide different ways of implementing 11 JF Bastien thoroughly documented the history and use cases for

unlikely(). the volatile keyword in C++ [Bas18].

v2022.09.25a
44 CHAPTER 4. TOOLS OF THE TRADE

Listing 4.24: Avoiding Danger, 2018 Style Listing 4.26: Preventing Store Fusing and Invented Stores
1 ptr = READ_ONCE(global_ptr); 1 void shut_it_down(void)
2 if (ptr != NULL && ptr < high_address) 2 {
3 do_low(ptr); 3 WRITE_ONCE(status, SHUTTING_DOWN); /* BUGGY!!! */
4 start_shutdown();
5 while (!READ_ONCE(other_task_ready)) /* BUGGY!!! */
Listing 4.25: Preventing Load Fusing 6 continue;
7 finish_shutdown();
1 while (!READ_ONCE(need_to_stop)) 8 WRITE_ONCE(status, SHUT_DOWN); /* BUGGY!!! */
2 do_something_quickly(); 9 do_something_else();
10 }
11
12 void work_until_shut_down(void)
that access’s size and type are available.12
Concur- 13 {
14 while (READ_ONCE(status) != SHUTTING_DOWN) /* BUGGY!!! */
rent code relies on this constraint to avoid unneces- 15 do_more_work();
sary load and store tearing. 16 WRITE_ONCE(other_task_ready, 1); /* BUGGY!!! */
17 }
2. Implementations must not assume anything about the
semantics of a volatile access, nor, for any volatile Listing 4.27: Disinviting an Invented Store
access that returns a value, about the possible set of 1 if (condition)
values that might be returned.13 Concurrent code 2 WRITE_ONCE(a, 1);
3 else
relies on this constraint to avoid optimizations that 4 do_a_bunch_of_stuff();
are inapplicable given that other processors might be
concurrently accessing the location in question.
shown in Listing 4.19, with the result shown in List-
3. Aligned machine-sized non-mixed-size volatile ac- ing 4.26. However, this does nothing to prevent code
cesses interact naturally with volatile assembly-code reordering, which requires some additional tricks taught
sequences before and after. This is necessary because in Section 4.3.4.3.
some devices must be accessed using a combina-
Finally, WRITE_ONCE() can be used to prevent the store
tion of volatile MMIO accesses and special-purpose
invention shown in Listing 4.20, with the resulting code
assembly-language instructions. Concurrent code
shown in Listing 4.27.
relies on this constraint in order to achieve the desired
To summarize, the volatile keyword can prevent
ordering properties from combinations of volatile ac-
load tearing and store tearing in cases where the loads
cesses and other means discussed in Section 4.3.4.3.
and stores are machine-sized and properly aligned. It
Concurrent code also relies on the first two constraints can also prevent load fusing, store fusing, invented loads,
to avoid undefined behavior that could result due to data and invented stores. However, although it does prevent
races if any of the accesses to a given object was either the compiler from reordering volatile accesses with
non-atomic or non-volatile, assuming that all accesses are each other, it does nothing to prevent the CPU from
aligned and machine-sized. The semantics of mixed-size reordering these accesses. Furthermore, it does nothing
accesses to the same locations are more complex, and are to prevent either compiler or CPU from reordering non-
left aside for the time being. volatile accesses with each other or with volatile
So how does volatile stack up against the earlier accesses. Preventing these types of reordering requires
examples? the techniques described in the next section.
Using READ_ONCE() on line 1 of Listing 4.14 avoids
invented loads, resulting in the code shown in Listing 4.24. 4.3.4.3 Assembling the Rest of a Solution
As shown in Listing 4.25, READ_ONCE() can also pre-
vent the loop unrolling in Listing 4.17. Additional ordering has traditionally been provided by
READ_ONCE() and WRITE_ONCE() can also be used recourse to assembly language, for example, GCC asm
to prevent the store fusing and invented stores that were directives. Oddly enough, these directives need not ac-
tually contain assembly language, as exemplified by the
12 Note that this leaves unspecified what to do with 128-bit loads and
barrier() macro shown in Listing 4.9.
stores on CPUs having 128-bit CAS but not 128-bit loads and stores.
In the barrier() macro, the __asm__ introduces the
13 This is strongly implied by the implementation-defined semantics asm directive, the __volatile__ prevents the compiler
called out above. from optimizing the asm away, the empty string specifies

v2022.09.25a
4.3. ALTERNATIVES TO POSIX OPERATIONS 45

Listing 4.28: Preventing C Compilers From Fusing Loads Ordering is also provided by some read-modify-write
1 while (!need_to_stop) { atomic operations, some of which are presented in Sec-
2 barrier();
3 do_something_quickly(); tion 4.3.5. In the general case, memory ordering can be
4 barrier(); quite subtle, as discussed in Chapter 15. The next section
5 }
covers an alternative to memory ordering, namely limiting
or even entirely avoiding data races.
Listing 4.29: Preventing Reordering
1 void shut_it_down(void)
2 { 4.3.4.4 Avoiding Data Races
3 WRITE_ONCE(status, SHUTTING_DOWN);
4 smp_mb(); “Doctor, it hurts my head when I think about
5 start_shutdown();
6 while (!READ_ONCE(other_task_ready)) concurrently accessing shared variables!”
7 continue;
8 smp_mb(); “Then stop concurrently accessing shared vari-
9 finish_shutdown(); ables!!!”
10 smp_mb();
11 WRITE_ONCE(status, SHUT_DOWN);
12 do_something_else(); The doctor’s advice might seem unhelpful, but one
13 } time-tested way to avoid concurrently accessing shared
14
15 void work_until_shut_down(void) variables is access those variables only when holding a
16 { particular lock, as will be discussed in Chapter 7. Another
17 while (READ_ONCE(status) != SHUTTING_DOWN) {
18 smp_mb(); way is to access a given “shared” variable only from a
19 do_more_work(); given CPU or thread, as will be discussed in Chapter 8. It
20 }
21 smp_mb(); is possible to combine these two approaches, for example,
22 WRITE_ONCE(other_task_ready, 1); a given variable might be modified only by a given CPU or
23 }
thread while holding a particular lock, and might be read
either from that same CPU or thread on the one hand, or
that no actual instructions are to be emitted, and the from some other CPU or thread while holding that same
final "memory" tells the compiler that this do-nothing lock on the other. In all of these situations, all accesses to
asm can arbitrarily change memory. In response, the the shared variables may be plain C-language accesses.
compiler will avoid moving any memory references across Here is a list of situations allowing plain loads and stores
the barrier() macro. This means that the real-time- for some accesses to a given variable, while requiring
destroying loop unrolling shown in Listing 4.17 can be markings (such as READ_ONCE() and WRITE_ONCE()) for
prevented by adding barrier() calls as shown on lines 2 other accesses to that same variable:
and 4 of Listing 4.28. These two lines of code prevent the
1. A shared variable is only modified by a given owning
compiler from pushing the load from need_to_stop into
CPU or thread, but is read by other CPUs or threads.
or past do_something_quickly() from either direction.
All stores must use WRITE_ONCE(). The owning
However, this does nothing to prevent the CPU from
CPU or thread may use plain loads. Everything else
reordering the references. In many cases, this is not
must use READ_ONCE() for loads.
a problem because the hardware can only do a certain
amount of reordering. However, there are cases such 2. A shared variable is only modified while holding a
as Listing 4.19 where the hardware must be constrained. given lock, but is read by code not holding that lock.
Listing 4.26 prevented store fusing and invention, and All stores must use WRITE_ONCE(). CPUs or threads
Listing 4.29 further prevents the remaining reordering holding the lock may use plain loads. Everything
by addition of smp_mb() on lines 4, 8, 10, 18, and 21. else must use READ_ONCE() for loads.
The smp_mb() macro is similar to barrier() shown in
Listing 4.9, but with the empty string replaced by a string 3. A shared variable is only modified while holding a
containing the instruction for a full memory barrier, for given lock by a given owning CPU or thread, but is
example, "mfence" on x86 or "sync" on PowerPC. read by other CPUs or threads or by code not holding
that lock. All stores must use WRITE_ONCE(). The
Quick Quiz 4.31: But aren’t full memory barriers very owning CPU or thread may use plain loads, as may
heavyweight? Isn’t there a cheaper way to enforce the ordering
any CPU or thread holding the lock. Everything else
needed in Listing 4.29?
must use READ_ONCE() for loads.

v2022.09.25a
46 CHAPTER 4. TOOLS OF THE TRADE

4. A shared variable is only accessed by a given CPU Listing 4.30: Per-Thread-Variable API
or thread and by a signal or interrupt handler running DEFINE_PER_THREAD(type, name)
DECLARE_PER_THREAD(type, name)
in that CPU’s or thread’s context. The handler can per_thread(name, thread)
use plain loads and stores, as can any code that __get_thread_var(name)
init_per_thread(name, v)
has prevented the handler from being invoked, that
is, code that has blocked signals and/or interrupts.
All other code must use READ_ONCE() and WRITE_ nothing happens unless the original value of the atomic
ONCE(). variable is different than the value specified (these are very
5. A shared variable is only accessed by a given CPU handy for managing reference counters, for example).
or thread and by a signal or interrupt handler running An atomic exchange operation is provided by atomic_
in that CPU’s or thread’s context, and the handler xchg(), and the celebrated compare-and-swap (CAS)
always restores the values of any variables that it operation is provided by atomic_cmpxchg(). Both
has written before return. The handler can use plain of these return the old value. Many additional atomic
loads and stores, as can any code that has prevented RMW primitives are available in the Linux kernel, see
the handler from being invoked, that is, code that the Documentation/atomic_t.txt file in the Linux-
has blocked signals and/or interrupts. All other code kernel source tree.14
can use plain loads, but must use WRITE_ONCE() This book’s CodeSamples API closely follows that of
to prevent store tearing, store fusing, and invented the Linux kernel.
stores.
4.3.6 Per-CPU Variables
Quick Quiz 4.32: What needs to happen if an interrupt or
signal handler might itself be interrupted? The Linux kernel uses DEFINE_PER_CPU() to define a
per-CPU variable, this_cpu_ptr() to form a reference
In most other cases, loads from and stores to a shared to this CPU’s instance of a given per-CPU variable, per_
variable must use READ_ONCE() and WRITE_ONCE() or cpu() to access a specified CPU’s instance of a given
stronger, respectively. But it bears repeating that neither per-CPU variable, along with many other special-purpose
READ_ONCE() nor WRITE_ONCE() provide any ordering per-CPU operations.
guarantees other than within the compiler. See the above Listing 4.30 shows this book’s per-thread-variable API,
Section 4.3.4.3 or Chapter 15 for information on such which is patterned after the Linux kernel’s per-CPU-
guarantees. variable API. This API provides the per-thread equivalent
Examples of many of these data-race-avoidance patterns of global variables. Although this API is, strictly speaking,
are presented in Chapter 5. not necessary,15 it can provide a good userspace analogy
to Linux kernel code.
4.3.5 Atomic Operations Quick Quiz 4.33: How could you work around the lack of a
The Linux kernel provides a wide variety of atomic opera- per-thread-variable API on systems that do not provide it?
tions, but those defined on type atomic_t provide a good
start. Normal non-tearing reads and stores are provided by
atomic_read() and atomic_set(), respectively. Ac- 4.3.6.1 DEFINE_PER_THREAD()
quire load is provided by smp_load_acquire() and The DEFINE_PER_THREAD() primitive defines a per-
release store by smp_store_release(). thread variable. Unfortunately, it is not possible to pro-
Non-value-returning fetch-and-add operations are pro- vide an initializer in the way permitted by the Linux
vided by atomic_add(), atomic_sub(), atomic_ kernel’s DEFINE_PER_CPU() primitive, but there is an
inc(), and atomic_dec(), among others. An atomic init_per_thread() primitive that permits easy runtime
decrement that returns a reached-zero indication is pro- initialization.
vided by both atomic_dec_and_test() and atomic_
sub_and_test(). An atomic add that returns the
new value is provided by atomic_add_return().
Both atomic_add_unless() and atomic_inc_not_ 14 As of Linux kernel v5.11.
zero() provide conditional atomic operations, where 15 You could instead use __thread or _Thread_local.

v2022.09.25a
4.4. THE RIGHT TOOL FOR THE JOB: HOW TO CHOOSE? 47

4.3.6.2 DECLARE_PER_THREAD() Again, it is possible to gain a similar effect using other


mechanisms, but per-thread variables combine conve-
The DECLARE_PER_THREAD() primitive is a declaration nience and high performance, as will be shown in more
in the C sense, as opposed to a definition. Thus, a detail in Section 5.2.
DECLARE_PER_THREAD() primitive may be used to ac-
cess a per-thread variable defined in some other file.
4.4 The Right Tool for the Job: How
4.3.6.3 per_thread()
to Choose?
The per_thread() primitive accesses the specified
thread’s variable.
If you get stuck, change your tools; it may free your
thinking.
4.3.6.4 __get_thread_var()
Paul Arden, abbreviated
The __get_thread_var() primitive accesses the cur-
rent thread’s variable. As a rough rule of thumb, use the simplest tool that will
get the job done. If you can, simply program sequentially.
4.3.6.5 init_per_thread() If that is insufficient, try using a shell script to mediate
parallelism. If the resulting shell-script fork()/exec()
The init_per_thread() primitive sets all threads’ in- overhead (about 480 microseconds for a minimal C pro-
stances of the specified variable to the specified value. The gram on an Intel Core Duo laptop) is too large, try using
Linux kernel accomplishes this via normal C initialization, the C-language fork() and wait() primitives. If the
relying in clever use of linker scripts and code executed overhead of these primitives (about 80 microseconds for
during the CPU-online process. a minimal child process) is still too large, then you might
need to use the POSIX threading primitives, choosing the
4.3.6.6 Usage Example appropriate locking and/or atomic-operation primitives.
If the overhead of the POSIX threading primitives (typi-
Suppose that we have a counter that is incremented very cally sub-microsecond) is too great, then the primitives
frequently but read out quite rarely. As will become clear introduced in Chapter 9 may be required. Of course, the
in Section 5.2, it is helpful to implement such a counter actual overheads will depend not only on your hardware,
using a per-thread variable. Such a variable can be defined but most critically on the manner in which you use the
as follows: primitives. Furthermore, always remember that inter-
DEFINE_PER_THREAD(int, counter); process communication and message-passing can be good
alternatives to shared-memory multithreaded execution,
especially when your code makes good use of the design
The counter must be initialized as follows: principles called out in Chapter 6.
init_per_thread(counter, 0); Quick Quiz 4.34: Wouldn’t the shell normally use vfork()
rather than fork()?
A thread can increment its instance of this counter as
Because concurrency was added to the C standard
follows:
several decades after the C language was first used to
p_counter = &__get_thread_var(counter); build concurrent systems, there are a number of ways
WRITE_ONCE(*p_counter, *p_counter + 1); of concurrently accessing shared variables. All else
being equal, the C11 standard operations described in
The value of the counter is then the sum of its instances. Section 4.2.6 should be your first stop. If you need to
A snapshot of the value of the counter can thus be collected access a given shared variable both with plain accesses and
as follows: atomically, then the modern GCC atomics described in
Section 4.2.7 might work well for you. If you are working
for_each_thread(t) on an old codebase that uses the classic GCC __sync
sum += READ_ONCE(per_thread(counter, t));
API, then you should review Section 4.2.5 as well as the

v2022.09.25a
48 CHAPTER 4. TOOLS OF THE TRADE

relevant GCC documentation. If you are working on the


Linux kernel or similar codebase that combines use of the
volatile keyword with inline assembly, or if you need
dependencies to provide ordering, look at the material
presented in Section 4.3.4 as well as that in Chapter 15.
Whatever approach you take, please keep in mind that
randomly hacking multi-threaded code is a spectacularly
bad idea, especially given that shared-memory parallel sys-
tems use your own intelligence against you: The smarter
you are, the deeper a hole you will dig for yourself before
you realize that you are in trouble [Pok16]. Therefore,
it is necessary to make the right design choices as well
as the correct choice of individual primitives, as will be
discussed at length in subsequent chapters.

v2022.09.25a
As easy as 1, 2, 3!

Unknown
Chapter 5

Counting

Counting is perhaps the simplest and most natural thing number of structures in use exceeds an exact limit (again, say
a computer can do. However, counting efficiently and 10,000). Suppose further that these structures are short-lived,
scalably on a large shared-memory multiprocessor can and that the limit is rarely exceeded, that there is almost always
be quite challenging. Furthermore, the simplicity of the at least one structure in use, and suppose further still that it is
underlying concept of counting allows us to explore the necessary to know exactly when this counter reaches zero, for
example, in order to free up some memory that is not required
fundamental issues of concurrency without the distractions
unless there is at least one structure in use.
of elaborate data structures or complex synchronization
primitives. Counting therefore provides an excellent
Quick Quiz 5.5: Removable I/O device access-count prob-
introduction to parallel programming.
lem. Suppose that you need to maintain a reference count on
This chapter covers a number of special cases for which a heavily used removable mass-storage device, so that you can
there are simple, fast, and scalable counting algorithms. tell the user when it is safe to remove the device. As usual, the
But first, let us find out how much you already know about user indicates a desire to remove the device, and the system
concurrent counting. tells the user when it is safe to do so.
Quick Quiz 5.1: Why should efficient and scalable counting
Section 5.1 shows why counting is non-trivial. Sec-
be hard??? After all, computers have special hardware for the
tions 5.2 and 5.3 investigate network-packet counting
sole purpose of doing counting!!!
and approximate structure-allocation limits, respectively.
Section 5.4 takes on exact structure-allocation limits. Fi-
Quick Quiz 5.2: Network-packet counting problem. Sup-
pose that you need to collect statistics on the number of nally, Section 5.5 presents performance measurements
networking packets transmitted and received. Packets might and discussion.
be transmitted or received by any CPU on the system. Suppose Sections 5.1 and 5.2 contain introductory material,
further that your system is capable of handling millions of while the remaining sections are more advanced.
packets per second per CPU, and that a systems-monitoring
package reads the count every five seconds. How would you
implement this counter? 5.1 Why Isn’t Concurrent Counting
Quick Quiz 5.3: Approximate structure-allocation limit Trivial?
problem. Suppose that you need to maintain a count of the
number of structures allocated in order to fail any allocations
Seek simplicity, and distrust it.
once the number of structures in use exceeds a limit (say,
10,000). Suppose further that the structures are short-lived, Alfred North Whitehead
the limit is rarely exceeded, and a “sloppy” approximate limit
is acceptable. Let’s start with something simple, for example, the
straightforward use of arithmetic shown in Listing 5.1
Quick Quiz 5.4: Exact structure-allocation limit problem. (count_nonatomic.c). Here, we have a counter on
Suppose that you need to maintain a count of the number of
line 1, we increment it on line 5, and we read out its value
structures allocated in order to fail any allocations once the
on line 10. What could be simpler?

49

v2022.09.25a
50 CHAPTER 5. COUNTING

Listing 5.1: Just Count! 100000


1 unsigned long counter = 0;

Time Per Increment (ns)


2
10000
3 static __inline__ void inc_count(void)
4 {
5 WRITE_ONCE(counter, READ_ONCE(counter) + 1); 1000
6 }
7
8 static __inline__ unsigned long read_count(void) 100
9 {
10 return READ_ONCE(counter);
11 } 10

Listing 5.2: Just Count Atomically! 1

10

100
1 atomic_t counter = ATOMIC_INIT(0);
2
3 static __inline__ void inc_count(void) Number of CPUs (Threads)
4 {
5 atomic_inc(&counter);
6 }
Figure 5.1: Atomic Increment Scalability on x86
7
8 static __inline__ long read_count(void)
9 {
10 return atomic_read(&counter); times slower than non-atomic increment, even when only
11 } a single thread is incrementing.1
This poor performance should not be a surprise, given
the discussion in Chapter 3, nor should it be a surprise
Quick Quiz 5.6: One thing that could be simpler is ++ instead that the performance of atomic increment gets slower
of that concatenation of READ_ONCE() and WRITE_ONCE(). as the number of CPUs and threads increase, as shown
Why all that extra typing??? in Figure 5.1. In this figure, the horizontal dashed line
resting on the x axis is the ideal performance that would
This approach has the additional advantage of being be achieved by a perfectly scalable algorithm: With
blazingly fast if you are doing lots of reading and almost such an algorithm, a given increment would incur the
no incrementing, and on small systems, the performance same overhead that it would in a single-threaded program.
is excellent. Atomic increment of a single global variable is clearly
There is just one large fly in the ointment: This approach decidedly non-ideal, and gets multiple orders of magnitude
can lose counts. On my six-core x86 laptop, a short run worse with additional CPUs.
invoked inc_count() 285,824,000 times, but the final
Quick Quiz 5.9: Why doesn’t the horizontal dashed line on
value of the counter was only 35,385,525. Although the x axis meet the diagonal line at 𝑥 = 1?
approximation does have a large place in computing, loss
of 87 % of the counts is a bit excessive. Quick Quiz 5.10: But atomic increment is still pretty fast.
Quick Quiz 5.7: But can’t a smart compiler prove that line 5 And incrementing a single variable in a tight loop sounds pretty
of Listing 5.1 is equivalent to the ++ operator and produce an unrealistic to me, after all, most of the program’s execution
x86 add-to-memory instruction? And won’t the CPU cache should be devoted to actually doing work, not accounting for
cause this to be atomic? the work it has done! Why should I care about making this go
faster?
Quick Quiz 5.8: The 8-figure accuracy on the number of
For another perspective on global atomic increment,
failures indicates that you really did test this. Why would it be
necessary to test such a trivial program, especially when the
consider Figure 5.2. In order for each CPU to get a
bug is easily seen by inspection? chance to increment a given global variable, the cache
line containing that variable must circulate among all
The straightforward way to count accurately is to use 1 Interestingly enough, non-atomically incrementing a counter will
atomic operations, as shown in Listing 5.2 (count_ advance the counter more quickly than atomically incrementing the
atomic.c). Line 1 defines an atomic variable, line 5 counter. Of course, if your only goal is to make the counter increase
quickly, an easier approach is to simply assign a large value to the counter.
atomically increments it, and line 10 reads it out. Be- Nevertheless, there is likely to be a role for algorithms that use carefully
cause this is atomic, it keeps perfect count. However, it is relaxed notions of correctness in order to gain greater performance and
slower: On my six-core x86 laptop, it is more than twenty scalability [And91, ACMS03, Rin13, Ung11].

v2022.09.25a
5.2. STATISTICAL COUNTERS 51

CPU 0 CPU 1 CPU 2 CPU 3 5.2 Statistical Counters


Cache Cache Cache Cache
Interconnect Interconnect Facts are stubborn things, but statistics are pliable.

Mark Twain
Memory System Interconnect Memory
This section covers the common special case of statistical
counters, where the count is updated extremely frequently
Interconnect Interconnect and the value is read out rarely, if ever. These will be used
Cache Cache Cache Cache to solve the network-packet counting problem posed in
CPU 4 CPU 5 CPU 6 CPU 7 Quick Quiz 5.2.

Figure 5.2: Data Flow For Global Atomic Increment 5.2.1 Design
Statistical counting is typically handled by providing a
counter per thread (or CPU, when running in the kernel),
so that each thread updates its own counter, as was fore-
shadowed in Section 4.3.6 on page 46. The aggregate
value of the counters is read out by simply summing up
all of the threads’ counters, relying on the commutative
and associative properties of addition. This is an example
of the Data Ownership pattern that will be introduced in
Section 6.3.4 on page 86.
One one thousand. Quick Quiz 5.12: But doesn’t the fact that C’s “integers” are
Two one thousand.
Three one thousand...
limited in size complicate things?

5.2.2 Array-Based Implementation


One way to provide per-thread variables is to allocate
an array with one element per thread (presumably cache
Figure 5.3: Waiting to Count aligned and padded to avoid false sharing).
Quick Quiz 5.13: An array??? But doesn’t that limit the
number of threads?

Such an array can be wrapped into per-thread primitives,


the CPUs, as shown by the red arrows. Such circulation as shown in Listing 5.3 (count_stat.c). Line 1 defines
will take significant time, resulting in the poor perfor- an array containing a set of per-thread counters of type
mance seen in Figure 5.1, which might be thought of as unsigned long named, creatively enough, counter.
shown in Figure 5.3. The following sections discuss high- Lines 3–8 show a function that increments the counters,
performance counting, which avoids the delays inherent using the __get_thread_var() primitive to locate the
in such circulation. currently running thread’s element of the counter array.
Because this element is modified only by the correspond-
ing thread, non-atomic increment suffices. However, this
code uses WRITE_ONCE() to prevent destructive compiler
Quick Quiz 5.11: But why can’t CPU designers simply
optimizations. For but one example, the compiler is within
ship the addition operation to the data, avoiding the need to
circulate the cache line containing the global variable being its rights to use a to-be-stored-to location as temporary
incremented? storage, thus writing what would be for all intents and
purposes garbage to that location just before doing the
desired store. This could of course be rather confusing

v2022.09.25a
52 CHAPTER 5. COUNTING

Listing 5.3: Array-Based Per-Thread Statistical Counters


CPU 0 CPU 1 CPU 2 CPU 3
1 DEFINE_PER_THREAD(unsigned long, counter);
2 Cache Cache Cache Cache
3 static __inline__ void inc_count(void)
4 { Interconnect Interconnect
5 unsigned long *p_counter = &__get_thread_var(counter);
6
7 WRITE_ONCE(*p_counter, *p_counter + 1);
8 } Memory System Interconnect Memory
9
10 static __inline__ unsigned long read_count(void)
11 {
12 int t;
Interconnect Interconnect
13 unsigned long sum = 0; Cache Cache Cache Cache
14
15 for_each_thread(t) CPU 4 CPU 5 CPU 6 CPU 7
16 sum += READ_ONCE(per_thread(counter, t));
17 return sum;
18 } Figure 5.4: Data Flow For Per-Thread Increment

to anything attempting to read out the count. The use the network-packet counting problem presented at the
of WRITE_ONCE() prevents this optimization and others beginning of this chapter.
besides. Quick Quiz 5.17: The read operation takes time to sum
Quick Quiz 5.14: What other nasty optimizations could up the per-thread values, and during that time, the counter
GCC apply? could well be changing. This means that the value returned
by read_count() in Listing 5.3 will not necessarily be exact.
Lines 10–18 show a function that reads out the aggregate Assume that the counter is being incremented at rate 𝑟 counts
per unit time, and that read_count()’s execution consumes
value of the counter, using the for_each_thread()
𝛥 units of time. What is the expected error in the return value?
primitive to iterate over the list of currently running
threads, and using the per_thread() primitive to fetch
the specified thread’s counter. This code also uses READ_ However, many implementations provide cheaper mech-
ONCE() to ensure that the compiler doesn’t optimize these anisms for per-thread data that are free from arbitrary
loads into oblivion. For but one example, a pair of array-size limits. This is the topic of the next section.
consecutive calls to read_count() might be inlined, and
an intrepid optimizer might notice that the same locations 5.2.3 Per-Thread-Variable-Based Imple-
were being summed and thus incorrectly conclude that it
would be simply wonderful to sum them once and use the
mentation
resulting value twice. This sort of optimization might be The C language, since C11, features a _Thread_local
rather frustrating to people expecting later read_count() storage class that provides per-thread storage.2 This can be
calls to account for the activities of other threads. The use used as shown in Listing 5.4 (count_end.c) to implement
of READ_ONCE() prevents this optimization and others a statistical counter that not only scales well and avoids
besides. arbitrary thread-number limits, but that also incurs little
Quick Quiz 5.15: How does the per-thread counter variable or no performance penalty to incrementers compared to
in Listing 5.3 get initialized? simple non-atomic increment.
Lines 1–4 define needed variables: counter is the
per-thread counter variable, the counterp[] array allows
Quick Quiz 5.16: How is the code in Listing 5.3 supposed
to permit more than one counter? threads to access each others’ counters, finalcount ac-
cumulates the total as individual threads exit, and final_
This approach scales linearly with increasing number mutex coordinates between threads accumulating the total
of updater threads invoking inc_count(). As is shown value of the counter and exiting threads.
by the green arrows on each CPU in Figure 5.4, the
reason for this is that each CPU can make rapid progress 2 GCC provides its own __thread storage class, which was used
incrementing its thread’s variable, without any expensive in previous versions of this book. The two methods for specifying a
cross-system communication. As such, this section solves thread-local variable are interchangeable when using GCC.

v2022.09.25a
5.2. STATISTICAL COUNTERS 53

Listing 5.4: Per-Thread Statistical Counters counter-pointers to that variable rather than setting them to
1 unsigned long _Thread_local counter = 0; NULL?
2 unsigned long *counterp[NR_THREADS] = { NULL };
3 unsigned long finalcount = 0;
4 DEFINE_SPINLOCK(final_mutex); Quick Quiz 5.20: Why on earth do we need something as
5
heavyweight as a lock guarding the summation in the function
6 static inline void inc_count(void)
7 { read_count() in Listing 5.4?
8 WRITE_ONCE(counter, counter + 1);
9 } Lines 25–32 show the count_register_thread()
10
11 static inline unsigned long read_count(void) function, which must be called by each thread before its
12 { first use of this counter. This function simply sets up this
13 int t;
14 unsigned long sum; thread’s element of the counterp[] array to point to its
15
16 spin_lock(&final_mutex);
per-thread counter variable.
17 sum = finalcount;
18 for_each_thread(t)
Quick Quiz 5.21: Why on earth do we need to acquire the
19 if (counterp[t] != NULL) lock in count_register_thread() in Listing 5.4? It is a
20 sum += READ_ONCE(*counterp[t]); single properly aligned machine-word store to a location that
21 spin_unlock(&final_mutex);
22 return sum; no other thread is modifying, so it should be atomic anyway,
23 } right?
24
25 void count_register_thread(unsigned long *p)
26 { Lines 34–42 show the count_unregister_
27 int idx = smp_thread_id(); thread() function, which must be called prior to exit
28
29 spin_lock(&final_mutex); by each thread that previously called count_register_
30 counterp[idx] = &counter; thread(). Line 38 acquires the lock, and line 41 releases
31 spin_unlock(&final_mutex);
32 } it, thus excluding any calls to read_count() as well as
33 other calls to count_unregister_thread(). Line 39
34 void count_unregister_thread(int nthreadsexpected)
35 { adds this thread’s counter to the global finalcount,
36 int idx = smp_thread_id(); and then line 40 NULLs out its counterp[] array entry.
37
38 spin_lock(&final_mutex); A subsequent call to read_count() will see the exiting
39 finalcount += counter; thread’s count in the global finalcount, and will
40 counterp[idx] = NULL;
41 spin_unlock(&final_mutex); skip the exiting thread when sequencing through the
42 } counterp[] array, thus obtaining the correct total.
This approach gives updaters almost exactly the same
performance as a non-atomic add, and also scales linearly.
Quick Quiz 5.18: Doesn’t that explicit counterp array On the other hand, concurrent reads contend for a sin-
in Listing 5.4 reimpose an arbitrary limit on the number
gle global lock, and therefore perform poorly and scale
of threads? Why doesn’t the C language provide a per_
abysmally. However, this is not a problem for statistical
thread() interface, similar to the Linux kernel’s per_cpu()
primitive, to allow threads to more easily access each others’ counters, where incrementing happens often and readout
per-thread variables? happens almost never. Of course, this approach is consid-
erably more complex than the array-based scheme, due to
The inc_count() function used by updaters is quite the fact that a given thread’s per-thread variables vanish
simple, as can be seen on lines 6–9. when that thread exits.
The read_count() function used by readers is a bit Quick Quiz 5.22: Fine, but the Linux kernel doesn’t have
more complex. Line 16 acquires a lock to exclude exiting to acquire a lock when reading out the aggregate value of
threads, and line 21 releases it. Line 17 initializes the per-CPU counters. So why should user-space code need to do
sum to the count accumulated by those threads that have this???
already exited, and lines 18–20 sum the counts being
Both the array-based and _Thread_local-based ap-
accumulated by threads currently running. Finally, line 22
proaches offer excellent update-side performance and
returns the sum.
scalability. However, these benefits result in large read-
Quick Quiz 5.19: Doesn’t the check for NULL on line 19 side expense for large numbers of threads. The next
of Listing 5.4 add extra branch mispredictions? Why not
section shows one way to reduce read-side expense while
have a variable set permanently to zero, and point unused
still retaining the update-side scalability.

v2022.09.25a
54 CHAPTER 5. COUNTING

5.2.4 Eventually Consistent Implementa-


tion

One way to retain update-side scalability while greatly


improving read-side performance is to weaken consis-
tency requirements. The counting algorithm in the pre- Listing 5.5: Array-Based Per-Thread Eventually Consistent
vious section is guaranteed to return a value between the Counters
value that an ideal counter would have taken on near the 1 DEFINE_PER_THREAD(unsigned long, counter);
2 unsigned long global_count;
beginning of read_count()’s execution and that near 3 int stopflag;
the end of read_count()’s execution. Eventual consis- 4
5 static __inline__ void inc_count(void)
tency [Vog09] provides a weaker guarantee: In absence 6 {
of calls to inc_count(), calls to read_count() will 7 unsigned long *p_counter = &__get_thread_var(counter);
8
eventually return an accurate count. 9 WRITE_ONCE(*p_counter, *p_counter + 1);
10 }
We exploit eventual consistency by maintaining a global 11
counter. However, updaters only manipulate their per- 12 static __inline__ unsigned long read_count(void)
13 {
thread counters. A separate thread is provided to transfer 14 return READ_ONCE(global_count);
counts from the per-thread counters to the global counter. 15 }
16
Readers simply access the value of the global counter. If 17 void *eventual(void *arg)
updaters are active, the value used by the readers will 18 {
19 int t;
be out of date, however, once updates cease, the global 20 unsigned long sum;
counter will eventually converge on the true value—hence 21
22 while (READ_ONCE(stopflag) < 3) {
this approach qualifies as eventually consistent. 23 sum = 0;
24 for_each_thread(t)
The implementation is shown in Listing 5.5 (count_ 25 sum += READ_ONCE(per_thread(counter, t));
stat_eventual.c). Lines 1–2 show the per-thread vari- 26 WRITE_ONCE(global_count, sum);
27 poll(NULL, 0, 1);
able and the global variable that track the counter’s value, 28 if (READ_ONCE(stopflag)) {
and line 3 shows stopflag which is used to coordinate 29 smp_mb();
30 WRITE_ONCE(stopflag, stopflag + 1);
termination (for the case where we want to terminate 31 }
the program with an accurate counter value). The inc_ 32 }
33 return NULL;
count() function shown on lines 5–10 is similar to its 34 }
counterpart in Listing 5.3. The read_count() function 35
36 void count_init(void)
shown on lines 12–15 simply returns the value of the 37 {
global_count variable. 38 int en;
39 pthread_t tid;
However, the count_init() function on lines 36–46 40
41 en = pthread_create(&tid, NULL, eventual, NULL);
creates the eventual() thread shown on lines 17–34, 42 if (en != 0) {
which cycles through all the threads, summing the per- 43 fprintf(stderr, "pthread_create: %s\n", strerror(en));
44 exit(EXIT_FAILURE);
thread local counter and storing the sum to the global_ 45 }
count variable. The eventual() thread waits an arbi- 46 }
47
trarily chosen one millisecond between passes. 48 void count_cleanup(void)
49 {
The count_cleanup() function on lines 48–54 coor- 50 WRITE_ONCE(stopflag, 1);
51 while (READ_ONCE(stopflag) < 3)
dinates termination. The calls to smp_mb() here and in 52 poll(NULL, 0, 1);
eventual() ensure that all updates to global_count are 53 smp_mb();
54 }
visible to code following the call to count_cleanup().
This approach gives extremely fast counter read-out
while still supporting linear counter-update scalability.
However, this excellent read-side performance and update-
side scalability comes at the cost of the additional thread
running eventual().

v2022.09.25a
5.3. APPROXIMATE LIMIT COUNTERS 55

Quick Quiz 5.23: Why doesn’t inc_count() in Listing 5.5 5.3 Approximate Limit Counters
need to use atomic instructions? After all, we now have
multiple threads accessing the per-thread counters!
An approximate answer to the right problem is worth
a good deal more than an exact answer to an
approximate problem.
Quick Quiz 5.24: Won’t the single global thread in the func-
tion eventual() of Listing 5.5 be just as severe a bottleneck John Tukey
as a global lock would be?
Another special case of counting involves limit-checking.
For example, as noted in the approximate structure-
Quick Quiz 5.25: Won’t the estimate returned by read_ allocation limit problem in Quick Quiz 5.3, suppose that
count() in Listing 5.5 become increasingly inaccurate as the you need to maintain a count of the number of structures
number of threads rises?
allocated in order to fail any allocations once the number
of structures in use exceeds a limit, in this case, 10,000.
Suppose further that these structures are short-lived, that
Quick Quiz 5.26: Given that in the eventually-consistent
this limit is rarely exceeded, and that this limit is approx-
algorithm shown in Listing 5.5 both reads and updates have
extremely low overhead and are extremely scalable, why imate in that it is OK to exceed it sometimes by some
would anyone bother with the implementation described in bounded amount (see Section 5.4 if you instead need the
Section 5.2.2, given its costly read-side code? limit to be exact).

5.3.1 Design
Quick Quiz 5.27: What is the accuracy of the estimate
returned by read_count() in Listing 5.5? One possible design for limit counters is to divide the
limit of 10,000 by the number of threads, and give each
thread a fixed pool of structures. For example, given 100
threads, each thread would manage its own pool of 100
structures. This approach is simple, and in some cases
works well, but it does not handle the common case where
5.2.5 Discussion a given structure is allocated by one thread and freed by
another [MS93]. On the one hand, if a given thread takes
These three implementations show that it is possible credit for any structures it frees, then the thread doing
to obtain near-uniprocessor performance for statistical most of the allocating runs out of structures, while the
counters, despite running on a parallel machine. threads doing most of the freeing have lots of credits that
they cannot use. On the other hand, if freed structures
Quick Quiz 5.28: What fundamental difference is there are credited to the CPU that allocated them, it will be
between counting packets and counting the total number of necessary for CPUs to manipulate each others’ counters,
bytes in the packets, given that the packets vary in size? which will require expensive atomic instructions or other
means of communicating between threads.3
In short, for many important workloads, we cannot fully
Quick Quiz 5.29: Given that the reader must sum all the partition the counter. Given that partitioning the counters
threads’ counters, this counter-read operation could take a long was what brought the excellent update-side performance
time given large numbers of threads. Is there any way that for the three schemes discussed in Section 5.2, this might
the increment operation can remain fast and scalable while
be grounds for some pessimism. However, the eventually
allowing readers to also enjoy not only reasonable performance
and scalability, but also good accuracy?
consistent algorithm presented in Section 5.2.4 provides
an interesting hint. Recall that this algorithm kept two sets
of books, a per-thread counter variable for updaters and a
Given what has been presented in this section, you
should now be able to answer the Quick Quiz about 3 That said, if each structure will always be freed by the same CPU
statistical counters for networking near the beginning of (or thread) that allocated it, then this simple partitioning approach works
this chapter. extremely well.

v2022.09.25a
56 CHAPTER 5. COUNTING

global_count variable for readers, with an eventual() Listing 5.6: Simple Limit Counter Variables
thread that periodically updated global_count to be 1 unsigned long __thread counter = 0;
2 unsigned long __thread countermax = 0;
eventually consistent with the values of the per-thread 3 unsigned long globalcountmax = 10000;
counter. The per-thread counter perfectly partitioned 4 unsigned long globalcount = 0;
5 unsigned long globalreserve = 0;
the counter value, while global_count kept the full 6 unsigned long *counterp[NR_THREADS] = { NULL };
value. 7 DEFINE_SPINLOCK(gblcnt_mutex);

For limit counters, we can use a variation on this theme


where we partially partition the counter. For example,
consider four threads with each having not only a per- globalcountmax limit. This means that the value of a
thread counter, but also a per-thread maximum value given thread’s countermax variable can be set based on
(call it countermax). this difference. When far from the limit, the countermax
But then what happens if a given thread needs to per-thread variables are set to large values to optimize for
increment its counter, but counter is equal to its performance and scalability, while when close to the limit,
countermax? The trick here is to move half of that these same variables are set to small values to minimize
thread’s counter value to a globalcount, then incre- the error in the checks against the globalcountmax limit.
ment counter. For example, if a given thread’s counter This design is an example of parallel fastpath, which is
and countermax variables were both equal to 10, we do an important design pattern in which the common case
the following: executes with no expensive instructions and no interactions
between threads, but where occasional use is also made
1. Acquire a global lock. of a more conservatively designed (and higher overhead)
global algorithm. This design pattern is covered in more
2. Add five to globalcount. detail in Section 6.4.

3. To balance out the addition, subtract five from this


thread’s counter. 5.3.2 Simple Limit Counter Implementa-
tion
4. Release the global lock.
Listing 5.6 shows both the per-thread and global variables
5. Increment this thread’s counter, resulting in a value used by this implementation. The per-thread counter
of six. and countermax variables are the corresponding thread’s
local counter and the upper bound on that counter, re-
Although this procedure still requires a global lock, spectively. The globalcountmax variable on line 3
that lock need only be acquired once for every five in- contains the upper bound for the aggregate counter, and
crement operations, greatly reducing that lock’s level of the globalcount variable on line 4 is the global counter.
contention. We can reduce this contention as low as we The sum of globalcount and each thread’s counter
wish by increasing the value of countermax. However, gives the aggregate value of the overall counter. The
the corresponding penalty for increasing the value of globalreserve variable on line 5 is at least the sum of
countermax is reduced accuracy of globalcount. To all of the per-thread countermax variables. The relation-
see this, note that on a four-CPU system, if countermax ship among these variables is shown by Figure 5.5:
is equal to ten, globalcount will be in error by at most
40 counts. In contrast, if countermax is increased to 1. The sum of globalcount and globalreserve
100, globalcount might be in error by as much as 400 must be less than or equal to globalcountmax.
counts.
2. The sum of all threads’ countermax values must be
This raises the question of just how much we care about
less than or equal to globalreserve.
globalcount’s deviation from the aggregate value of
the counter, where this aggregate value is the sum of 3. Each thread’s counter must be less than or equal to
globalcount and each thread’s counter variable. The that thread’s countermax.
answer to this question depends on how far the aggregate
value is from the counter’s limit (call it globalcountmax). Each element of the counterp[] array references the
The larger the difference between these two values, the corresponding thread’s counter variable, and, finally, the
larger countermax can be without risk of exceeding the gblcnt_mutex spinlock guards all of the global variables,

v2022.09.25a
5.3. APPROXIMATE LIMIT COUNTERS 57

countermax 3 Listing 5.7: Simple Limit Counter Add, Subtract, and Read
globalreserve

counter 3 1 static __inline__ int add_count(unsigned long delta)


2 {
3 if (countermax - counter >= delta) {
countermax 2 counter 2 WRITE_ONCE(counter, counter + delta);
globalcountmax

4
5 return 1;
6 }
countermax 1 counter 1
7 spin_lock(&gblcnt_mutex);
8 globalize_count();
countermax 0 9 if (globalcountmax -
counter 0
10 globalcount - globalreserve < delta) {
11 spin_unlock(&gblcnt_mutex);
12 return 0;
globalcount

13 }
14 globalcount += delta;
15 balance_count();
16 spin_unlock(&gblcnt_mutex);
17 return 1;
18 }
19
20 static __inline__ int sub_count(unsigned long delta)
21 {
22 if (counter >= delta) {
Figure 5.5: Simple Limit Counter Variable Relationships 23 WRITE_ONCE(counter, counter - delta);
24 return 1;
25 }
26 spin_lock(&gblcnt_mutex);
in other words, no thread is permitted to access or modify 27 globalize_count();
any of the global variables unless it has acquired gblcnt_ 28 if (globalcount < delta) {
29 spin_unlock(&gblcnt_mutex);
mutex. 30 return 0;
Listing 5.7 shows the add_count(), sub_count(), 31 }
32 globalcount -= delta;
and read_count() functions (count_lim.c). 33 balance_count();
34 spin_unlock(&gblcnt_mutex);
Quick Quiz 5.30: Why does Listing 5.7 provide add_ 35 return 1;
count() and sub_count() instead of the inc_count() and 36 }
37
dec_count() interfaces show in Section 5.2? 38 static __inline__ unsigned long read_count(void)
39 {
Lines 1–18 show add_count(), which adds the speci- 40 int t;
41 unsigned long sum;
fied value delta to the counter. Line 3 checks to see if 42

there is room for delta on this thread’s counter, and, if 43 spin_lock(&gblcnt_mutex);


44 sum = globalcount;
so, line 4 adds it and line 5 returns success. This is the 45 for_each_thread(t)
add_counter() fastpath, and it does no atomic opera- 46 if (counterp[t] != NULL)
47 sum += READ_ONCE(*counterp[t]);
tions, references only per-thread variables, and should not 48 spin_unlock(&gblcnt_mutex);
incur any cache misses. 49 return sum;
50 }
Quick Quiz 5.31: What is with the strange form of the
condition on line 3 of Listing 5.7? Why not the more intuitive
form of the fastpath shown in Listing 5.8?

If the test on line 3 fails, we must access global variables,


and thus must acquire gblcnt_mutex on line 7, which we
release on line 11 in the failure case or on line 16 in the suc- Listing 5.8: Intuitive Fastpath
cess case. Line 8 invokes globalize_count(), shown 3 if (counter + delta <= countermax) {
in Listing 5.9, which clears the thread-local variables, 4 WRITE_ONCE(counter, counter + delta);
5 return 1;
adjusting the global variables as needed, thus simplifying 6 }
global processing. (But don’t take my word for it, try
coding it yourself!) Lines 9 and 10 check to see if addition
of delta can be accommodated, with the meaning of

v2022.09.25a
58 CHAPTER 5. COUNTING

the expression preceding the less-than sign shown in Fig- Listing 5.9: Simple Limit Counter Utility Functions
ure 5.5 as the difference in height of the two red (leftmost) 1 static __inline__ void globalize_count(void)
2 {
bars. If the addition of delta cannot be accommodated, 3 globalcount += counter;
then line 11 (as noted earlier) releases gblcnt_mutex 4 counter = 0;
5 globalreserve -= countermax;
and line 12 returns indicating failure. 6 countermax = 0;
Otherwise, we take the slowpath. Line 14 adds delta 7 }
8
to globalcount, and then line 15 invokes balance_ 9 static __inline__ void balance_count(void)
count() (shown in Listing 5.9) in order to update both the 10 {
11 countermax = globalcountmax -
global and the per-thread variables. This call to balance_ 12 globalcount - globalreserve;
count() will usually set this thread’s countermax to 13 countermax /= num_online_threads();
14 globalreserve += countermax;
re-enable the fastpath. Line 16 then releases gblcnt_ 15 counter = countermax / 2;
mutex (again, as noted earlier), and, finally, line 17 returns 16 if (counter > globalcount)
17 counter = globalcount;
indicating success. 18 globalcount -= counter;
19 }
Quick Quiz 5.32: Why does globalize_count() zero the 20

per-thread variables, only to later call balance_count() to 21 void count_register_thread(void)


22 {
refill them in Listing 5.7? Why not just leave the per-thread 23 int idx = smp_thread_id();
variables non-zero? 24
25 spin_lock(&gblcnt_mutex);
26 counterp[idx] = &counter;
Lines 20–36 show sub_count(), which subtracts the 27 spin_unlock(&gblcnt_mutex);
specified delta from the counter. Line 22 checks to see if 28 }
29
the per-thread counter can accommodate this subtraction, 30 void count_unregister_thread(int nthreadsexpected)
and, if so, line 23 does the subtraction and line 24 returns 31 {
32 int idx = smp_thread_id();
success. These lines form sub_count()’s fastpath, and, 33
as with add_count(), this fastpath executes no costly 34 spin_lock(&gblcnt_mutex);
35 globalize_count();
operations. 36 counterp[idx] = NULL;
If the fastpath cannot accommodate subtraction of 37 spin_unlock(&gblcnt_mutex);
38 }
delta, execution proceeds to the slowpath on lines 26–35.
Because the slowpath must access global state, line 26 ac-
quires gblcnt_mutex, which is released either by line 29
re-enabling the fastpath). Then line 34 releases gblcnt_
(in case of failure) or by line 34 (in case of success).
mutex, and line 35 returns success.
Line 27 invokes globalize_count(), shown in List-
ing 5.9, which again clears the thread-local variables, Quick Quiz 5.35: Why have both add_count() and sub_
adjusting the global variables as needed. Line 28 checks count() in Listing 5.7? Why not simply pass a negative
to see if the counter can accommodate subtracting delta, number to add_count()?
and, if not, line 29 releases gblcnt_mutex (as noted
earlier) and line 30 returns failure. Lines 38–50 show read_count(), which returns the
aggregate value of the counter. It acquires gblcnt_
Quick Quiz 5.33: Given that globalreserve counted mutex on line 43 and releases it on line 48, excluding
against us in add_count(), why doesn’t it count for us in
global operations from add_count() and sub_count(),
sub_count() in Listing 5.7?
and, as we will see, also excluding thread creation and
exit. Line 44 initializes local variable sum to the value of
Quick Quiz 5.34: Suppose that one thread invokes add_
globalcount, and then the loop spanning lines 45–47
count() shown in Listing 5.7, and then another thread in-
vokes sub_count(). Won’t sub_count() return failure even
sums the per-thread counter variables. Line 49 then
though the value of the counter is non-zero? returns the sum.
Listing 5.9 shows a number of utility functions used by
If, on the other hand, line 28 finds that the counter the add_count(), sub_count(), and read_count()
can accommodate subtracting delta, we complete the primitives shown in Listing 5.7.
slowpath. Line 32 does the subtraction and then line 33 in- Lines 1–7 show globalize_count(), which zeros
vokes balance_count() (shown in Listing 5.9) in order the current thread’s per-thread counters, adjusting the
to update both global and per-thread variables (hopefully global variables appropriately. It is important to note that

v2022.09.25a
5.3. APPROXIMATE LIMIT COUNTERS 59

this function does not change the aggregate value of the by the bottommost dotted line connecting the leftmost
counter, but instead changes how the counter’s current and center configurations. In other words, the sum of
value is represented. Line 3 adds the thread’s counter globalcount and the four threads’ counter variables is
variable to globalcount, and line 4 zeroes counter. the same in both configurations. Similarly, this change did
Similarly, line 5 subtracts the per-thread countermax not affect the sum of globalcount and globalreserve,
from globalreserve, and line 6 zeroes countermax. It as indicated by the upper dotted line.
is helpful to refer to Figure 5.5 when reading both this The rightmost configuration shows the relationship
function and balance_count(), which is next. of these counters after balance_count() is executed,
Lines 9–19 show balance_count(), which is roughly again by thread 0. One-quarter of the remaining count,
speaking the inverse of globalize_count(). This func- denoted by the vertical line extending up from all three
tion’s job is to set the current thread’s countermax vari- configurations, is added to thread 0’s countermax and
able to the largest value that avoids the risk of the counter half of that to thread 0’s counter. The amount added to
exceeding the globalcountmax limit. Changing the thread 0’s counter is also subtracted from globalcount
current thread’s countermax variable of course requires in order to avoid changing the overall value of the counter
corresponding adjustments to counter, globalcount (which is again the sum of globalcount and the three
and globalreserve, as can be seen by referring back to threads’ counter variables), again as indicated by the
Figure 5.5. By doing this, balance_count() maximizes lowermost of the two dotted lines connecting the center and
use of add_count()’s and sub_count()’s low-overhead rightmost configurations. The globalreserve variable
fastpaths. As with globalize_count(), balance_ is also adjusted so that this variable remains equal to the
count() is not permitted to change the aggregate value sum of the four threads’ countermax variables. Because
of the counter. thread 0’s counter is less than its countermax, thread 0
Lines 11–13 compute this thread’s share of that por- can once again increment the counter locally.
tion of globalcountmax that is not already covered by Quick Quiz 5.37: In Figure 5.6, even though a quarter of the
either globalcount or globalreserve, and assign the remaining count up to the limit is assigned to thread 0, only an
computed quantity to this thread’s countermax. Line 14 eighth of the remaining count is consumed, as indicated by the
makes the corresponding adjustment to globalreserve. uppermost dotted line connecting the center and the rightmost
Line 15 sets this thread’s counter to the middle of the configurations. Why is that?
range from zero to countermax. Line 16 checks to
see whether globalcount can in fact accommodate this Lines 21–28 show count_register_thread(),
value of counter, and, if not, line 17 decreases counter which sets up state for newly created threads. This
accordingly. Finally, in either case, line 18 makes the function simply installs a pointer to the newly created
corresponding adjustment to globalcount. thread’s counter variable into the corresponding entry of
the counterp[] array under the protection of gblcnt_
Quick Quiz 5.36: Why set counter to countermax / 2
mutex.
in line 15 of Listing 5.9? Wouldn’t it be simpler to just take
countermax counts? Finally, lines 30–38 show count_unregister_
thread(), which tears down state for a soon-to-be-exiting
It is helpful to look at a schematic depicting how the thread. Line 34 acquires gblcnt_mutex and line 37 re-
relationship of the counters changes with the execution of leases it. Line 35 invokes globalize_count() to clear
first globalize_count() and then balance_count(), out this thread’s counter state, and line 36 clears this
as shown in Figure 5.6. Time advances from left to right, thread’s entry in the counterp[] array.
with the leftmost configuration roughly that of Figure 5.5.
The center configuration shows the relationship of these
5.3.3 Simple Limit Counter Discussion
same counters after globalize_count() is executed by
thread 0. As can be seen from the figure, thread 0’s This type of counter is quite fast when aggregate val-
counter (“c 0” in the figure) is added to globalcount, ues are near zero, with some overhead due to the com-
while the value of globalreserve is reduced by this same parison and branch in both add_count()’s and sub_
amount. Both thread 0’s counter and its countermax count()’s fastpaths. However, the use of a per-thread
(“cm 0” in the figure) are reduced to zero. The other three countermax reserve means that add_count() can fail
threads’ counters are unchanged. Note that this change even when the aggregate value of the counter is nowhere
did not affect the overall value of the counter, as indicated near globalcountmax. Similarly, sub_count() can fail

v2022.09.25a
60 CHAPTER 5. COUNTING

globalize_count() balance_count()

cm 3

globalreserve
c 3

globalreserve
cm 3 cm 3
globalreserve

c 3 c 3

cm 2
c 2
cm 2 cm 2
c 2 c 2
cm 1 c 1
cm 1 c 1 cm 1 c 1
cm 0
c 0
cm 0 c 0

globalcount

globalcount
globalcount

Figure 5.6: Schematic of Globalization and Balancing

even when the aggregate value of the counter is nowhere


near zero.
In many cases, this is unacceptable. Even if the Listing 5.10: Approximate Limit Counter Variables
1 unsigned long __thread counter = 0;
globalcountmax is intended to be an approximate limit, 2 unsigned long __thread countermax = 0;
there is usually a limit to exactly how much approxima- 3 unsigned long globalcountmax = 10000;
4 unsigned long globalcount = 0;
tion can be tolerated. One way to limit the degree of 5 unsigned long globalreserve = 0;
approximation is to impose an upper limit on the value 6 unsigned long *counterp[NR_THREADS] = { NULL };
7 DEFINE_SPINLOCK(gblcnt_mutex);
of the per-thread countermax instances. This task is 8 #define MAX_COUNTERMAX 100
undertaken in the next section.

5.3.4 Approximate Limit Counter Imple-


mentation Listing 5.11: Approximate Limit Counter Balancing
1 static void balance_count(void)
2 {
Because this implementation (count_lim_app.c) is 3 countermax = globalcountmax -
quite similar to that in the previous section (Listings 5.6, 4 globalcount - globalreserve;
5 countermax /= num_online_threads();
5.7, and 5.9), only the changes are shown here. List- 6 if (countermax > MAX_COUNTERMAX)
ing 5.10 is identical to Listing 5.6, with the addition of 7 countermax = MAX_COUNTERMAX;
8 globalreserve += countermax;
MAX_COUNTERMAX, which sets the maximum permissible 9 counter = countermax / 2;
value of the per-thread countermax variable. 10 if (counter > globalcount)
11 counter = globalcount;
Similarly, Listing 5.11 is identical to the balance_ 12 globalcount -= counter;
13 }
count() function in Listing 5.9, with the addition of
lines 6 and 7, which enforce the MAX_COUNTERMAX limit
on the per-thread countermax variable.

v2022.09.25a
5.4. EXACT LIMIT COUNTERS 61

5.3.5 Approximate Limit Counter Discus- Listing 5.12: Atomic Limit Counter Variables and Access
sion Functions
1 atomic_t __thread counterandmax = ATOMIC_INIT(0);
2 unsigned long globalcountmax = 1 << 25;
These changes greatly reduce the limit inaccuracy seen in 3 unsigned long globalcount = 0;
the previous version, but present another problem: Any 4 unsigned long globalreserve = 0;
5 atomic_t *counterp[NR_THREADS] = { NULL };
given value of MAX_COUNTERMAX will cause a workload- 6 DEFINE_SPINLOCK(gblcnt_mutex);
dependent fraction of accesses to fall off the fastpath. As 7 #define CM_BITS (sizeof(atomic_t) * 4)
8 #define MAX_COUNTERMAX ((1 << CM_BITS) - 1)
the number of threads increase, non-fastpath execution 9

will become both a performance and a scalability problem. 10 static __inline__ void
11 split_counterandmax_int(int cami, int *c, int *cm)
However, we will defer this problem and turn instead to 12 {
counters with exact limits. 13 *c = (cami >> CM_BITS) & MAX_COUNTERMAX;
14 *cm = cami & MAX_COUNTERMAX;
15 }
16
17 static __inline__ void
5.4 Exact Limit Counters 18
19
split_counterandmax(atomic_t *cam, int *old, int *c, int *cm)
{
20 unsigned int cami = atomic_read(cam);
21
Exactitude can be expensive. Spend wisely. 22 *old = cami;
23 split_counterandmax_int(cami, c, cm);
Unknown 24 }
25
26 static __inline__ int merge_counterandmax(int c, int cm)
To solve the exact structure-allocation limit problem noted 27 {
28 unsigned int cami;
in Quick Quiz 5.4, we need a limit counter that can 29
tell exactly when its limits are exceeded. One way of 30 cami = (c << CM_BITS) | cm;
31 return ((int)cami);
implementing such a limit counter is to cause threads 32 }
that have reserved counts to give them up. One way to
do this is to use atomic instructions. Of course, atomic
instructions will slow down the fastpath, but on the other variable is of type atomic_t, which has an underlying
hand, it would be silly not to at least give them a try. representation of int.
Lines 2–6 show the definitions for globalcountmax,
5.4.1 Atomic Limit Counter Implementa- globalcount, globalreserve, counterp, and
gblcnt_mutex, all of which take on roles similar to
tion
their counterparts in Listing 5.10. Line 7 defines CM_
Unfortunately, if one thread is to safely remove counts BITS, which gives the number of bits in each half of
from another thread, both threads will need to atomically counterandmax, and line 8 defines MAX_COUNTERMAX,
manipulate that thread’s counter and countermax vari- which gives the maximum value that may be held in either
ables. The usual way to do this is to combine these two half of counterandmax.
variables into a single variable, for example, given a 32-bit Quick Quiz 5.39: In what way does line 7 of Listing 5.12
variable, using the high-order 16 bits to represent counter violate the C standard?
and the low-order 16 bits to represent countermax.
Lines 10–15 show the split_counterandmax_
Quick Quiz 5.38: Why is it necessary to atomically manip-
ulate the thread’s counter and countermax variables as a int() function, which, when given the underlying int
unit? Wouldn’t it be good enough to atomically manipulate from the atomic_t counterandmax variable, splits it
them individually? into its counter (c) and countermax (cm) components.
Line 13 isolates the most-significant half of this int,
The variables and access functions for a simple atomic placing the result as specified by argument c, and line 14
limit counter are shown in Listing 5.12 (count_lim_ isolates the least-significant half of this int, placing the
atomic.c). The counter and countermax variables in result as specified by argument cm.
earlier algorithms are combined into the single variable Lines 17–24 show the split_counterandmax() func-
counterandmax shown on line 1, with counter in the tion, which picks up the underlying int from the spec-
upper half and countermax in the lower half. This ified variable on line 20, stores it as specified by the

v2022.09.25a
62 CHAPTER 5. COUNTING

old argument on line 22, and then invokes split_


counterandmax_int() to split it on line 23.
Quick Quiz 5.40: Given that there is only one
counterandmax variable, why bother passing in a pointer Listing 5.13: Atomic Limit Counter Add and Subtract
to it on line 18 of Listing 5.12? 1 int add_count(unsigned long delta)
2 {
3 int c;
Lines 26–32 show the merge_counterandmax() func- 4 int cm;
tion, which can be thought of as the inverse of split_ 5 int old;
6 int new;
counterandmax(). Line 30 merges the counter and 7

countermax values passed in c and cm, respectively, and 8 do {


9 split_counterandmax(&counterandmax, &old, &c, &cm);
returns the result. 10 if (delta > MAX_COUNTERMAX || c + delta > cm)
11 goto slowpath;
Quick Quiz 5.41: Why does merge_counterandmax() in 12 new = merge_counterandmax(c + delta, cm);
Listing 5.12 return an int rather than storing directly into an 13 } while (atomic_cmpxchg(&counterandmax,
atomic_t? 14 old, new) != old);
15 return 1;
16 slowpath:
Listing 5.13 shows the add_count() and sub_ 17 spin_lock(&gblcnt_mutex);
count() functions. 18 globalize_count();
19 if (globalcountmax - globalcount -
Lines 1–32 show add_count(), whose fastpath spans 20 globalreserve < delta) {
lines 8–15, with the remainder of the function being the 21 flush_local_count();
22 if (globalcountmax - globalcount -
slowpath. Lines 8–14 of the fastpath form a compare-and- 23 globalreserve < delta) {
swap (CAS) loop, with the atomic_cmpxchg() primitive 24 spin_unlock(&gblcnt_mutex);
25 return 0;
on lines 13–14 performing the actual CAS. Line 9 splits 26 }
the current thread’s counterandmax variable into its 27 }
28 globalcount += delta;
counter (in c) and countermax (in cm) components, 29 balance_count();
while placing the underlying int into old. Line 10 30 spin_unlock(&gblcnt_mutex);
31 return 1;
checks whether the amount delta can be accommodated 32 }
locally (taking care to avoid integer overflow), and if not, 33
34 int sub_count(unsigned long delta)
line 11 transfers to the slowpath. Otherwise, line 12 35 {
combines an updated counter value with the original 36 int c;
37 int cm;
countermax value into new. The atomic_cmpxchg() 38 int old;
primitive on lines 13–14 then atomically compares this 39 int new;
40
thread’s counterandmax variable to old, updating its 41 do {
value to new if the comparison succeeds. If the comparison 42 split_counterandmax(&counterandmax, &old, &c, &cm);
43 if (delta > c)
succeeds, line 15 returns success, otherwise, execution 44 goto slowpath;
continues in the loop at line 8. 45 new = merge_counterandmax(c - delta, cm);
46 } while (atomic_cmpxchg(&counterandmax,
Quick Quiz 5.42: Yecch! Why the ugly goto on line 11 of 47 old, new) != old);
48 return 1;
Listing 5.13? Haven’t you heard of the break statement??? 49 slowpath:
50 spin_lock(&gblcnt_mutex);
51 globalize_count();
52 if (globalcount < delta) {
Quick Quiz 5.43: Why would the atomic_cmpxchg() 53 flush_local_count();
primitive at lines 13–14 of Listing 5.13 ever fail? After all, we 54 if (globalcount < delta) {
picked up its old value on line 9 and have not changed it! 55 spin_unlock(&gblcnt_mutex);
56 return 0;
57 }
Lines 16–31 of Listing 5.13 show add_count()’s 58 }
slowpath, which is protected by gblcnt_mutex, which 59 globalcount -= delta;
60 balance_count();
is acquired on line 17 and released on lines 24 and 30. 61 spin_unlock(&gblcnt_mutex);
Line 18 invokes globalize_count(), which moves this 62 return 1;
63 }
thread’s state to the global counters. Lines 19–20 check
whether the delta value can be accommodated by the
current global state, and, if not, line 21 invokes flush_
local_count() to flush all threads’ local state to the

v2022.09.25a
5.4. EXACT LIMIT COUNTERS 63

Listing 5.14: Atomic Limit Counter Read Listing 5.15: Atomic Limit Counter Utility Functions 1
1 unsigned long read_count(void) 1 static void globalize_count(void)
2 { 2 {
3 int c; 3 int c;
4 int cm; 4 int cm;
5 int old; 5 int old;
6 int t; 6
7 unsigned long sum; 7 split_counterandmax(&counterandmax, &old, &c, &cm);
8 8 globalcount += c;
9 spin_lock(&gblcnt_mutex); 9 globalreserve -= cm;
10 sum = globalcount; 10 old = merge_counterandmax(0, 0);
11 for_each_thread(t) 11 atomic_set(&counterandmax, old);
12 if (counterp[t] != NULL) { 12 }
13 split_counterandmax(counterp[t], &old, &c, &cm); 13
14 sum += c; 14 static void flush_local_count(void)
15 } 15 {
16 spin_unlock(&gblcnt_mutex); 16 int c;
17 return sum; 17 int cm;
18 } 18 int old;
19 int t;
20 int zero;
21
22 if (globalreserve == 0)
global counters, and then lines 22–23 recheck whether 23 return;
delta can be accommodated. If, after all that, the addition 24 zero = merge_counterandmax(0, 0);
25 for_each_thread(t)
of delta still cannot be accommodated, then line 24 26 if (counterp[t] != NULL) {
releases gblcnt_mutex (as noted earlier), and then line 25 27 old = atomic_xchg(counterp[t], zero);
28 split_counterandmax_int(old, &c, &cm);
returns failure. 29 globalcount += c;
Otherwise, line 28 adds delta to the global counter, 30 globalreserve -= cm;
31 }
line 29 spreads counts to the local state if appropriate, 32 }
line 30 releases gblcnt_mutex (again, as noted earlier),
and finally, line 31 returns success.
Lines 34–63 of Listing 5.13 show sub_count(), which local variable zero to a combined zeroed counter and
is structured similarly to add_count(), having a fastpath countermax. The loop spanning lines 25–31 sequences
on lines 41–48 and a slowpath on lines 49–62. A line-by- through each thread. Line 26 checks to see if the current
line analysis of this function is left as an exercise to the thread has counter state, and, if so, lines 27–30 move that
reader. state to the global counters. Line 27 atomically fetches
Listing 5.14 shows read_count(). Line 9 acquires the current thread’s state while replacing it with zero.
gblcnt_mutex and line 16 releases it. Line 10 initializes Line 28 splits this state into its counter (in local variable
local variable sum to the value of globalcount, and the c) and countermax (in local variable cm) components.
loop spanning lines 11–15 adds the per-thread counters to Line 29 adds this thread’s counter to globalcount,
this sum, isolating each per-thread counter using split_ while line 30 subtracts this thread’s countermax from
counterandmax on line 13. Finally, line 17 returns the globalreserve.
sum. Quick Quiz 5.44: What stops a thread from simply refilling its
Listings 5.15 and 5.16 show the utility func- counterandmax variable immediately after flush_local_
tions globalize_count(), flush_local_count(), count() on line 14 of Listing 5.15 empties it?
balance_count(), count_register_thread(), and
count_unregister_thread(). The code for Quick Quiz 5.45: What prevents concurrent execution of
globalize_count() is shown on lines 1–12 of List- the fastpath of either add_count() or sub_count() from
ing 5.15, and is similar to that of previous algorithms, interfering with the counterandmax variable while flush_
local_count() is accessing it on line 27 of Listing 5.15?
with the addition of line 7, which is now required to split
out counter and countermax from counterandmax.
The code for flush_local_count(), which moves Lines 1–22 on Listing 5.16 show the code for
all threads’ local counter state to the global counter, is balance_count(), which refills the calling thread’s local
shown on lines 14–32. Line 22 checks to see if the value counterandmax variable. This function is quite similar
of globalreserve permits any per-thread counts, and, to that of the preceding algorithms, with changes required
if not, line 23 returns. Otherwise, line 24 initializes to handle the merged counterandmax variable. Detailed

v2022.09.25a
64 CHAPTER 5. COUNTING

Listing 5.16: Atomic Limit Counter Utility Functions 2


1 static void balance_count(void) IDLE
2 {
3 int c;
4 int cm;
5 int old;
unsigned long limit; need no
6 flushed
7 flush count
8 limit = globalcountmax - globalcount -
9 globalreserve;
10 limit /= num_online_threads(); !counting
11 if (limit > MAX_COUNTERMAX) REQ READY
12 cm = MAX_COUNTERMAX;
13 else
14 cm = limit;
15 globalreserve += cm; done
16 c = cm / 2; counting
counting
17 if (c > globalcount)
18 c = globalcount;
19 globalcount -= c;
20 old = merge_counterandmax(c, cm);
21 atomic_set(&counterandmax, old);
ACK
22 }
23
24 void count_register_thread(void) Figure 5.7: Signal-Theft State Machine
25 {
26 int idx = smp_thread_id();
27
28 spin_lock(&gblcnt_mutex); with better write-side performance. One such algorithm
29 counterp[idx] = &counterandmax;
30 spin_unlock(&gblcnt_mutex); uses a signal handler to steal counts from other threads.
31 } Because signal handlers run in the context of the signaled
32
33 void count_unregister_thread(int nthreadsexpected) thread, atomic operations are not necessary, as shown in
34 { the next section.
35 int idx = smp_thread_id();
36 Quick Quiz 5.47: But signal handlers can be migrated to
37 spin_lock(&gblcnt_mutex);
38 globalize_count(); some other CPU while running. Doesn’t this possibility require
39 counterp[idx] = NULL; that atomic instructions and memory barriers are required to
40 spin_unlock(&gblcnt_mutex); reliably communicate between a thread and a signal handler
41 }
that interrupts that thread?

analysis of the code is left as an exercise for the reader, as


it is with the count_register_thread() function start- 5.4.3 Signal-Theft Limit Counter Design
ing on line 24 and the count_unregister_thread()
Even though per-thread state will now be manipulated
function starting on line 33.
only by the corresponding thread, there will still need
Quick Quiz 5.46: Given that the atomic_set() primitive to be synchronization with the signal handlers. This
does a simple store to the specified atomic_t, how can line 21 synchronization is provided by the state machine shown
of balance_count() in Listing 5.16 work correctly in face of in Figure 5.7.
concurrent flush_local_count() updates to this variable? The state machine starts out in the IDLE state, and when
add_count() or sub_count() find that the combination
The next section qualitatively evaluates this design. of the local thread’s count and the global count cannot
accommodate the request, the corresponding slowpath sets
each thread’s theft state to REQ (unless that thread has
5.4.2 Atomic Limit Counter Discussion no count, in which case it transitions directly to READY).
This is the first implementation that actually allows the Only the slowpath, which holds the gblcnt_mutex lock,
counter to be run all the way to either of its limits, but it is permitted to transition from the IDLE state, as indicated
does so at the expense of adding atomic operations to the by the green color.4 The slowpath then sends a signal
fastpaths, which slow down the fastpaths significantly on
some systems. Although some workloads might tolerate 4 For those with black-and-white versions of this book, IDLE and

this slowdown, it is worthwhile looking for algorithms READY are green, REQ is red, and ACK is blue.

v2022.09.25a
5.4. EXACT LIMIT COUNTERS 65

to each thread, and the corresponding signal handler Listing 5.17: Signal-Theft Limit Counter Data
checks the corresponding thread’s theft and counting 1 #define THEFT_IDLE 0
2 #define THEFT_REQ 1
variables. If the theft state is not REQ, then the signal 3 #define THEFT_ACK 2
handler is not permitted to change the state, and therefore 4 #define THEFT_READY 3
5
simply returns. Otherwise, if the counting variable is set, 6 int __thread theft = THEFT_IDLE;
indicating that the current thread’s fastpath is in progress, 7 int __thread counting = 0;
8 unsigned long __thread counter = 0;
the signal handler sets the theft state to ACK, otherwise 9 unsigned long __thread countermax = 0;
to READY. 10 unsigned long globalcountmax = 10000;
11 unsigned long globalcount = 0;
If the theft state is ACK, only the fastpath is permitted 12 unsigned long globalreserve = 0;
to change the theft state, as indicated by the blue color. 13 unsigned long *counterp[NR_THREADS] = { NULL };
14 unsigned long *countermaxp[NR_THREADS] = { NULL };
When the fastpath completes, it sets the theft state to 15 int *theftp[NR_THREADS] = { NULL };
READY. 16 DEFINE_SPINLOCK(gblcnt_mutex);
17 #define MAX_COUNTERMAX 100
Once the slowpath sees a thread’s theft state is
READY, the slowpath is permitted to steal that thread’s
count. The slowpath then sets that thread’s theft state to Quick Quiz 5.50: In Listing 5.18’s function flush_local_
IDLE. count_sig(), why are there READ_ONCE() and WRITE_
Quick Quiz 5.48: In Figure 5.7, why is the REQ theft state ONCE() wrappers around the uses of the theft per-thread
colored red? variable?

Lines 21–49 show flush_local_count(), which is


Quick Quiz 5.49: In Figure 5.7, what is the point of having called from the slowpath to flush all threads’ local counts.
separate REQ and ACK theft states? Why not simplify the
The loop spanning lines 26–34 advances the theft state
state machine by collapsing them into a single REQACK state?
for each thread that has local count, and also sends that
Then whichever of the signal handler or the fastpath gets there
first could set the state to READY. thread a signal. Line 27 skips any non-existent threads.
Otherwise, line 28 checks to see if the current thread
holds any local count, and, if not, line 29 sets the thread’s
theft state to READY and line 30 skips to the next thread.
5.4.4 Signal-Theft Limit Counter Imple- Otherwise, line 32 sets the thread’s theft state to REQ
mentation and line 33 sends the thread a signal.

Listing 5.17 (count_lim_sig.c) shows the data struc- Quick Quiz 5.51: In Listing 5.18, why is it safe for line 28 to
directly access the other thread’s countermax variable?
tures used by the signal-theft based counter implemen-
tation. Lines 1–7 define the states and values for the
per-thread theft state machine described in the preceding Quick Quiz 5.52: In Listing 5.18, why doesn’t line 33 check
for the current thread sending itself a signal?
section. Lines 8–17 are similar to earlier implementa-
tions, with the addition of lines 14 and 15 to allow remote
Quick Quiz 5.53: The code shown in Listings 5.17 and 5.18
access to a thread’s countermax and theft variables,
works with GCC and POSIX. What would be required to make
respectively.
it also conform to the ISO C standard?
Listing 5.18 shows the functions responsible for migrat-
ing counts between per-thread variables and the global The loop spanning lines 35–48 waits until each thread
variables. Lines 1–7 show globalize_count(), which reaches READY state, then steals that thread’s count.
is identical to earlier implementations. Lines 9–19 show Lines 36–37 skip any non-existent threads, and the loop
flush_local_count_sig(), which is the signal han- spanning lines 38–42 waits until the current thread’s
dler used in the theft process. Lines 11 and 12 check to theft state becomes READY. Line 39 blocks for a
see if the theft state is REQ, and, if not returns without millisecond to avoid priority-inversion problems, and if
change. Line 13 executes a memory barrier to ensure line 40 determines that the thread’s signal has not yet
that the sampling of the theft variable happens before any arrived, line 41 resends the signal. Execution reaches
change to that variable. Line 14 sets the theft state to line 43 when the thread’s theft state becomes READY,
ACK, and, if line 15 sees that this thread’s fastpaths are so lines 43–46 do the thieving. Line 47 then sets the
not running, line 16 sets the theft state to READY. thread’s theft state back to IDLE.

v2022.09.25a
66 CHAPTER 5. COUNTING

Listing 5.19: Signal-Theft Limit Counter Add Function


1 int add_count(unsigned long delta)
2 {
3 int fastpath = 0;
Listing 5.18: Signal-Theft Limit Counter Value-Migration Func- 4
tions 5 WRITE_ONCE(counting, 1);
1 static void globalize_count(void) 6 barrier();
2 { 7 if (READ_ONCE(theft) <= THEFT_REQ &&
3 globalcount += counter; 8 countermax - counter >= delta) {
4 counter = 0; 9 WRITE_ONCE(counter, counter + delta);
5 globalreserve -= countermax; 10 fastpath = 1;
6 countermax = 0; 11 }
7 } 12 barrier();
8
13 WRITE_ONCE(counting, 0);
9 static void flush_local_count_sig(int unused) 14 barrier();
10 { 15 if (READ_ONCE(theft) == THEFT_ACK) {
11 if (READ_ONCE(theft) != THEFT_REQ) 16 smp_mb();
12 return; 17 WRITE_ONCE(theft, THEFT_READY);
13 smp_mb(); 18 }
14 WRITE_ONCE(theft, THEFT_ACK); 19 if (fastpath)
15 if (!counting) { 20 return 1;
16 WRITE_ONCE(theft, THEFT_READY); 21 spin_lock(&gblcnt_mutex);
17 } 22 globalize_count();
18 smp_mb(); 23 if (globalcountmax - globalcount -
19 } 24 globalreserve < delta) {
20
25 flush_local_count();
21 static void flush_local_count(void) 26 if (globalcountmax - globalcount -
22 { 27 globalreserve < delta) {
23 int t; 28 spin_unlock(&gblcnt_mutex);
24 thread_id_t tid; 29 return 0;
25
30 }
26 for_each_tid(t, tid) 31 }
27 if (theftp[t] != NULL) { 32 globalcount += delta;
28 if (*countermaxp[t] == 0) { 33 balance_count();
29 WRITE_ONCE(*theftp[t], THEFT_READY); 34 spin_unlock(&gblcnt_mutex);
30 continue; 35 return 1;
31 } 36 }
32 WRITE_ONCE(*theftp[t], THEFT_REQ);
33 pthread_kill(tid, SIGUSR1);
34 }
35 for_each_tid(t, tid) { Quick Quiz 5.54: In Listing 5.18, why does line 41 resend
36 if (theftp[t] == NULL) the signal?
37 continue;
38 while (READ_ONCE(*theftp[t]) != THEFT_READY) {
39 poll(NULL, 0, 1); Lines 51–63 show balance_count(), which is similar
40 if (READ_ONCE(*theftp[t]) == THEFT_REQ)
41 pthread_kill(tid, SIGUSR1); to that of earlier examples.
42 } Listing 5.19 shows the add_count() function. The
43 globalcount += *counterp[t];
44 *counterp[t] = 0; fastpath spans lines 5–20, and the slowpath lines 21–35.
45 globalreserve -= *countermaxp[t]; Line 5 sets the per-thread counting variable to 1 so that
46 *countermaxp[t] = 0;
47 WRITE_ONCE(*theftp[t], THEFT_IDLE); any subsequent signal handlers interrupting this thread will
48 } set the theft state to ACK rather than READY, allowing
49 }
50
this fastpath to complete properly. Line 6 prevents the
51 static void balance_count(void) compiler from reordering any of the fastpath body to
52 {
53 countermax = globalcountmax - globalcount - precede the setting of counting. Lines 7 and 8 check
54 globalreserve; to see if the per-thread data can accommodate the add_
55 countermax /= num_online_threads();
56 if (countermax > MAX_COUNTERMAX) count() and if there is no ongoing theft in progress, and
57 countermax = MAX_COUNTERMAX; if so line 9 does the fastpath addition and line 10 notes
58 globalreserve += countermax;
59 counter = countermax / 2; that the fastpath was taken.
60 if (counter > globalcount) In either case, line 12 prevents the compiler from
61 counter = globalcount;
62 globalcount -= counter; reordering the fastpath body to follow line 13, which
63 } permits any subsequent signal handlers to undertake theft.
Line 14 again disables compiler reordering, and then
line 15 checks to see if the signal handler deferred the
theft state-change to READY, and, if so, line 16 executes

v2022.09.25a
5.4. EXACT LIMIT COUNTERS 67

Listing 5.20: Signal-Theft Limit Counter Subtract Function Listing 5.22: Signal-Theft Limit Counter Initialization Func-
1 int sub_count(unsigned long delta) tions
2 { 1 void count_init(void)
3 int fastpath = 0; 2 {
4 3 struct sigaction sa;
5 WRITE_ONCE(counting, 1); 4
6 barrier(); 5 sa.sa_handler = flush_local_count_sig;
7 if (READ_ONCE(theft) <= THEFT_REQ && 6 sigemptyset(&sa.sa_mask);
8 counter >= delta) { 7 sa.sa_flags = 0;
9 WRITE_ONCE(counter, counter - delta); 8 if (sigaction(SIGUSR1, &sa, NULL) != 0) {
10 fastpath = 1; 9 perror("sigaction");
11 } 10 exit(EXIT_FAILURE);
12 barrier(); 11 }
13 WRITE_ONCE(counting, 0); 12 }
14 barrier(); 13
15 if (READ_ONCE(theft) == THEFT_ACK) { 14 void count_register_thread(void)
16 smp_mb(); 15 {
17 WRITE_ONCE(theft, THEFT_READY); 16 int idx = smp_thread_id();
18 } 17
19 if (fastpath) 18 spin_lock(&gblcnt_mutex);
20 return 1; 19 counterp[idx] = &counter;
21 spin_lock(&gblcnt_mutex); 20 countermaxp[idx] = &countermax;
22 globalize_count(); 21 theftp[idx] = &theft;
23 if (globalcount < delta) { 22 spin_unlock(&gblcnt_mutex);
24 flush_local_count(); 23 }
25 if (globalcount < delta) { 24
26 spin_unlock(&gblcnt_mutex); 25 void count_unregister_thread(int nthreadsexpected)
27 return 0; 26 {
28 } 27 int idx = smp_thread_id();
29 } 28
30 globalcount -= delta; 29 spin_lock(&gblcnt_mutex);
31 balance_count(); 30 globalize_count();
32 spin_unlock(&gblcnt_mutex); 31 counterp[idx] = NULL;
33 return 1; 32 countermaxp[idx] = NULL;
34 } 33 theftp[idx] = NULL;
34 spin_unlock(&gblcnt_mutex);
35 }
Listing 5.21: Signal-Theft Limit Counter Read Function
1 unsigned long read_count(void)
2 { Lines 1–12 of Listing 5.22 show count_init(), which
3 int t;
4 unsigned long sum; set up flush_local_count_sig() as the signal han-
5
dler for SIGUSR1, enabling the pthread_kill() calls
6 spin_lock(&gblcnt_mutex);
7 sum = globalcount; in flush_local_count() to invoke flush_local_
8 for_each_thread(t) count_sig(). The code for thread registry and unregistry
9 if (counterp[t] != NULL)
10 sum += READ_ONCE(*counterp[t]); is similar to that of earlier examples, so its analysis is left
11 spin_unlock(&gblcnt_mutex); as an exercise for the reader.
12 return sum;
13 }

5.4.5 Signal-Theft Limit Counter Discus-


sion
a memory barrier to ensure that any CPU that sees line 17
setting state to READY also sees the effects of line 9. If The signal-theft implementation runs more than eight
the fastpath addition at line 9 was executed, then line 20 times as fast as the atomic implementation on my six-core
returns success. x86 laptop. Is it always preferable?
Otherwise, we fall through to the slowpath starting at The signal-theft implementation would be vastly prefer-
line 21. The structure of the slowpath is similar to those able on Pentium-4 systems, given their slow atomic in-
of earlier examples, so its analysis is left as an exercise structions, but the old 80386-based Sequent Symmetry
to the reader. Similarly, the structure of sub_count() systems would do much better with the shorter path length
on Listing 5.20 is the same as that of add_count(), so of the atomic implementation. However, this increased
the analysis of sub_count() is also left as an exercise update-side performance comes at the prices of higher
for the reader, as is the analysis of read_count() in read-side overhead: Those POSIX signals are not free. If
Listing 5.21. ultimate performance is of the essence, you will need to

v2022.09.25a
68 CHAPTER 5. COUNTING

measure them both on the system that your application is


to be deployed on. 1 read_lock(&mylock);
2 if (removing) {
Quick Quiz 5.55: Not only are POSIX signals slow, sending 3 read_unlock(&mylock);
4 cancel_io();
one to each thread simply does not scale. What would you do 5 } else {
if you had (say) 10,000 threads and needed the read side to be 6 add_count(1);
fast? 7 read_unlock(&mylock);
8 do_io();
9 sub_count(1);
This is but one reason why high-quality APIs are so 10 }
important: They permit implementations to be changed
as required by ever-changing hardware performance char-
Line 1 read-acquires the lock, and either line 3 or 7
acteristics.
releases it. Line 2 checks to see if the device is being
Quick Quiz 5.56: What if you want an exact limit counter to removed, and, if so, line 3 releases the lock and line 4
be exact only for its lower limit, but to allow the upper limit to cancels the I/O, or takes whatever action is appropriate
be inexact? given that the device is to be removed. Otherwise, line 6
increments the access count, line 7 releases the lock, line 8
performs the I/O, and line 9 decrements the access count.
5.4.6 Applying Exact Limit Counters
Quick Quiz 5.58: This is ridiculous! We are read-acquiring
Although the exact limit counter implementations pre- a reader-writer lock to update the counter? What are you
sented in this section can be very useful, they are not playing at???
much help if the counter’s value remains near zero at
The code to remove the device might be as follows:
all times, as it might when counting the number of out-
standing accesses to an I/O device. The high overhead 1 write_lock(&mylock);
of such near-zero counting is especially painful given 2 removing = 1;
3 sub_count(mybias);
that we normally don’t care how many references there 4 write_unlock(&mylock);
are. As noted in the removable I/O device access-count 5 while (read_count() != 0) {
problem posed by Quick Quiz 5.5, the number of accesses 6 poll(NULL, 0, 1);
7 }
is irrelevant except in those rare cases when someone is 8 remove_device();
actually trying to remove the device.
One simple solution to this problem is to add a large Line 1 write-acquires the lock and line 4 releases it.
“bias” (for example, one billion) to the counter in order Line 2 notes that the device is being removed, and the
to ensure that the value is far enough from zero that the loop spanning lines 5–7 waits for any I/O operations to
counter can operate efficiently. When someone wants complete. Finally, line 8 does any additional processing
to remove the device, this bias is subtracted from the needed to prepare for device removal.
counter value. Counting the last few accesses will be quite
inefficient, but the important point is that the many prior Quick Quiz 5.59: What other issues would need to be
accesses will have been counted at full speed. accounted for in a real system?

Quick Quiz 5.57: What else had you better have done when
using a biased counter?
5.5 Parallel Counting Discussion
Although a biased counter can be quite helpful and
useful, it is only a partial solution to the removable I/O
This idea that there is generality in the specific is of
device access-count problem called out on page 49. When
far-reaching importance.
attempting to remove a device, we must not only know
the precise number of current I/O accesses, we also need Douglas R. Hofstadter
to prevent any future accesses from starting. One way
to accomplish this is to read-acquire a reader-writer lock This chapter has presented the reliability, performance, and
when updating the counter, and to write-acquire that same scalability problems with traditional counting primitives.
reader-writer lock when checking the counter. Code for The C-language ++ operator is not guaranteed to function
doing I/O might be as follows: reliably in multithreaded code, and atomic operations to a

v2022.09.25a
5.5. PARALLEL COUNTING DISCUSSION 69

single variable neither perform nor scale well. This chapter on large numbers of core, and suffers severe lock con-
therefore presented a number of counting algorithms that tention when there are many parallel readers. This con-
perform and scale extremely well in certain special cases. tention can be addressed using the deferred-processing
It is well worth reviewing the lessons from these count- techniques introduced in Chapter 9, as shown on the
ing algorithms. To that end, Section 5.5.1 overviews count_end_rcu.c row of Table 5.1. Deferred process-
requisite validation, Section 5.5.2 summarizes perfor- ing also shines on the count_stat_eventual.c row,
mance and scalability, Section 5.5.3 discusses the need courtesy of eventual consistency.
for specialization, and finally, Section 5.5.4 enumerates Quick Quiz 5.60: On the count_stat.c row of Table 5.1,
lessons learned and calls attention to later chapters that we see that the read-side scales linearly with the number of
will expand on these lessons. threads. How is that possible given that the more threads there
are, the more per-thread counters must be summed up?

5.5.1 Parallel Counting Validation Quick Quiz 5.61: Even on the fourth row of Table 5.1,
Many of the algorithms in this section are quite simple, so the read-side performance of these statistical counter imple-
mentations is pretty horrible. So why bother with them?
much so that it is tempting to declare them to be correct
by construction or by inspection. Unfortunately, it is all
too easy for those carrying out the construction or the The bottom half of Table 5.1 shows the performance of
inspection to become overconfident, tired, confused, or the parallel limit-counting algorithms. Exact enforcement
just plain sloppy, all of which can result in bugs. And of the limits incurs a substantial update-side performance
early implementations of these limit counters have in fact penalty, although on this x86 system that penalty can
contained bugs, in some cases aided and abetted by the be reduced by substituting signals for atomic operations.
complexities inherent in maintaining a 64-bit count on a All of these implementations suffer from read-side lock
32-bit system. Therefore, validation is not optional, even contention in the face of concurrent readers.
for the simple algorithms presented in this chapter. Quick Quiz 5.62: Given the performance data shown in the
The statistical counters are tested for acting like counters bottom half of Table 5.1, we should always prefer signals over
(“counttorture.h”), that is, that the aggregate sum in atomic operations, right?
the counter changes by the sum of the amounts added by
the various update-side threads. Quick Quiz 5.63: Can advanced techniques be applied to
The limit counters are also tested for acting like counters address the lock contention for readers seen in the bottom half
(“limtorture.h”), and additionally checked for their of Table 5.1?
ability to accommodate the specified limit.
In short, this chapter has demonstrated a number of
Both of these test suites produce performance data that counting algorithms that perform and scale extremely
is used in Section 5.5.2. well in a number of special cases. But must our parallel
Although this level of validation is good and sufficient counting be confined to special cases? Wouldn’t it be
for textbook implementations such as these, it would be better to have a general algorithm that operated efficiently
wise to apply additional validation before putting similar in all cases? The next section looks at these questions.
algorithms into production. Chapter 11 describes addi-
tional approaches to testing, and given the simplicity of
most of these counting algorithms, most of the techniques 5.5.3 Parallel Counting Specializations
described in Chapter 12 can also be quite helpful. The fact that these algorithms only work well in their
respective special cases might be considered a major
5.5.2 Parallel Counting Performance problem with parallel programming in general. After
all, the C-language ++ operator works just fine in single-
The top half of Table 5.1 shows the performance of the threaded code, and not just for special cases, but in general,
four parallel statistical counting algorithms. All four algo- right?
rithms provide near-perfect linear scalability for updates. This line of reasoning does contain a grain of truth, but
The per-thread-variable implementation (count_end.c) is in essence misguided. The problem is not parallelism
is significantly faster on updates than the array-based as such, but rather scalability. To understand this, first
implementation (count_stat.c), but is slower at reads consider the C-language ++ operator. The fact is that it

v2022.09.25a
70 CHAPTER 5. COUNTING

Table 5.1: Statistical/Limit Counter Performance on x86

Exact?
Reads (ns)
Algorithm Updates
(count_*.c) Section (ns) 1 CPU 8 CPUs 64 CPUs 420 CPUs
stat 5.2.2 6.3 294 303 315 612
stat_eventual 5.2.4 6.4 1 1 1 1
end 5.2.3 2.9 301 6,309 147,594 239,683
end_rcu 13.5.1 2.9 454 481 508 2,317
lim 5.3.2 N 3.2 435 6,678 156,175 239,422
lim_app 5.3.4 N 2.4 485 7,041 173,108 239,682
lim_atomic 5.4.1 Y 19.7 513 7,085 199,957 239,450
lim_sig 5.4.4 Y 4.7 519 6,805 120,000 238,811

does not work in general, only for a restricted range of sort of adaptation will become increasingly important as
numbers. If you need to deal with 1,000-digit decimal the number of CPUs on mainstream systems continues to
numbers, the C-language ++ operator will not work for increase.
you. In short, as discussed in Chapter 3, the laws of physics
Quick Quiz 5.64: The ++ operator works just fine for 1,000- constrain parallel software just as surely as they constrain
digit numbers! Haven’t you heard of operator overloading??? mechanical artifacts such as bridges. These constraints
force specialization, though in the case of software it
might be possible to automate the choice of specialization
This problem is not specific to arithmetic. Suppose you to fit the hardware and workload in question.
need to store and query data. Should you use an ASCII Of course, even generalized counting is quite special-
file? XML? A relational database? A linked list? A dense ized. We need to do a great number of other things with
array? A B-tree? A radix tree? Or one of the plethora of computers. The next section relates what we have learned
other data structures and environments that permit data to from counters to topics taken up later in this book.
be stored and queried? It depends on what you need to
do, how fast you need it done, and how large your data set
is—even on sequential systems. 5.5.4 Parallel Counting Lessons
Similarly, if you need to count, your solution will
depend on how large of numbers you need to work with, The opening paragraph of this chapter promised that our
how many CPUs need to be manipulating a given number study of counting would provide an excellent introduction
concurrently, how the number is to be used, and what level to parallel programming. This section makes explicit
of performance and scalability you will need. connections between the lessons from this chapter and the
Nor is this problem specific to software. The design material presented in a number of later chapters.
for a bridge meant to allow people to walk across a small The examples in this chapter have shown that an impor-
brook might be a simple as a single wooden plank. But you tant scalability and performance tool is partitioning. The
would probably not use a plank to span the kilometers-wide counters might be fully partitioned, as in the statistical
mouth of the Columbia River, nor would such a design be counters discussed in Section 5.2, or partially partitioned
advisable for bridges carrying concrete trucks. In short, as in the limit counters discussed in Sections 5.3 and 5.4.
just as bridge design must change with increasing span Partitioning will be considered in far greater depth in Chap-
and load, so must software design change as the number of ter 6, and partial parallelization in particular in Section 6.4,
CPUs increases. That said, it would be good to automate where it is called parallel fastpath.
this process, so that the software adapts to changes in
hardware configuration and in workload. There has in fact Quick Quiz 5.65: But if we are going to have to partition
been some research into this sort of automation [AHS+ 03, everything, why bother with shared-memory multithreading?
Why not just partition the problem completely and run as
SAH+ 03], and the Linux kernel does some boot-time
multiple processes, each in its own address space?
reconfiguration, including limited binary rewriting. This

v2022.09.25a
5.5. PARALLEL COUNTING DISCUSSION 71

The partially partitioned counting algorithms used lock- Batch


ing to guard the global data, and locking is the subject
Work
of Chapter 7. In contrast, the partitioned data tended to
Partitioning
be fully under the control of the corresponding thread, so
that no synchronization whatsoever was required. This Resource
Parallel
data ownership will be introduced in Section 6.3.4 and Partitioning and
Access Control Replication
discussed in more detail in Chapter 8.
Because integer addition and subtraction are extremely
Interacting
cheap compared to typical synchronization operations,
With Hardware
achieving reasonable scalability requires synchronization Weaken Partition
operations be used sparingly. One way of achieving this
is to batch the addition and subtraction operations, so that
a great many of these cheap operations are handled by a Figure 5.8: Optimization and the Four Parallel-
single synchronization operation. Batching optimizations Programming Tasks
of one sort or another are used by each of the counting
algorithms listed in Table 5.1.
Finally, the eventually consistent statistical counter dis- 6. Judicious use of delay promotes performance and
cussed in Section 5.2.4 showed how deferring activity scalability, as seen in Section 5.2.4.
(in that case, updating the global counter) can provide 7. Parallel performance and scalability is usually a
substantial performance and scalability benefits. This balancing act: Beyond a certain point, optimizing
approach allows common case code to use much cheaper some code paths will degrade others. The count_
synchronization operations than would otherwise be pos- stat.c and count_end_rcu.c rows of Table 5.1
sible. Chapter 9 will examine a number of additional ways illustrate this point.
that deferral can improve performance, scalability, and
even real-time response. 8. Different levels of performance and scalability will
Summarizing the summary: affect algorithm and data-structure design, as do a
large number of other factors. Figure 5.1 illustrates
1. Partitioning promotes performance and scalability. this point: Atomic increment might be completely
acceptable for a two-CPU system, but nevertheless
2. Partial partitioning, that is, partitioning applied only be completely inadequate for an eight-CPU system.
to common code paths, works almost as well.
Summarizing still further, we have the “big three” meth-
3. Partial partitioning can be applied to code (as in ods of increasing performance and scalability, namely
Section 5.2’s statistical counters’ partitioned updates (1) partitioning over CPUs or threads, (2) batching so that
and non-partitioned reads), but also across time (as in more work can be done by each expensive synchronization
Section 5.3’s and Section 5.4’s limit counters running operation, and (3) weakening synchronization operations
fast when far from the limit, but slowly when close where feasible. As a rough rule of thumb, you should
to the limit). apply these methods in this order, as was noted earlier
in the discussion of Figure 2.6 on page 15. The parti-
4. Partitioning across time often batches updates locally tioning optimization applies to the “Resource Partitioning
in order to reduce the number of expensive global and Replication” bubble, the batching optimization to the
operations, thereby decreasing synchronization over- “Work Partitioning” bubble, and the weakening optimiza-
head, in turn improving performance and scalability. tion to the “Parallel Access Control” bubble, as shown in
All the algorithms shown in Table 5.1 make heavy Figure 5.8. Of course, if you are using special-purpose
use of batching. hardware such as digital signal processors (DSPs), field-
programmable gate arrays (FPGAs), or general-purpose
5. Read-only code paths should remain read-only: Spu- graphical processing units (GPGPUs), you may need to
rious synchronization writes to shared memory pay close attention to the “Interacting With Hardware”
kill performance and scalability, as seen in the bubble throughout the design process. For example, the
count_end.c row of Table 5.1. structure of a GPGPU’s hardware threads and memory

v2022.09.25a
72 CHAPTER 5. COUNTING

connectivity might richly reward very careful partitioning


and batching design decisions.
In short, as noted at the beginning of this chapter, the
simplicity of counting have allowed us to explore many
fundamental concurrency issues without the distraction
of complex synchronization primitives or elaborate data
structures. Such synchronization primitives and data
structures are covered in later chapters.

v2022.09.25a
Divide and rule.

Philip II of Macedon
Chapter 6

Partitioning and Synchronization Design

This chapter describes how to design software to take ad-


P1
vantage of modern commodity multicore systems by using
idioms, or “design patterns” [Ale79, GHJV95, SSRB00],
to balance performance, scalability, and response time.
Correctly partitioned problems lead to simple, scalable,
and high-performance solutions, while poorly partitioned P5 P2
problems result in slow and complex solutions. This
chapter will help you design partitioning into your code,
with some discussion of batching and weakening as well.
The word “design” is very important: You should parti-
tion first, batch second, weaken third, and code fourth.
Changing this order often leads to poor performance and
scalability along with great frustration.1
P4 P3
To this end, Section 6.1 presents partitioning exercises,
Section 6.2 reviews partitionability design criteria, Sec-
tion 6.3 discusses synchronization granularity selection, Figure 6.1: Dining Philosophers Problem
Section 6.4 overviews important parallel-fastpath design
patterns that provide speed and scalability on common-
case fastpaths while using simpler less-scalable “slow path” Section 6.1.1 therefore takes more highly parallel look at
fallbacks for unusual situations, and finally Section 6.5 the classic Dining Philosophers problem and Section 6.1.2
takes a brief look beyond partitioning. revisits the double-ended queue.

6.1 Partitioning Exercises 6.1.1 Dining Philosophers Problem


Figure 6.1 shows a diagram of the classic Dining Philoso-
Whenever a theory appears to you as the only phers problem [Dij71]. This problem features five philoso-
possible one, take this as a sign that you have neither phers who do nothing but think and eat a “very difficult
understood the theory nor the problem which it was kind of spaghetti” which requires two forks to eat.2 A
intended to solve. given philosopher is permitted to use only the forks to his
or her immediate right and left, but will not put a given
Karl Popper
fork down until sated.
The object is to construct an algorithm that, quite
Although partitioning is more widely understood than it literally, prevents starvation. One starvation scenario
was in the early 2000s, its value is still underappreciated. would be if all of the philosophers picked up their leftmost
forks simultaneously. Because none of them will put down
1 That other great dodge around the Laws of Physics, read-only

replication, is covered in Chapter 9. 2 But feel free to instead think in terms of chopsticks.

73

v2022.09.25a
74 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

Figure 6.2: Partial Starvation Is Also Bad

lowest-numbered fork next to his or her plate, then picks


P1
up the other fork. The philosopher sitting in the uppermost
position in the diagram thus picks up the leftmost fork first,
then the rightmost fork, while the rest of the philosophers
5 1 instead pick up their rightmost fork first. Because two of
P5 P2 the philosophers will attempt to pick up fork 1 first, and
because only one of those two philosophers will succeed,
there will be five forks available to four philosophers. At
least one of these four will have two forks, and will thus
4 2
be able to eat.
This general technique of numbering resources and
3 acquiring them in numerical order is heavily used as a
deadlock-prevention technique. However, it is easy to
P4 P3
imagine a sequence of events that will result in only one
philosopher eating at a time even though all are hungry:
Figure 6.3: Dining Philosophers Problem, Textbook
1. P2 picks up fork 1, preventing P1 from taking a fork.
Solution
2. P3 picks up fork 2.
their fork until after they finished eating, and because none 3. P4 picks up fork 3.
of them may pick up their second fork until at least one
of them has finished eating, they all starve. Please note 4. P5 picks up fork 4.
that it is not sufficient to allow at least one philosopher to 5. P5 picks up fork 5 and eats.
eat. As Figure 6.2 shows, starvation of even a few of the
philosophers is to be avoided. 6. P5 puts down forks 4 and 5.
Dijkstra’s solution used a global semaphore, which
7. P4 picks up fork 4 and eats.
works fine assuming negligible communications delays,
an assumption that became invalid in the late 1980s or In short, this algorithm can result in only one philoso-
early 1990s.3 More recent solutions number the forks pher eating at a given time, even when all five philosophers
as shown in Figure 6.3. Each philosopher picks up the are hungry, despite the fact that there are more than enough
3 It is all too easy to denigrate Dijkstra from the viewpoint of the
forks for two philosophers to eat concurrently. It should
year 2021, more than 50 years after the fact. If you still feel the need
be possible to do better than this!
to denigrate Dijkstra, my advice is to publish something, wait 50 years, One approach is shown in Figure 6.4, which includes
and then see how well your ideas stood the test of time. four philosophers rather than five to better illustrate the

v2022.09.25a
6.1. PARTITIONING EXERCISES 75

section shows how a partitioning design strategy can result


P1
in a reasonably simple implementation, looking at three
general approaches in the following sections. But first,
how should we validate a concurrent double-ended queue?

6.1.2.1 Double-Ended Queue Validation

P4 P2
A good place to start is with invariants. For example, if
elements are pushed onto one end of a double-ended queue
and popped off of the other, the order of those elements
must be preserved. Similarly, if elements are pushed onto
one end of the queue and popped off of that same end, the
order of those elements must be reversed. Any element
popped from the queue must have been most recently
P3 pushed onto that queue, and if the queue is emptied, all
elements pushed onto it must have already been popped
Figure 6.4: Dining Philosophers Problem, Partitioned from it.
The beginnings of a test suite for concurrent double-
ended queues (“deqtorture.h”) provides the following
partition technique. Here the upper and rightmost philoso- checks:
phers share a pair of forks, while the lower and leftmost
philosophers share another pair of forks. If all philoso- 1. Element-ordering checks provided by CHECK_
phers are simultaneously hungry, at least two will always SEQUENCE_PAIR().
be able to eat concurrently. In addition, as shown in the
2. Checks that elements popped were most recently
figure, the forks can now be bundled so that the pair are
pushed, provided by melee().
picked up and put down simultaneously, simplifying the
acquisition and release algorithms. 3. Checks that elements pushed are popped before the
Quick Quiz 6.1: Is there a better solution to the Dining queue is emptied, also provided by melee().
Philosophers Problem?
This suite includes both sequential and concurrent tests.
Quick Quiz 6.2: How would you valididate an algorithm Although this suite is good and sufficient for textbook
alleged to solve the Dining Philosophers Problem? code, you should test considerably more thoroughly for
code intended for production use. Chapters 11 and 12
This is an example of “horizontal parallelism” [Inm85] cover a large array of validation tools and techniques.
or “data parallelism”, so named because there is no depen- But with a prototype test suite in place, we are ready
dency among the pairs of philosophers. In a horizontally to look at the double-ended-queue algorithms in the next
parallel data-processing system, a given item of data would sections.
be processed by only one of a replicated set of software
components. 6.1.2.2 Left- and Right-Hand Locks
Quick Quiz 6.3: And in just what sense can this “horizontal
parallelism” be said to be “horizontal”? One seemingly straightforward approach would be to use
a doubly linked list with a left-hand lock for left-hand-end
enqueue and dequeue operations along with a right-hand
lock for right-hand-end operations, as shown in Figure 6.5.
6.1.2 Double-Ended Queue
However, the problem with this approach is that the two
A double-ended queue is a data structure containing a locks’ domains must overlap when there are fewer than
list of elements that may be inserted or removed from four elements on the list. This overlap is due to the fact
either end [Knu73]. It has been claimed that a lock-based that removing any given element affects not only that
implementation permitting concurrent operations on both element, but also its left- and right-hand neighbors. These
ends of the double-ended queue is difficult [Gro07]. This domains are indicated by color in the figure, with blue with

v2022.09.25a
76 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

Lock L Lock R
same double-ended queue, as we can unconditionally left-
enqueue elements to the left-hand queue and right-enqueue
Header L Header R
elements to the right-hand queue. The main complication
arises when dequeuing from an empty queue, in which
Lock L Lock R
case it is necessary to:
Header L 0 Header R
1. If holding the right-hand lock, release it and acquire
the left-hand lock.
Lock L Lock R
2. Acquire the right-hand lock.
Header L 0 1 Header R

3. Rebalance the elements across the two queues.

Lock L Lock R 4. Remove the required element if there is one.


Header L 0 1 2 Header R
5. Release both locks.

Lock L Lock R Quick Quiz 6.4: In this compound double-ended queue


implementation, what should be done if the queue has become
Header L 0 1 2 3 Header R
non-empty while releasing and reacquiring the lock?
Figure 6.5: Double-Ended Queue With Left- and Right-
The resulting code (locktdeq.c) is quite straightfor-
Hand Locks
ward. The rebalancing operation might well shuttle a given
element back and forth between the two queues, wasting
Lock L Lock R time and possibly requiring workload-dependent heuris-
tics to obtain optimal performance. Although this might
DEQ L DEQ R
well be the best approach in some cases, it is interesting
Figure 6.6: Compound Double-Ended Queue to try for an algorithm with greater determinism.

6.1.2.4 Hashed Double-Ended Queue


downward stripes indicating the domain of the left-hand
lock, red with upward stripes indicating the domain of the One of the simplest and most effective ways to determinis-
right-hand lock, and purple (with no stripes) indicating tically partition a data structure is to hash it. It is possible
overlapping domains. Although it is possible to create to trivially hash a double-ended queue by assigning each
an algorithm that works this way, the fact that it has no element a sequence number based on its position in the list,
fewer than five special cases should raise a big red flag, so that the first element left-enqueued into an empty queue
especially given that concurrent activity at the other end of is numbered zero and the first element right-enqueued
the list can shift the queue from one special case to another into an empty queue is numbered one. A series of ele-
at any time. It is far better to consider other designs. ments left-enqueued into an otherwise-idle queue would
be assigned decreasing numbers (−1, −2, −3, . . .), while
a series of elements right-enqueued into an otherwise-idle
6.1.2.3 Compound Double-Ended Queue
queue would be assigned increasing numbers (2, 3, 4, . . .).
One way of forcing non-overlapping lock domains is A key point is that it is not necessary to actually represent
shown in Figure 6.6. Two separate double-ended queues a given element’s number, as this number will be implied
are run in tandem, each protected by its own lock. This by its position in the queue.
means that elements must occasionally be shuttled from Given this approach, we assign one lock to guard the
one of the double-ended queues to the other, in which case left-hand index, one to guard the right-hand index, and one
both locks must be held. A simple lock hierarchy may lock for each hash chain. Figure 6.7 shows the resulting
be used to avoid deadlock, for example, always acquiring data structure given four hash chains. Note that the lock
the left-hand lock before acquiring the right-hand lock. domains do not overlap, and that deadlock is avoided by
This will be much simpler than applying two locks to the acquiring the index locks before the chain locks, and by

v2022.09.25a
6.1. PARTITIONING EXERCISES 77

R4 R5 R6 R7

L0 R1 R2 R3
DEQ 0 DEQ 1 DEQ 2 DEQ 3
L−4 L−3 L−2 L−1
Lock 0 Lock 1 Lock 2 Lock 3
L−8 L−7 L−6 L−5

Index L Index R Figure 6.9: Hashed Double-Ended Queue With 16 Ele-


Lock L Lock R ments

Figure 6.7: Hashed Double-Ended Queue Listing 6.1: Lock-Based Parallel Double-Ended Queue Data
Structure
1 struct pdeq {
2 spinlock_t llock;
3 int lidx;
R1 4 spinlock_t rlock;
5 int ridx;
6 struct deq bkt[PDEQ_N_BKTS];
7 };

DEQ 0 DEQ 1 DEQ 2 DEQ 3

never acquiring more than one lock of a given type (index


Index L Index R
or chain) at a time.
Each hash chain is itself a double-ended queue, and
Enq 3R
in this example, each holds every fourth element. The
uppermost portion of Figure 6.8 shows the state after a
single element (“R1 ”) has been right-enqueued, with the
R4 R1 R2 R3 right-hand index having been incremented to reference
hash chain 2. The middle portion of this same figure
shows the state after three more elements have been
right-enqueued. As you can see, the indexes are back to
DEQ 0 DEQ 1 DEQ 2 DEQ 3
their initial states (see Figure 6.7), however, each hash
chain is now non-empty. The lower portion of this figure
Index L Index R shows the state after three additional elements have been
left-enqueued and an additional element has been right-
Enq 3L1R
enqueued.
From the last state shown in Figure 6.8, a left-dequeue
operation would return element “L−2 ” and leave the left-
R4 R5 R2 R3 hand index referencing hash chain 2, which would then
contain only a single element (“R2 ”). In this state, a
left-enqueue running concurrently with a right-enqueue
would result in lock contention, but the probability of
L0 R1 L −2 L −1
such contention can be reduced to arbitrarily low levels
by using a larger hash table.
Figure 6.9 shows how 16 elements would be organized
DEQ 0 DEQ 1 DEQ 2 DEQ 3 in a four-hash-bucket parallel double-ended queue. Each
underlying single-lock double-ended queue holds a one-
Index L Index R quarter slice of the full parallel double-ended queue.
Listing 6.1 shows the corresponding C-language data
Figure 6.8: Hashed Double-Ended Queue After Inser- structure, assuming an existing struct deq that provides
tions a trivially locked double-ended-queue implementation.
This data structure contains the left-hand lock on line 2,

v2022.09.25a
78 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

Listing 6.2: Lock-Based Parallel Double-Ended Queue Imple- Lines 1–13 show pdeq_pop_l(), which left-dequeues
mentation and returns an element if possible, returning NULL other-
1 struct cds_list_head *pdeq_pop_l(struct pdeq *d)
2 { wise. Line 6 acquires the left-hand spinlock, and line 7
3 struct cds_list_head *e; computes the index to be dequeued from. Line 8 dequeues
4 int i;
5
the element, and, if line 9 finds the result to be non-NULL,
6 spin_lock(&d->llock); line 10 records the new left-hand index. Either way, line 11
7 i = moveright(d->lidx);
8 e = deq_pop_l(&d->bkt[i]); releases the lock, and, finally, line 12 returns the element
9 if (e != NULL) if there was one, or NULL otherwise.
10 d->lidx = i;
11 spin_unlock(&d->llock); Lines 29–38 show pdeq_push_l(), which left-
12 return e; enqueues the specified element. Line 33 acquires the
13 }
14
left-hand lock, and line 34 picks up the left-hand in-
15 struct cds_list_head *pdeq_pop_r(struct pdeq *d) dex. Line 35 left-enqueues the specified element onto
16 {
17 struct cds_list_head *e; the double-ended queue indexed by the left-hand index.
18 int i; Line 36 then updates the left-hand index and line 37
19
20 spin_lock(&d->rlock); releases the lock.
21 i = moveleft(d->ridx); As noted earlier, the right-hand operations are com-
22 e = deq_pop_r(&d->bkt[i]);
23 if (e != NULL) pletely analogous to their left-handed counterparts, so
24 d->ridx = i; their analysis is left as an exercise for the reader.
25 spin_unlock(&d->rlock);
26 return e; Quick Quiz 6.5: Is the hashed double-ended queue a good
27 }
28
solution? Why or why not?
29 void pdeq_push_l(struct cds_list_head *e, struct pdeq *d)
30 {
31 int i;
32 6.1.2.5 Compound Double-Ended Queue Revisited
33 spin_lock(&d->llock);
34 i = d->lidx; This section revisits the compound double-ended queue,
35 deq_push_l(e, &d->bkt[i]);
36 d->lidx = moveleft(d->lidx); using a trivial rebalancing scheme that moves all the
37 spin_unlock(&d->llock); elements from the non-empty queue to the now-empty
38 }
39 queue.
40 void pdeq_push_r(struct cds_list_head *e, struct pdeq *d)
41 { Quick Quiz 6.6: Move all the elements to the queue that
42 int i; became empty? In what possible universe is this brain-dead
43
44 spin_lock(&d->rlock); solution in any way optimal???
45 i = d->ridx;
46 deq_push_r(e, &d->bkt[i]); In contrast to the hashed implementation presented in
47 d->ridx = moveright(d->ridx);
48 spin_unlock(&d->rlock); the previous section, the compound implementation will
49 } build on a sequential implementation of a double-ended
queue that uses neither locks nor atomic operations.
Listing 6.3 shows the implementation. Unlike the
the left-hand index on line 3, the right-hand lock on line 4 hashed implementation, this compound implementation is
(which is cache-aligned in the actual implementation), the asymmetric, so that we must consider the pdeq_pop_l()
right-hand index on line 5, and, finally, the hashed array and pdeq_pop_r() implementations separately.
of simple lock-based double-ended queues on line 6. A
high-performance implementation would of course use Quick Quiz 6.7: Why can’t the compound parallel double-
padding or special alignment directives to avoid false ended queue implementation be symmetric?
sharing. The pdeq_pop_l() implementation is shown on
Listing 6.2 (lockhdeq.c) shows the implementation of lines 1–16 of the figure. Line 5 acquires the left-hand lock,
the enqueue and dequeue functions.4 Discussion will focus which line 14 releases. Line 6 attempts to left-dequeue
on the left-hand operations, as the right-hand operations an element from the left-hand underlying double-ended
are trivially derived from them. queue, and, if successful, skips lines 8–13 to simply return
4 One could easily create a polymorphic implementation in any this element. Otherwise, line 8 acquires the right-hand
number of languages, but doing so is left as an exercise for the reader. lock, line 9 left-dequeues an element from the right-hand

v2022.09.25a
6.1. PARTITIONING EXERCISES 79

queue, and line 10 moves any remaining elements on the


right-hand queue to the left-hand queue, line 11 initializes
the right-hand queue, and line 12 releases the right-hand
lock. The element, if any, that was dequeued on line 9
will be returned.
The pdeq_pop_r() implementation is shown on
Listing 6.3: Compound Parallel Double-Ended Queue Imple- lines 18–38 of the figure. As before, line 22 acquires
mentation the right-hand lock (and line 36 releases it), and line 23
1 struct cds_list_head *pdeq_pop_l(struct pdeq *d) attempts to right-dequeue an element from the right-hand
2 {
3 struct cds_list_head *e;
queue, and, if successful, skips lines 25–35 to simply
4 return this element. However, if line 24 determines that
5 spin_lock(&d->llock);
6 e = deq_pop_l(&d->ldeq);
there was no element to dequeue, line 25 releases the
7 if (e == NULL) { right-hand lock and lines 26–27 acquire both locks in
8 spin_lock(&d->rlock);
9 e = deq_pop_l(&d->rdeq);
the proper order. Line 28 then attempts to right-dequeue
10 cds_list_splice(&d->rdeq.chain, &d->ldeq.chain); an element from the right-hand list again, and if line 29
11 CDS_INIT_LIST_HEAD(&d->rdeq.chain);
12 spin_unlock(&d->rlock);
determines that this second attempt has failed, line 30
13 } right-dequeues an element from the left-hand queue (if
14 spin_unlock(&d->llock);
15 return e;
there is one available), line 31 moves any remaining ele-
16 } ments from the left-hand queue to the right-hand queue,
17
18 struct cds_list_head *pdeq_pop_r(struct pdeq *d)
and line 32 initializes the left-hand queue. Either way,
19 { line 34 releases the left-hand lock.
20 struct cds_list_head *e;
21 Quick Quiz 6.8: Why is it necessary to retry the right-
22 spin_lock(&d->rlock); dequeue operation on line 28 of Listing 6.3?
23 e = deq_pop_r(&d->rdeq);
24 if (e == NULL) {
25 spin_unlock(&d->rlock); Quick Quiz 6.9: Surely the left-hand lock must sometimes be
26 spin_lock(&d->llock);
27 spin_lock(&d->rlock); available!!! So why is it necessary that line 25 of Listing 6.3
28 e = deq_pop_r(&d->rdeq); unconditionally release the right-hand lock?
29 if (e == NULL) {
30 e = deq_pop_r(&d->ldeq);
31 cds_list_splice(&d->ldeq.chain, &d->rdeq.chain); The pdeq_push_l() implementation is shown on
32 CDS_INIT_LIST_HEAD(&d->ldeq.chain);
33 }
lines 40–45 of Listing 6.3. Line 42 acquires the left-
34 spin_unlock(&d->llock); hand spinlock, line 43 left-enqueues the element onto the
35 }
36 spin_unlock(&d->rlock);
left-hand queue, and finally line 44 releases the lock. The
37 return e; pdeq_push_r() implementation (shown on lines 47–52)
38 }
39
is quite similar.
40 void pdeq_push_l(struct cds_list_head *e, struct pdeq *d)
41 { Quick Quiz 6.10: But in the case where data is flowing in
42 spin_lock(&d->llock); only one direction, the algorithm shown in Listing 6.3 will
43 deq_push_l(e, &d->ldeq); have both ends attempting to acquire the same lock whenever
44 spin_unlock(&d->llock);
45 } the consuming end empties its underlying double-ended queue.
46 Doesn’t that mean that sometimes this algorithm fails to provide
47 void pdeq_push_r(struct cds_list_head *e, struct pdeq *d)
48 {
concurrent access to both ends of the queue even when the
49 spin_lock(&d->rlock); queue contains an arbitrarily large number of elements?
50 deq_push_r(e, &d->rdeq);
51 spin_unlock(&d->rlock);
52 }
6.1.2.6 Double-Ended Queue Discussion
The compound implementation is somewhat more com-
plex than the hashed variant presented in Section 6.1.2.4,
but is still reasonably simple. Of course, a more intelligent
rebalancing scheme could be arbitrarily complex, but the
simple scheme shown here has been shown to perform well

v2022.09.25a
80 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

compared to software alternatives [DCW+ 11] and even parallelism”. The synchronization overhead in this case
compared to algorithms using hardware assist [DLM+ 10]. is nearly (or even exactly) zero. In contrast, the double-
Nevertheless, the best we can hope for from such a scheme ended queue implementations are examples of “vertical
is 2x scalability, as at most two threads can be holding the parallelism” or “pipelining”, given that data moves from
dequeue’s locks concurrently. This limitation also applies one thread to another. The tighter coordination required
to algorithms based on non-blocking synchronization, for pipelining in turn requires larger units of work to obtain
such as the compare-and-swap-based dequeue algorithm a given level of efficiency.
of Michael [Mic03].5
Quick Quiz 6.12: The tandem double-ended queue runs
Quick Quiz 6.11: Why are there not one but two solutions about twice as fast as the hashed double-ended queue, even
to the double-ended queue problem? when I increase the size of the hash table to an insanely large
number. Why is that?
In fact, as noted by Dice et al. [DLM+ 10], an unsynchro-
nized single-threaded double-ended queue significantly
Quick Quiz 6.13: Is there a significantly better way of
outperforms any of the parallel implementations they stud-
handling concurrency for double-ended queues?
ied. Therefore, the key point is that there can be significant
overhead enqueuing to or dequeuing from a shared queue, These two examples show just how powerful partition-
regardless of implementation. This should come as no ing can be in devising parallel algorithms. Section 6.3.5
surprise in light of the material in Chapter 3, given the looks briefly at a third example, matrix multiply. However,
strict first-in-first-out (FIFO) nature of these queues. all three of these examples beg for more and better design
Furthermore, these strict FIFO queues are strictly FIFO criteria for parallel programs, a topic taken up in the next
only with respect to linearization points [HW90]6 that section.
are not visible to the caller, in fact, in these examples, the
linearization points are buried in the lock-based critical
sections. These queues are not strictly FIFO with respect
to (say) the times at which the individual operations 6.2 Design Criteria
started [HKLP12]. This indicates that the strict FIFO
property is not all that valuable in concurrent programs, One pound of learning requires ten pounds of
and in fact, Kirsch et al. present less-strict queues that commonsense to apply it.
provide improved performance and scalability [KLP12].7
Persian proverb
All that said, if you are pushing all the data used by your
concurrent program through a single queue, you really
need to rethink your overall design. One way to obtain the best performance and scalability is
to simply hack away until you converge on the best possible
parallel program. Unfortunately, if your program is other
6.1.3 Partitioning Example Discussion than microscopically tiny, the space of possible parallel
The optimal solution to the dining philosophers problem programs is so huge that convergence is not guaranteed in
given in the answer to the Quick Quiz in Section 6.1.1 is the lifetime of the universe. Besides, what exactly is the
an excellent example of “horizontal parallelism” or “data “best possible parallel program”? After all, Section 2.2
called out no fewer than three parallel-programming goals
5 This paper is interesting in that it showed that special double-
of performance, productivity, and generality, and the best
compare-and-swap (DCAS) instructions are not needed for lock-free possible performance will likely come at a cost in terms
implementations of double-ended queues. Instead, the common compare- of productivity and generality. We clearly need to be able
and-swap (e.g., x86 cmpxchg) suffices. to make higher-level choices at design time in order to
6 In short, a linearization point is a single point within a given
arrive at an acceptably good parallel program before that
function where that function can be said to have taken effect. In this
lock-based implementation, the linearization points can be said to be program becomes obsolete.
anywhere within the critical section that does the work. However, more detailed design criteria are required to
7 Nir Shavit produced relaxed stacks for roughly the same rea-
actually produce a real-world design, a task taken up in
sons [Sha11]. This situation leads some to believe that the linearization
points are useful to theorists rather than developers, and leads others
this section. This being the real world, these criteria often
to wonder to what extent the designers of such data structures and conflict to a greater or lesser degree, requiring that the
algorithms were considering the needs of their users. designer carefully balance the resulting tradeoffs.

v2022.09.25a
6.2. DESIGN CRITERIA 81

As such, these criteria may be thought of as the the sequential program, although large state spaces
“forces” acting on the design, with particularly good having regular structures can in some cases be easily
tradeoffs between these forces being called “design pat- understood. A parallel programmer must consider
terns” [Ale79, GHJV95]. synchronization primitives, messaging, locking de-
The design criteria for attaining the three parallel- sign, critical-section identification, and deadlock in
programming goals are speedup, contention, overhead, the context of this larger state space.
read-to-write ratio, and complexity: This greater complexity often translates to higher
Speedup: As noted in Section 2.2, increased performance development and maintenance costs. Therefore, bud-
is the major reason to go to all of the time and trouble getary constraints can limit the number and types
required to parallelize it. Speedup is defined to be the of modifications made to an existing program, since
ratio of the time required to run a sequential version a given degree of speedup is worth only so much
of the program to the time required to run a parallel time and trouble. Worse yet, added complexity can
version. actually reduce performance and scalability.
Contention: If more CPUs are applied to a parallel pro- Therefore, beyond a certain point, there may be
gram than can be kept busy by that program, the potential sequential optimizations that are cheaper
excess CPUs are prevented from doing useful work and more effective than parallelization. As noted
by contention. This may be lock contention, memory in Section 2.2.1, parallelization is but one perfor-
contention, or a host of other performance killers. mance optimization of many, and is furthermore an
optimization that applies most readily to CPU-based
Work-to-Synchronization Ratio: A uniprocessor, sin- bottlenecks.
gle-threaded, non-preemptible, and non-interrupt-
ible8 version of a given parallel program would not These criteria will act together to enforce a maximum
need any synchronization primitives. Therefore, speedup. The first three criteria are deeply interrelated,
any time consumed by these primitives (including so the remainder of this section analyzes these interrela-
communication cache misses as well as message tionships.9
latency, locking primitives, atomic instructions, and Note that these criteria may also appear as part of the
memory barriers) is overhead that does not contrib- requirements specification. For example, speedup may act
ute directly to the useful work that the program is as a relative desideratum (“the faster, the better”) or as an
intended to accomplish. Note that the important absolute requirement of the workload (“the system must
measure is the relationship between the synchroniza- support at least 1,000,000 web hits per second”). Classic
tion overhead and the overhead of the code in the design pattern languages describe relative desiderata as
critical section, with larger critical sections able to forces and absolute requirements as context.
tolerate greater synchronization overhead. The work- An understanding of the relationships between these
to-synchronization ratio is related to the notion of design criteria can be very helpful when identifying ap-
synchronization efficiency. propriate design tradeoffs for a parallel program.
Read-to-Write Ratio: A data structure that is rarely up- 1. The less time a program spends in exclusive-lock
dated may often be replicated rather than partitioned, critical sections, the greater the potential speedup.
and furthermore may be protected with asymmet- This is a consequence of Amdahl’s Law [Amd67]
ric synchronization primitives that reduce readers’ because only one CPU may execute within a given
synchronization overhead at the expense of that of exclusive-lock critical section at a given time.
writers, thereby reducing overall synchronization More specifically, for unbounded linear scalability,
overhead. Corresponding optimizations are possible the fraction of time that the program spends in a
for frequently updated data structures, as discussed given exclusive critical section must decrease as the
in Chapter 5. number of CPUs increases. For example, a program
Complexity: A parallel program is more complex than will not scale to 10 CPUs unless it spends much
an equivalent sequential program because the paral-
9 A real-world parallel system will be subject to many additional
lel program has a much larger state space than does
design criteria, such as data-structure layout, memory size, memory-
8 Either by masking interrupts or by being oblivious to them. hierarchy latencies, bandwidth limitations, and I/O issues.

v2022.09.25a
82 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

less than one tenth of its time in the most-restrictive Sequential


exclusive-lock critical section. Program

Partition Batch
2. Contention effects consume the excess CPU and/or Code
wallclock time when the actual speedup is less than Locking
the number of available CPUs. The larger the gap Partition Batch
between the number of CPUs and the actual speedup,
Data
the less efficiently the CPUs will be used. Similarly, Locking
the greater the desired efficiency, the smaller the
achievable speedup. Own Disown
Data
Ownership
3. If the available synchronization primitives have high
overhead compared to the critical sections that they Figure 6.10: Design Patterns and Lock Granularity
guard, the best way to improve speedup is to reduce
the number of times that the primitives are invoked.
This can be accomplished by batching critical sec- 6.3 Synchronization Granularity
tions, using data ownership (see Chapter 8), using
asymmetric primitives (see Chapter 9), or by using a
Doing little things well is a step toward doing big
coarse-grained design such as code locking. things better.

Harry F. Banks
4. If the critical sections have high overhead compared
to the primitives guarding them, the best way to Figure 6.10 gives a pictorial view of different levels of
improve speedup is to increase parallelism by moving synchronization granularity, each of which is described
to reader/writer locking, data locking, asymmetric, in one of the following sections. These sections focus
or data ownership. primarily on locking, but similar granularity issues arise
with all forms of synchronization.
5. If the critical sections have high overhead compared
to the primitives guarding them and the data structure 6.3.1 Sequential Program
being guarded is read much more often than modified, If the program runs fast enough on a single processor,
the best way to increase parallelism is to move to and has no interactions with other processes, threads, or
reader/writer locking or asymmetric primitives. interrupt handlers, you should remove the synchronization
primitives and spare yourself their overhead and complex-
ity. Some years back, there were those who would argue
6. Many changes that improve SMP performance, for
that Moore’s Law would eventually force all programs
example, reducing lock contention, also improve
into this category. However, as can be seen in Figure 6.11,
real-time latencies [McK05c].
the exponential increase in single-threaded performance
halted in about 2003. Therefore, increasing performance
will increasingly require parallelism.10 Given that back
Quick Quiz 6.14: Don’t all these problems with critical
in 2006 Paul typed the first version of this sentence on
sections mean that we should just always use non-blocking
synchronization [Her90], which don’t have critical sections? a dual-core laptop, and further given that many of the
graphs added in 2020 were generated on a system with
10 This plot shows clock frequencies for newer CPUs theoretically

It is worth reiterating that contention has many guises, capable of retiring one or more instructions per clock, and MIPS for
older CPUs requiring multiple clocks to execute even the simplest
including lock contention, memory contention, cache instruction. The reason for taking this approach is that the newer CPUs’
overflow, thermal throttling, and much else besides. This ability to retire multiple instructions per clock is typically limited by
chapter looks primarily at lock and memory contention. memory-system performance.

v2022.09.25a
6.3. SYNCHRONIZATION GRANULARITY 83

10000
CPU Clock Frequency / MIPS

1000
1x106
100
100000 Ethernet

Relative Performance
10 10000

1000
1
100 x86 CPUs

0.1 10
1975
1980
1985
1990
1995
2000
2005
2010
2015
2020
1

Year 0.1

1970
1975
1980
1985
1990
1995
2000
2005
2010
2015
2020
Figure 6.11: MIPS/Clock-Frequency Trend for Intel
CPUs
Year

Figure 6.12: Ethernet Bandwidth vs. Intel x86 CPU


56 hardware threads per socket, parallelism is well and Performance
truly here. It is also important to note that Ethernet band-
width is continuing to grow, as shown in Figure 6.12. This
growth will continue to motivate multithreaded servers in
order to handle the communications load.
Please note that this does not mean that you should
code each and every program in a multi-threaded manner.
Again, if a program runs quickly enough on a single
processor, spare yourself the overhead and complexity of
SMP synchronization primitives. The simplicity of the Listing 6.4: Sequential-Program Hash Table Search
1 struct hash_table
hash-table lookup code in Listing 6.4 underscores this 2 {
point.11 A key point is that speedups due to parallelism 3 long nbuckets;
4 struct node **buckets;
are normally limited to the number of CPUs. In contrast, 5 };
speedups due to sequential optimizations, for example, 6
7 typedef struct node {
careful choice of data structure, can be arbitrarily large. 8 unsigned long key;
9 struct node *next;
Quick Quiz 6.15: What should you do to validate a hash 10 } node_t;
table? 11
12 int hash_search(struct hash_table *h, long key)
13 {
On the other hand, if you are not in this happy situation, 14 struct node *cur;
read on! 15
16 cur = h->buckets[key % h->nbuckets];
17 while (cur != NULL) {
if (cur->key >= key) {
6.3.2 Code Locking 18
19 return (cur->key == key);
20 }
Code locking is quite simple due to the fact that is uses 21 cur = cur->next;
}
only global locks.12 It is especially easy to retrofit an 22
23 return 0;
11 The 24 }
examples in this section are taken from Hart et al. [HMB06],
adapted for clarity by gathering related code from multiple files.
12 If your program instead has locks in data structures, or, in the case

of Java, uses classes with synchronized instances, you are instead using
“data locking”, described in Section 6.3.3.

v2022.09.25a
84 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

Listing 6.5: Code-Locking Hash Table Search


1 spinlock_t hash_lock;
2
3 struct hash_table
4 {
5 long nbuckets;
6 struct node **buckets;
7 };
8
9 typedef struct node {
10 unsigned long key;
11 struct node *next;
12 } node_t;
13
14 int hash_search(struct hash_table *h, long key)
15 {
16 struct node *cur;
17 int retval;
18
19 spin_lock(&hash_lock);
20 cur = h->buckets[key % h->nbuckets]; toy
21 while (cur != NULL) {
22 if (cur->key >= key) {
23 retval = (cur->key == key);
24 spin_unlock(&hash_lock); Figure 6.13: Lock Contention
25 return retval;
26 }
27 cur = cur->next;
28 } it owns this hash table. In addition, in a well-designed
29 spin_unlock(&hash_lock);
30 return 0; algorithm, there should be ample partitions of time during
31 } which no CPU owns this hash table.
Quick Quiz 6.16: “Partitioning time”? Isn’t that an odd turn
of phrase?
existing program to use code locking in order to run it on
a multiprocessor. If the program has only a single shared Unfortunately, code locking is particularly prone to
resource, code locking will even give optimal performance. “lock contention”, where multiple CPUs need to acquire
However, many of the larger and more complex programs the lock concurrently. SMP programmers who have taken
require much of the execution to occur in critical sections, care of groups of small children (or groups of older people
which in turn causes code locking to sharply limits their who are acting like children) will immediately recognize
scalability. the danger of having only one of something, as illustrated
Therefore, you should use code locking on programs in Figure 6.13.
that spend only a small fraction of their execution time
One solution to this problem, named “data locking”, is
in critical sections or from which only modest scaling is
described in the next section.
required. In addition, programs that primarily use the
more scalable approaches described in later sections often
use code locking to handle rare error cases or significant 6.3.3 Data Locking
state transitions. In these cases, code locking will provide
a relatively simple program that is very similar to its Many data structures may be partitioned, with each par-
sequential counterpart, as can be seen in Listing 6.5. tition of the data structure having its own lock. Then
However, note that the simple return of the comparison the critical sections for each part of the data structure
in hash_search() in Listing 6.4 has now become three can execute in parallel, although only one instance of the
statements due to the need to release the lock before critical section for a given part could be executing at a
returning. given time. You should use data locking when contention
Note that the hash_lock acquisition and release state- must be reduced, and where synchronization overhead is
ments on lines 19, 24, and 29 are mediating ownership of not limiting speedups. Data locking reduces contention
the hash table among the CPUs wishing to concurrently by distributing the instances of the overly-large critical
access that hash table. Another way of looking at this section across multiple data structures, for example, main-
is that hash_lock is partitioning time, thus giving each taining per-hash-bucket critical sections in a hash table,
requesting CPU its own partition of time during which as shown in Listing 6.6. The increased scalability again

v2022.09.25a
6.3. SYNCHRONIZATION GRANULARITY 85

Listing 6.6: Data-Locking Hash Table Search


1 struct hash_table
2 {
3 long nbuckets;
4 struct bucket **buckets;
5 };
6
7 struct bucket {
8 spinlock_t bucket_lock;
9 node_t *list_head;
10 };
11
12 typedef struct node { toy
13 unsigned long key;
14 struct node *next;
15 } node_t; yot

16
17 int hash_search(struct hash_table *h, long key)
18 {
19 struct bucket *bp; toy
20 struct node *cur;
21 int retval;
22
23 bp = h->buckets[key % h->nbuckets];
24 spin_lock(&bp->bucket_lock); Figure 6.14: Data Locking
25 cur = bp->list_head;
26 while (cur != NULL) {
27 if (cur->key >= key) {
28 retval = (cur->key == key);
29 spin_unlock(&bp->bucket_lock);
30 return retval;
31 }
32 cur = cur->next;
33 }
34 spin_unlock(&bp->bucket_lock);
35 return 0;
36 }

results in a slight increase in complexity in the form of an


toy
additional data structure, the struct bucket. toy toy

In contrast with the contentious situation shown in


Figure 6.13, data locking helps promote harmony, as
illustrated by Figure 6.14—and in parallel programs, this toy

almost always translates into increased performance and


scalability. For this reason, data locking was heavily used Figure 6.15: Data Locking and Skew
by Sequent in its kernels [BK85, Inm85, Gar90, Dov90,
MD92, MG92, MS93].
Another way of looking at this is to think of each -> can arise in SMP programs. For example, the Linux
bucket_lock as mediating ownership not of the entire kernel maintains a cache of files and directories (called
hash table as was done for code locking, but only for the “dcache”). Each entry in this cache has its own lock, but the
bucket corresponding to that ->bucket_lock. Each lock entries corresponding to the root directory and its direct
still partitions time, but the per-bucket-locking technique descendants are much more likely to be traversed than
also partitions the address space, so that the overall tech- are more obscure entries. This can result in many CPUs
nique can be said to partition spacetime. If the number of contending for the locks of these popular entries, resulting
buckets is large enough, this partitioning of space should in a situation not unlike that shown in Figure 6.15.
with high probability permit a given CPU immediate In many cases, algorithms can be designed to re-
access to a given hash bucket. duce the instance of data skew, and in some cases
However, as those who have taken care of small children eliminate it entirely (for example, in the Linux ker-
can again attest, even providing enough to go around is nel’s dcache [MSS04, Cor10a, Bro15a, Bro15b, Bro15c]).
no guarantee of tranquillity. The analogous situation Data locking is often used for partitionable data structures

v2022.09.25a
86 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

such as hash tables, as well as in situations where multiple If there is significant sharing, communication between
entities are each represented by an instance of a given the threads or CPUs can result in significant complexity
data structure. The Linux-kernel task list is an example of and overhead. Furthermore, if the most-heavily used data
the latter, each task structure having its own alloc_lock happens to be that owned by a single CPU, that CPU will be
and pi_lock. a “hot spot”, sometimes with results resembling that shown
A key challenge with data locking on dynamically in Figure 6.15. However, in situations where no sharing
allocated structures is ensuring that the structure remains is required, data ownership achieves ideal performance,
in existence while the lock is being acquired [GKAS99]. and with code that can be as simple as the sequential-
The code in Listing 6.6 finesses this challenge by placing program case shown in Listing 6.4. Such situations are
the locks in the statically allocated hash buckets, which often referred to as “embarrassingly parallel”, and, in
are never freed. However, this trick would not work if the best case, resemble the situation previously shown in
the hash table were resizeable, so that the locks were now Figure 6.14.
dynamically allocated. In this case, there would need to Another important instance of data ownership occurs
be some means to prevent the hash bucket from being when the data is read-only, in which case, all threads can
freed during the time that its lock was being acquired. “own” it via replication.
Quick Quiz 6.17: What are some ways of preventing a
Where data locking partitions both the address space
structure from being freed while its lock is being acquired? (with one hash buckets per partition) and time (using
per-bucket locks), data ownership partitions only the ad-
dress space. The reason that data ownership need not
partition time is because a given thread or CPU is assigned
6.3.4 Data Ownership permanent ownership of a given address-space partition.
Data ownership partitions a given data structure over the Quick Quiz 6.18: But won’t system boot and shutdown (or
threads or CPUs, so that each thread/CPU accesses its application startup and shutdown) be partitioning time, even
subset of the data structure without any synchronization for data ownership?
overhead whatsoever. However, if one thread wishes
to access some other thread’s data, the first thread is Data ownership will be presented in more detail in
unable to do so directly. Instead, the first thread must Chapter 8.
communicate with the second thread, so that the second
thread performs the operation on behalf of the first, or,
alternatively, migrates the data to the first thread.
6.3.5 Locking Granularity and Perfor-
Data ownership might seem arcane, but it is used very mance
frequently: This section looks at locking granularity and performance
from a mathematical synchronization-efficiency viewpoint.
1. Any variables accessible by only one CPU or thread
Readers who are uninspired by mathematics might choose
(such as auto variables in C and C++) are owned by
to skip this section.
that CPU or process.
The approach is to use a crude queueing model for the
2. An instance of a user interface owns the correspond- efficiency of synchronization mechanism that operate on
ing user’s context. It is very common for applications a single shared global variable, based on an M/M/1 queue.
interacting with parallel database engines to be writ- M/M/1 queuing models are based on an exponentially
ten as if they were entirely sequential programs. Such distributed “inter-arrival rate” 𝜆 and an exponentially
applications own the user interface and his current distributed “service rate” 𝜇. The inter-arrival rate 𝜆 can
action. Explicit parallelism is thus confined to the be thought of as the average number of synchronization
database engine itself. operations per second that the system would process if the
synchronization were free, in other words, 𝜆 is an inverse
3. Parametric simulations are often trivially parallelized measure of the overhead of each non-synchronization
by granting each thread ownership of a particular unit of work. For example, if each unit of work was a
region of the parameter space. There are also com- transaction, and if each transaction took one millisecond
puting frameworks designed for this type of prob- to process, excluding synchronization overhead, then 𝜆
lem [Uni08a]. would be 1,000 transactions per second.

v2022.09.25a
6.3. SYNCHRONIZATION GRANULARITY 87

The service rate 𝜇 is defined similarly, but for the

Synchronization Efficiency
average number of synchronization operations per second 1
that the system would process if the overhead of each 0.9
transaction was zero, and ignoring the fact that CPUs 0.8
0.7
must wait on each other to complete their synchronization 100
0.6
operations, in other words, 𝜇 can be roughly thought of as
0.5 75
the synchronization overhead in absence of contention. For
0.4 50
example, suppose that each transaction’s synchronization
0.3 25
operation involves an atomic increment instruction, and 0.2 10
that a computer system is able to do a private-variable 0.1
atomic increment every 5 nanoseconds on each CPU

10
20
30
40
50
60
70
80
90
100
(see Figure 5.1).13 The value of 𝜇 is therefore about
200,000,000 atomic increments per second.
Of course, the value of 𝜆 increases as increasing num- Number of CPUs (Threads)
bers of CPUs increment a shared variable because each
Figure 6.16: Synchronization Efficiency
CPU is capable of processing transactions independently
(again, ignoring synchronization):
But the value of 𝜇/𝜆0 is just the ratio of the time
𝜆 = 𝑛𝜆0 (6.1) required to process the transaction (absent synchronization
Here, 𝑛 is the number of CPUs and 𝜆 0 is the transaction- overhead) to that of the synchronization overhead itself
processing capability of a single CPU. Note that the (absent contention). If we call this ratio 𝑓 , we have:
expected time for a single CPU to execute a single trans-
action in the absence of contention is 1/𝜆0 . 𝑓 −𝑛
𝑒= (6.6)
Because the CPUs have to “wait in line” behind each 𝑓 − (𝑛 − 1)
other to get their chance to increment the single shared vari-
Figure 6.16 plots the synchronization efficiency 𝑒 as
able, we can use the M/M/1 queueing-model expression
a function of the number of CPUs/threads 𝑛 for a few
for the expected total waiting time:
values of the overhead ratio 𝑓 . For example, again us-
1 ing the 5-nanosecond atomic increment, the 𝑓 = 10
𝑇= (6.2) line corresponds to each CPU attempting an atomic in-
𝜇−𝜆
Substituting the above value of 𝜆: crement every 50 nanoseconds, and the 𝑓 = 100 line
corresponds to each CPU attempting an atomic increment
1 every 500 nanoseconds, which in turn corresponds to some
𝑇= (6.3)
𝜇 − 𝑛𝜆0 hundreds (perhaps thousands) of instructions. Given that
Now, the efficiency is just the ratio of the time required each trace drops off sharply with increasing numbers of
to process a transaction in absence of synchronization CPUs or threads, we can conclude that synchronization
(1/𝜆0 ) to the time required including synchronization mechanisms based on atomic manipulation of a single
(𝑇 + 1/𝜆0 ): global shared variable will not scale well if used heavily
on current commodity hardware. This is an abstract math-
1/𝜆0 ematical depiction of the forces leading to the parallel
𝑒= (6.4)
𝑇 + 1/𝜆0 counting algorithms that were discussed in Chapter 5.
Substituting the above value for 𝑇 and simplifying: Your real-world mileage may differ.
Nevertheless, the concept of efficiency is useful, and
𝜇
𝜆0 −𝑛 even in cases having little or no formal synchronization.
𝑒= 𝜇 (6.5) Consider for example a matrix multiply, in which the
𝜆0 − (𝑛 − 1)
columns of one matrix are multiplied (via “dot product”)
13 Of course, if there are 8 CPUs all incrementing the same shared
by the rows of another, resulting in an entry in a third
variable, then each CPU must wait at least 35 nanoseconds for each
of the other CPUs to do its increment before consuming an additional
matrix. Because none of these operations conflict, it
5 nanoseconds doing its own increment. In fact, the wait will be longer is possible to partition the columns of the first matrix
due to the need to move the variable from one CPU to another. among a group of threads, with each thread computing

v2022.09.25a
88 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

1.2 Reader/Writer

Matrix Multiply Efficiency


Locking
1
1024
0.8 256
RCU
0.6
128 512 Parallel
0.4 Fastpath
Hierarchical
0.2 Locking
64
0
1 10 100
Allocator
Number of CPUs (Threads) Caches

Figure 6.17: Matrix Multiply Efficiency


Figure 6.18: Parallel-Fastpath Design Patterns

the corresponding columns of the result matrix. The


threads can therefore operate entirely independently, with Quick Quiz 6.20: How are data-parallel techniques going to
help with matrix multiply? It is already data parallel!!!
no synchronization overhead whatsoever, as is done in
matmul.c. One might therefore expect a perfect efficiency
Quick Quiz 6.21: What did you do to validate this matrix
of 1.0.
multiply algorithm?
However, Figure 6.17 tells a different story, especially
for a 64-by-64 matrix multiply, which never gets above
an efficiency of about 0.3, even when running single-
threaded, and drops sharply as more threads are added.14 6.4 Parallel Fastpath
The 128-by-128 matrix does better, but still fails to demon-
strate much performance increase with added threads. The
256-by-256 matrix does scale reasonably well, but only There are two ways of meeting difficulties: You alter
the difficulties, or you alter yourself to meet them.
up to a handful of CPUs. The 512-by-512 matrix mul-
tiply’s efficiency is measurably less than 1.0 on as few Phyllis Bottome
as 10 threads, and even the 1024-by-1024 matrix multi-
ply deviates noticeably from perfection at a few tens of Fine-grained (and therefore usually higher-performance)
threads. Nevertheless, this figure clearly demonstrates the designs are typically more complex than are coarser-
performance and scalability benefits of batching: If you grained designs. In many cases, most of the overhead is
must incur synchronization overhead, you may as well get incurred by a small fraction of the code [Knu73]. So why
your money’s worth. not focus effort on that small fraction?
Quick Quiz 6.19: How can a single-threaded 64-by-64 This is the idea behind the parallel-fastpath design
matrix multiple possibly have an efficiency of less than 1.0? pattern, to aggressively parallelize the common-case code
Shouldn’t all of the traces in Figure 6.17 have efficiency of path without incurring the complexity that would be
exactly 1.0 when running on one thread? required to aggressively parallelize the entire algorithm.
You must understand not only the specific algorithm you
Given these inefficiencies, it is worthwhile to look wish to parallelize, but also the workload that the algorithm
into more-scalable approaches such as the data locking will be subjected to. Great creativity and design effort is
described in Section 6.3.3 or the parallel-fastpath approach often required to construct a parallel fastpath.
discussed in the next section. Parallel fastpath combines different patterns (one for
the fastpath, one elsewhere) and is therefore a template
pattern. The following instances of parallel fastpath occur
14 In contrast to the smooth traces of Figure 6.16, the wide error bars often enough to warrant their own patterns, as depicted in
and jagged traces of Figure 6.17 gives evidence of its real-world nature. Figure 6.18:

v2022.09.25a
6.4. PARALLEL FASTPATH 89

1. Reader/Writer Locking (described below in Sec- Listing 6.7: Reader-Writer-Locking Hash Table Search
tion 6.4.1). 1 rwlock_t hash_lock;
2
3 struct hash_table
2. Read-copy update (RCU), which may be used as a 4 {
5 long nbuckets;
high-performance replacement for reader/writer lock- 6 struct node **buckets;
ing, is introduced in Section 9.5. Other alternatives 7 };
8
include hazard pointers (Section 9.3) and sequence 9 typedef struct node {
locking (Section 9.4). These alternatives will not be 10 unsigned long key;
11 struct node *next;
discussed further in this chapter. 12 } node_t;
13
14 int hash_search(struct hash_table *h, long key)
3. Hierarchical Locking ([McK96a]), which is touched 15 {
upon in Section 6.4.2. 16 struct node *cur;
17 int retval;
18
19 read_lock(&hash_lock);
4. Resource Allocator Caches ([McK96a, MS93]). See 20 cur = h->buckets[key % h->nbuckets];
Section 6.4.3 for more detail. 21 while (cur != NULL) {
22 if (cur->key >= key) {
23 retval = (cur->key == key);
24 read_unlock(&hash_lock);
6.4.1 Reader/Writer Locking 25 return retval;
26 }
27 cur = cur->next;
If synchronization overhead is negligible (for example, if 28 }
29 read_unlock(&hash_lock);
the program uses coarse-grained parallelism with large 30 return 0;
critical sections), and if only a small fraction of the critical 31 }
sections modify data, then allowing multiple readers to
proceed in parallel can greatly increase scalability. Writ-
ers exclude both readers and each other. There are many 6.4.3 Resource Allocator Caches
implementations of reader-writer locking, including the
POSIX implementation described in Section 4.2.4. List- This section presents a simplified schematic of a parallel
ing 6.7 shows how the hash search might be implemented fixed-block-size memory allocator. More detailed descrip-
using reader-writer locking. tions may be found in the literature [MG92, MS93, BA01,
Reader/writer locking is a simple instance of asymmet- MSK01, Eva11, Ken20] or in the Linux kernel [Tor03].
ric locking. Snaman [ST87] describes a more ornate six-
mode asymmetric locking design used in several clustered
systems. Locking in general and reader-writer locking in 6.4.3.1 Parallel Resource Allocation Problem
particular is described extensively in Chapter 7.
The basic problem facing a parallel memory allocator
is the tension between the need to provide extremely
6.4.2 Hierarchical Locking fast memory allocation and freeing in the common case
and the need to efficiently distribute memory in face of
The idea behind hierarchical locking is to have a coarse- unfavorable allocation and freeing patterns.
grained lock that is held only long enough to work out To see this tension, consider a straightforward applica-
which fine-grained lock to acquire. Listing 6.8 shows how tion of data ownership to this problem—simply carve up
our hash-table search might be adapted to do hierarchical memory so that each CPU owns its share. For example,
locking, but also shows the great weakness of this ap- suppose that a system with 12 CPUs has 64 gigabytes of
proach: We have paid the overhead of acquiring a second memory, for example, the laptop I am using right now.
lock, but we only hold it for a short time. In this case, We could simply assign each CPU a five-gigabyte region
the data-locking approach would be simpler and likely of memory, and allow each CPU to allocate from its own
perform better. region, without the need for locking and its complexities
Quick Quiz 6.22: In what situation would hierarchical and overheads. Unfortunately, this scheme fails when
locking work well? CPU 0 only allocates memory and CPU 1 only frees it, as
happens in simple producer-consumer workloads.

v2022.09.25a
90 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

Listing 6.8: Hierarchical-Locking Hash Table Search


1 struct hash_table
2 { Global Pool
3 long nbuckets;
4 struct bucket **buckets;

Overflow

Overflow
5 };
(Code Locked)
6
7 struct bucket {

Empty

Empty
8 spinlock_t bucket_lock;
9 node_t *list_head;
10 };
11
12 typedef struct node {
13 spinlock_t node_lock;
14 unsigned long key;
15 struct node *next; CPU 0 Pool CPU 1 Pool
16 } node_t;
17
18 int hash_search(struct hash_table *h, long key) (Owned by CPU 0) (Owned by CPU 1)
19 {
20 struct bucket *bp;
21 struct node *cur;
22 int retval;
23
Allocate/Free
24 bp = h->buckets[key % h->nbuckets];
25 spin_lock(&bp->bucket_lock);
26 cur = bp->list_head; Figure 6.19: Allocator Cache Schematic
27 while (cur != NULL) {
28 if (cur->key >= key) {
29 spin_lock(&cur->node_lock); Listing 6.9: Allocator-Cache Data Structures
30 spin_unlock(&bp->bucket_lock); 1 #define TARGET_POOL_SIZE 3
31 retval = (cur->key == key); 2 #define GLOBAL_POOL_SIZE 40
32 spin_unlock(&cur->node_lock); 3
33 return retval; 4 struct globalmempool {
34 } 5 spinlock_t mutex;
35 cur = cur->next; 6 int cur;
36 } 7 struct memblock *pool[GLOBAL_POOL_SIZE];
37 spin_unlock(&bp->bucket_lock); 8 } globalmem;
38 return 0; 9
39 } 10 struct perthreadmempool {
11 int cur;
12 struct memblock *pool[2 * TARGET_POOL_SIZE];
13 };
The other extreme, code locking, suffers from excessive 14
15 DEFINE_PER_THREAD(struct perthreadmempool, perthreadmem);
lock contention and overhead [MS93].

6.4.3.2 Parallel Fastpath for Resource Allocation The “Global Pool” of Figure 6.19 is implemented by
globalmem of type struct globalmempool, and the
The commonly used solution uses parallel fastpath with
two CPU pools by the per-thread variable perthreadmem
each CPU owning a modest cache of blocks, and with a
of type struct perthreadmempool. Both of these data
large code-locked shared pool for additional blocks. To
structures have arrays of pointers to blocks in their pool
prevent any given CPU from monopolizing the memory
fields, which are filled from index zero upwards. Thus,
blocks, we place a limit on the number of blocks that can
if globalmem.pool[3] is NULL, then the remainder of
be in each CPU’s cache. In a two-CPU system, the flow of
the array from index 4 up must also be NULL. The cur
memory blocks will be as shown in Figure 6.19: When a
fields contain the index of the highest-numbered full
given CPU is trying to free a block when its pool is full, it
element of the pool array, or −1 if all elements are
sends blocks to the global pool, and, similarly, when that
empty. All elements from globalmem.pool[0] through
CPU is trying to allocate a block when its pool is empty,
globalmem.pool[globalmem.cur] must be full, and
it retrieves blocks from the global pool.
all the rest must be empty.15

6.4.3.3 Data Structures


15 Both pool sizes (TARGET_POOL_SIZE and GLOBAL_POOL_SIZE)
The actual data structures for a “toy” implementation of are unrealistically small, but this small size makes it easier to single-step
allocator caches are shown in Listing 6.9 (“smpalloc.c”). the program in order to get a feel for its operation.

v2022.09.25a
6.4. PARALLEL FASTPATH 91

(Empty) −1 Listing 6.10: Allocator-Cache Allocator Function


1 struct memblock *memblock_alloc(void)
2 {
0
3 int i;
4 struct memblock *p;
1 5 struct perthreadmempool *pcpp;
6
7 pcpp = &__get_thread_var(perthreadmem);
2 8 if (pcpp->cur < 0) {
9 spin_lock(&globalmem.mutex);
10 for (i = 0; i < TARGET_POOL_SIZE &&
3 11 globalmem.cur >= 0; i++) {
12 pcpp->pool[i] = globalmem.pool[globalmem.cur];
13 globalmem.pool[globalmem.cur--] = NULL;
4 14 }
15 pcpp->cur = i - 1;
16 spin_unlock(&globalmem.mutex);
5 17 }
18 if (pcpp->cur >= 0) {
19 p = pcpp->pool[pcpp->cur];
Figure 6.20: Allocator Pool Schematic 20 pcpp->pool[pcpp->cur--] = NULL;
21 return p;
22 }
23 return NULL;
The operation of the pool data structures is illustrated 24 }
by Figure 6.20, with the six boxes representing the array
of pointers making up the pool field, and the number Listing 6.11: Allocator-Cache Free Function
preceding them representing the cur field. The shaded 1 void memblock_free(struct memblock *p)
2 {
boxes represent non-NULL pointers, while the empty boxes 3 int i;
represent NULL pointers. An important, though potentially 4 struct perthreadmempool *pcpp;
5
confusing, invariant of this data structure is that the cur 6 pcpp = &__get_thread_var(perthreadmem);
field is always one smaller than the number of non-NULL 7 if (pcpp->cur >= 2 * TARGET_POOL_SIZE - 1) {
8 spin_lock(&globalmem.mutex);
pointers. 9 for (i = pcpp->cur; i >= TARGET_POOL_SIZE; i--) {
10 globalmem.pool[++globalmem.cur] = pcpp->pool[i];
11 pcpp->pool[i] = NULL;
6.4.3.4 Allocation Function 12 }
13 pcpp->cur = i;
The allocation function memblock_alloc() may be seen 14 spin_unlock(&globalmem.mutex);
15 }
in Listing 6.10. Line 7 picks up the current thread’s 16 pcpp->pool[++pcpp->cur] = p;
per-thread pool, and line 8 checks to see if it is empty. 17 }

If so, lines 9–16 attempt to refill it from the global pool


under the spinlock acquired on line 9 and released on
moving blocks from the local to the global pool, and
line 16. Lines 10–14 move blocks from the global to the
line 13 sets the per-thread pool’s count to the proper value.
per-thread pool until either the local pool reaches its target
In either case, line 16 then places the newly freed block
size (half full) or the global pool is exhausted, and line 15
into the per-thread pool.
sets the per-thread pool’s count to the proper value.
In either case, line 18 checks for the per-thread pool still Quick Quiz 6.23: Doesn’t this resource-allocator design
being empty, and if not, lines 19–21 remove a block and resemble that of the approximate limit counters covered in
return it. Otherwise, line 23 tells the sad tale of memory Section 5.3?
exhaustion.
6.4.3.6 Performance
6.4.3.5 Free Function
Rough performance results16 are shown in Figure 6.21,
Listing 6.11 shows the memory-block free function. Line 6 running on a dual-core Intel x86 running at 1 GHz (4300
gets a pointer to this thread’s pool, and line 7 checks to
16 This data was not collected in a statistically meaningful way, and
see if this per-thread pool is full.
therefore should be viewed with great skepticism and suspicion. Good
If so, lines 8–15 empty half of the per-thread pool data-collection and -reduction practice is discussed in Chapter 11. That
into the global pool, with lines 8 and 14 acquiring and said, repeated runs gave similar results, and these results match more
releasing the spinlock. Lines 9–12 implement the loop careful evaluations of similar algorithms.

v2022.09.25a
92 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

30 Quick Quiz 6.25: Allocation failures were observed in the


two-thread tests at run lengths of 19 and greater. Given the
Allocations/Frees Per Microsecond 25 global-pool size of 40 and the per-thread target pool size 𝑠
of three, number of threads 𝑛 equal to two, and assuming
20 that the per-thread pools are initially empty with none of the
memory in use, what is the smallest allocation run length 𝑚 at
15 which failures can occur? (Recall that each thread repeatedly
allocates 𝑚 block of memory, and then frees the 𝑚 blocks of
10 memory.) Alternatively, given 𝑛 threads each with pool size
𝑠, and where each thread repeatedly first allocates 𝑚 blocks
5
of memory and then frees those 𝑚 blocks, how large must the
global pool size be? Note: Obtaining the correct answer will
require you to examine the smpalloc.c source code, and very
0
0 5 10 15 20 25 likely single-step it as well. You have been warned!
Allocation Run Length

Figure 6.21: Allocator Cache Performance


6.4.3.7 Validation

Validation of this simple allocator spawns a specified


bogomips per CPU) with at most six blocks allowed in
number of threads, with each thread repeatedly allocating a
each CPU’s cache. In this micro-benchmark, each thread
specified number of memory blocks and then deallocating
repeatedly allocates a group of blocks and then frees all
them. This simple regimen suffices to exercise both the
the blocks in that group, with the number of blocks in
per-thread caches and the global pool, as can be seen in
the group being the “allocation run length” displayed on
Figure 6.21.
the x-axis. The y-axis shows the number of successful
Much more aggressive validation is required for mem-
allocation/free pairs per microsecond—failed allocations
ory allocators that are to be used in production. The
are not counted. The “X”s are from a two-thread run,
test suites for tcmalloc [Ken20] and jemalloc [Eva11] are
while the “+”s are from a single-threaded run.
instructive, as are the tests for the Linux kernel’s memory
Note that run lengths up to six scale linearly and give allocator.
excellent performance, while run lengths greater than
six show poor performance and almost always also show
negative scaling. It is therefore quite important to size 6.4.3.8 Real-World Design
TARGET_POOL_SIZE sufficiently large, which fortunately
is usually quite easy to do in actual practice [MSK01], The toy parallel resource allocator was quite simple, but
especially given today’s large memories. For example, real-world designs expand on this approach in a number
in most systems, it is quite reasonable to set TARGET_ of ways.
POOL_SIZE to 100, in which case allocations and frees First, real-world allocators are required to handle a wide
are guaranteed to be confined to per-thread pools at least range of allocation sizes, as opposed to the single size
99 % of the time. shown in this toy example. One popular way to do this is
As can be seen from the figure, the situations where to offer a fixed set of sizes, spaced so as to balance external
the common-case data-ownership applies (run lengths up and internal fragmentation, such as in the late-1980s BSD
to six) provide greatly improved performance compared memory allocator [MK88]. Doing this would mean that
to the cases where locks must be acquired. Avoiding the “globalmem” variable would need to be replicated
synchronization in the common case will be a recurring on a per-size basis, and that the associated lock would
theme through this book. similarly be replicated, resulting in data locking rather
than the toy program’s code locking.
Quick Quiz 6.24: In Figure 6.21, there is a pattern of Second, production-quality systems must be able to
performance rising with increasing run length in groups of repurpose memory, meaning that they must be able to co-
three samples, for example, for run lengths 10, 11, and 12. alesce blocks into larger structures, such as pages [MS93].
Why? This coalescing will also need to be protected by a lock,
which again could be replicated on a per-size basis.

v2022.09.25a
6.5. BEYOND PARTITIONING 93

Table 6.1: Schematic of Real-World Parallel Allocator Listing 6.12: SEQ Pseudocode
1 int maze_solve(maze *mp, cell sc, cell ec)
Level Locking Purpose 2 {
3 cell c = sc;
4 cell n;
Per-thread pool Data ownership High-speed 5 int vi = 0;
allocation 6
7 maze_try_visit_cell(mp, c, c, &n, 1);
Global block pool Data locking Distributing blocks 8 for (;;) {
among threads 9 while (!maze_find_any_next_cell(mp, c, &n)) {
10 if (++vi >= mp->vi)
Coalescing Data locking Combining blocks 11 return 0;
into pages 12 c = mp->visited[vi].c;
13 }
System memory Code locking Memory from/to 14 do {
system 15 if (n == ec) {
16 return 1;
17 }
18 c = n;
19 } while (maze_find_any_next_cell(mp, c, &n));
Third, coalesced memory must be returned to the un- 20 c = mp->visited[vi].c;
21 }
derlying memory system, and pages of memory must 22 }
also be allocated from the underlying memory system.
The locking required at this level will depend on that
of the underlying memory system, but could well be so it should come as no surprise that they are generated
code locking. Code locking can often be tolerated at and solved using computers, including biological com-
this level, because this level is so infrequently reached in puters [Ada11], GPGPUs [Eri08], and even discrete hard-
well-designed systems [MSK01]. ware [KFC11]. Parallel solution of mazes is sometimes
Concurrent userspace allocators face similar chal- used as a class project in universities [ETH11, Uni10]
lenges [Ken20, Eva11]. and as a vehicle to demonstrate the benefits of parallel-
Despite this real-world design’s greater complexity, programming frameworks [Fos10].
the underlying idea is the same—repeated application of Common advice is to use a parallel work-queue algo-
parallel fastpath, as shown in Table 6.1. rithm (PWQ) [ETH11, Fos10]. This section evaluates this
advice by comparing PWQ against a sequential algorithm
(SEQ) and also against an alternative parallel algorithm,
6.5 Beyond Partitioning in all cases solving randomly generated square mazes.
Section 6.5.1 discusses PWQ, Section 6.5.2 discusses
It is all right to aim high if you have plenty of an alternative parallel algorithm, Section 6.5.4 analyzes
ammunition. its anomalous performance, Section 6.5.5 derives an im-
proved sequential algorithm from the alternative paral-
Hawley R. Everhart lel algorithm, Section 6.5.6 makes further performance
comparisons, and finally Section 6.5.7 presents future
This chapter has discussed how data partitioning can be directions and concluding remarks.
used to design simple linearly scalable parallel programs.
Section 6.3.4 hinted at the possibilities of data replication,
which will be used to great effect in Section 9.5. 6.5.1 Work-Queue Parallel Maze Solver
The main goal of applying partitioning and replication
is to achieve linear speedups, in other words, to ensure PWQ is based on SEQ, which is shown in Listing 6.12
that the total amount of work required does not increase (pseudocode for maze_seq.c). The maze is represented
significantly as the number of CPUs or threads increases. by a 2D array of cells and a linear-array-based work queue
A problem that can be solved via partitioning and/or named ->visited.
replication, resulting in linear speedups, is embarrassingly Line 7 visits the initial cell, and each iteration of the loop
parallel. But can we do better? spanning lines 8–21 traverses passages headed by one cell.
To answer this question, let us examine the solution of The loop spanning lines 9–13 scans the ->visited[]
labyrinths and mazes. Of course, labyrinths and mazes array for a visited cell with an unvisited neighbor, and
have been objects of fascination for millennia [Wik12], the loop spanning lines 14–19 traverses one fork of the

v2022.09.25a
94 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

Listing 6.13: SEQ Helper Pseudocode


1 2 3
1 int maze_try_visit_cell(struct maze *mp, cell c, cell t,
2 cell *n, int d)
3 {
4 if (!maze_cells_connected(mp, c, t) || 2 3 4
5 (*celladdr(mp, t) & VISITED))
6 return 0;
7 *n = t;
8 mp->visited[mp->vi] = t; 3 4 5
9 mp->vi++;
10 *celladdr(mp, t) |= VISITED | d;
11 return 1; Figure 6.22: Cell-Number Solution Tracking
12 }
13
14 int maze_find_any_next_cell(struct maze *mp, cell c,
15 cell *n) 1
16 {
0.9
17 int d = (*celladdr(mp, c) & DISTANCE) + 1;
18 0.8
19 if (maze_try_visit_cell(mp, c, prevcol(c), n, d))
0.7 PWQ
20 return 1;
if (maze_try_visit_cell(mp, c, nextcol(c), n, d))

Probability
21 0.6
22 return 1; 0.5 SEQ
23 if (maze_try_visit_cell(mp, c, prevrow(c), n, d))
24 return 1; 0.4
25 if (maze_try_visit_cell(mp, c, nextrow(c), n, d)) 0.3
26 return 1;
27 return 0; 0.2
28 } 0.1
0
0 20 40 60 80 100 120 140
CDF of Solution Time (ms)
submaze headed by that neighbor. Line 20 initializes for
the next pass through the outer loop. Figure 6.23: CDF of Solution Times For SEQ and PWQ
The pseudocode for maze_try_visit_cell() is
shown on lines 1–12 of Listing 6.13 (maze.c). Line 4
checks to see if cells c and t are adjacent and connected,
The parallel work-queue solver is a straightforward
while line 5 checks to see if cell t has not yet been vis-
parallelization of the algorithm shown in Listings 6.12
ited. The celladdr() function returns the address of the
and 6.13. Line 10 of Listing 6.12 must use fetch-and-
specified cell. If either check fails, line 6 returns failure.
add, and the local variable vi must be shared among the
Line 7 indicates the next cell, line 8 records this cell in the
various threads. Lines 5 and 10 of Listing 6.13 must be
next slot of the ->visited[] array, line 9 indicates that
combined into a CAS loop, with CAS failure indicating
this slot is now full, and line 10 marks this cell as visited
a loop in the maze. Lines 8–9 of this listing must use
and also records the distance from the maze start. Line 11
fetch-and-add to arbitrate concurrent attempts to record
then returns success.
cells in the ->visited[] array.
The pseudocode for maze_find_any_next_cell()
is shown on lines 14–28 of Listing 6.13 (maze.c). Line 17 This approach does provide significant speedups on a
picks up the current cell’s distance plus 1, while lines 19, dual-CPU Lenovo W500 running at 2.53 GHz, as shown
21, 23, and 25 check the cell in each direction, and in Figure 6.23, which shows the cumulative distribution
lines 20, 22, 24, and 26 return true if the corresponding functions (CDFs) for the solution times of the two al-
cell is a candidate next cell. The prevcol(), nextcol(), gorithms, based on the solution of 500 different square
prevrow(), and nextrow() each do the specified array- 500-by-500 randomly generated mazes. The substantial
index-conversion operation. If none of the cells is a overlap of the projection of the CDFs onto the x-axis will
candidate, line 27 returns false. be addressed in Section 6.5.4.
The path is recorded in the maze by counting the number Interestingly enough, the sequential solution-path track-
of cells from the starting point, as shown in Figure 6.22, ing works unchanged for the parallel algorithm. However,
where the starting cell is in the upper left and the ending this uncovers a significant weakness in the parallel algo-
cell is in the lower right. Starting at the ending cell and rithm: At most one thread may be making progress along
following consecutively decreasing cell numbers traverses the solution path at any given time. This weakness is
the solution. addressed in the next section.

v2022.09.25a
6.5. BEYOND PARTITIONING 95

Listing 6.14: Partitioned Parallel Solver Pseudocode Listing 6.15: Partitioned Parallel Helper Pseudocode
1 int maze_solve_child(maze *mp, cell *visited, cell sc) 1 int maze_try_visit_cell(struct maze *mp, int c, int t,
2 { 2 int *n, int d)
3 cell c; 3 {
4 cell n; 4 cell_t t;
5 int vi = 0; 5 cell_t *tp;
6 6 int vi;
7 myvisited = visited; myvi = &vi; 7
8 c = visited[vi]; 8 if (!maze_cells_connected(mp, c, t))
9 do { 9 return 0;
10 while (!maze_find_any_next_cell(mp, c, &n)) { 10 tp = celladdr(mp, t);
11 if (visited[++vi].row < 0) 11 do {
12 return 0; 12 t = READ_ONCE(*tp);
13 if (READ_ONCE(mp->done)) 13 if (t & VISITED) {
14 return 1; 14 if ((t & TID) != mytid)
15 c = visited[vi]; 15 mp->done = 1;
16 } 16 return 0;
17 do { 17 }
18 if (READ_ONCE(mp->done)) 18 } while (!CAS(tp, t, t | VISITED | myid | d));
19 return 1; 19 *n = t;
20 c = n; 20 vi = (*myvi)++;
21 } while (maze_find_any_next_cell(mp, c, &n)); 21 myvisited[vi] = t;
22 c = visited[vi]; 22 return 1;
23 } while (!READ_ONCE(mp->done)); 23 }
24 return 1;
25 }

suffices [Smi19]. Finally, the maze_find_any_next_


cell() function must use compare-and-swap to mark a
6.5.2 Alternative Parallel Maze Solver
cell as visited, however no constraints on ordering are
Youthful maze solvers are often urged to start at both ends, required beyond those provided by thread creation and
and this advice has been repeated more recently in the join.
context of automated maze solving [Uni10]. This advice The pseudocode for maze_find_any_next_cell()
amounts to partitioning, which has been a powerful paral- is identical to that shown in Listing 6.13, but the pseu-
lelization strategy in the context of parallel programming docode for maze_try_visit_cell() differs, and is
for both operating-system kernels [BK85, Inm85] and shown in Listing 6.15. Lines 8–9 check to see if the
applications [Pat10]. This section applies this strategy, cells are connected, returning failure if not. The loop
using two child threads that start at opposite ends of the spanning lines 11–18 attempts to mark the new cell visited.
solution path, and takes a brief look at the performance Line 13 checks to see if it has already been visited, in
and scalability consequences. which case line 16 returns failure, but only after line 14
The partitioned parallel algorithm (PART), shown in checks to see if we have encountered the other thread, in
Listing 6.14 (maze_part.c), is similar to SEQ, but has which case line 15 indicates that the solution has been
a few important differences. First, each child thread located. Line 19 updates to the new cell, lines 20 and 21
has its own visited array, passed in by the parent as update this thread’s visited array, and line 22 returns
shown on line 1, which must be initialized to all [−1, −1]. success.
Line 7 stores a pointer to this array into the per-thread Performance testing revealed a surprising anomaly,
variable myvisited to allow access by helper functions, shown in Figure 6.24. The median solution time for PART
and similarly stores a pointer to the local visit index. (17 milliseconds) is more than four times faster than that
Second, the parent visits the first cell on each child’s of SEQ (79 milliseconds), despite running on only two
behalf, which the child retrieves on line 8. Third, the threads.
maze is solved as soon as one child locates a cell that has The first reaction to such a dramatic performance anom-
been visited by the other child. When maze_try_visit_ aly is to check for bugs, which suggests stringent validation
cell() detects this, it sets a ->done field in the maze be applied. This is the topic of the next section.
structure. Fourth, each child must therefore periodically
check the ->done field, as shown on lines 13, 18, and 23. 6.5.3 Maze Validation
The READ_ONCE() primitive must disable any compiler
optimizations that might combine consecutive loads or Much of the validation effort comprised consistency
that might reload the value. A C++1x volatile relaxed load checks, which can be located by searching for ABORT()

v2022.09.25a
96 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

1 1
0.9 0.9
0.8 PART 0.8
Probability 0.7 PWQ 0.7

Probability
0.6 0.6
0.5 SEQ 0.5 SEQ/PWQ SEQ/PART
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 20 40 60 80 100 120 140 0.1 1 10 100
CDF of Solution Time (ms) CDF of Speedup Relative to SEQ

Figure 6.24: CDF of Solution Times For SEQ, PWQ, Figure 6.25: CDF of SEQ/PWQ and SEQ/PART Solution-
and PART Time Ratios

in CodeSamples/SMPdesign/maze/*.c. Examples
checks include:

1. Maze solution steps that end up outside of the maze.

2. Mazes that suddenly have zero or fewer rows or


columns.

3. Newly created mazes with unreachable cells. Figure 6.26: Reason for Small Visit Percentages

4. Mazes that have no solution.


6.5.4 Performance Comparison I
5. Discontinuous maze solutions. Although the algorithms were in fact finding valid so-
lutions to valid mazes, the plot of CDFs in Figure 6.24
6. Attempts to start the maze solver outside of the maze. assumes independent data points. This is not the case:
The performance tests randomly generate a maze, and
7. Mazes whose solution path is longer than the number then run all solvers on that maze. It therefore makes sense
of cells in the maze. to plot the CDF of the ratios of solution times for each
generated maze, as shown in Figure 6.25, greatly reduc-
8. Subsolutions by different threads cross each other. ing the CDFs’ overlap. This plot reveals that for some
mazes, PART is more than forty times faster than SEQ. In
9. Memory-allocation failure. contrast, PWQ is never more than about two times faster
than SEQ. A forty-times speedup on two threads demands
10. System-call failure. explanation. After all, this is not merely embarrassingly
parallel, where partitionability means that adding threads
Additional manual validation was applied by Paul’s does not increase the overall computational cost. It is in-
wife, who greatly enjoys solving puzzles. stead humiliatingly parallel: Adding threads significantly
However, if this maze software was to be used in pro- reduces the overall computational cost, resulting in large
duction, whatever that might mean, it would be wise to algorithmic superlinear speedups.
construct an independent maze fsck program. Never- Further investigation showed that PART sometimes
theless, the mazes and solutions all proved to be quite visited fewer than 2 % of the maze’s cells, while SEQ
valid. The next section therefore more deeply analyzes and PWQ never visited fewer than about 9 %. The reason
the scalability anomaly called out in Section 6.5.2. for this difference is shown by Figure 6.26. If the thread

v2022.09.25a
6.5. BEYOND PARTITIONING 97

140 1
0.9
120 PART
0.8
100
Solution Time (ms)

SEQ 0.7

Probability
0.6
80
0.5 PWQ
60 PWQ
0.4

40 0.3
0.2
20 PART
0.1 SEQ -O3
0 0
0 10 20 30 40 50 60 70 80 90 100 0.1 1 10 100
Percent of Maze Cells Visited CDF of Speedup Relative to SEQ

Figure 6.27: Correlation Between Visit Percentage and Figure 6.29: Effect of Compiler Optimization (-O3)
Solution Time

each cell with more than two neighbors. Each such cell
can result in contention in PWQ, because one thread can
enter but two threads can exit, which hurts performance,
as noted earlier in this chapter. In contrast, PART can
incur such contention but once, namely when the solution
is located. Of course, SEQ never contends.
Although PART’s speedup is impressive, we should
not neglect sequential optimizations. Figure 6.29 shows
that SEQ, when compiled with -O3, is about twice as
Figure 6.28: PWQ Potential Contention Points
fast as unoptimized PWQ, approaching the performance
of unoptimized PART. Compiling all three algorithms
with -O3 gives results similar to (albeit faster than) those
traversing the solution from the upper left reaches the
shown in Figure 6.25, except that PWQ provides almost
circle, the other thread cannot reach the upper-right portion
no speedup compared to SEQ, in keeping with Amdahl’s
of the maze. Similarly, if the other thread reaches the
Law [Amd67]. However, if the goal is to double per-
square, the first thread cannot reach the lower-left portion
formance compared to unoptimized SEQ, as opposed to
of the maze. Therefore, PART will likely visit a small
achieving optimality, compiler optimizations are quite
fraction of the non-solution-path cells. In short, the
attractive.
superlinear speedups are due to threads getting in each
others’ way. This is a sharp contrast with decades of Cache alignment and padding often improves perfor-
experience with parallel programming, where workers mance by reducing false sharing. However, for these maze-
have struggled to keep threads out of each others’ way. solution algorithms, aligning and padding the maze-cell
Figure 6.27 confirms a strong correlation between cells array degrades performance by up to 42 % for 1000x1000
visited and solution time for all three methods. The mazes. Cache locality is more important than avoiding
slope of PART’s scatterplot is smaller than that of SEQ, false sharing, especially for large mazes. For smaller
indicating that PART’s pair of threads visits a given 20-by-20 or 50-by-50 mazes, aligning and padding can
fraction of the maze faster than can SEQ’s single thread. produce up to a 40 % performance improvement for PART,
PART’s scatterplot is also weighted toward small visit but for these small sizes, SEQ performs better anyway
percentages, confirming that PART does less total work, because there is insufficient time for PART to make up for
hence the observed humiliating parallelism. the overhead of thread creation and destruction.
The fraction of cells visited by PWQ is similar to that In short, the partitioned parallel maze solver is an
of SEQ. In addition, PWQ’s solution time is greater than interesting example of an algorithmic superlinear speedup.
that of PART, even for equal visit fractions. The reason If “algorithmic superlinear speedup” causes cognitive
for this is shown in Figure 6.28, which has a red circle on dissonance, please proceed to the next section.

v2022.09.25a
98 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

1 1.8
COPART

Speedup Relative to COPART (-O3)


0.9 1.6
PWQ
0.8 PART 1.4
0.7
1.2
Probability
0.6
1
0.5
0.8 PART
0.4
0.6
0.3
PWQ
0.4
0.2
0.1 0.2

0 0
0.1 1 10 100 10 100 1000
CDF of Speedup Relative to SEQ (-O3) Maze Size

Figure 6.30: Partitioned Coroutines Figure 6.32: Varying Maze Size vs. COPART

12 3.5

Mean Speedup Relative to COPART (-O3)


Speedup Relative to SEQ (-O3)

10 3

8 2.5

2
6
1.5
4 PART
1
2 PART PWQ 0.5
PWQ
0
10 100 1000 0
1 2 3 4 5 6 7 8
Maze Size
Number of Threads

Figure 6.31: Varying Maze Size vs. SEQ Figure 6.33: Mean Speedup vs. Number of Threads,
1000x1000 Maze
6.5.5 Alternative Sequential Maze Solver
The presence of algorithmic superlinear speedups sug- 90-percent-confidence error bars. PART shows superlin-
gests simulating parallelism via co-routines, for example, ear scalability against SEQ and modest scalability against
manually switching context between threads on each pass COPART for 100-by-100 and larger mazes. PART exceeds
through the main do-while loop in Listing 6.14. This theoretical energy-efficiency breakeven against COPART
context switching is straightforward because the context at roughly the 200-by-200 maze size, given that power
consists only of the variables c and vi: Of the numer- consumption rises as roughly the square of the frequency
ous ways to achieve the effect, this is a good tradeoff for high frequencies [Mud01], so that 1.4x scaling on two
between context-switch overhead and visit percentage. threads consumes the same energy as a single thread at
As can be seen in Figure 6.30, this coroutine algorithm equal solution speeds. In contrast, PWQ shows poor scala-
(COPART) is quite effective, with the performance on one bility against both SEQ and COPART unless unoptimized:
thread being within about 30 % of PART on two threads Figures 6.31 and 6.32 were generated using -O3.
(maze_2seq.c). Figure 6.33 shows the performance of PWQ and PART
relative to COPART. For PART runs with more than
6.5.6 Performance Comparison II two threads, the additional threads were started evenly
spaced along the diagonal connecting the starting and
Figures 6.31 and 6.32 show the effects of varying maze ending cells. Simplified link-state routing [BG87] was
size, comparing both PWQ and PART running on two used to detect early termination on PART runs with more
threads against either SEQ or COPART, respectively, with than two threads (the solution is flagged when a thread is

v2022.09.25a
6.6. PARTITIONING, PARALLELISM, AND OPTIMIZATION 99

connected to both beginning and end). PWQ performs design-time application of parallelism is likely to be a
quite poorly, but PART hits breakeven at two threads and fruitful field of study. This section took the problem
again at five threads, achieving modest speedups beyond of solving mazes from mildly scalable to humiliatingly
five threads. Theoretical energy efficiency breakeven is parallel and back again. It is hoped that this experience will
within the 90-percent-confidence interval for seven and motivate work on parallelism as a first-class design-time
eight threads. The reasons for the peak at two threads whole-application optimization technique, rather than as
are (1) the lower complexity of termination detection in a grossly suboptimal after-the-fact micro-optimization to
the two-thread case and (2) the fact that there is a lower be retrofitted into existing programs.
probability of the third and subsequent threads making
useful forward progress: Only the first two threads are
guaranteed to start on the solution line. This disappointing 6.6 Partitioning, Parallelism, and
performance compared to results in Figure 6.32 is due to Optimization
the less-tightly integrated hardware available in the larger
and older Xeon system running at 2.66 GHz.
Knowledge is of no value unless you put it into
practice.
6.5.7 Future Directions and Conclusions Anton Chekhov

Much future work remains. First, this section applied Most important, although this chapter has demonstrated
only one technique used by human maze solvers. Oth- that applying parallelism at the design level gives excellent
ers include following walls to exclude portions of the results, this final section shows that this is not enough.
maze and choosing internal starting points based on the For search problems such as maze solution, this section
locations of previously traversed paths. Second, different has shown that search strategy is even more important
choices of starting and ending points might favor different than parallel design. Yes, for this particular type of maze,
algorithms. Third, although placement of the PART algo- intelligently applying parallelism identified a superior
rithm’s first two threads is straightforward, there are any search strategy, but this sort of luck is no substitute for a
number of placement schemes for the remaining threads. clear focus on search strategy itself.
Optimal placement might well depend on the starting As noted back in Section 2.2, parallelism is but one
and ending points. Fourth, study of unsolvable mazes potential optimization of many. A successful design needs
and cyclic mazes is likely to produce interesting results. to focus on the most important optimization. Much though
Fifth, the lightweight C++11 atomic operations might I might wish to claim otherwise, that optimization might
improve performance. Sixth, it would be interesting to or might not be parallelism.
compare the speedups for three-dimensional mazes (or of However, for the many cases where parallelism is the
even higher-order mazes). Finally, for mazes, humiliating right optimization, the next section covers that synchro-
parallelism indicated a more-efficient sequential imple- nization workhorse, locking.
mentation using coroutines. Do humiliatingly parallel
algorithms always lead to more-efficient sequential imple-
mentations, or are there inherently humiliatingly parallel
algorithms for which coroutine context-switch overhead
overwhelms the speedups?
This section demonstrated and analyzed parallelization
of maze-solution algorithms. A conventional work-queue-
based algorithm did well only when compiler optimiza-
tions were disabled, suggesting that some prior results
obtained using high-level/overhead languages will be in-
validated by advances in optimization.
This section gave a clear example where approaching
parallelism as a first-class optimization technique rather
than as a derivative of a sequential algorithm paves the
way for an improved sequential algorithm. High-level

v2022.09.25a
100 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN

v2022.09.25a
Locking is the worst general-purpose
synchronization mechanism except for all those other
mechanisms that have been tried from time to time.

Chapter 7 With apologies to the memory of Winston Churchill


and to whoever he was quoting

Locking

In recent concurrency research, locking often plays the role 5. Locking works extremely well for some software
of villain. Locking stands accused of inciting deadlocks, artifacts and extremely poorly for others. Developers
convoying, starvation, unfairness, data races, and all man- who have worked on artifacts for which locking works
ner of other concurrency sins. Interestingly enough, the well can be expected to have a much more positive
role of workhorse in production-quality shared-memory opinion of locking than those who have worked on
parallel software is also played by locking. This chapter artifacts for which locking works poorly, as will be
will look into this dichotomy between villain and hero, as discussed in Section 7.5.
fancifully depicted in Figures 7.1 and 7.2.
There are a number of reasons behind this Jekyll-and- 6. All good stories need a villain, and locking has a long
Hyde dichotomy: and honorable history serving as a research-paper
whipping boy.
1. Many of locking’s sins have pragmatic design solu-
tions that work well in most cases, for example: Quick Quiz 7.1: Just how can serving as a whipping boy be
considered to be in any way honorable???
(a) Use of lock hierarchies to avoid deadlock.
(b) Deadlock-detection tools, for example, the This chapter will give an overview of a number of ways
Linux kernel’s lockdep facility [Cor06a]. to avoid locking’s more serious sins.
(c) Locking-friendly data structures, such as arrays,
hash tables, and radix trees, which will be 7.1 Staying Alive
covered in Chapter 10.

2. Some of locking’s sins are problems only at high I work to stay alive.
levels of contention, levels reached only by poorly
Bette Davis
designed programs.

3. Some of locking’s sins are avoided by using other Given that locking stands accused of deadlock and starva-
synchronization mechanisms in concert with locking. tion, one important concern for shared-memory parallel
These other mechanisms include statistical counters developers is simply staying alive. The following sections
(see Chapter 5), reference counters (see Section 9.2), therefore cover deadlock, livelock, starvation, unfairness,
hazard pointers (see Section 9.3), sequence-locking and inefficiency.
readers (see Section 9.4), RCU (see Section 9.5),
and simple non-blocking data structures (see Sec- 7.1.1 Deadlock
tion 14.2).
Deadlock occurs when each member of a group of threads
4. Until quite recently, almost all large shared-memory is holding at least one lock while at the same time waiting
parallel programs were developed in secret, so that it on a lock held by a member of that same group. This
was not easy to learn of these pragmatic solutions. happens even in groups containing a single thread when

101

v2022.09.25a
102 CHAPTER 7. LOCKING

Lock 1

Thread A Lock 2

XXXX

Lock 3 Thread B

Thread C Lock 4

Figure 7.3: Deadlock Cycle

that thread attempts to acquire a non-recursive lock that it


already holds. Deadlock can therefore occur even given
but one thread and one lock!
Without some sort of external intervention, deadlock
is forever. No thread can acquire the lock it is waiting on
Figure 7.1: Locking: Villain or Slob? until that lock is released by the thread holding it, but the
thread holding it cannot release it until the holding thread
acquires the lock that it is in turn waiting on.
We can create a directed-graph representation of a
deadlock scenario with nodes for threads and locks, as
shown in Figure 7.3. An arrow from a lock to a thread
indicates that the thread holds the lock, for example,
Thread B holds Locks 2 and 4. An arrow from a thread to
a lock indicates that the thread is waiting on the lock, for
example, Thread B is waiting on Lock 3.
A deadlock scenario will always contain at least one
deadlock cycle. In Figure 7.3, this cycle is Thread B,
Lock 3, Thread C, Lock 4, and back to Thread B.
Quick Quiz 7.2: But the definition of lock-based deadlock
only said that each thread was holding at least one lock and
waiting on another lock that was held by some thread. How
do you know that there is a cycle?

Although there are some software environments such


as database systems that can recover from an existing
deadlock, this approach requires either that one of the
threads be killed or that a lock be forcibly stolen from one
of the threads. This killing and forcible stealing works
well for transactions, but is often problematic for kernel
Figure 7.2: Locking: Workhorse or Hero? and application-level use of locking: Dealing with the
resulting partially updated structures can be extremely
complex, hazardous, and error-prone.
Therefore, kernels and applications should instead
avoid deadlocks. Deadlock-avoidance strategies in-

v2022.09.25a
7.1. STAYING ALIVE 103

clude locking hierarchies (Section 7.1.1.1), local lock-


ing hierarchies (Section 7.1.1.2), layered locking hier- Application
archies (Section 7.1.1.3), temporal locking hierarchies
(Section 7.1.1.4), strategies for dealing with APIs con- Lock A Lock B
taining pointers to locks (Section 7.1.1.5), conditional foo() bar() cmp()
locking (Section 7.1.1.6), acquiring all needed locks
first (Section 7.1.1.7), single-lock-at-a-time designs (Sec-
tion 7.1.1.8), and strategies for signal/interrupt handlers
(Section 7.1.1.9). Although there is no deadlock-avoidance Library
strategy that works perfectly for all situations, there is a Lock C
good selection of tools to choose from.
qsort()

7.1.1.1 Locking Hierarchies


Locking hierarchies order the locks and prohibit acquiring Figure 7.4: No qsort() Compare-Function Locking
locks out of order. In Figure 7.3, we might order the
locks numerically, thus forbidding a thread from acquiring
But suppose that a library function does invoke the
a given lock if it already holds a lock with the same or
caller’s code. For example, qsort() invokes a caller-
a higher number. Thread B has violated this hierarchy
provided comparison function. Now, normally this com-
because it is attempting to acquire Lock 3 while holding
parison function will operate on unchanging local data,
Lock 4. This violation permitted the deadlock to occur.
so that it need not acquire locks, as shown in Figure 7.4.
Again, to apply a locking hierarchy, order the locks
But maybe someone is crazy enough to sort a collection
and prohibit out-of-order lock acquisition. For different
whose keys are changing, thus requiring that the compari-
types of locks, it is helpful to have a carefully considered
son function acquire locks, which might result in deadlock,
hierarchy from one type to the next. For many instances
as shown in Figure 7.5. How can the library function
of the same type of lock, for example, a per-node lock
avoid this deadlock?
in a search tree, the traditional approach is to carry out
The golden rule in this case is “Release all locks before
lock acquisition in order of the addresses of the locks to
invoking unknown code.” To follow this rule, the qsort()
be acquired. Either way, in large program, it is wise to
function must release all of its locks before invoking the
use tools such as the Linux-kernel lockdep [Cor06a] to
comparison function. Thus qsort() will not be holding
enforce your locking hierarchy.
any of its locks while the comparison function acquires
any of the caller’s locks, thus avoiding deadlock.
7.1.1.2 Local Locking Hierarchies
Quick Quiz 7.4: But if qsort() releases all its locks before
However, the global nature of locking hierarchies makes invoking the comparison function, how can it protect against
them difficult to apply to library functions. After all, when races with other qsort() threads?
a program using a given library function has not yet been
To see the benefits of local locking hierarchies, compare
written, how can the poor library-function implementor
Figures 7.5 and 7.6. In both figures, application func-
possibly follow the yet-to-be-defined locking hierarchy?
tions foo() and bar() invoke qsort() while holding
One special (but common) case is when the library
Locks A and B, respectively. Because this is a parallel im-
function does not invoke any of the caller’s code. In
plementation of qsort(), it acquires Lock C. Function
this case, the caller’s locks will never be acquired while
foo() passes function cmp() to qsort(), and cmp()
holding any of the library’s locks, so that there cannot be
acquires Lock B. Function bar() passes a simple integer-
a deadlock cycle containing locks from both the library
comparison function (not shown) to qsort(), and this
and the caller.
simple function does not acquire any locks.
Quick Quiz 7.3: Are there any exceptions to this rule, so that Now, if qsort() holds Lock C while calling cmp()
there really could be a deadlock cycle containing locks from in violation of the golden release-all-locks rule above, as
both the library and the caller, even given that the library code
shown in Figure 7.5, deadlock can occur. To see this,
never invokes any of the caller’s functions?
suppose that one thread invokes foo() while a second

v2022.09.25a
104 CHAPTER 7. LOCKING

Application

Lock A Lock B

foo() bar()
Application

Lock A Lock B Lock B


Library
foo() bar() cmp() Lock C

DEADLOCK qsort()

Library
Lock C

qsort() Lock D

cmp()
Figure 7.5: Without qsort() Local Locking Hierarchy

Figure 7.7: Layered Locking Hierarchy for qsort()

thread concurrently invokes bar(). The first thread will


acquire Lock A and the second thread will acquire Lock B.
If the first thread’s call to qsort() acquires Lock C, then
it will be unable to acquire Lock B when it calls cmp().
But the first thread holds Lock C, so the second thread’s
call to qsort() will be unable to acquire it, and thus
unable to release Lock B, resulting in deadlock.
In contrast, if qsort() releases Lock C before invoking
Application the comparison function, which is unknown code from
qsort()’s perspective, then deadlock is avoided as shown
Lock A Lock B Lock B
in Figure 7.6.
foo() bar() cmp() If each module releases all locks before invoking un-
known code, then deadlock is avoided if each module
separately avoids deadlock. This rule therefore greatly
simplifies deadlock analysis and greatly improves modu-
Library larity.
Lock C
7.1.1.3 Layered Locking Hierarchies
qsort()
Unfortunately, it might not be possible for qsort() to
release all of its locks before invoking the comparison
Figure 7.6: Local Locking Hierarchy for qsort() function. In this case, we cannot construct a local locking
hierarchy by releasing all locks before invoking unknown
code. However, we can instead construct a layered locking
hierarchy, as shown in Figure 7.7. Here, the cmp()
function uses a new Lock D that is acquired after all of
Locks A, B, and C, avoiding deadlock. We therefore

v2022.09.25a
7.1. STAYING ALIVE 105

Listing 7.1: Concurrent List Iterator Listing 7.2: Concurrent List Iterator Usage
1 struct locked_list { 1 struct list_ints {
2 spinlock_t s; 2 struct cds_list_head n;
3 struct cds_list_head h; 3 int a;
4 }; 4 };
5 5
6 struct cds_list_head *list_start(struct locked_list *lp) 6 void list_print(struct locked_list *lp)
7 { 7 {
8 spin_lock(&lp->s); 8 struct cds_list_head *np;
9 return list_next(lp, &lp->h); 9 struct list_ints *ip;
10 } 10
11 11 np = list_start(lp);
12 struct cds_list_head *list_next(struct locked_list *lp, 12 while (np != NULL) {
13 struct cds_list_head *np) 13 ip = cds_list_entry(np, struct list_ints, n);
14 { 14 printf("\t%d\n", ip->a);
15 struct cds_list_head *ret; 15 np = list_next(lp, np);
16 16 }
17 ret = np->next; 17 }
18 if (ret == &lp->h) {
19 spin_unlock(&lp->s);
20 ret = NULL;
21 } the deadlock by layering the locking hierarchy to take the
22 return ret;
23 } list-iterator locking into account.
This layered approach can be extended to an arbitrarily
large number of layers, but each added layer increases
have three layers to the global deadlock hierarchy, the first the complexity of the locking design. Such increases in
containing Locks A and B, the second containing Lock C, complexity are particularly inconvenient for some types of
and the third containing Lock D. object-oriented designs, in which control passes back and
Please note that it is not typically possible to mechan- forth among a large group of objects in an undisciplined
ically change cmp() to use the new Lock D. Quite the manner.1 This mismatch between the habits of object-
opposite: It is often necessary to make profound design- oriented design and the need to avoid deadlock is an
level modifications. Nevertheless, the effort required for important reason why parallel programming is perceived
such modifications is normally a small price to pay in by some to be so difficult.
order to avoid deadlock. More to the point, this potential Some alternatives to highly layered locking hierarchies
deadlock should preferably be detected at design time, are covered in Chapter 9.
before any code has been generated!
For another example where releasing all locks before 7.1.1.4 Temporal Locking Hierarchies
invoking unknown code is impractical, imagine an iterator
One way to avoid deadlock is to defer acquisition of one
over a linked list, as shown in Listing 7.1 (locked_list.
of the conflicting locks. This approach is used in Linux-
c). The list_start() function acquires a lock on the
kernel RCU, whose call_rcu() function is invoked by
list and returns the first element (if there is one), and
the Linux-kernel scheduler while holding its locks. This
list_next() either returns a pointer to the next element
means that call_rcu() cannot always safely invoke the
in the list or releases the lock and returns NULL if the end
scheduler to do a wakeup, for example, in order to wake
of the list has been reached.
up an RCU kthread in order to start the new grace period
Listing 7.2 shows how this list iterator may be used. that is required by the callback queued by call_rcu().
Lines 1–4 define the list_ints element containing a
single integer, and lines 6–17 show how to iterate over Quick Quiz 7.5: What do you mean “cannot always safely
the list. Line 11 locks the list and fetches a pointer to the invoke the scheduler”? Either call_rcu() can or cannot
safely invoke the scheduler, right?
first element, line 13 provides a pointer to our enclosing
list_ints structure, line 14 prints the corresponding However, grace periods last for many milliseconds, so
integer, and line 15 moves to the next element. This is waiting another millisecond before starting a new grace pe-
quite simple, and hides all of the locking. riod is not normally a problem. Therefore, if call_rcu()
That is, the locking remains hidden as long as the code detects a possible deadlock with the scheduler, it arranges
processing each list element does not itself acquire a lock to start the new grace period later, either within a timer
that is held across some other call to list_start() or
list_next(), which results in deadlock. We can avoid 1 One name for this is “object-oriented spaghetti code.”

v2022.09.25a
106 CHAPTER 7. LOCKING

handler or within the scheduler-clock interrupt handler, Listing 7.3: Protocol Layering and Deadlock
depending on configuration. Because no scheduler locks 1 spin_lock(&lock2);
2 layer_2_processing(pkt);
are held across either handler, deadlock is successfully 3 nextlayer = layer_1(pkt);
avoided. 4 spin_lock(&nextlayer->lock1);
5 layer_1_processing(pkt);
The overall approach is thus to adhere to a locking 6 spin_unlock(&lock2);
hierarchy by deferring lock acquisition to an environment 7 spin_unlock(&nextlayer->lock1);

in which no locks are held.


Listing 7.4: Avoiding Deadlock Via Conditional Locking
1 retry:
7.1.1.5 Locking Hierarchies and Pointers to Locks 2 spin_lock(&lock2);
3 layer_2_processing(pkt);
4 nextlayer = layer_1(pkt);
Although there are some exceptions, an external API 5 if (!spin_trylock(&nextlayer->lock1)) {
containing a pointer to a lock is very often a misdesigned 6 spin_unlock(&lock2);
7 spin_lock(&nextlayer->lock1);
API. Handing an internal lock to some other software 8 spin_lock(&lock2);
component is after all the antithesis of information hiding, 9 if (layer_1(pkt) != nextlayer) {
10 spin_unlock(&nextlayer->lock1);
which is in turn a key design principle. 11 spin_unlock(&lock2);
12 goto retry;
Quick Quiz 7.6: Name one common situation where a 13 }
pointer to a lock is passed into a function. 14 }
15 layer_1_processing(pkt);
16 spin_unlock(&lock2);
One exception is functions that hand off some entity, 17 spin_unlock(&nextlayer->lock1);
where the caller’s lock must be held until the handoff is
complete, but where the lock must be released before the
function returns. One example of such a function is the wire are acquiring the locks in order, the lock acquisition
POSIX pthread_cond_wait() function, where passing in line 4 of the listing can result in deadlock.
an pointer to a pthread_mutex_t prevents hangs due to One way to avoid deadlocks in this case is to impose
lost wakeups. a locking hierarchy, but when it is necessary to acquire
a lock out of order, acquire it conditionally, as shown
Quick Quiz 7.7: Doesn’t the fact that pthread_cond_
in Listing 7.4. Instead of unconditionally acquiring the
wait() first releases the mutex and then re-acquires it elimi-
nate the possibility of deadlock? layer-1 lock, line 5 conditionally acquires the lock using
the spin_trylock() primitive. This primitive acquires
In short, if you find yourself exporting an API with a the lock immediately if the lock is available (returning
pointer to a lock as an argument or as the return value, do non-zero), and otherwise returns zero without acquiring
yourself a favor and carefully reconsider your API design. the lock.
It might well be the right thing to do, but experience If spin_trylock() was successful, line 15 does the
indicates that this is unlikely. needed layer-1 processing. Otherwise, line 6 releases
the lock, and lines 7 and 8 acquire them in the correct
order. Unfortunately, there might be multiple networking
7.1.1.6 Conditional Locking devices on the system (e.g., Ethernet and WiFi), so that
But suppose that there is no reasonable locking hierarchy. the layer_1() function must make a routing decision.
This can happen in real life, for example, in some types This decision might change at any time, especially if the
of layered network protocol stacks where packets flow system is mobile.2 Therefore, line 9 must recheck the
in both directions, for example, in implementations of decision, and if it has changed, must release the locks and
distributed lock managers. In the networking case, it start over.
might be necessary to hold the locks from both layers Quick Quiz 7.8: Can the transformation from Listing 7.3 to
when passing a packet from one layer to another. Given Listing 7.4 be applied universally?
that packets travel both up and down the protocol stack,
this is an excellent recipe for deadlock, as illustrated in Quick Quiz 7.9: But the complexity in Listing 7.4 is well
Listing 7.3. Here, a packet moving down the stack towards worthwhile given that it avoids deadlock, right?
the wire must acquire the next layer’s lock out of order.
Given that packets moving up the stack away from the 2 And, in contrast to the 1900s, mobility is the common case.

v2022.09.25a
7.1. STAYING ALIVE 107

7.1.1.7 Acquire Needed Locks First 7.1.1.9 Signal/Interrupt Handlers

In an important special case of conditional locking, all Deadlocks involving signal handlers are often quickly
needed locks are acquired before any processing is carried dismissed by noting that it is not legal to invoke pthread_
out, where the needed locks might be identified by hashing mutex_lock() from within a signal handler [Ope97].
the addresses of the data structures involved. In this case, However, it is possible (though often unwise) to hand-
processing need not be idempotent: If it turns out to be craft locking primitives that can be invoked from signal
impossible to acquire a given lock without first releasing handlers. Besides which, almost all operating-system
one that was already acquired, just release all the locks kernels permit locks to be acquired from within interrupt
and try again. Only once all needed locks are held will handlers, which are analogous to signal handlers.
any processing be carried out. The trick is to block signals (or disable interrupts, as
However, this procedure can result in livelock, which the case may be) when acquiring any lock that might
will be discussed in Section 7.1.2. be acquired within a signal (or an interrupt) handler.
Furthermore, if holding such a lock, it is illegal to attempt
Quick Quiz 7.10: When using the “acquire needed locks to acquire any lock that is ever acquired outside of a signal
first” approach described in Section 7.1.1.7, how can livelock handler without blocking signals.
be avoided?
Quick Quiz 7.11: Suppose Lock A is never acquired within
A related approach, two-phase locking [BHG87], has a signal handler, but Lock B is acquired both from thread
context and by signal handlers. Suppose further that Lock A is
seen long production use in transactional database systems.
sometimes acquired with signals unblocked. Why is it illegal
In the first phase of a two-phase locking transaction, locks
to acquire Lock A holding Lock B?
are acquired but not released. Once all needed locks have
been acquired, the transaction enters the second phase, If a lock is acquired by the handlers for several signals,
where locks are released, but not acquired. This locking then each and every one of these signals must be blocked
approach allows databases to provide serializability guar- whenever that lock is acquired, even when that lock is
antees for their transactions, in other words, to guarantee acquired within a signal handler.
that all values seen and produced by the transactions are
consistent with some global ordering of all the transac- Quick Quiz 7.12: How can you legally block signals within
tions. Many such systems rely on the ability to abort a signal handler?
transactions, although this can be simplified by avoiding
making any changes to shared data until all needed locks Unfortunately, blocking and unblocking signals can be
are acquired. Livelock and deadlock are issues in such expensive in some operating systems, notably including
systems, but practical solutions may be found in any of a Linux, so performance concerns often mean that locks
number of database textbooks. acquired in signal handlers are only acquired in signal
handlers, and that lockless synchronization mechanisms
are used to communicate between application code and
signal handlers.
7.1.1.8 Single-Lock-at-a-Time Designs
Or that signal handlers are avoided completely except
In some cases, it is possible to avoid nesting locks, thus for handling fatal errors.
avoiding deadlock. For example, if a problem is perfectly Quick Quiz 7.13: If acquiring locks in signal handlers is
partitionable, a single lock may be assigned to each par- such a bad idea, why even discuss ways of making it safe?
tition. Then a thread working on a given partition need
only acquire the one corresponding lock. Because no
thread ever holds more than one lock at a time, deadlock
7.1.1.10 Discussion
is impossible.
However, there must be some mechanism to ensure that There are a large number of deadlock-avoidance strategies
the needed data structures remain in existence during the available to the shared-memory parallel programmer, but
time that neither lock is held. One such mechanism is there are sequential programs for which none of them
discussed in Section 7.4 and several others are presented is a good fit. This is one of the reasons that expert
in Chapter 9. programmers have more than one tool in their toolbox:

v2022.09.25a
108 CHAPTER 7. LOCKING

Listing 7.5: Abusing Conditional Locking Listing 7.6: Conditional Locking and Exponential Backoff
1 void thread1(void) 1 void thread1(void)
2 { 2 {
3 retry: 3 unsigned int wait = 1;
4 spin_lock(&lock1); 4 retry:
5 do_one_thing(); 5 spin_lock(&lock1);
6 if (!spin_trylock(&lock2)) { 6 do_one_thing();
7 spin_unlock(&lock1); 7 if (!spin_trylock(&lock2)) {
8 goto retry; 8 spin_unlock(&lock1);
9 } 9 sleep(wait);
10 do_another_thing(); 10 wait = wait << 1;
11 spin_unlock(&lock2); 11 goto retry;
12 spin_unlock(&lock1); 12 }
13 } 13 do_another_thing();
14 14 spin_unlock(&lock2);
15 void thread2(void) 15 spin_unlock(&lock1);
16 { 16 }
17 retry: 17
18 spin_lock(&lock2); 18 void thread2(void)
19 do_a_third_thing(); 19 {
20 if (!spin_trylock(&lock1)) { 20 unsigned int wait = 1;
21 spin_unlock(&lock2); 21 retry:
22 goto retry; 22 spin_lock(&lock2);
23 } 23 do_a_third_thing();
24 do_a_fourth_thing(); 24 if (!spin_trylock(&lock1)) {
25 spin_unlock(&lock1); 25 spin_unlock(&lock2);
26 spin_unlock(&lock2); 26 sleep(wait);
27 } 27 wait = wait << 1;
28 goto retry;
29 }
30 do_a_fourth_thing();
Locking is a powerful concurrency tool, but there are jobs 31 spin_unlock(&lock1);
32 spin_unlock(&lock2);
better addressed with other tools. 33 }
Quick Quiz 7.14: Given an object-oriented application that
passes control freely among a group of objects such that there
is no straightforward locking hierarchy,a layered or otherwise, 5. Thread 1 releases lock1 on line 7, then jumps to
how can this application be parallelized? retry at line 3.
a Also known as “object-oriented spaghetti code.” 6. Thread 2 releases lock2 on line 21, and jumps to
retry at line 17.
Nevertheless, the strategies described in this section
have proven quite useful in many settings. 7. The livelock dance repeats from the beginning.

7.1.2 Livelock and Starvation Quick Quiz 7.15: How can the livelock shown in Listing 7.5
be avoided?
Although conditional locking can be an effective deadlock-
avoidance mechanism, it can be abused. Consider for Livelock can be thought of as an extreme form of
example the beautifully symmetric example shown in starvation where a group of threads starves, rather than
Listing 7.5. This example’s beauty hides an ugly livelock. just one of them.3
To see this, consider the following sequence of events: Livelock and starvation are serious issues in software
transactional memory implementations, and so the concept
1. Thread 1 acquires lock1 on line 4, then invokes of contention manager has been introduced to encapsulate
do_one_thing(). these issues. In the case of locking, simple exponential
backoff can often address livelock and starvation. The
2. Thread 2 acquires lock2 on line 18, then invokes
idea is to introduce exponentially increasing delays before
do_a_third_thing().
each retry, as shown in Listing 7.6.
3. Thread 1 attempts to acquire lock2 on line 6, but
fails because Thread 2 holds it. 3 Try not to get too hung up on the exact definitions of terms like

livelock, starvation, and unfairness. Anything that causes a group of


4. Thread 2 attempts to acquire lock1 on line 20, but threads to fail to make adequate forward progress is a bug that needs to
fails because Thread 1 holds it. be fixed, and debating names doesn’t fix bugs.

v2022.09.25a
7.1. STAYING ALIVE 109

CPU 0 CPU 1 CPU 2 CPU 3


Cache Cache Cache Cache
Interconnect Interconnect

Memory System Interconnect Memory

Interconnect Interconnect Figure 7.9: Saw Kerf


Cache Cache Cache Cache
CPU 4 CPU 5 CPU 6 CPU 7
7.1.4 Inefficiency
Locks are implemented using atomic instructions and
Speed−of−Light Round−Trip Distance in Vacuum memory barriers, and often involve cache misses. As we
for 1.8 GHz Clock Period (8 cm) saw in Chapter 3, these instructions are quite expensive,
roughly two orders of magnitude greater overhead than
Figure 7.8: System Architecture and Lock Unfairness simple instructions. This can be a serious problem for
locking: If you protect a single instruction with a lock,
you will increase the overhead by a factor of one hundred.
Quick Quiz 7.16: What problems can you spot in the code Even assuming perfect scalability, one hundred CPUs
in Listing 7.6? would be required to keep up with a single CPU executing
the same code without locking.
For better results, backoffs should be bounded, and This situation is not confined to locking. Figure 7.9
even better high-contention results are obtained via queued shows how this same principle applies to the age-old
locking [And90], which is discussed more in Section 7.3.2. activity of sawing wood. As can be seen in the figure,
Of course, best of all is to use a good parallel design that sawing a board converts a small piece of that board (the
avoids these problems by maintaining low lock contention. width of the saw blade) into sawdust. Of course, locks
partition time instead of sawing wood,4 but just like sawing
wood, using locks to partition time wastes some of that
time due to lock overhead and (worse yet) lock contention.
7.1.3 Unfairness One important difference is that if someone saws a board
into too-small pieces, the resulting conversion of most
Unfairness can be thought of as a less-severe form of star- of that board into sawdust will be immediately obvious.
vation, where a subset of threads contending for a given In contrast, it is not always obvious that a given lock
lock are granted the lion’s share of the acquisitions. This acquisition is wasting excessive amounts of time.
can happen on machines with shared caches or NUMA And this situation underscores the importance of the
characteristics, for example, as shown in Figure 7.8. If synchronization-granularity tradeoff discussed in Sec-
CPU 0 releases a lock that all the other CPUs are attempt- tion 6.3, especially Figure 6.16: Too coarse a granularity
ing to acquire, the interconnect shared between CPUs 0 will limit scalability, while too fine a granularity will
and 1 means that CPU 1 will have an advantage over result in excessive synchronization overhead.
CPUs 2–7. Therefore CPU 1 will likely acquire the lock. Acquiring a lock might be expensive, but once held,
If CPU 1 holds the lock long enough for CPU 0 to be the CPU’s caches are an effective performance booster, at
requesting the lock by the time CPU 1 releases it and least for large critical sections. In addition, once a lock is
vice versa, the lock can shuttle between CPUs 0 and 1, held, the data protected by that lock can be accessed by
bypassing CPUs 2–7. the lock holder without interference from other threads.
Quick Quiz 7.18: How might the lock holder be interfered
Quick Quiz 7.17: Wouldn’t it be better just to use a good with?
parallel design so that lock contention was low enough to avoid
unfairness? 4 That is, locking is temporal synchronization. Mechanisms that

synchronize both temporally and spatially are described in Chapter 9.

v2022.09.25a
110 CHAPTER 7. LOCKING

The Rust programming language takes lock/data asso- It is important to note that unconditionally acquiring
ciation a step further by allowing the developer to make a an exclusive lock has two effects: (1) Waiting for all prior
compiler-visible association between a lock and the data holders of that lock to release it and (2) Blocking any
that it protects [JJKD21]. When such an association has other acquisition attempts until the lock is released. As a
been made, attempts to access the data without the benefit result, at lock acquisition time, any concurrent acquisitions
of the corresponding lock will result in a compile-time of that lock must be partitioned into prior holders and
diagnostic. The hope is that this will greatly reduce the subsequent holders. Different types of exclusive locks
frequency of this class of bugs. Of course, this approach use different partitioning strategies [Bra11, GGL+ 19], for
does not apply straightforwardly to cases where the data to example:
be locked is distributed throughout the nodes of some data
structure or when that which is locked is purely abstract, 1. Strict FIFO, with acquisitions starting earlier acquir-
for example, when a small subset of state-machine transi- ing the lock earlier.
tions is to be protected by a given lock. For this reason,
Rust allows locks to be associated with types rather than 2. Approximate FIFO, with acquisitions starting suffi-
data items or even to be associated with nothing at all. This ciently earlier acquiring the lock earlier.
last option permits Rust to emulate traditional locking use 3. FIFO within priority level, with higher-priority
cases, but is not popular among Rust developers. Perhaps threads acquiring the lock earlier than any lower-
the Rust community will come up with other mechanisms priority threads attempting to acquire the lock at
tailored to other locking use cases. about the same time, but so that some FIFO ordering
applies for threads of the same priority.
7.2 Types of Locks 4. Random, so that the new lock holder is chosen ran-
domly from all threads attempting acquisition, re-
Only locks in life are what you think you know, but gardless of timing.
don’t. Accept your ignorance and try something new.
5. Unfair, so that a given acquisition might never acquire
Dennis Vickers the lock (see Section 7.1.3).

There are a surprising number of types of locks, more Unfortunately, locking implementations with stronger
than this short chapter can possibly do justice to. The guarantees typically incur higher overhead, motivating the
following sections discuss exclusive locks (Section 7.2.1), wide variety of locking implementations in production
reader-writer locks (Section 7.2.2), multi-role locks (Sec- use. For example, real-time systems often require some
tion 7.2.3), and scoped locking (Section 7.2.4). degree of FIFO ordering within priority level, and much
else besides (see Section 14.3.5.1), while non-realtime
systems subject to high contention might require only
7.2.1 Exclusive Locks enough ordering to avoid starvation, and finally, non-
Exclusive locks are what they say they are: Only one realtime systems designed to avoid contention might not
thread may hold the lock at a time. The holder of such a need fairness at all.
lock thus has exclusive access to all data protected by that
lock, hence the name. 7.2.2 Reader-Writer Locks
Of course, this all assumes that this lock is held across
all accesses to data purportedly protected by the lock. Reader-writer locks [CHP71] permit any number of read-
Although there are some tools that can help (see for ers to hold the lock concurrently on the one hand or a
example Section 12.3.1), the ultimate responsibility for single writer to hold the lock on the other. In theory,
ensuring that the lock is always acquired when needed then, reader-writer locks should allow excellent scalability
rests with the developer. for data that is read often and written rarely. In prac-
tice, the scalability will depend on the reader-writer lock
Quick Quiz 7.19: Does it ever make sense to have an implementation.
exclusive lock acquisition immediately followed by a release
The classic reader-writer lock implementation involves
of that same lock, that is, an empty critical section?
a set of counters and flags that are manipulated atomically.

v2022.09.25a
7.2. TYPES OF LOCKS 111

This type of implementation suffers from the same problem Table 7.1: VAX/VMS Distributed Lock Manager Policy
as does exclusive locking for short critical sections: The

Concurrent Write
Concurrent Read
overhead of acquiring and releasing the lock is about

Null (Not Held)

Protected Write
Protected Read
two orders of magnitude greater than the overhead of a
simple instruction. Of course, if the critical section is

Exclusive
long enough, the overhead of acquiring and releasing the
lock becomes negligible. However, because only one
thread at a time can be manipulating the lock, the required
critical-section size increases with the number of CPUs. Null (Not Held)
It is possible to design a reader-writer lock that is much Concurrent Read X
more favorable to readers through use of per-thread exclu- Concurrent Write X X X
sive locks [HW92]. To read, a thread acquires only its own Protected Read X X X
lock. To write, a thread acquires all locks. In the absence Protected Write X X X X
of writers, each reader incurs only atomic-instruction and Exclusive X X X X X
memory-barrier overhead, with no cache misses, which is
quite good for a locking primitive. Unfortunately, writers
must incur cache misses as well as atomic-instruction and large latencies when switching from read-holder to write-
memory-barrier overhead—multiplied by the number of holder mode. Here are a few possible approaches:
threads.
In short, reader-writer locks can be quite useful in a 1. Reader-preference implementations unconditionally
number of situations, but each type of implementation favor readers over writers, possibly allowing write
does have its drawbacks. The canonical use case for reader- acquisitions to be indefinitely blocked.
writer locking involves very long read-side critical sections, 2. Batch-fair implementations ensure that when both
preferably measured in hundreds of microseconds or even readers and writers are acquiring the lock, both have
milliseconds. reasonable access via batching. For example, the
As with exclusive locks, a reader-writer lock acquisition lock might admit five readers per CPU, then two
cannot complete until all prior conflicting holders of that writers, then five more readers per CPU, and so on.
lock have released it. If a lock is read-held, then read acqui-
sitions can complete immediately, but write acquisitions 3. Writer-preference implementations unconditionally
must wait until there are no longer any readers holding favor writers over readers, possibly allowing read
the lock. If a lock is write-held, then all acquisitions must acquisitions to be indefinitely blocked.
wait until the writer releases the lock. Again as with exclu-
Of course, these distinctions matter only under condi-
sive locks, different reader-writer lock implementations
tions of high lock contention.
provide different degrees of FIFO ordering to readers on
Please keep the waiting/blocking dual nature of locks
the one hand and to writers on the other.
firmly in mind. This will be revisited in Chapter 9’s
But suppose a large number of readers hold the lock and discussion of scalable high-performance special-purpose
a writer is waiting to acquire the lock. Should readers be alternatives to locking.
allowed to continue to acquire the lock, possibly starving
the writer? Similarly, suppose that a writer holds the
lock and that a large number of both readers and writers 7.2.3 Beyond Reader-Writer Locks
are waiting to acquire the lock. When the current writer Reader-writer locks and exclusive locks differ in their
release the lock, should it be given to a reader or to another admission policy: Exclusive locks allow at most one
writer? If it is given to a reader, how many readers should holder, while reader-writer locks permit an arbitrary num-
be allowed to acquire the lock before the next writer is ber of read-holders (but only one write-holder). There is a
permitted to do so? very large number of possible admission policies, one of
There are many possible answers to these questions, which is that of the VAX/VMS distributed lock manager
with different levels of complexity, overhead, and fairness. (DLM) [ST87], which is shown in Table 7.1. Blank cells
Different implementations might have different costs, for indicate compatible modes, while cells containing “X”
example, some types of reader-writer locks incur extremely indicate incompatible modes.

v2022.09.25a
112 CHAPTER 7. LOCKING

The VAX/VMS DLM uses six modes. For purposes can have more than thirty locking modes, compared to
of comparison, exclusive locks use two modes (not held VAX/VMS’s six.
and held), while reader-writer locks use three modes (not
held, read held, and write held). 7.2.4 Scoped Locking
The first mode is null, or not held. This mode is
compatible with all other modes, which is to be expected: The locking primitives discussed thus far require explicit
If a thread is not holding a lock, it should not prevent any acquisition and release primitives, for example, spin_
other thread from acquiring that lock. lock() and spin_unlock(), respectively. Another ap-
The second mode is concurrent read, which is com- proach is to use the object-oriented resource-acquisition-
patible with every other mode except for exclusive. The is-initialization (RAII) pattern [ES90].5 This pattern is
concurrent-read mode might be used to accumulate ap- often applied to auto variables in languages like C++,
proximate statistics on a data structure, while permitting where the corresponding constructor is invoked upon en-
updates to proceed concurrently. try to the object’s scope, and the corresponding destructor
The third mode is concurrent write, which is compatible is invoked upon exit from that scope. This can be applied
with null, concurrent read, and concurrent write. The to locking by having the constructor acquire the lock and
concurrent-write mode might be used to update approxi- the destructor free it.
mate statistics, while still permitting reads and concurrent This approach can be quite useful, in fact in 1990 I was
updates to proceed concurrently. convinced that it was the only type of locking that was
The fourth mode is protected read, which is compatible needed.6 One very nice property of RAII locking is that
with null, concurrent read, and protected read. The you don’t need to carefully release the lock on each and
protected-read mode might be used to obtain a consistent every code path that exits that scope, a property that can
snapshot of the data structure, while permitting reads but eliminate a troublesome set of bugs.
not updates to proceed concurrently. However, RAII locking also has a dark side. RAII
The fifth mode is protected write, which is compatible makes it quite difficult to encapsulate lock acquisition
with null and concurrent read. The protected-write mode and release, for example, in iterators. In many iterator
might be used to carry out updates to a data structure that implementations, you would like to acquire the lock in the
could interfere with protected readers but which could be iterator’s “start” function and release it in the iterator’s
tolerated by concurrent readers. “stop” function. RAII locking instead requires that the
The sixth and final mode is exclusive, which is compat- lock acquisition and release take place in the same level
ible only with null. The exclusive mode is used when it is of scoping, making such encapsulation difficult or even
necessary to exclude all other accesses. impossible.
It is interesting to note that exclusive locks and reader- Strict RAII locking also prohibits overlapping critical
writer locks can be emulated by the VAX/VMS DLM. Ex- sections, due to the fact that scopes must nest. This
clusive locks would use only the null and exclusive modes, prohibition makes it difficult or impossible to express a
while reader-writer locks might use the null, protected- number of useful constructs, for example, locking trees
read, and protected-write modes. that mediate between multiple concurrent attempts to
Quick Quiz 7.20: Is there any other way for the VAX/VMS
assert an event. Of an arbitrarily large group of concurrent
DLM to emulate a reader-writer lock? attempts, only one need succeed, and the best strategy
for the remaining attempts is for them to fail as quickly
Although the VAX/VMS DLM policy has seen wide- and painlessly as possible. Otherwise, lock contention
spread production use for distributed databases, it does not becomes pathological on large systems (where “large” is
appear to be used much in shared-memory applications. many hundreds of CPUs). Therefore, C++17 [Smi19] has
One possible reason for this is that the greater commu- escapes from strict RAII in its unique_lock class, which
nication overheads of distributed databases can hide the allows the scope of the critical section to be controlled to
greater overhead of the VAX/VMS DLM’s more-complex roughly the same extent as can be achieved with explicit
admission policy. lock acquisition and release primitives.
Nevertheless, the VAX/VMS DLM is an interesting 5 Though more clearly expressed at https://www.stroustrup.
illustration of just how flexible the concepts behind locking com/bs_faq2.html#finally.
can be. It also serves as a very simple introduction to 6 My later work with parallelism at Sequent Computer Systems very

the locking schemes used by modern DBMSes, which quickly disabused me of this misguided notion.

v2022.09.25a
7.2. TYPES OF LOCKS 113

Listing 7.7: Conditional Locking to Reduce Contention


Root rcu_node
Structure 1 void force_quiescent_state(struct rcu_node *rnp_leaf)
2 {
3 int ret;
4 struct rcu_node *rnp = rnp_leaf;
5 struct rcu_node *rnp_old = NULL;
6
7 for (; rnp != NULL; rnp = rnp->parent) {
8 ret = (READ_ONCE(gp_flags)) ||
Leaf rcu_node Leaf rcu_node 9 !raw_spin_trylock(&rnp->fqslock);
Structure 0 Structure N 10 if (rnp_old != NULL)
11 raw_spin_unlock(&rnp_old->fqslock);
12 if (ret)
13 return;
14 rnp_old = rnp;
15 }
if (!READ_ONCE(gp_flags)) {
CPU 0
CPU 1

CPU m

CPU m * (N − 1)
CPU m * (N − 1) + 1

CPU m * N − 1
16
17 WRITE_ONCE(gp_flags, 1);
18 do_force_quiescent_state();
19 WRITE_ONCE(gp_flags, 0);
20 }
21 raw_spin_unlock(&rnp_old->fqslock);
22 }

Figure 7.10: Locking Hierarchy multiple concurrent callers, we need at most one of them
to actually invoke do_force_quiescent_state(), and
we need the rest to (as quickly and painlessly as possible)
Example strict-RAII-unfriendly data structures from
give up and leave.
Linux-kernel RCU are shown in Figure 7.10. Here, each
CPU is assigned a leaf rcu_node structure, and each rcu_ To this end, each pass through the loop spanning
node structure has a pointer to its parent (named, oddly lines 7–15 attempts to advance up one level in the rcu_
enough, ->parent), up to the root rcu_node structure, node hierarchy. If the gp_flags variable is already set
which has a NULL ->parent pointer. The number of child (line 8) or if the attempt to acquire the current rcu_node
rcu_node structures per parent can vary, but is typically structure’s ->fqslock is unsuccessful (line 9), then local
32 or 64. Each rcu_node structure also contains a lock variable ret is set to 1. If line 10 sees that local variable
named ->fqslock. rnp_old is non-NULL, meaning that we hold rnp_old’s
The general approach is a tournament, where a given ->fqs_lock, line 11 releases this lock (but only after the
CPU conditionally acquires its leaf rcu_node structure’s attempt has been made to acquire the parent rcu_node
->fqslock, and, if successful, attempt to acquire that structure’s ->fqslock). If line 12 sees that either line 8
of the parent, then release that of the child. In addi- or 9 saw a reason to give up, line 13 returns to the caller.
tion, at each level, the CPU checks a global gp_flags Otherwise, we must have acquired the current rcu_node
variable, and if this variable indicates that some other structure’s ->fqslock, so line 14 saves a pointer to this
CPU has asserted the event, the first CPU drops out of structure in local variable rnp_old in preparation for the
the competition. This acquire-then-release sequence con- next pass through the loop.
tinues until either the gp_flags variable indicates that If control reaches line 16, we won the tournament, and
someone else won the tournament, one of the attempts now holds the root rcu_node structure’s ->fqslock. If
to acquire an ->fqslock fails, or the root rcu_node line 16 still sees that the global variable gp_flags is zero,
structure’s ->fqslock has been acquired. If the root line 17 sets gp_flags to one, line 18 invokes do_force_
rcu_node structure’s ->fqslock is acquired, a function quiescent_state(), and line 19 resets gp_flags back
named do_force_quiescent_state() is invoked. to zero. Either way, line 21 releases the root rcu_node
Simplified code to implement this is shown in List- structure’s ->fqslock.
ing 7.7. The purpose of this function is to mediate between
CPUs who have concurrently detected a need to invoke Quick Quiz 7.21: The code in Listing 7.7 is ridiculously
the do_force_quiescent_state() function. At any complicated! Why not conditionally acquire a single global
given time, it only makes sense for one instance of do_ lock?
force_quiescent_state() to be active, so if there are

v2022.09.25a
114 CHAPTER 7. LOCKING

Quick Quiz 7.22: Wait a minute! If we “win” the tournament Listing 7.8: Sample Lock Based on Atomic Exchange
on line 16 of Listing 7.7, we get to do all the work of do_ 1 typedef int xchglock_t;
2 #define DEFINE_XCHG_LOCK(n) xchglock_t n = 0
force_quiescent_state(). Exactly how is that a win, 3
really? 4 void xchg_lock(xchglock_t *xp)
5 {
6 while (xchg(xp, 1) == 1) {
This function illustrates the not-uncommon pattern of 7 while (READ_ONCE(*xp) == 1)
hierarchical locking. This pattern is difficult to implement 8 continue;
9 }
using strict RAII locking,7 just like the iterator encapsula- 10 }
tion noted earlier, and so explicit lock/unlock primitives 11
12 void xchg_unlock(xchglock_t *xp)
(or C++17-style unique_lock escapes) will be required 13 {
for the foreseeable future. 14 (void)xchg(xp, 0);
15 }

7.3 Locking Implementation Issues


(lines 7–8) spins until the lock is available, at which point
the outer loop makes another attempt to acquire the lock.
When you translate a dream into reality, it’s never a
full implementation. It is easier to dream than to do. Quick Quiz 7.24: Why bother with the inner loop on
lines 7–8 of Listing 7.8? Why not simply repeatedly do the
Shai Agassi atomic exchange operation on line 6?

Developers are almost always best-served by using what- Lock release is carried out by the xchg_unlock()
ever locking primitives are provided by the system, for function shown on lines 12–15. Line 14 atomically ex-
example, the POSIX pthread mutex locks [Ope97, But97]. changes the value zero (“unlocked”) into the lock, thus
Nevertheless, studying sample implementations can be marking it as having been released.
helpful, as can considering the challenges posed by ex- Quick Quiz 7.25: Why not simply store zero into the lock
treme workloads and environments. word on line 14 of Listing 7.8?

7.3.1 Sample Exclusive-Locking Imple- This lock is a simple example of a test-and-set


mentation Based on Atomic Exchange lock [SR84], but very similar mechanisms have been
used extensively as pure spinlocks in production.
This section reviews the implementation shown in List-
ing 7.8. The data structure for this lock is just an int,
as shown on line 1, but could be any integral type. The 7.3.2 Other Exclusive-Locking Implemen-
initial value of this lock is zero, meaning “unlocked”, as tations
shown on line 2.
There are a great many other possible implementations
Quick Quiz 7.23: Why not rely on the C language’s default of locking based on atomic instructions, many of which
initialization of zero instead of using the explicit initializer are reviewed in the classic paper by Mellor-Crummey
shown on line 2 of Listing 7.8?
and Scott [MCS91]. These implementations represent
Lock acquisition is carried out by the xchg_lock() different points in a multi-dimensional design trade-
function shown on lines 4–10. This function uses a nested off [GGL+ 19, Gui18, McK96b]. For example, the atomic-
loop, with the outer loop repeatedly atomically exchanging exchange-based test-and-set lock presented in the previous
the value of the lock with the value one (meaning “locked”). section works well when contention is low and has the
If the old value was already the value one (in other words, advantage of small memory footprint. It avoids giving the
someone else already holds the lock), then the inner loop lock to threads that cannot use it, but as a result can suf-
fer from unfairness or even starvation at high contention
levels.
7 Which is why many RAII locking implementations provide a way
In contrast, ticket lock [MCS91], which was once used
to leak the lock out of the scope that it was acquired and into the scope
in which it is to be released. However, some object must mediate the
in the Linux kernel, avoids unfairness at high contention
scope leaking, which can add complexity compared to non-RAII explicit levels. However, as a consequence of its strict FIFO
locking primitives. discipline, it can grant the lock to a thread that is currently

v2022.09.25a
7.3. LOCKING IMPLEMENTATION ISSUES 115

unable to use it, perhaps due to that thread being preempted at low levels of contention and switching to the queued
or interrupted. On the other hand, it is important to avoid lock at high levels of contention [LA94], thus getting
getting too worried about the possibility of preemption low overhead at low levels of contention and getting
and interruption. After all, in many cases, this preemption fairness and high throughput at high levels of contention.
and interruption could just as well happen just after the Browning et al. took a similar approach, but avoided the
lock was acquired.8 use of a separate flag, so that the test-and-set fast path
All locking implementations where waiters spin on a uses the same sequence of instructions that would be used
single memory location, including both test-and-set locks in a simple test-and-set lock [BMMM05]. This approach
and ticket locks, suffer from performance problems at high has been used in production.
contention levels. The problem is that the thread releasing Another issue that arises at high levels of contention
the lock must update the value of the corresponding is when the lock holder is delayed, especially when the
memory location. At low contention, this is not a problem: delay is due to preemption, which can result in priority
The corresponding cache line is very likely still local to inversion, where a low-priority thread holds a lock, but
and writeable by the thread holding the lock. In contrast, is preempted by a medium priority CPU-bound thread,
at high levels of contention, each thread attempting to which results in a high-priority process blocking while
acquire the lock will have a read-only copy of the cache attempting to acquire the lock. The result is that the
line, and the lock holder will need to invalidate all such CPU-bound medium-priority process is preventing the
copies before it can carry out the update that releases the high-priority process from running. One solution is
lock. In general, the more CPUs and threads there are, priority inheritance [LR80], which has been widely used
the greater the overhead incurred when releasing the lock for real-time computing [SRL90, Cor06b], despite some
under conditions of high contention. lingering controversy over this practice [Yod04a, Loc02].
This negative scalability has motivated a number of
Another way to avoid priority inversion is to prevent pre-
different queued-lock implementations [And90, GT90,
emption while a lock is held. Because preventing preemp-
MCS91, WKS94, Cra93, MLH94, TS93], some of which
tion while locks are held also improves throughput, most
are used in recent versions of the Linux kernel [Cor14b].
proprietary UNIX kernels offer some form of scheduler-
Queued locks avoid high cache-invalidation overhead by
conscious synchronization mechanism [KWS97], largely
assigning each thread a queue element. These queue
due to the efforts of a certain sizable database ven-
elements are linked together into a queue that governs the
dor. These mechanisms usually take the form of a hint
order that the lock will be granted to the waiting threads.
that preemption should be avoided in a given region
The key point is that each thread spins on its own queue
of code, with this hint typically being placed in a ma-
element, so that the lock holder need only invalidate the
chine register. These hints frequently take the form of
first element from the next thread’s CPU’s cache. This
a bit set in a particular machine register, which enables
arrangement greatly reduces the overhead of lock handoff
extremely low per-lock-acquisition overhead for these
at high levels of contention.
mechanisms. In contrast, Linux avoids these hints, in-
More recent queued-lock implementations also take the
stead getting similar results from a mechanism called
system’s architecture into account, preferentially granting
futexes [FRK02, Mol06, Ros06, Dre11].
locks locally, while also taking steps to avoid starva-
tion [SSVM02, RH03, RH02, JMRR02, MCM02]. Many Interestingly enough, atomic instructions are not strictly
of these can be thought of as analogous to the elevator needed to implement locks [Dij65, Lam74]. An excellent
algorithms traditionally used in scheduling disk I/O. exposition of the issues surrounding locking implementa-
Unfortunately, the same scheduling logic that improves tions based on simple loads and stores may be found in
the efficiency of queued locks at high contention also Herlihy’s and Shavit’s textbook [HS08, HSLS20]. The
increases their overhead at low contention. Beng-Hong main point echoed here is that such implementations cur-
Lim and Anant Agarwal therefore combined a simple test- rently have little practical application, although a careful
and-set lock with a queued lock, using the test-and-set lock study of them can be both entertaining and enlightening.
Nevertheless, with one exception described below, such
8 Besides, the best way of handling high lock contention is to avoid study is left as an exercise for the reader.
it in the first place! There are nevertheless some situations where high
lock contention is the lesser of the available evils, and in any case,
Gamsa et al. [GKAS99, Section 5.3] describe a token-
studying schemes that deal with high levels of contention is a good based mechanism in which a token circulates among
mental exercise. the CPUs. When the token reaches a given CPU, it has

v2022.09.25a
116 CHAPTER 7. LOCKING

exclusive access to anything protected by that token. There Listing 7.9: Per-Element Locking Without Existence Guaran-
are any number of schemes that may be used to implement tees
1 int delete(int key)
the token-based mechanism, for example: 2 {
3 int b;
1. Maintain a per-CPU flag, which is initially zero for 4 struct element *p;
5
all but one CPU. When a CPU’s flag is non-zero, it 6 b = hashfunction(key);
holds the token. When it finishes with the token, it 7 p = hashtable[b];
8 if (p == NULL || p->key != key)
zeroes its flag and sets the flag of the next CPU to 9 return 0;
one (or to any other non-zero value). 10 spin_lock(&p->lock);
11 hashtable[b] = NULL;
12 spin_unlock(&p->lock);
2. Maintain a per-CPU counter, which is initially set to 13 kfree(p);
the corresponding CPU’s number, which we assume 14 return 1;
15 }
to range from zero to 𝑁 − 1, where 𝑁 is the number
of CPUs in the system. When a CPU’s counter is
greater than that of the next CPU (taking counter over roll-your-own efforts is that the standard primitives
wrap into account), the first CPU holds the token. are typically much less bug-prone.9
When it is finished with the token, it sets the next
CPU’s counter to a value one greater than its own
counter. 7.4 Lock-Based Existence Guaran-
Quick Quiz 7.26: How can you tell if one counter is greater
tees
than another, while accounting for counter wrap?
Existence precedes and rules essence.
Quick Quiz 7.27: Which is better, the counter approach or
Jean-Paul Sartre
the flag approach?

This lock is unusual in that a given CPU cannot nec- A key challenge in parallel programming is to provide
essarily acquire it immediately, even if no other CPU existence guarantees [GKAS99], so that attempts to access
is using it at the moment. Instead, the CPU must wait a given object can rely on that object being in existence
until the token comes around to it. This is useful in throughout a given access attempt.
cases where CPUs need periodic access to the critical In some cases, existence guarantees are implicit:
section, but can tolerate variances in token-circulation rate. 1. Global variables and static local variables in the
Gamsa et al. [GKAS99] used it to implement a variant of base module will exist as long as the application is
read-copy update (see Section 9.5), but it could also be running.
used to protect periodic per-CPU operations such as flush-
ing per-CPU caches used by memory allocators [MS93], 2. Global variables and static local variables in a loaded
garbage-collecting per-CPU data structures, or flushing module will exist as long as that module remains
per-CPU data to shared storage (or to mass storage, for loaded.
that matter). 3. A module will remain loaded as long as at least one
The Linux kernel now uses queued spinlocks [Cor14b], of its functions has an active instance.
but because of the complexity of implementations that pro-
vide good performance across the range of contention lev- 4. A given function instance’s on-stack variables will
els, the path has not always been smooth [Mar18, Dea18]. exist until that instance returns.
As increasing numbers of people gain familiarity with
5. If you are executing within a given function or have
parallel hardware and parallelize increasing amounts of
been called (directly or indirectly) from that function,
code, we can continue to expect more special-purpose
then the given function has an active instance.
locking primitives to appear, see for example Guerraoui et
9 And yes, I have done at least my share of roll-your-own synchro-
al. [GGL+ 19, Gui18]. Nevertheless, you should carefully
nization primitives. However, you will notice that my hair is much greyer
consider this important safety tip: Use the standard syn- than it was before I started doing that sort of work. Coincidence? Maybe.
chronization primitives whenever humanly possible. The But are you really willing to risk your own hair turning prematurely
big advantage of the standard synchronization primitives grey?

v2022.09.25a
7.5. LOCKING: HERO OR VILLAIN? 117

These implicit existence guarantees are straightforward, Listing 7.10: Per-Element Locking With Lock-Based Existence
though bugs involving implicit existence guarantees really Guarantees
1 int delete(int key)
can happen. 2 {
3 int b;
Quick Quiz 7.28: How can relying on implicit existence 4 struct element *p;
guarantees result in a bug? 5 spinlock_t *sp;
6
7 b = hashfunction(key);
But the more interesting—and troublesome—guarantee 8 sp = &locktable[b];
9 spin_lock(sp);
involves heap memory: A dynamically allocated data 10 p = hashtable[b];
structure will exist until it is freed. The problem to be 11 if (p == NULL || p->key != key) {
12 spin_unlock(sp);
solved is to synchronize the freeing of the structure with 13 return 0;
concurrent accesses to that same structure. One way to 14 }
15 hashtable[b] = NULL;
do this is with explicit guarantees, such as locking. If a 16 spin_unlock(sp);
given structure may only be freed while holding a given 17 kfree(p);
18 return 1;
lock, then holding that lock guarantees that structure’s 19 }
existence.
But this guarantee depends on the existence of the lock
itself. One straightforward way to guarantee the lock’s Because there is no existence guarantee, the identity of
existence is to place the lock in a global variable, but the data element can change while a thread is attempting
global locking has the disadvantage of limiting scalability. to acquire that element’s lock on line 10!
One way of providing scalability that improves as the size One way to fix this example is to use a hashed set
of the data structure increases is to place a lock in each of global locks, so that each hash bucket has its own
element of the structure. Unfortunately, putting the lock lock, as shown in Listing 7.10. This approach allows
that is to protect a data element in the data element itself is acquiring the proper lock (on line 9) before gaining a
subject to subtle race conditions, as shown in Listing 7.9. pointer to the data element (on line 10). Although this
Quick Quiz 7.29: What if the element we need to delete is approach works quite well for elements contained in a
not the first element of the list on line 8 of Listing 7.9? single partitionable data structure such as the hash table
shown in the listing, it can be problematic if a given
To see one of these race conditions, consider the fol- data element can be a member of multiple hash tables
lowing sequence of events: or given more-complex data structures such as trees or
graphs. Not only can these problems be solved, but
1. Thread 0 invokes delete(0), and reaches line 10 the solutions also form the basis of lock-based software
of the listing, acquiring the lock. transactional memory implementations [ST95, DSS06].
However, Chapter 9 describes simpler—and faster—ways
2. Thread 1 concurrently invokes delete(0), reaching of providing existence guarantees.
line 10, but spins on the lock because Thread 0 holds
it.

3. Thread 0 executes lines 11–14, removing the element 7.5 Locking: Hero or Villain?
from the hashtable, releasing the lock, and then
freeing the element. You either die a hero or you live long enough to see
yourself become the villain.
4. Thread 0 continues execution, and allocates memory,
getting the exact block of memory that it just freed. Aaron Eckhart as Harvey Dent

5. Thread 0 then initializes this block of memory as As is often the case in real life, locking can be either
some other type of structure. hero or villain, depending on how it is used and on the
problem at hand. In my experience, those writing whole
6. Thread 1’s spin_lock() operation fails due to the applications are happy with locking, those writing parallel
fact that what it believes to be p->lock is no longer libraries are less happy, and those parallelizing existing
a spinlock. sequential libraries are extremely unhappy. The following

v2022.09.25a
118 CHAPTER 7. LOCKING

sections discuss some reasons for these differences in deadlock can ensue just as surely as if the library function
viewpoints. had called the signal handler directly. A final complication
occurs for those library functions that can be used between
a fork()/exec() pair, for example, due to use of the
7.5.1 Locking For Applications: Hero! system() function. In this case, if your library function
When writing an entire application (or entire kernel), was holding a lock at the time of the fork(), then the
developers have full control of the design, including the child process will begin life with that lock held. Because
synchronization design. Assuming that the design makes the thread that will release the lock is running in the parent
good use of partitioning, as discussed in Chapter 6, locking but not the child, if the child calls your library function,
can be an extremely effective synchronization mechanism, deadlock will ensue.
as demonstrated by the heavy use of locking in production- The following strategies may be used to avoid deadlock
quality parallel software. problems in these cases:
Nevertheless, although such software usually bases
most of its synchronization design on locking, such soft- 1. Don’t use either callbacks or signals.
ware also almost always makes use of other synchro- 2. Don’t acquire locks from within callbacks or signal
nization mechanisms, including special counting algo- handlers.
rithms (Chapter 5), data ownership (Chapter 8), reference
counting (Section 9.2), hazard pointers (Section 9.3), 3. Let the caller control synchronization.
sequence locking (Section 9.4), and read-copy update
(Section 9.5). In addition, practitioners use tools for 4. Parameterize the library API to delegate locking to
deadlock detection [Cor06a], lock acquisition/release bal- caller.
ancing [Cor04b], cache-miss analysis [The11], hardware-
5. Explicitly avoid callback deadlocks.
counter-based profiling [EGMdB11, The12b], and many
more besides. 6. Explicitly avoid signal-handler deadlocks.
Given careful design, use of a good combination of
synchronization mechanisms, and good tooling, locking 7. Avoid invoking fork().
works quite well for applications and kernels.
Each of these strategies is discussed in one of the
following sections.
7.5.2 Locking For Parallel Libraries: Just
Another Tool 7.5.2.1 Use Neither Callbacks Nor Signals
Unlike applications and kernels, the designer of a library If a library function avoids callbacks and the application
cannot know the locking design of the code that the library as a whole avoids signals, then any locks acquired by that
will be interacting with. In fact, that code might not be library function will be leaves of the locking-hierarchy
written for years to come. Library designers therefore tree. This arrangement avoids deadlock, as discussed in
have less control and must exercise more care when laying Section 7.1.1.1. Although this strategy works extremely
out their synchronization design. well where it applies, there are some applications that
Deadlock is of course of particular concern, and the must use signal handlers, and there are some library
techniques discussed in Section 7.1.1 need to be applied. functions (such as the qsort() function discussed in
One popular deadlock-avoidance strategy is therefore to Section 7.1.1.2) that require callbacks.
ensure that the library’s locks are independent subtrees of The strategy described in the next section can often be
the enclosing program’s locking hierarchy. However, this used in these cases.
can be harder than it looks.
One complication was discussed in Section 7.1.1.2, 7.5.2.2 Avoid Locking in Callbacks and Signal Han-
namely when library functions call into application code, dlers
with qsort()’s comparison-function argument being a
case in point. Another complication is the interaction If neither callbacks nor signal handlers acquire locks, then
with signal handlers. If an application signal handler is they cannot be involved in deadlock cycles, which allows
invoked from a signal received within the library function, straightforward locking hierarchies to once again consider

v2022.09.25a
7.5. LOCKING: HERO OR VILLAIN? 119

library functions to be leaves on the locking-hierarchy tree. example, a hash table or a parallel sort. In this case, the
This strategy works very well for most uses of qsort, library absolutely must control its own synchronization.
whose callbacks usually simply compare the two values
passed in to them. This strategy also works wonderfully 7.5.2.4 Parameterize Library Synchronization
for many signal handlers, especially given that acquiring
locks from within signal handlers is generally frowned The idea here is to add arguments to the library’s API to
upon [Gro01],10 but can fail if the application needs to specify which locks to acquire, how to acquire and release
manipulate complex data structures from a signal handler. them, or both. This strategy allows the application to
Here are some ways to avoid acquiring locks in sig- take on the global task of avoiding deadlock by specifying
nal handlers even if complex data structures must be which locks to acquire (by passing in pointers to the
manipulated: locks in question) and how to acquire them (by passing
in pointers to lock acquisition and release functions),
1. Use simple data structures based on non-blocking syn- but also allows a given library function to control its
chronization, as will be discussed in Section 14.2.1. own concurrency by deciding where the locks should be
acquired and released.
2. If the data structures are too complex for reasonable
In particular, this strategy allows the lock acquisition
use of non-blocking synchronization, create a queue
and release functions to block signals as needed without
that allows non-blocking enqueue operations. In the
the library code needing to be concerned with which
signal handler, instead of manipulating the complex
signals need to be blocked by which locks. The separation
data structure, add an element to the queue describing
of concerns used by this strategy can be quite effective,
the required change. A separate thread can then
but in some cases the strategies laid out in the following
remove elements from the queue and carry out the
sections can work better.
required changes using normal locking. There are
That said, passing explicit pointers to locks to external
a number of readily available implementations of
APIs must be very carefully considered, as discussed in
concurrent queues [KLP12, Des09b, MS96].
Section 7.1.1.5. Although this practice is sometimes the
This strategy should be enforced with occasional manual right thing to do, you should do yourself a favor by looking
or (preferably) automated inspections of callbacks and into alternative designs first.
signal handlers. When carrying out these inspections, be
wary of clever coders who might have (unwisely) created 7.5.2.5 Explicitly Avoid Callback Deadlocks
home-brew locks from atomic operations.
The basic rule behind this strategy was discussed in Sec-
tion 7.1.1.2: “Release all locks before invoking unknown
7.5.2.3 Caller Controls Synchronization code.” This is usually the best approach because it allows
Letting the caller control synchronization works extremely the application to ignore the library’s locking hierarchy:
well when the library functions are operating on indepen- The library remains a leaf or isolated subtree of the appli-
dent caller-visible instances of a data structure, each of cation’s overall locking hierarchy.
which may be synchronized separately. For example, if In cases where it is not possible to release all locks before
the library functions operate on a search tree, and if the invoking unknown code, the layered locking hierarchies
application needs a large number of independent search described in Section 7.1.1.3 can work well. For example, if
trees, then the application can associate a lock with each the unknown code is a signal handler, this implies that the
tree. The application then acquires and releases locks as library function block signals across all lock acquisitions,
needed, so that the library need not be aware of parallelism which can be complex and slow. Therefore, in cases
at all. Instead, the application controls the parallelism, where signal handlers (probably unwisely) acquire locks,
so that locking can work very well, as was discussed in the strategies in the next section may prove helpful.
Section 7.5.1.
However, this strategy fails if the library implements 7.5.2.6 Explicitly Avoid Signal-Handler Deadlocks
a data structure that requires internal concurrency, for
Suppose that a given library function is known to acquire
10 But the standard’s words do not stop clever coders from creating locks, but does not block signals. Suppose further that it
their own home-brew locking primitives from atomic operations. is necessary to invoke that function both from within and

v2022.09.25a
120 CHAPTER 7. LOCKING

outside of a signal handler, and that it is not permissible 2. If the child creates additional threads, two threads
to modify this library function. Of course, if no special might break the lock concurrently, with the result
action is taken, then if a signal arrives while that library that both threads believe they own the lock. This
function is holding its lock, deadlock can occur when the could again result in arbitrary memory corruption.
signal handler invokes that same library function, which
in turn attempts to re-acquire that same lock. The pthread_atfork() function is provided to help
Such deadlocks can be avoided as follows: deal with these situations. The idea is to register a triplet of
functions, one to be called by the parent before the fork(),
1. If the application invokes the library function from one to be called by the parent after the fork(), and one
within a signal handler, then that signal must be to be called by the child after the fork(). Appropriate
blocked every time that the library function is invoked cleanups can then be carried out at these three points.
from outside of a signal handler. Be warned, however, that coding of pthread_
2. If the application invokes the library function while atfork() handlers is quite subtle in general. The cases
holding a lock acquired within a given signal handler, where pthread_atfork() works best are cases where
then that signal must be blocked every time that the the data structure in question can simply be re-initialized
library function is called outside of a signal handler. by the child.

These rules can be enforced by using tools similar to the 7.5.2.8 Parallel Libraries: Discussion
Linux kernel’s lockdep lock dependency checker [Cor06a].
One of the great strengths of lockdep is that it is not fooled Regardless of the strategy used, the description of the
by human intuition [Ros11]. library’s API must include a clear description of that
strategy and how the caller should interact with that
7.5.2.7 Library Functions Used Between fork() and strategy. In short, constructing parallel libraries using
exec() locking is possible, but not as easy as constructing a
parallel application.
As noted earlier, if a thread executing a library function is
holding a lock at the time that some other thread invokes
7.5.3 Locking For Parallelizing Sequential
fork(), the fact that the parent’s memory is copied to
create the child means that this lock will be born held in Libraries: Villain!
the child’s context. The thread that will release this lock With the advent of readily available low-cost multicore
is running in the parent, but not in the child, which means systems, a common task is parallelizing an existing library
that the child’s copy of this lock will never be released. that was designed with only single-threaded use in mind.
Therefore, any attempt on the part of the child to invoke This all-too-common disregard for parallelism can result
that same library function will result in deadlock. in a library API that is severely flawed from a parallel-
A pragmatic and straightforward way of solving this programming viewpoint. Candidate flaws include:
problem is to fork() a child process while the process is
still single-threaded, and have this child process remain 1. Implicit prohibition of partitioning.
single-threaded. Requests to create further child processes
2. Callback functions requiring locking.
can then be communicated to this initial child process,
which can safely carry out any needed fork() and exec() 3. Object-oriented spaghetti code.
system calls on behalf of its multi-threaded parent process.
Another rather less pragmatic and straightforward solu- These flaws and the consequences for locking are dis-
tion to this problem is to have the library function check cussed in the following sections.
to see if the owner of the lock is still running, and if not,
“breaking” the lock by re-initializing and then acquiring it. 7.5.3.1 Partitioning Prohibited
However, this approach has a couple of vulnerabilities:
Suppose that you were writing a single-threaded hash-
1. The data structures protected by that lock are likely table implementation. It is easy and fast to maintain an
to be in some intermediate state, so that naively exact count of the total number of items in the hash table,
breaking the lock might result in arbitrary memory and also easy and fast to return this exact count on each
corruption. addition and deletion operation. So why not?

v2022.09.25a
7.5. LOCKING: HERO OR VILLAIN? 121

One reason is that exact counters do not perform or Nevertheless, human nature being what it is, we can
scale well on multicore systems, as was seen in Chapter 5. expect our hapless developer to be more likely to complain
As a result, the parallelized implementation of the hash about locking than about his or her own poor (though
table will not perform or scale well. understandable) API design choices.
So what can be done about this? One approach is to
return an approximate count, using one of the algorithms 7.5.3.2 Deadlock-Prone Callbacks
from Chapter 5. Another approach is to drop the element
count altogether. Sections 7.1.1.2, 7.1.1.3, and 7.5.2 described how undisci-
Either way, it will be necessary to inspect uses of the plined use of callbacks can result in locking woes. These
hash table to see why the addition and deletion operations sections also described how to design your library function
need the exact count. Here are a few possibilities: to avoid these problems, but it is unrealistic to expect a
1990s programmer with no experience in parallel program-
1. Determining when to resize the hash table. In this ming to have followed such a design. Therefore, someone
case, an approximate count should work quite well. It attempting to parallelize an existing callback-heavy single-
might also be useful to trigger the resizing operation threaded library will likely have many opportunities to
from the length of the longest chain, which can be curse locking’s villainy.
computed and maintained in a nicely partitioned If there are a very large number of uses of a callback-
per-chain manner. heavy library, it may be wise to again add a parallel-
friendly API to the library in order to allow existing
2. Producing an estimate of the time required to traverse
users to convert their code incrementally. Alternatively,
the entire hash table. An approximate count works
some advocate use of transactional memory in these cases.
well in this case, also.
While the jury is still out on transactional memory, Sec-
3. For diagnostic purposes, for example, to check for tion 17.2 discusses its strengths and weaknesses. It is
items being lost when transferring them to and from important to note that hardware transactional memory
the hash table. This clearly requires an exact count. (discussed in Section 17.3) cannot help here unless the
However, given that this usage is diagnostic in na- hardware transactional memory implementation provides
ture, it might suffice to maintain the lengths of the forward-progress guarantees, which few do. Other alter-
hash chains, then to infrequently sum them up while natives that appear to be quite practical (if less heavily
locking out addition and deletion operations. hyped) include the methods discussed in Sections 7.1.1.6
and 7.1.1.7, as well as those that will be discussed in
It turns out that there is now a strong theoretical basis for Chapters 8 and 9.
some of the constraints that performance and scalability
place on a parallel library’s APIs [AGH+ 11a, AGH+ 11b, 7.5.3.3 Object-Oriented Spaghetti Code
McK11b]. Anyone designing a parallel library needs to
pay close attention to those constraints. Object-oriented programming went mainstream sometime
Although it is all too easy to blame locking for what in the 1980s or 1990s, and as a result there is a huge amount
are really problems due to a concurrency-unfriendly API, of single-threaded object-oriented code in production.
doing so is not helpful. On the other hand, one has little Although object orientation can be a valuable software
choice but to sympathize with the hapless developer who technique, undisciplined use of objects can easily result
made this choice in (say) 1985. It would have been a in object-oriented spaghetti code. In object-oriented
rare and courageous developer to anticipate the need for spaghetti code, control flits from object to object in an
parallelism at that time, and it would have required an even essentially random manner, making the code hard to
more rare combination of brilliance and luck to actually understand and even harder, and perhaps impossible, to
arrive at a good parallel-friendly API. accommodate a locking hierarchy.
Times change, and code must change with them. That Although many might argue that such code should
said, there might be a huge number of users of a popular be cleaned up in any case, such things are much easier
library, in which case an incompatible change to the API to say than to do. If you are tasked with parallelizing
would be quite foolish. Adding a parallel-friendly API such a beast, you can reduce the number of opportunities
to complement the existing heavily used sequential-only to curse locking by using the techniques described in
API is usually the best course of action. Sections 7.1.1.6 and 7.1.1.7, as well as those that will be

v2022.09.25a
122 CHAPTER 7. LOCKING

discussed in Chapters 8 and 9. This situation appears to


be the use case that inspired transactional memory, so
it might be worth a try as well. That said, the choice
of synchronization mechanism should be made in light
of the hardware habits discussed in Chapter 3. After
all, if the overhead of the synchronization mechanism
is orders of magnitude more than that of the operations
being protected, the results are not going to be pretty.
And that leads to a question well worth asking in
these situations: Should the code remain sequential? For
example, perhaps parallelism should be introduced at the
process level rather than the thread level. In general, if a
task is proving extremely hard, it is worth some time spent
thinking about not only alternative ways to accomplish
that particular task, but also alternative tasks that might
better solve the problem at hand.

7.6 Summary

Achievement unlocked.

Unknown

Locking is perhaps the most widely used and most gen-


erally useful synchronization tool. However, it works
best when designed into an application or library from
the beginning. Given the large quantity of pre-existing
single-threaded code that might need to one day run in
parallel, locking should therefore not be the only tool in
your parallel-programming toolbox. The next few chap-
ters will discuss other tools, and how they can best be
used in concert with locking and with each other.

v2022.09.25a
It is mine, I tell you. My own. My precious. Yes, my
precious.

Gollum in “The Fellowship of the Ring”,


Chapter 8 J.R.R. Tolkien

Data Ownership

One of the simplest ways to avoid the synchronization 8.1 Multiple Processes
overhead that comes with locking is to parcel the data
out among the threads (or, in the case of kernels, CPUs)
so that a given piece of data is accessed and modified A man’s home is his castle
by only one of the threads. Interestingly enough, data Ancient Laws of England
ownership covers each of the “big three” parallel design
techniques: It partitions over threads (or CPUs, as the case Section 4.1 introduced the following example:
may be), it batches all local operations, and its elimination
of synchronization operations is weakening carried to its 1 compute_it 1 > compute_it.1.out &
logical extreme. It should therefore be no surprise that 2 compute_it 2 > compute_it.2.out &
3 wait
data ownership is heavily used: Even novices use it almost 4 cat compute_it.1.out
instinctively. In fact, it is so heavily used that this chapter 5 cat compute_it.2.out
will not introduce any new examples, but will instead refer
back to those of previous chapters. This example runs two instances of the compute_it
program in parallel, as separate processes that do not
share memory. Therefore, all data in a given process
Quick Quiz 8.1: What form of data ownership is extremely is owned by that process, so that almost the entirety of
difficult to avoid when creating shared-memory parallel pro- data in the above example is owned. This approach
grams (for example, using pthreads) in C or C++? almost entirely eliminates synchronization overhead. The
resulting combination of extreme simplicity and optimal
performance is obviously quite attractive.
Quick Quiz 8.2: What synchronization remains in the
There are a number of approaches to data ownership. example shown in Section 8.1?
Section 8.1 presents the logical extreme in data ownership,
where each thread has its own private address space. Sec- Quick Quiz 8.3: Is there any shared data in the example
tion 8.2 looks at the opposite extreme, where the data is shown in Section 8.1?
shared, but different threads own different access rights to
the data. Section 8.3 describes function shipping, which is This same pattern can be written in C as well as in sh,
a way of allowing other threads to have indirect access to as illustrated by Listings 4.1 and 4.2.
data owned by a particular thread. Section 8.4 describes It bears repeating that these trivial forms of parallelism
how designated threads can be assigned ownership of are not in any way cheating or ducking responsibility, but
a specified function and the related data. Section 8.5 are rather simple and elegant ways to make your code
discusses improving performance by transforming algo- run faster. It is fast, scales well, is easy to program, easy
rithms with shared data to instead use data ownership. to maintain, and gets the job done. In addition, taking
Finally, Section 8.6 lists a few software environments that this approach (where applicable) allows the developer
feature data ownership as a first-class citizen. more time to focus on other things whether these things

123

v2022.09.25a
124 CHAPTER 8. DATA OWNERSHIP

might involve applying sophisticated single-threaded opti- 8.3 Function Shipping


mizations to compute_it on the one hand, or applying
sophisticated parallel-programming patterns to portions
of the code where this approach is inapplicable. What is If the mountain will not come to Muhammad, then
Muhammad must go to the mountain.
not to like?
The next section discusses the use of data ownership in Essays, Francis Bacon
shared-memory parallel programs.
The previous section described a weak form of data owner-
ship where threads reached out to other threads’ data. This
8.2 Partial Data Ownership and can be thought of as bringing the data to the functions that
need it. An alternative approach is to send the functions
pthreads to the data.
Such an approach is illustrated in Section 5.4.3 be-
Give thy mind more to what thou hast than to what ginning on page 64, in particular the flush_local_
thou hast not. count_sig() and flush_local_count() functions in
Listing 5.18 on page 66.
Marcus Aurelius Antoninus The flush_local_count_sig() function is a signal
handler that acts as the shipped function. The pthread_
Concurrent counting (see Chapter 5) uses data ownership kill() function in flush_local_count() sends the
heavily, but adds a twist. Threads are not allowed to modify signal—shipping the function—and then waits until the
data owned by other threads, but they are permitted to shipped function executes. This shipped function has the
read it. In short, the use of shared memory allows more not-unusual added complication of needing to interact
nuanced notions of ownership and access rights. with any concurrently executing add_count() or sub_
For example, consider the per-thread statistical counter count() functions (see Listing 5.19 on page 66 and
implementation shown in Listing 5.4 on page 53. Here, Listing 5.20 on page 67).
inc_count() updates only the corresponding thread’s
instance of counter, while read_count() accesses, but Quick Quiz 8.5: What mechanisms other than POSIX signals
may be used for function shipping?
does not modify, all threads’ instances of counter.
Quick Quiz 8.4: Does it ever make sense to have partial data
ownership where each thread reads only its own instance of a
per-thread variable, but writes to other threads’ instances? 8.4 Designated Thread
Partial data ownership is also common within the Linux
kernel. For example, a given CPU might be permitted to Let a man practice the profession which he best
knows.
read a given set of its own per-CPU variables only with
interrupts disabled, another CPU might be permitted to Cicero
read that same set of the first CPU’s per-CPU variables
only when holding the corresponding per-CPU lock. Then The earlier sections describe ways of allowing each thread
that given CPU would be permitted to update this set to keep its own copy or its own portion of the data. In
of its own per-CPU variables if it both has interrupts contrast, this section describes a functional-decomposition
disabled and holds its per-CPU lock. This arrangement approach, where a special designated thread owns the
can be thought of as a reader-writer lock that allows each rights to the data that is required to do its job. The
CPU very low-overhead access to its own set of per-CPU eventually consistent counter implementation described in
variables. There are a great many variations on this theme. Section 5.2.4 provides an example. This implementation
For its own part, pure data ownership is also both has a designated thread that runs the eventual() function
common and useful, for example, the per-thread memory- shown on lines 17–34 of Listing 5.5. This eventual()
allocator caches discussed in Section 6.4.3 starting on thread periodically pulls the per-thread counts into the
page 89. In this algorithm, each thread’s cache is com- global counter, so that accesses to the global counter will,
pletely private to that thread. as the name says, eventually converge on the actual value.

v2022.09.25a
8.6. OTHER USES OF DATA OWNERSHIP 125

Quick Quiz 8.6: But none of the data in the eventual() In short, privatization is a powerful tool in the parallel
function shown on lines 17–34 of Listing 5.5 is actually owned programmer’s toolbox, but it must nevertheless be used
by the eventual() thread! In just what way is this data with care. Just like every other synchronization primitive,
ownership??? it has the potential to increase complexity while decreasing
performance and scalability.

8.5 Privatization 8.6 Other Uses of Data Ownership


There is, of course, a difference between what a man
seizes and what he really possesses. Everything comes to us that belongs to us if we
create the capacity to receive it.
Pearl S. Buck
Rabindranath Tagore
One way of improving the performance and scalability of
a shared-memory parallel program is to transform it so as Data ownership works best when the data can be parti-
to convert shared data to private data that is owned by a tioned so that there is little or no need for cross thread
particular thread. access or update. Fortunately, this situation is reasonably
An excellent example of this is shown in the answer common, and in a wide variety of parallel-programming
to one of the Quick Quizzes in Section 6.1.1, which environments.
uses privatization to produce a solution to the Dining Examples of data ownership include:
Philosophers problem with much better performance and
1. All message-passing environments, such as
scalability than that of the standard textbook solution.
MPI [MPI08] and BOINC [Uni08a].
The original problem has five philosophers sitting around
the table with one fork between each adjacent pair of 2. Map-reduce [Jac08].
philosophers, which permits at most two philosophers to
eat concurrently. 3. Client-server systems, including RPC, web services,
We can trivially privatize this problem by providing and pretty much any system with a back-end database
an additional five forks, so that each philosopher has server.
his or her own private pair of forks. This allows all
4. Shared-nothing database systems.
five philosophers to eat concurrently, and also offers a
considerable reduction in the spread of certain types of 5. Fork-join systems with separate per-process address
disease. spaces.
In other cases, privatization imposes costs. For example,
consider the simple limit counter shown in Listing 5.7 on 6. Process-based parallelism, such as the Erlang lan-
page 57. This is an example of an algorithm where threads guage.
can read each others’ data, but are only permitted to update
7. Private variables, for example, C-language on-stack
their own data. A quick review of the algorithm shows
auto variables, in threaded environments.
that the only cross-thread accesses are in the summation
loop in read_count(). If this loop is eliminated, we 8. Many parallel linear-algebra algorithms, especially
move to the more-efficient pure data ownership, but at the those well-suited for GPGPUs.1
cost of a less-accurate result from read_count().
9. Operating-system kernels adapted for networking,
Quick Quiz 8.7: Is it possible to obtain greater accuracy
where each connection (also called flow [DKS89,
while still maintaining full privacy of the per-thread data?
Zha89, McK90]) is assigned to a specific thread. One
Partial privatization is also possible, with some synchro- recent example of this approach is the IX operating
nization requirements, but less than in the fully shared system [BPP+ 16]. IX does have some shared data
case. Some partial-privatization possibilities were ex- structures, which use synchronization mechanisms
plored in Section 4.3.4.4. Chapter 9 will introduce a to be described in Section 9.5.
temporal component to data ownership by providing ways 1 But note that a great many other classes of applications have also

of safely taking public data structures private. been ported to GPGPUs [Mat17, AMD20, NVi17a, NVi17b].

v2022.09.25a
126 CHAPTER 8. DATA OWNERSHIP

Data ownership is perhaps the most underappreciated


synchronization mechanism in existence. When used
properly, it delivers unrivaled simplicity, performance,
and scalability. Perhaps its simplicity costs it the respect
that it deserves. Hopefully a greater appreciation for
the subtlety and power of data ownership will lead to
greater level of respect, to say nothing of leading to
greater performance and scalability coupled with reduced
complexity.

v2022.09.25a
All things come to those who wait.

Violet Fane
Chapter 9

Deferred Processing

The strategy of deferring work goes back before the dawn


of recorded history. It has occasionally been derided route_list
as procrastination or even as sheer laziness. However,
in the last few decades workers have recognized this
strategy’s value in simplifying and streamlining parallel
algorithms [KL80, Mas92]. Believe it or not, “laziness” in
->addr=42 ->addr=56 ->addr=17
parallel programming often outperforms and out-scales in- ->iface=1 ->iface=3 ->iface=7
dustriousness! These performance and scalability benefits
stem from the fact that deferring work can enable weak-
ening of synchronization primitives, thereby reducing Figure 9.1: Pre-BSD Packet Routing List
synchronization overhead. General approaches of work
deferral include reference counting (Section 9.2), hazard
pointers (Section 9.3), sequence locking (Section 9.4), and ern routing algorithms use more complex data structures,
RCU (Section 9.5). Finally, Section 9.6 describes how however a simple algorithm will help highlight issues
to choose among the work-deferral schemes covered in specific to parallelism in a straightforward setting.
this chapter and Section 9.7 discusses updates. But first, We further simplify the algorithm by reducing the
Section 9.1 will introduce an example algorithm that will search key from a quadruple consisting of source and
be used to compare and contrast these approaches. destination IP addresses and ports all the way down to a
simple integer. The value looked up and returned will also
9.1 Running Example be a simple integer, so that the data structure is as shown
in Figure 9.1, which directs packets with address 42 to
interface 1, address 56 to interface 3, and address 17 to
An ounce of application is worth a ton of abstraction. interface 7. This list will normally be searched frequently
Booker T. Washington and updated rarely. In Chapter 3 we learned that the best
ways to evade inconvenient laws of physics, such as the
This chapter will use a simplified packet-routing algo- finite speed of light and the atomic nature of matter, is to
rithm to demonstrate the value of these approaches and either partition the data or to rely on read-mostly sharing.
to allow them to be compared. Routing algorithms are This chapter applies read-mostly sharing techniques to
used in operating-system kernels to deliver each outgoing Pre-BSD packet routing.
TCP/IP packet to the appropriate network interface. This Listing 9.1 (route_seq.c) shows a simple single-
particular algorithm is a simplified version of the clas- threaded implementation corresponding to Figure 9.1.
sic 1980s packet-train-optimized algorithm used in BSD Lines 1–5 define a route_entry structure and line 6 de-
UNIX [Jac88], consisting of a simple linked list.1 Mod- fines the route_list header. Lines 8–20 define route_
lookup(), which sequentially searches route_list, re-
1 In other words, this is not OpenBSD, NetBSD, or even FreeBSD, turning the corresponding ->iface, or ULONG_MAX if
but none other than Pre-BSD. there is no such route entry. Lines 22–33 define route_

127

v2022.09.25a
128 CHAPTER 9. DEFERRED PROCESSING

add(), which allocates a route_entry structure, initial-


izes it, and adds it to the list, returning -ENOMEM in case
of memory-allocation failure. Finally, lines 35–47 define
route_del(), which removes and frees the specified
route_entry structure if it exists, or returns -ENOENT
otherwise.
This single-threaded implementation serves as a proto-
type for the various concurrent implementations in this
Listing 9.1: Sequential Pre-BSD Routing Table chapter, and also as an estimate of ideal scalability and
1 struct route_entry { performance.
2 struct cds_list_head re_next;
3 unsigned long addr;
4 unsigned long iface;
};
5
6 CDS_LIST_HEAD(route_list); 9.2 Reference Counting
7
8 unsigned long route_lookup(unsigned long addr)
9 {
10 struct route_entry *rep;
I am never letting you go!
11 unsigned long ret;
12 Unknown
13 cds_list_for_each_entry(rep, &route_list, re_next) {
14 if (rep->addr == addr) {
15 ret = rep->iface; Reference counting tracks the number of references to a
16 return ret; given object in order to prevent that object from being
17 }
18 } prematurely freed. As such, it has a long and honorable
19 return ULONG_MAX; history of use dating back to at least an early 1960s Weizen-
20 }
21 baum paper [Wei63]. Weizenbaum discusses reference
22 int route_add(unsigned long addr, unsigned long interface) counting as if it was already well-known, so it likely dates
23 {
24 struct route_entry *rep; back to the 1950s or even to the 1940s. And perhaps
25 even further, given that people repairing large dangerous
26 rep = malloc(sizeof(*rep));
27 if (!rep) machines have long used a mechanical reference-counting
28 return -ENOMEM; technique implemented via padlocks. Before entering
29 rep->addr = addr;
30 rep->iface = interface; the machine, each worker locks a padlock onto the ma-
31 cds_list_add(&rep->re_next, &route_list); chine’s on/off switch, thus preventing the machine from
32 return 0;
33 } being powered on while that worker is inside. Reference
34 counting is thus an excellent time-honored candidate for a
35 int route_del(unsigned long addr)
36 { concurrent implementation of Pre-BSD routing.
37 struct route_entry *rep; To that end, Listing 9.2 shows data structures and
38
39 cds_list_for_each_entry(rep, &route_list, re_next) { the route_lookup() function and Listing 9.3 shows
40 if (rep->addr == addr) { the route_add() and route_del() functions (all at
41 cds_list_del(&rep->re_next);
42 free(rep); route_refcnt.c). Since these algorithms are quite
43 return 0; similar to the sequential algorithm shown in Listing 9.1,
44 }
45 } only the differences will be discussed.
46 return -ENOENT;
47 } Starting with Listing 9.2, line 2 adds the actual reference
counter, line 6 adds a ->re_freed use-after-free check
field, line 9 adds the routelock that will be used to
synchronize concurrent updates, and lines 11–15 add
re_free(), which sets ->re_freed, enabling route_
lookup() to check for use-after-free bugs. In route_
lookup() itself, lines 29–30 release the reference count
of the prior element and free it if the count becomes zero,
and lines 34–42 acquire a reference on the new element,
with lines 35 and 36 performing the use-after-free check.

v2022.09.25a
9.2. REFERENCE COUNTING 129

Listing 9.3: Reference-Counted Pre-BSD Routing Table Add/


Delete (BUGGY!!!)
1 int route_add(unsigned long addr, unsigned long interface)
2 {
3 struct route_entry *rep;
4
5 rep = malloc(sizeof(*rep));
6 if (!rep)
7 return -ENOMEM;
8 atomic_set(&rep->re_refcnt, 1);
Listing 9.2: Reference-Counted Pre-BSD Routing Table Lookup 9 rep->addr = addr;
10 rep->iface = interface;
(BUGGY!!!) 11 spin_lock(&routelock);
1 struct route_entry { 12 rep->re_next = route_list.re_next;
2 atomic_t re_refcnt; 13 rep->re_freed = 0;
3 struct route_entry *re_next; 14 route_list.re_next = rep;
4 unsigned long addr; 15 spin_unlock(&routelock);
5 unsigned long iface; 16 return 0;
6 int re_freed; 17 }
7 }; 18
8 struct route_entry route_list; 19 int route_del(unsigned long addr)
9 DEFINE_SPINLOCK(routelock); 20 {
10 21 struct route_entry *rep;
11 static void re_free(struct route_entry *rep) 22 struct route_entry **repp;
12 { 23
13 WRITE_ONCE(rep->re_freed, 1); 24 spin_lock(&routelock);
14 free(rep); 25 repp = &route_list.re_next;
15 } 26 for (;;) {
16 27 rep = *repp;
17 unsigned long route_lookup(unsigned long addr) 28 if (rep == NULL)
18 { 29 break;
19 int old; 30 if (rep->addr == addr) {
20 int new; 31 *repp = rep->re_next;
21 struct route_entry *rep; 32 spin_unlock(&routelock);
22 struct route_entry **repp; 33 if (atomic_dec_and_test(&rep->re_refcnt))
23 unsigned long ret; 34 re_free(rep);
24 35 return 0;
25 retry: 36 }
26 repp = &route_list.re_next; 37 repp = &rep->re_next;
27 rep = NULL; 38 }
28 do { 39 spin_unlock(&routelock);
29 if (rep && atomic_dec_and_test(&rep->re_refcnt)) 40 return -ENOENT;
30 re_free(rep); 41 }
31 rep = READ_ONCE(*repp);
32 if (rep == NULL)
33 return ULONG_MAX;
34 do { Quick Quiz 9.1: Why bother with a use-after-free check?
35 if (READ_ONCE(rep->re_freed))
36 abort();
37 old = atomic_read(&rep->re_refcnt); In Listing 9.3, lines 11, 15, 24, 32, and 39 introduce
38 if (old <= 0)
39 goto retry;
locking to synchronize concurrent updates. Line 13
40 new = old + 1; initializes the ->re_freed use-after-free-check field, and
41 } while (atomic_cmpxchg(&rep->re_refcnt,
42 old, new) != old);
finally lines 33–34 invoke re_free() if the new value of
43 repp = &rep->re_next; the reference count is zero.
44 } while (rep->addr != addr);
45 ret = rep->iface; Quick Quiz 9.2: Why doesn’t route_del() in Listing 9.3
46 if (atomic_dec_and_test(&rep->re_refcnt)) use reference counts to protect the traversal to the element to
47 re_free(rep);
48 return ret; be freed?
49 }
Figure 9.2 shows the performance and scalability of
reference counting on a read-only workload with a ten-
element list running on an eight-socket 28-core-per-socket
hyperthreaded 2.1 GHz x86 system with a total of 448 hard-
ware threads (hps.2019.12.02a/lscpu.hps). The
“ideal” trace was generated by running the sequential
code shown in Listing 9.1, which works only because
this is a read-only workload. The reference-counting

v2022.09.25a
130 CHAPTER 9. DEFERRED PROCESSING

2.5x107 successfully, and because this entry’s ->re_refcnt


field was equal to the value one, it invokes re_
Lookups per Millisecond 2x107 free() to set the ->re_freed field and to free the
ideal
entry.
7
1.5x10
3. Thread A continues execution of route_lookup().
1x10 7 Its rep pointer is non-NULL, but line 35 sees that its
->re_freed field is non-zero, so line 36 invokes
abort().
5x106

refcnt The problem is that the reference count is located in


0
0 50 100 150 200 250 300 350 400 450 the object to be protected, but that means that there is no
Number of CPUs (Threads) protection during the instant in time when the reference
count itself is being acquired! This is the reference-
Figure 9.2: Pre-BSD Routing Table Protected by Refer- counting counterpart of a locking issue noted by Gamsa
ence Counting et al. [GKAS99]. One could imagine using a global
lock or reference count to protect the per-route-entry
performance is abysmal and its scalability even more so, reference-count acquisition, but this would result in severe
with the “refcnt” trace indistinguishable from the x-axis. contention issues. Although algorithms exist that allow
This should be no surprise in view of Chapter 3: The safe reference-count acquisition in a concurrent environ-
reference-count acquisitions and releases have added fre- ment [Val95], they are not only extremely complex and
quent shared-memory writes to an otherwise read-only error-prone [MS95], but also provide terrible performance
workload, thus incurring severe retribution from the laws and scalability [HMBW07].
of physics. As well it should, given that all the wishful In short, concurrency has most definitely reduced the
thinking in the world is not going to increase the speed usefulness of reference counting! Of course, as with other
of light or decrease the size of the atoms used in modern synchronization primitives, reference counts also have
digital electronics. well-known ease-of-use shortcomings. These can result
in memory leaks on the one hand or premature freeing on
Quick Quiz 9.3: Why the break in the “ideal” line at 224 the other.
CPUs in Figure 9.2? Shouldn’t it be a straight line?
Quick Quiz 9.5: If concurrency has “most definitely reduced
the usefulness of reference counting”, why are there so many
Quick Quiz 9.4: Shouldn’t the refcnt trace in Figure 9.2 be
reference counters in the Linux kernel?
at least a little bit off of the x-axis???
It is sometimes helpful to look at a problem in an
But it gets worse.
entirely different way in order to successfully solve it. To
Running multiple updater threads repeatedly invoking
this end, the next section describes what could be thought
route_add() and route_del() will quickly encounter
of as an inside-out reference count that provides decent
the abort() statement on line 36 of Listing 9.2, which
performance and scalability.
indicates a use-after-free bug. This in turn means that
the reference counts are not only profoundly degrading
scalability and performance, but also failing to provide 9.3 Hazard Pointers
the needed protection.
One sequence of events leading to the use-after-free
bug is as follows, given the list shown in Figure 9.1: If in doubt, turn it inside out.
Zara Carpenter
1. Thread A looks up address 42, reaching line 32 of
route_lookup() in Listing 9.2. In other words,
Thread A has a pointer to the first element, but has One way of avoiding problems with concurrent reference
not yet acquired a reference to it. counting is to implement the reference counters inside out,
that is, rather than incrementing an integer stored in the
2. Thread B invokes route_del() in Listing 9.3 to data element, instead store a pointer to that data element
delete the route entry for address 42. It completes in per-CPU (or per-thread) lists. Each element of these

v2022.09.25a
9.3. HAZARD POINTERS 131

Listing 9.4: Hazard-Pointer Recording and Clearing Quick Quiz 9.6: Given that papers on hazard pointers use
1 static inline void *_h_t_r_impl(void **p, the bottom bits of each pointer to mark deleted elements, what
2 hazard_pointer *hp)
3 {
is up with HAZPTR_POISON?
4 void *tmp;
5
Line 6 reads the pointer to the object to be protected.
6 tmp = READ_ONCE(*p);
7 if (!tmp || tmp == (void *)HAZPTR_POISON) If line 8 finds that this pointer was either NULL or the
8 return tmp; special HAZPTR_POISON deleted-object token, it returns
9 WRITE_ONCE(hp->p, tmp);
10 smp_mb(); the pointer’s value to inform the caller of the failure.
11 if (tmp == READ_ONCE(*p)) Otherwise, line 9 stores the pointer into the specified
12 return tmp;
13 return (void *)HAZPTR_POISON; hazard pointer, and line 10 forces full ordering of that
14 } store with the reload of the original pointer on line 11.
15
16 #define hp_try_record(p, hp) _h_t_r_impl((void **)(p), hp) (See Chapter 15 for more information on memory order-
17
ing.) If the value of the original pointer has not changed,
18 static inline void *hp_record(void **p,
19 hazard_pointer *hp) then the hazard pointer protects the pointed-to object,
20 { and in that case, line 12 returns a pointer to that object,
21 void *tmp;
22 which also indicates success to the caller. Otherwise,
23 do { if the pointer changed between the two READ_ONCE()
24 tmp = hp_try_record(*p, hp);
25 } while (tmp == (void *)HAZPTR_POISON); invocations, line 13 indicates failure.
26 return tmp;
27 } Quick Quiz 9.7: Why does hp_try_record() in Listing 9.4
28
take a double indirection to the data element? Why not void *
29 static inline void hp_clear(hazard_pointer *hp)
30 { instead of void **?
31 smp_mb();
32 WRITE_ONCE(hp->p, NULL); The hp_record() function is quite straightforward:
33 }
It repeatedly invokes hp_try_record() until the return
value is something other than HAZPTR_POISON.
lists is called a hazard pointer [Mic04a].2 The value of a Quick Quiz 9.8: Why bother with hp_try_record()?
given data element’s “virtual reference counter” can then Wouldn’t it be easier to just use the failure-immune hp_
be obtained by counting the number of hazard pointers record() function?
referencing that element. Therefore, if that element has
been rendered inaccessible to readers, and there are no The hp_clear() function is even more straightforward,
longer any hazard pointers referencing it, that element with an smp_mb() to force full ordering between the
may safely be freed. caller’s uses of the object protected by the hazard pointer
Of course, this means that hazard-pointer acquisition and the setting of the hazard pointer to NULL.
must be carried out quite carefully in order to avoid destruc- Once a hazard-pointer-protected object has been re-
tive races with concurrent deletion. One implementation moved from its linked data structure, so that it is now
is shown in Listing 9.4, which shows hp_try_record() inaccessible to future hazard-pointer readers, it is passed to
on lines 1–16, hp_record() on lines 18–27, and hp_ hazptr_free_later(), which is shown on lines 48–56
clear() on lines 29–33 (hazptr.h). of Listing 9.5 (hazptr.c). Lines 50 and 51 enqueue
The hp_try_record() macro on line 16 is simply a the object on a per-thread list rlist and line 52 counts
casting wrapper for the _h_t_r_impl() function, which the object in rcount. If line 53 sees that a sufficiently
attempts to store the pointer referenced by p into the hazard large number of objects are now queued, line 54 invokes
pointer referenced by hp. If successful, it returns the value hazptr_scan() to attempt to free some of them.
of the stored pointer. If it fails due to that pointer being The hazptr_scan() function is shown on lines 6–46
NULL, it returns NULL. Finally, if it fails due to racing with of the listing. This function relies on a fixed maximum
an update, it returns a special HAZPTR_POISON token. number of threads (NR_THREADS) and a fixed maximum
number of hazard pointers per thread (K), which allows a
fixed-size array of hazard pointers to be used. Because
any thread might need to scan the hazard pointers, each
thread maintains its own array, which is referenced by the
2 Also independently invented by others [HLM02]. per-thread variable gplist. If line 14 determines that this

v2022.09.25a
132 CHAPTER 9. DEFERRED PROCESSING

Listing 9.6: Hazard-Pointer Pre-BSD Routing Table Lookup


1 struct route_entry {
2 struct hazptr_head hh;
3 struct route_entry *re_next;
4 unsigned long addr;
5 unsigned long iface;
6 int re_freed;
7 };
Listing 9.5: Hazard-Pointer Scanning and Freeing 8 struct route_entry route_list;
1 int compare(const void *a, const void *b) 9 DEFINE_SPINLOCK(routelock);
2 { 10 hazard_pointer __thread *my_hazptr;
3 return ( *(hazptr_head_t **)a - *(hazptr_head_t **)b ); 11

4 } 12 unsigned long route_lookup(unsigned long addr)


5 13 {
6 void hazptr_scan() 14 int offset = 0;
7 { 15 struct route_entry *rep;
8 hazptr_head_t *cur; 16 struct route_entry **repp;
9 int i; 17

10 hazptr_head_t *tmplist; 18 retry:


11 hazptr_head_t **plist = gplist; 19 repp = &route_list.re_next;
12 unsigned long psize; 20 do {
13 21 rep = hp_try_record(repp, &my_hazptr[offset]);
14 if (plist == NULL) { 22 if (!rep)
15 psize = sizeof(hazptr_head_t *) * K * NR_THREADS; 23 return ULONG_MAX;
16 plist = (hazptr_head_t **)malloc(psize); 24 if ((uintptr_t)rep == HAZPTR_POISON)
17 BUG_ON(!plist); 25 goto retry;
18 gplist = plist; 26 repp = &rep->re_next;
19 } 27 } while (rep->addr != addr);
20 smp_mb(); 28 if (READ_ONCE(rep->re_freed))
21 psize = 0; 29 abort();
22 for (i = 0; i < H; i++) { 30 return rep->iface;
23 uintptr_t hp = (uintptr_t)READ_ONCE(HP[i].p); 31 }
24
25 if (!hp)
26 continue;
27 plist[psize++] = (hazptr_head_t *)(hp & ~0x1UL); thread has not yet allocated its gplist, lines 15–18 carry
28 }
29 smp_mb(); out the allocation. The memory barrier on line 20 ensures
30 qsort(plist, psize, sizeof(hazptr_head_t *), compare); that all threads see the removal of all objects by this
31 tmplist = rlist;
32 rlist = NULL; thread before lines 22–28 scan all of the hazard pointers,
33 rcount = 0; accumulating non-NULL pointers into the plist array
34 while (tmplist != NULL) {
35 cur = tmplist; and counting them in psize. The memory barrier on
36 tmplist = tmplist->next; line 29 ensures that the reads of the hazard pointers happen
37 if (bsearch(&cur, plist, psize,
38 sizeof(hazptr_head_t *), compare)) { before any objects are freed. Line 30 then sorts this array
39 cur->next = rlist; to enable use of binary search below.
40 rlist = cur;
41 rcount++; Lines 31 and 32 remove all elements from this thread’s
42 } else { list of to-be-freed objects, placing them on the local
43 hazptr_free(cur);
44 } tmplist and line 33 zeroes the count. Each pass through
45 } the loop spanning lines 34–45 processes each of the to-be-
46 }
47
freed objects. Lines 35 and 36 remove the first object from
48 void hazptr_free_later(hazptr_head_t *n) tmplist, and if lines 37 and 38 determine that there is a
49 {
50 n->next = rlist; hazard pointer protecting this object, lines 39–41 place it
51 rlist = n; back onto rlist. Otherwise, line 43 frees the object.
52 rcount++;
53 if (rcount >= R) { The Pre-BSD routing example can use hazard pointers
54 hazptr_scan(); as shown in Listing 9.6 for data structures and route_
55 }
56 } lookup(), and in Listing 9.7 for route_add() and
route_del() (route_hazptr.c). As with reference
counting, the hazard-pointers implementation is quite sim-
ilar to the sequential algorithm shown in Listing 9.1 on
page 128, so only differences will be discussed.
Starting with Listing 9.6, line 2 shows the ->hh field
used to queue objects pending hazard-pointer free, line 6

v2022.09.25a
9.3. HAZARD POINTERS 133

shows the ->re_freed field used to detect use-after-free Quick Quiz 9.9: Readers must “typically” restart? What are
bugs, and line 21 invokes hp_try_record() to attempt some exceptions?
to acquire a hazard pointer. If the return value is NULL,
line 23 returns a not-found indication to the caller. If the Because algorithms using hazard pointers might be
call to hp_try_record() raced with deletion, line 25 restarted at any step of their traversal through the linked
branches back to line 18’s retry to re-traverse the list data structure, such algorithms must typically take care
from the beginning. The do–while loop falls through to avoid making any changes to the data structure until
when the desired element is located, but if this element after they have acquired all the hazard pointers that are
has already been freed, line 29 terminates the program. required for the update in question.
Otherwise, the element’s ->iface field is returned to the Quick Quiz 9.10: But don’t these restrictions on hazard
caller. pointers also apply to other forms of reference counting?
Note that line 21 invokes hp_try_record() rather
than the easier-to-use hp_record(), restarting the full These hazard-pointer restrictions result in great benefits
search upon hp_try_record() failure. And such restart- to readers, courtesy of the fact that the hazard pointers are
ing is absolutely required for correctness. To see this, stored local to each CPU or thread, which in turn allows
consider a hazard-pointer-protected linked list containing traversals to be carried out without any writes to the data
elements A, B, and C that is subjected to the following structures being traversed. Referring back to Figure 5.8
sequence of events: on page 71, hazard pointers enable the CPU caches to
do resource replication, which in turn allows weakening
1. Thread 0 stores a hazard pointer to element B (having of the parallel-access-control mechanism, thus boosting
presumably traversed to element B from element A). performance and scalability.
Another advantage of restarting hazard pointers traver-
2. Thread 1 removes element B from the list, which sals is a reduction in minimal memory footprint: Any
sets the pointer from element B to element C to the object not currently referenced by some hazard pointer
special HAZPTR_POISON value in order to mark the may be immediately freed. In contrast, Section 9.5 will
deletion. Because Thread 0 has a hazard pointer to discuss a mechanism that avoids read-side retries (and
element B, it cannot yet be freed. minimizes read-side overhead), but which can result in a
much larger memory footprint.
3. Thread 1 removes element C from the list. Because The route_add() and route_del() functions are
there are no hazard pointers referencing element C, shown in Listing 9.7. Line 10 initializes ->re_freed,
it is immediately freed. line 31 poisons the ->re_next field of the newly removed
object, and line 33 passes that object to the hazptr_
4. Thread 0 attempts to acquire a hazard pointer to free_later() function, which will free that object once
now-removed element B’s successor, but hp_try_ it is safe to do so. The spinlocks work the same as in
record() returns the HAZPTR_POISON value, forc- Listing 9.3.
ing the caller to restart its traversal from the beginning Figure 9.3 shows the hazard-pointers-protected Pre-
of the list. BSD routing algorithm’s performance on the same read-
only workload as for Figure 9.2. Although hazard pointers
Which is a very good thing, because B’s successor is scale far better than does reference counting, hazard point-
the now-freed element C, which means that Thread 0’s ers still require readers to do writes to shared memory
subsequent accesses might have resulted in arbitrarily (albeit with much improved locality of reference), and
horrible memory corruption, especially if the memory also require a full memory barrier and retry check for
for element C had since been re-allocated for some other each object traversed. Therefore, hazard-pointers per-
purpose. Therefore, hazard-pointer readers must typically formance is still far short of ideal. On the other hand,
restart the full traversal in the face of a concurrent deletion. unlike naive approaches to concurrent reference-counting,
Often the restart must go back to some global (and thus hazard pointers not only operate correctly for workloads
immortal) pointer, but it is sometimes possible to restart at involving concurrent updates, but also exhibit excellent
some intermediate location if that location is guaranteed scalability. Additional performance comparisons with
to still be live, for example, due to the current thread other mechanisms may be found in Chapter 10 and in
holding a lock, a reference count, etc. other publications [HMBW07, McK13, Mic04a].

v2022.09.25a
134 CHAPTER 9. DEFERRED PROCESSING

Listing 9.7: Hazard-Pointer Pre-BSD Routing Table Add/Delete Ah, I finally got
1
2
int route_add(unsigned long addr, unsigned long interface)
{
done reading!
3 struct route_entry *rep;
4
5 rep = malloc(sizeof(*rep));
6 if (!rep)
7 return -ENOMEM;
8 rep->addr = addr;
9 rep->iface = interface;
10 rep->re_freed = 0;
11 spin_lock(&routelock); No, you didn't!
rep->re_next = route_list.re_next;
Start over!
12
13 route_list.re_next = rep;
14 spin_unlock(&routelock);
15 return 0;
16 }
17
18 int route_del(unsigned long addr)
19 {
20 struct route_entry *rep; Figure 9.4: Reader And Uncooperative Sequence Lock
21 struct route_entry **repp;
22
23 spin_lock(&routelock);
24 repp = &route_list.re_next; Quick Quiz 9.11: Figure 9.3 shows no sign of hyperthread-
25 for (;;) { induced flattening at 224 threads. Why is that?
26 rep = *repp;
27 if (rep == NULL)
28 break; Quick Quiz 9.12: The paper “Structured Deferral: Syn-
29 if (rep->addr == addr) {
30 *repp = rep->re_next; chronization via Procrastination” [McK13] shows that hazard
31 rep->re_next = (struct route_entry *)HAZPTR_POISON; pointers have near-ideal performance. Whatever happened in
32 spin_unlock(&routelock);
33 hazptr_free_later(&rep->hh);
Figure 9.3???
34 return 0;
35 } The next section attempts to improve on hazard pointers
36 repp = &rep->re_next;
37 } by using sequence locks, which avoid both read-side writes
38 spin_unlock(&routelock); and per-object memory barriers.
39 return -ENOENT;
40 }

9.4 Sequence Locks

It’ll be just like starting over.


2.5x107
John Lennon
7
2x10
Lookups per Millisecond

ideal
The published sequence-lock record [Eas71, Lam77] ex-
1.5x10 7 tends back as far as that of reader-writer locking, but
sequence locks nevertheless remain in relative obscurity.
1x107
Sequence locks are used in the Linux kernel for read-
mostly data that must be seen in a consistent state by
5x106
readers. However, unlike reader-writer locking, readers
hazptr do not exclude writers. Instead, like hazard pointers,
0
sequence locks force readers to retry an operation if they
0 50 100 150 200 250 300 350 400 450 detect activity from a concurrent writer. As can be seen
Number of CPUs (Threads) from Figure 9.4, it is important to design code using
Figure 9.3: Pre-BSD Routing Table Protected by Hazard sequence locks so that readers very rarely need to retry.
Pointers Quick Quiz 9.13: Why isn’t this sequence-lock discussion
in Chapter 7, you know, the one on locking?

v2022.09.25a
9.4. SEQUENCE LOCKS 135

Listing 9.8: Sequence-Locking Reader


1 do {
2 seq = read_seqbegin(&test_seqlock);
3 /* read-side access. */
4 } while (read_seqretry(&test_seqlock, seq));

Listing 9.9: Sequence-Locking Writer


1 write_seqlock(&test_seqlock);
2 /* Update */
3 write_sequnlock(&test_seqlock);

Listing 9.10: Sequence-Locking Implementation


1 typedef struct {
The key component of sequence locking is the sequence 2 unsigned long seq;
number, which has an even value in the absence of up- 3 spinlock_t lock;
4 } seqlock_t;
daters and an odd value if there is an update in progress. 5
Readers can then snapshot the value before and after 6 static inline void seqlock_init(seqlock_t *slp)
7 {
each access. If either snapshot has an odd value, or if 8 slp->seq = 0;
the two snapshots differ, there has been a concurrent 9 spin_lock_init(&slp->lock);
10 }
update, and the reader must discard the results of the 11
access and then retry it. Readers therefore use the read_ 12 static inline unsigned long read_seqbegin(seqlock_t *slp)
13 {
seqbegin() and read_seqretry() functions shown in 14 unsigned long s;
Listing 9.8 when accessing data protected by a sequence 15
16 s = READ_ONCE(slp->seq);
lock. Writers must increment the value before and af- 17 smp_mb();
ter each update, and only one writer is permitted at a 18 return s & ~0x1UL;
19 }
given time. Writers therefore use the write_seqlock() 20
and write_sequnlock() functions shown in Listing 9.9 21 static inline int read_seqretry(seqlock_t *slp,
22 unsigned long oldseq)
when updating data protected by a sequence lock. 23 {
As a result, sequence-lock-protected data can have an 24 unsigned long s;
25
arbitrarily large number of concurrent readers, but only 26 smp_mb();
one writer at a time. Sequence locking is used in the 27 s = READ_ONCE(slp->seq);
28 return s != oldseq;
Linux kernel to protect calibration quantities used for 29 }
timekeeping. It is also used in pathname traversal to 30
31 static inline void write_seqlock(seqlock_t *slp)
detect concurrent rename operations. 32 {
33 spin_lock(&slp->lock);
A simple implementation of sequence locks is shown 34 ++slp->seq;
in Listing 9.10 (seqlock.h). The seqlock_t data struc- 35 smp_mb();
36 }
ture is shown on lines 1–4, and contains the sequence 37
number along with a lock to serialize writers. Lines 6–10 38 static inline void write_sequnlock(seqlock_t *slp)
39 {
show seqlock_init(), which, as the name indicates, 40 smp_mb();
initializes a seqlock_t. 41 ++slp->seq;
42 spin_unlock(&slp->lock);
Lines 12–19 show read_seqbegin(), which begins 43 }
a sequence-lock read-side critical section. Line 16 takes
a snapshot of the sequence counter, and line 17 orders
this snapshot operation before the caller’s critical section.
Finally, line 18 returns the value of the snapshot (with the
least-significant bit cleared), which the caller will pass to
a later call to read_seqretry().

Quick Quiz 9.14: Why not have read_seqbegin() in


Listing 9.10 check for the low-order bit being set, and retry
internally, rather than allowing a doomed read to start?

v2022.09.25a
136 CHAPTER 9. DEFERRED PROCESSING

Lines 21–29 show read_seqretry(), which returns


true if there was at least one writer since the time of the
corresponding call to read_seqbegin(). Line 26 orders Listing 9.11: Sequence-Locked Pre-BSD Routing Table Lookup
the caller’s prior critical section before line 27’s fetch of (BUGGY!!!)
the new snapshot of the sequence counter. Line 28 checks 1 struct route_entry {
2 struct route_entry *re_next;
whether the sequence counter has changed, in other words, 3 unsigned long addr;
whether there has been at least one writer, and returns true 4 unsigned long iface;
5 int re_freed;
if so. 6 };
7 struct route_entry route_list;
Quick Quiz 9.15: Why is the smp_mb() on line 26 of 8 DEFINE_SEQ_LOCK(sl);
Listing 9.10 needed? 9
10 unsigned long route_lookup(unsigned long addr)
11 {
Quick Quiz 9.16: Can’t weaker memory barriers be used in 12 struct route_entry *rep;
13 struct route_entry **repp;
the code in Listing 9.10? 14 unsigned long ret;
15 unsigned long s;
16
Quick Quiz 9.17: What prevents sequence-locking updaters 17 retry:
from starving readers? 18 s = read_seqbegin(&sl);
19 repp = &route_list.re_next;
20 do {
Lines 31–36 show write_seqlock(), which simply 21 rep = READ_ONCE(*repp);
acquires the lock, increments the sequence number, and 22 if (rep == NULL) {
23 if (read_seqretry(&sl, s))
executes a memory barrier to ensure that this increment is 24 goto retry;
ordered before the caller’s critical section. Lines 38–43 25 return ULONG_MAX;
26 }
show write_sequnlock(), which executes a memory 27 repp = &rep->re_next;
barrier to ensure that the caller’s critical section is ordered 28 } while (rep->addr != addr);
29 if (READ_ONCE(rep->re_freed))
before the increment of the sequence number on line 41, 30 abort();
then releases the lock. 31 ret = rep->iface;
32 if (read_seqretry(&sl, s))
Quick Quiz 9.18: What if something else serializes writers, 33 goto retry;
34 return ret;
so that the lock is not needed? 35 }

Quick Quiz 9.19: Why isn’t seq on line 2 of List-


ing 9.10 unsigned rather than unsigned long? After all, if
unsigned is good enough for the Linux kernel, shouldn’t it
be good enough for everyone?
2.5x107
So what happens when sequence locking is applied to
the Pre-BSD routing table? Listing 9.11 shows the data
2x107
Lookups per Millisecond

structures and route_lookup(), and Listing 9.12 shows ideal


route_add() and route_del() (route_seqlock.c).
1.5x107
This implementation is once again similar to its counter-
parts in earlier sections, so only the differences will be
1x107
highlighted.
In Listing 9.11, line 5 adds ->re_freed, which is seqlock
5x106
checked on lines 29 and 30. Line 8 adds a sequence hazptr
lock, which is used by route_lookup() on lines 18, 23,
0
and 32, with lines 24 and 33 branching back to the retry 0 50 100 150 200 250 300 350 400 450
label on line 17. The effect is to retry any lookup that Number of CPUs (Threads)
runs concurrently with an update.
In Listing 9.12, lines 11, 14, 23, 31, and 39 acquire Figure 9.5: Pre-BSD Routing Table Protected by Se-
and release the sequence lock, while lines 10 and 33 quence Locking
handle ->re_freed. This implementation is therefore
quite straightforward.

v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 137

Listing 9.12: Sequence-Locked Pre-BSD Routing Table Add/ locking does not permit traversal of pointers to objects
Delete (BUGGY!!!) that might be freed by updaters. These limitations are of
1 int route_add(unsigned long addr, unsigned long interface) course overcome by transactional memory, but can also be
2 {
3 struct route_entry *rep; overcome by combining other synchronization primitives
4
5 rep = malloc(sizeof(*rep));
with sequence locking.
6 if (!rep) Sequence locks allow writers to defer readers, but not
7 return -ENOMEM; vice versa. This can result in unfairness and even starvation
8 rep->addr = addr;
9 rep->iface = interface; in writer-heavy workloads.3 On the other hand, in the
10 rep->re_freed = 0; absence of writers, sequence-lock readers are reasonably
11 write_seqlock(&sl);
12 rep->re_next = route_list.re_next; fast and scale linearly. It is only human to want the best of
13 route_list.re_next = rep; both worlds: Fast readers without the possibility of read-
14 write_sequnlock(&sl);
15 return 0; side failure, let alone starvation. In addition, it would also
16 } be nice to overcome sequence locking’s limitations with
17
18 int route_del(unsigned long addr) pointers. The following section presents a synchronization
19 { mechanism with exactly these properties.
20 struct route_entry *rep;
21 struct route_entry **repp;
22
write_seqlock(&sl);
23
24 repp = &route_list.re_next; 9.5 Read-Copy Update (RCU)
25 for (;;) {
26 rep = *repp;
27 if (rep == NULL) “Free” is a very good price!
28 break;
29 if (rep->addr == addr) {
30 *repp = rep->re_next;
Tom Peterson
31 write_sequnlock(&sl);
32 smp_mb(); All of the mechanisms discussed in the preceding sections
33 rep->re_freed = 1;
34 free(rep); used one of a number of approaches to defer specific
35 return 0; actions until they may be carried out safely. The reference
36 }
37 repp = &rep->re_next; counters discussed in Section 9.2 use explicit counters to
38 } defer actions that could disturb readers, which results in
39 write_sequnlock(&sl);
40 return -ENOENT; read-side contention and thus poor scalability. The hazard
41 } pointers covered by Section 9.3 uses implicit counters
in the guise of per-thread lists of pointer. This avoids
read-side contention, but requires readers to do stores
It also performs better on the read-only workload, as and conditional branches, as well as either full memory
can be seen in Figure 9.5, though its performance is barriers in read-side primitives or real-time-unfriendly
still far from ideal. Worse yet, it suffers use-after-free inter-processor interrupts in update-side primitives.4 The
failures. The problem is that the reader might encounter a sequence lock presented in Section 9.4 also avoids read-
segmentation violation due to accessing an already-freed side contention, but does not protect pointer traversals and,
structure before read_seqretry() has a chance to warn like hazard pointers, requires either full memory barriers
of the concurrent update. in read-side primitives, or inter-processor interrupts in
update-side primitives. These schemes’ shortcomings
Quick Quiz 9.20: Can this bug be fixed? In other words, can
you use sequence locks as the only synchronization mechanism
raise the question of whether it is possible to do better.
protecting a linked list supporting concurrent addition, deletion, This section introduces read-copy update (RCU), which
and lookup? provides an API that allows readers to be associated with
3 Dmitry Vyukov describes one way to reduce (but, sadly, not elimi-
Both the read-side and write-side critical sections of nate) reader starvation: http://www.1024cores.net/home/lock-
a sequence lock can be thought of as transactions, and free-algorithms/reader-writer-problem/improved-lock-
sequence locking therefore can be thought of as a limited free-seqlock.
4 In some important special cases, this extra work can be avoided
form of transactional memory, which will be discussed by using link counting as exemplified by the UnboundedQueue and
in Section 17.2. The limitations of sequence locking are: ConcurrentHashMap data structures implemented in Folly open-source
(1) Sequence locking restricts updates and (2) Sequence library (https://github.com/facebook/folly).

v2022.09.25a
138 CHAPTER 9. DEFERRED PROCESSING

regions in the source code, rather than with expensive (1) gptr
updates to frequently updated shared data. The remainder
of this section examines RCU from a number of different
perspectives. Section 9.5.1 provides the classic intro- kmalloc()
duction to RCU, Section 9.5.2 covers fundamental RCU
concepts, Section 9.5.3 presents the Linux-kernel API,
p
Section 9.5.4 introduces some common RCU use cases, (2)
->addr=?
gptr
and finally Section 9.5.5 covers recent work related to ->iface=?
RCU.

initialization
9.5.1 Introduction to RCU
The approaches discussed in the preceding sections have
p
provided good scalability but decidedly non-ideal perfor- ->addr=42
(3) gptr
mance for the Pre-BSD routing table. Therefore, in the ->iface=1
spirit of “only those who have gone too far know how
far you can go”,5 we will go all the way, looking into
algorithms in which concurrent readers execute the same smp_store_release(&gptr, p);
sequence of assembly language instructions as would a
single-threaded lookup, despite the presence of concurrent
updates. Of course, this laudable goal might raise seri- p
->addr=42
(4) gptr
ous implementability questions, but we cannot possibly ->iface=1
succeed if we don’t even try!
Figure 9.6: Insertion With Concurrent Readers
9.5.1.1 Minimal Insertion and Deletion
To minimize implementability concerns, we focus on a
minimal data structure, which consists of a single global
get the newly installed pointer, but either way see a valid
pointer that is either NULL or references a single structure.
result. Unfortunately, Section 4.3.4.1 dashes these hopes
Minimal though it might be, this data structure is heavily
as well. To obtain this guarantee, readers must instead use
used in production [RH18]. A classic approach for inser-
READ_ONCE(), or, as will be seen, rcu_dereference().
tion is shown in Figure 9.6, which shows four states with
However, on most modern computer systems, each of
time advancing from top to bottom. The first row shows
these read-side primitives can be implemented with a
the initial state, with gptr equal to NULL. In the second
single load instruction, exactly the instruction that would
row, we have allocated a structure which is uninitialized,
normally be used in single-threaded code.
as indicated by the question marks. In the third row, we
have initialized the structure. Finally, in the fourth and Reviewing Figure 9.6 from the viewpoint of readers,
final row, we have updated gptr to reference the newly in the first three states all readers see gptr having the
allocated and initialized element. value NULL. Upon entering the fourth state, some readers
We might hope that this assignment to gptr could use might see gptr still having the value NULL while others
a simple C-language assignment statement. Unfortunately, might see it referencing the newly inserted element, but
Section 4.3.4.1 dashes these hopes. Therefore, the updater after some time, all readers will see this new element. At
cannot use a simple C-language assignment, but must in- all times, all readers will see gptr as containing a valid
stead use smp_store_release() as shown in the figure, pointer. Therefore, it really is possible to add new data to
or, as will be seen, rcu_assign_pointer(). linked data structures while allowing concurrent readers
Similarly, one might hope that readers could use a single to execute the same sequence of machine instructions
C-language assignment to fetch the value of gptr, and that is normally used in single-threaded code. This no-
be guaranteed to either get the old value of NULL or to cost approach to concurrent reading provides excellent
performance and scalability, and also is eminently suitable
5 With apologies to T. S. Eliot. for real-time use.

v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 139

->addr=42 Readers? no longer be any readers referencing it. It may therefore


(1) gptr
->iface=1 be safely freed, as shown on row 4.
1 Version
Thus, given a way to wait for pre-existing readers to
complete, it is possible to both add data to and remove data
smp_store_release(&gptr, NULL);
from a linked data structure, despite the readers executing
the same sequence of machine instructions that would
Readers?
be appropriate for single-threaded execution. So perhaps
->addr=42
(2) gptr
->iface=1
going all the way was not too far after all!
2 Versions But how can we tell when all of the pre-existing readers
have in fact completed? This question is the topic of
wait
gptr = for readers???
NULL; /*almost*/ Section 9.5.1.3. But first, the next section defines RCU’s
core API.

Readers?
(3) gptr
->addr=42 9.5.1.2 Core RCU API
->iface=1
1 Version
The full Linux-kernel API is quite extensive, with more
than one hundred API members. However, this section
gptr = NULL; /*almost*/
free() will confine itself to six core RCU API members, which
suffices for the upcoming sections introducing RCU and
covering its fundamentals. The full API is covered in
(4) gptr 1 Version Section 9.5.3.
Three members of the core APIs are used by read-
Figure 9.7: Deletion With Concurrent Readers ers. The rcu_read_lock() and rcu_read_unlock()
functions delimit RCU read-side critical sections. These
may be nested, so that one rcu_read_lock()–rcu_
Insertion is of course quite useful, but sooner or later, read_unlock() pair can be enclosed within another. In
it will also be necessary to delete data. As can be seen in this case, the nested set of RCU read-side critical sec-
Figure 9.7, the first step is easy. Again taking the lessons tions act as one large critical section covering the full
from Section 4.3.4.1 to heart, smp_store_release() is extent of the nested set. The third read-side API member,
used to NULL the pointer, thus moving from the first row to rcu_dereference(), fetches an RCU-protected pointer.
the second in the figure. At this point, pre-existing readers Conceptually, rcu_dereference() simply loads from
see the old structure with ->addr of 42 and ->iface memory, but we will see in Section 9.5.2.1 that rcu_
of 1, but new readers will see a NULL pointer, that is, dereference() must prevent the compiler and (in one
concurrent readers can disagree on the state, as indicated case) the CPU from reordering its load with later memory
by the “2 Versions” in the figure. operations that dereference this pointer.

Quick Quiz 9.21: Why does Figure 9.7 use smp_store_ Quick Quiz 9.23: What is an RCU-protected pointer?
release() given that it is storing a NULL pointer? Wouldn’t
WRITE_ONCE() work just as well in this case, given that there
The other three members of the core APIs are used by up-
is no structure initialization to order against the store of the daters. The synchronize_rcu() function implements
NULL pointer? the “wait for readers” operation from Figure 9.7. The
call_rcu() function is the asynchronous counterpart of
synchronize_rcu() by invoking the specified function
Quick Quiz 9.22: Readers running concurrently with each
other and with the procedure outlined in Figure 9.7 can disagree
after all pre-existing RCU readers have completed. Finally,
on the value of gptr. Isn’t that just a wee bit problematic??? the rcu_assign_pointer() macro is used to update an
RCU-protected pointer. Conceptually, this is simply an
assignment statement, but we will see in Section 9.5.2.1
We get back to a single version simply by waiting for that rcu_assign_pointer() must prevent the compiler
all the pre-existing readers to complete, as shown in row 3. and the CPU from reordering this assignment to precede
At that point, all the pre-existing readers are done, and no any prior assignments used to initialize the pointed-to
later reader has a path to the old data item, so there can structure.

v2022.09.25a
140 CHAPTER 9. DEFERRED PROCESSING

Table 9.1: Core RCU API

Primitive Purpose

Readers rcu_read_lock() Start an RCU read-side critical section.


rcu_read_unlock() End an RCU read-side critical section.
rcu_dereference() Safely load an RCU-protected pointer.

Updaters synchronize_rcu() Wait for all pre-existing RCU read-side critical


sections to complete.
call_rcu() Invoke the specified function after all pre-existing
RCU read-side critical sections complete.
rcu_assign_pointer() Safely update an RCU-protected pointer.

Quick Quiz 9.24: What does synchronize_rcu() do if it mizations. Nevertheless, this approach is said to be used
starts at about the same time as an rcu_read_lock()? in production [Ash15].
A third approach is to simply wait for a fixed period
The core RCU API is summarized in Table 9.1 for of time that is long enough to comfortably exceed the
easy reference. With that, we are ready to continue this lifetime of any reasonable reader [Jac93, Joh95]. This
introduction to RCU with the key RCU operation, waiting can work quite well in hard real-time systems [RLPB18],
for readers. but in less exotic settings, Murphy says that it is critically
important to be prepared even for unreasonably long-lived
readers. To see this, consider the consequences of failing
9.5.1.3 Waiting for Readers do so: A data item will be freed while the unreasonable
It is tempting to base the reader-waiting functionality of reader is still referencing it, and that item might well
synchronize_rcu() and call_rcu() on a reference be immediately reallocated, possibly even as a data item
counter updated by rcu_read_lock() and rcu_read_ of some other type. The unreasonable reader and the
unlock(), but Figure 5.1 in Chapter 5 shows that con- unwitting reallocator would then be attempting to use
current reference counting results in extreme overhead. the same memory for two very different purposes. The
This extreme overhead was confirmed in the specific case ensuing mess will at best be exceedingly difficult to debug.
of reference counters in Figure 9.2 on page 130. Hazard A fourth approach is to wait forever, secure in the
pointers profoundly reduce this overhead, but, as we saw knowledge that doing so will accommodate even the
in Figure 9.3 on page 134, not to zero. Nevertheless, many most unreasonable reader. This approach is also called
RCU implementations make very careful cache-local use “leaking memory”, and has a bad reputation due to the
of counters. fact that memory leaks often require untimely and incon-
venient reboots. Nevertheless, this is a viable strategy
A second approach observes that memory synchro-
when the update rate and the uptime are both sharply
nization is expensive, and therefore uses registers instead,
bounded. For example, this approach could work well in a
namely each CPU’s or thread’s program counter (PC), thus
high-availability cluster where systems were periodically
imposing no overhead on readers, at least in the absence
crashed in order to ensure that cluster really remained
of concurrent updates. The updater polls each relevant
highly available.6 Leaking the memory is also a viable
PC, and if that PC is not within read-side code, then the
strategy in environments having garbage collectors, in
corresponding CPU or thread is within a quiescent state,
which case the garbage collector can be thought of as
in turn signaling the completion of any reader that might
plugging the leak [KL80]. However, if your environment
have access to the newly removed data element. Once all
lacks a garbage collector, read on!
CPU’s or thread’s PCs have been observed to be outside
of any reader, the grace period has completed. Please 6 The program that forces the periodic crashing is sometimes
note that this approach poses some serious challenges, known as a “chaos monkey”: https://netflix.github.io/
including memory ordering, functions that are sometimes chaosmonkey/. However, it might also be a mistake to neglect chaos
invoked from readers, and ever-exciting code-motion opti- caused by systems running for too long.

v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 141

WRITE_ONCE(gptr, NULL);
A fifth approach avoids the period crashes in favor
of periodically “stopping the world”, as exemplified by
the traditional stop-the-world garbage collector. This
approach was also heavily used during the decades before

synchronize_rcu()
CPU 1 CPU 2 CPU 3
ubiquitous connectivity, when it was common practice
to power systems off at the end of each working day.
Context Switch
However, in today’s always-connected always-on world,
stopping the world can gravely degrade response times,
which has been one motivation for the development of
Reader
concurrent garbage collectors [BCR03]. Furthermore,
although we need all pre-existing readers to complete, we
do not need them all to complete at the same time.
This observation leads to the sixth approach, which is
stopping one CPU or thread at a time. This approach has

Grace Period
the advantage of not degrading reader response times at
all, let alone gravely. Furthermore, numerous applications
already have states (termed quiescent states) that can be
reached only after all pre-existing readers are done. In
transaction-processing systems, the time between a pair
of successive transactions might be a quiescent state. In

free()
reactive systems, the state between a pair of successive
events might be a quiescent state. Within non-preemptive
operating-systems kernels, a context switch can be a
quiescent state [MS98a]. Either way, once all CPUs Figure 9.8: QSBR: Waiting for Pre-Existing Readers
and/or threads have passed through a quiescent state, the
system is said to have completed a grace period, at which
point all readers in existence at the start of that grace Again, this same constraint is imposed on reader threads
period are guaranteed to have completed. As a result, it is dereferencing gptr: Such threads are not allowed to block
also guaranteed to be safe to free any removed data items until after they are done using the pointed-to data item.
that were removed prior to the start of that grace period.7 Returning to the second row of Figure 9.7, where the
Within a non-preemptive operating-system kernel, for updater has just completed executing the smp_store_
context switch to be a valid quiescent state, readers must release(), imagine that CPU 0 executes a context switch.
be prohibited from blocking while referencing a given Because readers are not permitted to block while traversing
instance data structure obtained via the gptr pointer the linked list, we are guaranteed that all prior readers that
shown in Figures 9.6 and 9.7. This no-blocking constraint might have been running on CPU 0 will have completed.
is consistent with similar constraints on pure spinlocks, Extending this line of reasoning to the other CPUs, once
where a CPU is forbidden from blocking while holding each CPU has been observed executing a context switch,
a spinlock. Without this constraint, all CPUs might be we are guaranteed that all prior readers have completed,
consumed by threads spinning attempting to acquire a and that there are no longer any reader threads referencing
spinlock held by a blocked thread. The spinning threads the newly removed data element. The updater can then
will not relinquish their CPUs until they acquire the lock, safely free that data element, resulting in the state shown
but the thread holding the lock cannot possibly release at the bottom of Figure 9.7.
it until one of the spinning threads relinquishes a CPU. This approach is termed quiescent-state-based recla-
This is a classic deadlock situation, and this deadlock is mation (QSBR) [HMB06]. A QSBR schematic is shown
avoided by forbidding blocking while holding a spinlock. in Figure 9.8, with time advancing from the top of the
figure to the bottom. The cyan-colored boxes depict RCU
read-side critical sections, each of which begins with
7 It is possible to do much more with RCU than simply defer rcu_read_lock() and ends with rcu_read_unlock().
reclamation of memory, but deferred reclamation is RCU’s most common CPU 1 does the WRITE_ONCE() that removes the current
use case, and is therefore an excellent place to start. data item (presumably having previously read the pointer

v2022.09.25a
142 CHAPTER 9. DEFERRED PROCESSING

value and availed itself of appropriate synchronization), Listing 9.13: Insertion and Deletion With Concurrent Readers
then waits for readers. This wait operation results in 1 struct route *gptr;
2
an immediate context switch, which is a quiescent state 3 int access_route(int (*f)(struct route *rp))
(denoted by the pink circle), which in turn means that all 4 {
5 int ret = -1;
prior reads on CPU 1 have completed. Next, CPU 2 does 6 struct route *rp;
a context switch, so that all readers on CPUs 1 and 2 are 7
8 rcu_read_lock();
now known to have completed. Finally, CPU 3 does a 9 rp = rcu_dereference(gptr);
context switch. At this point, all readers throughout the 10 if (rp)
11 ret = f(rp);
entire system are known to have completed, so the grace 12 rcu_read_unlock();
period ends, permitting synchronize_rcu() to return 13 return ret;
14 }
to its caller, in turn permitting CPU 1 to free the old data 15

item. 16 struct route *ins_route(struct route *rp)


17 {
Quick Quiz 9.25: In Figure 9.8, the last of CPU 3’s readers 18 struct route *old_rp;
19
that could possibly have access to the old data item ended 20 spin_lock(&route_lock);
before the grace period even started! So why would anyone 21 old_rp = gptr;
bother waiting until CPU 3’s later context switch??? 22 rcu_assign_pointer(gptr, rp);
23 spin_unlock(&route_lock);
24 return old_rp;
25 }
26
9.5.1.4 Toy Implementation 27 int del_route(void)
28 {
Although production-quality QSBR implementations can 29 struct route *old_rp;
be quite complex, a toy non-preemptive Linux-kernel 30
31 spin_lock(&route_lock);
implementation is exceedingly simple: 32 old_rp = gptr;
33 RCU_INIT_POINTER(gptr, NULL);
1 void synchronize_rcu(void) 34 spin_unlock(&route_lock);
2 { 35 synchronize_rcu();
3 int cpu; 36 free(old_rp);
4 37 return !!old_rp;
5 for_each_online_cpu(cpu) 38 }
6 sched_setaffinity(current->pid, cpumask_of(cpu));
7 }

and demonstrates that it really is possible to provide


The for_each_online_cpu() primitive iterates over
read-side synchronization at zero cost, even in the face
all CPUs, and the sched_setaffinity() function
of concurrent updates. In fact, Listing 9.13 shows how
causes the current thread to execute on the specified CPU,
reading (access_route()), Figure 9.6’s insertion (ins_
which forces the destination CPU to execute a context
route()) and Figure 9.7’s deletion (del_route()) can
switch. Therefore, once the for_each_online_cpu()
be implemented. (A slightly more capable routing table
has completed, each CPU has executed a context switch,
is shown in Section 9.5.4.1.)
which in turn guarantees that all pre-existing reader threads
have completed. Quick Quiz 9.26: What is the point of rcu_read_lock()
Please note that this approach is not production qual- and rcu_read_unlock() in Listing 9.13? Why not just let
ity. Correct handling of a number of corner cases the quiescent states speak for themselves?
and the need for a number of powerful optimizations
mean that production-quality implementations are quite Quick Quiz 9.27: What is the point of rcu_dereference(),
rcu_assign_pointer() and RCU_INIT_POINTER() in
complex. In addition, RCU implementations for pre-
Listing 9.13? Why not just use READ_ONCE(), smp_store_
emptible environments require that readers actually do release(), and WRITE_ONCE(), respectively?
something, which in non-real-time Linux-kernel environ-
ments can be as simple as defining rcu_read_lock() Referring back to Listing 9.13, note that route_lock is
and rcu_read_unlock() as preempt_disable() and used to synchronize between concurrent updaters invoking
preempt_enable(), respectively.8 However, this sim- ins_route() and del_route(). However, this lock
ple non-preemptible approach is conceptually complete, is not acquired by readers invoking access_route():
8 Some toy RCU implementations that handle preempted read-side Readers are instead protected by the QSBR techniques
critical sections are shown in Appendix B. described in this section.

v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 143

Note that ins_route() simply returns the old value of 16000


gptr, which Figure 9.6 assumed would always be NULL.
14000
This means that it is the caller’s responsibility to figure
out what to do with a non-NULL value, a task complicated 12000

# RCU API Uses


by the fact that readers might still be referencing it for an 10000
indeterminate period of time. Callers might use one of
the following approaches: 8000
6000
1. Use synchronize_rcu() to safely free the pointed-
4000
to structure. Although this approach is correct
from an RCU perspective, it arguably has software- 2000
engineering leaky-API problems. 0

2002
2004
2006
2008
2010
2012
2014
2016
2018
2020
2. Trip an assertion if the returned pointer is non-NULL.

3. Pass the returned pointer to a later invocation of


Year
ins_route() to restore the earlier value.
Figure 9.9: RCU Usage in the Linux Kernel
In contrast, del_route() uses synchronize_rcu()
and free() to safely free the newly deleted data item.
Quick Quiz 9.29: Doesn’t Section 9.4’s seqlock also per-
Quick Quiz 9.28: But what if the old structure needs to be mit readers and updaters to make useful concurrent forward
freed, but the caller of ins_route() cannot block, perhaps progress?
due to performance considerations or perhaps because the
caller is executing within an RCU read-side critical section? As noted earlier, RCU delimits readers with rcu_read_
lock() and rcu_read_unlock(), and ensures that each
reader has a coherent view of each object (see Figure 9.7)
This example shows one general approach to reading
by maintaining multiple versions of objects and using
and updating RCU-protected data structures, however,
update-side primitives such as synchronize_rcu() to
there is quite a variety of use cases, several of which are
ensure that objects are not freed until after the comple-
covered in Section 9.5.4.
tion of all readers that might be using them. RCU uses
In summary, it is in fact possible to create concurrent rcu_assign_pointer() and rcu_dereference() to
linked data structures that can be traversed by readers provide efficient and scalable mechanisms for publishing
executing the same sequence of machine instructions that and reading new versions of an object, respectively. These
would be executed by single-threaded readers. The next mechanisms distribute the work among read and update
section summarizes RCU’s high-level properties. paths in such a way as to make read paths extremely
fast, using replication and weakening optimizations in a
9.5.1.5 RCU Properties manner similar to hazard pointers, but without the need
for read-side retries. In some cases, including CONFIG_
A key RCU property is that reads need not wait for PREEMPT=n Linux kernels, RCU’s read-side primitives
updates. This property enables RCU implementations have zero overhead.
to provide low-cost or even no-cost readers, resulting in But are these properties actually useful in practice?
low overhead and excellent scalability. This property This question is taken up by the next section.
also allows RCU readers and updaters to make useful
concurrent forward progress. In contrast, conventional
9.5.1.6 Practical Applicability
synchronization primitives must enforce strict mutual
exclusion using expensive instructions, thus increasing RCU has been used in the Linux kernel since October
overhead and degrading scalability, but also typically 2002 [Tor02]. Use of the RCU API has increased sub-
prohibiting readers and updaters from making useful stantially since that time, as can be seen in Figure 9.9. In
concurrent forward progress. fact, code very similar to that in Listing 9.13 is used in the

v2022.09.25a
144 CHAPTER 9. DEFERRED PROCESSING

Linux kernel. RCU has enjoyed heavy use both prior to ins_route() access_route()
and since its acceptance in the Linux kernel, as discussed
Allocate
in Section 9.5.5.
It is therefore safe to say that RCU enjoys wide practical
applicability. Pre-initialization
garbage
The minimal example discussed in this section is a good
introduction to RCU. However, effective use of RCU often
requires that you think differently about your problem. It Initialize
is therefore useful to examine RCU’s fundamentals, a task
taken up by the following section.
Valid route structure

9.5.2 RCU Fundamentals Subscribe to


Publish pointer
pointer
This section re-examines the ground covered in the pre-

Not OK
vious section, but independent of any particular example
or use case. People who prefer to live their lives very Valid route structure Dereference pointer
OK
close to the actual code may wish to skip the underlying
fundamentals presented in this section. Surprising, but OK
RCU is made up of three fundamental mechanisms, the
first being used for insertion, the second being used for Figure 9.10: Publication/Subscription Constraints
deletion, and the third being used to allow readers to toler-
ate concurrent insertions and deletions. Section 9.5.2.1
describes the publish-subscribe mechanism used for inser- then initializes the newly allocated structure, and then in-
tion, Section 9.5.2.2 describes how waiting for pre-existing vokes ins_route() to publish a pointer to the new route
RCU readers enabled deletion, and Section 9.5.2.3 dis- structure. Publication does not affect the contents of the
cusses how maintaining multiple versions of recently up- structure, which therefore remain valid after publication.
dated objects permits concurrent insertions and deletions. The access_route() column from this same figure
Finally, Section 9.5.2.4 summarizes RCU fundamentals. shows the pointer being subscribed to and dereferenced.
This dereference operation absolutely must see a valid
route structure rather than pre-initialization garbage be-
9.5.2.1 Publish-Subscribe Mechanism
cause referencing garbage could result in memory corrup-
Because RCU readers are not excluded by RCU updaters, tion, crashes, and hangs. As noted earlier, avoiding such
an RCU-protected data structure might change while a garbage means that the publish and subscribe operations
reader accesses it. The accessed data item might be moved, must inform both the compiler and the CPU of the need
removed, or replaced. Because the data structure does to maintain the needed ordering.
not “hold still” for the reader, each reader’s access can Publication is carried out by rcu_assign_pointer(),
be thought of as subscribing to the current version of the which ensures that ins_route()’s caller’s initialization
RCU-protected data item. For their part, updaters can be is ordered before the actual publication operation’s store
thought of as publishing new versions. of the pointer. In addition, rcu_assign_pointer()
Unfortunately, as laid out in Section 4.3.4.1 and reiter- must be atomic in the sense that concurrent readers see
ated in Section 9.5.1.1, it is unwise to use plain accesses either the old value of the pointer or the new value of the
for these publication and subscription operations. It is pointer, but not some mash-up of these two values. These
instead necessary to inform both the compiler and the requirements are met by the C11 store-release operation,
CPU of the need for care, as can be seen from Figure 9.10, and in fact in the Linux kernel, rcu_assign_pointer()
which illustrates interactions between concurrent execu- is defined in terms of smp_store_release(), which is
tions of ins_route() (and its caller) and read_gptr() similar to C11 store-release.
from Listing 9.13. Note that if concurrent updates are required, some sort
The ins_route() column from Figure 9.10 shows of synchronization mechanism will be required to medi-
ins_route()’s caller allocating a new route structure, ate among multiple concurrent rcu_assign_pointer()
which then contains pre-initialization garbage. The caller calls on the same pointer. In the Linux kernel, locking

v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 145

is the mechanism of choice, but pretty much any syn- what alternate universe would that qualify as “not disrupting
chronization mechanism may be used. An example of concurrent readers”???
a particularly lightweight synchronization mechanism is
Adding data to a linked structure without disrupting
Chapter 8’s data ownership: If each pointer is owned by
readers is a good thing, as are the cases where this can
a particular thread, then that thread may execute rcu_
be done with no added read-side cost compared to single-
assign_pointer() on that pointer with no additional
threaded readers. However, in most cases it is also nec-
synchronization overhead.
essary to remove data, and this is the subject of the next
Quick Quiz 9.30: Wouldn’t use of data ownership for RCU section.
updaters mean that the updates could use exactly the same
sequence of instructions as would the corresponding single-
threaded code? 9.5.2.2 Wait For Pre-Existing RCU Readers

Subscription is carried out by rcu_dereference(), In its most basic form, RCU is a way of waiting for
which orders the subscription operation’s load from the things to finish. Of course, there are a great many other
pointer is before the dereference. Similar to rcu_assign_ ways of waiting for things to finish, including reference
pointer(), rcu_dereference() must be atomic in the counts, reader-writer locks, events, and so on. The great
sense that the value loaded must be that from a single store, advantage of RCU is that it can wait for each of (say)
for example, the compiler must not tear the load.9 Unfor- 20,000 different things without having to explicitly track
tunately, compiler support for rcu_dereference() is at each and every one of them, and without having to worry
best a work in progress [MWB+ 17, MRP+ 17, BM18]. In about the performance degradation, scalability limitations,
the meantime, the Linux kernel relies on volatile loads, complex deadlock scenarios, and memory-leak hazards
the details of the various CPU architectures, coding re- that are inherent in schemes using explicit tracking.
strictions [McK14e], and, on DEC Alpha [Cor02], a In RCU’s case, each of the things waited on is called
memory-barrier instruction. However, on other architec- an RCU read-side critical section. As noted in Table 9.1,
tures, rcu_dereference() typically emits a single load an RCU read-side critical section starts with an rcu_
instruction, just as would the equivalent single-threaded read_lock() primitive, and ends with a corresponding
code. The coding restrictions are described in more detail rcu_read_unlock() primitive. RCU read-side critical
in Section 15.3.2, however, the common case of field sections can be nested, and may contain pretty much any
selection (“->”) works quite well. Software that does not code, as long as that code does not contain a quiescent
require the ultimate in read-side performance can instead state. For example, within the Linux kernel, it is illegal
use C11 acquire loads, which provide the needed ordering to sleep within an RCU read-side critical section because
and more, albeit at a cost. It is hoped that lighter-weight a context switch is a quiescent state.10 If you abide
compiler support for rcu_dereference() will appear by these conventions, you can use RCU to wait for any
in due course. pre-existing RCU read-side critical section to complete,
In short, use of rcu_assign_pointer() for publish- and synchronize_rcu() uses indirect means to do the
ing pointers and use of rcu_dereference() for subscrib- actual waiting [DMS+ 12, McK13].
ing to them successfully avoids the “Not OK” garbage The relationship between an RCU read-side critical
loads depicted in Figure 9.10. These two primitives can section and a later RCU grace period is an if-then rela-
therefore be used to add new data to linked structures tionship, as illustrated by Figure 9.11. If any portion of a
without disrupting concurrent readers. given critical section precedes the beginning of a given
grace period, then RCU guarantees that all of that critical
Quick Quiz 9.31: But suppose that updaters are adding section will precede the end of that grace period. In the
and removing multiple data items from a linked list while a figure, P0()’s access to x precedes P1()’s access to this
reader is iterating over that same list. Specifically, suppose same variable, and thus also precedes the grace period
that a list initially contains elements A, B, and C, and that an
generated by P1()’s call to synchronize_rcu(). It is
updater removes element A and then adds a new element D at
the end of the list. The reader might well see {A, B, C, D},
therefore guaranteed that P0()’s access to y will precede
when that sequence of elements never actually ever existed! In P1()’s access. In this case, if r1’s final value is 0, then
r2’s final value is guaranteed to also be 0.
9 That is, the compiler must not break the load into multiple smaller 10 However, a special form of RCU called SRCU [McK06] does

loads, as described under “load tearing” in Section 4.3.4.1. permit general sleeping in SRCU read-side critical sections.

v2022.09.25a
146 CHAPTER 9. DEFERRED PROCESSING

P0()

rcu_read_lock()
P1()

x = 1;
r1 = x; Given this ordering ...
P0()

this ordering.
rcu_read_lock() synchronize_rcu()

r2 = y;

r1 = x; ... RCU guarantees y = 1;


rcu_read_unlock() x = 1;

r2 = y; Given this ordering ...


synchronize_rcu()

rcu_read_unlock()
.... RCU guarantees this ordering. y = 1;

Figure 9.12: RCU Reader and Earlier Grace Period


P1()

Figure 9.11: RCU Reader and Later Grace Period

Quick Quiz 9.32: What other final values of r1 and r2 are


possible in Figure 9.11?
P1()
The relationship between an RCU read-side critical
x = 1;
section and an earlier RCU grace period is also an if-then
relationship, as illustrated by Figure 9.12. If any portion P0()
of a given critical section follows the end of a given
grace period, then RCU guarantees that all of that critical rcu_read_lock() synchronize_rcu()

section will follow the beginning of that grace period.


In the figure, P0()’s access to y follows P1()’s access
to this same variable, and thus follows the grace period r1 = x; Given this ordering ...
generated by P1()’s call to synchronize_rcu(). It is
therefore guaranteed that P0()’s access to x will follow
P1()’s access. In this case, if r2’s final value is 1, then r2 = y; ... this can happen
r1’s final value is guaranteed to also be 1.
Quick Quiz 9.33: What would happen if the order of P0()’s
two accesses was reversed in Figure 9.12? rcu_read_unlock()

Finally, as shown in Figure 9.13, an RCU read-side


critical section can be completely overlapped by an RCU
grace period. In this case, r1’s final value is 1 and r2’s y = 1;
final value is 0.
However, it cannot be the case that r1’s final value Figure 9.13: RCU Reader Within Grace Period
is 0 and r2’s final value is 1. This would mean that an
RCU read-side critical section had completely overlapped
a grace period, which is forbidden (or at the very least

v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 147

constitutes a bug in RCU). RCU’s wait-for-readers guar-


antee therefore has two parts: (1) If any part of a given
RCU read-side critical section precedes the beginning of a
given grace period, then the entirety of that critical section
precedes the end of that grace period. (2) If any part of a rcu_read_lock()
given RCU read-side critical section follows the end of a
given grace period, then the entirety of that critical section Remove
follows the beginning of that grace period. This definition
rcu_read_unlock() synchronize_rcu()
is sufficient for almost all RCU-based algorithms, but for
those wanting more, simple executable formal models Free Old Memory
of RCU are available as part of Linux kernel v4.17 and
later, as discussed in Section 12.3.2. In addition, RCU’s (1) Reader precedes removal
ordering properties are examined in much greater detail
in Section 15.4.3.
Remove
Quick Quiz 9.34: What would happen if P0()’s accesses in
Figures 9.11–9.13 were stores? rcu_read_lock() synchronize_rcu()

Although RCU’s wait-for-readers capability really is Free Old Memory


sometimes used to order the assignment of values to vari-
ables as shown in Figures 9.11–9.13, it is more frequently rcu_read_unlock()
used to safely free data elements removed from a linked
(2) Removal precedes reader
structure, as was done in Section 9.5.1. The general
process is illustrated by the following pseudocode:
Remove
1. Make a change, for example, remove an element from
a linked list. rcu_read_lock()
synchronize_rcu()
2. Wait for all pre-existing RCU read-side critical sec- rcu_read_unlock()
tions to completely finish (for example, by using
synchronize_rcu()). Free Old Memory

3. Clean up, for example, free the element that was (3) Reader within grace period
replaced above.
rcu_read_lock()
This more abstract procedure requires a more abstract
diagram than Figures 9.11–9.13, which are specific to a Remove
particular litmus test. After all, an RCU implementation
must work correctly regardless of the form of the RCU up- synchronize_rcu()
dates and the RCU read-side critical sections. Figure 9.14 Free Old Memory
fills this need, showing the four possible scenarios, with
time advancing from top to bottom within each scenario.
rcu_read_unlock()
Within each scenario, an RCU reader is represented by BUG!!!
the left-hand stack of boxes and RCU updater by the (4) Grace period within reader (BUG!!!)
right-hand stack.
In the first scenario, the reader starts execution before the
updater starts the removal, so it is possible that this reader Figure 9.14: Summary of RCU Grace-Period Ordering
has a reference to the removed data element. Therefore, Guarantees
the updater must not free this element until after the reader
completes. In the second scenario, the reader does not
start execution until after the removal has completed. The

v2022.09.25a
148 CHAPTER 9. DEFERRED PROCESSING

reader cannot possibly obtain a reference to the already- Reader { A }


removed data element, so this element may be freed before
the reader completes. The third scenario is like the second, 1. A B C D
but illustrates that even when the reader cannot possibly
obtain a reference to an element, it is still permissible
to defer the freeing of that element until after the reader Reader { A, B }
completes. In the fourth and final scenario, the reader
starts execution before the updater starts removing the data A B C D
2.
element, but this element is (incorrectly) freed before the
reader completed. A correct RCU implementation will
not allow this fourth scenario to occur. This diagram thus Reader { A, B }
illustrates RCU’s wait-for-readers functionality: Given a
grace period, each reader ends before the end of that grace
3. B C D
period, starts after the beginning of that grace period, or
both, in which case it is wholly contained within that grace
period.
Reader { A, B }
Because RCU readers can make forward progress while
updates are in progress, different readers might disagree
4. B C D E
about the state of the data structure, a topic taken up by
the next section.
Reader { A, B, C, D, E }
9.5.2.3 Maintain Multiple Versions of Recently Up-
dated Objects 5. B C D E

This section discusses how RCU accommodates Figure 9.15: Multiple RCU Data-Structure Versions
synchronization-free readers by maintaining multiple ver-
sions of data. Because these synchronization-free readers
provide very weak temporal synchronization, RCU users is concurrently updated.11 In the first row of the figure,
compensate via spatial synchronization. Spatial synchro- the reader is referencing data item A, and in the second
nization was discussed in Chapter 6, and is heavily used row, it advances to B, having thus far seen A followed
in practice to obtain good performance and scalability. by B. In the third row, an updater removes element A and
In this section, spatial synchronization will be used to in the fourth row an updater adds element E to the end of
attain a weak (but useful) form of correctness as well as the list. In the fifth and final row, the reader completes its
excellent performance and scalability. traversal, having seeing elements A through E.
Figure 9.7 in Section 9.5.1.1 showed a simple variant of Except that there was no time at which such a list
spatial synchronization, in which different readers running existed. This situation might be even more surprising than
concurrently with del_route() (see Listing 9.13) might that shown in Figure 9.7, in which different concurrent
see the old route structure or an empty list, but either readers see different versions. In contrast, in Figure 9.15
way get a valid result. Of course, a closer look at Fig- the reader sees a version that never actually existed!
ure 9.6 shows that calls to ins_route() can also result One way to resolve this strange situation is via weaker
in concurrent readers seeing different versions: Either the semanitics. A reader traversal must encounter any data
initial empty list or the newly inserted route structure. item that was present during the full traversal (B, C, and D),
Note that both reference counting (Section 9.2) and hazard and might or might not encounter data items that were
pointers (Section 9.3) can also cause concurrent readers present for only part of the traversal (A and E). Therefore,
to see different versions, but RCU’s lightweight readers in this particular case, it is perfectly legitimate for the
make this more likely. reader traversal to encounter all five elements. If this out-
However, maintaining multiple weakly consistent ver- come is problematic, another way to resolve this situation
sions can provide some surprises. For example, consider
Figure 9.15, in which a reader is traversing a linked list that 11 RCU linked-list APIs may be found in Section 9.5.3.

v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 149

is through use of stronger synchronization mechanisms, such inconsistencies is more common than one might
such as reader-writer locking, or clever use of timestamps imagine.
and versioning, as discussed in Section 9.5.4.11. Of One root cause of this common-case tolerance of in-
course, stronger mechanisms will be more expensive, but consistencies is that single-item lookups are much more
then again the engineering life is all about choices and common in practice than are full-data-structure traversals.
tradeoffs. After all, full-data-structure traversals are much more
Strange though this situation might seem, it is entirely expensive than single-item lookups, so developers are
consistent with the real world. As we saw in Section 3.2, motivated to avoid such traversals. Not only are concur-
the finite speed of light cannot be ignored within a com- rent updates are less likely to affect a single-item lookup
puter system, and it most certainly cannot be ignored than they are a full traversal, but it is also the case that
outside of this system. This in turn means that any data an isolated single-item lookup has no way of detecting
within the system representing state in the real world such inconsistencies. As a result, in the common case,
outside of the system is always and forever outdated, and such inconsistencies are not just tolerable, they are in fact
thus inconsistent with the real world. Therefore, it is invisible.
quite possible that the sequence {A, B, C, D, E} occurred In such cases, RCU readers can be considered to be fully
in the real world, but due to speed-of-light delays was ordered with updaters, despite the fact that these readers
never represented in the computer system’s memory. In might be executing the exact same sequence of machine
this case, the reader’s surprising traversal would correctly instructions that would be executed by a single-threaded
reflect reality. program. For example, referring back to Listing 9.13
on page 142, suppose that each reader thread invokes
As a result, algorithms operating on real-world data
access_route() exactly once during its lifetime, and
must account for inconsistent data, either by tolerating
that there is no other communication among reader and up-
inconsistencies or by taking steps to exclude or reject them.
dater threads. Then each invocation of access_route()
In many cases, these algorithms are also perfectly capable
can be ordered after the ins_route() invocation that
of dealing with inconsistencies within the system.
produced the route structure accessed by line 11 of
The pre-BSD packet routing example laid out in Sec- the listing in access_route() and ordered before any
tion 9.1 is a case in point. The contents of a routing subsequent ins_route() or del_route() invocation.
list is set by routing protocols, and these protocols fea- In summary, maintaining multiple versions is exactly
ture significant delays (seconds or even minutes) to avoid what enables the extremely low overheads of RCU readers,
routing instabilities. Therefore, once a routing update and as noted earlier, many algorithms are unfazed by
reaches a given system, it might well have been sending multiple versions. However, there are algorithms that
packets the wrong way for quite some time. Sending a few absolutely cannot handle multiple versions. There are
more packets the wrong way for the few microseconds techniques for adapting such algorithms to RCU [McK04],
during which the update is in flight is clearly not a problem for example, the use of sequence locking described in
because the same higher-level protocol actions that deal Section 13.4.2.
with delayed routing updates will also deal with internal
inconsistencies.
Nor is Internet routing the only situation tolerating Exercises These examples assumed that a mutex was
inconsistencies. To repeat, any algorithm in which data held across the entire update operation, which would mean
within a system tracks outside-of-system state must tol- that there could be at most two versions of the list active
erate inconsistencies, which includes security policies at a given time.
(often set by committees of humans), storage configura-
tion, and WiFi access points, to say nothing of removable Quick Quiz 9.35: How would you modify the deletion
hardware such as microphones, headsets, cameras, mice, example to permit more than two versions of the list to be
printers, and much else besides. Furthermore, the large active?
number of Linux-kernel RCU API uses shown in Fig-
ure 9.9, combined with the Linux kernel’s heavy use Quick Quiz 9.36: How many RCU versions of a given list
of reference counting and with increasing use of hazard can be active at any given time?
pointers in other projects, demonstrates that tolerance for

v2022.09.25a
150 CHAPTER 9. DEFERRED PROCESSING

9.5.2.4 Summary of RCU Fundamentals but preferably after reviewing the next section covering
software-engineering considerations.
This section has described the three fundamental compo-
nents of RCU-based algorithms:
9.5.3.1 RCU API and Software Engineering
1. A publish-subscribe mechanism for adding new data
Readers who have looked ahead to Tables 9.2, 9.3, 9.4,
featuring rcu_assign_pointer() for update-side
and 9.5 might have noted that the full list of Linux-kernel
publication and rcu_dereference() for read-side
APIs sports more than 100 members. This is in sharp
subscription,
(and perhaps dismaying) contrast to the mere six API
2. A way of waiting for pre-existing RCU readers to members shown in Table 9.1. This situation clearly raises
finish based on readers being delimited by rcu_ the question “Why so many???”
read_lock() and rcu_read_unlock() on the one This question is answered more thoroughly in the fol-
hand and updaters waiting via synchronize_rcu() lowing sections, but in the meantime the rest of this section
or call_rcu() on the other (see Section 15.4.3 for summarizes the motivations.
a formal description), and There is a wise old saying to the effect of “To err is
human.” This means that purpose of a significant fraction
3. A discipline of maintaining multiple versions to of the RCU API is to provide diagnostics, most notably in
permit change without harming or unduly delaying Table 9.5, but elsewhere as well.
concurrent RCU readers. Important causes of human error are the limits of the
human brain, for example, the limited capacity of short-
Quick Quiz 9.37: How can RCU updaters possibly delay term memory. The toy examples shown in this book do
RCU readers, given that neither rcu_read_lock() nor rcu_ not stress these limits. This is out of necessity: Many
read_unlock() spin or block? readers push their cognitive limits while learning new
material, so the examples need to be kept simple.
These three RCU components allow data to be updated
These examples therefore keep rcu_dereference()
in the face of concurrent readers that might be executing
invocations in the same function as the enclosing rcu_
the same sequence of machine instructions that would
read_lock() and rcu_read_unlock() calls. In con-
be used by a reader in a single-threaded implementation.
trast, real-world software must frequently invoke these
These RCU components can be combined in different ways
API members from different functions, and even from
to implement a surprising variety of different types of
different translation units. The Linux kernel RCU API
RCU-based algorithms, a number of which are presented
has therefore expanded to accommodate lockdep, which
in Section 9.5.4. However, it is usually better to work at
allows rcu_dereference() and friends to complain if
higher levels of abstraction. To this end, the next section
it is not protected by rcu_read_lock(). Linux-kernel
describes the Linux-kernel API, which includes simple
RCU also checks for some double-free errors, infinite
data structures such as lists.
loops in RCU read-side critical sections, and attempts
to invoke quiescent states within RCU read-side critical
9.5.3 RCU Linux-Kernel API sections.
Another way that real-world software accommodates
This section looks at RCU from the viewpoint of its
the limits of human cognition is through abstraction. The
Linux-kernel API.12 Section 9.5.3.2 presents RCU’s wait-
Linux-kernel API therefore includes members that operate
to-finish APIs, Section 9.5.3.3 presents RCU’s publish-
on lists in addition to the pointer-oriented core API of
subscribe and version-maintenance APIs, Section 9.5.3.4
Table 9.1. The Linux kernel itself also provides RCU-
presents RCU’s list-processing APIs, Section 9.5.3.5
protected hash tables and search trees.
presents RCU’s diagnostic APIs, and Section 9.5.3.6
Operating-systems kernels such as Linux operate near
describes in which contexts RCU’s various APIs may
the bottom of the “iron triangle” of the software stack
be used. Finally, Section 9.5.3.7 presents concluding
shown in Figure 2.3, where performance is critically
remarks.
important. There are thus specialized variants of a number
Readers who are not excited about kernel internals
of RCU APIs for use on fastpaths, for example, as discussed
may wish to skip ahead to Section 9.5.4 on page 160,
in Section 9.5.3.3, RCU_INIT_POINTER() may be used
12 Userspace RCU’s API is documented elsewhere [MDJ13c]. in place of rcu_assign_pointer() in cases where the

v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 151

RCU-protected pointer is being assigned to NULL or when or rcu_read_lock_sched() and end with rcu_read_
that pointer is not yet accessible by readers. Use of RCU_ unlock(), rcu_read_unlock_bh(), or rcu_read_
INIT_POINTER() allows the compiler more leeway in unlock_sched(), respectively. Any region of code that
selecting instructions and carrying out optimizations, thus disables bottom halves, interrupts, or preemption also
increasing performance. acts as an RCU read-side critical section. RCU read-side
On the other hand, when used incorrectly RCU_INIT_ critical sections may be nested. The corresponding syn-
POINTER() can result in silent memory corruption, so chronous update-side primitives, synchronize_rcu()
please be careful! Yes, in some cases, the kernel can and synchronize_rcu_expedited(), along with their
check for inappropriate use of RCU API members from synonym synchronize_net(), wait for any type of cur-
a given kernel context, but the constraints of RCU_INIT_ rently executing RCU read-side critical sections to com-
POINTER() use are not yet checkable. plete. The length of this wait is known as a “grace period”,
Finally, within the Linux kernel, the aforementioned and synchronize_rcu_expedited() is designed to re-
limits of human cognition are compounded by the variety duce grace-period latency at the expense of increased
and severity of workloads running on Linux. As of v5.16, CPU overhead and IPIs. The asynchronous update-side
this has given rise to no fewer than five flavors of RCU, primitive, call_rcu(), invokes a specified function with
each designed to provide different performance, scalability, a specified argument after a subsequent grace period.
response-time, and energy efficiency tradeoffs to RCU For example, call_rcu(p,f); will result in the “RCU
readers and writers. These RCU flavors are the subject of callback” f(p) being invoked after a subsequent grace
the next section. period. There are situations, such as when unloading
a Linux-kernel module that uses call_rcu(), when it
9.5.3.2 RCU has a Family of Wait-to-Finish APIs is necessary to wait for all outstanding RCU callbacks
to complete [McK07e]. The rcu_barrier() primitive
The most straightforward answer to “what is RCU” is that does this job.
RCU is an API. For example, the RCU implementation
Quick Quiz 9.39: How do you prevent a huge number of
used in the Linux kernel is summarized by Table 9.2,
RCU read-side critical sections from indefinitely blocking a
which shows the wait-for-readers portions of the RCU,
synchronize_rcu() invocation?
“sleepable” RCU (SRCU), Tasks RCU, and generic APIs,
respectively, and by Table 9.3, which shows the publish- Quick Quiz 9.40: The synchronize_rcu() API waits for
subscribe portions of the API [McK19b].13 all pre-existing interrupt handlers to complete, right?
If you are new to RCU, you might consider focusing
on just one of the columns in Table 9.2, each of which Finally, RCU may be used to provide type-safe mem-
summarizes one member of the Linux kernel’s RCU API ory [GC96], as described in Section 9.5.4.5. In the context
family. For example, if you are primarily interested in of RCU, type-safe memory guarantees that a given data
understanding how RCU is used in the Linux kernel, element will not change type during any RCU read-side
“RCU” would be the place to start, as it is used most critical section that accesses it. To make use of RCU-
frequently. On the other hand, if you want to understand based type-safe memory, pass SLAB_TYPESAFE_BY_RCU
RCU for its own sake, “Tasks RCU” has the simplest API. to kmem_cache_create().
You can always come back for the other columns later. The “SRCU” column in Table 9.2 displays a special-
If you are already familiar with RCU, these tables can ized RCU API that permits general sleeping in SRCU
serve as a useful reference. read-side critical sections [McK06] delimited by srcu_
Quick Quiz 9.38: Why do some of the cells in Table 9.2 read_lock() and srcu_read_unlock(). However, un-
have exclamation marks (“!”)? like RCU, SRCU’s srcu_read_lock() returns a value
that must be passed into the corresponding srcu_read_
The “RCU” column corresponds to the consolidation of unlock(). This difference is due to the fact that the
the three Linux-kernel RCU implementations [McK19c, SRCU user allocates an srcu_struct for each distinct
McK19a], in which RCU read-side critical sections SRCU usage, so that there is no convenient place to store a
start with rcu_read_lock(), rcu_read_lock_bh(), per-task reader-nesting count. (Keep in mind that although
13 This citation covers v4.20 and later. Documetation for earlier the Linux kernel provides dynamically allocated per-CPU
versions of the Linux-kernel RCU API may be found elsewhere [McK08e, storage, there is not yet dynamically allocated per-task
McK14f]. storage.)

v2022.09.25a
CHAPTER 9. DEFERRED PROCESSING

v2022.09.25a
Table 9.2: RCU Wait-to-Finish APIs
RCU: Original SRCU: Sleeping readers Tasks RCU: Free tracing Tasks RCU Rude: Free idle-task Tasks RCU Trace: Protect sleepable
trampolines tracing trampolines BPF programs
Initialization and DEFINE_SRCU()
Cleanup DEFINE_STATIC_SRCU()
init_srcu_struct()
cleanup_srcu_struct()
Read-side rcu_read_lock() ! srcu_read_lock() Voluntary context switch Voluntary context switch and rcu_read_lock_trace()
critical-section rcu_read_unlock() ! srcu_read_unlock() preempt-enable regions of code rcu_read_unlock_trace()
markers rcu_read_lock_bh()
rcu_read_unlock_bh()
rcu_read_lock_sched()
rcu_read_unlock_sched()
(Plus anything disabing bottom
halves, preemption, or interrupts.)
Update-side primitives synchronize_rcu() synchronize_srcu() synchronize_rcu_tasks() synchronize_rcu_tasks_rude() synchronize_rcu_tasks_trace()
(synchronous) synchronize_net() synchronize_srcu_expedited()
synchronize_rcu_expedited()
Update-side primitives call_rcu() ! call_srcu() call_rcu_tasks() call_rcu_tasks_rude() call_rcu_tasks_trace()
(asynchronous /
callback)
Update-side primitives rcu_barrier() srcu_barrier() rcu_barrier_tasks() rcu_barrier_tasks_rude() rcu_barrier_tasks_trace()
(wait for callbacks)
Update-side primitives get_state_synchronize_rcu()
(initiate / wait) cond_synchronize_rcu()
Update-side primitives kfree_rcu()
(free memory)
Type-safe memory SLAB_TYPESAFE_BY_RCU
Read side constraints No blocking (only preemption) No synchronize_srcu() with No voluntary context switch Neither blocking nor preemption No RCU tasks trace grace period
same srcu_struct
Read side overhead CPU-local accesses (barrier() Simple instructions, memory Free CPU-local accesses (free on CPU-local accesses
on PREEMPT=n) barriers PREEMPT=n)
Asynchronous sub-microsecond sub-microsecond sub-microsecond sub-microsecond sub-microsecond
update-side overhead
Grace-period latency 10s of milliseconds Milliseconds Seconds Milliseconds 10s of milliseconds
Expedited 10s of microseconds Microseconds N/A N/A N/A
grace-period latency
152
9.5. READ-COPY UPDATE (RCU) 153

A given srcu_struct structure may be defined as a to ensure that all code executing within a given trampoline
global variable with DEFINE_SRCU() if the structure must has finished before freeing that trampoline.
be used in multiple translation units, or with DEFINE_ Changes to the code being traced are typically limited
STATIC_SRCU() otherwise. For example, DEFINE_ to a single jump or call instruction, and thus cannot ac-
SRCU(my_srcu) would create a global variable named commodate the sequence of code required to implement
my_srcu that could be used by any file in the program. rcu_read_lock() and rcu_read_unlock(). Nor can
Alternatively, an srcu_struct structure may be either the trampoline contain these calls to rcu_read_lock()
an on-stack variable or a dynamically allocated region of and rcu_read_unlock(). To see this, consider a CPU
memory. In both of these non-global-variable cases, the that is just about to start executing a given trampoline.
memory must be initialized using init_srcu_struct() Because it has not yet executed the rcu_read_lock(),
prior to its first use and cleaned up using cleanup_srcu_ that trampoline could be freed at any time, which would
struct() after its last use (but before the underlying come as a fatal surprise to this CPU. Therefore, trampo-
storage disappears). lines cannot be protected by synchronization primitives
However they are created, these distinct srcu_ executed in either the traced code or in the trampoline
struct structures prevent SRCU read-side criti- itself. Which does raise the question of exactly how the
cal sections from blocking unrelated synchronize_ trampoline is to be protected.
srcu() and synchronize_srcu_expedited() invoca- The key to answering this question is to note that
tions. Of course, use of either synchronize_srcu() trampoline code never contains code that either directly
or synchronize_srcu_expedited() within an SRCU or indirectly does a voluntary context switch. This code
read-side critical section can result in self-deadlock, so might be preempted, but it will never directly or indirectly
should be avoided. As with RCU, SRCU’s synchronize_ invoke schedule(). This suggests a variant of RCU
srcu_expedited() decreases grace-period latency com- having voluntary context switches and idle execution as
pared to synchronize_srcu(), but at the expense of its only quiescent states. This variant is Tasks RCU.
increased CPU overhead. Tasks RCU is unusual in having no read-side mark-
Quick Quiz 9.41: Under what conditions can synchronize_ ing functions, which is good given that its main use
srcu() be safely used within an SRCU read-side critical case has nowhere to put such markings. Instead, calls
section? to schedule() serve directly as quiescent states. Up-
dates can use synchronize_rcu_tasks() to wait for
Similar to normal RCU, self-deadlock can be avoided all pre-existing trampoline execution to complete, or
using the asynchronous call_srcu() function. However, they can use its asynchronous counterpart, call_rcu_
special care must be taken when using call_srcu() tasks(). There is also an rcu_barrier_tasks()
because a single task could register SRCU callbacks that waits for completion of callbacks corresponding
very quickly. Given that SRCU allows readers to block to all prior invocations of call_rcu_tasks(). There
for arbitrary periods of time, this could consume an is no synchronize_rcu_tasks_expedited() because
arbitrarily large quantity of memory. In contrast, given the there has not yet been a request for it, though implementing
synchronous synchronize_srcu() interface, a given a useful variant of it would not be free of challenges.
task must finish waiting for a given grace period before it
can start waiting for the next one. Quick Quiz 9.42: In a kernel built with CONFIG_PREEMPT_
NONE=y, won’t synchronize_rcu() wait for all trampolines,
Also similar to RCU, there is an srcu_barrier() given that preemption is disabled and that trampolines never
function that waits for all prior call_srcu() callbacks directly or indirectly invoke schedule()?
to be invoked.
In other words, SRCU compensates for its extremely The “Tasks RCU Rude” column provides a more ef-
weak forward-progress guarantees by permitting the de- fective variant of the toy implementation presented in
veloper to restrict its scope. Section 9.5.1.4. This variant causes each CPU to execute
The “Tasks RCU” column in Table 9.2 displays a spe- a context switch, so that any voluntary context switch or
cialized RCU API that mediates freeing of the trampolines any preemptible region of code can serve as a quiescent
used in Linux-kernel tracing. These trampolines are used state. The Tasks RCU Rude variant uses the Linux-kernel
to transfer control from a point in the code being traced to workqueues facility to force concurrent context switches,
the code doing the actual tracing. It is of course necessary in contrast to the serial CPU-by-CPU approach taken by

v2022.09.25a
154 CHAPTER 9. DEFERRED PROCESSING

the toy implementation. The API mirrors that of Tasks Quick Quiz 9.44: Are there any downsides to the fact that
RCU, including the lack of explicit read-side markers. these traversal and update primitives can be used with any of
Finally, the “Tasks RCU Trace” column provides an the RCU API family members?
RCU implementation with functionality similar to that
of SRCU, except with much faster read-side markers.14 The rcu_pointer_handoff() primitive simply re-
However, this speed is a consequence of the fact that turns its sole argument, but is useful to tooling checking
these markers do not execute memory-barrier instructions, for pointers being leaked from RCU read-side critical
which means that Tasks RCU Trace grace periods must sections. Use of rcu_pointer_handoff() indicates to
often send IPIs to all CPUs and must always scan the such tooling that protection of the structure in question
entire task list, thus degrading real-time response and has been handed off from RCU to some other mechanism,
consuming considerable CPU time. Nevertheless, in the such as locking or reference counting.
absence of readers, the resulting grace-period latency is The RCU_INIT_POINTER() macro can be used to
reasonably short, rivaling that of RCU. initialize RCU-protected pointers that have not yet
been exposed to readers, or alternatively, to set RCU-
protected pointers to NULL. In these restricted cases, the
9.5.3.3 RCU has Publish-Subscribe and Version-
memory-barrier instructions provided by rcu_assign_
Maintenance APIs
pointer() are not needed. Similarly, RCU_POINTER_
Fortunately, the RCU publish-subscribe and version- INITIALIZER() provides a GCC-style structure initial-
maintenance primitives shown in Table 9.3 apply to all of izer to allow easy initialization of RCU-protected pointers
the variants of RCU discussed above. This commonality in structures.
can allow more code to be shared, and reduces API prolifer- The second category subscribes to pointers to data
ation. The original purpose of the RCU publish-subscribe items, or, alternatively, safely traverses RCU-protected
APIs was to bury memory barriers into these APIs, so that pointers. Again, simply loading these pointers using C-
Linux kernel programmers could use RCU without need- language accesses could result in seeing pre-initialization
ing to become expert on the memory-ordering models of garbage in the pointed-to data. Similarly, loading these
each of the 20+ CPU families that Linux supports [Spr01]. pointer by any means outside of an RCU read-side crit-
These primitives operate directly on pointers, and are ical section could result in the pointed-to object being
useful for creating RCU-protected linked data structures, freed at any time. However, if the pointer is merely
such as RCU-protected arrays and trees. The special to be tested and not dereferenced, the freeing of the
case of linked lists is handled by a separate set of APIs pointed-to object is not necessarily a problem. In this
described in Section 9.5.3.4. case, rcu_access_pointer() may be used. Normally,
The first category publishes pointers to new data items. however, RCU read-side protection is required, and so
The rcu_assign_pointer() primitive ensures that any the rcu_dereference() primitive uses the Linux ker-
prior initialization remains ordered before the assign- nel’s lockdep facility [Cor06a] to verify that this rcu_
ment to the pointer on weakly ordered machines. The dereference() invocation is under the protection of
rcu_replace_pointer() primitive updates the pointer rcu_read_lock(), srcu_read_lock(), or some other
just like rcu_assign_pointer() does, but also re- RCU read-side marker. In contrast, the rcu_access_
turns the previous value, just like rcu_dereference_ pointer() primitive does not involve lockdep, and thus
protected() (see below) would, including the lockdep will not provoke lockdep complaints when used outside
expression. This replacement is convenient when the of an RCU read-side critical section.
updater must both publish a new pointer and free the Another situation where protection is not required
structure referenced by the old pointer. is when update-side code accesses the RCU-protected
Quick Quiz 9.43: Normally, any pointer subject to rcu_ pointer while holding the update-side lock. The rcu_
dereference() must always be updated using one of the dereference_protected() API member is provided
pointer-publish functions in Table 9.3, for example, rcu_ for this situation. Its first parameter is the RCU-protected
assign_pointer(). pointer, and the second parameter takes a lockdep expres-
What is an exception to this rule? sion describing which locks must be held in order for the
access to be safe. Code invoked both from readers and
14 And thus is unusual for the Tasks RCU family for having explicit updaters can use rcu_dereference_check(), which
read-side markers! also takes a lockdep expression, but which may also be

v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 155

Table 9.3: RCU Publish-Subscribe and Version Maintenance APIs

Category Primitives Overhead

Pointer publish rcu_assign_pointer() Memory barrier


rcu_replace_pointer() Memory barrier (two of them on Alpha)
rcu_pointer_handoff() Simple instructions
RCU_INIT_POINTER() Simple instructions
RCU_POINTER_INITIALIZER() Compile-time constant

Pointer subscribe (traversal) rcu_access_pointer() Simple instructions


rcu_dereference() Simple instructions (memory barrier on Alpha)
rcu_dereference_check() Simple instructions (memory barrier on Alpha)
rcu_dereference_protected() Simple instructions
rcu_dereference_raw() Simple instructions (memory barrier on Alpha)
rcu_dereference_raw_notrace() Simple instructions (memory barrier on Alpha)

first next next next


next next next next pprev
pprev pprev
prev prev prev prev A B C
A B C
Figure 9.18: Linux Linear Linked List (hlist)

Figure 9.16: Linux Circular Linked List (list)


9.5.3.4 RCU has List-Processing APIs
Although rcu_assign_pointer() and rcu_
dereference() can in theory be used to construct any
A B C conceivable RCU-protected data structure, in practice it is
often better to use higher-level constructs. Therefore, the
Figure 9.17: Linux Linked List Abbreviated rcu_assign_pointer() and rcu_dereference()
primitives have been embedded in special RCU variants of
Linux’s list-manipulation API. Linux has four variants of
doubly linked list, the circular struct list_head and
invoked from read-side code not holding the locks. In the linear struct hlist_head/struct hlist_node,
some cases, the lockdep expressions can be very com- struct hlist_nulls_head/struct hlist_nulls_
plex, for example, when using fine-grained locking, any node, and struct hlist_bl_head/struct hlist_
of a very large number of locks might be held, and it bl_node pairs. The former is laid out as shown in
might be quite difficult to work out which applies. In Figure 9.16, where the green (leftmost) boxes represent
these (hopefully rare) cases, rcu_dereference_raw() the list header and the blue (rightmost three) boxes
provides protection but does not check for being invoked represent the elements in the list. This notation is
within a reader or with any particular lock being held. cumbersome, and will therefore be abbreviated as shown
The rcu_dereference_raw_notrace() API member in Figure 9.17, which shows only the non-header (blue)
acts similarly, but cannot be traced, and may therefore be elements.
safely used by tracing code. Linux’s hlist15 is a linear list, which means that it
needs only one pointer for the header rather than the two
Although pretty much any linked structure can be ac- required for the circular list, as shown in Figure 9.18.
cessed by manipulating pointers, higher-level structures Thus, use of hlist can halve the memory consumption
can be quite helpful. The next section therefore looks at
various sorts of RCU-protected linked lists used by the 15 The “h” stands for hashtable, in which it reduces memory use by

Linux kernel. half compared to Linux’s double-pointer circular linked list.

v2022.09.25a
156 CHAPTER 9. DEFERRED PROCESSING

for the hash-bucket arrays of large hash tables. As before, values or the new set of values, but not a mixture of the
this notation is cumbersome, so hlist structures will be two sets. For example, each node of a linked list might
abbreviated in the same way list_head-style lists are, as have integer fields ->a, ->b, and ->c, and it might be
shown in Figure 9.17. necessary to update a given node’s fields from 5, 6, and 7
A variant of Linux’s hlist, named hlist_nulls, to 5, 2, and 3, respectively.
provides multiple distinct NULL pointers, but otherwise The code implementing this atomic update is straight-
uses the same layout as shown in Figure 9.18. In this forward:
variant, a ->next pointer having a zero low-order bit is
considered to be a pointer. However, if the low-order bit is 15 q = kmalloc(sizeof(*p), GFP_KERNEL);
16 *q = *p;
set to one, the upper bits identify the type of NULL pointer. 17 q->b = 2;
This type of list is used to allow lockless readers to detect 18 q->c = 3;
19 list_replace_rcu(&p->list, &q->list);
when a node has been moved from one list to another. For 20 synchronize_rcu();
example, each bucket of a hash table might use its index to 21 kfree(p);
mark its NULL pointer. Should a reader encounter a NULL
pointer not matching the index of the bucket it started from,
The following discussion walks through this code, using
that reader knows that an element it was traversing was
Figure 9.19 to illustrate the state changes. The triples
moved to some other bucket during the traversal, taking
in each element represent the values of fields ->a, ->b,
that reader with it. The reader can use the is_a_nulls()
and ->c, respectively. The red-shaded elements might
function (which returns true if passed an hlist_nulls
be referenced by readers, and because readers do not
NULL pointer) to determine when it reaches the end of a list,
synchronize directly with updaters, readers might run
and the get_nulls_value() function (which returns its
concurrently with this entire replacement process. Please
argument’s NULL-pointer identifier) to fetch the type of
note that backwards pointers and the link from the tail to
NULL pointer. When get_nulls_value() returns an
the head are omitted for clarity.
unexpected value, the reader can take corrective action,
The initial state of the list, including the pointer p, is
for example, restarting its traversal from the beginning.
the same as for the deletion example, as shown on the first
Quick Quiz 9.45: But what if an hlist_nulls reader gets row of the figure.
moved to some other bucket and then back again? The following text describes how to replace the 5,6,7
element with 5,2,3 in such a way that any given reader
More information on hlist_nulls is available in sees one of these two values.
the Linux-kernel source tree, with helpful example code Line 15 allocates a replacement element, resulting in
provided in the rculist_nulls.rst file (rculist_ the state as shown in the second row of Figure 9.19. At
nulls.txt in older kernels). this point, no reader can hold a reference to the newly
Another variant of Linux’s hlist incorporates bit- allocated element (as indicated by its green shading), and
locking, and is named hlist_bl. This variant uses the it is uninitialized (as indicated by the question marks).
same layout as shown in Figure 9.18, but reserves the Line 16 copies the old element to the new one, resulting
low-order bit of the head pointer (“first” in the figure) to in the state as shown in the third row of Figure 9.19.
lock the list. This approach also reduces memory usage, The newly allocated element still cannot be referenced by
as it allows what would otherwise be a separate spinlock readers, but it is now initialized.
to be stored with the pointer itself. Line 17 updates q->b to the value “2”, and line 18
The API members for these linked-list variants are updates q->c to the value “3”, as shown on the fourth row
summarized in Table 9.4. More information is available in of Figure 9.19. Note that the newly allocated structure is
the Documentation/RCU directory of the Linux-kernel still inaccessible to readers.
source tree and at Linux Weekly News [McK19b]. Now, line 19 does the replacement, so that the new
However, the remainder of this section expands on element is finally visible to readers, and hence is shaded
the use of list_replace_rcu(), given that this API red, as shown on the fifth row of Figure 9.19. At this
member gave RCU its name. This API member is used to point, as shown below, we have two versions of the list.
carry out more complex updates in which an element in Pre-existing readers might see the 5,6,7 element (which
the middle of the list having multiple fields is atomically is therefore now shaded yellow), but new readers will
updated, so that a given reader sees either the old set of instead see the 5,2,3 element. But any given reader is

v2022.09.25a
Table 9.4: RCU-Protected List APIs

list
list: Circular doubly linked list hlist
hlist: Linear doubly linked list hlist_nulls
hlist_nulls: Linear doubly linked list hlist_bl
hlist_bl: Linear doubly linked list
with marked NULL pointer, with up to with bit locking
31 bits of marking

Structures
struct list_head struct hlist_head struct hlist_nulls_head struct hlist_bl_head
struct hlist_node struct hlist_nulls_node struct hlist_bl_node
Initialization
INIT_LIST_HEAD_RCU()
9.5. READ-COPY UPDATE (RCU)

Full traversal
list_for_each_entry_rcu() hlist_for_each_entry_rcu() hlist_nulls_for_each_entry_rcu() hlist_bl_for_each_entry_rcu()
list_for_each_entry_lockless() hlist_for_each_entry_rcu_bh() hlist_nulls_for_each_entry_safe()
hlist_for_each_entry_rcu_notrace()
Resume traversal
list_for_each_entry_continue_rcu() hlist_for_each_entry_continue_rcu()
list_for_each_entry_from_rcu() hlist_for_each_entry_continue_rcu_bh()
hlist_for_each_entry_from_rcu()
Stepwise traversal
list_entry_rcu() hlist_first_rcu() hlist_nulls_first_rcu() hlist_bl_first_rcu()
list_entry_lockless() hlist_next_rcu() hlist_nulls_next_rcu()
list_first_or_null_rcu() hlist_pprev_rcu()
list_next_rcu()
list_next_or_null_rcu()
Add
list_add_rcu() hlist_add_before_rcu() hlist_nulls_add_head_rcu() hlist_bl_add_head_rcu()
list_add_tail_rcu() hlist_add_behind_rcu() hlist_bl_set_first_rcu()
hlist_add_head_rcu()
hlist_add_tail_rcu()
Delete
list_del_rcu() hlist_del_rcu() hlist_nulls_del_rcu() hlist_bl_del_rcu()
hlist_del_init_rcu() hlist_nulls_del_init_rcu() hlist_bl_del_init_rcu()
Replace
list_replace_rcu() hlist_replace_rcu()
Splice
list_splice_init_rcu() list_splice_tail_init_rcu()
157

v2022.09.25a
158 CHAPTER 9. DEFERRED PROCESSING

Table 9.5: RCU Diagnostic APIs

Category Primitives

Mark RCU pointer __rcu


1,2,3 5,6,7 11,4,8
Debug-object support init_rcu_head()
destroy_rcu_head()
Allocate init_rcu_head_on_stack()
destroy_rcu_head_on_stack()
?,?,? Stall-warning control rcu_cpu_stall_reset()

Callback checking rcu_head_init()


1,2,3 5,6,7 11,4,8 rcu_head_after_call_rcu()

lockdep support rcu_read_lock_held()


Copy rcu_read_lock_bh_held()
rcu_read_lock_sched_held()
srcu_read_lock_held()
5,6,7 rcu_is_watching()
RCU_LOCKDEP_WARN()
1,2,3 5,6,7 11,4,8 RCU_NONIDLE()
rcu_sleep_check()

Update

guaranteed to see one set of values or the other, not a


5,2,3 mixture of the two.
After the synchronize_rcu() on line 20 returns, a
1,2,3 5,6,7 11,4,8 grace period will have elapsed, and so all reads that started
before the list_replace_rcu() will have completed.
In particular, any readers that might have been holding
list_replace_rcu()
references to the 5,6,7 element are guaranteed to have
exited their RCU read-side critical sections, and are thus
5,2,3 prohibited from continuing to hold a reference. Therefore,
there can no longer be any readers holding references to
1,2,3 5,6,7 11,4,8
the old element, as indicated its green shading in the sixth
row of Figure 9.19. As far as the readers are concerned,
we are back to having a single version of the list, but with
synchronize_rcu() the new element in place of the old.
After the kfree() on line 21 completes, the list will
5,2,3 appear as shown on the final row of Figure 9.19.
Despite the fact that RCU was named after the replace-
1,2,3 5,6,7 11,4,8
ment case, the vast majority of RCU usage within the
Linux kernel relies on the simple independent insertion and
kfree()
deletion, as was shown in Figure 9.15 in Section 9.5.2.3.
The next section looks at APIs that assist developers in
debugging their code that makes use of RCU.
1,2,3 5,2,3 11,4,8

Figure 9.19: RCU Replacement in Linked List 9.5.3.5 RCU Has Diagnostic APIs
Table 9.5 shows RCU’s diagnostic APIs.
The __rcu tag marks an RCU-protected pointer,
for example, “struct foo __rcu *p;”. Pointers

v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 159

that might be passed to rcu_dereference() can be


marked, but pointers holding values returned from rcu_
NMI
dereference() should not be. Providing these markings

RCU List Traversal


on variables, structure fields, function parameters, and re-

rcu_read_unlock()
rcu_dereference()
rcu_read_lock()
turn values allows the Linux kernel’s sparse tool to detect

rcu_assign_pointer()
situations where RCU-protected pointers are incorrectly

RCU List Mutation


IRQ
accessed using plain C-language loads and stores.

call_rcu()
Debug-object support is automatic for any rcu_head
structures that are part of a structure obtained from the
Linux kernel’s memory allocators, but those building
Process synchronize_rcu()
their own special-purpose memory allocators can use
init_rcu_head() and destroy_rcu_head() at allo-
cation and free time, respectively. Those using rcu_head
structures allocated on the function-call stack (it happens!) Figure 9.20: RCU API Usage Constraints
may use init_rcu_head_on_stack() before first use
and destroy_rcu_head_on_stack() after last use, but
before returning from the function. Debug-object sup- CPUs, which means that rcu_is_watching() does not
port allows detection of bugs involving passing the same apply to SRCU.
rcu_head structure to call_rcu() and friends in quick RCU_LOCKDEP_WARN() emits a warning if lockdep is
succession, which is the call_rcu() counterpart to the enabled and if its argument evaluates to true. For exam-
infamous double-free class of memory-allocation bugs. ple, RCU_LOCKDEP_WARN(!rcu_read_lock_held())
Stall-warning control is provided by rcu_cpu_stall_ would emit a warning if invoked outside of an RCU
reset(), which allows the caller to suppress RCU CPU read-side critical section.
stall warnings for the remainder of the current grace period. RCU_NONIDLE() may be used to force RCU to watch
RCU CPU stall warnings help pinpoint situations where an when executing the statement that is passed in as the sole
RCU read-side critical section runs for an excessive length argument. For example, RCU_NONIDLE(WARN_ON(!rcu_
of time, and it is useful for things like kernel debuggers to is_watching())) would never emit a warning. How-
be able to suppress them, for example, when encountering ever, changes in the 2020–2021 timeframe extend RCU’s
a breakpoint. reach deeper into the idle loop, which should greatly
Callback checking is provided by rcu_head_init() reduce or even eliminate the need for RCU_NONIDLE().
and rcu_head_after_call_rcu(). The former is in- Finally, rcu_sleep_check() emits a warning if in-
voked on an rcu_head structure before it is passed to voked within an RCU, RCU-bh, or RCU-sched read-side
call_rcu(), and then rcu_head_after_call_rcu() critical section.
will check to see if the callback has been invoked with the
specified function.
9.5.3.6 Where Can RCU’s APIs Be Used?
Support for lockdep [Cor06a] includes rcu_read_
lock_held(), rcu_read_lock_bh_held(), rcu_ Figure 9.20 shows which APIs may be used in which
read_lock_sched_held(), and srcu_read_lock_ in-kernel environments. The RCU read-side primitives
held(), each of which returns true if invoked within the may be used in any environment, including NMI, the RCU
corresponding type of RCU read-side critical section. mutation and asynchronous grace-period primitives may
Quick Quiz 9.46: Why isn’t there a rcu_read_lock_ be used in any environment other than NMI, and, finally,
tasks_held() for Tasks RCU? the RCU synchronous grace-period primitives may be
used only in process context. The RCU list-traversal prim-
Because rcu_read_lock() cannot be used from the itives include list_for_each_entry_rcu(), hlist_
idle loop, and because energy-efficiency concerns have for_each_entry_rcu(), etc. Similarly, the RCU list-
caused the idle loop to become quite ornate, rcu_is_ mutation primitives include list_add_rcu(), hlist_
watching() returns true if invoked in a context where del_rcu(), etc.
use of rcu_read_lock() is legal. Note again that srcu_ Note that primitives from other families of RCU may
read_lock() may be used from idle and even offline be substituted, for example, srcu_read_lock() may be

v2022.09.25a
160 CHAPTER 9. DEFERRED PROCESSING

used in any context in which rcu_read_lock() may be


used.

9.5.3.7 So, What is RCU Really? Table 9.6: RCU Usage


At its core, RCU is nothing more nor less than an API Mechanism RCU Replaces Page
that supports publication and subscription for insertions,
waiting for all RCU readers to complete, and maintenance RCU for pre-BSD routing 160
of multiple versions. That said, it is possible to build Wait for pre-existing things to finish 163
higher-level constructs on top of RCU, including the Phased state change 164
reader-writer-locking, reference-counting, and existence- Add-only list (publish/subscribe) 165
guarantee constructs listed in Section 9.5.4. Furthermore, Type-safe memory 165
I have no doubt that the Linux community will continue Existence Guarantee 166
to find interesting new uses for RCU, just as they do for Light-weight garbage collector 166
any of a number of synchronization primitives throughout Delete-only list 166
the kernel.
Quasi reader-writer lock 167
Of course, a more-complete view of RCU would also
Quasi reference count 173
include all of the things you can do with these APIs.
However, for many people, a complete view of RCU Quasi multi-version concurrency control (MVCC) 175
must include sample RCU implementations. Appendix B
therefore presents a series of “toy” RCU implementations
of increasing complexity and capability, though others
might prefer the classic “User-Level Implementations of
Read-Copy Update” [DMS+ 12]. For everyone else, the
next section gives an overview of some RCU use cases.

9.5.4 RCU Usage Listing 9.14: RCU Pre-BSD Routing Table Lookup
1 struct route_entry {
This section answers the question “What is RCU?” from 2 struct rcu_head rh;
3 struct cds_list_head re_next;
the viewpoint of the uses to which RCU can be put. 4 unsigned long addr;
Because RCU is most frequently used to replace some 5 unsigned long iface;
6 int re_freed;
existing mechanism, we look at it primarily in terms of 7 };
its relationship to such mechanisms, as listed in Table 9.6 8 CDS_LIST_HEAD(route_list);
9 DEFINE_SPINLOCK(routelock);
and as displayed in Figure 9.23. Following the sections 10
listed in this table, Section 9.5.4.12 provides a summary. 11 unsigned long route_lookup(unsigned long addr)
12 {
13 struct route_entry *rep;
14 unsigned long ret;
9.5.4.1 RCU for Pre-BSD Routing 15
16 rcu_read_lock();
In contrast to the later sections, this section focuses on a 17 cds_list_for_each_entry_rcu(rep, &route_list, re_next) {
18 if (rep->addr == addr) {
very specific use case for the purpose of comparison with 19 ret = rep->iface;
other mechanisms. 20 if (READ_ONCE(rep->re_freed))
21 abort();
Listings 9.14 and 9.15 show code for an RCU-protected 22 rcu_read_unlock();
Pre-BSD routing table (route_rcu.c). The former 23 return ret;
24 }
shows data structures and route_lookup(), and the 25 }
latter shows route_add() and route_del(). 26 rcu_read_unlock();
27 return ULONG_MAX;
In Listing 9.14, line 2 adds the ->rh field used by 28 }
RCU reclamation, line 6 adds the ->re_freed use-after-
free-check field, lines 16, 22, and 26 add RCU read-side
protection, and lines 20 and 21 add the use-after-free check.
In Listing 9.15, lines 11, 13, 30, 34, and 39 add update-side

v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 161

Listing 9.15: RCU Pre-BSD Routing Table Add/Delete 2.5x107


1 int route_add(unsigned long addr, unsigned long interface)
{
2x107
2

Lookups per Millisecond


3 struct route_entry *rep; ideal
4
5 rep = malloc(sizeof(*rep)); 7
if (!rep) 1.5x10
6 RCU
7 return -ENOMEM;
8 rep->addr = addr;
9 rep->iface = interface; 1x107
10 rep->re_freed = 0;
11 spin_lock(&routelock); seqlock
12 cds_list_add_rcu(&rep->re_next, &route_list); 5x106
13 spin_unlock(&routelock); hazptr
14 return 0;
15 } 0
16 0 50 100 150 200 250 300 350 400 450
17 static void route_cb(struct rcu_head *rhp) Number of CPUs (Threads)
18 {
19 struct route_entry *rep; Figure 9.21: Pre-BSD Routing Table Protected by RCU
20
21 rep = container_of(rhp, struct route_entry, rh);
22 WRITE_ONCE(rep->re_freed, 1);
23 free(rep);
24 } 2.5x107
25
26 int route_del(unsigned long addr)
{ 2x107 RCU-QSBR

Lookups per Millisecond


27
28 struct route_entry *rep;
29
30 spin_lock(&routelock); 1.5x107 ideal
RCU
31 cds_list_for_each_entry(rep, &route_list, re_next) {
32 if (rep->addr == addr) {
33 cds_list_del_rcu(&rep->re_next); 1x107
34 spin_unlock(&routelock);
35 call_rcu(&rep->rh, route_cb);
seqlock
36 return 0; 5x106
37 } hazptr
38 }
39 spin_unlock(&routelock); 0
40 return -ENOENT; 0 50 100 150 200 250 300 350 400 450
41 } Number of CPUs (Threads)

Figure 9.22: Pre-BSD Routing Table Protected by RCU


QSBR
locking, lines 12 and 33 add RCU update-side protection,
line 35 causes route_cb() to be invoked after a grace
period elapses, and lines 17–24 define route_cb(). This
is minimal added code for a working concurrent imple- Quick Quiz 9.47: Wait, what??? How can RCU QSBR
mentation. possibly be better than ideal? Just what rubbish definition of
ideal would fail to be the best of all possible results???
Figure 9.21 shows the performance on the read-only
workload. RCU scales quite well, and offers nearly ideal
performance. However, this data was generated using the
RCU_SIGNAL flavor of userspace RCU [Des09b, MDJ13c], Quick Quiz 9.48: Given RCU QSBR’s read-side perfor-
for which rcu_read_lock() and rcu_read_unlock() mance, why bother with any other flavor of userspace RCU?
generate a small amount of code. What happens for the
QSBR flavor of RCU, which generates no code at all
for rcu_read_lock() and rcu_read_unlock()? (See Although Pre-BSD routing is an excellent RCU use
Section 9.5.1, and especially Figure 9.8, for a discussion case, it is worthwhile looking at the relationships betweeen
of RCU QSBR.) the wider spectrum of use cases shown in Figure 9.23.
The answer to this is shown in Figure 9.22, which shows This task is taken up by the following sections.
that RCU QSBR’s performance and scalability actually While reading these sections, please ask yourself which
exceeds that of the ideal synchronization-free workload. of these use cases best describes Pre-BSD routing.

v2022.09.25a
162 CHAPTER 9. DEFERRED PROCESSING

Quasi Reader-Writer Lock

+ Readers as read-held reader-writer lock


+ Spatial as well as temporal synchronization
+ Optional read-to-write upgrade
+ Optional bridging to per-object lock or reference
+ Optionally ignore deleted objects

Quasi Reference Count

+ Readers as individual or bulk unconditional references


+ Optional bridging to per-object lock or reference

Quasi Multi-Version Consistency Control

+ Readers include some sort of snapshot operation


+ Constraints on readers and writers:
+ (1) single object, (2) sequence locks, (3) version number(s),
+ (4) Issaquah challenge, and/or (5) many other approaches

Light-Weight Garbage Collector for Delete-Only List


Non-Blocking Synchronization (NBS)
- Publish/subscribe
+ NBS

Type-Safe Memory Existence Guarantee

+ Slab allocator + Heap allocator


+ Deferred slab reclamation + Deferred reclamation

Publish/Subscribe Add-Only Phased State Change


For Linked Structure List
+ Checked state variable
rcu_assign_pointer() &
rcu_dereference()

Wait for Pre-Existing Things to Finish

rcu_read_lock() & rcu_read_unlock()


vs. synchronize_rcu()

Figure 9.23: Relationships Between RCU Use Cases

v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 163

9.5.4.2 Wait for Pre-Existing Things to Finish Listing 9.16: Using RCU to Wait for NMIs to Finish
1 struct profile_buffer {
As noted in Section 9.5.2 an important component of 2 long size;
3 atomic_t entry[0];
RCU is a way of waiting for RCU readers to finish. One 4 };
of RCU’s great strength is that it allows you to wait for 5 static struct profile_buffer *buf = NULL;
6
each of thousands of different things to finish without 7 void nmi_profile(unsigned long pcvalue)
having to explicitly track each and every one of them, and 8 {
9 struct profile_buffer *p = rcu_dereference(buf);
without incurring the performance degradation, scalability 10
limitations, complex deadlock scenarios, and memory- 11 if (p == NULL)
12 return;
leak hazards that are inherent in schemes that use explicit 13 if (pcvalue >= p->size)
tracking. 14 return;
15 atomic_inc(&p->entry[pcvalue]);
In this section, we will show how synchronize_ 16 }
sched()’s read-side counterparts (which include anything 17
18 void nmi_stop(void)
that disables preemption, along with hardware operations 19 {
and primitives that disable interrupts) permit you to in- 20 struct profile_buffer *p = buf;
21
teraction with non-maskable interrupt (NMI) handlers, 22 if (p == NULL)
which is quite difficult using locking. This approach has 23 return;
24 rcu_assign_pointer(buf, NULL);
been called “Pure RCU” [McK04], and it is used in a few 25 synchronize_sched();
places in the Linux kernel. 26 kfree(p);
27 }
The basic form of such “Pure RCU” designs is as
follows:

1. Make a change, for example, to the way that the OS be preempted, nor can it be interrupted by a normal
reacts to an NMI. interrupt handler, however, it is still subject to delays
due to cache misses, ECC errors, and cycle stealing by
2. Wait for all pre-existing read-side critical sections other hardware threads within the same core. Line 9
to completely finish (for example, by using the gets a local pointer to the profile buffer using the rcu_
synchronize_sched() primitive).16 The key ob- dereference() primitive to ensure memory ordering on
servation here is that subsequent RCU read-side crit- DEC Alpha, and lines 11 and 12 exit from this function
ical sections are guaranteed to see whatever change if there is no profile buffer currently allocated, while
was made. lines 13 and 14 exit from this function if the pcvalue
3. Clean up, for example, return status indicating that argument is out of range. Otherwise, line 15 increments
the change was successfully made. the profile-buffer entry indexed by the pcvalue argument.
Note that storing the size with the buffer guarantees that
The remainder of this section presents example code the range check matches the buffer, even if a large buffer
adapted from the Linux kernel. In this example, the is suddenly replaced by a smaller one.
timer_stop() function in the now-defunct oprofile fa- Lines 18–27 define the nmi_stop() function, where
cility uses synchronize_sched() to ensure that all in- the caller is responsible for mutual exclusion (for example,
flight NMI notifications have completed before freeing holding the correct lock). Line 20 fetches a pointer to the
the associated resources. A simplified version of this code profile buffer, and lines 22 and 23 exit the function if there
is shown in Listing 9.16. is no buffer. Otherwise, line 24 NULLs out the profile-buffer
Lines 1–4 define a profile_buffer structure, con- pointer (using the rcu_assign_pointer() primitive to
taining a size and an indefinite array of entries. Line 5 maintain memory ordering on weakly ordered machines),
defines a pointer to a profile buffer, which is presumably and line 25 waits for an RCU Sched grace period to elapse,
initialized elsewhere to point to a dynamically allocated in particular, waiting for all non-preemptible regions
region of memory. of code, including NMI handlers, to complete. Once
Lines 7–16 define the nmi_profile() function, which execution continues at line 26, we are guaranteed that any
is called from within an NMI handler. As such, it cannot instance of nmi_profile() that obtained a pointer to
16 In Linux kernel v5.1 and later, synchronize_sched() has been the old buffer has returned. It is therefore safe to free the
subsumed into synchronize_rcu(). buffer, in this case using the kfree() primitive.

v2022.09.25a
164 CHAPTER 9. DEFERRED PROCESSING

Common-Case Maintenance Listing 9.17: Phased State Change for Maintenance Operations
1 bool be_careful;
Operations Operations 2
3 void cco(void)
Time 4 {
Quickly 5 rcu_read_lock();
6 if (READ_ONCE(be_careful))
7 cco_carefully();
8 else
Either Prepare 9 cco_quickly();
10 rcu_read_unlock();
11 }
Carefully Maintenance 12
13 void maint(void)
14 {
Either Clean up
15 WRITE_ONCE(be_careful, true);
16 synchronize_rcu();
17 do_maint();
Quickly 18 synchronize_rcu();
19 WRITE_ONCE(be_careful, false);
20 }

Figure 9.24: Phased State Change for Maintenance Oper-


ation Example pseudo-code for this phased state change is
shown in Listing 9.17. The common-case operations are
carried out by cco() within an RCU read-side critical
Quick Quiz 9.49: Suppose that the nmi_profile() function section extending from line 5 to line 10. Here, line 6 checks
was preemptible. What would need to change to make this a global be_careful flag, invoking cco_carefully()
example work correctly? or cco_quickly(), as indicated.
In short, RCU makes it easy to dynamically switch This allows the maint() function to set the be_
among profile buffers (you just try doing this efficiently careful flag on line 15 and wait for an RCU grace
with atomic operations, or at all with locking!). This is a period on line 16. When control reaches line 17, all
rare use of RCU in its pure form. RCU is normally used cco() functions that saw a false value of be_careful
at higher levels of abstraction, as will be shown in the (and thus which might invoke the cco_quickly() func-
following sections. tion) will have completed their operations, so that all
currently executing cco() functions will be invoking
cco_carefully(). This means that it is safe for the
9.5.4.3 Phased State Change
do_maint() function to be invoked. Line 18 then waits
Figure 9.24 shows a timeline for an example phased state for all cco() functions that might have run concurrently
change to efficiently handle maintenance operations. If with do_maint() to complete, and finally line 19 sets the
there is no maintenance operation in progress, common- be_careful flag back to false.
case operations must proceed quickly, for example, with-
out acquiring a reader-writer lock. However, if there Quick Quiz 9.50: What is the point of the second call to
synchronize_rcu() in function maint() in Listing 9.17?
is a maintenance operation in progress, the common-
Isn’t it OK for any cco() invocations in the clean-up phase to
case operations must be undertaken carefully, taking into
invoke either cco_carefully() or cco_quickly()?
account added complexities due to their running con-
currently with that maintenance operation. This means
that common-case operations will incur higher overhead Quick Quiz 9.51: How can you be sure that the code shown
during maintenance operations, which is one reason that in maint() in Listing 9.17 really works?
maintenance operations are normally scheduled to take
place during times of low load. Phased state change allows frequent operations to use
In the figure, these apparently conflicting requirements light-weight checks, without the need for expensive lock ac-
are resolved by having a prepare phase prior to the mainte- quisitions or atomic read-modify-write operations. Phased
nance operation and a cleanup phase after it, during which state change adds only a checked state variable to the wait-
the common-case operations can proceed either quickly to-finish use case (Section 9.5.4.2), thus also residing at a
or carefully. rather low level of abstraction.

v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 165

9.5.4.4 Add-Only List Quick Quiz 9.52: But what if there is an arbitrarily long
series of RCU read-side critical sections in multiple threads,
Add-only data structures, exemplified by the add-only list, so that at any point in time there is at least one thread in
can be used for a surprisingly common set of use cases, the system executing in an RCU read-side critical section?
perhaps most commonly the logging of changes. Add- Wouldn’t that prevent any data from a SLAB_TYPESAFE_BY_
only data structures are a pure use of RCU’s underlying RCU slab ever being returned to the system, possibly resulting
publish/subscribe mechanism. in OOM events?
An add-only variant of a pre-BSD routing table can be
derived from Listings 9.14 and 9.15. Because there is no It is important to note that SLAB_TYPESAFE_BY_RCU
deletion, the route_del() and route_cb() functions will in no way prevent kmem_cache_alloc() from im-
may be dispensed with, along with the ->rh and ->re_ mediately reallocating memory that was just now freed
freed fields of the route_entry structure, the rcu_ via kmem_cache_free()! In fact, the SLAB_TYPESAFE_
read_lock(), the rcu_read_unlock() invocations in BY_RCU-protected data structure just returned by rcu_
the route_lookup() function, and all uses of the ->re_ dereference might be freed and reallocated an arbitrarily
freed field in all remaining functions. large number of times, even when under the protection
Of course, if there are many concurrent invocations of of rcu_read_lock(). Instead, SLAB_TYPESAFE_BY_
the route_add() function, there will be heavy contention RCU operates by preventing kmem_cache_free() from
on routelock, and if lockless techniques are used, heavy returning a completely freed-up slab of data structures to
memory contention on routelist. The usual way to the system until after an RCU grace period elapses. In
avoid this contention is to use a concurrency-friendly data short, although a given RCU read-side critical section
structure such as a hash table (see Chapter 10). Alter- might see a given SLAB_TYPESAFE_BY_RCU data element
natively, per-CPU data structures might be periodically being freed and reallocated arbitrarily often, the element’s
merged into a single global data structure. type is guaranteed not to change until that critical section
On the other hand, if there is never any deletion, ex- has completed.
tended time periods featuring many concurrent invocations
of route_add() will eventually consume all available These algorithms therefore typically use a validation
memory. Therefore, most RCU-protected data structures step that checks to make sure that the newly referenced data
also implement deletion. structure really is the one that was requested [LS86, Sec-
tion 2.5]. These validation checks require that portions of
the data structure remain untouched by the free-reallocate
9.5.4.5 Type-Safe Memory process. Such validation checks are usually very hard to
get right, and can hide subtle and difficult bugs.
A number of lockless algorithms do not require that a given
Therefore, although type-safety-based lockless algo-
data element keep the same identity through a given RCU
rithms can be extremely helpful in a very few difficult
read-side critical section referencing it—but only if that
situations, you should instead use existence guarantees
data element retains the same type. In other words, these
where possible. Simpler is after all almost always better!
lockless algorithms can tolerate a given data element being
On the other hand, type-safety-based lockless algorithms
freed and reallocated as the same type of structure while
can provide improved cache locality, and thus improved
they are referencing it, but must prohibit a change in type.
performance. This improved cache locality is provided by
This guarantee, called “type-safe memory” in academic
the fact that such algorithms can immediately reallocate
literature [GC96], is weaker than the existence guarantees
a newly freed block of memory. In contrast, algorithms
discussed in Section 9.5.4.6, and is therefore quite a bit
based on existence guarantees must wait for all pre-existing
harder to work with. Type-safe memory algorithms in the
readers before reallocating memory, by which time that
Linux kernel make use of slab caches, specially marking
memory may have been ejected from CPU caches.
these caches with SLAB_TYPESAFE_BY_RCU so that RCU
is used when returning a freed-up slab to system memory. As can be seen in Figure 9.23, RCU’s type-safe-memory
This use of RCU guarantees that any in-use element of use case combines both the wait-to-finish and publish-
such a slab will remain in that slab, thus retaining its type, subscribe components, but in the Linux kernel also in-
for the duration of any pre-existing RCU read-side critical cludes the slab allocator’s deferred reclamation specified
sections. by the SLAB_TYPESAFE_BY_RCU flag.

v2022.09.25a
166 CHAPTER 9. DEFERRED PROCESSING

Listing 9.18: Existence Guarantees Enable Per-Element Lock- newly removed element, and line 20 indicates success. If
ing the element is no longer the one we want, line 22 releases
1 int delete(int key) the lock, line 23 leaves the RCU read-side critical section,
2 {
3 struct element *p; and line 24 indicates failure to delete the specified key.
4 int b;
5 Quick Quiz 9.54: Why is it OK to exit the RCU read-side
6 b = hashfunction(key); critical section on line 15 of Listing 9.18 before releasing the
7 rcu_read_lock();
8 p = rcu_dereference(hashtable[b]); lock on line 17?
9 if (p == NULL || p->key != key) {
10 rcu_read_unlock(); Quick Quiz 9.55: Why not exit the RCU read-side critical
11 return 0;
12 } section on line 23 of Listing 9.18 before releasing the lock on
13 spin_lock(&p->lock); line 22?
14 if (hashtable[b] == p && p->key == key) {
15 rcu_read_unlock(); Alert readers will recognize this as only a slight varia-
16 rcu_assign_pointer(hashtable[b], NULL);
17 spin_unlock(&p->lock); tion on the original wait-to-finish theme (Section 9.5.4.2),
18 synchronize_rcu(); adding publish/subscribe, linked structures, a heap allo-
19 kfree(p);
20 return 1; cator (typically), and deferred reclamation, as shown in
21 } Figure 9.23. They might also note the deadlock-immunity
22 spin_unlock(&p->lock);
23 rcu_read_unlock(); advantages over the lock-based existence guarantees dis-
24 return 0; cussed in Section 7.4.
25 }

9.5.4.7 Light-Weight Garbage Collector


9.5.4.6 Existence Guarantee A not-uncommon exclamation made by people first learn-
Gamsa et al. [GKAS99] discuss existence guarantees and ing about RCU is “RCU is sort of like a garbage collector!”
describe how a mechanism resembling RCU can be used This exclamation has a large grain of truth, but it can also
to provide these existence guarantees (see Section 5 on be misleading.
page 7 of the PDF), and Section 7.4 discusses how to Perhaps the best way to think of the relationship be-
guarantee existence via locking, along with the ensuing tween RCU and automatic garbage collectors (GCs) is
disadvantages of doing so. The effect is that if any RCU- that RCU resembles a GC in that the timing of collection
protected data element is accessed within an RCU read- is automatically determined, but that RCU differs from a
side critical section, that data element is guaranteed to GC in that: (1) The programmer must manually indicate
remain in existence for the duration of that RCU read-side when a given data structure is eligible to be collected
critical section. and (2) The programmer must manually mark the RCU
read-side critical sections where references might be held.
Listing 9.18 demonstrates how RCU-based existence
Despite these differences, the resemblance does go
guarantees can enable per-element locking via a function
quite deep. In fact, the first RCU-like mechanism I am
that deletes an element from a hash table. Line 6 computes
aware of used a reference-count-based garbage collector
a hash function, and line 7 enters an RCU read-side critical
to handle the grace periods [KL80], and the connection
section. If line 9 finds that the corresponding bucket of
between RCU and garbage collection has been noted more
the hash table is empty or that the element present is not
recently [SWS16].
the one we wish to delete, then line 10 exits the RCU
The light-weight garbage collector use case is very
read-side critical section and line 11 indicates failure.
similar to the existence-guarantee use case, adding only
Quick Quiz 9.53: What if the element we need to delete is the desired non-blocking algorithm to the mix. This light-
not the first element of the list on line 9 of Listing 9.18? weight garbage collector use case can also be used in
conjunction with the existence guarantees described in
Otherwise, line 13 acquires the update-side spinlock, the next section.
and line 14 then checks that the element is still the one
that we want. If so, line 15 leaves the RCU read-side
9.5.4.8 Delete-Only List
critical section, line 16 removes it from the table, line 17
releases the lock, line 18 waits for all pre-existing RCU The delete-only list is the less-popular counterpart to the
read-side critical sections to complete, line 19 frees the add-only list covered in Section 9.5.4.4, and can be thought

v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 167

of as the existence-guarantee use case, but without the 10000


publish/subscribe component, as shown in Figure 9.23. A

Nanoseconds per operation


delete-only list can be used when the universe of possible 1000 rwlock
members of the list is known at initialization, and where
members can be removed. For example, elements of the 100
list might represent hardware elements of the system that
are subject to failure, but cannot be repaired or replaced
10
without a reboot.
An delete-only variant of a pre-BSD routing table RCU
can be derived from Listings 9.14 and 9.15. Because 1

there is no addition, the route_add() function may be


dispensed with, or, alternatively, its use might be restricted 0.1
1 10 100
to initialization time. In theory, the route_lookup()
Number of CPUs (Threads)
function can use a non-RCU iterator, though in the Linux
kernel this will result in complaints from debug code. In Figure 9.25: Performance Advantage of RCU Over
addition, the incremental cost of an RCU iterator is usually Reader-Writer Locking
negligible.
As a result, delete-only situations typically use algo-
rithms and data structures that are designed for addition many milliseconds. These advantages and limitations are
as well as deletion. discussed in the following sections.

9.5.4.9 Quasi Reader-Writer Lock Performance The read-side performance advantages of


Perhaps the most common use of RCU within the Linux Linux-kernel RCU over reader-writer locking are shown in
kernel is as a replacement for reader-writer locking in Figure 9.25, which was generated on a 448-CPU 2.10 GHz
read-intensive situations. Nevertheless, this use of RCU Intel x86 system.
was not immediately apparent to me at the outset. In Quick Quiz 9.56: WTF? How the heck do you expect me to
fact, I chose to implement a lightweight reader-writer believe that RCU can have less than a 300-picosecond overhead
lock [HW92]17 before implementing a general-purpose when the clock period at 2.10 GHz is almost 500 picoseconds?
RCU implementation back in the early 1990s. Each and
every one of the uses I envisioned for the lightweight
reader-writer lock was instead implemented using RCU. Quick Quiz 9.57: Didn’t an earlier edition of this book show
In fact, it was more than three years before the lightweight RCU read-side overhead way down in the sub-picosecond
reader-writer lock saw its first use. Boy, did I feel foolish! range? What happened???
The key similarity between RCU and reader-writer
locking is that both have read-side critical sections that Quick Quiz 9.58: Why is there such large variation for the
can execute concurrently. In fact, in some cases, it is RCU trace in Figure 9.25?
possible to mechanically substitute RCU API members
for the corresponding reader-writer lock API members. Note that reader-writer locking is more than an order
But first, why bother? of magnitude slower than RCU on a single CPU, and is
Advantages of RCU include performance, deadlock more than four orders of magnitude slower on 192 CPUs.
immunity, and realtime latency. There are, of course, In contrast, RCU scales quite well. In both cases, the
limitations to RCU, including the fact that readers and error bars cover the full range of the measurements from
updaters run concurrently, that low-priority RCU readers 30 runs, with the line being the median.
can block high-priority threads waiting for a grace period A more moderate view may be obtained from a CONFIG_
to elapse, and that grace-period latencies can extend for PREEMPT kernel, though RCU still beats reader-writer
locking by between a factor of seven on a single CPU and
by three orders of magnitude on 192 CPUs, as shown in
17 Similar to brlock in the 2.4 Linux kernel and to lglock in more Figure 9.26, which was generated on the same 448-CPU
recent Linux kernels. 2.10 GHz x86 system. Note the high variability of reader-

v2022.09.25a
168 CHAPTER 9. DEFERRED PROCESSING

10000 But please note the logscale y axis, which means that
the small separations between the traces still represent
Nanoseconds per operation significant differences. This figure shows non-preemptible
1000
rwlock RCU, but given that preemptible RCU’s read-side overhead
is only about three nanoseconds, its plot would be nearly
100
identical to Figure 9.27.
Quick Quiz 9.60: Why the larger error ranges for the
submicrosecond durations in Figure 9.27?
10 RCU
There are three traces for reader-writer locking, with the
upper trace being for 100 CPUs, the next for 10 CPUs, and
1 the lowest for 1 CPU. The greater the number of CPUs
1 10 100
and the shorter the critical sections, the greater is RCU’s
Number of CPUs (Threads)
performance advantage. These performance advantages
Figure 9.26: Performance Advantage of Preemptible are underscored by the fact that 100-CPU systems are no
RCU Over Reader-Writer Locking longer uncommon and that a number of system calls (and
thus any RCU read-side critical sections that they contain)
100000 complete within microseconds.
In addition, as is discussed in the next section, RCU
Nanoseconds per operation

read-side primitives are almost entirely deadlock-immune.


10000 rwlock 100 CPUs
Deadlock Immunity Although RCU offers significant
performance advantages for read-mostly workloads, one of
10 CPUs the primary reasons for creating RCU in the first place was
1000 in fact its immunity to read-side deadlocks. This immunity
1 CPU RCU stems from the fact that RCU read-side primitives do not
block, spin, or even do backwards branches, so that their
100 execution time is deterministic. It is therefore impossible
100 1000 10000 for them to participate in a deadlock cycle.
Critical-Section Duration (nanoseconds)
Quick Quiz 9.61: Is there an exception to this deadlock
Figure 9.27: Comparison of RCU to Reader-Writer Lock- immunity, and if so, what sequence of events could lead to
ing as Function of Critical-Section Duration, 192 deadlock?
CPUs
An interesting consequence of RCU’s read-side dead-
lock immunity is that it is possible to unconditionally
upgrade an RCU reader to an RCU updater. Attempting
writer locking at larger numbers of CPUs. The error bars
to do such an upgrade with reader-writer locking results
span the full range of data.
in deadlock. A sample code fragment that does an RCU
Quick Quiz 9.59: Given that the system had no fewer than read-to-update upgrade follows:
448 hardware threads, why only 192 CPUs?
1 rcu_read_lock();
2 list_for_each_entry_rcu(p, &head, list_field) {
Of course, the low performance of reader-writer locking 3 do_something_with(p);
in Figures 9.25 and 9.26 is exaggerated by the unrealistic 4 if (need_update(p)) {
5 spin_lock(my_lock);
zero-length critical sections. The performance advantages 6 do_update(p);
of RCU decrease as the overhead of the critical sections 7 spin_unlock(&my_lock);
8 }
increase, as shown in Figure 9.27, which was run on 9 }
the same system as the previous plots. Here, the y- 10 rcu_read_unlock();

axis represents the sum of the overhead of the read-side


primitives and that of the critical section and the x-axis Note that do_update() is executed under the protec-
represents the critical-section overhead in nanoseconds. tion of the lock and under RCU read-side protection.

v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 169

Another interesting consequence of RCU’s deadlock rwlock reader spin rwlock reader
immunity is its immunity to a large class of priority rwlock reader spin rwlock reader
inversion problems. For example, low-priority RCU
rwlock reader spin rwlock reader
readers cannot prevent a high-priority RCU updater from
spin rwlock writer
acquiring the update-side lock. Similarly, a low-priority
RCU updater cannot prevent high-priority RCU readers
from entering an RCU read-side critical section. RCU reader RCU reader RCU reader
RCU reader RCU reader RCU reader
Quick Quiz 9.62: Immunity to both deadlock and priority
RCU reader RCU reader RCU reader
inversion??? Sounds too good to be true. Why should I believe
RCU updater
that this is even possible?
Time

Update Received
Realtime Latency Because RCU read-side primitives
Figure 9.28: Response Time of RCU vs. Reader-Writer
neither spin nor block, they offer excellent realtime laten-
Locking
cies. In addition, as noted earlier, this means that they are
immune to priority inversion involving the RCU read-side
primitives and locks. classic example is the networking routing table. Because
However, RCU is susceptible to more subtle priority- routing updates can take considerable time to reach a given
inversion scenarios, for example, a high-priority process system (seconds or even minutes), the system will have
blocked waiting for an RCU grace period to elapse can be been sending packets the wrong way for quite some time
blocked by low-priority RCU readers in -rt kernels. This when the update arrives. It is usually not a problem to con-
can be solved by using RCU priority boosting [McK07d, tinue sending updates the wrong way for a few additional
GMTW08]. milliseconds. Furthermore, because RCU updaters can
However, use of RCU priority boosting requires that make changes without waiting for RCU readers to finish,
rcu_read_unlock() do deboosting, which entails ac- the RCU readers might well see the change more quickly
quiring scheduler locks. Some care is therefore required than would batch-fair reader-writer-locking readers, as
within the scheduler and RCU to avoid deadlocks, which as shown in Figure 9.28.
of the v5.15 Linux kernel requires RCU to avoid invoking
the scheduler while holding any of RCU’s locks. Quick Quiz 9.63: But how many other algorithms really
This in turn means that rcu_read_unlock() is not tolerate stale and inconsistent data?
always lockless when RCU priority boosting is enabled. Once the update is received, the rwlock writer cannot
However, rcu_read_unlock() will still be lockless if proceed until the last reader completes, and subsequent
its critical section was not priority-boosted. Furthermore, readers cannot proceed until the writer completes. How-
critical sections will not be priority boosted unless they ever, these subsequent readers are guaranteed to see the
are preempted, or, in -rt kernels, they acquire non-raw new value, as indicated by the green shading of the right-
spinlocks. This means that rcu_read_unlock() will most boxes. In contrast, RCU readers and updaters do
normally be lockless from the perspective of the highest not block each other, which permits the RCU readers to
priority task running on any given CPU. see the updated values sooner. Of course, because their
execution overlaps that of the RCU updater, all of the RCU
RCU Readers and Updaters Run Concurrently Be- readers might well see updated values, including the three
cause RCU readers never spin nor block, and because readers that started before the update. Nevertheless only
updaters are not subject to any sort of rollback or abort the green-shaded rightmost RCU readers are guaranteed
semantics, RCU readers and updaters really can run con- to see the updated values.
currently. This means that RCU readers might access stale Reader-writer locking and RCU simply provide different
data, and might even see inconsistencies, either of which guarantees. With reader-writer locking, any reader that
can render conversion from reader-writer locking to RCU begins after the writer begins is guaranteed to see new
non-trivial. values, and any reader that attempts to begin while the
However, in a surprisingly large number of situations, writer is spinning might or might not see new values,
inconsistencies and stale data are not problems. The depending on the reader/writer preference of the rwlock

v2022.09.25a
170 CHAPTER 9. DEFERRED PROCESSING

implementation in question. In contrast, with RCU, any for the rule of thumb that RCU be used in read-mostly
reader that begins after the updater completes is guaranteed situations.
to see new values, and any reader that completes after As noted in Section 9.5.3, within the Linux kernel,
the updater begins might or might not see new values, shorter grace periods may be obtained via expedited grace
depending on timing. periods, for example, by invoking synchronize_rcu_
The key point here is that, although reader-writer lock- expedited() instead of synchronize_rcu(). Expe-
ing does indeed guarantee consistency within the confines dited grace periods can reduce delays to as little as a few
of the computer system, there are situations where this tens of microseconds, albeit at the expense of higher CPU
consistency comes at the price of increased inconsistency utilization and IPIs. The added IPIs can be especially
with the outside world, courtesy of the finite speed of light unwelcome in some real-time workloads.
and the non-zero size of atoms. In other words, reader-
writer locking obtains internal consistency at the price of Code: Reader-Writer Locking vs. RCU In the best
silently stale data with respect to the outside world. case, the conversion from reader-writer locking to RCU is
Note that if a value is computed while read-holding quite simple, as shown in Listings 9.19, 9.20, and 9.21,
a reader-writer lock, and then that value is used after all taken from Wikipedia [MPA+ 06].
that lock is released, then this reader-writer-locking use However, the transformation is not always this straight-
case is using stale data. After all, the quantities that this forward. This is because neither the spin_lock() nor the
value is based on could change at any time after that synchronize_rcu() in Listing 9.21 exclude the read-
lock is released. This sort of reader-writer-locking use ers in Listing 9.20. First, the spin_lock() does not
case is often easy to convert to RCU, as will be shown in interact in any way with rcu_read_lock() and rcu_
Listings 9.19, 9.20, and 9.21 and the accompanying text. read_unlock(), thus not excluding them. Second, al-
though both write_lock() and synchronize_rcu()
Low-Priority RCU Readers Can Block High-Pri- wait for pre-existing readers, only write_lock() pre-
ority Reclaimers In Realtime RCU [GMTW08] or vents subsequent readers from commencing.18 Thus,
SRCU [McK06], a preempted reader will prevent a grace synchronize_rcu() cannot exclude readers. Neverthe-
period from completing, even if a high-priority task is less, a great many situations using reader-writer locking
blocked waiting for that grace period to complete. Real- can be converted to RCU.
time RCU can avoid this problem by substituting call_ More-elaborate cases of replacing reader-writer locking
rcu() for synchronize_rcu() or by using RCU priority with RCU may be found elsewhere [Bro15a, Bro15b].
boosting [McK07d, GMTW08]. It might someday be nec-
essary to augment SRCU and RCU Tasks Trace with Semantics: Reader-Writer Locking vs. RCU Expand-
priority boosting, but not before a clear real-world need is ing on the previous section, reader-writer locking seman-
demonstrated. tics can be roughly and informally summarized by the
following three temporal constraints:
Quick Quiz 9.64: If Tasks RCU Trace might someday be
priority boosted, why not also Tasks RCU and Tasks RCU 1. Write-side acquisitions wait for any read-holders to
Rude? release the lock.
2. Writer-side acquisitions wait for any write-holder to
RCU Grace Periods Extend for Many Milliseconds release the lock.
With the exception of userspace RCU [Des09b, MDJ13c], 3. Read-side acquisitions wait for any write-holder to
expedited grace periods, and several of the “toy” release the lock.
RCU implementations described in Appendix B, RCU
grace periods extend milliseconds. Although there RCU dispenses entirely with constraint #3 and weakens
are a number of techniques to render such long de- the other two as follows:
lays harmless, including use of the asynchronous in-
1. Writers wait for any pre-existing read-holders before
terfaces (call_rcu() and call_rcu_bh()) or of the
progressing to the destructive phase of their update
polling interfaces (get_state_synchronize_rcu(),
(usually the freeing of memory).
start_poll_synchronize_rcu(), and poll_state_
synchronize_rcu()), this situation is a major reason 18 Kudos to whoever pointed this out to Paul.

v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 171

Listing 9.19: Converting Reader-Writer Locking to RCU: Data


1 struct el { 1 struct el {
2 struct list_head lp; 2 struct list_head lp;
3 long key; 3 long key;
4 spinlock_t mutex; 4 spinlock_t mutex;
5 int data; 5 int data;
6 /* Other data fields */ 6 /* Other data fields */
7 }; 7 };
8 DEFINE_RWLOCK(listmutex); 8 DEFINE_SPINLOCK(listmutex);
9 LIST_HEAD(head); 9 LIST_HEAD(head);

Listing 9.20: Converting Reader-Writer Locking to RCU: Search


1 int search(long key, int *result) 1 int search(long key, int *result)
2 { 2 {
3 struct el *p; 3 struct el *p;
4 4
5 read_lock(&listmutex); 5 rcu_read_lock();
6 list_for_each_entry(p, &head, lp) { 6 list_for_each_entry_rcu(p, &head, lp) {
7 if (p->key == key) { 7 if (p->key == key) {
8 *result = p->data; 8 *result = p->data;
9 read_unlock(&listmutex); 9 rcu_read_unlock();
10 return 1; 10 return 1;
11 } 11 }
12 } 12 }
13 read_unlock(&listmutex); 13 rcu_read_unlock();
14 return 0; 14 return 0;
15 } 15 }

Listing 9.21: Converting Reader-Writer Locking to RCU: Deletion


1 int delete(long key) 1 int delete(long key)
2 { 2 {
3 struct el *p; 3 struct el *p;
4 4
5 write_lock(&listmutex); 5 spin_lock(&listmutex);
6 list_for_each_entry(p, &head, lp) { 6 list_for_each_entry(p, &head, lp) {
7 if (p->key == key) { 7 if (p->key == key) {
8 list_del(&p->lp); 8 list_del_rcu(&p->lp);
9 write_unlock(&listmutex); 9 spin_unlock(&listmutex);
10 synchronize_rcu();
10 kfree(p); 11 kfree(p);
11 return 1; 12 return 1;
12 } 13 }
13 } 14 }
14 write_unlock(&listmutex); 15 spin_unlock(&listmutex);
15 return 0; 16 return 0;
16 } 17 }

v2022.09.25a
172 CHAPTER 9. DEFERRED PROCESSING

Listing 9.22: RCU Singleton Get Listing 9.23: RCU Singleton Set
1 struct myconfig { 1 void set_config(int cur_a, int cur_b)
2 int a; 2 {
3 int b; 3 struct myconfig *mcp;
4 } *curconfig; 4
5 5 mcp = malloc(sizeof(*mcp));
6 int get_config(int *cur_a, int *cur_b) 6 BUG_ON(!mcp);
7 { 7 mcp->a = cur_a;
8 struct myconfig *mcp; 8 mcp->b = cur_b;
9 9 mcp = xchg(&curconfig, mcp);
10 rcu_read_lock(); 10 if (mcp) {
11 mcp = rcu_dereference(curconfig); 11 synchronize_rcu();
12 if (!mcp) { 12 free(mcp);
13 rcu_read_unlock(); 13 }
14 return 0; 14 }
15 }
16 *cur_a = mcp->a;
17 *cur_b = mcp->b;
18 rcu_read_unlock();
19 return 1; The fields of the current structure are passed back
20 } through the cur_a and cur_b parameters to the get_
config() function defined on lines 6–20. These two
fields can be slightly out of date, but they absolutely
2. Writers synchronize with each other as needed. must be consistent with each other. The get_config()
It is of course this weakening that permits RCU imple- function provides this consistency within the RCU read-
mentations to attain excellent performance and scalability. side critical section starting on line 10 and ending on
It also allows RCU to implement the aforementioned un- either line 13 or line 18, which provides the needed
conditional read-to-write upgrade that is so attractive and temporal synchronization. Line 11 fetches the pointer to
so deadlock-prone in reader-writer locking. Code using the current myconfig structure. This structure will be
RCU can compensate for this weakening in a surprisingly used regardless of any concurrent changes due to calls to
large number of ways, but most commonly by imposing the set_config() function, thus providing the needed
spatial constraints: spatial synchronization. If line 12 determines that the
curconfig pointer was NULL, line 14 returns failure.
1. New data is placed in newly allocated memory. Otherwise, lines 16 and 17 copy out the ->a and ->b
fields and line 19 returns success. These ->a and ->b
2. Old data is freed, but only after: fields are from the same myconfig structure, and the
(a) That data has been unlinked so as to be inac- RCU read-side critical section prevents this structure from
cessible to later readers, and being freed, thus guaranteeing that these two fields are
consistent with each other.
(b) A subsequent RCU grace period has elapsed.
The structure is updated by the set_config() function
Of course, there are some reader-writer-locking use shown in Listing 9.23. Lines 5–8 allocate and initialize
cases for which RCU’s weakened semantics are inap- a new myconfig structure. Line 9 atomically exchanges
propriate, but experience in the Linux kernel indicates a pointer to this new structure with the pointer to the old
that more than 80% of reader-writer locks can in fact be structure in curconfig, while also providing full mem-
replaced by RCU. For example, a common reader-writer- ory ordering both before and after the xchg() operation,
locking use case computes some value while holding the thus providing the needed updater/reader spatial synchro-
lock and then uses that value after releasing that lock. nization on the one hand and the needed updater/updater
This use case results in stale data, and therefore often synchronization on the other. If line 10 determines that the
accommodates RCU’s weaker semantics. pointer to the old structure was in fact non-NULL, line 11
This interaction of temporal and spatial constraints is waits for a grace period (thus providing the needed read-
illustrated by the RCU singleton data structure illustrated er/updater temporal synchronization) and line 12 frees
in Figures 9.6 and 9.7. This structure is defined on the old structure, safe in the knowledge that there are no
lines 1–4 of Listing 9.22, and contains two integer fields, longer any readers still referencing it.
->a and ->b (singleton.c). The current instance of this Figure 9.29 shows an abbreviated representation of
structure is referenced by the curconfig pointer defined get_config() on the left and right and a similarly ab-
on line 4. breviated representation of set_config() in the middle.

v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 173

Time
Address Space

5,
5,25 curconfig
5, 5,
9,81
Readers

rcu_read_lock();
mcp = ... mcp = kmalloc(...)
*cur_a = mcp->a; (5) mcp = xchg(&curconfig, mcp);

*cur_b = mcp->b; (25) synchronize_rcu();


Grace
Period

rcu_read_unlock(); ... rcu_read_lock();


... mcp = ...

kfree(mcp); *cur_a = mcp->a; (9)


Readers

*cur_b = mcp->b; (81)


rcu_read_unlock();

Figure 9.29: RCU Spatial/Temporal Synchronization

Time advances from top to bottom, and the address space as each reader loads the curconfig pointer only once,
of the objects referenced by curconfig advances from each reader will see a consistent view of its myconfig
left to right. The boxes with comma-separated numbers structure.
each represent a myconfig structure, with the constraint In short, when operating on a suitable linked data
that ->b is the square of ->a. Each blue dash-dotted structure, RCU combines temporal and spatial synchro-
arrow represents an interaction with the old structure (on nization in order to approximate reader-writer locking,
the left, containing “5,25”) and each green dashed arrow with RCU read-side critical sections acting as the reader-
represents an interaction with the new structure (on the writer-locking reader, as shown in Figures 9.23 and 9.29.
right, containing “9,81”). RCU’s temporal synchronization is provided by the read-
The black dotted arrows represent temporal relation- side markers, for example, rcu_read_lock() and rcu_
ships between RCU readers on the left and right and read_unlock(), as well as the update-side grace-period
the RCU grace period at center, with each arrow point- primitives, for example, synchronize_rcu() or call_
ing from an older event to a newer event. The call to rcu(). The spatial synchronization is provided by
synchronize_rcu() followed the leftmost rcu_read_ the read-side rcu_dereference() family of primitives,
lock(), and therefore that synchronize_rcu() invoca- each of which subscribes to a version published by rcu_
tion must not return until after the corresponding rcu_ assign_pointer().19 RCU’s combining of temporal
read_unlock(). In contrast, the call to synchronize_ and spatial synchronization contrasts to the schemes pre-
rcu() precedes the rightmost rcu_read_lock(), which sented in Sections 6.3.2, 6.3.3, and 7.1.4, in which tempo-
allows the return from that same synchronize_rcu() to ral and spatial synchronization are provided separately by
ignore the corresponding rcu_read_unlock(). These locking and by static data-structure layout, respectively.
temporal relationships prevent the myconfig structures Quick Quiz 9.65: Is RCU the only synchronization mecha-
from being freed while RCU readers are still accessing nism that combines temporal and spatial synchronization in
them. this way?
The two horizontal grey dashed lines represent the
period of time during which different readers get different
results, however, each reader will see one and only one of 9.5.4.10 Quasi Reference Count
the two objects. Before the first horizontal line, all readers Because grace periods are not allowed to complete while
see the leftmost myconfig structure, and after the second there is an RCU read-side critical section in progress,
horizontal line, all readers will see the rightmost structure.
Between the two lines, that is, during the grace period, 19 Preferably with both rcu_dereference() and rcu_assign_

different readers might see different objects, but as long pointer() being embedded in higher-level APIs.

v2022.09.25a
174 CHAPTER 9. DEFERRED PROCESSING

the RCU read-side primitives may be used as a restricted 10000


reference-counting mechanism. For example, consider
the following code fragment:

Nanoseconds per operation


1000
refcnt
1 rcu_read_lock(); /* acquire reference. */
2 p = rcu_dereference(head); 100
3 /* do something with p. */
4 rcu_read_unlock(); /* release reference. */
10
The combination of the rcu_read_lock() and rcu_
RCU
dereference() primitives can be thought of as acquir- 1
ing a reference to p, because a grace period starting
after the rcu_dereference() assignment to p cannot 0.1
possibly end until after we reach the matching rcu_read_ 1 10 100
unlock(). This reference-counting scheme is restricted Number of CPUs (Threads)
in that it is forbidden to wait for RCU grace periods within Figure 9.30: Performance of RCU vs. Reference Counting
RCU read-side critical sections, and also forbidden to
hand off an RCU read-side critical section’s references
10000
from one task to another.
Regardless of these restrictions, the following code can

Nanoseconds per operation


safely delete p: 1000
refcnt
1 spin_lock(&mylock);
2 p = head;
3 rcu_assign_pointer(head, NULL); 100
4 spin_unlock(&mylock);
5 /* Wait for all references to be released. */
6 synchronize_rcu();
kfree(p); 10
7
RCU

The assignment to head prevents any future references


to p from being acquired, and the synchronize_rcu() 1
1 10 100
waits for any previously acquired references to be released. Number of CPUs (Threads)
Quick Quiz 9.66: But wait! This is exactly the same code
Figure 9.31: Performance of Preemptible RCU vs. Refer-
that might be used when thinking of RCU as a replacement for
reader-writer locking! What gives? ence Counting

Of course, RCU can also be combined with traditional 100000


reference counting, as discussed in Section 13.2.
Nanoseconds per operation

But why bother? Again, part of the answer is perfor-


mance, as shown in Figures 9.30 and 9.31, again show-
10000 refcnt 100 CPUs
ing data taken on a 448-CPU 2.1 GHz Intel x86 system
for non-preemptible and preemptible Linux-kernel RCU,
respectively. Non-preemptible RCU’s advantage over 10 CPUs

reference counting ranges from more than an order of 1000 RCU


magnitude at one CPU up to about four orders of magni- 1 CPU
tude at 192 CPUs. Preemptible RCU’s advantage ranges
from about a factor of three at one CPU up to about three 100
orders of magnitude at 192 CPUs. 100 1000 10000
However, as with reader-writer locking, the performance Critical-Section Duration (nanoseconds)
advantages of RCU are most pronounced for short-duration
Figure 9.32: Response Time of RCU vs. Reference Count-
critical sections and for large numbers of CPUs, as shown
ing, 192 CPUs
in Figure 9.32 for the same system. In addition, as with

v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 175

reader-writer locking, many system calls (and thus any presented by this restriction, and as to how it can best be
RCU read-side critical sections that they contain) complete handled.
in a few microseconds. However, in the common case where references are
Although traditional reference counters are usually asso- held within the confines of a single CPU or task, RCU
ciated with a specific data structure, or perhaps a specific can be used as high-performance and highly scalable
group of data structures, this approach does have some reference-counting mechanism.
disadvantages. For example, maintaining a single global As shown in Figure 9.23, quasi reference counts add
reference counter for a large variety of data structures RCU readers as individual or bulk reference counts, pos-
typically results in bouncing the cache line containing the sibly also bridging to reference counters in corner cases.
reference count. As we saw in Figures 9.30–9.32, such
cache-line bouncing can severely degrade performance. 9.5.4.11 Quasi Multi-Version Concurrency Control
In contrast, RCU’s lightweight rcu_read_lock(), RCU can also be thought of as a simplified multi-version
rcu_dereference(), and rcu_read_unlock() read- concurrency control (MVCC) mechanism with weak con-
side primitives permit extremely frequent read-side usage sistency criteria. The multi-version aspects were touched
with negligible performance degradation. Except that the upon in Section 9.5.2.3. However, in its native form,
calls to rcu_dereference() are not doing anything spe- RCU provides version consistency only within a given
cific to acquire a reference to the pointed-to object. The RCU-protected data element.
heavy lifting is instead done by the rcu_read_lock() Nevertheless, there are situations where consistency
and rcu_read_unlock() primitives and their interac- and fresh data are required across multiple data elements.
tions with RCU grace periods. Fortunately, there are a number of approaches that avoid
And ignoring those calls to rcu_dereference() per- inconsistency and stale data, including the following:
mits RCU to be thought of as a “bulk reference-counting”
mechanism, where each call to rcu_read_lock() ob- 1. Enclose RCU readers within sequence-locking read-
tains a reference on each and every RCU-protected object, ers, forcing the RCU readers to be retried should
and with little or no overhead. However, the restrictions an update occur, as described in Section 13.4.2 and
that go with RCU can be quite onerous. For example, in Section 13.4.3.
many cases, the Linux-kernel prohibition against sleeping
2. Place the data that must be consistent into a single
while in an RCU read-side critical section would defeat
element of a linked data structure, and refrain from
the entire purpose. Such cases might be better served by
updating those fields within any element visible to
the hazard pointers mechanism described in Section 9.3.
RCU readers. RCU readers gaining a reference to any
Cases where code rarely sleeps have been handled by using
such element are then guaranteed to see consistent
RCU as a reference count in the common non-sleeping
values. See Section 13.5.4 for additional details.
case and by bridging to an explicit reference counter when
sleeping is necessary. 3. Use a per-element lock that guards a “deleted” flag
Alternatively, situations where a reference must be held to allow RCU readers to reject stale data [McK04,
by a single task across a section of code that sleeps may ACMS03].
be accommodated with Sleepable RCU (SRCU) [McK06].
4. Provide an existence flag that is referenced by all data
This fails to cover the not-uncommon situation where
elements whose update is to appear atomic to RCU
a reference is “passed” from one task to another, for
readers [McK14d, McK14a, McK15b, McK16b,
example, when a reference is acquired when starting an
McK16a].
I/O and released in the corresponding completion interrupt
handler. Again, such cases might be better handled by 5. Use one of a wide range of counter-based meth-
explicit reference counters or by hazard pointers. ods [McK08a, McK10, MW11, McK14b, MSFM15,
Of course, SRCU brings restrictions of its own, namely KMK+ 19]. In these approaches, updaters maintain
that the return value from srcu_read_lock() be passed a version number and maintain links to old versions
into the corresponding srcu_read_unlock(), and that of a given piece of data. Readers take a snapshot
no SRCU primitives be invoked from hardware interrupt of the current version number, and, if necessary, tra-
handlers or from non-maskable interrupt (NMI) handlers. verse the links to find a version consistent with that
The jury is still out as to how much of a problem is snapshot.

v2022.09.25a
176 CHAPTER 9. DEFERRED PROCESSING

Pre-BSD Routing Table


where stale and inconsistent data is permissible (but see
Stale and Inconsistent Data OK below for more information on stale and inconsistent data).
The canonical example of this case in the Linux kernel is

& CU
Re on ork
Inc W
ad sist s G
(R
routing tables. Because it may have taken many seconds

-M en
Ne U M

Ne

os t D reat

100% Reads
100% Writes

(R or even minutes for the routing updates to propagate

Re ns
ed
Re ons Be

ed U W

tly at !!!)
(R
C

ad iste Wel

,S a
ad iste OK
C ght

Co ork
Ne

C
across the Internet, the system has been sending packets

-M
-W nt

tal OK
ed
W ns ot B

os t D
i

e
rite Da ..)
rite ist
Co

the wrong way for quite some time. Having some small

tly ata
(R

,
-M ent st)*

n
,
CU

s
os Da

probability of continuing to send some of them the wrong


tly
N

ta

l)
, ta

way for a few more milliseconds is almost never a problem.


e

If you have a read-mostly workload where consistent


Need Fully Fresh and Consistent Data
data is required, RCU works well, as shown by the green
* 1. RCU provides ABA protection for update-friendly synchronization mechanisms
* 2. RCU provides bounded wait-free read-side primitives for real-time use “read-mostly, need consistent data” box. One example of
this case is the Linux kernel’s mapping from user-level
Figure 9.33: RCU Areas of Applicability System-V semaphore IDs to the corresponding in-kernel
data structures. Semaphores tend to be used far more
frequently than they are created and destroyed, so this
In short, when using RCU to approximate multi-version mapping is read-mostly. However, it would be erroneous
concurrency control, you only pay for the level of consis- to perform a semaphore operation on a semaphore that
tency that you actually need. has already been deleted. This need for consistency is
As shown in Figure 9.23, quasi multi-version concur- handled by using the lock in the in-kernel semaphore data
rency control is based on existence guarantees, adding structure, along with a “deleted” flag that is set when
read-side snapshot operations and constraints on readers deleting a semaphore. If a user ID maps to an in-kernel
and writers, the exact form of the constraint being dictated data structure with the “deleted” flag set, the data structure
by the consistency requirements, as summarized above. is ignored, so that the user ID is flagged as invalid.
Although this requires that the readers acquire a lock
9.5.4.12 RCU Usage Summary for the data structure representing the semaphore itself, it
allows them to dispense with locking for the mapping data
At its core, RCU is nothing more nor less than an API that
structure. The readers therefore locklessly traverse the
provides:
tree used to map from ID to data structure, which in turn
1. A publish-subscribe mechanism for adding new data, greatly improves performance, scalability, and real-time
response.
2. A way of waiting for pre-existing RCU readers to As indicated by the yellow “read-write” box, RCU can
finish, and also be useful for read-write workloads where consistent
3. A discipline of maintaining multiple versions to data is required, although usually in conjunction with a
permit change without harming or unduly delaying number of other synchronization primitives. For example,
concurrent RCU readers. the directory-entry cache in recent Linux kernels uses
RCU in conjunction with sequence locks, per-CPU locks,
That said, it is possible to build higher-level constructs and per-data-structure locks to allow lockless traversal of
on top of RCU, including the various use cases described pathnames in the common case. Although RCU can be
in the earlier sections. Furthermore, I have no doubt that very beneficial in this read-write case, the corresponding
new use cases will continue to be found for RCU, as well code is often more complex than that of the read-mostly
as for any of a number of other synchronization primitives. cases.
Quick Quiz 9.67: Which of these use cases best describes Finally, as indicated by the red box in the lower-left
the Pre-BSD routing example in Section 9.5.4.1? corner of the figure, update-mostly workloads requir-
ing consistent data are rarely good places to use RCU,
In the meantime, Figure 9.33 shows some rough rules though there are some exceptions [DMS+ 12]. For exam-
of thumb on where RCU is most helpful. ple, as noted in Section 9.5.4.5, within the Linux kernel,
As shown in the blue box in the upper-right corner of the SLAB_TYPESAFE_BY_RCU slab-allocator flag provides
the figure, RCU works best if you have read-mostly data type-safe memory to RCU readers, which can greatly sim-

v2022.09.25a
9.5. READ-COPY UPDATE (RCU) 177

plify non-blocking synchronization and other lockless 9.5.5.1 RCU Uses


algorithms. In addition, if the rare readers are on critical
code paths on real-time systems, use of RCU for those Phil Howard and Jon Walpole of Portland State Univer-
readers might provide real-time response benefits that sity (PSU) have applied RCU to red-black trees [How12,
more than make up for the increased update-side overhead, HW11] combined with updates synchronized using soft-
as discussed in Section 14.3.6.5. ware transactional memory. Josh Triplett and Jon
In short, RCU is an API that includes a publish- Walpole (again of PSU) applied RCU to resizable hash
subscribe mechanism for adding new data, a way of tables [Tri12, TMW11, Cor14c, Cor14d]. Other RCU-
waiting for pre-existing RCU readers to finish, and a disci- protected resizable hash tables have been created by Her-
pline of maintaining multiple versions to allow updates bert Xu [Xu10] and by Mathieu Desnoyers [MDJ13a].
to avoid harming or unduly delaying concurrent RCU Austin Clements, Frans Kaashoek, and Nickolai Zel-
readers. This RCU API is best suited for read-mostly dovich of MIT created an RCU-optimized balanced bi-
situations, especially if stale and inconsistent data can be nary tree (Bonsai) [CKZ12], and applied this tree to the
tolerated by the application. Linux kernel’s VM subsystem in order to reduce read-side
contention on the Linux kernel’s mmap_sem. This work
resulted in order-of-magnitude speedups and scalability up
9.5.5 RCU Related Work to at least 80 CPUs for a microbenchmark featuring large
The first known mention of anything resembling RCU numbers of minor page faults. This is similar to a patch de-
took the form of a bug report from Donald Knuth [Knu73, veloped earlier by Peter Zijlstra [Zij14], and both were lim-
page 413 of Fundamental Algorithms] against Joseph ited by the fact that, at the time, filesystem data structures
Weizenbaum’s SLIP list-processing facility for FOR- were not safe for RCU readers. Clements et al. avoided
TRAN [Wei63]. Knuth was justified in reporting the this limitation by optimizing the page-fault path for anony-
bug, as SLIP had no notion of any sort of grace-period mous pages only. More recently, filesystem data structures
guarantee. have been made safe for RCU readers [Cor10a, Cor11],
The first known non-bug-report mention of anything so perhaps this work can be implemented for all page
resembling RCU appeared in Kung’s and Lehman’s land- types, not just anonymous pages—Peter Zijlstra has, in
mark paper [KL80]. There was some additional use of fact, recently prototyped exactly this, and Laurent Dufour
this technique in academia [ML82, ML84, Lis88, Pug90, has continued work along these lines.
And91, PAB+ 95, CAK+ 96, RSB+ 97, GKAS99], but much Yandong Mao and Robert Morris of MIT and Ed-
of the work in this area was carried out by practition- die Kohler of Harvard University created another RCU-
ers [RTY+ 87, HOS89, Jac93, Joh95, SM95, SM97, SM98, protected tree named Masstree [MKM12] that combines
MS98a]. By the year 2000, the initiative had passed ideas from B+ trees and tries. Although this tree is about
to open-source projects, most notably the Linux kernel 2.5x slower than an RCU-protected hash table, it supports
community [Rus00a, Rus00b, MS01, MAK+ 01, MSA+ 02, operations on key ranges, unlike hash tables. In addition,
ACMS03].20 RCU was accepted into the Linux kernel in Masstree supports efficient storage of objects with long
late 2002, with many subsequent improvements for scala- shared key prefixes and, furthermore, provides persistence
bility, robustness, real-time response, energy efficiency, via logging to mass storage.
and specialized use cases. As of 2021, Linux-kernel RCU The paper notes that Masstree’s performance rivals
is still under active development. that of memcached, even given that Masstree is persis-
However, in the mid 2010s, there was a welcome up- tently storing updates and memcached is not. The paper
surge in RCU research and development across a number also compares Masstree’s performance to the persistent
of communities and institutions [Kaa15]. Section 9.5.5.1 datastores MongoDB, VoltDB, and Redis, reporting sig-
describes uses of RCU, Section 9.5.5.2 describes RCU im- nificant performance advantages for Masstree, in some
plementations (as well as work that both creates and uses cases exceeding two orders of magnitude. Another pa-
an implementation), and finally, Section 9.5.5.3 describes per [TZK+ 13], by Stephen Tu, Wenting Zheng, Barbara
verification and validation of RCU and its uses. Liskov, and Samuel Madden of MIT and Kohler, applies
Masstree to an in-memory database named Silo, achiev-
ing 700K transactions per second (42M transactions per
20 A list of citations with well over 200 entries may be found in minute) on a well-known transaction-processing bench-
bib/RCU.bib in the LATEX source for this book. mark. Interestingly enough, Silo guarantees linearizability

v2022.09.25a
178 CHAPTER 9. DEFERRED PROCESSING

without incurring the overhead of grace periods while lock” [LZC14], similar to those created by Gautham
holding locks. Shenoy [She06] and Srivatsa Bhat [Bha14]. The Liu
Maya Arbel and Hagit Attiya of Technion took a more et al. paper is interesting from a number of perspec-
rigorous approach [AA14] to an RCU-protected search tives [McK14g].
tree that, like Masstree, allows concurrent updates. This Mike Ash posted [Ash15] a description of an RCU-like
paper includes a proof of correctness, including proof primitive in Apple’s Objective-C runtime. This approach
that all operations on this tree are linearizable. Unfor- identifies read-side critical sections via designated code
tunately, this implementation achieves linearizability by ranges, thus qualifying as another method of achieving
incurring the full latency of grace-period waits while zero read-side overhead, albeit one that poses some in-
holding locks, which degrades scalability of update-only teresting practical challenges for large read-side critical
workloads. One way around this problem is to abandon sections that span multiple functions.
linearizability [HKLP12, McK14d]), however, Arbel and Pedro Ramalhete and Andreia Correia [RC15] pro-
Attiya instead created an RCU variant that reduces low- duced “Poor Man’s RCU”, which, despite using a pair of
end grace-period latency. Of course, nothing comes for reader-writer locks, manages to provide lock-free forward-
free, and this RCU variant appears to hit a scalability progress guarantees to readers [MP15a].
limit at about 32 CPUs. Although there is much to be Maya Arbel and Adam Morrison [AM15] produced
said for dropping linearizability, thus gaining both perfor- “Predicate RCU”, which works hard to reduce grace-period
mance and scalability, it is very good to see academics duration in order to efficiently support algorithms that
experimenting with alternative RCU implementations. hold update-side locks across grace periods. This results
in reduced batching of updates into grace periods and
9.5.5.2 RCU Implementations reduced scalability, but does succeed in providing short
grace periods.
Keir Fraser created a user-space RCU named EBR for
use in non-blocking synchronization and software trans- Quick Quiz 9.68: Why not just drop the lock before waiting
actional memory [Fra03, Fra04, FH07]. Interestingly for the grace period, or using something like call_rcu()
enough, this work cites Linux-kernel RCU on the one instead of waiting for a grace period?
hand, but also inspired the name QSBR for the original
non-preemptible Linux-kernel RCU implementation. Alexander Matveev (MIT), Nir Shavit (MIT and Tel-
Mathieu Desnoyers created a user-space RCU for use in Aviv University), Pascal Felber (University of Neuchâ-
tracing [Des09b, Des09a, DMS+ 12], which has seen use tel), and Patrick Marlier (also University of Neuchâ-
in a number of projects [BD13]. tel) [MSFM15] produced an RCU-like mechanism that
Researchers at Charles University in Prague have can be thought of as software transactional memory that
also been working on RCU implementations, including explicitly marks read-only transactions. Their use cases
dissertations by Andrej Podzimek [Pod10] and Adam require holding locks across grace periods, which lim-
Hraska [Hra13]. its scalability [MP15a, MP15b]. This appears to be the
Yujie Liu (Lehigh University), Victor Luchangco (Or- first academic RCU-related work to make good use of the
acle Labs), and Michael Spear (also Lehigh) [LLS13] rcutorture test suite, and also the first to have submitted
pressed scalable non-zero indicators (SNZI) [ELLM07] a performance improvement to Linux-kernel RCU, which
into service as a grace-period mechanism. The intended was accepted into v4.4.
use is to implement software transactional memory (see Alexander Matveev’s RLU was followed up by MV-
Section 17.2), which imposes linearizability requirements, RLU from Jaeho Kim et al. [KMK+ 19]. This work im-
which in turn seems to limit scalability. proves scalability over RLU by permitting multiple concur-
RCU-like mechanisms are also finding their way into rent updates, by avoiding holding locks across grace peri-
Java. Sivaramakrishnan et al. [SZJ12] use an RCU-like ods, and by using asynchronous grace periods, for example,
mechanism to eliminate the read barriers that are otherwise call_rcu() instead of synchronize_rcu(). This pa-
required when interacting with Java’s garbage collector, per also made some interesting performance-evaluation
resulting in significant performance improvements. choices that are discussed further in Section 17.2.3.3 on
Ran Liu, Heng Zhang, and Haibo Chen of Shanghai page 379.
Jiao Tong University created a specialized variant of RCU Adam Belay et al. created an RCU implementation that
that they used for an optimized “passive reader-writer guards the data structures used by TCP/IP’s address-

v2022.09.25a
9.6. WHICH TO CHOOSE? 179

resolution protocol (ARP) in their IX operating sys- the rcutorture test suite. The effort found several holes
tem [BPP+ 16]. in this suite’s coverage, one of which was hiding a real
Geoff Romer and Andrew Hunter (both at Google) bug (since fixed) in Tiny RCU.
proposed a cell-based API for RCU protection of singleton With some luck, all of this validation work will eventu-
data structures for inclusion in the C++ standard [RH18]. ally result in more and better tools for validating concurrent
Dimitrios Siakavaras et al. have applied HTM and RCU code.
to search trees [SNGK17, SBN+ 20], Christina Giannoula
et al. have used HTM and RCU to color graphs [GGK18],
and SeongJae Park et al. have used HTM and RCU to 9.6 Which to Choose?
optimize high-contention locking on NUMA systems.
Alex Kogan et al. applied RCU to the construction of
range locking for scalable address spaces [KDI20]. Choose always the way that seems the best, however
rough it may be; custom will soon render it easy and
Production uses of RCU are listed in Section 9.6.3.3.
agreeable.

9.5.5.3 RCU Validation Pythagoras

In early 2017, it is commonly recognized that almost Section 9.6.1 provides a high-level overview and then Sec-
any bug is a potential security exploit, so validation and tion 9.6.2 provides a more detailed view of the differences
verification are first-class concerns. between the deferred-processing techniques presented
Researchers at Stony Brook University have produced an in this chapter. This discussion assumes a linked data
RCU-aware data-race detector [Dug10, Sey12, SRK+ 11]. structure that is large enough that readers do not hold ref-
Alexey Gotsman of IMDEA, Noam Rinetzky of Tel Aviv erences from one traversal to another, and where elements
University, and Hongseok Yang of the University of Oxford might be added to and removed from the structure at any
have published a paper [GRY12] expressing the formal location and at any time. Section 9.6.3 then points out a
semantics of RCU in terms of separation logic, and have few publicly visible production uses of hazard pointers,
continued with other aspects of concurrency. sequence locking, and RCU. This discussion should help
Joseph Tassarotti (Carnegie-Mellon University), Derek you to make an informed choice between these techniques.
Dreyer (Max Planck Institute for Software Systems), and
Viktor Vafeiadis (also MPI-SWS) [TDV15] produced a
manual formal proof of correctness of the quiescent- 9.6.1 Which to Choose? (Overview)
state-based reclamation (QSBR) variant of userspace
RCU [Des09b, DMS+ 12]. Lihao Liang (University of Table 9.7 shows a few high-level properties that distinguish
Oxford), Paul E. McKenney (IBM), Daniel Kroening, the deferred-reclamation techniques from one another.
and Tom Melham (both also Oxford) [LMKM16] used The “Readers” row summarizes the results presented in
the C bounded model checker (CBMC) [CKL04] to pro- Figure 9.22, which shows that all but reference counting
duce a mechanical proof of correctness of a significant enjoy reasonably fast and scalable readers.
portion of Linux-kernel Tree RCU. Lance Roy [Roy17] The “Number of Protected Objects” row evaluates each
used CBMC to produce a similar proof of correctness technique’s need for external storage with which to record
for a significant portion of Linux-kernel sleepable RCU reader protection. RCU relies on quiescent states, and
(SRCU) [McK06]. Finally, Michalis Kokologiannakis and thus needs no storage to represent readers, whether within
Konstantinos Sagonas (National Technical University of or outside of the object. Reference counting can use a
Athens) [KS17a, KS19] used the Nighugg tool [LSLK14] single integer within each object in the structure, and no
to produce a mechanical proof of correctness of a some- additional storage is required. Hazard pointers require
what larger portion of Linux-kernel Tree RCU. external-to-object pointers be provisioned, and that there
None of these efforts located any bugs other than bugs be sufficient pointers for each CPU or thread to track all
injected into RCU specifically to test the verification the objects being referenced at any given time. Given that
tools. In contrast, Alex Groce (Oregon State University), most hazard-pointer-based traversals require only a few
Iftekhar Ahmed, Carlos Jensen (both also OSU), and Paul hazard pointers, this is not normally a problem in practice.
E. McKenney (IBM) [GAJM15] automatically mutated Of course, sequence locks provides no pointer-traversal
Linux-kernel RCU’s source code to test the coverage of protection, which is why it is normally used on static data.

v2022.09.25a
180 CHAPTER 9. DEFERRED PROCESSING

Table 9.7: Which Deferred Technique to Choose? (Overview)

Property Reference Counting Hazard Pointers Sequence Locks RCU

Readers Slow and unscalable Fast and scalable Fast and scalable Fast and scalable
Memory Overhead Counter per object Pointer per No protection None
reader per object
Duration of Protection Can be long Can be long No protection User must bound
duration
Need for Traversal If object deleted If object deleted If any update Never
Retries

Quick Quiz 9.69: Why can’t users dynamically allocate the applications, then it does not matter that RCU has duration-
hazard pointers as they are needed? limit requirements because your code already meets them.
In the same vein, if readers must already write to the
The “Duration of Protection” describes constraints (if objects that they are traversing, the read-side overhead of
any) on how long a period of time a user may protect a reference counters might not be so important. Of course, if
given object. Reference counting and hazard pointers can the data to be protected is in statically allocated variables,
both protect objects for extended time periods with no then sequence locking’s inability to protect pointers is
untoward side effects, but maintaining an RCU reference irrelevant.
to even one object prevents all other RCU from being freed.
Finally, there is some work on dynamically switching
RCU readers must therefore be relatively short in order
between hazard pointers and RCU based on dynamic
to avoid running the system out of memory, with special-
sampling of delays [BGHZ16]. This defers the choice be-
purpose implementations such as SRCU, Tasks RCU, and
tween hazard pointers and RCU to runtime, and delegates
Tasks Trace RCU being exceptions to this rule. Again,
responsibility for the decision to the software.
sequence locks provide no pointer-traversal protection,
Nevertheless, this table should be of great help when
which is why it is normally used on static data.
choosing between these techniques. But those wishing
The “Need for Traversal Retries” row tells whether a
more detail should continue on to the next section.
new reference to a given object may be acquired uncon-
ditionally, as it can with RCU, or whether the reference
acquisition can fail, resulting in a retry operation, which
is the case for reference counting, hazard pointers, and 9.6.2 Which to Choose? (Details)
sequence locks. In the case of reference counting and
Table 9.8 provides more-detailed rules of thumb that
hazard pointers, retries are only required if an attempt to
can help you choose among the four deferred-processing
acquire a reference to a given object while that object is in
techniques presented in this chapter.
the process of being deleted, a topic covered in more detail
in the next section. Sequence locking must of course retry As shown in the “Existence Guarantee” row, if you
its critical section should it run concurrently with any need existence guarantees for linked data elements, you
update. must use reference counting, hazard pointers, or RCU. Se-
quence locks do not provide existence guarantees, instead
Quick Quiz 9.70: But don’t Linux-kernel kref reference providing detection of updates, retrying any read-side
counters allow guaranteed unconditional reference acquisition? critical sections that do encounter an update.
Of course, as shown in the “Updates and Readers
Of course, different rows will have different levels of Progress Concurrently” row, this detection of updates
importance in different situations. For example, if your implies that sequence locking does not permit updaters
current code is having read-side scalability problems with and readers to make forward progress concurrently. After
hazard pointers, then it does not matter that hazard pointers all, preventing such forward progress is the whole point
can require retrying reference acquisition because your of using sequence locking in the first place! This situation
current code already handles this. Similarly, if response- points the way to using sequence locking in conjunction
time considerations already limit the duration of reader with reference counting, hazard pointers, or RCU in order
traversals, as is often the case in kernels and low-level to provide both existence guarantees and update detection.

v2022.09.25a
9.6. WHICH TO CHOOSE? 181

Table 9.8: Which Deferred Technique to Choose? (Details)

Property Reference Counting Hazard Sequence RCU


Pointers Locks

Existence Guarantees Complex Yes No Yes


Updates and Readers Yes Yes No Yes
Progress Concurrently
Contention Among High None None None
Readers
Reader Per-Critical- N/A N/A Two Ranges from none
Section Overhead smp_mb() to two smp_mb()
Reader Per-Object Read-modify-write atomic smp_mb()* None, but None (volatile
Traversal Overhead operations, memory-barrier unsafe accesses)
instructions, and cache
misses
Reader Forward Progress Lock free Lock free Blocking Bounded wait free
Guarantee
Reader Reference Can fail (conditional) Can fail Unsafe Cannot fail
Acquisition (conditional) (unconditional)
Memory Footprint Bounded Bounded Bounded Unbounded
Reclamation Forward Lock free Lock free N/A Blocking
Progress
Automatic Reclamation Yes Use Case N/A Use Case
Lines of Code 94 79 79 73
* This smp_mb() can be downgraded to a compiler barrier() by using the Linux-kernel membarrier()

system call.

In fact, the Linux kernel combines RCU and sequence could eliminate the read-side smp_mb() from hazard pointers?
locking in this manner during pathname lookup.
The “Contention Among Readers”, “Reader Per-
The “Reader Forward Progress Guarantee” row shows
Critical-Section Overhead”, and “Reader Per-Object Tra-
that only RCU has a bounded wait-free forward-progress
versal Overhead” rows give a rough sense of the read-side
guarantee, which means that it can carry out a finite
overhead of these techniques. The overhead of reference
traversal by executing a bounded number of instructions.
counting can be quite large, with contention among read-
ers along with a fully ordered read-modify-write atomic The “Reader Reference Acquisition” row indicates that
operation required for each and every object traversed. only RCU is capable of unconditionally acquiring refer-
Hazard pointers incur the overhead of a memory barrier for ences. The entry for sequence locks is “Unsafe” because,
each data element traversed, and sequence locks incur the again, sequence locks detect updates rather than acquiring
overhead of a pair of memory barriers for each attempt to references. Reference counting and hazard pointers both
execute the critical section. The overhead of RCU imple- require that traversals be restarted from the beginning if a
mentations vary from nothing to that of a pair of memory given acquisition fails. To see this, consider a linked list
barriers for each read-side critical section, thus providing containing objects A, B, C, and D, in that order, and the
RCU with the best performance, particularly for read-side following series of events:
critical sections that traverse many data elements. Of
course, the read-side overhead of all deferred-processing 1. A reader acquires a reference to object B.
variants can be reduced by batching, so that each read-side
operation covers more data.
2. An updater removes object B, but refrains from
freeing it because the reader holds a reference. The
Quick Quiz 9.71: But didn’t the answer to one of the quick
list now contains objects A, C, and D, and object B’s
quizzes in Section 9.3 say that pairwise asymmetric barriers
->next pointer is set to HAZPTR_POISON.

v2022.09.25a
182 CHAPTER 9. DEFERRED PROCESSING

3. The updater removes object C, so that the list now updates. However, there are situations in which the only
contains objects A and D. Because there is no blocking operation is a wait to free memory, which re-
reference to object C, it is immediately freed. sults in a situation that, for many purposes, is as good as
non-blocking [DMS+ 12].
4. The reader tries to advance to the successor of the As shown in the “Automatic Reclamation” row, only
object following the now-removed object B, but the reference counting can automate freeing of memory, and
poisoned ->next pointer prevents this. Which is even then only for non-cyclic data structures. Certain use
a good thing, because object B’s ->next pointer cases for hazard pointers and RCU can provide automatic
would otherwise point to the freelist. reclamation using link counts, which can be thought of
5. The reader must therefore restart its traversal from as reference counts, but applying only to incoming links
the head of the list. from other parts of the data structure [Mic18].
Finally, the “Lines of Code” row shows the size of
Thus, when failing to acquire a reference, a hazard- the Pre-BSD Routing Table implementations, giving a
pointer or reference-counter traversal must restart that rough idea of relative ease of use. That said, it is im-
traversal from the beginning. In the case of nested linked portant to note that the reference-counting and sequence-
data structures, for example, a tree containing linked locking implementations are buggy, and that a correct
lists, the traversal must be restarted from the outermost reference-counting implementation is considerably more
data structure. This situation gives RCU a significant complex [Val95, MS95]. For its part, a correct sequence-
ease-of-use advantage. locking implementation requires the addition of some
However, RCU’s ease-of-use advantage does not come other synchronization mechanism, for example, hazard
for free, as can be seen in the “Memory Footprint” row. pointers or RCU, so that sequence locking detects con-
RCU’s support of unconditional reference acquisition current updates and the other mechanism provides safe
means that it must avoid freeing any object reachable by a reference acquisition.
given RCU reader until that reader completes. RCU there- As more experience is gained using these techniques,
fore has an unbounded memory footprint, at least unless both separately and in combination, the rules of thumb
updates are throttled. In contrast, reference counting and laid out in this section will need to be refined. However,
hazard pointers need to retain only those data elements this section does reflect the current state of the art.
actually referenced by concurrent readers.
This tension between memory footprint and acquisition
failures is sometimes resolved within the Linux kernel by
9.6.3 Which to Choose? (Production Use)
combining use of RCU and reference counters. RCU is This section points out a few publicly visible production
used for short-lived references, which means that RCU uses of hazard pointers, sequence locking, and RCU. Ref-
read-side critical sections can be short. These short erence counting is omitted, not because it is unimportant,
RCU read-side critical sections in turn mean that the but rather because it is not only used pervasively, but heav-
corresponding RCU grace periods can also be short, which ily documented in textbooks going back a half century.
limits the memory footprint. For the few data elements that One of the hoped-for benefits of listing production uses of
need longer-lived references, reference counting is used. these other techniques is to provide examples to study—or
This means that the complexity of reference-acquisition to find bugs in, as the case may be.21
failure only needs to be dealt with for those few data
elements: The bulk of the reference acquisitions are 9.6.3.1 Production Uses of Hazard Pointers
unconditional, courtesy of RCU. See Section 13.2 for
more information on combining reference counting with In 2010, Keith Bostic added hazard pointers to
other synchronization mechanisms. WiredTiger [Bos10]. MongoDB 3.0, released in 2015,
The “Reclamation Forward Progress” row shows included WiredTiger and thus hazard pointers.
that hazard pointers can provide non-blocking up- In 2011, Samy Al Bahra added hazard pointers to the
dates [Mic04a, HLM02]. Reference counting might or Concurrency Kit library [Bah11b].
might not, depending on the implementation. However, 21 Kudos to Mathias Stearn, Matt Wilson, David Goldblatt, Live-
sequence locking cannot provide non-blocking updates, Journal user fanf, Nadav Har’El, Avi Kivity, Dmitry Vyukov, Raul
courtesy of its update-side lock. RCU updaters must Guitterez S., Twitter user @peo3, Paolo Bonzini, and Thomas Monjalon
wait on readers, which also rules out fully non-blocking for locating a great many of these use cases.

v2022.09.25a
9.7. WHAT ABOUT UPDATES? 183

In 2014, Maxim Khizhinsky added hazard pointers to In 2015, Maxim Khizhinsky added RCU to
libcds [Khi14]. libcds [Khi15].
In 2015, David Gwynne introduced shared reference Mindaugas Rasiukevicius implemented libqsbr in 2016,
pointers, a form of hazard pointers, to OpenBSD [Gwy15]. which features QSBR and epoch-based reclamation
In 2017–2018, the Rust-language arc-swap [Van18] (EBR) [Ras16], both of which are types of implemen-
and conc [cut17] crates rolled their own implementations tations of RCU.
of hazard pointers. Sheth et al. [SWS16] demonstrated the value of lever-
In 2018, Maged Michael added hazard pointers to aging Go’s garbage collector to provide RCU-like func-
Facebook’s Folly library [Mic18], where it is used heavily. tionality, and the Go programming language provides a
Value type that can provide this functionality.22
9.6.3.2 Production Uses of Sequence Locking Matt Klein describes an RCU-like mechanism that is
used in the Envoy Proxy [Kle17].
The Linux kernel added sequence locking to v2.5.60 Honnappa Nagarahalli added an RCU library to the
in 2003 [Cor03], having been generalized from an ad- Data Plane Development Kit (DPDK) in 2018 [Nag18].
hoc technique used in x86’s implementation of the Stjepan Glavina merged an epoch-based RCU imple-
gettimeofday() system call. mentation into the crossbeam set of concurrency-support
In 2011, Samy Al Bahra added sequence locking to the “crates” for the Rust language [Gla18].
Concurrency Kit library [Bah11c]. Jason Donenfeld produced an RCU implementations
Paolo Bonzini added a simple sequence-lock to the as part of his port of WireGuard to Windows NT ker-
QEMU emulator in 2013 [Bon13]. nel [Don21].
Alexis Menard abstracted a sequence-lock implementa- Finally, any garbage-collected concurrent language (not
tion in Chromium in 2016 [Men16]. just Go!) gets the update side of an RCU implementation
A simple sequence locking implementation was added at zero incremental cost.
to jemalloc() in 2018 [Gol18a]. The eigen library
also has a special-purpose queue that is managed by a 9.6.3.4 Summary of Production Uses
mechanism resembling sequence locking.
Perhaps the time will come when sequence locking, hazard
pointers, and RCU are all as heavily used and as well
9.6.3.3 Production Uses of RCU
known as are reference counters. Until that time comes,
IBM’s VM/XA is adopted passive serialization, a mecha- the current production uses of these mechanisms should
nism similar to RCU, some time in the 1980s [HOS89]. help guide the choice of mechanism as well as showing
DYNIX/ptx adopted RCU in 1993 [MS98a, SM95]. how best to apply each of them.
The Linux kernel adopted Dipankar Sarma’s implemen- The next section discusses updates, a ticklish issue for
tation of RCU in 2002 [Tor02]. many of the read-mostly mechanisms described in this
The userspace RCU project started in 2009 [Des09b]. chapter.
The Knot DNS project started using the userspace RCU
library in 2010 [Slo10]. That same year, the OSv kernel
added an RCU implementation [Kiv13], later adding an 9.7 What About Updates?
RCU-protected linked list [Kiv14b] and an RCU-protected
hash table [Kiv14a]. The only thing constant in life is change.
In 2011, Samy Al Bahra added epochs (a form
of RCU [Fra04, FH07]) to the Concurrency Kit li- François de la Rochefoucauld
brary [Bah11a].
NetBSD began using the aforementioned passive se- The deferred-processing techniques called out in this chap-
rialization with v6.0 in 2012 [The12a]. Among other ter are most directly applicable to read-mostly situations,
things, passive serialization is used in NetBSD packet which begs the question “But what about updates?” After
filter (NPF) [Ras14]. all, increasing the performance and scalability of readers
Paolo Bonzini added RCU support to the QEMU em-
ulator in 2015 via a friendly fork of the userspace RCU 22 See https://golang.org/pkg/sync/atomic/#Value, par-

library [BD13, Bon15]. ticularly the “Example (ReadMostly)”.

v2022.09.25a
184 CHAPTER 9. DEFERRED PROCESSING

is all well and good, but it is only natural to also want


great performance and scalability for writers.
We have already seen one situation featuring high per-
formance and scalability for writers, namely the counting
algorithms surveyed in Chapter 5. These algorithms fea-
tured partially partitioned data structures so that updates
can operate locally, while the more-expensive reads must
sum across the entire data structure. Silas Boyd-Wickhizer
has generalized this notion to produce OpLog, which he
has applied to Linux-kernel pathname lookup, VM reverse
mappings, and the stat() system call [BW14].
Another approach, called “Disruptor”, is designed for
applications that process high-volume streams of input
data. The approach is to rely on single-producer-single-
consumer FIFO queues, minimizing the need for synchro-
nization [Sut13]. For Java applications, Disruptor also
has the virtue of minimizing use of the garbage collector.
And of course, where feasible, fully partitioned or
“sharded” systems provide excellent performance and scal-
ability, as noted in Chapter 6.
The next chapter will look at updates in the context of
several types of data structures.

v2022.09.25a
Bad programmers worry about the code. Good
programmers worry about data structures and their
relationships.
Chapter 10 Linus Torvalds

Data Structures

Serious discussions of algorithms include time complexity using an in-memory database with each animal in the zoo
of their data structures [CLRS01]. However, for parallel represented by a data item in this database. Each animal
programs, the time complexity includes concurrency ef- has a unique name that is used as a key, with a variety of
fects because these effects can be overwhelmingly large, as data tracked for each animal.
shown in Chapter 3. In other words, a good programmer’s Births, captures, and purchases result in insertions,
data-structure relationships include those aspects related while deaths, releases, and sales result in deletions. Be-
to concurrency. cause Schrödinger’s zoo contains a large quantity of short-
Section 10.1 presents the motivating application for lived animals, including mice and insects, the database
this chapter’s data structures. Chapter 6 showed how par- must handle high update rates. Those interested in Schrö-
titioning improves scalability, so Section 10.2 discusses dinger’s animals can query them, and Schrödinger has
partitionable data structures. Chapter 9 described how noted suspiciously query rates for his cat, so much so that
deferring some actions can greatly improve both perfor- he suspects that his mice might be checking up on their
mance and scalability, a topic taken up by Section 10.3. nemesis. Whatever their source, Schrödinger’s application
Section 10.4 looks at a non-partitionable data structure, must handle high query rates to a single data element.
splitting it into read-mostly and partitionable portions,
As we will see, this simple application can be a challenge
which improves both performance and scalability. Be-
to concurrent data structures.
cause this chapter cannot delve into the details of every
concurrent data structure, Section 10.5 surveys a few of
the important ones. Although the best performance and
scalability results from design rather than after-the-fact
micro-optimization, micro-optimization is nevertheless
10.2 Partitionable Data Structures
necessary for the absolute best possible performance and
scalability, as described in Section 10.6. Finally, Sec- Finding a way to live the simple life today is the most
tion 10.7 presents a summary of this chapter. complicated task.

Henry A. Courtney, updated

10.1 Motivating Application


There are a huge number of data structures in use today, so
much so that there are multiple textbooks covering them.
The art of doing mathematics consists in finding that This section focuses on a single data structure, namely
special case which contains all the germs of
the hash table. This focused approach allows a much
generality.
deeper investigation of how concurrency interacts with
David Hilbert data structures, and also focuses on a data structure that
is heavily used in practice. Section 10.2.1 overviews the
We will use the Schrödinger’s Zoo application to evaluate design, and Section 10.2.2 presents the implementation.
performance [McK13]. Schrödinger has a zoo containing Finally, Section 10.2.3 discusses the resulting performance
a large number of animals, and he would like to track them and scalability.

185

v2022.09.25a
186 CHAPTER 10. DATA STRUCTURES

10.2.1 Hash-Table Design Listing 10.1: Hash-Table Data Structures


1 struct ht_elem {
Chapter 6 emphasized the need to apply partitioning in 2 struct cds_list_head hte_next;
3 unsigned long hte_hash;
order to attain respectable performance and scalability, 4 };
so partitionability must be a first-class criterion when 5
6 struct ht_bucket {
selecting data structures. This criterion is well satisfied by 7 struct cds_list_head htb_head;
that workhorse of parallelism, the hash table. Hash tables 8 spinlock_t htb_lock;
9 };
are conceptually simple, consisting of an array of hash 10

buckets. A hash function maps from a given element’s 11 struct hashtab {


12 unsigned long ht_nbuckets;
key to the hash bucket that this element will be stored 13 int (*ht_cmp)(struct ht_elem *htep, void *key);
in. Each hash bucket therefore heads up a linked list of 14 struct ht_bucket ht_bkt[0];
15 };
elements, called a hash chain. When properly configured,
these hash chains will be quite short, permitting a hash
table to access its elements extremely efficiently. struct hashtab
−>ht_nbuckets = 4
Quick Quiz 10.1: But chained hash tables are but one type
−>ht_cmp
of many. Why the focus on chained hash tables?
−>ht_bkt[0] struct ht_elem struct ht_elem
−>htb_head −>hte_next −>hte_next
In addition, each bucket has its own lock, so that −>htb_lock −>hte_hash −>hte_hash
elements in different buckets of the hash table may be −>ht_bkt[1]
added, deleted, and looked up completely independently. −>htb_head
A large hash table with a large number of buckets (and −>htb_lock
thus locks), with each bucket containing a small number −>ht_bkt[2] struct ht_elem
of elements should therefore provide excellent scalability. −>htb_head −>hte_next
−>htb_lock −>hte_hash
−>ht_bkt[3]
−>htb_head
10.2.2 Hash-Table Implementation −>htb_lock

Listing 10.1 (hash_bkt.c) shows a set of data struc- Figure 10.1: Hash-Table Data-Structure Diagram
tures used in a simple fixed-sized hash table using chain-
ing and per-hash-bucket locking, and Figure 10.1 dia-
grams how they fit together. The hashtab structure
ing two functions acquire and release the ->htb_lock
(lines 11–15 in Listing 10.1) contains four ht_bucket
corresponding to the specified hash value.
structures (lines 6–9 in Listing 10.1), with the ->ht_
nbuckets field controlling the number of buckets and the Listing 10.3 shows hashtab_lookup(), which returns
->ht_cmp field holding the pointer to key-comparison a pointer to the element with the specified hash and key if it
function. Each such bucket contains a list header ->htb_ exists, or NULL otherwise. This function takes both a hash
head and a lock ->htb_lock. The list headers chain value and a pointer to the key because this allows users
ht_elem structures (lines 1–4 in Listing 10.1) through of this function to use arbitrary keys and arbitrary hash
their ->hte_next fields, and each ht_elem structure functions. Line 8 maps from the hash value to a pointer
also caches the corresponding element’s hash value in the to the corresponding hash bucket. Each pass through the
->hte_hash field. The ht_elem structure is included in loop spanning lines 9–14 examines one element of the
a larger structure which might contain a complex key. bucket’s hash chain. Line 10 checks to see if the hash
Figure 10.1 shows bucket 0 containing two elements values match, and if not, line 11 proceeds to the next
and bucket 2 containing one. element. Line 12 checks to see if the actual key matches,
and if so, line 13 returns a pointer to the matching element.
Listing 10.2 shows mapping and locking functions.
If no element matches, line 15 returns NULL.
Lines 1 and 2 show the macro HASH2BKT(), which maps
from a hash value to the corresponding ht_bucket struc- Quick Quiz 10.2: But isn’t the double comparison on
ture. This macro uses a simple modulus: If more aggres- lines 10–13 in Listing 10.3 inefficient in the case where the
sive hashing is required, the caller needs to implement key fits into an unsigned long?
it when mapping from key to hash value. The remain-

v2022.09.25a
10.2. PARTITIONABLE DATA STRUCTURES 187

Listing 10.5: Hash-Table Allocation and Free


1 struct hashtab *
2 hashtab_alloc(unsigned long nbuckets,
3 int (*cmp)(struct ht_elem *htep, void *key))
Listing 10.2: Hash-Table Mapping and Locking 4 {
1 #define HASH2BKT(htp, h) \ 5 struct hashtab *htp;
2 (&(htp)->ht_bkt[h % (htp)->ht_nbuckets]) 6 int i;
3 7
4 static void hashtab_lock(struct hashtab *htp, 8 htp = malloc(sizeof(*htp) +
5 unsigned long hash) 9 nbuckets * sizeof(struct ht_bucket));
6 { 10 if (htp == NULL)
7 spin_lock(&HASH2BKT(htp, hash)->htb_lock); 11 return NULL;
8 } 12 htp->ht_nbuckets = nbuckets;
9 13 htp->ht_cmp = cmp;
10 static void hashtab_unlock(struct hashtab *htp, 14 for (i = 0; i < nbuckets; i++) {
11 unsigned long hash) 15 CDS_INIT_LIST_HEAD(&htp->ht_bkt[i].htb_head);
12 { 16 spin_lock_init(&htp->ht_bkt[i].htb_lock);
13 spin_unlock(&HASH2BKT(htp, hash)->htb_lock); 17 }
14 } 18 return htp;
19 }
20
21 void hashtab_free(struct hashtab *htp)
22 {
23 free(htp);
24 }

Listing 10.4 shows the hashtab_add() and hashtab_


Listing 10.3: Hash-Table Lookup
1 struct ht_elem *
del() functions that add and delete elements from the
2 hashtab_lookup(struct hashtab *htp, unsigned long hash, hash table, respectively.
3 void *key) The hashtab_add() function simply sets the element’s
4 {
5 struct ht_bucket *htb; hash value on line 4, then adds it to the corresponding
6 struct ht_elem *htep; bucket on lines 5 and 6. The hashtab_del() function
7
8 htb = HASH2BKT(htp, hash); simply removes the specified element from whatever hash
9 cds_list_for_each_entry(htep, &htb->htb_head, hte_next) { chain it is on, courtesy of the doubly linked nature of
10 if (htep->hte_hash != hash)
11 continue; the hash-chain lists. Before calling either of these two
12 if (htp->ht_cmp(htep, key)) functions, the caller is required to ensure that no other
13 return htep;
14 } thread is accessing or modifying this same bucket, for
15 return NULL; example, by invoking hashtab_lock() beforehand.
16 }
Listing 10.5 shows hashtab_alloc() and hashtab_
free(), which do hash-table allocation and freeing, re-
spectively. Allocation begins on lines 8–9 with allocation
of the underlying memory. If line 10 detects that memory
has been exhausted, line 11 returns NULL to the caller. Oth-
erwise, lines 12 and 13 initialize the number of buckets
and the pointer to key-comparison function, and the loop
Listing 10.4: Hash-Table Modification
spanning lines 14–17 initializes the buckets themselves,
1 void hashtab_add(struct hashtab *htp, unsigned long hash,
2 struct ht_elem *htep) including the chain list header on line 15 and the lock on
3 { line 16. Finally, line 18 returns a pointer to the newly
4 htep->hte_hash = hash;
5 cds_list_add(&htep->hte_next, allocated hash table. The hashtab_free() function on
6 &HASH2BKT(htp, hash)->htb_head); lines 21–24 is straightforward.
7 }
8
9 void hashtab_del(struct ht_elem *htep)
10 { 10.2.3 Hash-Table Performance
11 cds_list_del_init(&htep->hte_next);
12 } The performance results for a single 28-core socket of a
2.1 GHz Intel Xeon system using a bucket-locked hash
table with 262,144 buckets are shown in Figure 10.2.
The performance does scale nearly linearly, but it falls

v2022.09.25a
188 CHAPTER 10. DATA STRUCTURES

6 250000
1.4x10

Total Lookups per Millisecond


6
Total Lookups per Millisecond 1.2x10 200000

1x106
150000
800000
ideal
600000 100000

400000 50000

200000
bucket 0
0 0 50 100 150 200 250 300 350 400 450
5 10 15 20 25 Number of CPUs (Threads)
Number of CPUs (Threads)
Figure 10.4: Read-Only Hash-Table Performance For
Figure 10.2: Read-Only Hash-Table Performance For Schrödinger’s Zoo, Varying Buckets
Schrödinger’s Zoo

Quick Quiz 10.3: Instead of simply increasing the number of


250000 hash buckets, wouldn’t it be better to cache-align the existing
hash buckets?
Total Lookups per Millisecond

200000
However, as can be seen in Figure 10.4, changing the
number of buckets has almost no effect: Scalability is
150000
still abysmal. In particular, we still see a sharp dropoff at
29 CPUs and beyond. Clearly something else is going on.
100000
The problem is that this is a multi-socket system, with
CPUs 0–27 and 225–251 mapped to the first socket as
50000
shown in Figure 10.5. Test runs confined to the first
28 CPUs therefore perform quite well, but tests that in-
0
0 50 100 150 200 250 300 350 400 450 volve socket 0’s CPUs 0–27 as well as socket 1’s CPU 28
Number of CPUs (Threads) incur the overhead of passing data across socket bound-
aries. This can severely degrade performance, as was
Figure 10.3: Read-Only Hash-Table Performance For discussed in Section 3.2.1. In short, large multi-socket
Schrödinger’s Zoo, 448 CPUs systems require good locality of reference in addition to
full partitioning. The remainder of this chapter will dis-
cuss ways of providing good locality of reference within
a far short of the ideal performance level, even at only the hash table itself, but in the meantime please note that
28 CPUs. Part of this shortfall is due to the fact that the one other way to provide good locality of reference would
lock acquisitions and releases incur no cache misses on a be to place large data elements in the hash table. For
single CPU, but do incur misses on two or more CPUs. example, Schrödinger might attain excellent cache locality
by placing photographs or even videos of his animals in
And things only get worse with more CPUs, as can be each element of the hash table. But for those needing hash
seen in Figure 10.3. We do not need to show ideal perfor- tables containing small data elements, please read on!
mance: The performance for 29 CPUs and beyond is all
Quick Quiz 10.4: Given the negative scalability of the
too clearly worse than abysmal. This clearly underscores
Schrödinger’s Zoo application across sockets, why not just run
the dangers of extrapolating performance from a modest
multiple copies of the application, with each copy having a
number of CPUs. subset of the animals and confined to run on a single socket?
Of course, one possible reason for the collapse in
performance might be that more hash buckets are needed.
One key property of the Schrödinger’s-zoo runs dis-
We can test this by increasing the number of hash buckets.
cussed thus far is that they are all read-only. This makes the

v2022.09.25a
10.3. READ-MOSTLY DATA STRUCTURES 189

Hyperthread Listing 10.6: RCU-Protected Hash-Table Read-Side Concur-


Socket 0 1 rency Control
1 static void hashtab_lock_lookup(struct hashtab *htp,
0 0–27 224–251 2 unsigned long hash)
3 {
1 28–55 252–279 4 rcu_read_lock();
5 }
2 56–83 280–307 6
7 static void hashtab_unlock_lookup(struct hashtab *htp,
3 84–111 308–335 8 unsigned long hash)
9 {
4 112–139 336–363 10 rcu_read_unlock();
5 140–167 364–391 11 }

6 168–195 392–419
7 196–223 420–447 10.3.1 RCU-Protected Hash Table Imple-
Figure 10.5: NUMA Topology of System Under Test mentation
For an RCU-protected hash table with per-bucket lock-
ing, updaters use locking as shown in Section 10.2,
performance degradation due to lock-acquisition-induced but readers use RCU. The data structures remain
cache misses all the more painful. Even though we are as shown in Listing 10.1, and the HASH2BKT(),
not updating the underlying hash table itself, we are still hashtab_lock(), and hashtab_unlock() functions
paying the price for writing to memory. Of course, if remain as shown in Listing 10.2. However, readers
the hash table was never going to be updated, we could use the lighter-weight concurrency-control embodied
dispense entirely with mutual exclusion. This approach by hashtab_lock_lookup() and hashtab_unlock_
is quite straightforward and is left as an exercise for the lookup() shown in Listing 10.6.
reader. But even with the occasional update, avoiding Listing 10.7 shows hashtab_lookup() for the RCU-
writes avoids cache misses, and allows the read-mostly protected per-bucket-locked hash table. This is identical
data to be replicated across all the caches, which in turn to that in Listing 10.3 except that cds_list_for_each_
promotes locality of reference. entry() is replaced by cds_list_for_each_entry_
The next section therefore examines optimizations that rcu(). Both of these primitives traverse the hash chain ref-
can be carried out in read-mostly cases where updates are erenced by htb->htb_head but cds_list_for_each_
rare, but could happen at any time. entry_rcu() also correctly enforces memory ordering
in case of concurrent insertion. This is an important
difference between these two hash-table implementations:
Unlike the pure per-bucket-locked implementation, the
RCU protected implementation allows lookups to run con-
10.3 Read-Mostly Data Structures currently with insertions and deletions, and RCU-aware
primitives like cds_list_for_each_entry_rcu() are
Adapt the remedy to the disease. required to correctly handle this added concurrency. Note
also that hashtab_lookup()’s caller must be within an
Chinese proverb
RCU read-side critical section, for example, the caller
must invoke hashtab_lock_lookup() before invoking
Although partitioned data structures can offer excellent hashtab_lookup() (and of course invoke hashtab_
scalability, NUMA effects can result in severe degradations unlock_lookup() some time afterwards).
of both performance and scalability. In addition, the need
Quick Quiz 10.5: But if elements in a hash table can be
for read-side synchronization can degrade performance removed concurrently with lookups, doesn’t that mean that
in read-mostly situations. However, we can achieve both a lookup could return a reference to a data element that was
performance and scalability by using RCU, which was removed immediately after it was looked up?
introduced in Section 9.5. Similar results can be achieved
using hazard pointers (hazptr.c) [Mic04a], which will Listing 10.8 shows hashtab_add() and hashtab_
be included in the performance results shown in this del(), both of which are quite similar to their counterparts
section [McK13]. in the non-RCU hash table shown in Listing 10.4. The

v2022.09.25a
190 CHAPTER 10. DATA STRUCTURES

Listing 10.7: RCU-Protected Hash-Table Lookup The test suite (“hashtorture.h”) contains a
1 struct ht_elem *hashtab_lookup(struct hashtab *htp, smoketest() function that verifies that a specific se-
2 unsigned long hash,
3 void *key) ries of single-threaded additions, deletions, and lookups
4 { give the expected results.
5 struct ht_bucket *htb;
6 struct ht_elem *htep; Concurrent test runs put each updater thread in control
7 of its portion of the elements, which allows assertions
8 htb = HASH2BKT(htp, hash);
9 cds_list_for_each_entry_rcu(htep, checking for the following issues:
10 &htb->htb_head,
11 hte_next) {
12 if (htep->hte_hash != hash) 1. A just-now-to-be-added element already being in the
13 continue; table according to hastab_lookup().
14 if (htp->ht_cmp(htep, key))
15 return htep;
16 } 2. A just-now-to-be-added element being marked as
17 return NULL; being in the table by its ->in_table flag.
18 }

3. A just-now-to-be-deleted element not being in the


Listing 10.8: RCU-Protected Hash-Table Modification table according to hastab_lookup().
1 void hashtab_add(struct hashtab *htp,
2 unsigned long hash, 4. A just-now-to-be-deleted element being marked as
3 struct ht_elem *htep)
4 { not being in the table by its ->in_table flag.
5 htep->hte_hash = hash;
6 cds_list_add_rcu(&htep->hte_next,
7 &HASH2BKT(htp, hash)->htb_head); In addition, concurrent test runs run lookups concur-
8 } rently with updates in order to catch all manner of data-
9
10 void hashtab_del(struct ht_elem *htep) structure corruption problems. Some runs also continually
11 { resize the hash table concurrently with both lookups and
12 cds_list_del_rcu(&htep->hte_next);
13 } updates to verify correct behavior, and also to verify that
resizes do not unduly delay either readers or updaters.
Finally, the concurrent tests output statistics that can
hashtab_add() function uses cds_list_add_rcu() be used to track down performance and scalabilty issues,
instead of cds_list_add() in order to ensure proper which provides the raw data used by Section 10.3.3.
ordering when an element is added to the hash table at Quick Quiz 10.6: The hashtorture.h file contains more
the same time that it is being looked up. The hashtab_ than 1,000 lines! Is that a comprehensive test or what???
del() function uses cds_list_del_rcu() instead of
cds_list_del_init() to allow for the case where an All code requires significant validation effort, and high-
element is looked up just before it is deleted. Unlike performance concurrent code requires more validation
cds_list_del_init(), cds_list_del_rcu() leaves than most.
the forward pointer intact, so that hashtab_lookup()
can traverse to the newly deleted element’s successor.
10.3.3 RCU-Protected Hash Table Perfor-
Of course, after invoking hashtab_del(), the caller
must wait for an RCU grace period (e.g., by invok-
mance
ing synchronize_rcu()) before freeing or otherwise Figure 10.6 shows the read-only performance of RCU-
reusing the memory for the newly deleted element. protected and hazard-pointer-protected hash tables against
the previous section’s per-bucket-locked implementation.
As you can see, both RCU and hazard pointers perform and
10.3.2 RCU-Protected Hash Table Valida- scale much better than per-bucket locking because read-
tion only replication avoids NUMA effects. The difference
increases with larger numbers of threads. Results from
Although the topic of validation is covered in detail in a globally locked implementation are also shown, and
Chapter 11, the fact is that a hash table with lockless RCU- as expected the results are even worse than those of the
protected lookups needs special attention to validation per-bucket-locked implementation. RCU does slightly
sooner rather than later. better than hazard pointers.

v2022.09.25a
10.3. READ-MOSTLY DATA STRUCTURES 191

2.2x107
2x107

Total Lookups per Millisecond


7
1.8x10
1.6x107
7
1.4x10 ideal
8
1x10 1.2x107
1x107
Total Lookups per Millisecond

1x10 7 8x106
6
6x10
ideal U
RC zptr 4x106 QSBR,RCU
1x106 ha 6
2x10 hazptr
0
100000 bucket 0 50 100 150 200 250 300 350 400 450
Number of CPUs (Threads)

10000 global Figure 10.8: Read-Only RCU-Protected Hash-Table Per-


formance For Schrödinger’s Zoo including QSBR,
1000 Linear Scale
1 10 100
Number of CPUs (Threads)

Figure 10.6: Read-Only RCU-Protected Hash-Table Per- Figure 10.7 shows the same data on a linear scale. This
formance For Schrödinger’s Zoo drops the global-locking trace into the x-axis, but allows
the non-ideal performance of RCU and hazard pointers to
be more readily discerned. Both show a change in slope
at 224 CPUs, and this is due to hardware multithreading.
At 32 and fewer CPUs, each thread has a core to itself. In
this regime, RCU does better than does hazard pointers
because the latter’s read-side memory barriers result in
dead time within the core. In short, RCU is better able to
utilize a core from a single hardware thread than is hazard
pointers.
This situation changes above 224 CPUs. Because RCU
2.2x107
2x107
is using more than half of each core’s resources from a
Total Lookups per Millisecond

1.8x107
single hardware thread, RCU gains relatively little benefit
1.6x107 from the second hardware thread in each core. The slope
1.4x107 ideal of the hazard-pointers trace also decreases at 224 CPUs,
1.2x107 but less dramatically, because the second hardware thread
1x107 is able to fill in the time that the first hardware thread is
8x106 stalled due to memory-barrier latency. As we will see
6x106 in later sections, this second-hardware-thread advantage
6
4x10 RCU depends on the workload.
2x106 hazptr
0
But why is RCU’s performance a factor of five less
0 50 100 150 200 250 300 350 400 450 than ideal? One possibility is that the per-thread coun-
Number of CPUs (Threads) ters manipulated by rcu_read_lock() and rcu_read_
unlock() are slowing things down. Figure 10.8 therefore
Figure 10.7: Read-Only RCU-Protected Hash-Table Per-
adds the results for the QSBR variant of RCU, whose
formance For Schrödinger’s Zoo, Linear Scale
read-side primitives do nothing. And although QSBR
does perform slightly better than does RCU, it is still about
a factor of five short of ideal.
Figure 10.9 adds completely unsynchronized results,
which works because this is a read-only benchmark with

v2022.09.25a
192 CHAPTER 10. DATA STRUCTURES

2.2x107 1x107
7
2x10
Total Lookups per Millisecond 1x106

Cat Lookups per Millisecond


7
1.8x10
7 RCU
1.6x10
1.4x10
7
ideal 100000 hazptr
1.2x107
10000
1x107
8x106 bucket
6 1000
6x10 unsync,QSBR,RCU
4x106
6
100 global
2x10 hazptr
0
0 50 100 150 200 250 300 350 400 450 10
1 10
Number of CPUs (Threads)
Number of CPUs Looking Up The Cat
Figure 10.9: Read-Only RCU-Protected Hash-Table Per- Figure 10.10: Read-Side Cat-Only RCU-Protected Hash-
formance For Schrödinger’s Zoo including QSBR Table Performance For Schrödinger’s Zoo at 64 CPUs
and Unsynchronized, Linear Scale

no effect? Might the problem instead be something like false


nothing to synchronize. Even with no synchronization sharing?
whatsoever, performance still falls far short of ideal.
The problem is that this system has sockets with 28 cores, What if the memory footprint is reduced still further?
which have the modest cache sizes shown in Table 3.2 Figure 9.22 on page 512 shows that RCU attains very nearly
on page 24. Each hash bucket (struct ht_bucket) ideal performance on the much smaller data structure
occupies 56 bytes and each element (struct zoo_he) represented by the pre-BSD routing table.
occupies 72 bytes for the RCU and QSBR runs. The
Quick Quiz 10.8: The memory system is a serious bottleneck
benchmark generating Figure 10.9 used 262,144 buckets on this big system. Why bother putting 448 CPUs on a
and up to 262,144 elements, for a total of 33,554,448 bytes, system without giving them enough memory bandwidth to do
which not only overflows the 1,048,576-byte L2 caches something useful???
by more than a factor of thirty, but is also uncomfortably
close to the L3 cache size of 40,370,176 bytes, especially As noted earlier, Schrödinger is surprised by the popu-
given that this cache has only 11 ways. This means that larity of his cat [Sch35], but recognizes the need to reflect
L2 cache collisions will be the rule and also that L3 cache this popularity in his design. Figure 10.10 shows the
collisions will not be uncommon, so that the resulting results of 64-CPU runs, varying the number of CPUs that
cache misses will degrade performance. In this case, the are doing nothing but looking up the cat. Both RCU and
bottleneck is not in the CPU, but rather in the hardware hazard pointers respond well to this challenge, but bucket
memory system. locking scales negatively, eventually performing as badly
Additional evidence for this memory-system bottleneck as global locking. This should not be a surprise because
may be found by examining the unsynchronized code. This if all CPUs are doing nothing but looking up the cat, the
code does not need locks, so each hash bucket occupies lock corresponding to the cat’s bucket is for all intents and
only 16 bytes compared to the 56 bytes for RCU and purposes a global lock.
QSBR. Similarly, each hash-table element occupies only This cat-only benchmark illustrates one potential prob-
56 bytes compared to the 72 bytes for RCU and QSBR. lem with fully partitioned sharding approaches. Only the
So it is unsurprising that the single-CPU unsynchronized CPUs associated with the cat’s partition is able to access
run performs up to about half again faster than that of the cat, limiting the cat-only throughput. Of course, a
either QSBR or RCU. great many applications have good load-spreading proper-
ties, and for these applications sharding works quite well.
Quick Quiz 10.7: How can we be so sure that the hash- However, sharding does not handle “hot spots” very well,
table size is at fault here, especially given that Figure 10.4
with the hot spot exemplified by Schrödinger’s cat being
on page 188 shows that varying hash-table size has almost
but one case in point.

v2022.09.25a
10.3. READ-MOSTLY DATA STRUCTURES 193

7 6
1x10 1x10
RCU bucket
6
1x10 100000
Lookups per Millisecond

Updates per Millisecond


hazptr
hazptr

100000 bucket 10000 RCU

10000 1000

global
1000 global 100

100 10
1 10 100 1 10 100
Number of CPUs Doing Updates Number of CPUs Doing Updates

Figure 10.11: Read-Side RCU-Protected Hash-Table Figure 10.12: Update-Side RCU-Protected Hash-Table
Performance For Schrödinger’s Zoo in the Presence Performance For Schrödinger’s Zoo
of Updates

It is quite possible that the differences in lookup per-


If we were only ever going to read the data, we would formance observed in Figure 10.11 are affected by the
not need any concurrency control to begin with. Fig- differences in update rates. One way to check this is to
ure 10.11 therefore shows the effect of updates on readers. artificially throttle the update rates of per-bucket locking
At the extreme left-hand side of this graph, all but one of and hazard pointers to match that of RCU. Doing so
the CPUs are doing lookups, while to the right all 448 does not significantly improve the lookup performance
CPUs are doing updates. For all four implementations, of per-bucket locking, nor does it close the gap between
the number of lookups per millisecond decreases as the hazard pointers and RCU. However, removing the read-
number of updating CPUs increases, of course reaching side memory barriers from hazard pointers (thus resulting
zero lookups per millisecond when all 448 CPUs are up- in an unsafe implementation) does nearly close the gap
dating. Both hazard pointers and RCU do well compared between hazard pointers and RCU. Although this unsafe
to per-bucket locking because their readers do not increase hazard-pointer implementation will usually be reliable
update-side lock contention. RCU does well relative to enough for benchmarking purposes, it is absolutely not
hazard pointers as the number of updaters increases due to recommended for production use.
the latter’s read-side memory barriers, which incur greater Quick Quiz 10.9: The dangers of extrapolating from 28 CPUs
overhead, especially in the presence of updates, and par- to 448 CPUs was made quite clear in Section 10.2.3. Would
ticularly when execution involves more than one socket. extrapolating up from 448 CPUs be any safer?
It therefore seems likely that modern hardware heavily
optimizes memory-barrier execution, greatly reducing
memory-barrier overhead in the read-only case. 10.3.4 RCU-Protected Hash Table Discus-
Where Figure 10.11 showed the effect of increasing
sion
update rates on lookups, Figure 10.12 shows the effect of
increasing update rates on the updates themselves. Again, One consequence of the RCU and hazard-pointer im-
at the left-hand side of the figure all but one of the CPUs plementations is that a pair of concurrent readers might
are doing lookups and at the right-hand side of the figure disagree on the state of the cat. For example, one of the
all 448 CPUs are doing updates. Hazard pointers and readers might have fetched the pointer to the cat’s data
RCU start off with a significant advantage because, unlike structure just before it was removed, while another reader
bucket locking, readers do not exclude updaters. However, might have fetched this same pointer just afterwards. The
as the number of updating CPUs increases, update-side first reader would then believe that the cat was alive, while
overhead starts to make its presence known, first for RCU the second reader would believe that the cat was dead.
and then for hazard pointers. Of course, all three of these This situation is completely fitting for Schrödinger’s
implementations beat global locking. cat, but it turns out that it is quite reasonable for normal

v2022.09.25a
194 CHAPTER 10. DATA STRUCTURES

Figure 10.13: Even Veterinarians Disagree!


Figure 10.14: Partitioning Problems

non-quantum cats as well. After all, it is impossible to


determine exactly when an animal is born or dies. However, as we saw in Figure 9.28 on page 169, increased
internal consistency can come at the expense of degraded
To see this, let’s suppose that we detect a cat’s death
external consistency. Techniques such as RCU and hazard
by heartbeat. This raise the question of exactly how long
pointers give up some degree of internal consistency to
we should wait after the last heartbeat before declaring
attain improved external consistency.
death. It is clearly ridiculous to wait only one millisecond,
In short, internal consistency is not necessarily a natural
because then a healthy living cat would have to be declared
part of all problem domains, and often incurs great expense
dead—and then resurrected—more than once per second.
in terms of performance, scalability, consistency with the
It is equally ridiculous to wait a full month, because by
outside world [HKLP12, HHK+ 13, Rin13], or all of the
that time the poor cat’s death would have made itself very
above.
clearly known via olfactory means.
Because an animal’s heart can stop for some seconds
and then start up again, there is a tradeoff between timely
recognition of death and probability of false alarms. It is
10.4 Non-Partitionable Data Struc-
quite possible that a pair of veterinarians might disagree tures
on the time to wait between the last heartbeat and the
declaration of death. For example, one veterinarian might
Don’t be afraid to take a big step if one is indicated.
declare death thirty seconds after the last heartbeat, while You can’t cross a chasm in two small steps.
another might insist on waiting a full minute. In this case,
the two veterinarians would disagree on the state of the David Lloyd George
cat for the second period of thirty seconds following the
last heartbeat, as fancifully depicted in Figure 10.13. Fixed-size hash tables are perfectly partitionable, but resiz-
Heisenberg taught us to live with this sort of uncer- able hash tables pose partitioning challenges when grow-
tainty [Hei27], which is a good thing because computing ing or shrinking, as fancifully depicted in Figure 10.14.
hardware and software acts similarly. For example, how However, it turns out that it is possible to construct high-
do you know that a piece of computing hardware has performance scalable RCU-protected hash tables, as de-
failed? Often because it does not respond in a timely scribed in the following sections.
fashion. Just like the cat’s heartbeat, this results in a
window of uncertainty as to whether or not the hardware
10.4.1 Resizable Hash Table Design
has really failed, as opposed to just being slow.
Furthermore, most computing systems are intended In happy contrast to the situation in the early 2000s, there
to interact with the outside world. Consistency with are now no fewer than three different types of scalable
the outside world is therefore of paramount importance. RCU-protected hash tables. The first (and simplest) was

v2022.09.25a
10.4. NON-PARTITIONABLE DATA STRUCTURES 195

Bucket 0 Bucket 1 Bucket 0 Bucket 1

Links 0 Links 0 Links 0 Links 0 Links 0 Links 0 Links 0 Links 0

Links 1 Links 1 Links 1 Links 1 Links 1 Links 1 Links 1 Links 1

A B C D A B C D

Figure 10.15: Growing a Two-List Hash Table, State (a)


Bucket 0 Bucket 1 Bucket 2 Bucket 3

Bucket 0 Bucket 1
Figure 10.17: Growing a Two-List Hash Table, State (c)

Links 0 Links 0 Links 0 Links 0


Links 0 Links 0 Links 0 Links 0
Links 1 Links 1 Links 1 Links 1
Links 1 Links 1 Links 1 Links 1
A B C D
A B C D

Bucket 0 Bucket 1 Bucket 2 Bucket 3


Bucket 0 Bucket 1 Bucket 2 Bucket 3

Figure 10.18: Growing a Two-List Hash Table, State (d)


Figure 10.16: Growing a Two-List Hash Table, State (b)

the old two-bucket array may now be freed, resulting in


developed for the Linux kernel by Herbert Xu [Xu10], and state (d), shown in Figure 10.18.
is described in the following sections. The other two are This design leads to a relatively straightforward imple-
covered briefly in Section 10.4.4. mentation, which is the subject of the next section.
The key insight behind the first hash-table implemen-
tation is that each data element can have two sets of
list pointers, with one set currently being used by RCU 10.4.2 Resizable Hash Table Implementa-
readers (as well as by non-RCU updaters) and the other tion
being used to construct a new resized hash table. This
Resizing is accomplished by the classic approach of in-
approach allows lookups, insertions, and deletions to all
serting a level of indirection, in this case, the ht structure
run concurrently with a resize operation (as well as with
shown on lines 11–20 of Listing 10.9 (hash_resize.c).
each other).
The hashtab structure shown on lines 27–30 contains
The resize operation proceeds as shown in Fig- only a pointer to the current ht structure along with a
ures 10.15–10.18, with the initial two-bucket state shown spinlock that is used to serialize concurrent attempts to
in Figure 10.15 and with time advancing from figure to resize the hash table. If we were to use a traditional lock-
figure. The initial state uses the zero-index links to chain or atomic-operation-based implementation, this hashtab
the elements into hash buckets. A four-bucket array is structure could become a severe bottleneck from both
allocated, and the one-index links are used to chain the performance and scalability viewpoints. However, be-
elements into these four new hash buckets. This results in cause resize operations should be relatively infrequent,
state (b) shown in Figure 10.16, with readers still using we should be able to make good use of RCU.
the original two-bucket array. The ht structure represents a specific size of the hash
The new four-bucket array is exposed to readers and then table, as specified by the ->ht_nbuckets field on line 12.
a grace-period operation waits for all readers, resulting in The size is stored in the same structure containing the
state (c), shown in Figure 10.17. In this state, all readers array of buckets (->ht_bkt[] on line 19) in order to avoid
are using the new four-bucket array, which means that mismatches between the size and the array. The ->ht_

v2022.09.25a
196 CHAPTER 10. DATA STRUCTURES

Listing 10.9: Resizable Hash-Table Data Structures Listing 10.10: Resizable Hash-Table Bucket Selection
1 struct ht_elem { 1 static struct ht_bucket *
2 struct rcu_head rh; 2 ht_get_bucket(struct ht *htp, void *key,
3 struct cds_list_head hte_next[2]; 3 long *b, unsigned long *h)
4 }; 4 {
5 5 unsigned long hash = htp->ht_gethash(key);
6 struct ht_bucket { 6
7 struct cds_list_head htb_head; 7 *b = hash % htp->ht_nbuckets;
8 spinlock_t htb_lock; 8 if (h)
9 }; 9 *h = hash;
10 10 return &htp->ht_bkt[*b];
11 struct ht { 11 }
12 long ht_nbuckets; 12
13 long ht_resize_cur; 13 static struct ht_elem *
14 struct ht *ht_new; 14 ht_search_bucket(struct ht *htp, void *key)
15 int ht_idx; 15 {
16 int (*ht_cmp)(struct ht_elem *htep, void *key); 16 long b;
17 unsigned long (*ht_gethash)(void *key); 17 struct ht_elem *htep;
18 void *(*ht_getkey)(struct ht_elem *htep); 18 struct ht_bucket *htbp;
19 struct ht_bucket ht_bkt[0]; 19
20 }; 20 htbp = ht_get_bucket(htp, key, &b, NULL);
21 21 cds_list_for_each_entry_rcu(htep,
22 struct ht_lock_state { 22 &htbp->htb_head,
23 struct ht_bucket *hbp[2]; 23 hte_next[htp->ht_idx]) {
24 int hls_idx[2]; 24 if (htp->ht_cmp(htep, key))
25 }; 25 return htep;
26 26 }
27 struct hashtab { 27 return NULL;
28 struct ht *ht_cur; 28 }
29 spinlock_t ht_lock;
30 };

lock_mod() to hashtab_add(), hashtab_del(), and


hashtab_unlock_mod(). This state prevents the algo-
resize_cur field on line 13 is equal to −1 unless a resize rithm from being redirected to the wrong bucket during
operation is in progress, in which case it indicates the concurrent resize operations.
index of the bucket whose elements are being inserted into The ht_bucket structure is the same as before, and
the new hash table, which is referenced by the ->ht_new the ht_elem structure differs from that of previous imple-
field on line 14. If there is no resize operation in progress, mentations only in providing a two-element array of list
->ht_new is NULL. Thus, a resize operation proceeds by pointer sets in place of the prior single set of list pointers.
allocating a new ht structure and referencing it via the
In a fixed-sized hash table, bucket selection is quite
->ht_new pointer, then advancing ->ht_resize_cur
straightforward: Simply transform the hash value to the
through the old table’s buckets. When all the elements
corresponding bucket index. In contrast, when resizing, it
have been added to the new table, the new table is linked
is also necessary to determine which of the old and new
into the hashtab structure’s ->ht_cur field. Once all old
sets of buckets to select from. If the bucket that would be
readers have completed, the old hash table’s ht structure
selected from the old table has already been distributed
may be freed.
into the new table, then the bucket should be selected from
The ->ht_idx field on line 15 indicates which of the the new table as well as from the old table. Conversely,
two sets of list pointers are being used by this instantiation if the bucket that would be selected from the old table
of the hash table, and is used to index the ->hte_next[] has not yet been distributed, then the bucket should be
array in the ht_elem structure on line 3. selected from the old table.
The ->ht_cmp(), ->ht_gethash(), and ->ht_ Bucket selection is shown in Listing 10.10, which shows
getkey() fields on lines 16–18 collectively define the ht_get_bucket() on lines 1–11 and ht_search_
per-element key and the hash function. The ->ht_cmp() bucket() on lines 13–28. The ht_get_bucket() func-
function compares a specified key with that of the specified tion returns a reference to the bucket corresponding to the
element, the ->ht_gethash() calculates the specified specified key in the specified hash table, without making
key’s hash, and ->ht_getkey() extracts the key from any allowances for resizing. It also stores the bucket index
the enclosing data element. corresponding to the key into the location referenced by
The ht_lock_state shown on lines 22–25 is used parameter b on line 7, and the corresponding hash value
to communicate lock state from a new hashtab_ corresponding to the key into the location referenced by

v2022.09.25a
10.4. NON-PARTITIONABLE DATA STRUCTURES 197

Listing 10.11: Resizable Hash-Table Update-Side Concurrency The hashtab_lock_mod() spans lines 1–25 in the
Control listing. Line 10 enters an RCU read-side critical section
1 static void
2 hashtab_lock_mod(struct hashtab *htp_master, void *key, to prevent the data structures from being freed during
3 struct ht_lock_state *lsp) the traversal, line 11 acquires a reference to the current
4 {
5 long b; hash table, and then line 12 obtains a reference to the
6 unsigned long h; bucket in this hash table corresponding to the key. Line 13
7 struct ht *htp;
8 struct ht_bucket *htbp; acquires that bucket’s lock, which will prevent any con-
9 current resizing operation from distributing that bucket,
10 rcu_read_lock();
11 htp = rcu_dereference(htp_master->ht_cur); though of course it will have no effect if that bucket has
12 htbp = ht_get_bucket(htp, key, &b, &h); already been distributed. Lines 14–15 store the bucket
13 spin_lock(&htbp->htb_lock);
14 lsp->hbp[0] = htbp; pointer and pointer-set index into their respective fields in
15 lsp->hls_idx[0] = htp->ht_idx; the ht_lock_state structure, which communicates the
16 if (b > READ_ONCE(htp->ht_resize_cur)) {
17 lsp->hbp[1] = NULL; information to hashtab_add(), hashtab_del(), and
18 return; hashtab_unlock_mod(). Line 16 then checks to see
19 }
20 htp = rcu_dereference(htp->ht_new); if a concurrent resize operation has already distributed
21 htbp = ht_get_bucket(htp, key, &b, &h); this bucket across the new hash table, and if not, line 17
22 spin_lock(&htbp->htb_lock);
23 lsp->hbp[1] = htbp; indicates that there is no already-resized hash bucket and
24 lsp->hls_idx[1] = htp->ht_idx; line 18 returns with the selected hash bucket’s lock held
25 }
26
(thus preventing a concurrent resize operation from dis-
27 static void tributing this bucket) and also within an RCU read-side
28 hashtab_unlock_mod(struct ht_lock_state *lsp)
29 { critical section. Deadlock is avoided because the old
30 spin_unlock(&lsp->hbp[0]->htb_lock); table’s locks are always acquired before those of the new
31 if (lsp->hbp[1])
32 spin_unlock(&lsp->hbp[1]->htb_lock); table, and because the use of RCU prevents more than two
33 rcu_read_unlock(); versions from existing at a given time, thus preventing a
34 }
deadlock cycle.
Otherwise, a concurrent resize operation has already
parameter h (if non-NULL) on line 9. Line 10 then returns distributed this bucket, so line 20 proceeds to the new
a reference to the corresponding bucket. hash table, line 21 selects the bucket corresponding to the
The ht_search_bucket() function searches for the key, and line 22 acquires the bucket’s lock. Lines 23–24
specified key within the specified hash-table version. store the bucket pointer and pointer-set index into their
Line 20 obtains a reference to the bucket correspond- respective fields in the ht_lock_state structure, which
ing to the specified key. The loop spanning lines 21–26 again communicates this information to hashtab_add(),
searches that bucket, so that if line 24 detects a match, hashtab_del(), and hashtab_unlock_mod(). Be-
line 25 returns a pointer to the enclosing data element. cause this bucket has already been resized and because
Otherwise, if there is no match, line 27 returns NULL to hashtab_add() and hashtab_del() affect both the old
indicate failure. and the new ht_bucket structures, two locks are held,
one on each of the two buckets. Additionally, both ele-
Quick Quiz 10.10: How does the code in Listing 10.10 pro- ments of each array in ht_lock_state structure are used,
tect against the resizing process progressing past the selected with the [0] element pertaining to the old ht_bucket
bucket? structure and the [1] element pertaining to the new struc-
ture. Once again, hashtab_lock_mod() exits within an
This implementation of ht_get_bucket() and ht_
RCU read-side critical section.
search_bucket() permits lookups and modifications to
run concurrently with a resize operation. The hashtab_unlock_mod() function releases the
Read-side concurrency control is provided by RCU lock(s) acquired by hashtab_lock_mod(). Line 30
as was shown in Listing 10.6, but the update-side con- releases the lock on the old ht_bucket structure. In
currency-control functions hashtab_lock_mod() and the unlikely event that line 31 determines that a resize
hashtab_unlock_mod() must now deal with the pos- operation is in progress, line 32 releases the lock on the
sibility of a concurrent resize operation as shown in new ht_bucket structure. Either way, line 33 exits the
Listing 10.11. RCU read-side critical section.

v2022.09.25a
198 CHAPTER 10. DATA STRUCTURES

Listing 10.12: Resizable Hash-Table Access Functions to the current hash bucket. If line 19 determines that
1 struct ht_elem * this bucket has been distributed to a new version of the
2 hashtab_lookup(struct hashtab *htp_master, void *key)
3 { hash table, then line 20 also adds the new element to the
4 struct ht *htp; corresponding new bucket. The caller is required to handle
5 struct ht_elem *htep;
6 concurrency, for example, by invoking hashtab_lock_
7 htp = rcu_dereference(htp_master->ht_cur); mod() before the call to hashtab_add() and invoking
8 htep = ht_search_bucket(htp, key);
9 return htep; hashtab_unlock_mod() afterwards.
10 }
11
The hashtab_del() function on lines 24–32 of the
12 void hashtab_add(struct ht_elem *htep, listing removes an existing element from the hash table.
13 struct ht_lock_state *lsp)
14 {
Line 27 picks up the index of the pointer pair and line 29
15 struct ht_bucket *htbp = lsp->hbp[0]; removes the specified element from the current table. If
16 int i = lsp->hls_idx[0];
17
line 30 determines that this bucket has been distributed to
18 cds_list_add_rcu(&htep->hte_next[i], &htbp->htb_head); a new version of the hash table, then line 31 also removes
19 if ((htbp = lsp->hbp[1])) {
20 cds_list_add_rcu(&htep->hte_next[!i], &htbp->htb_head);
the specified element from the corresponding new bucket.
21 } As with hashtab_add(), the caller is responsible for
22 }
23
concurrency control and this concurrency control suffices
24 void hashtab_del(struct ht_elem *htep, for synchronizing with a concurrent resize operation.
25 struct ht_lock_state *lsp)
26 {
27 int i = lsp->hls_idx[0]; Quick Quiz 10.13: The hashtab_add() and hashtab_
28 del() functions in Listing 10.12 can update two hash buckets
29 cds_list_del_rcu(&htep->hte_next[i]); while a resize operation is progressing. This might cause
30 if (lsp->hbp[1])
31 cds_list_del_rcu(&htep->hte_next[!i]); poor performance if the frequency of resize operation is not
32 } negligible. Isn’t it possible to reduce the cost of updates in
such cases?

Quick Quiz 10.11: Suppose that one thread is inserting an The actual resizing itself is carried out by hashtab_
element into the hash table during a resize operation. What resize, shown in Listing 10.13 on page 199. Line 16
prevents this insertion from being lost due to a subsequent conditionally acquires the top-level ->ht_lock, and if this
resize operation completing before the insertion does? acquisition fails, line 17 returns -EBUSY to indicate that
a resize is already in progress. Otherwise, line 18 picks
Now that we have bucket selection and concurrency
up a reference to the current hash table, and lines 19–22
control in place, we are ready to search and update our re-
allocate a new hash table of the desired size. If a new
sizable hash table. The hashtab_lookup(), hashtab_
set of hash/key functions have been specified, these are
add(), and hashtab_del() functions are shown in List-
used for the new table, otherwise those of the old table are
ing 10.12.
preserved. If line 23 detects memory-allocation failure,
The hashtab_lookup() function on lines 1–10 of the line 24 releases ->ht_lock and line 25 returns a failure
listing does hash lookups. Line 7 fetches the current hash indication.
table and line 8 searches the bucket corresponding to the
Line 27 picks up the current table’s index and line 28
specified key. Line 9 returns a pointer to the searched-for
stores its inverse to the new hash table, thus ensuring that
element or NULL when the search fails. The caller must
the two hash tables avoid overwriting each other’s linked
be within an RCU read-side critical section.
lists. Line 29 then starts the bucket-distribution process by
Quick Quiz 10.12: The hashtab_lookup() function in installing a reference to the new table into the ->ht_new
Listing 10.12 ignores concurrent resize operations. Doesn’t this field of the old table. Line 30 ensures that all readers who
mean that readers might miss an element that was previously
are not aware of the new table complete before the resize
added during a resize operation?
operation continues.
The hashtab_add() function on lines 12–22 of the Each pass through the loop spanning lines 31–42 dis-
listing adds new data elements to the hash table. Line 15 tributes the contents of one of the old hash table’s buckets
picks up the current ht_bucket structure into which the into the new hash table. Line 32 picks up a reference to
new element is to be added, and line 16 picks up the the old table’s current bucket and line 33 acquires that
index of the pointer pair. Line 18 adds the new element bucket’s spinlock.

v2022.09.25a
10.4. NON-PARTITIONABLE DATA STRUCTURES 199

Listing 10.13: Resizable Hash-Table Resizing


1 int hashtab_resize(struct hashtab *htp_master,
2 unsigned long nbuckets,
3 int (*cmp)(struct ht_elem *htep, void *key),
4 unsigned long (*gethash)(void *key),
5 void *(*getkey)(struct ht_elem *htep))
6 {
7 struct ht *htp;
8 struct ht *htp_new;
9 int i;
10 int idx;
11 struct ht_elem *htep;
12 struct ht_bucket *htbp;
13 struct ht_bucket *htbp_new;
14 long b;
15
16 if (!spin_trylock(&htp_master->ht_lock))
17 return -EBUSY;
18 htp = htp_master->ht_cur;
19 htp_new = ht_alloc(nbuckets,
20 cmp ? cmp : htp->ht_cmp,
21 gethash ? gethash : htp->ht_gethash,
22 getkey ? getkey : htp->ht_getkey);
23 if (htp_new == NULL) {
24 spin_unlock(&htp_master->ht_lock);
25 return -ENOMEM;
26 }
27 idx = htp->ht_idx;
28 htp_new->ht_idx = !idx;
29 rcu_assign_pointer(htp->ht_new, htp_new);
30 synchronize_rcu();
31 for (i = 0; i < htp->ht_nbuckets; i++) {
32 htbp = &htp->ht_bkt[i];
33 spin_lock(&htbp->htb_lock);
34 cds_list_for_each_entry(htep, &htbp->htb_head, hte_next[idx]) {
35 htbp_new = ht_get_bucket(htp_new, htp_new->ht_getkey(htep), &b, NULL);
36 spin_lock(&htbp_new->htb_lock);
37 cds_list_add_rcu(&htep->hte_next[!idx], &htbp_new->htb_head);
38 spin_unlock(&htbp_new->htb_lock);
39 }
40 WRITE_ONCE(htp->ht_resize_cur, i);
41 spin_unlock(&htbp->htb_lock);
42 }
43 rcu_assign_pointer(htp_master->ht_cur, htp_new);
44 synchronize_rcu();
45 spin_unlock(&htp_master->ht_lock);
46 free(htp);
47 return 0;
48 }

v2022.09.25a
200 CHAPTER 10. DATA STRUCTURES

1x107 in the hash table. The figure shows three traces for each
element count, one for a fixed-size 262,144-bucket hash
table, another for a fixed-size 524,288-bucket hash table,
Lookups per Millisecond

1x106 and a third for a resizable hash table that shifts back
262,144 and forth between 262,144 and 524,288 buckets, with a
one-millisecond pause between each resize operation.
100000 The uppermost three traces are for the 262,144-element
hash table.1 The dashed trace corresponds to the two
fixed-size hash tables, and the solid trace to the resizable
10000 2,097,152 hash table. In this case, the short hash chains cause normal
lookup overhead to be so low that the overhead of resizing
dominates over most of the range. In particular, the entire
1000
1 10 100 hash table fits into L3 cache.
Number of CPUs (Threads)
The lower three traces are for the 2,097,152-element
hash table. The upper dashed trace corresponds to the
Figure 10.19: Overhead of Resizing Hash Tables Between 262,144-bucket fixed-size hash table, the solid trace in
262,144 and 524,288 Buckets vs. Total Number of the middle for low CPU counts and at the bottom for high
Elements CPU counts to the resizable hash table, and the other trace
to the 524,288-bucket fixed-size hash table. The fact that
there are now an average of eight elements per bucket can
Quick Quiz 10.14: In the hashtab_resize() function in only be expected to produce a sharp decrease in perfor-
Listing 10.13, what guarantees that the update to ->ht_new on mance, as in fact is shown in the graph. But worse yet,
line 29 will be seen as happening before the update to ->ht_ the hash-table elements occupy 128 MB, which overflows
resize_cur on line 40 from the perspective of hashtab_ each socket’s 39 MB L3 cache, with performance conse-
add() and hashtab_del()? In other words, what prevents
quences analogous to those described in Section 3.2.2.
hashtab_add() and hashtab_del() from dereferencing a
NULL pointer loaded from ->ht_new?
The resulting cache overflow means that the memory sys-
tem is involved even for a read-only benchmark, and as
Each pass through the loop spanning lines 34–39 adds you can see from the sublinear portions of the lower three
one data element from the current old-table bucket to the traces, the memory system can be a serious bottleneck.
corresponding new-table bucket, holding the new-table Quick Quiz 10.16: How much of the difference in per-
bucket’s lock during the add operation. Line 40 updates formance between the large and small hash tables shown in
->ht_resize_cur to indicate that this bucket has been Figure 10.19 was due to long hash chains and how much was
distributed. Finally, line 41 releases the old-table bucket due to memory-system bottlenecks?
lock.
Execution reaches line 43 once all old-table buckets Referring to the last column of Table 3.1, we recall
have been distributed across the new table. Line 43 installs that the first 28 CPUs are in the first socket, on a one-
the newly created table as the current one, and line 44 CPU-per-core basis, which explains the sharp decrease in
waits for all old readers (who might still be referencing performance of the resizable hash table beyond 28 CPUs.
the old table) to complete. Then line 45 releases the Sharp though this decrease is, please recall that it is due
resize-serialization lock, line 46 frees the old hash table, to constant resizing back and forth. It would clearly be
and finally line 47 returns success. better to resize once to 524,288 buckets, or, even better,
do a single eight-fold resize to 2,097,152 elements, thus
Quick Quiz 10.15: Why is there a WRITE_ONCE() on line 40 dropping the average number of elements per bucket down
in Listing 10.13? to the level enjoyed by the runs producing the upper three
traces.
The key point from this data is that the RCU-protected
10.4.3 Resizable Hash Table Discussion resizable hash table performs and scales almost as well as

Figure 10.19 compares resizing hash tables to their fixed- 1 You see only two traces? The dashed one is composed of two

sized counterparts for 262,144 and 2,097,152 elements traces that differ only slightly, hence the irregular-looking dash pattern.

v2022.09.25a
10.4. NON-PARTITIONABLE DATA STRUCTURES 201

does its fixed-size counterpart. The performance during even 0 2


an actual resize operation of course suffers somewhat (a)
odd 1 3
due to the cache misses causes by the updates to each
element’s pointers, and this effect is most pronounced
when the memory system becomes a bottleneck. This (b)
even 0 2

indicates that hash tables should be resized by substantial odd 1 3

amounts, and that hysteresis should be applied to prevent


all
performance degradation due to too-frequent resize op-
erations. In memory-rich environments, hash-table sizes
should furthermore be increased much more aggressively even 0 2
(c)
than they are decreased. odd 1 3
Another key point is that although the hashtab struc-
ture is non-partitionable, it is also read-mostly, which all

suggests the use of RCU. Given that the performance and


scalability of this resizable hash table is very nearly that of even 0 2
RCU-protected fixed-sized hash tables, we must conclude (d)
odd 1 3
that this approach was quite successful.
Finally, it is important to note that insertions, deletions, all

and lookups can proceed concurrently with a resize op-


eration. This concurrency is critically important when even 0 2
resizing large hash tables, especially for applications that (e)
odd 1 3
must meet severe response-time constraints.
Of course, the ht_elem structure’s pair of pointer sets all

does impose some memory overhead, which is taken up


in the next section. (f) all 1 3 0 2

10.4.4 Other Resizable Hash Tables Figure 10.20: Shrinking a Relativistic Hash Table

One shortcoming of the resizable hash table described


earlier in this section is memory consumption. Each
data element has two pairs of linked-list pointers rather process to work correctly, we clearly need to constrain the
than just one. Is it possible to create an RCU-protected hash functions for the two tables. One such constraint is
resizable hash table that makes do with just one pair? to use the same underlying hash function for both tables,
It turns out that the answer is “yes”. Josh Triplett but to throw out the low-order bit when shrinking from
et al. [TMW11] produced a relativistic hash table that large to small. For example, the old two-bucket hash table
incrementally splits and combines corresponding hash would use the two top bits of the value, while the new
chains so that readers always see valid hash chains at all one-bucket hash table could use the top bit of the value.
points during the resizing operation. This incremental In this way, a given pair of adjacent even and odd buckets
splitting and combining relies on the fact that it is harmless in the old large hash table can be coalesced into a single
for a reader to see a data element that should be in some bucket in the new small hash table, while still having a
other hash chain: When this happens, the reader will single hash value cover all of the elements in that single
simply ignore the extraneous data element due to key bucket.
mismatches. The initial state is shown at the top of the figure, with
The process of shrinking a relativistic hash table by time advancing from top to bottom, starting with initial
a factor of two is shown in Figure 10.20, in this case state (a). The shrinking process begins by allocating the
shrinking a two-bucket hash table into a one-bucket hash new smaller array of buckets, and having each bucket of
table, otherwise known as a linear list. This process works this new smaller array reference the first element of one
by coalescing pairs of buckets in the old larger hash table of the buckets of the corresponding pair in the old large
into single buckets in the new smaller hash table. For this hash table, resulting in state (b).

v2022.09.25a
202 CHAPTER 10. DATA STRUCTURES

Then the two hash chains are linked together, resulting


in state (c). In this state, readers looking up an even-
numbered element see no change, and readers looking
up elements 1 and 3 likewise see no change. However,
readers looking up some other odd number will also
traverse elements 0 and 2. This is harmless because any
odd number will compare not-equal to these two elements.
There is some performance loss, but on the other hand,
this is exactly the same performance loss that will be
experienced once the new small hash table is fully in
place. (a) all 0 1 2 3

Next, the new small hash table is made accessible to


readers, resulting in state (d). Note that older readers even

might still be traversing the old large hash table, so in this (b) odd

state both hash tables are in use.


all 0 1 2 3
The next step is to wait for all pre-existing readers to
complete, resulting in state (e). In this state, all readers
even
are using the new small hash table, so that the old large
odd
hash table’s buckets may be freed, resulting in the final (c)

state (f).
all 0 1 2 3
Growing a relativistic hash table reverses the shrinking
process, but requires more grace-period steps, as shown
even
in Figure 10.21. The initial state (a) is at the top of this
odd
figure, with time advancing from top to bottom. (d)

We start by allocating the new large two-bucket hash all 0 1 2 3


table, resulting in state (b). Note that each of these new
buckets references the first element destined for that bucket.
These new buckets are published to readers, resulting in even 0 1 2 3
(e)
state (c). After a grace-period operation, all readers are odd

using the new large hash table, resulting in state (d). In


this state, only those readers traversing the even-values
even 0 1 2 3
hash bucket traverse element 0, which is therefore now (f)
colored white. odd

At this point, the old small hash buckets may be freed,


although many implementations use these old buckets even 0 2
to track progress “unzipping” the list of items into their (g)
odd 1 3
respective new buckets. The last even-numbered element
in the first consecutive run of such elements now has
Figure 10.21: Growing a Relativistic Hash Table
its pointer-to-next updated to reference the following
even-numbered element. After a subsequent grace-period
operation, the result is state (e). The vertical arrow
indicates the next element to be unzipped, and element 1
is now colored black to indicate that only those readers
traversing the odd-values hash bucket may reach it.
Next, the last odd-numbered element in the first con-
secutive run of such elements now has its pointer-to-next
updated to reference the following odd-numbered ele-
ment. After a subsequent grace-period operation, the

v2022.09.25a
10.5. OTHER DATA STRUCTURES 203

result is state (f). A final unzipping operation (including compression techniques that may be used to work around
a grace-period operation) results in the final state (g). this disadvantage, including hashing the key value to a
In short, the relativistic hash table reduces the number smaller keyspace before the traversal [ON07]. Radix
of per-element list pointers at the expense of additional trees are heavily used in practice, including in the Linux
grace periods incurred during resizing. These additional kernel [Pig06].
grace periods are usually not a problem because insertions, One important special case of both a hash table and a
deletions, and lookups may proceed concurrently with a trie is what is perhaps the oldest of data structures, the
resize operation. array and its multi-dimensional counterpart, the matrix.
It turns out that it is possible to reduce the per-element The fully partitionable nature of matrices is exploited
memory overhead from a pair of pointers to a single heavily in concurrent numerical algorithms.
pointer, while still retaining O (1) deletions. This is Self-balancing trees are heavily used in sequential code,
accomplished by augmenting split-order list [SS06] with with AVL trees and red-black trees being perhaps the
RCU protection [Des09b, MDJ13a]. The data elements most well-known examples [CLRS01]. Early attempts to
in the hash table are arranged into a single sorted linked parallelize AVL trees were complex and not necessarily
list, with each hash bucket referencing the first element all that efficient [Ell80], however, more recent work on
in that bucket. Elements are deleted by setting low-order red-black trees provides better performance and scalability
bits in their pointer-to-next fields, and these elements are by using RCU for readers and hashed arrays of locks2 to
removed from the list by later traversals that encounter protect reads and updates, respectively [HW11, HW14]. It
them. turns out that red-black trees rebalance aggressively, which
This RCU-protected split-order list is complex, but works well for sequential programs, but not necessarily
offers lock-free progress guarantees for all insertion, dele- so well for parallel use. Recent work has therefore made
tion, and lookup operations. Such guarantees can be use of RCU-protected “bonsai trees” that rebalance less
important in real-time applications. An implementation aggressively [CKZ12], trading off optimal tree depth to
is available from recent versions of the userspace RCU gain more efficient concurrent updates.
library [Des09b]. Concurrent skip lists lend themselves well to RCU
readers, and in fact represents an early academic use of a
technique resembling RCU [Pug90].
10.5 Other Data Structures Concurrent double-ended queues were discussed in
Section 6.1.2, and concurrent stacks and queues have
All life is an experiment. The more experiments you a long history [Tre86], though not normally the most
make the better. impressive performance or scalability. They are neverthe-
less a common feature of concurrent libraries [MDJ13b].
Ralph Waldo Emerson Researchers have recently proposed relaxing the or-
dering constraints of stacks and queues [Sha11], with
The preceding sections have focused on data structures that some work indicating that relaxed-ordered queues actu-
enhance concurrency due to partitionability (Section 10.2), ally have better ordering properties than do strict FIFO
efficient handling of read-mostly access patterns (Sec- queues [HKLP12, KLP12, HHK+ 13].
tion 10.3), or application of read-mostly techniques to It seems likely that continued work with concurrent data
avoid non-partitionability (Section 10.4). This section structures will produce novel algorithms with surprising
gives a brief review of other data structures. properties.
One of the hash table’s greatest advantages for parallel
use is that it is fully partitionable, at least while not being
resized. One way of preserving the partitionability and
the size independence is to use a radix tree, which is also
called a trie. Tries partition the search key, using each
successive key partition to traverse the next level of the
trie. As such, a trie can be thought of as a set of nested
hash tables, thus providing the required partitionability.
One disadvantage of tries is that a sparse key space can 2 In the guise of swissTM [DFGG11], which is a variant of software

result in inefficient use of memory. There are a number of transactional memory in which the developer flags non-shared accesses.

v2022.09.25a
204 CHAPTER 10. DATA STRUCTURES

10.6 Micro-Optimization hash and interact with possible resize operations twice
rather than just once. In a performance-conscious envi-
ronment, the hashtab_lock_mod() function would also
The devil is in the details. return a reference to the bucket selected, eliminating the
Unknown subsequent call to ht_get_bucket().
Quick Quiz 10.17: Couldn’t the hashtorture.h code be
The data structures shown in this section were coded modified to accommodate a version of hashtab_lock_mod()
straightforwardly, with no adaptation to the underlying that subsumes the ht_get_bucket() functionality?
system’s cache hierarchy. In addition, many of the im-
plementations used pointers to functions for key-to-hash Quick Quiz 10.18: How much do these specializations really
conversions and other frequent operations. Although this save? Are they really worth it?
approach provides simplicity and portability, in many
cases it does give up some performance. All that aside, one of the great benefits of modern
The following sections touch on specialization, memory hardware compared to that available when I first started
conservation, and hardware considerations. Please do not learning to program back in the early 1970s is that much
mistake these short sections for a definitive treatise on this less specialization is required. This allows much greater
subject. Whole books have been written on optimizing productivity than was possible back in the days of four-
to a specific CPU, let alone to the set of CPU families in kilobyte address spaces.
common use today.

10.6.2 Bits and Bytes


10.6.1 Specialization
The hash tables discussed in this chapter made almost no
The resizable hash table presented in Section 10.4 used attempt to conserve memory. For example, the ->ht_
an opaque type for the key. This allows great flexibility, idx field in the ht structure in Listing 10.9 on page 196
permitting any sort of key to be used, but it also incurs always has a value of either zero or one, yet takes up
significant overhead due to the calls via of pointers to a full 32 bits of memory. It could be eliminated, for
functions. Now, modern hardware uses sophisticated example, by stealing a bit from the ->ht_resize_key
branch-prediction techniques to minimize this overhead, field. This works because the ->ht_resize_key field
but on the other hand, real-world software is often larger is large enough to address every byte of memory and
than can be accommodated even by today’s large hardware the ht_bucket structure is more than one byte long, so
branch-prediction tables. This is especially the case for that the ->ht_resize_key field must have several bits
calls via pointers, in which case the branch prediction to spare.
hardware must record a pointer in addition to branch- This sort of bit-packing trick is frequently used in
taken/branch-not-taken information. data structures that are highly replicated, as is the page
This overhead can be eliminated by specializing a structure in the Linux kernel. However, the resizable
hash-table implementation to a given key type and hash hash table’s ht structure is not all that highly replicated.
function, for example, by using C++ templates. Doing It is instead the ht_bucket structures we should focus
so eliminates the ->ht_cmp(), ->ht_gethash(), and on. There are two major opportunities for shrinking the
->ht_getkey() function pointers in the ht structure ht_bucket structure: (1) Placing the ->htb_lock field
shown in Listing 10.9 on page 196. It also eliminates the in a low-order bit of one of the ->htb_head pointers and
corresponding calls through these pointers, which could (2) Reducing the number of pointers required.
allow the compiler to inline the resulting fixed functions, The first opportunity might make use of bit-spinlocks
eliminating not only the overhead of the call instruction, in the Linux kernel, which are provided by the include/
but the argument marshalling as well. linux/bit_spinlock.h header file. These are used in
In addition, the resizable hash table is designed to fit space-critical data structures in the Linux kernel, but are
an API that segregates bucket selection from concurrency not without their disadvantages:
control. Although this allows a single torture test to ex-
ercise all the hash-table implementations in this chapter, 1. They are significantly slower than the traditional
it also means that many operations must compute the spinlock primitives.

v2022.09.25a
10.6. MICRO-OPTIMIZATION 205

2. They cannot participate in the lockdep deadlock Listing 10.14: Alignment for 64-Byte Cache Lines
detection tooling in the Linux kernel [Cor06a]. 1 struct hash_elem {
2 struct ht_elem e;
3 long __attribute__ ((aligned(64))) counter;
3. They do not record lock ownership, further compli- 4 };
cating debugging.

4. They do not participate in priority boosting in -rt One way to solve this problem on systems with 64-
kernels, which means that preemption must be dis- byte cache line is shown in Listing 10.14. Here GCC’s
abled when holding bit spinlocks, which can degrade aligned attribute is used to force the ->counter and the
real-time latency. ht_elem structure into separate cache lines. This would
allow CPUs to traverse the hash bucket list at full speed
Despite these disadvantages, bit-spinlocks are extremely despite the frequent incrementing.
useful when memory is at a premium. Of course, this raises the question “How did we
One aspect of the second opportunity was covered in know that cache lines are 64 bytes in size?” On a
Section 10.4.4, which presented resizable hash tables that Linux system, this information may be obtained from
require only one set of bucket-list pointers in place of the the /sys/devices/system/cpu/cpu*/cache/ direc-
pair of sets required by the resizable hash table presented tories, and it is even possible to make the installation
in Section 10.4. Another approach would be to use singly process rebuild the application to accommodate the sys-
linked bucket lists in place of the doubly linked lists used tem’s hardware structure. However, this would be more
in this chapter. One downside of this approach is that difficult if you wanted your application to also run on non-
deletion would then require additional overhead, either Linux systems. Furthermore, even if you were content
by marking the outgoing pointer for later removal or by to run only on Linux, such a self-modifying installation
searching the bucket list for the element being deleted. poses validation challenges. For example, systems with
In short, there is a tradeoff between minimal memory 32-byte cachelines might work well, but performance
overhead on the one hand, and performance and simplicity might suffer on systems with 64-byte cachelines due to
on the other. Fortunately, the relatively large memories false sharing.
available on modern systems have allowed us to priori- Fortunately, there are some rules of thumb that work
tize performance and simplicity over memory overhead. reasonably well in practice, which were gathered into a
However, even though the year 2022’s pocket-sized smart- 1995 paper [GKPS95].3 The first group of rules involve
phones sport many gigabytes of memory and its mid-range rearranging structures to accommodate cache geometry:
servers sport terabytes, it is sometimes necessary to take
extreme measures to reduce memory overhead. 1. Place read-mostly data far from frequently updated
data. For example, place read-mostly data at the
beginning of the structure and frequently updated
10.6.3 Hardware Considerations data at the end. Place data that is rarely accessed in
between.
Modern computers typically move data between CPUs
and main memory in fixed-sized blocks that range in size 2. If the structure has groups of fields such that each
from 32 bytes to 256 bytes. These blocks are called cache group is updated by an independent code path, sep-
lines, and are extremely important to high performance arate these groups from each other. Again, it can
and scalability, as was discussed in Section 3.2. One be helpful to place rarely accessed data between the
timeworn way to kill both performance and scalability is groups. In some cases, it might also make sense
to place incompatible variables into the same cacheline. to place each such group into a separate structure
For example, suppose that a resizable hash table data referenced by the original structure.
element had the ht_elem structure in the same cacheline
as a frequently incremented counter. The frequent incre- 3. Where possible, associate update-mostly data with
menting would cause the cacheline to be present at the a CPU, thread, or task. We saw several very effec-
CPU doing the incrementing, but nowhere else. If other tive examples of this rule of thumb in the counter
CPUs attempted to traverse the hash bucket list containing implementations in Chapter 5.
that element, they would incur expensive cache misses, 3 A number of these rules are paraphrased and expanded on here

degrading both performance and scalability. with permission from Orran Krieger.

v2022.09.25a
206 CHAPTER 10. DATA STRUCTURES

4. Going one step further, partition your data on a per- 1. Fully partitioned data structures work well on small
CPU, per-thread, or per-task basis, as was discussed systems, for example, single-socket systems.
in Chapter 8.
2. Larger systems require locality of reference as well
There has been some work towards automated trace- as full partitioning.
based rearrangement of structure fields [GDZE10]. This
3. Read-mostly techniques, such as hazard pointers
work might well ease one of the more painstaking tasks
and RCU, provide good locality of reference for
required to get excellent performance and scalability from
read-mostly workloads, and thus provide excellent
multithreaded software.
performance and scalability even on larger systems.
An additional set of rules of thumb deal with locks:
4. Read-mostly techniques also work well on some
1. Given a heavily contended lock protecting data that types of non-partitionable data structures, such as
is frequently modified, take one of the following resizable hash tables.
approaches:
5. Large data structures can overflow CPU caches, re-
(a) Place the lock in a different cacheline than the ducing performance and scalability.
data that it protects.
6. Additional performance and scalability can be ob-
(b) Use a lock that is adapted for high contention,
tained by specializing the data structure to a specific
such as a queued lock.
workload, for example, by replacing a general key
(c) Redesign to reduce lock contention. (This with a 32-bit integer.
approach is best, but is not always trivial.)
7. Although requirements for portability and for extreme
2. Place uncontended locks into the same cache line performance often conflict, there are some data-
as the data that they protect. This approach means structure-layout techniques that can strike a good
that the cache miss that brings the lock to the current balance between these two sets of requirements.
CPU also brings its data.
That said, performance and scalability are of little use
3. Protect read-mostly data with hazard pointers, RCU, without reliability, so the next chapter covers validation.
or, for long-duration critical sections, reader-writer
locks.

Of course, these are rules of thumb rather than absolute


rules. Some experimentation is required to work out
which are most applicable to a given situation.

10.7 Summary

There’s only one thing more painful than learning


from experience, and that is not learning from
experience.

Archibald MacLeish

This chapter has focused primarily on hash tables, includ-


ing resizable hash tables, which are not fully partitionable.
Section 10.5 gave a quick overview of a few non-hash-
table data structures. Nevertheless, this exposition of
hash tables is an excellent introduction to the many is-
sues surrounding high-performance scalable data access,
including:

v2022.09.25a
If it is not tested, it doesn’t work.

Unknown
Chapter 11

Validation

I have had a few parallel programs work the first time, but 11.1 Introduction
that is only because I have written a large number parallel
programs over the past three decades. And I have had far
Debugging is like being the detective in a crime
more parallel programs that fooled me into thinking that
movie where you are also the murderer.
they were working correctly the first time than actually
were working the first time. Filipe Fortes
I thus need to validate my parallel programs. The basic
trick behind validation, is to realize that the computer Section 11.1.1 discusses the sources of bugs, and Sec-
knows what is wrong. It is therefore your job to force tion 11.1.2 overviews the mindset required when validating
it to tell you. This chapter can therefore be thought of software. Section 11.1.3 discusses when you should start
as a short course in machine interrogation. But you can validation, and Section 11.1.4 describes the surprisingly
leave the good-cop/bad-cop routine at home. This chapter effective open-source regimen of code review and com-
covers much more sophisticated and effective methods, munity testing.
especially given that most computers couldn’t tell a good
cop from a bad cop, at least as far as we know. 11.1.1 Where Do Bugs Come From?
A longer course may be found in many recent books
Bugs come from developers. The basic problem is that
on validation, as well as at least one older but valuable
the human brain did not evolve with computer software in
one [Mye79]. Validation is an extremely important topic
mind. Instead, the human brain evolved in concert with
that cuts across all forms of software, and is worth intensive
other human brains and with animal brains. Because of this
study in its own right. However, this book is primarily
history, the following three characteristics of computers
about concurrency, so this chapter will do little more than
often come as a shock to human intuition:
scratch the surface of this critically important topic.
Section 11.1 introduces the philosophy of debugging. 1. Computers lack common sense, despite huge sacri-
Section 11.2 discusses tracing, Section 11.3 discusses fices at the altar of artificial intelligence.
assertions, and Section 11.4 discusses static analysis.
Section 11.5 describes some unconventional approaches 2. Computers fail to understand user intent, or more
to code review that can be helpful when the fabled 10,000 formally, computers generally lack a theory of mind.
eyes happen not to be looking at your code. Section 11.6
3. Computers cannot do anything useful with a frag-
overviews the use of probability for validating parallel
mentary plan, instead requiring that every detail of
software. Because performance and scalability are first-
all possible scenarios be spelled out in full.
class requirements for parallel programming, Section 11.7
covers these topics. Finally, Section 11.8 gives a fanciful The first two points should be uncontroversial, as they
summary and a short list of statistical traps to avoid. are illustrated by any number of failed products, perhaps
But never forget that the three best debugging tools most famously Clippy and Microsoft Bob. By attempting
are a thorough understanding of the requirements, a solid to relate to users as people, these two products raised
design, and a good night’s sleep! common-sense and theory-of-mind expectations that they

207

v2022.09.25a
208 CHAPTER 11. VALIDATION

proved incapable of meeting. Perhaps the set of software An important special case is the project that, while
assistants are now available on smartphones will fare valuable, is not valuable enough to justify the time required
better, but as of 2021 reviews are mixed. That said, the to implement it. This special case is quite common, and
developers working on them by all accounts still develop one early symptom is the unwillingness of the decision-
the old way: The assistants might well benefit end users, makers to invest enough to actually implement the project.
but not so much their own developers. A natural reaction is for the developers to produce an
This human love of fragmentary plans deserves more unrealistically optimistic estimate in order to be permitted
explanation, especially given that it is a classic two-edged to start the project. If the organization is strong enough
sword. This love of fragmentary plans is apparently due and its decision-makers ineffective enough, the project
to the assumption that the person carrying out the plan might succeed despite the resulting schedule slips and
will have (1) common sense and (2) a good understanding budget overruns. However, if the organization is not
of the intent and requirements driving the plan. This latter strong enough and if the decision-makers fail to cancel the
assumption is especially likely to hold in the common project as soon as it becomes clear that the estimates are
case where the person doing the planning and the person garbage, then the project might well kill the organization.
carrying out the plan are one and the same: In this This might result in another organization picking up the
case, the plan will be revised almost subconsciously as project and either completing it, canceling it, or being
obstacles arise, especially when that person has the a good killed by it. A given project might well succeed only
understanding of the problem at hand. In fact, the love after killing several organizations. One can only hope
of fragmentary plans has served human beings well, in that the organization that eventually makes a success of
part because it is better to take random actions that have a serial-organization-killer project maintains a suitable
a some chance of locating food than to starve to death level of humility, lest it be killed by its next such project.
while attempting to plan the unplannable. However, the
Quick Quiz 11.2: Who cares about the organization? After
usefulness of fragmentary plans in the everyday life of
all, it is the project that is important!
which we are all experts is no guarantee of their future
usefulness in stored-program computers. Important though insane levels of optimism might
Furthermore, the need to follow fragmentary plans has be, they are a key source of bugs (and perhaps failure
had important effects on the human psyche, due to the of organizations). The question is therefore “How to
fact that throughout much of human history, life was often maintain the optimism required to start a large project
difficult and dangerous. It should come as no surprise that while at the same time injecting enough reality to keep
executing a fragmentary plan that has a high probability the bugs down to a dull roar?” The next section examines
of a violent encounter with sharp teeth and claws requires this conundrum.
almost insane levels of optimism—a level of optimism that
actually is present in most human beings. These insane
levels of optimism extend to self-assessments of program- 11.1.2 Required Mindset
ming ability, as evidenced by the effectiveness of (and the
controversy over) code-interviewing techniques [Bra07]. When carrying out any validation effort, keep the following
In fact, the clinical term for a human being with less-than- definitions firmly in mind:
insane levels of optimism is “clinically depressed”. Such
people usually have extreme difficulty functioning in their 1. The only bug-free programs are trivial programs.
daily lives, underscoring the perhaps counter-intuitive im-
portance of insane levels of optimism to a normal, healthy 2. A reliable program has no known bugs.
life. Furtheremore, if you are not insanely optimistic, you
are less likely to start a difficult but worthwhile project.1 From these definitions, it logically follows that any
reliable non-trivial program contains at least one bug that
Quick Quiz 11.1: When in computing is it necessary to
you do not know about. Therefore, any validation effort
follow a fragmentary plan?
undertaken on a non-trivial program that fails to find any
bugs is itself a failure. A good validation is therefore an
1 There are some famous exceptions to this rule of thumb. Some
exercise in destruction. This means that if you are the
people take on difficult or risky projects in order to at least a temporarily
escape from their depression. Others have nothing to lose: The project type of person who enjoys breaking things, validation is
is literally a matter of life or death. just job for you.

v2022.09.25a
11.1. INTRODUCTION 209

Quick Quiz 11.3: Suppose that you are writing a script that
processes the output of the time command, which looks as
follows:

real 0m0.132s
user 0m0.040s
sys 0m0.008s

The script is required to check its input for errors, and to give
appropriate diagnostics if fed erroneous time output. What
test inputs should you provide to this program to test it for use
with time output generated by single-threaded programs?

But perhaps you are a super-programmer whose code


is always perfect the first time every time. If so, congratu-
lations! Feel free to skip this chapter, but I do hope that
you will forgive my skepticism. You see, I have too many Figure 11.1: Validation and the Geneva Convention
people who claimed to be able to write perfect code the
first time, which is not too surprising given the previous
discussion of optimism and over-confidence. And even
if you really are a super-programmer, you just might find
yourself debugging lesser mortals’ work.
One approach for the rest of us is to alternate between
our normal state of insane optimism (Sure, I can program
that!) and severe pessimism (It seems to work, but I just
know that there have to be more bugs hiding in there
somewhere!). It helps if you enjoy breaking things. If
you don’t, or if your joy in breaking things is limited to
breaking other people’s things, find someone who does
love breaking your code and have them help you break it.
Another helpful frame of mind is to hate it when other
people find bugs in your code. This hatred can help
motivate you to torture your code beyond all reason in
order to increase the probability that you will be the one to Figure 11.2: Rationalizing Validation
find the bugs. Just make sure to suspend this hatred long
enough to sincerely thank anyone who does find a bug
in your code! After all, by so doing, they saved you the One way of looking at this is that consistently making
trouble of tracking it down, and possibly at great personal good things happen requires a lot of focus on a lot of bad
expense dredging through your code. things that might happen, with an eye towards preventing
Yet another helpful frame of mind is studied skepticism. or otherwise handling those bad things.2 The prospect of
You see, believing that you understand the code means these bad things might also motivate you to torture your
you can learn absolutely nothing about it. Ah, but you code into revealing the whereabouts of its bugs.
know that you completely understand the code because This wide variety of frames of mind opens the door to
you wrote or reviewed it? Sorry, but the presence of the possibility of multiple people with different frames of
bugs suggests that your understanding is at least partially mind contributing to the project, with varying levels of
fallacious. One cure is to write down what you know to optimism. This can work well, if properly organized.
be true and double-check this knowledge, as discussed in
Sections 11.2–11.5. Objective reality always overrides
whatever you might think you know. 2 For more on this philosophy, see the chapter entitled “The Power
One final frame of mind is to consider the possibility of Negative Thinking” from Chris Hadfield’s excellent book entitled
that someone’s life depends on your code being correct. “An Astronaut’s Guide to Life on Earth.”

v2022.09.25a
210 CHAPTER 11. VALIDATION

Some people might see vigorous validation as a form One such approach takes a Darwinian view, with the
of torture, as depicted in Figure 11.1.3 Such people might validation suite eliminating code that is not fit to solve
do well to remind themselves that, Tux cartoons aside, the problem at hand. From this viewpoint, a vigorous
they are really torturing an inanimate object, as shown in validation suite is essential to the fitness of your software.
Figure 11.2. Rest assured that those who fail to torture However, taking this approach to its logical conclusion is
their code are doomed to be tortured by it! quite humbling, as it requires us developers to admit that
However, this leaves open the question of exactly when our carefully crafted changes to the codebase are, from a
during the project lifetime validation should start, a topic Darwinian standpoint, random mutations. On the other
taken up by the next section. hand, this conclusion is supported by long experience
indicating that seven percent of fixes introduce at least
11.1.3 When Should Validation Start? one bug [BJ12].
How vigorous should your validation suite be? If the
Validation should start exactly when the project starts. bugs it finds aren’t threatening the very foundations of
To see this, consider that tracking down a bug is much your software design, then it is not yet vigorous enough.
harder in a large program than in a small one. Therefore, After all, your design is just as prone to bugs as is your
to minimize the time and effort required to track down code, and the earlier you find and fix the bugs in your
bugs, you should test small units of code. Although you design, the less time you will waste coding those design
won’t find all the bugs this way, you will find a substantial bugs.
fraction, and it will be much easier to find and fix the
ones you do find. Testing at this level can also alert you Quick Quiz 11.5: Are you actually suggesting that it is
to larger flaws in your overall design, minimizing the time possible to test correctness into software??? Everyone knows
that is impossible!!!
you waste writing code that is broken by design.
But why wait until you have code before validating your
design?4 Hopefully reading Chapters 3 and 4 provided you It is worth reiterating that this advice applies to first-
with the information required to avoid some regrettably of-a-kind projects. If you are instead doing a project in a
common design flaws, but discussing your design with a well-explored area, you would be quite foolish to refuse
colleague or even simply writing it down can help flush to learn from previous experience. But you should still
out additional flaws. start validating right at the beginning of the project, but
However, it is all too often the case that waiting to hopefully guided by others’ hard-won knowledge of both
start validation until you have a design is waiting too long. requirements and pitfalls.
Mightn’t your natural level of optimism caused you to start An equally important question is “When should valida-
the design before you fully understood the requirements? tion stop?” The best answer is “Some time after the last
The answer to this question will almost always be “yes”. change.” Every change has the potential to create a bug,
One good way to avoid flawed requirements is to get to and thus every change must be validated. Furthermore,
know your users. To really serve them well, you will have validation development should continue through the full
to live among them. lifetime of the project. After all, the Darwinian perspec-
tive above implies that bugs are adapting to your validation
Quick Quiz 11.4: You are asking me to do all this validation suite. Therefore, unless you continually improve your
BS before I even start coding??? That sounds like a great way
validation suite, your project will naturally accumulate
to never get started!!!
hordes of validation-suite-immune bugs.
First-of-a-kind projects often use different methodolo- But life is a tradeoff, and every bit of time invested in
gies such as rapid prototyping or agile. Here, the main validation suites as a bit of time that cannot be invested
goal of early prototypes are not to create correct imple- in directly improving the project itself. These sorts of
mentations, but rather to learn the project’s requirements. choices are never easy, and it can be just as damaging to
But this does not mean that you omit validation; it instead overinvest in validation as it can be to underinvest. But
means that you approach it differently. this is just one more indication that life is not easy.
3 The cynics among us might question whether these people are
Now that we have established that you should start
afraid that validation will find bugs that they will then be required to fix.
validation when you start the project (if not earlier!), and
4 The old saying “First we must code, then we have incentive to that both validation and validation development should
think” notwithstanding. continue throughout the lifetime of that project, the fol-

v2022.09.25a
11.2. TRACING 211

lowing sections cover a number of validation techniques likely would have forgotten how the patch was supposed
and methods that have proven their worth. to work, making it much more difficult to fix them.
However, we must not forget the second tenet of the
open-source development, namely intensive testing. For
11.1.4 The Open Source Way example, a great many people test the Linux kernel. Some
The open-source programming methodology has proven test patches as they are submitted, perhaps even yours.
quite effective, and includes a regimen of intense code Others test the -next tree, which is helpful, but there is
review and testing. likely to be several weeks or even months delay between
I can personally attest to the effectiveness of the open- the time that you write the patch and the time that it
source community’s intense code review. One of my appears in the -next tree, by which time the patch will not
first patches to the Linux kernel involved a distributed be quite as fresh in your mind. Still others test maintainer
filesystem where one node might write to a given file trees, which often have a similar time delay.
that another node has mapped into memory. In this case, Quite a few people don’t test code until it is committed
it is necessary to invalidate the affected pages from the to mainline, or the master source tree (Linus’s tree in the
mapping in order to allow the filesystem to maintain case of the Linux kernel). If your maintainer won’t accept
coherence during the write operation. I coded up a first your patch until it has been tested, this presents you with a
attempt at a patch, and, in keeping with the open-source deadlock situation: Your patch won’t be accepted until it
maxim “post early, post often”, I posted the patch. I then is tested, but it won’t be tested until it is accepted. Never-
considered how I was going to test it. theless, people who test mainline code are still relatively
But before I could even decide on an overall test strategy, aggressive, given that many people and organizations do
I got a reply to my posting pointing out a few bugs. I fixed not test code until it has been pulled into a Linux distro.
the bugs and reposted the patch, and returned to thinking And even if someone does test your patch, there is
out my test strategy. However, before I had a chance to no guarantee that they will be running the hardware and
write any test code, I received a reply to my reposted patch, software configuration and workload required to locate
pointing out more bugs. This process repeated itself many your bugs.
times, and I am not sure that I ever got a chance to actually Therefore, even when writing code for an open-source
test the patch. project, you need to be prepared to develop and run your
This experience brought home the truth of the open- own test suite. Test development is an underappreciated
source saying: Given enough eyeballs, all bugs are shal- and very valuable skill, so be sure to take full advantage
low [Ray99]. of any existing test suites available to you. Important as
However, when you post some code or a given patch, it test development is, we must leave further discussion of it
is worth asking a few questions: to books dedicated to that topic. The following sections
therefore discuss locating bugs in your code given that
1. How many of those eyeballs are actually going to you already have a good test suite.
look at your code?

2. How many will be experienced and clever enough to 11.2 Tracing


actually find your bugs?
The machine knows what is wrong. Make it tell you.
3. Exactly when are they going to look?
Unknown
I was lucky: There was someone out there who wanted
the functionality provided by my patch, who had long When all else fails, add a printk()! Or a printf(), if
experience with distributed filesystems, and who looked you are working with user-mode C-language applications.
at my patch almost immediately. If no one had looked at The rationale is simple: If you cannot figure out how
my patch, there would have been no review, and therefore execution reached a given point in the code, sprinkle print
none of those bugs would have been located. If the people statements earlier in the code to work out what happened.
looking at my patch had lacked experience with distributed You can get a similar effect, and with more convenience
filesystems, it is unlikely that they would have found all and flexibility, by using a debugger such as gdb (for
the bugs. Had they waited months or even years to look, I user applications) or kgdb (for debugging Linux kernels).

v2022.09.25a
212 CHAPTER 11. VALIDATION

Much more sophisticated tools exist, with some of the 11.3 Assertions
more recent offering the ability to rewind backwards in
time from the point of failure.
No man really becomes a fool until he stops asking
These brute-force testing tools are all valuable, espe-
questions.
cially now that typical systems have more than 64K of
memory and CPUs running faster than 4 MHz. Much has Charles P. Steinmetz
been written about these tools, so this chapter will add
only a little more. Assertions are usually implemented in the following man-
However, these tools all have a serious shortcoming ner:
when you need a fastpath to tell you what is going wrong, 1 if (something_bad_is_happening())
namely, these tools often have excessive overheads. There 2 complain();
are special tracing technologies for this purpose, which
typically leverage data ownership techniques (see Chap- This pattern is often encapsulated into C-preprocessor
ter 8) to minimize the overhead of runtime data collec- macros or language intrinsics, for example, in the
tion. One example within the Linux kernel is “trace Linux kernel, this might be represented as WARN_
events” [Ros10b, Ros10c, Ros10d, Ros10a], which uses ON(something_bad_is_happening()). Of course, if
per-CPU buffers to allow data to be collected with ex- something_bad_is_happening() quite frequently, the
tremely low overhead. Even so, enabling tracing can resulting output might obscure reports of other prob-
sometimes change timing enough to hide bugs, resulting lems, in which case WARN_ON_ONCE(something_bad_
in heisenbugs, which are discussed in Section 11.6 and is_happening()) might be more appropriate.
especially Section 11.6.4. In the kernel, BPF can do Quick Quiz 11.6: How can you implement WARN_ON_
data reduction in the kernel, reducing the overhead of ONCE()?
transmitting the needed information from the kernel to
userspace [Gre19]. In userspace code, there is a huge In parallel code, one bad something that might hap-
number of tools that can help you. One good starting pen is that a function expecting to be called under a
point is Brendan Gregg’s blog.5 particular lock might be called without that lock being
Even if you avoid heisenbugs, other pitfalls await you. held. Such functions sometimes have header comments
For example, although the machine really does know all, stating something like “The caller must hold foo_lock
what it knows is almost always way more than your head when calling this function”, but such a comment does no
can hold. For this reason, high-quality test suites normally good unless someone actually reads it. An executable
come with sophisticated scripts to analyze the voluminous statement carries far more weight. The Linux kernel’s
output. But beware—scripts will only notice what you tell lockdep facility [Cor06a, Ros11] therefore provides a
them to. My rcutorture scripts are a case in point: Early lockdep_assert_held() function that checks whether
versions of those scripts were quite satisfied with a test the specified lock is held. Of course, lockdep incurs
run in which RCU grace periods stalled indefinitely. This significant overhead, and thus might not be helpful in
of course resulted in the scripts being modified to detect production.
RCU grace-period stalls, but this does not change the fact An especially bad parallel-code something is unex-
that the scripts will only detect problems that I make them pected concurrent access to data. The Kernel Concurrency
detect. But note well that unless you have a solid design, Sanitizer (KCSAN) [Cor16a] uses existing markings such
you won’t know what your script should check for! as READ_ONCE() and WRITE_ONCE() to determine which
Another problem with tracing and especially with concurrent accesses deserve warning messages. KCSAN
printk() calls is that their overhead can rule out produc- has a significant false-positive rate, especially from the
tion use. In such cases, assertions can be helpful. viewpoint of developers thinking in terms of C as assembly
language with additional syntax. KCSAN therefore pro-
vides a data_race() construct to forgive known-benign
data races, and also the ASSERT_EXCLUSIVE_ACCESS()
and ASSERT_EXCLUSIVE_WRITER() assertions to explic-
itly check for data races [EMV+ 20a, EMV+ 20b].
So what can be done in cases where checking is neces-
5 http://www.brendangregg.com/blog/ sary, but where the overhead of runtime checking cannot

v2022.09.25a
11.5. CODE REVIEW 213

be tolerated? One approach is static analysis, which is 11.5 Code Review


discussed in the next section.
If a man speaks of my virtues, he steals from me; if
he speaks of my vices, then he is my teacher.
11.4 Static Analysis Chinese proverb

A lot of automation isn’t a replacement of humans


Code review is a special case of static analysis with human
but of mind-numbing behavior. beings doing the analysis. This section covers inspection,
walkthroughs, and self-inspection.
Summarized from Stewart Butterfield

11.5.1 Inspection
Static analysis is a validation technique where one program
takes a second program as input, reporting errors and vul- Traditionally, formal code inspections take place in face-
nerabilities located in this second program. Interestingly to-face meetings with formally defined roles: Moderator,
enough, almost all programs are statically analyzed by developer, and one or two other participants. The devel-
their compilers or interpreters. These tools are far from oper reads through the code, explaining what it is doing
perfect, but their ability to locate errors has improved and why it works. The one or two other participants ask
immensely over the past few decades, in part because they questions and raise issues, hopefully exposing the author’s
now have much more than 64K bytes of memory in which invalid assumptions, while the moderator’s job is to re-
to carry out their analyses. solve any resulting conflicts and take notes. This process
The original UNIX lint tool [Joh77] was quite useful, can be extremely effective at locating bugs, particularly if
though much of its functionality has since been incorpo- all of the participants are familiar with the code at hand.
rated into C compilers. There are nevertheless lint-like However, this face-to-face formal procedure does not
tools in use to this day. The sparse static analyzer [Cor04b] necessarily work well in the global Linux kernel com-
finds higher-level issues in the Linux kernel, including: munity. Instead, individuals review code separately and
provide comments via email or IRC. The note-taking
is provided by email archives or IRC logs, and modera-
1. Misuse of pointers to user-space structures. tors volunteer their services as required by the occasional
flamewar. This process also works reasonably well, par-
ticularly if all of the participants are familiar with the
2. Assignments from too-long constants.
code at hand. In fact, one advantage of the Linux kernel
community approach over traditional formal inspections
3. Empty switch statements. is the greater probability of contributions from people not
familiar with the code, who might not be blinded by the
4. Mismatched lock acquisition and release primitives. author’s invalid assumptions, and who might also test the
code.
Quick Quiz 11.7: Just what invalid assumptions are you
5. Misuse of per-CPU primitives. accusing Linux kernel hackers of harboring???

It is quite likely that the Linux kernel community’s


6. Use of RCU primitives on non-RCU pointers and
review process is ripe for improvement:
vice versa.
1. There is sometimes a shortage of people with the
time and expertise required to carry out an effective
Although it is likely that compilers will continue to
review.
increase their static-analysis capabilities, the sparse static
analyzer demonstrates the benefits of static analysis out- 2. Even though all review discussions are archived, they
side of the compiler, particularly for finding application- are often “lost” in the sense that insights are forgotten
specific bugs. Sections 12.4–12.5 describe more sophisti- and people fail to look up the discussions. This can
cated forms of static analysis. result in re-insertion of the same old bugs.

v2022.09.25a
214 CHAPTER 11. VALIDATION

3. It is sometimes difficult to resolve flamewars when where there is no reasonable alternative. For example, the
they do break out, especially when the combatants developer might be the only person authorized to look
have disjoint goals, experience, and vocabulary. at the code, other qualified developers might all be too
busy, or the code in question might be sufficiently bizarre
Perhaps some of the needed improvements will be that the developer is unable to convince anyone else to
provided by continuous-integration-style testing, but there take it seriously until after demonstrating a prototype. In
are many bugs more easily found by review than by testing. these cases, the following procedure can be quite helpful,
When reviewing, therefore, it is worthwhile to look at especially for complex parallel code:
relevant documentation in commit logs, bug reports, and
LWN articles. This documentation can help you quickly 1. Write design document with requirements, diagrams
build up the required expertise. for data structures, and rationale for design choices.

2. Consult with experts, updating the design document


11.5.2 Walkthroughs as needed.
A traditional code walkthrough is similar to a formal
3. Write the code in pen on paper, correcting errors as
inspection, except that the group “plays computer” with the
you go. Resist the temptation to refer to pre-existing
code, driven by specific test cases. A typical walkthrough
nearly identical code sequences, instead, copy them.
team has a moderator, a secretary (who records bugs
found), a testing expert (who generates the test cases) 4. At each step, articulate and question your assump-
and perhaps one to two others. These can be extremely tions, inserting assertions or constructing tests to
effective, albeit also extremely time-consuming. check them.
It has been some decades since I have participated in
a formal walkthrough, and I suspect that a present-day 5. If there were errors, copy the code in pen on fresh
walkthrough would use single-stepping debuggers. One paper, correcting errors as you go. Repeat until the
could imagine a particularly sadistic procedure as follows: last two copies are identical.

1. The tester presents the test case. 6. Produce proofs of correctness for any non-obvious
code.
2. The moderator starts the code under a debugger,
using the specified test case as input. 7. Use a source-code control system. Commit early;
commit often.
3. Before each statement is executed, the developer is
required to predict the outcome of the statement and 8. Test the code fragments from the bottom up.
explain why this outcome is correct.
9. When all the code is integrated (but preferably be-
4. If the outcome differs from that predicted by the fore), do full-up functional and stress testing.
developer, this is taken as a potential bug.
10. Once the code passes all tests, write code-level doc-
5. In parallel code, a “concurrency shark” asks what umentation, perhaps as an extension to the design
code might execute concurrently with this code, and document discussed above. Fix both the code and
why such concurrency is harmless. the test code as needed.

Sadistic, certainly. Effective? Maybe. If the partic- When I follow this procedure for new RCU code, there
ipants have a good understanding of the requirements, are normally only a few bugs left at the end. With a few
software tools, data structures, and algorithms, then walk- prominent (and embarrassing) exceptions [McK11a], I
throughs can be extremely effective. If not, walkthroughs usually manage to locate these bugs before others do. That
are often a waste of time. said, this is getting more difficult over time as the number
and variety of Linux-kernel users increases.
11.5.3 Self-Inspection Quick Quiz 11.8: Why would anyone bother copying exist-
ing code in pen on paper??? Doesn’t that just increase the
Although developers are usually not all that effective at probability of transcription errors?
inspecting their own code, there are a number of situations

v2022.09.25a
11.6. PROBABILITY AND HEISENBUGS 215

Quick Quiz 11.9: This procedure is ridiculously over- 5. Make extremely disciplined use of parallel-
engineered! How can you expect to get a reasonable amount programming primitives, so that the resulting code
of software written doing it this way??? is easily seen to be correct. But beware: It is always
tempting to break the rules “just a little bit” to gain
Quick Quiz 11.10: What do you do if, after all the pen-on- better performance or scalability. Breaking the rules
paper copying, you find a bug while typing in the resulting often results in general breakage. That is, unless you
code? carefully do the paperwork described in this section.
The above procedure works well for new code, but
what if you need to inspect code that you have already But the sad fact is that even if you do the paperwork
written? You can of course apply the above procedure or use one of the above ways to more-or-less safely avoid
for old code in the special case where you wrote one to paperwork, there will be bugs. If nothing else, more users
throw away [FPB79], but the following approach can also and a greater variety of users will expose more bugs more
be helpful in less desperate circumstances: quickly, especially if those users are doing things that the
original developers did not consider. The next section
1. Using your favorite documentation tool (LATEX, describes how to handle the probabilistic bugs that occur
HTML, OpenOffice, or straight ASCII), describe all too commonly when validating parallel software.
the high-level design of the code in question. Use
Quick Quiz 11.11: Wait! Why on earth would an abstract
lots of diagrams to illustrate the data structures and piece of software fail only sometimes???
how these structures are updated.
2. Make a copy of the code, stripping away all com-
ments.
11.6 Probability and Heisenbugs
3. Document what the code does statement by statement.
4. Fix bugs as you find them. With both heisenbugs and impressionist art, the
closer you get, the less you see.
This works because describing the code in detail is
Unknown
an excellent way to spot bugs [Mye79]. This second
procedure is also a good way to get your head around
So your parallel program fails sometimes. But you used
someone else’s code, although the first step often suffices.
techniques from the earlier sections to locate the problem
Although review and inspection by others is probably
and now have a fix in place! Congratulations!!!
more efficient and effective, the above procedures can be
Now the question is just how much testing is required
quite helpful in cases where for whatever reason it is not
in order to be certain that you actually fixed the bug, as
feasible to involve others.
opposed to just reducing the probability of it occurring on
At this point, you might be wondering how to write par-
the one hand, having fixed only one of several related bugs
allel code without having to do all this boring paperwork.
on the other hand, or made some ineffectual unrelated
Here are some time-tested ways of accomplishing this:
change on yet a third hand. In short, what is the answer to
1. Write a sequential program that scales through use the eternal question posed by Figure 11.3?
of available parallel library functions. Unfortunately, the honest answer is that an infinite
amount of testing is required to attain absolute certainty.
2. Write sequential plug-ins for a parallel framework,
such as map-reduce, BOINC, or a web-application Quick Quiz 11.12: Suppose that you had a very large number
server. of systems at your disposal. For example, at current cloud
prices, you can purchase a huge amount of CPU time at low
3. Fully partition your problems, then implement se- cost. Why not use this approach to get close enough to certainty
quential program(s) that run in parallel without com- for all practical purposes?
munication.
But suppose that we are willing to give up absolute
4. Stick to one of the application areas (such as linear certainty in favor of high probability. Then we can bring
algebra) where tools can automatically decompose powerful statistical tools to bear on this problem. However,
and parallelize the problem. this section will focus on simple statistical tools. These

v2022.09.25a
216 CHAPTER 11. VALIDATION

11.6.1 Statistics for Discrete Testing


Hooray! I passed
the stress test! Suppose a bug has a 10 % chance of occurring in a given
run and that we do five runs. How do we compute the
Ha. You just got lucky probability of at least one run failing? Here is one way:
1. Compute the probability of a given run succeeding,
which is 90 %.
2. Compute the probability of all five runs succeeding,
which is 0.9 raised to the fifth power, or about 59 %.
3. Because either all five runs succeed, or at least one
fails, subtract the 59 % expected success rate from
100 %, yielding a 41 % expected failure rate.
For those preferring formulas, call the probability of
a single failure 𝑓 . The probability of a single success
is then 1 − 𝑓 and the probability that all of 𝑛 tests will
Figure 11.3: Passed on Merits? Or Dumb Luck? succeed is 𝑆 𝑛 :

𝑆 𝑛 = (1 − 𝑓 ) 𝑛 (11.1)
tools are extremely helpful, but please note that reading
The probability of failure is 1 − 𝑆 𝑛 , or:
this section is not a substitute for statistics classes.6
For our start with simple statistical tools, we need to
𝐹𝑛 = 1 − (1 − 𝑓 ) 𝑛 (11.2)
decide whether we are doing discrete or continuous testing.
Discrete testing features well-defined individual test runs. Quick Quiz 11.13: Say what??? When I plug the earlier five-
For example, a boot-up test of a Linux kernel patch is an test 10 %-failure-rate example into the formula, I get 59,050 %
example of a discrete test: The kernel either comes up or it and that just doesn’t make sense!!!
does not. Although you might spend an hour boot-testing
your kernel, the number of times you attempted to boot So suppose that a given test has been failing 10 % of
the kernel and the number of times the boot-up succeeded the time. How many times do you have to run the test to
would often be of more interest than the length of time be 99 % sure that your supposed fix actually helped?
you spent testing. Functional tests tend to be discrete. Another way to ask this question is “How many times
On the other hand, if my patch involved RCU, I would would we need to run the test to cause the probability of
probably run rcutorture, which is a kernel module that, failure to rise above 99 %?” After all, if we were to run
strangely enough, tests RCU. Unlike booting the ker- the test enough times that the probability of seeing at least
nel, where the appearance of a login prompt signals the one failure becomes 99 %, if there are no failures, there is
successful end of a discrete test, rcutorture will happily only 1 % probability of this “success” being due to dumb
continue torturing RCU until either the kernel crashes or luck. And if we plug 𝑓 = 0.1 into Eq. 11.2 and vary 𝑛,
until you tell it to stop. The duration of the rcutorture test we find that 43 runs gives us a 98.92 % chance of at least
is usually of more interest than the number of times you one test failing given the original 10 % per-test failure
started and stopped it. Therefore, rcutorture is an example rate, while 44 runs gives us a 99.03 % chance of at least
of a continuous test, a category that includes many stress one test failing. So if we run the test on our fix 44 times
tests. and see no failures, there is a 99 % probability that our fix
Statistics for discrete tests are simpler and more famil- really did help.
iar than those for continuous tests, and furthermore the But repeatedly plugging numbers into Eq. 11.2 can get
statistics for discrete tests can often be pressed into service tedious, so let’s solve for 𝑛:
for continuous tests, though with some loss of accuracy.
We therefore start with discrete tests. 𝐹𝑛 = 1 − (1 − 𝑓 ) 𝑛 (11.3)
6 Which 1 − 𝐹𝑛 = (1 − 𝑓 ) 𝑛 (11.4)
I most highly recommend. The few statistics courses I have
taken have provided value far beyond that of the time I spent on them. log (1 − 𝐹𝑛 ) = 𝑛 log (1 − 𝑓 ) (11.5)

v2022.09.25a
11.6. PROBABILITY AND HEISENBUGS 217

1000 An order of magnitude improvement from a 30 % failure


Number of Runs for 99% Confidence

rate would be a 3 % failure rate. Plugging these numbers


into Eq. 11.6 yields:
100 log (1 − 0.99)
𝑛= = 151.2 (11.7)
log (1 − 0.03)
So our order of magnitude improvement requires
10 roughly an order of magnitude more testing. Certainty
is impossible, and high probabilities are quite expensive.
This is why making tests run more quickly and making
failures more probable are essential skills in the devel-
1 opment of highly reliable software. These skills will be
0 0.2 0.4 0.6 0.8 1
Per-Run Failure Probability
covered in Section 11.6.4.

Figure 11.4: Number of Tests Required for 99 Percent


11.6.2 Statistics Abuse for Discrete Testing
Confidence Given Failure Rate
But suppose that you have a continuous test that fails about
three times every ten hours, and that you fix the bug that
Finally the number of tests required is given by: you believe was causing the failure. How long do you
have to run this test without failure to be 99 % certain that
log (1 − 𝐹𝑛 )
𝑛= (11.6) you reduced the probability of failure?
log (1 − 𝑓 )
Without doing excessive violence to statistics, we could
Plugging 𝑓 = 0.1 and 𝐹𝑛 = 0.99 into Eq. 11.6 gives simply redefine a one-hour run to be a discrete test that
43.7, meaning that we need 44 consecutive successful test has a 30 % probability of failure. Then the results of in
runs to be 99 % certain that our fix was a real improvement. the previous section tell us that if the test runs for 13 hours
This matches the number obtained by the previous method, without failure, there is a 99 % probability that our fix
which is reassuring. actually improved the program’s reliability.
Quick Quiz 11.14: In Eq. 11.6, are the logarithms base-10, A dogmatic statistician might not approve of this ap-
base-2, or base-e? proach, but the sad fact is that the errors introduced by this
sort of statistical abuse are usually quite small compared
Figure 11.4 shows a plot of this function. Not surpris- to the errors in your failure-rate estimates. Nevertheless,
ingly, the less frequently each test run fails, the more test the next section takes a more rigorous approach.
runs are required to be 99 % confident that the bug has
been fixed. If the bug caused the test to fail only 1 % of
the time, then a mind-boggling 458 test runs are required.
11.6.3 Statistics for Continuous Testing
As the failure probability decreases, the number of test The fundamental formula for failure probabilities is the
runs required increases, going to infinity as the failure Poisson distribution:
probability goes to zero.
The moral of this story is that when you have found a 𝜆 𝑚 −𝜆
𝐹𝑚 = e (11.8)
rarely occurring bug, your testing job will be much easier 𝑚!
if you can come up with a carefully targeted test with a Here 𝐹𝑚 is the probability of 𝑚 failures in the test and
much higher failure rate. For example, if your targeted test 𝜆 is the expected failure rate per unit time. A rigorous
raised the failure rate from 1 % to 30 %, then the number derivation may be found in any advanced probability
of runs required for 99 % confidence would drop from textbook, for example, Feller’s classic “An Introduction to
458 to a more tractable 13. Probability Theory and Its Applications” [Fel50], while a
But these thirteen test runs would only give you 99 % more intuitive derivation may be found in the first edition
confidence that your fix had produced “some improve- of this book [McK14c, Equations 11.8–11.26].
ment”. Suppose you instead want to have 99 % confidence Let’s try reworking the example from Section 11.6.2
that your fix reduced the failure rate by an order of magni- using the Poisson distribution. Recall that this example
tude. How many failure-free test runs are required? involved a test with a 30 % failure rate per hour, and that

v2022.09.25a
218 CHAPTER 11. VALIDATION

the question was how long the test would need to run Here 𝑚 is the actual number of errors in the long test
error-free on a alleged fix to be 99 % certain that the fix run (in this case, two) and 𝜆 is expected number of errors
actually reduced the failure rate. In this case, 𝑚 is zero, in the long test run (in this case, 24). Plugging 𝑚 = 2 and
so that Eq. 11.8 reduces to: 𝜆 = 24 into this expression gives the probability of two
or fewer failures as about 1.2 × 10−8 , in other words, we
𝐹0 = e−𝜆 (11.9) have a high level of confidence that the fix actually had
some relationship to the bug.7
Solving this requires setting 𝐹0 to 0.01 and solving for
𝜆, resulting in: Quick Quiz 11.16: Doing the summation of all the factorials
and exponentials is a real pain. Isn’t there an easier way?
𝜆 = − ln 0.01 = 4.6 (11.10)
Quick Quiz 11.17: But wait!!! Given that there has to be
Because we get 0.3 failures per hour, the number of some number of failures (including the possibility of zero
hours required is 4.6/0.3 = 14.3, which is within 10 % of failures), shouldn’t Eq. 11.13 approach the value 1 as 𝑚 goes
the 13 hours calculated using the method in Section 11.6.2. to infinity?
Given that you normally won’t know your failure rate to
The Poisson distribution is a powerful tool for analyzing
anywhere near 10 %, the simpler method described in
test results, but the fact is that in this last example there
Section 11.6.2 is almost always good and sufficient.
were still two remaining test failures in a 24-hour test run.
More generally, if we have 𝑛 failures per unit time, and
Such a low failure rate results in very long test runs. The
we want to be 𝑃 % certain that a fix reduced the failure
next section discusses counter-intuitive ways of improving
rate, we can use the following formula:
this situation.
1 100 − 𝑃
𝑇 = − ln (11.11)
𝑛 100 11.6.4 Hunting Heisenbugs
Quick Quiz 11.15: Suppose that a bug causes a test failure This line of thought also helps explain heisenbugs: Adding
three times per hour on average. How long must the test run tracing and assertions can easily reduce the probability of a
error-free to provide 99.9 % confidence that the fix significantly bug appearing, which is why extremely lightweight tracing
reduced the probability of failure? and assertion mechanism are so critically important.
The term “heisenbug” was inspired by the Heisenberg
As before, the less frequently the bug occurs and the Uncertainty Principle from quantum physics, which states
greater the required level of confidence, the longer the that it is impossible to exactly quantify a given particle’s
required error-free test run. position and velocity at any given point in time [Hei27].
Suppose that a given test fails about once every hour, Any attempt to more accurately measure that particle’s
but after a bug fix, a 24-hour test run fails only twice. position will result in increased uncertainty of its velocity
Assuming that the failure leading to the bug is a random and vice versa. Similarly, attempts to track down the
occurrence, what is the probability that the small number heisenbug causes its symptoms to radically change or
of failures in the second run was due to random chance? even disappear completely.8 Of course, adding debug-
In other words, how confident should we be that the fix ging overhead can and sometimes does make bugs more
actually had some effect on the bug? This probability may probable. But developers are more likely to remember
be calculated by summing Eq. 11.8 as follows: the frustration of a disappearing heisenbug than the joy
inspired by the bug becoming more easily reproduced!
𝑚 If the field of physics inspired the name of this problem,
∑︁ 𝜆𝑖 it is only fair that the field of physics should inspire
𝐹0 + 𝐹1 + · · · + 𝐹𝑚−1 + 𝐹𝑚 = e−𝜆 (11.12)
𝑖=0
𝑖! the solution. Fortunately, particle physics is up to the
task: Why not create an anti-heisenbug to annihilate the
This is the Poisson cumulative distribution function,
which can be written more compactly as: 7 Of course, this result in no way excuses you from finding and fixing

the bug(s) resulting in the remaining two failures!


𝑚 8 The term “heisenbug” is a misnomer, as most heisenbugs are fully
∑︁ 𝜆𝑖
𝐹𝑖 ≤𝑚 = e−𝜆 (11.13) explained by the observer effect from classical physics. Nevertheless,
𝑖=0
𝑖! the name has stuck.

v2022.09.25a
11.6. PROBABILITY AND HEISENBUGS 219

heisenbug? Or, perhaps more accurately, to annihilate the some types of race conditions more probable. One way
heisen-ness of the heisenbug? Although producing an of getting a similar effect today is to test on multi-socket
anti-heisenbug for a given heisenbug is more an art than a systems, thus incurring the large delays described in
science, the following sections describe a number of ways Section 3.2.
to do just that: However you choose to add delays, you can then look
more intensively at the code implicated by those delays
1. Add delay to race-prone regions (Section 11.6.4.1). that make the greatest difference in failure rate. It might
2. Increase workload intensity (Section 11.6.4.2). be helpful to test that code in isolation, for example.
One important aspect of software configuration is the
3. Isolate suspicious subsystems (Section 11.6.4.3). history of changes, which is why git bisect is so useful.
Bisection of the change history can provide very valuable
4. Simulate unusual events (Section 11.6.4.4). clues as to the nature of the heisenbug, in this case
5. Count near misses (Section 11.6.4.5). presumably by locating a commit that shows a change in
the software’s response to the addition or removal of a
These are followed by discussion in Section 11.6.4.6. given delay.
Quick Quiz 11.19: But I did the bisection, and ended up
11.6.4.1 Add Delay with a huge commit. What do I do now?
Consider the count-lossy code in Section 5.1. Adding Once you locate the suspicious section of code, you can
printf() statements will likely greatly reduce or even then introduce delays to attempt to increase the probability
eliminate the lost counts. However, converting the load- of failure. As we have seen, increasing the probability of
add-store sequence to a load-add-delay-store sequence failure makes it much easier to gain high confidence in
will greatly increase the incidence of lost counts (try it!). the corresponding fix.
Once you spot a bug involving a race condition, it is However, it is sometimes quite difficult to track down
frequently possible to create an anti-heisenbug by adding the problem using normal debugging techniques. The
delay in this manner. following sections present some other alternatives.
Of course, this begs the question of how to find the
race condition in the first place. Although very lucky
developers might accidentally create delay-based anti- 11.6.4.2 Increase Workload Intensity
heisenbugs when adding debug code, this is in general a
It is often the case that a given test suite places relatively
dark art. Nevertheless, there are a number of things you
low stress on a given subsystem, so that a small change
can do to find your race conditions.
in timing can cause a heisenbug to disappear. One way
One approach is to recognize that race conditions of-
to create an anti-heisenbug for this case is to increase the
ten end up corrupting some of the data involved in the
workload intensity, which has a good chance of increasing
race. It is therefore good practice to double-check the
the bug’s probability. If the probability is increased suffi-
synchronization of any corrupted data. Even if you cannot
ciently, it may be possible to add lightweight diagnostics
immediately recognize the race condition, adding delay be-
such as tracing without causing the bug to vanish.
fore and after accesses to the corrupted data might change
How can you increase the workload intensity? This
the failure rate. By adding and removing the delays in an
depends on the program, but here are some things to try:
organized fashion (e.g., binary search), you might learn
more about the workings of the race condition. 1. Add more CPUs.
Quick Quiz 11.18: How is this approach supposed to help
if the corruption affected some unrelated pointer, which then 2. If the program uses networking, add more network
caused the corruption??? adapters and more or faster remote systems.

Another important approach is to vary the software and 3. If the program is doing heavy I/O when the problem
hardware configuration and look for statistically significant occurs, either (1) add more storage devices, (2) use
differences in failure rate. For example, back in the 1990s, faster storage devices, for example, substitute SSDs
it was common practice to test on systems having CPUs for disks, or (3) use a RAM-based filesystem to
running at different clock rates, which tended to make substitute main memory for mass storage.

v2022.09.25a
220 CHAPTER 11. VALIDATION

4. Change the size of the problem, for example, if


doing a parallel matrix multiply, change the size of call_rcu()

the matrix. Larger problems may introduce more


complexity, but smaller problems often increase the
Grace-Period Start

Near Miss
level of contention. If you aren’t sure whether you

Reader

Reader Error
should go large or go small, just try both.

Time
Grace-Period End

However, it is often the case that the bug is in a specific


subsystem, and the structure of the program limits the
amount of stress that can be applied to that subsystem.
The next section addresses this situation. Callback Invocation

11.6.4.3 Isolate Suspicious Subsystems


Figure 11.5: RCU Errors and Near Misses
If the program is structured such that it is difficult or
impossible to apply much stress to a subsystem that is
under suspicion, a useful anti-heisenbug is a stress test 11.6.4.5 Count Near Misses
that tests that subsystem in isolation. The Linux kernel’s Bugs are often all-or-nothing things, so that a bug either
rcutorture module takes exactly this approach with RCU: happens or not, with nothing in between. However, it is
Applying more stress to RCU than is feasible in a produc- sometimes possible to define a near miss where the bug
tion environment increases the probability that RCU bugs does not result in a failure, but has likely manifested. For
will be found during testing rather than in production.9 example, suppose your code is making a robot walk. The
In fact, when creating a parallel program, it is wise robot’s falling down constitutes a bug in your program,
to stress-test the components separately. Creating such but stumbling and recovering might constitute a near miss.
component-level stress tests can seem like a waste of time, If the robot falls over only once per hour, but stumbles
but a little bit of component-level testing can save a huge every few minutes, you might be able to speed up your
amount of system-level debugging. debugging progress by counting the number of stumbles
in addition to the number of falls.
11.6.4.4 Simulate Unusual Events In concurrent programs, timestamping can sometimes
Heisenbugs are sometimes due to unusual events, such as be used to detect near misses. For example, locking
memory-allocation failure, conditional-lock-acquisition primitives incur significant delays, so if there is a too-
failure, CPU-hotplug operations, timeouts, packet losses, short delay between a pair of operations that are supposed
and so on. One way to construct an anti-heisenbug for to be protected by different acquisitions of the same lock,
this class of heisenbug is to introduce spurious failures. this too-short delay might be counted as a near miss.10
For example, instead of invoking malloc() directly, For example, a low-probability bug in RCU priority
invoke a wrapper function that uses a random number boosting occurred roughly once every hundred hours of
to decide whether to return NULL unconditionally on the focused rcutorture testing. Because it would take almost
one hand, or to actually invoke malloc() and return the 500 hours of failure-free testing to be 99 % certain that the
resulting pointer on the other. Inducing spurious failures bug’s probability had been significantly reduced, the git
is an excellent way to bake robustness into sequential bisect process to find the failure would be painfully
programs as well as parallel programs. slow—or would require an extremely large test farm.
Fortunately, the RCU operation being tested included
Quick Quiz 11.20: Why don’t conditional-locking primitives not only a wait for an RCU grace period, but also a
provide this spurious-failure functionality? previous wait for the grace period to start and a subsequent
wait for an RCU callback to be invoked after completion
of the RCU grace period. This distinction between an

10 Of course, in this case, you might be better off using whatever

lock_held() primitive is available in your environment. If there isn’t


9 Though sadly not increased to probability one. a lock_held() primitive, create one!

v2022.09.25a
11.6. PROBABILITY AND HEISENBUGS 221

rcutorture error and near miss is shown in Figure 11.5. mathematics of Sections 11.6.1, 11.6.2, and 11.6.3. If you
To qualify as a full-fledged error, an RCU read-side critical love precision and mathematics, you may be disappointed
section must extend from the call_rcu() that initiated a to learn that the situations to which this section applies
grace period, through the remainder of the previous grace are far more common than those to which the preceding
period, through the entirety of the grace period initiated sections apply.
by the call_rcu() (denoted by the region between the In fact, the common case is that although you might
jagged lines), and through the delay from the end of that have reason to believe that your code has bugs, you have
grace period to the callback invocation, as indicated by no idea what those bugs are, what causes them, how
the “Error” arrow. However, the formal definition of RCU likely they are to appear, or what conditions affect their
prohibits RCU read-side critical sections from extending probability of appearance. In this all-too-common case,
across a single grace period, as indicated by the “Near statistics cannot help you.12 That is to say, statistics cannot
Miss” arrow. This suggests using near misses as the help you directly. But statistics can be of great indirect
error condition, however, this can be problematic because help—if you have the humility required to admit that you
different CPUs can have different opinions as to exactly make mistakes, that you can reduce the probability of
where a given grace period starts and ends, as indicated these mistakes (for example, by getting enough sleep), and
by the jagged lines.11 Using the near misses as the error that the number and type of mistakes you made in the past
condition could therefore result in false positives, which is indicative of the number and type of mistakes that you
need to be avoided in the automated rcutorture testing. are likely to make in the future. For example, I have a
By sheer dumb luck, rcutorture happens to include deplorable tendency to forget to write a small but critical
some statistics that are sensitive to the near-miss version portion of the initialization code, and frequently get most
of the grace period. As noted above, these statistics are or even all of a parallel program correct—except for a
subject to false positives due to their unsynchronized stupid omission in initialization. Once I was willing to
access to RCU’s state variables, but these false positives admit to myself that I am prone to this type of mistake, it
turn out to be extremely rare on strongly ordered systems was easier (but not easy!) to force myself to double-check
such as the IBM mainframe and x86, occurring less than my initialization code. Doing this allowed me to find
once per thousand hours of testing. numerous bugs ahead of time.
These near misses occurred roughly once per hour, When your quick bug hunt morphs into a long-term
about two orders of magnitude more frequently than the quest, it is important to log everything you have tried and
actual errors. Use of these near misses allowed the bug’s what happened. In the common case where the software
root cause to be identified in less than a week and a high is changing during the course of your quest, make sure
degree of confidence in the fix to be built in less than a to record the exact version of the software to which each
day. In contrast, excluding the near misses in favor of log entry applies. From time to time, reread the entire log
the real errors would have required months of debug and in order to make connections between clues encountered
validation time. at different times. Such rereading is especially important
To sum up near-miss counting, the general approach upon encountering a surprising test result, for example, I
is to replace counting of infrequent failures with more- reread my log upon realizing that what I thought was a
frequent near misses that are believed to be correlated with failure of the hypervisor to schedule a vCPU was instead
those failures. These near-misses can be considered an an interrupt storm preventing that vCPU from making
anti-heisenbug to the real failure’s heisenbug because the forward progress on the interrupted code. If the code you
near-misses, being more frequent, are likely to be more are debugging is new to you, this log is also an excellent
robust in the face of changes to your code, for example, place to document the relationships between code and data
the changes you make to add debugging code. structures. Keeping a log when you are furiously chasing
a difficult bug might seem like needless paperwork, but it
11.6.4.6 Heisenbug Discussion has on many occasions saved me from debugging around
and around in circles, which can waste far more time than
The alert reader might have noticed that this section was keeping a log ever could.
fuzzy and qualitative, in stark contrast to the precise
12 Although if you know what your program is supposed to do and
11 In
real life, these lines can be much more jagged because idle if your program is small enough (both less likely that you might think),
CPUs can be completely unaware of a great many recent grace periods. then the formal-verification tools described in Chapter 12 can be helpful.

v2022.09.25a
222 CHAPTER 11. VALIDATION

Using Taleb’s nomenclature [Tal07], a white swan Quick Quiz 11.22: But if you are going to put in all the hard
is a bug that we can reproduce. We can run a large work of parallelizing an application, why not do it right? Why
number of tests, use ordinary statistics to estimate the settle for anything less than optimal performance and linear
bug’s probability, and use ordinary statistics again to scalability?
estimate our confidence in a proposed fix. An unsuspected
Validating a parallel program must therfore include
bug is a black swan. We know nothing about it, we have
validating its performance. But validating performance
no tests that have yet caused it to happen, and statistics
means having a workload to run and performance criteria
is of no help. Studying our own behavior, especially the
with which to evaluate the program at hand. These needs
number and types of mistakes we make, can turn black
are often met by performance benchmarks, which are
swans into grey swans. We might not know exactly what
discussed in the next section.
the bugs are, but we have some idea of their number and
maybe also of their type. Ordinary statistics is still of no
help (at least not until we are able to reproduce one of 11.7.1 Benchmarking
the bugs), but robust13 testing methods can be of great
Frequent abuse aside, benchmarks are both useful and
help. The goal, therefore, is to use experience and good
heavily used, so it is not helpful to be too dismissive of
validation practices to turn the black swans grey, focused
them. Benchmarks span the range from ad hoc test jigs
testing and analysis to turn the grey swans white, and
to international standards, but regardless of their level of
ordinary methods to fix the white swans.
formality, benchmarks serve four major purposes:
That said, thus far, we have focused solely on bugs in the
parallel program’s functionality. However, performance is 1. Providing a fair framework for comparing competing
a first-class requirement for a parallel program. Otherwise, implementations.
why not write a sequential program? To repurpose Kipling,
our goal when writing parallel code is to fill the unforgiving 2. Focusing competitive energy on improving imple-
second with sixty minutes worth of distance run. The next mentations in ways that matter to users.
section therefore discusses a number of performance bugs
3. Serving as example uses of the implementations
that would be happy to thwart this Kiplingesque goal.
being benchmarked.
4. Serving as a marketing tool to highlight your software
11.7 Performance Estimation against your competitors’ offerings.

Of course, the only completely fair framework is the in-


There are lies, damn lies, statistics, and benchmarks.
tended application itself. So why would anyone who cared
Unknown about fairness in benchmarking bother creating imperfect
benchmarks rather than simply using the application itself
Parallel programs usually have performance and scalability as the benchmark?
requirements, after all, if performance is not an issue, why Running the actual application is in fact the best ap-
not use a sequential program? Ultimate performance proach where it is practical. Unfortunately, it is often
and linear scalability might not be necessary, but there is impractical for the following reasons:
little use for a parallel program that runs slower than its
optimal sequential counterpart. And there really are cases 1. The application might be proprietary, and you might
where every microsecond matters and every nanosecond not have the right to run the intended application.
is needed. Therefore, for parallel programs, insufficient
2. The application might require more hardware than
performance is just as much a bug as is incorrectness.
you have access to.
Quick Quiz 11.21: That is ridiculous!!! After all, isn’t
getting the correct answer later than one would like better than 3. The application might use data that you cannot access,
getting an incorrect answer??? for example, due to privacy regulations.
4. The application might take longer than is convenient
to reproduce a performance or scalability problem.14
13 That is to say brutal. 14 Microbenchmarks can help, but please see Section 11.7.4.

v2022.09.25a
11.7. PERFORMANCE ESTIMATION 223

Creating a benchmark that approximates the application vary the load placed on the system, the number of network
can help overcome these obstacles. A carefully construc- adapters, the number of mass-storage devices, and so on.
ted benchmark can help promote performance, scalability, You then collect profiles of the two runs, and mathemati-
energy efficiency, and much else besides. However, be cally combine corresponding profile measurements. For
careful to avoid investing too much into the benchmarking example, if your main concern is scalability, you might
effort. It is after all important to invest at least a little into take the ratio of corresponding measurements, and then
the application itself [Gra91]. sort the ratios into descending numerical order. The prime
scalability suspects will then be sorted to the top of the
11.7.2 Profiling list [McK95, McK99].
Some tools such as perf have built-in differential-
In many cases, a fairly small portion of your software profiling support.
is responsible for the majority of the performance and
scalability shortfall. However, developers are notoriously 11.7.4 Microbenchmarking
unable to identify the actual bottlenecks by inspection.
For example, in the case of a kernel buffer allocator, Microbenchmarking can be useful when deciding which
all attention focused on a search of a dense array which algorithms or data structures are worth incorporating into
turned out to represent only a few percent of the allocator’s a larger body of software for deeper evaluation.
execution time. An execution profile collected via a One common approach to microbenchmarking is to
logic analyzer focused attention on the cache misses measure the time, run some number of iterations of the
that were actually responsible for the majority of the code under test, then measure the time again. The dif-
problem [MS93]. ference between the two times divided by the number of
An old-school but quite effective method of tracking iterations gives the measured time required to execute the
down performance and scalability bugs is to run your code under test.
program under a debugger, then periodically interrupt it, Unfortunately, this approach to measurement allows
recording the stacks of all threads at each interruption. any number of errors to creep in, including:
The theory here is that if something is slowing down your 1. The measurement will include some of the overhead
program, it has to be visible in your threads’ executions. of the time measurement. This source of error can
That said, there are a number of tools that will usually be reduced to an arbitrarily small value by increasing
do a much better job of helping you to focus your attention the number of iterations.
where it will do the most good. Two popular choices
are gprof and perf. To use perf on a single-process 2. The first few iterations of the test might incur cache
program, prefix your command with perf record, then misses or (worse yet) page faults that might inflate
after the command completes, type perf report. There the measured value. This source of error can also be
is a lot of work on tools for performance debugging reduced by increasing the number of iterations, or
of multi-threaded programs, which should make this it can often be eliminated entirely by running a few
important job easier. Again, one good starting point warm-up iterations before starting the measurement
is Brendan Gregg’s blog.15 period. Most systems have ways of detecting whether
a given process incurred a page fault, and you should
make use of this to reject runs whose performance
11.7.3 Differential Profiling has been thus impeded.
Scalability problems will not necessarily be apparent
3. Some types of interference, for example, random
unless you are running on very large systems. However,
memory errors, are so rare that they can be dealt
it is sometimes possible to detect impending scalability
with by running a number of sets of iterations of the
problems even when running on much smaller systems.
test. If the level of interference was statistically sig-
One technique for doing this is called differential profiling.
nificant, any performance outliers could be rejected
The idea is to run your workload under two different
statistically.
sets of conditions. For example, you might run it on two
CPUs, then run it again on four CPUs. You might instead 4. Any iteration of the test might be interfered with
by other activity on the system. Sources of inter-
15 http://www.brendangregg.com/blog/ ference include other applications, system utilities

v2022.09.25a
224 CHAPTER 11. VALIDATION

and daemons, device interrupts, firmware interrupts cases. However, it does have limitations, for example, it
(including system management interrupts, or SMIs), cannot do anything about the per-CPU kernel threads that
virtualization, memory errors, and much else besides. are often used for housekeeping tasks.
Assuming that these sources of interference occur One way to avoid interference from per-CPU kernel
randomly, their effect can be minimized by reducing threads is to run your test at a high real-time priority, for
the number of iterations. example, by using the POSIX sched_setscheduler()
system call. However, note that if you do this, you are im-
5. Thermal throttling can understate scalability because plicitly taking on responsibility for avoiding infinite loops,
increasing CPU activity increases heat generation, because otherwise your test can prevent part of the kernel
and on systems without adequate cooling (most of from functioning. This is an example of the Spiderman
them!), this can result in the CPU frequency decreas- Principle: “With great power comes great responsibility.”
ing as the number of CPUs increases.16 Of course, if And although the default real-time throttling settings often
you are testing an application to evaluate its expected address such problems, they might do so by causing your
behavior when run in production, such thermal throt- real-time threads to miss their deadlines.
tling is simply a fact of life. Otherwise, if you are These approaches can greatly reduce, and perhaps even
interested in theoretical scalability, use a system with eliminate, interference from processes, threads, and tasks.
adequate cooling or reduce the CPU clock rate to a However, it does nothing to prevent interference from
level that the cooling system can handle. device interrupts, at least in the absence of threaded
interrupts. Linux allows some control of threaded in-
The first and fourth sources of interference provide
terrupts via the /proc/irq directory, which contains
conflicting advice, which is one sign that we are living
numerical directories, one per interrupt vector. Each
in the real world. The remainder of this section looks at
numerical directory contains smp_affinity and smp_
ways of resolving this conflict.
affinity_list. Given sufficient permissions, you can
Quick Quiz 11.23: But what about other sources of error, write a value to these files to restrict interrupts to the
for example, due to interactions between caches and memory specified set of CPUs. For example, either “echo 3
layout? > /proc/irq/23/smp_affinity” or “echo 0-1 >
The following sections discuss ways of dealing with /proc/irq/23/smp_affinity_list” would confine
these measurement errors, with Section 11.7.5 covering interrupts on vector 23 to CPUs 0 and 1, at least given suffi-
isolation techniques that may be used to prevent some cient privileges. You can use “cat /proc/interrupts”
forms of interference, and with Section 11.7.6 covering to obtain a list of the interrupt vectors on your system,
methods for detecting interference so as to reject mea- how many are handled by each CPU, and what devices
surement data that might have been corrupted by that use each interrupt vector.
interference. Running a similar command for all interrupt vectors on
your system would confine interrupts to CPUs 0 and 1,
leaving the remaining CPUs free of interference. Or
11.7.5 Isolation mostly free of interference, anyway. It turns out that
The Linux kernel provides a number of ways to isolate a the scheduling-clock interrupt fires on each CPU that is
group of CPUs from outside interference. running in user mode.17 In addition you must take care to
First, let’s look at interference by other processes, ensure that the set of CPUs that you confine the interrupts
threads, and tasks. The POSIX sched_setaffinity() to is capable of handling the load.
system call may be used to move most tasks off of a But this only handles processes and interrupts running
given set of CPUs and to confine your tests to that same in the same operating-system instance as the test. Suppose
group. The Linux-specific user-level taskset command that you are running the test in a guest OS that is itself
may be used for the same purpose, though both sched_ running on a hypervisor, for example, Linux running
setaffinity() and taskset require elevated permis- KVM? Although you can in theory apply the same
sions. Linux-specific control groups (cgroups) may be techniques at the hypervisor level that you can at the
used for this same purpose. This approach can be quite
17 Frederic Weisbecker leads up a NO_HZ_FULL adaptive-ticks project
effective at reducing interference, and is sufficient in many
that allows scheduling-clock interrupts to be disabled on CPUs that have
16 Systems with adequate cooling tend to look like gaming systems. only one runnable task. As of 2021, this is largely complete.

v2022.09.25a
11.7. PERFORMANCE ESTIMATION 225

Listing 11.1: Using getrusage() to Detect Context Switches Opening and reading files is not the way to low overhead,
1 #include <sys/time.h> and it is possible to get the count of context switches for a
2 #include <sys/resource.h>
3 given thread by using the getrusage() system call, as
4 /* Return 0 if test results should be rejected. */ shown in Listing 11.1. This same system call can be used
5 int runtest(void)
6 { to detect minor page faults (ru_minflt) and major page
7 struct rusage ru1; faults (ru_majflt).
8 struct rusage ru2;
9 Unfortunately, detecting memory errors and firmware
10 if (getrusage(RUSAGE_SELF, &ru1) != 0) { interference is quite system-specific, as is the detection of
11 perror("getrusage");
12 abort(); interference due to virtualization. Although avoidance is
13 } better than detection, and detection is better than statistics,
14 /* run test here. */
15 if (getrusage(RUSAGE_SELF, &ru2 != 0) { there are times when one must avail oneself of statistics, a
16 perror("getrusage"); topic addressed in the next section.
17 abort();
18 }
19 return (ru1.ru_nvcsw == ru2.ru_nvcsw &&
20 ru1.runivcsw == ru2.runivcsw);
11.7.6.2 Detecting Interference Via Statistics
21 }
Any statistical analysis will be based on assumptions about
the data, and performance microbenchmarks often support
guest-OS level, it is quite common for hypervisor-level the following assumptions:
operations to be restricted to authorized personnel. In
1. Smaller measurements are more likely to be accurate
addition, none of these techniques work against firmware-
than larger measurements.
level interference.
Quick Quiz 11.24: Wouldn’t the techniques suggested to 2. The measurement uncertainty of good data is known.
isolate the code under test also affect that code’s performance,
particularly if it is running within a larger application? 3. A reasonable fraction of the test runs will result in
good data.
Of course, if it is in fact the interference that is producing
the behavior of interest, you will instead need to promote The fact that smaller measurements are more likely
interference, in which case being unable to prevent it is to be accurate than larger measurements suggests that
not a problem. But if you really do need interference-free sorting the measurements in increasing order is likely to be
measurements, then instead of preventing the interference, productive.18 The fact that the measurement uncertainty
you might need to detect the interference as described in is known allows us to accept measurements within this
the next section. uncertainty of each other: If the effects of interference are
large compared to this uncertainty, this will ease rejection
of bad data. Finally, the fact that some fraction (for
11.7.6 Detecting Interference example, one third) can be assumed to be good allows
If you cannot prevent interference, perhaps you can detect us to blindly accept the first portion of the sorted list,
it and reject results from any affected test runs. Sec- and this data can then be used to gain an estimate of the
tion 11.7.6.1 describes methods of rejection involving ad- natural variation of the measured data, over and above the
ditional measurements, while Section 11.7.6.2 describes assumed measurement error.
statistics-based rejection. The approach is to take the specified number of leading
elements from the beginning of the sorted list, and use
these to estimate a typical inter-element delta, which in
11.7.6.1 Detecting Interference Via Measurement
turn may be multiplied by the number of elements in the
Many systems, including Linux, provide means for deter- list to obtain an upper bound on permissible values. The
mining after the fact whether some forms of interference algorithm then repeatedly considers the next element of
have occurred. For example, process-based interference the list. If it falls below the upper bound, and if the distance
results in context switches, which, on Linux-based sys- between the next element and the previous element is not
tems, are visible in /proc/<PID>/sched via the nr_ too much greater than the average inter-element distance
switches field. Similarly, interrupt-based interference
can be detected via the /proc/interrupts file. 18 To paraphrase the old saying, “Sort first and ask questions later.”

v2022.09.25a
226 CHAPTER 11. VALIDATION

Listing 11.2: Statistical Elimination of Interference 5. The number of selected data items.
1 div=3
2 rel=0.01 6. The number of input data items.
3 tre=10
4 while test $# -gt 0
5 do This script takes three optional arguments as follows:
6 case "$1" in
7 --divisor)
8 shift --divisor: Number of segments to divide the list into,
9 div=$1
10 ;; for example, a divisor of four means that the first
11 --relerr) quarter of the data elements will be assumed to be
12 shift
13 rel=$1 good. This defaults to three.
14 ;;
15 --trendbreak) --relerr: Relative measurement error. The script as-
16 shift
17 tre=$1 sumes that values that differ by less than this error
18 ;; are for all intents and purposes equal. This defaults
19 esac
20 shift to 0.01, which is equivalent to 1 %.
21 done
22 --trendbreak: Ratio of inter-element spacing constitut-
23 awk -v divisor=$div -v relerr=$rel -v trendbreak=$tre '{
24 for (i = 2; i <= NF; i++) ing a break in the trend of the data. For example,
25 d[i - 1] = $i; if the average spacing in the data accepted so far is
26 asort(d);
27 i = int((NF + divisor - 1) / divisor); 1.5, then if the trend-break ratio is 2.0, then if the
28 delta = d[i] - d[1]; next data value differs from the last one by more than
29 maxdelta = delta * divisor;
30 maxdelta1 = delta + d[i] * relerr; 3.0, this constitutes a break in the trend. (Unless of
31 if (maxdelta1 > maxdelta) course, the relative error is greater than 3.0, in which
32 maxdelta = maxdelta1;
33 for (j = i + 1; j < NF; j++) { case the “break” will be ignored.)
34 if (j <= 2)
35 maxdiff = d[NF - 1] - d[1];
36 else
Lines 1–3 of Listing 11.2 set the default values for
37 maxdiff = trendbreak * (d[j - 1] - d[1]) / (j - 2); the parameters, and lines 4–21 parse any command-line
38 if (d[j] - d[1] > maxdelta && d[j] - d[j - 1] > maxdiff)
39 break;
overriding of these parameters. The awk invocation on
40 } line 23 sets the values of the divisor, relerr, and
41 n = sum = 0;
42 for (k = 1; k < j; k++) {
trendbreak variables to their sh counterparts. In the
43 sum += d[k]; usual awk manner, lines 24–50 are executed on each input
44 n++;
45 }
line. The loop spanning lines 24 and 25 copies the input
46 min = d[1]; y-values to the d array, which line 26 sorts into increasing
47 max = d[j - 1];
48 avg = sum / n;
order. Line 27 computes the number of trustworthy y-
49 print $1, avg, min, max, n, NF - 1; values by applying divisor and rounding up.
50 }'
Lines 28–32 compute the maxdelta lower bound on
the upper bound of y-values. To this end, line 29 multiplies
for the portion of the list accepted thus far, then the next the difference in values over the trusted region of data
element is accepted and the process repeats. Otherwise, by the divisor, which projects the difference in values
the remainder of the list is rejected. across the trusted region across the entire set of y-values.
Listing 11.2 shows a simple sh/awk script implementing However, this value might well be much smaller than
this notion. Input consists of an x-value followed by an the relative error, so line 30 computes the absolute error
arbitrarily long list of y-values, and output consists of one (d[i] * relerr) and adds that to the difference delta
line for each input line, with fields as follows: across the trusted portion of the data. Lines 31 and 32
then compute the maximum of these two values.
1. The x-value. Each pass through the loop spanning lines 33–40 at-
tempts to add another data value to the set of good data.
2. The average of the selected data. Lines 34–39 compute the trend-break delta, with line 34
3. The minimum of the selected data. disabling this limit if we don’t yet have enough val-
ues to compute a trend, and with line 37 multiplying
4. The maximum of the selected data. trendbreak by the average difference between pairs of

v2022.09.25a
11.8. SUMMARY 227

work out whether a program will halt, but also estimate how
long it will run before halting, as discussed in Section 11.7.
Furthermore, in cases where a given program might or
might not work correctly, we can often establish estimates
for what fraction of the time it will work correctly, as
discussed in Section 11.6.
Nevertheless, unthinking reliance on these estimates
is brave to the point of foolhardiness. After all, we are
summarizing a huge mass of complexity in code and data
Figure 11.6: Choose Validation Methods Wisely structures down to a single solitary number. Even though
we can get away with such bravery a surprisingly large
fraction of the time, abstracting all that code and data
data values in the good set. If line 38 determines that the away will occasionally cause severe problems.
candidate data value would exceed the lower bound on the
One possible problem is variability, where repeated
upper bound (maxdelta) and that the difference between
runs give wildly different results. This problem is often
the candidate data value and its predecessor exceeds the
addressed using standard deviation, however, using two
trend-break difference (maxdiff), then line 39 exits the
numbers to summarize the behavior of a large and complex
loop: We have the full good set of data.
program is about as brave as using only one number. In
Lines 41–49 then compute and print statistics.
computer programming, the surprising thing is that use
Quick Quiz 11.25: This approach is just plain weird! Why of the mean or the mean and standard deviation are often
not use means and standard deviations, like we were taught in sufficient. Nevertheless, there are no guarantees.
our statistics classes?
One cause of variation is confounding factors. For
Quick Quiz 11.26: But what if all the y-values in the trusted example, the CPU time consumed by a linked-list search
group of data are exactly zero? Won’t that cause the script to will depend on the length of the list. Averaging together
reject any non-zero value? runs with wildly different list lengths will probably not be
useful, and adding a standard deviation to the mean will
Although statistical interference detection can be quite not be much better. The right thing to do would be control
useful, it should be used only as a last resort. It is far better for list length, either by holding the length constant or to
to avoid interference in the first place (Section 11.7.5), measure CPU time as a function of list length.
or, failing that, detecting interference via measurement
Of course, this advice assumes that you are aware
(Section 11.7.6.1).
of the confounding factors, and Murphy says that you
will not be. I have been involved in projects that had
11.8 Summary confounding factors as diverse as air conditioners (which
drew considerable power at startup, thus causing the
voltage supplied to the computer to momentarily drop too
To err is human! Stop being human!!! low, sometimes resulting in failure), cache state (resulting
Ed Nofziger in odd variations in performance), I/O errors (including
disk errors, packet loss, and duplicate Ethernet MAC
Although validation never will be an exact science, much addresses), and even porpoises (which could not resist
can be gained by taking an organized approach to it, as playing with an array of transponders, which could be
an organized approach will help you choose the right otherwise used for high-precision acoustic positioning
validation tools for your job, avoiding situations like the and navigation). And this is but one reason why a good
one fancifully depicted in Figure 11.6. night’s sleep is such an effective debugging tool.
A key choice is that of statistics. Although the methods In short, validation always will require some measure
described in this chapter work very well most of the time, of the behavior of the system. To be at all useful, this
they do have their limitations, courtesy of the Halting measure must be a severe summarization of the system,
Problem [Tur37, Pul00]. Fortunately for us, there is a which in turn means that it can be misleading. So as the
huge number of special cases in which we can not only saying goes, “Be careful. It is a real world out there.”

v2022.09.25a
228 CHAPTER 11. VALIDATION

But what if you are working on the Linux kernel, which


as of 2017 was estimated to have more than 20 billion
instances running throughout the world? In that case,
a bug that occurs once every million years on a single
system will be encountered more than 50 times per day
across the installed base. A test with a 50 % chance of
encountering this bug in a one-hour run would need to
increase that bug’s probability of occurrence by more than
ten orders of magnitude, which poses a severe challenge
to today’s testing methodologies. One important tool
that can sometimes be applied with good effect to such
situations is formal verification, the subject of the next
chapter, and, more speculatively, Section 17.4.
The topic of choosing a validation plan, be it testing,
formal verification, or both, is taken up by Section 12.7.

v2022.09.25a
Beware of bugs in the above code; I have only proved
it correct, not tried it.

Chapter 12 Donald Knuth

Formal Verification

Parallel algorithms can be hard to write, and even harder are used to verifying data communication protocols. Sec-
to debug. Testing, though essential, is insufficient, as fatal tion 12.1.1 introduces Promela and Spin, including a
race conditions can have extremely low probabilities of couple of warm-up exercises verifying both non-atomic
occurrence. Proofs of correctness can be valuable, but in and atomic increment. Section 12.1.2 describes use of
the end are just as prone to human error as is the original Promela, including example command lines and a com-
algorithm. In addition, a proof of correctness cannot be parison of Promela syntax to that of C. Section 12.1.3
expected to find errors in your assumptions, shortcomings shows how Promela may be used to verify locking, Sec-
in the requirements, misunderstandings of the underlying tion 12.1.4 uses Promela to verify an unusual implemen-
software or hardware primitives, or errors that you did tation of RCU named “QRCU”, and finally Section 12.1.5
not think to construct a proof for. This means that formal applies Promela to early versions of RCU’s dyntick-idle
methods can never replace testing. Nevertheless, formal implementation.
methods can be a valuable addition to your validation
toolbox. 12.1.1 Promela and Spin
It would be very helpful to have a tool that could some-
how locate all race conditions. A number of such tools Promela is a language designed to help verify protocols,
exist, for example, Section 12.1 provides an introduction but which can also be used to verify small parallel al-
to the general-purpose state-space search tools Promela gorithms. You recode your algorithm and correctness
and Spin, Section 12.2 similarly introduces the special- constraints in the C-like language Promela, and then use
purpose ppcmem and cppmem tools, Section 12.3 looks Spin to translate it into a C program that you can compile
at an example axiomatic approach, Section 12.4 briefly and run. The resulting program carries out a full state-
overviews SAT solvers, Section 12.5 briefly overviews space search of your algorithm, either verifying or finding
stateless model checkers, Section 12.6 sums up use of counter-examples for assertions that you can associate
formal-verification tools for verifying parallel algorithms, with in your Promela program.
and finally Section 12.7 discusses how to decide how This full-state search can be extremely powerful, but
much and what type of validation to apply to a given can also be a two-edged sword. If your algorithm is too
software project. complex or your Promela implementation is careless, there
might be more states than fit in memory. Furthermore,
even given sufficient memory, the state-space search might
12.1 State-Space Search well run for longer than the expected lifetime of the
universe. Therefore, use this tool for compact but complex
parallel algorithms. Attempts to naively apply it to even
Follow every byway / Every path you know.
moderate-scale algorithms (let alone the full Linux kernel)
“Climb Every Mountain”, Rodgers & Hammerstein will end badly.
Promela and Spin may be downloaded from https:
This section features the general-purpose Promela and //spinroot.com/spin/whatispin.html.
Spin tools, which may be used to carry out a full state- The above site also gives links to Gerard Holzmann’s
space search of many types of multi-threaded code. They excellent book [Hol03] on Promela and Spin, as well as

229

v2022.09.25a
230 CHAPTER 12. FORMAL VERIFICATION

Listing 12.1: Promela Code for Non-Atomic Increment process’s completion. Because the Spin system will fully
1 #define NUMPROCS 2 search the state space, including all possible sequences of
2
3 byte counter = 0; states, there is no need for the loop that would be used for
4 byte progress[NUMPROCS]; conventional stress testing.
5
6 proctype incrementer(byte me) Lines 15–40 are the initialization block, which is ex-
7 { ecuted first. Lines 19–28 actually do the initialization,
8 int temp;
9 while lines 29–39 perform the assertion. Both are atomic
10 temp = counter; blocks in order to avoid unnecessarily increasing the state
11 counter = temp + 1;
12 progress[me] = 1; space: Because they are not part of the algorithm proper,
13 } we lose no verification coverage by making them atomic.
14
15 init { The do-od construct on lines 21–27 implements a
16 int i = 0; Promela loop, which can be thought of as a C for
17 int sum = 0;
18 (;;) loop containing a switch statement that allows
19 atomic { expressions in case labels. The condition blocks (prefixed
20 i = 0;
21 do by ::) are scanned non-deterministically, though in this
22 :: i < NUMPROCS -> case only one of the conditions can possibly hold at a
23 progress[i] = 0;
24 run incrementer(i); given time. The first block of the do-od from lines 22–25
25 i++; initializes the i-th incrementer’s progress cell, runs the i-th
26 :: i >= NUMPROCS -> break;
27 od; incrementer’s process, and then increments the variable i.
28 } The second block of the do-od on line 26 exits the loop
29 atomic {
30 i = 0; once these processes have been started.
31 sum = 0; The atomic block on lines 29–39 also contains a similar
32 do
33 :: i < NUMPROCS -> do-od loop that sums up the progress counters. The
34 sum = sum + progress[i]; assert() statement on line 38 verifies that if all processes
35 i++
36 :: i >= NUMPROCS -> break; have been completed, then all counts have been correctly
37 od; recorded.
38 assert(sum < NUMPROCS || counter == NUMPROCS);
39 } You can build and run this program as follows:
40 }
spin -a increment.spin # Translate the model to C
cc -DSAFETY -o pan pan.c # Compile the model
./pan # Run the model
searchable online references starting at: https://www.
spinroot.com/spin/Man/index.html.
This will produce output as shown in Listing 12.2.
The remainder of this section describes how to use
The first line tells us that our assertion was violated (as
Promela to debug parallel algorithms, starting with simple
expected given the non-atomic increment!). The second
examples and progressing to more complex uses.
line that a trail file was written describing how the
assertion was violated. The “Warning” line reiterates that
12.1.1.1 Warm-Up: Non-Atomic Increment all was not well with our model. The second paragraph
describes the type of state-search being carried out, in
Listing 12.1 demonstrates the textbook race condition this case for assertion violations and invalid end states.
resulting from non-atomic increment. Line 1 defines The third paragraph gives state-size statistics: This small
the number of processes to run (we will vary this to see model had only 45 states. The final line shows memory
the effect on state space), line 3 defines the counter, and usage.
line 4 is used to implement the assertion that appears on The trail file may be rendered human-readable as
lines 29–39. follows:
Lines 6–13 define a process that increments the counter
non-atomically. The argument me is the process number, spin -t -p increment.spin

set by the initialization block later in the code. Because


simple Promela statements are each assumed atomic, This gives the output shown in Listing 12.3. As can
we must break the increment into the two statements be seen, the first portion of the init block created both
on lines 10–11. The assignment on line 12 marks the incrementer processes, both of which first fetched the

v2022.09.25a
12.1. STATE-SPACE SEARCH 231

Listing 12.2: Non-Atomic Increment Spin Output Table 12.1: Memory Usage of Increment Model
pan:1: assertion violated
((sum<2)||(counter==2)) (at depth 22) # incrementers # states total memory usage (MB)
pan: wrote increment.spin.trail

(Spin Version 6.4.8 -- 2 March 2018) 1 11 128.7


Warning: Search not completed 2 52 128.7
+ Partial Order Reduction
3 372 128.7
Full statespace search for:
never claim - (none specified) 4 3,496 128.9
assertion violations + 5 40,221 131.7
cycle checks - (disabled by -DSAFETY)
invalid end states + 6 545,720 174.0
State-vector 48 byte, depth reached 24, errors: 1 7 8,521,446 881.9
45 states, stored
13 states, matched
58 transitions (= stored+matched)
53 atomic steps spin -a qrcu.spin
hash conflicts: 0 (resolved)
Create a file pan.c that fully searches the state
Stats on memory usage (in Megabytes): machine.
0.003 equivalent memory usage for states
(stored*(State-vector + overhead))
0.290 actual memory usage for states cc -DSAFETY [-DCOLLAPSE] [-DMA=N] -o pan
128.000 memory used for hash table (-w24) pan.c
0.534 memory used for DFS stack (-m10000)
128.730 total actual memory usage Compile the generated state-machine search.
The -DSAFETY generates optimizations that are
appropriate if you have only assertions (and perhaps
counter, then both incremented and stored it, losing a never statements). If you have liveness, fairness, or
count. The assertion then triggered, after which the global forward-progress checks, you may need to compile
state is displayed. without -DSAFETY. If you leave off -DSAFETY when
you could have used it, the program will let you
know.
12.1.1.2 Warm-Up: Atomic Increment
The optimizations produced by -DSAFETY greatly
It is easy to fix this example by placing the body of the speed things up, so you should use it when you
incrementer processes in an atomic block as shown in can. An example situation where you cannot use
Listing 12.4. One could also have simply replaced the pair -DSAFETY is when checking for livelocks (AKA
of statements with counter = counter + 1, because “non-progress cycles”) via -DNP.
Promela statements are atomic. Either way, running this The optional -DCOLLAPSE generates code for a state
modified model gives us an error-free traversal of the state vector compression mode.
space, as shown in Listing 12.5.
Another optional flag -DMA=N generates code for a
Table 12.1 shows the number of states and memory con-
slow but aggressive state-space memory compression
sumed as a function of number of incrementers modeled
mode.
(by redefining NUMPROCS):
Running unnecessarily large models is thus subtly dis- ./pan [-mN] [-wN]
couraged, although 882 MB is well within the limits of This actually searches the state space. The number
modern desktop and laptop machines. of states can reach into the tens of millions with very
With this example under our belt, let’s take a closer small state machines, so you will need a machine
look at the commands used to analyze Promela models with large memory. For example, qrcu.spin with
and then look at more elaborate examples. 3 updaters and 2 readers required 10.5 GB of memory
even with the -DCOLLAPSE flag.

12.1.2 How to Use Promela If you see a message from ./pan saying: “error:
max search depth too small”, you need to in-
Given a source file qrcu.spin, one can use the following crease the maximum depth by a -mN option for a
commands: complete search. The default is -m10000.

v2022.09.25a
232 CHAPTER 12. FORMAL VERIFICATION

Listing 12.3: Non-Atomic Increment Error Trail


using statement merging
1: proc 0 (:init::1) increment.spin:21 (state 1) [i = 0]
2: proc 0 (:init::1) increment.spin:23 (state 2) [((i<2))]
2: proc 0 (:init::1) increment.spin:24 (state 3) [progress[i] = 0]
Starting incrementer with pid 1
3: proc 0 (:init::1) increment.spin:25 (state 4) [(run incrementer(i))]
4: proc 0 (:init::1) increment.spin:26 (state 5) [i = (i+1)]
5: proc 0 (:init::1) increment.spin:23 (state 2) [((i<2))]
5: proc 0 (:init::1) increment.spin:24 (state 3) [progress[i] = 0]
Starting incrementer with pid 2
6: proc 0 (:init::1) increment.spin:25 (state 4) [(run incrementer(i))]
7: proc 0 (:init::1) increment.spin:26 (state 5) [i = (i+1)]
8: proc 0 (:init::1) increment.spin:27 (state 6) [((i>=2))]
9: proc 0 (:init::1) increment.spin:22 (state 10) [break]
10: proc 2 (incrementer:1) increment.spin:11 (state 1) [temp = counter]
11: proc 1 (incrementer:1) increment.spin:11 (state 1) [temp = counter]
12: proc 2 (incrementer:1) increment.spin:12 (state 2) [counter = (temp+1)]
13: proc 2 (incrementer:1) increment.spin:13 (state 3) [progress[me] = 1]
14: proc 2 terminates
15: proc 1 (incrementer:1) increment.spin:12 (state 2) [counter = (temp+1)]
16: proc 1 (incrementer:1) increment.spin:13 (state 3) [progress[me] = 1]
17: proc 1 terminates
18: proc 0 (:init::1) increment.spin:31 (state 12) [i = 0]
18: proc 0 (:init::1) increment.spin:32 (state 13) [sum = 0]
19: proc 0 (:init::1) increment.spin:34 (state 14) [((i<2))]
19: proc 0 (:init::1) increment.spin:35 (state 15) [sum = (sum+progress[i])]
19: proc 0 (:init::1) increment.spin:36 (state 16) [i = (i+1)]
20: proc 0 (:init::1) increment.spin:34 (state 14) [((i<2))]
20: proc 0 (:init::1) increment.spin:35 (state 15) [sum = (sum+progress[i])]
20: proc 0 (:init::1) increment.spin:36 (state 16) [i = (i+1)]
21: proc 0 (:init::1) increment.spin:37 (state 17) [((i>=2))]
22: proc 0 (:init::1) increment.spin:33 (state 21) [break]
spin: increment.spin:39, Error: assertion violated
spin: text of failed assertion: assert(((sum<2)||(counter==2)))
23: proc 0 (:init::1) increment.spin:39 (state 22) [assert(((sum<2)||(counter==2)))]
spin: trail ends after 23 steps
#processes: 1
counter = 1
progress[0] = 1
progress[1] = 1
23: proc 0 (:init::1) increment.spin:41 (state 24) <valid end state>
3 processes created

v2022.09.25a
12.1. STATE-SPACE SEARCH 233

Listing 12.4: Promela Code for Atomic Increment Don’t forget to capture the output, especially if you
1 proctype incrementer(byte me) are working on a remote machine.
2 {
3 int temp; If your model includes forward-progress checks, you
4
5 atomic { will likely need to enable “weak fairness” via the -f
6 temp = counter; command-line argument to ./pan. If your forward-
7 counter = temp + 1;
8 } progress checks involve accept labels, you will also
9 progress[me] = 1; need the -a argument.
10 }

spin -t -p qrcu.spin
Listing 12.5: Atomic Increment Spin Output Given trail file output by a run that encountered
(Spin Version 6.4.8 -- 2 March 2018) an error, output the sequence of steps leading to that
+ Partial Order Reduction
error. The -g flag will also include the values of
Full statespace search for: changed global variables, and the -l flag will also
never claim - (none specified)
assertion violations + include the values of changed local variables.
cycle checks - (disabled by -DSAFETY)
invalid end states +
12.1.2.1 Promela Peculiarities
State-vector 48 byte, depth reached 22, errors: 0
52 states, stored
21 states, matched Although all computer languages have underlying similar-
73 transitions (= stored+matched) ities, Promela will provide some surprises to people used
68 atomic steps
hash conflicts: 0 (resolved) to coding in C, C++, or Java.
Stats on memory usage (in Megabytes):
0.004 equivalent memory usage for states 1. In C, “;” terminates statements. In Promela it sep-
(stored*(State-vector + overhead)) arates them. Fortunately, more recent versions of
0.290 actual memory usage for states
128.000 memory used for hash table (-w24) Spin have become much more forgiving of “extra”
0.534 memory used for DFS stack (-m10000) semicolons.
128.730 total actual memory usage

unreached in proctype incrementer 2. Promela’s looping construct, the do statement, takes


(0 of 5 states) conditions. This do statement closely resembles a
unreached in init
(0 of 24 states) looping if-then-else statement.

3. In C’s switch statement, if there is no matching


The -wN option specifies the hashtable size. The case, the whole statement is skipped. In Promela’s
default for full state-space search is -w24.1 equivalent, confusingly called if, if there is no
matching guard expression, you get an error without
If you aren’t sure whether your machine has enough a recognizable corresponding error message. So, if
memory, run top in one window and ./pan in the error output indicates an innocent line of code,
another. Keep the focus on the ./pan window so check to see if you left out a condition from an if or
that you can quickly kill execution if need be. As do statement.
soon as CPU time drops much below 100 %, kill
./pan. If you have removed focus from the window 4. When creating stress tests in C, one usually races
running ./pan, you may wait a long time for the suspect operations against each other repeatedly. In
windowing system to grab enough memory to do Promela, one instead sets up a single race, because
anything for you. Promela will search out all the possible outcomes
Another option to avoid memory exhaustion is the from that single race. Sometimes you do need to
-DMEMLIM=N compiler flag. -DMEMLIM=2000 would loop in Promela, for example, if multiple operations
set the maximum of 2 GB. overlap, but doing so greatly increases the size of
your state space.

1 As
of Spin Version 6.4.6 and 6.4.8. In the online manual of Spin
5. In C, the easiest thing to do is to maintain a loop
dated 10 July 2011, the default for exhaustive search mode is said to be counter to track progress and terminate the loop.
-w19, which does not meet the actual behavior. In Promela, loop counters must be avoided like the

v2022.09.25a
234 CHAPTER 12. FORMAL VERIFICATION

plague because they cause the state space to explode. Listing 12.6: Complex Promela Assertion
On the other hand, there is no penalty for infinite 1 i = 0;
2 sum = 0;
loops in Promela as long as none of the variables 3 do
monotonically increase or decrease—Promela will 4 :: i < N_QRCU_READERS ->
5 sum = sum + (readerstart[i] == 1 &&
figure out how many passes through the loop really 6 readerprogress[i] == 1);
matter, and automatically prune execution beyond 7 i++
8 :: i >= N_QRCU_READERS ->
that point. 9 assert(sum == 0);
10 break
11 od
6. In C torture-test code, it is often wise to keep per-
task control variables. They are cheap to read, and
greatly aid in debugging the test code. In Promela,
per-task control variables should be used only when 1 if
2 :: 1 -> r1 = x;
there is no other alternative. To see this, consider 3 r2 = y
a 5-task verification with one bit each to indicate 4 :: 1 -> r2 = y;
5 r1 = x
completion. This gives 32 states. In contrast, a 6 fi
simple counter would have only six states, more
than a five-fold reduction. That factor of five might
not seem like a problem, at least not until you are The two branches of the if statement will be selected
struggling with a verification program possessing nondeterministically, since they both are available.
more than 150 million states consuming more than Because the full state space is searched, both choices
10 GB of memory! will eventually be made in all cases.
Of course, this trick will cause your state space to
7. One of the most challenging things both in C torture- explode if used too heavily. In addition, it requires
test code and in Promela is formulating good asser- you to anticipate possible reorderings.
tions. Promela also allows never claims that act like
an assertion replicated between every line of code. 2. State reduction. If you have complex assertions,
evaluate them under atomic. After all, they are not
8. Dividing and conquering is extremely helpful in part of the algorithm. One example of a complex
Promela in keeping the state space under control. assertion (to be discussed in more detail later) is as
Splitting a large model into two roughly equal halves shown in Listing 12.6.
will result in the state space of each half being roughly There is no reason to evaluate this assertion non-
the square root of the whole. For example, a million- atomically, since it is not actually part of the algo-
state combined model might reduce to a pair of rithm. Because each statement contributes to state,
thousand-state models. Not only will Promela handle we can reduce the number of useless states by enclos-
the two smaller models much more quickly with ing it in an atomic block as shown in Listing 12.7.
much less memory, but the two smaller algorithms
are easier for people to understand. 3. Promela does not provide functions. You must in-
stead use C preprocessor macros. However, you must
use them carefully in order to avoid combinatorial
12.1.2.2 Promela Coding Tricks explosion.

Promela was designed to analyze protocols, so using it on Now we are ready for further examples.
parallel programs is a bit abusive. The following tricks
can help you to abuse Promela safely:
12.1.3 Promela Example: Locking
1. Memory reordering. Suppose you have a pair of Since locks are generally useful, spin_lock() and spin_
statements copying globals x and y to locals r1 and unlock() macros are provided in lock.h, which may
r2, where ordering matters (e.g., unprotected by be included from multiple Promela models, as shown
locks), but where you have no memory barriers. This in Listing 12.8. The spin_lock() macro contains an
can be modeled in Promela as follows: infinite do-od loop spanning lines 2–11, courtesy of the

v2022.09.25a
12.1. STATE-SPACE SEARCH 235

Listing 12.7: Atomic Block for Complex Promela Assertion


1 atomic {
2 i = 0;
3 sum = 0;
4 do
5 :: i < N_QRCU_READERS ->
6 sum = sum + (readerstart[i] == 1 &&
7 readerprogress[i] == 1);
8 i++
9 :: i >= N_QRCU_READERS ->
10 assert(sum == 0);
11 break
12 od
13 }
Listing 12.9: Promela Code to Test Spinlocks
1 #include "lock.h"
Listing 12.8: Promela Code for Spinlock 2
3 #define N_LOCKERS 3
1 #define spin_lock(mutex) \
4
2 do \
5 bit mutex = 0;
3 :: 1 -> atomic { \
6 bit havelock[N_LOCKERS];
4 if \
7 int sum;
5 :: mutex == 0 -> \
8
6 mutex = 1; \
9 proctype locker(byte me)
7 break \
10 {
8 :: else -> skip \
11 do
9 fi \
12 :: 1 ->
10 } \
13 spin_lock(mutex);
11 od
14 havelock[me] = 1;
12
15 havelock[me] = 0;
13 #define spin_unlock(mutex) \
16 spin_unlock(mutex)
14 mutex = 0
17 od
18 }
19
20 init {
single guard expression of “1” on line 3. The body of 21 int i = 0;
this loop is a single atomic block that contains an if-fi 22 int j;
23
statement. The if-fi construct is similar to the do-od 24 end: do
construct, except that it takes a single pass rather than 25 :: i < N_LOCKERS ->
26 havelock[i] = 0;
looping. If the lock is not held on line 5, then line 6 27 run locker(i);
acquires it and line 7 breaks out of the enclosing do-od 28 i++
29 :: i >= N_LOCKERS ->
loop (and also exits the atomic block). On the other hand, 30 sum = 0;
if the lock is already held on line 8, we do nothing (skip), 31 j = 0;
32 atomic {
and fall out of the if-fi and the atomic block so as to 33 do
take another pass through the outer loop, repeating until 34 :: j < N_LOCKERS ->
35 sum = sum + havelock[j];
the lock is available. 36 j = j + 1
37 :: j >= N_LOCKERS ->
The spin_unlock() macro simply marks the lock as 38 break
no longer held. 39 od
40 }
Note that memory barriers are not needed because 41 assert(sum <= 1);
Promela assumes full ordering. In any given Promela 42 break
43 od
state, all processes agree on both the current state and the 44 }
order of state changes that caused us to arrive at the current
state. This is analogous to the “sequentially consistent”
memory model used by a few computer systems (such as
1990s MIPS and PA-RISC). As noted earlier, and as will
be seen in a later example, weak memory ordering must
be explicitly coded.
These macros are tested by the Promela code shown in
Listing 12.9. This code is similar to that used to test the
increments, with the number of locking processes defined
by the N_LOCKERS macro definition on line 3. The mutex

v2022.09.25a
236 CHAPTER 12. FORMAL VERIFICATION

Listing 12.10: Output for Spinlock Test Quick Quiz 12.2: What are some Promela code-style issues
(Spin Version 6.4.8 -- 2 March 2018) with this example?
+ Partial Order Reduction

Full statespace search for:


never claim - (none specified)
assertion violations + 12.1.4 Promela Example: QRCU
cycle checks - (disabled by -DSAFETY)
invalid end states +
This final example demonstrates a real-world use of
State-vector 52 byte, depth reached 360, errors: 0 Promela on Oleg Nesterov’s QRCU [Nes06a, Nes06b], but
576 states, stored
929 states, matched
modified to speed up the synchronize_qrcu() fastpath.
1505 transitions (= stored+matched) But first, what is QRCU?
368 atomic steps
hash conflicts: 0 (resolved) QRCU is a variant of SRCU [McK06] that trades some-
what higher read overhead (atomic increment and decre-
Stats on memory usage (in Megabytes):
0.044 equivalent memory usage for states ment on a global variable) for extremely low grace-period
(stored*(State-vector + overhead)) latencies. If there are no readers, the grace period will
0.288 actual memory usage for states
128.000 memory used for hash table (-w24) be detected in less than a microsecond, compared to the
0.534 memory used for DFS stack (-m10000) multi-millisecond grace-period latencies of most other
128.730 total actual memory usage
RCU implementations.
unreached in proctype locker
lock.spin:19, state 20, "-end-"
(1 of 20 states)
1. There is a qrcu_struct that defines a QRCU do-
unreached in init main. Like SRCU (and unlike other variants of RCU)
(0 of 22 states)
QRCU’s action is not global, but instead focused on
the specified qrcu_struct.
itself is defined on line 5, an array to track the lock owner 2. There are qrcu_read_lock() and qrcu_read_
on line 6, and line 7 is used by assertion code to verify unlock() primitives that delimit QRCU read-side
that only one process holds the lock. critical sections. The corresponding qrcu_struct
The locker process is on lines 9–18, and simply loops must be passed into these primitives, and the return
forever acquiring the lock on line 13, claiming it on line 14, value from qrcu_read_lock() must be passed to
unclaiming it on line 15, and releasing it on line 16. qrcu_read_unlock().
The init block on lines 20–44 initializes the current
locker’s havelock array entry on line 26, starts the current For example:
locker on line 27, and advances to the next locker on
idx = qrcu_read_lock(&my_qrcu_struct);
line 28. Once all locker processes are spawned, the /* read-side critical section. */
do-od loop moves to line 29, which checks the assertion. qrcu_read_unlock(&my_qrcu_struct, idx);
Lines 30 and 31 initialize the control variables, lines 32–40
atomically sum the havelock array entries, line 41 is the 3. There is a synchronize_qrcu() primitive that
assertion, and line 42 exits the loop. blocks until all pre-existing QRCU read-side critical
We can run this model by placing the two code fragments sections complete, but, like SRCU’s synchronize_
of Listings 12.8 and 12.9 into files named lock.h and srcu(), QRCU’s synchronize_qrcu() need wait
lock.spin, respectively, and then running the following only for those read-side critical sections that are using
commands: the same qrcu_struct.
spin -a lock.spin
cc -DSAFETY -o pan pan.c
For example, synchronize_qrcu(&your_qrcu_
./pan struct) would not need to wait on the earlier
QRCU read-side critical section. In contrast,
The output will look something like that shown in synchronize_qrcu(&my_qrcu_struct) would
Listing 12.10. As expected, this run has no assertion need to wait, since it shares the same qrcu_struct.
failures (“errors: 0”).
A Linux-kernel patch for QRCU has been pro-
Quick Quiz 12.1: Why is there an unreached statement in
duced [McK07c], but is unlikely to ever be included
locker? After all, isn’t this a full state-space search?
in the Linux kernel.

v2022.09.25a
12.1. STATE-SPACE SEARCH 237

Listing 12.11: QRCU Global Variables Listing 12.13: QRCU Unordered Summation
1 #include "lock.h" 1 #define sum_unordered \
2 2 atomic { \
3 #define N_QRCU_READERS 2 3 do \
4 #define N_QRCU_UPDATERS 2 4 :: 1 -> \
5 5 sum = ctr[0]; \
6 bit idx = 0; 6 i = 1; \
7 byte ctr[2]; 7 break \
8 byte readerprogress[N_QRCU_READERS]; 8 :: 1 -> \
9 bit mutex = 0; 9 sum = ctr[1]; \
10 i = 0; \
11 break \
Listing 12.12: QRCU Reader Process 12 od; \
13 } \
1 proctype qrcu_reader(byte me)
14 sum = sum + ctr[i]
2 {
3 int myidx;
4
5 do
6 :: 1 -> the global index, and lines 8–15 atomically increment it
7 myidx = idx;
8 atomic {
(and break from the infinite loop) if its value was non-zero
9 if (atomic_inc_not_zero()). Line 17 marks entry into
10 :: ctr[myidx] > 0 ->
11 ctr[myidx]++;
the RCU read-side critical section, and line 18 marks
12 break exit from this critical section, both lines for the benefit
13 :: else -> skip
14 fi
of the assert() statement that we shall encounter later.
15 } Line 19 atomically decrements the same counter that we
16 od;
17 readerprogress[me] = 1;
incremented, thereby exiting the RCU read-side critical
18 readerprogress[me] = 2; section.
19 atomic { ctr[myidx]-- }
20 } The C-preprocessor macro shown in Listing 12.13
sums the pair of counters so as to emulate weak memory
ordering. Lines 2–13 fetch one of the counters, and
Returning to the Promela code for QRCU, the global line 14 fetches the other of the pair and sums them. The
variables are as shown in Listing 12.11. This example atomic block consists of a single do-od statement. This
uses locking and includes lock.h. Both the number of do-od statement (spanning lines 3–12) is unusual in that it
readers and writers can be varied using the two #define contains two unconditional branches with guards on lines 4
statements, giving us not one but two ways to create and 8, which causes Promela to non-deterministically
combinatorial explosion. The idx variable controls which choose one of the two (but again, the full state-space
of the two elements of the ctr array will be used by search causes Promela to eventually make all possible
readers, and the readerprogress variable allows an choices in each applicable situation). The first branch
assertion to determine when all the readers are finished fetches the zero-th counter and sets i to 1 (so that line 14
(since a QRCU update cannot be permitted to complete will fetch the first counter), while the second branch does
until all pre-existing readers have completed their QRCU the opposite, fetching the first counter and setting i to 0
read-side critical sections). The readerprogress array (so that line 14 will fetch the second counter).
elements have values as follows, indicating the state of the
Quick Quiz 12.3: Is there a more straightforward way to
corresponding reader:
code the do-od statement?
0: Not yet started.
With the sum_unordered macro in place, we can now
1: Within QRCU read-side critical section. proceed to the update-side process shown in Listing 12.14.
2: Finished with QRCU read-side critical section. The update-side process repeats indefinitely, with the
corresponding do-od loop ranging over lines 7–57.
Finally, the mutex variable is used to serialize updaters’ Each pass through the loop first snapshots the global
slowpaths. readerprogress array into the local readerstart ar-
QRCU readers are modeled by the qrcu_reader() ray on lines 12–21. This snapshot will be used for the
process shown in Listing 12.12. A do-od loop spans assertion on line 53. Line 23 invokes sum_unordered,
lines 5–16, with a single guard of “1” on line 6 that makes and then lines 24–27 re-invoke sum_unordered if the
it an infinite loop. Line 7 captures the current value of fastpath is potentially usable.

v2022.09.25a
238 CHAPTER 12. FORMAL VERIFICATION

Listing 12.15: QRCU Initialization Process


1 init {
2 int i;
3
4 atomic {
5 ctr[idx] = 1;
6 ctr[!idx] = 0;
Listing 12.14: QRCU Updater Process 7 i = 0;
1 proctype qrcu_updater(byte me) 8 do
2 { 9 :: i < N_QRCU_READERS ->
3 int i; 10 readerprogress[i] = 0;
4 byte readerstart[N_QRCU_READERS]; 11 run qrcu_reader(i);
5 int sum; 12 i++
6 13 :: i >= N_QRCU_READERS -> break
7 do 14 od;
8 :: 1 -> 15 i = 0;
9 16 do
10 /* Snapshot reader state. */ 17 :: i < N_QRCU_UPDATERS ->
11 18 run qrcu_updater(i);
12 atomic { 19 i++
13 i = 0; 20 :: i >= N_QRCU_UPDATERS -> break
14 do 21 od
15 :: i < N_QRCU_READERS -> 22 }
16 readerstart[i] = readerprogress[i]; 23 }
17 i++
18 :: i >= N_QRCU_READERS ->
19 break
20 od Lines 28–40 execute the slowpath code if need be, with
21 }
22 lines 30 and 38 acquiring and releasing the update-side
23 sum_unordered; lock, lines 31–33 flipping the index, and lines 34–37
24 if
25 :: sum <= 1 -> sum_unordered waiting for all pre-existing readers to complete.
26 :: else -> skip
27 fi;
Lines 44–56 then compare the current values in
28 if the readerprogress array to those collected in the
29 :: sum > 1 ->
30 spin_lock(mutex);
readerstart array, forcing an assertion failure should
31 atomic { ctr[!idx]++ } any readers that started before this update still be in
32 idx = !idx;
33 atomic { ctr[!idx]-- }
progress.
34 do
35 :: ctr[!idx] > 0 -> skip Quick Quiz 12.4: Why are there atomic blocks at lines 12–21
36 :: ctr[!idx] == 0 -> break and lines 44–56, when the operations within those atomic
37 od;
38 spin_unlock(mutex); blocks have no atomic implementation on any current produc-
39 :: else -> skip tion microprocessor?
40 fi;
41
42 /* Verify reader progress. */
43
Quick Quiz 12.5: Is the re-summing of the counters on
44 atomic { lines 24–27 really necessary?
45 i = 0;
46 sum = 0;
47 do All that remains is the initialization block shown in List-
48 :: i < N_QRCU_READERS -> ing 12.15. This block simply initializes the counter pair
49 sum = sum + (readerstart[i] == 1 &&
50 readerprogress[i] == 1); on lines 5–6, spawns the reader processes on lines 7–14,
51 i++ and spawns the updater processes on lines 15–21. This is
52 :: i >= N_QRCU_READERS ->
53 assert(sum == 0); all done within an atomic block to reduce state space.
54 break
55 od
56 }
57 od 12.1.4.1 Running the QRCU Example
58 }

To run the QRCU example, combine the code fragments


in the previous section into a single file named qrcu.
spin, and place the definitions for spin_lock() and
spin_unlock() into a file named lock.h. Then use the
following commands to build and run the QRCU model:

v2022.09.25a
12.1. STATE-SPACE SEARCH 239

Table 12.2: Memory Usage of QRCU Model Listing 12.16: 3 Readers 3 Updaters QRCU Spin Output with
-DMA=96
updaters readers # states depth memory (MB)a (Spin Version 6.4.6 -- 2 December 2016)
+ Partial Order Reduction
1 1 376 95 128.7 + Graph Encoding (-DMA=96)
1 2 6,177 218 128.9 Full statespace search for:
1 3 99,728 385 132.6 never claim - (none specified)
2 1 29,399 859 129.8 assertion violations +
cycle checks - (disabled by -DSAFETY)
2 2 1,071,181 2,352 169.6 invalid end states +
2 3 33,866,736 12,857 1,540.8
State-vector 96 byte, depth reached 2055621, errors: 0
3 1 2,749,453 53,809 236.6 MA stats: -DMA=84 is sufficient
3 2 186,202,860 328,014 10,483.7 Minimized Automaton: 56420520 nodes and 1.75128e+08 edges
9.6647071e+09 states, stored
a Obtained with the compiler flag -DCOLLAPSE specified. 9.7503813e+09 states, matched
1.9415088e+10 transitions (= stored+matched)
7.2047951e+09 atomic steps

Stats on memory usage (in Megabytes):


spin -a qrcu.spin 1142905.887 equivalent memory usage for states
cc -DSAFETY [-DCOLLAPSE] -o pan pan.c (stored*(State-vector + overhead))
./pan [-mN] 5448.879 actual memory usage for states
(compression: 0.48%)
1068.115 memory used for DFS stack (-m20000000)
The output shows that this model passes all of the 1.619 memory lost to fragmentation
6515.375 total actual memory usage
cases shown in Table 12.2. It would be nice to run three
readers and three updaters, however, simple extrapolation unreached in proctype qrcu_reader
(0 of 18 states)
indicates that this will require about half a terabyte of unreached in proctype qrcu_updater
memory. What to do? qrcu.spin:102, state 82, "-end-"
(1 of 82 states)
It turns out that ./pan gives advice when it runs out unreached in init
of memory, for example, when attempting to run three (0 of 23 states)

readers and three updaters: pan: elapsed time 2.72e+05 seconds


pan: rate 35500.523 states/second
hint: to reduce memory, recompile with
-DCOLLAPSE # good, fast compression, or
-DMA=96 # better/slower compression, or
-DHC # hash-compaction, approximation Quick Quiz 12.6: A compression rate of 0.48 % corresponds
-DBITSTATE # supertrace, approximation to a 200-to-1 decrease in memory occupied by the states! Is
the state-space search really exhaustive???
Let’s try the suggested compiler flag -DMA=N, which
generates code for aggressive compression of the state For reference, Table 12.3 summarizes the Spin results
space at the cost of greatly increased search overhead. The with -DCOLLAPSE and -DMA=N compiler flags. The mem-
required commands are as follows: ory usage is obtained with minimal sufficient search depths
and -DMA=N parameters shown in the table. Hashtable
spin -a qrcu.spin sizes for -DCOLLAPSE runs are tweaked by the -wN option
cc -DSAFETY -DMA=96 -O2 -o pan pan.c of ./pan to avoid using too much memory hashing small
./pan -m20000000
state spaces. Hence the memory usage is smaller than
what is shown in Table 12.2, where the hashtable size
Here, the depth limit of 20,000,000 is an order of mag- starts from the default of -w24. The runtime is from a
nitude larger than the expected depth deduced from simple POWER9 server, which shows that -DMA=N suffers up to
extrapolation. Although this increases up-front memory about an order of magnitude higher CPU overhead than
usage, it avoids wasting a long run due to incomplete does -DCOLLAPSE, but on the other hand reduces memory
search resulting from a too-tight depth limit. This run overhead by well over an order of magnitude.
took a little more than 3 days on a POWER9 server. The So far so good. But adding a few more updaters or
result is shown in Listing 12.16. This Spin run completed readers would exhaust memory, even with -DMA=N.2 So
successfully with a total memory usage of only 6.5 GB, what to do? Here are some possible approaches:
which is almost two orders of magnitude lower than the
-DCOLLAPSE usage of about half a terabyte. 2 Alternatively, the CPU consumption would become excessive.

v2022.09.25a
240 CHAPTER 12. FORMAL VERIFICATION

Table 12.3: QRCU Spin Result Summary

-DCOLLAPSE -DMA=N

updaters readers # states depth reached -wN memory (MB) runtime (s) N memory (MB) runtime (s)

1 1 376 95 12 0.10 0.00 40 0.29 0.00


1 2 6,177 218 12 0.39 0.01 47 0.59 0.02
1 3 99,728 385 16 4.60 0.14 54 3.04 0.45
2 1 29,399 859 16 2.30 0.03 55 0.70 0.13
2 2 1,071,181 2,352 20 49.24 1.45 62 7.77 5.76
2 3 33,866,736 12,857 24 1,540.70 62.5 69 111.66 326
3 1 2,749,453 53,809 21 125.25 4.01 70 11.41 19.5
3 2 186,202,860 328,014 28 10,482.51 390 77 222.26 2,560
3 3 9,664,707,100 2,055,621 84 5,557.02 266,000

1. See whether a smaller number of readers and updaters 1. For synchronize_qrcu() to exit too early, then by
suffice to prove the general case. definition there must have been at least one reader
present during synchronize_qrcu()’s full execu-
2. Manually construct a proof of correctness. tion.
3. Use a more capable tool. 2. The counter corresponding to this reader will have
been at least 1 during this time interval.
4. Divide and conquer.
3. The synchronize_qrcu() code forces at least one
The following sections discuss each of these approaches. of the counters to be at least 1 at all times.
4. Therefore, at any given point in time, either one of
12.1.4.2 How Many Readers and Updaters Are Re-
the counters will be at least 2, or both of the counters
ally Needed?
will be at least one.
One approach is to look carefully at the Promela code for
5. However, the synchronize_qrcu() fastpath code
qrcu_updater() and notice that the only global state
can read only one of the counters at a given time. It
change is happening under the lock. Therefore, only one
is therefore possible for the fastpath code to fetch the
updater at a time can possibly be modifying state visible
first counter while zero, but to race with a counter
to either readers or other updaters. This means that any
flip so that the second counter is seen as one.
sequences of state changes can be carried out serially by
a single updater due to the fact that Promela does a full 6. There can be at most one reader persisting through
state-space search. Therefore, at most two updaters are such a race condition, as otherwise the sum would
required: One to change state and a second to become be two or greater, which would cause the updater to
confused. take the slowpath.
The situation with the readers is less clear-cut, as each
reader does only a single read-side critical section then 7. But if the race occurs on the fastpath’s first read of
terminates. It is possible to argue that the useful number the counters, and then again on its second read, there
of readers is limited, due to the fact that the fastpath must have to have been two counter flips.
see at most a zero and a one in the counters. This is a 8. Because a given updater flips the counter only once,
fruitful avenue of investigation, in fact, it leads to the full and because the update-side lock prevents a pair of
proof of correctness described in the next section. updaters from concurrently flipping the counters, the
only way that the fastpath code can race with a flip
12.1.4.3 Alternative Approach: Proof of Correct- twice is if the first updater completes.
ness
9. But the first updater will not complete until after all
An informal proof [McK07c] follows: pre-existing readers have completed.

v2022.09.25a
12.1. STATE-SPACE SEARCH 241

10. Therefore, if the fastpath races with a counter flip QRCU correct. And this is why formal-verification tools
twice in succession, all pre-existing readers must themselves should be tested using bug-injected versions
have completed, so that it is safe to take the fastpath. of the code being verified. If a given tool cannot find the
injected bugs, then that tool is clearly untrustworthy.
Of course, not all parallel algorithms have such simple
Quick Quiz 12.7: But different formal-verification tools
proofs. In such cases, it may be necessary to enlist more are often designed to locate particular classes of bugs. For
capable tools. example, very few formal-verification tools will find an error
in the specification. So isn’t this “clearly untrustworthy”
12.1.4.4 Alternative Approach: More Capable Tools judgment a bit harsh?

Although Promela and Spin are quite useful, much more Therefore, if you do intend to use QRCU, please take
capable tools are available, particularly for verifying hard- care. Its proofs of correctness might or might not them-
ware. This means that if it is possible to translate your selves be correct. Which is one reason why formal verifi-
algorithm to the hardware-design VHDL language, as it cation is unlikely to completely replace testing, as Donald
often will be for low-level parallel algorithms, then it is Knuth pointed out so long ago.
possible to apply these tools to your code (for example, this Quick Quiz 12.8: Given that we have two independent proofs
was done for the first realtime RCU algorithm). However, of correctness for the QRCU algorithm described herein, and
such tools can be quite expensive. given that the proof of incorrectness covers what is known to
Although the advent of commodity multiprocessing be a different algorithm, why is there any room for doubt?
might eventually result in powerful free-software model-
checkers featuring fancy state-space-reduction capabilities,
this does not help much in the here and now. 12.1.5 Promela Parable: dynticks and Pre-
As an aside, there are Spin features that support ap- emptible RCU
proximate searches that require fixed amounts of memory,
however, I have never been able to bring myself to trust In early 2008, a preemptible variant of RCU was accepted
approximations when verifying parallel algorithms. into mainline Linux in support of real-time workloads,
Another approach might be to divide and conquer. a variant similar to the RCU implementations in the -rt
patchset [Mol05] since August 2005. Preemptible RCU
is needed for real-time workloads because older RCU
12.1.4.5 Alternative Approach: Divide and Conquer
implementations disable preemption across RCU read-
It is often possible to break down a larger parallel algorithm side critical sections, resulting in excessive real-time
into smaller pieces, which can then be proven separately. latencies.
For example, a 10-billion-state model might be broken However, one disadvantage of the older -rt implemen-
into a pair of 100,000-state models. Taking this approach tation was that each grace period requires work to be
not only makes it easier for tools such as Promela to verify done on each CPU, even if that CPU is in a low-power
your algorithms, it can also make your algorithms easier “dynticks-idle” state, and thus incapable of executing RCU
to understand. read-side critical sections. The idea behind the dynticks-
idle state is that idle CPUs should be physically powered
12.1.4.6 Is QRCU Really Correct? down in order to conserve energy. In short, preemptible
RCU can disable a valuable energy-conservation feature
Is QRCU really correct? We have a Promela-based me- of recent Linux kernels. Although Josh Triplett and Paul
chanical proof and a by-hand proof that both say that it is. McKenney had discussed some approaches for allowing
However, a recent paper by Alglave et al. [AKT13] says CPUs to remain in low-power state throughout an RCU
otherwise (see Section 5.1 of the paper at the bottom of grace period (thus preserving the Linux kernel’s ability
page 12). Which is it? to conserve energy), matters did not come to a head until
It turns out that both are correct! When QRCU was Steve Rostedt integrated a new dyntick implementation
added to a suite of formal-verification benchmarks, its with preemptible RCU in the -rt patchset.
memory barriers were omitted, thus resulting in a buggy This combination caused one of Steve’s systems to
version of QRCU. So the real news here is that a number hang on boot, so in October, Paul coded up a dynticks-
of formal-verification tools incorrectly proved this buggy friendly modification to preemptible RCU’s grace-period

v2022.09.25a
242 CHAPTER 12. FORMAL VERIFICATION

processing. Steve coded up rcu_irq_enter() and rcu_ Preemptible RCU’s grace-period machinery samples
irq_exit() interfaces called from the irq_enter() the value of the dynticks_progress_counter variable
and irq_exit() interrupt entry/exit functions. These in order to determine when a dynticks-idle CPU may safely
rcu_irq_enter() and rcu_irq_exit() functions are be ignored.
needed to allow RCU to reliably handle situations where The following three sections give an overview of the
a dynticks-idle CPU is momentarily powered up for an task interface, the interrupt/NMI interface, and the use
interrupt handler containing RCU read-side critical sec- of the dynticks_progress_counter variable by the
tions. With these changes in place, Steve’s system booted grace-period machinery as of Linux kernel v2.6.25-rc4.
reliably, but Paul continued inspecting the code periodi-
cally on the assumption that we could not possibly have 12.1.5.2 Task Interface
gotten the code right on the first try.
Paul reviewed the code repeatedly from October 2007 When a given CPU enters dynticks-idle mode because it
to February 2008, and almost always found at least one has no more tasks to run, it invokes rcu_enter_nohz():
bug. In one case, Paul even coded and tested a fix before 1 static inline void rcu_enter_nohz(void)
2 {
realizing that the bug was illusory, and in fact in all cases, 3 mb();
the “bug” turned out to be illusory. 4 __get_cpu_var(dynticks_progress_counter)++;
5 WARN_ON(__get_cpu_var(dynticks_progress_counter) &
Near the end of February, Paul grew tired of this game. 6 0x1);
He therefore decided to enlist the aid of Promela and Spin. 7 }
The following presents a series of seven increasingly real-
This function simply increments dynticks_
istic Promela models, the last of which passes, consuming
progress_counter and checks that the result is even, but
about 40 GB of main memory for the state space.
first executing a memory barrier to ensure that any other
More important, Promela and Spin did find a very subtle
CPU that sees the new value of dynticks_progress_
bug for me!
counter will also see the completion of any prior RCU
Quick Quiz 12.9: Yeah, that’s just great! Now, just what read-side critical sections.
am I supposed to do if I don’t happen to have a machine with Similarly, when a CPU that is in dynticks-idle mode
40 GB of main memory??? prepares to start executing a newly runnable task, it invokes
rcu_exit_nohz():
Still better would be to come up with a simpler and
1 static inline void rcu_exit_nohz(void)
faster algorithm that has a smaller state space. Even better 2 {
would be an algorithm so simple that its correctness was 3 __get_cpu_var(dynticks_progress_counter)++;
4 mb();
obvious to the casual observer! 5 WARN_ON(!(__get_cpu_var(dynticks_progress_counter) &
Sections 12.1.5.1–12.1.5.4 give an overview of pre- 6 0x1));
7 }
emptible RCU’s dynticks interface, followed by Sec-
tion 12.1.6’s discussion of the validation of the interface. This function again increments dynticks_progress_
counter, but follows it with a memory barrier to ensure
12.1.5.1 Introduction to Preemptible RCU and that if any other CPU sees the result of any subsequent
dynticks RCU read-side critical section, then that other CPU will
also see the incremented value of dynticks_progress_
The per-CPU dynticks_progress_counter variable is counter. Finally, rcu_exit_nohz() checks that the
central to the interface between dynticks and preemptible result of the increment is an odd value.
RCU. This variable has an even value whenever the The rcu_enter_nohz() and rcu_exit_nohz()
corresponding CPU is in dynticks-idle mode, and an odd functions handle the case where a CPU enters and exits
value otherwise. A CPU exits dynticks-idle mode for the dynticks-idle mode due to task execution, but does not
following three reasons: handle interrupts, which are covered in the following
section.
1. To start running a task,
2. When entering the outermost of a possibly nested set 12.1.5.3 Interrupt Interface
of interrupt handlers, and
The rcu_irq_enter() and rcu_irq_exit() functions
3. When entering an NMI handler. handle interrupt/NMI entry and exit, respectively. Of

v2022.09.25a
12.1. STATE-SPACE SEARCH 243

course, nested interrupts must also be properly accounted 9 smp_mb();


10 per_cpu(dynticks_progress_counter, cpu)++;
for. The possibility of nested interrupts is handled by a 11 WARN_ON(per_cpu(dynticks_progress_counter,
second per-CPU variable, rcu_update_flag, which is 12 cpu) & 0x1);
13 }
incremented upon entry to an interrupt or NMI handler 14 }
(in rcu_irq_enter()) and is decremented upon exit
(in rcu_irq_exit()). In addition, the pre-existing in_
interrupt() primitive is used to distinguish between an Line 3 fetches the current CPU’s number, as before.
outermost or a nested interrupt/NMI. Line 5 checks to see if the rcu_update_flag is non-
Interrupt entry is handled by the rcu_irq_enter() zero, returning immediately (via falling off the end of the
shown below: function) if not. Otherwise, lines 6 through 12 come into
play. Line 6 decrements rcu_update_flag, returning if
1 void rcu_irq_enter(void) the result is not zero. Line 8 verifies that we are indeed
2 {
3 int cpu = smp_processor_id(); leaving the outermost level of nested interrupts, line 9
4 executes a memory barrier, line 10 increments dynticks_
5 if (per_cpu(rcu_update_flag, cpu))
6 per_cpu(rcu_update_flag, cpu)++; progress_counter, and lines 11 and 12 verify that this
7 if (!in_interrupt() && variable is now even. As with rcu_enter_nohz(), the
8 (per_cpu(dynticks_progress_counter,
9 cpu) & 0x1) == 0) { memory barrier ensures that any other CPU that sees the
10 per_cpu(dynticks_progress_counter, cpu)++; increment of dynticks_progress_counter will also
11 smp_mb();
12 per_cpu(rcu_update_flag, cpu)++; see the effects of an RCU read-side critical section in
13 } the interrupt handler (preceding the rcu_irq_exit()
14 }
invocation).
Line 3 fetches the current CPU’s number, while lines 5 These two sections have described how the dynticks_
and 6 increment the rcu_update_flag nesting counter progress_counter variable is maintained during entry
if it is already non-zero. Lines 7–9 check to see whether to and exit from dynticks-idle mode, both by tasks and by
we are the outermost level of interrupt, and, if so, whether interrupts and NMIs. The following section describes how
dynticks_progress_counter needs to be incremented. this variable is used by preemptible RCU’s grace-period
If so, line 10 increments dynticks_progress_counter, machinery.
line 11 executes a memory barrier, and line 12 increments
rcu_update_flag. As with rcu_exit_nohz(), the
memory barrier ensures that any other CPU that sees the 12.1.5.4 Grace-Period Interface
effects of an RCU read-side critical section in the interrupt
Of the four preemptible RCU grace-period states shown in
handler (following the rcu_irq_enter() invocation)
Figure 12.1, only the rcu_try_flip_waitack_state
will also see the increment of dynticks_progress_
and rcu_try_flip_waitmb_state states need to wait
counter.
for other CPUs to respond.
Quick Quiz 12.10: Why not simply increment rcu_update_
Of course, if a given CPU is in dynticks-idle state, we
flag, and then only increment dynticks_progress_
shouldn’t wait for it. Therefore, just before entering one
counter if the old value of rcu_update_flag was zero???
of these two states, the preceding state takes a snapshot
of each CPU’s dynticks_progress_counter variable,
Quick Quiz 12.11: But if line 7 finds that we are the placing the snapshot in another per-CPU variable, rcu_
outermost interrupt, wouldn’t we always need to increment dyntick_snapshot. This is accomplished by invoking
dynticks_progress_counter? dyntick_save_progress_counter(), shown below:

Interrupt exit is handled similarly by rcu_irq_exit(): 1 static void dyntick_save_progress_counter(int cpu)


2 {
1 void rcu_irq_exit(void) 3 per_cpu(rcu_dyntick_snapshot, cpu) =
2 { 4 per_cpu(dynticks_progress_counter, cpu);
3 int cpu = smp_processor_id(); 5 }
4
5 if (per_cpu(rcu_update_flag, cpu)) {
6 if (--per_cpu(rcu_update_flag, cpu))
7 return;
The rcu_try_flip_waitack_state state invokes
8 WARN_ON(in_interrupt()); rcu_try_flip_waitack_needed(), shown below:

v2022.09.25a
244 CHAPTER 12. FORMAL VERIFICATION

rcu_try_flip_idle_state For its part, the rcu_try_flip_waitmb_state state


Still no activity
(No RCU activity) invokes rcu_try_flip_waitmb_needed(), shown be-
low:
Increment grace−period counter
Request counter−flip acknowledgement 1 static inline int
2 rcu_try_flip_waitmb_needed(int cpu)
3 {
rcu_try_flip_waitack_state
4 long curr;
(Wait for acknowledgements) 5 long snap;
6
7 curr = per_cpu(dynticks_progress_counter, cpu);
Memory barrier 8 snap = per_cpu(rcu_dyntick_snapshot, cpu);
9 smp_mb();
10 if ((curr == snap) && ((curr & 0x1) == 0))
rcu_try_flip_waitzero_state 11 return 0;
(Wait for RCU read−side 12 if (curr != snap)
critical sections to complete) 13 return 0;
14 return 1;
Request memory barriers 15 }

This is quite similar to rcu_try_flip_waitack_


rcu_try_flip_waitmb_state
needed(), the difference being in lines 12 and 13, be-
(Wait for memory barriers)
cause any transition either to or from dynticks-idle state
executes the memory barrier needed by the rcu_try_
flip_waitmb_state state.
Figure 12.1: Preemptible RCU State Machine We now have seen all the code involved in the interface
between RCU and the dynticks-idle state. The next section
builds up the Promela model used to verify this code.
1 static inline int
2 rcu_try_flip_waitack_needed(int cpu)
Quick Quiz 12.12: Can you spot any bugs in any of the code
3 { in this section?
4 long curr;
5 long snap;
6
7 curr = per_cpu(dynticks_progress_counter, cpu); 12.1.6 Validating Preemptible RCU and
8 snap = per_cpu(rcu_dyntick_snapshot, cpu);
9 smp_mb(); dynticks
10 if ((curr == snap) && ((curr & 0x1) == 0))
11 return 0;
12 if ((curr - snap) > 2 || (snap & 0x1) == 0)
This section develops a Promela model for the interface
13 return 0; between dynticks and RCU step by step, with each of
14 return 1;
15 }
Sections 12.1.6.1–12.1.6.7 illustrating one step, starting
with the process-level code, adding assertions, interrupts,
and finally NMIs.
Lines 7 and 8 pick up current and snapshot versions Section 12.1.6.8 lists lessons (re)learned during this
of dynticks_progress_counter, respectively. The effort, and Sections 12.1.6.9–12.1.6.15 present a simpler
memory barrier on line 9 ensures that the counter checks solution to RCU’s dynticks problem.
in the later rcu_try_flip_waitzero_state follow the
fetches of these counters. Lines 10 and 11 return zero 12.1.6.1 Basic Model
(meaning no communication with the specified CPU is
This section translates the process-level dynticks en-
required) if that CPU has remained in dynticks-idle state
try/exit code and the grace-period processing into
since the time that the snapshot was taken. Similarly,
Promela [Hol03]. We start with rcu_exit_nohz() and
lines 12 and 13 return zero if that CPU was initially in
rcu_enter_nohz() from the 2.6.25-rc4 kernel, placing
dynticks-idle state or if it has completely passed through
these in a single Promela process that models exiting and
a dynticks-idle state. In both these cases, there is no
entering dynticks-idle mode in a loop as follows:
way that the CPU could have retained the old value of
the grace-period counter. If neither of these conditions 1 proctype dyntick_nohz()
2 {
hold, line 14 returns one, meaning that the CPU needs to 3 byte tmp;
explicitly respond. 4 byte i = 0;

v2022.09.25a
12.1. STATE-SPACE SEARCH 245

5 3 byte curr;
6 do 4 byte snap;
7 :: i >= MAX_DYNTICK_LOOP_NOHZ -> break; 5
8 :: i < MAX_DYNTICK_LOOP_NOHZ -> 6 atomic {
9 tmp = dynticks_progress_counter; 7 printf("MDLN = %d\n", MAX_DYNTICK_LOOP_NOHZ);
10 atomic { 8 snap = dynticks_progress_counter;
11 dynticks_progress_counter = tmp + 1; 9 }
12 assert((dynticks_progress_counter & 1) == 1); 10 do
13 } 11 :: 1 ->
14 tmp = dynticks_progress_counter; 12 atomic {
15 atomic { 13 curr = dynticks_progress_counter;
16 dynticks_progress_counter = tmp + 1; 14 if
17 assert((dynticks_progress_counter & 1) == 0); 15 :: (curr == snap) && ((curr & 1) == 0) ->
18 } 16 break;
19 i++; 17 :: (curr - snap) > 2 || (snap & 1) == 0 ->
20 od; 18 break;
21 } 19 :: 1 -> skip;
20 fi;
21 }
Lines 6 and 20 define a loop. Line 7 exits the loop 22 od;
once the loop counter i has exceeded the limit MAX_ 23 snap = dynticks_progress_counter;
24 do
DYNTICK_LOOP_NOHZ. Line 8 tells the loop construct to 25 :: 1 ->
execute lines 9–19 for each pass through the loop. Be- 26 atomic {
27 curr = dynticks_progress_counter;
cause the conditionals on lines 7 and 8 are exclusive of 28 if
each other, the normal Promela random selection of true 29 :: (curr == snap) && ((curr & 1) == 0) ->
30 break;
conditions is disabled. Lines 9 and 11 model rcu_ 31 :: (curr != snap) ->
exit_nohz()’s non-atomic increment of dynticks_ 32 break;
33 :: 1 -> skip;
progress_counter, while line 12 models the WARN_ 34 fi;
ON(). The atomic construct simply reduces the Promela 35 }
36 od;
state space, given that the WARN_ON() is not strictly speak- 37 }
ing part of the algorithm. Lines 14–18 similarly model
the increment and WARN_ON() for rcu_enter_nohz().
Finally, line 19 increments the loop counter. Lines 6–9 print out the loop limit (but only into the
Each pass through the loop therefore models a CPU ex- .trail file in case of error) and models a line of code from
iting dynticks-idle mode (for example, starting to execute rcu_try_flip_idle() and its call to dyntick_save_
a task), then re-entering dynticks-idle mode (for example, progress_counter(), which takes a snapshot of the
that same task blocking). current CPU’s dynticks_progress_counter variable.
Quick Quiz 12.13: Why isn’t the memory barrier in rcu_ These two lines are executed atomically to reduce state
exit_nohz() and rcu_enter_nohz() modeled in Promela? space.
Lines 10–22 model the relevant code in rcu_
Quick Quiz 12.14: Isn’t it a bit strange to model rcu_exit_ try_flip_waitack() and its call to rcu_try_flip_
nohz() followed by rcu_enter_nohz()? Wouldn’t it be waitack_needed(). This loop is modeling the grace-
more natural to instead model entry before exit? period state machine waiting for a counter-flip acknowl-
edgement from each CPU, but only that part that interacts
The next step is to model the interface to with dynticks-idle CPUs.
RCU’s grace-period processing. For this, we Line 23 models a line from rcu_try_flip_
need to model dyntick_save_progress_counter(), waitzero() and its call to dyntick_save_progress_
rcu_try_flip_waitack_needed(), rcu_try_flip_ counter(), again taking a snapshot of the CPU’s
waitmb_needed(), as well as portions of rcu_try_ dynticks_progress_counter variable.
flip_waitack() and rcu_try_flip_waitmb(), all
from the 2.6.25-rc4 kernel. The following grace_ Finally, lines 24–36 model the relevant code in rcu_
period() Promela process models these functions as try_flip_waitack() and its call to rcu_try_flip_
they would be invoked during a single pass through pre- waitack_needed(). This loop is modeling the grace-
emptible RCU’s grace-period processing. period state-machine waiting for each CPU to execute a
memory barrier, but again only that part that interacts
1 proctype grace_period()
2 {
with dynticks-idle CPUs.

v2022.09.25a
246 CHAPTER 12. FORMAL VERIFICATION

Quick Quiz 12.15: Wait a minute! In the Linux kernel, 36 :: (curr == snap) && ((curr & 1) == 0) ->
37 break;
both dynticks_progress_counter and rcu_dyntick_ 38 :: (curr != snap) ->
snapshot are per-CPU variables. So why are they instead 39 break;
being modeled as single global variables? 40 :: 1 -> skip;
41 fi;
42 }
The resulting model (dyntickRCU-base.spin), 43 od;
when run with the runspin.sh script, generates 691 44 grace_period_state = GP_DONE;
45 }
states and passes without errors, which is not at all sur-
prising given that it completely lacks the assertions that Lines 6, 10, 25, 26, 29, and 44 update this variable (com-
could find failures. The next section therefore adds safety bining atomically with algorithmic operations where fea-
assertions. sible) to allow the dyntick_nohz() process to verify the
basic RCU safety property. The form of this verification
12.1.6.2 Validating Safety is to assert that the value of the grace_period_state
variable cannot jump from GP_IDLE to GP_DONE during
A safe RCU implementation must never permit a grace a time period over which RCU readers could plausibly
period to complete before the completion of any RCU persist.
readers that started before the start of the grace period.
This is modeled by a grace_period_state variable that Quick Quiz 12.16: Given there are a pair of back-to-back
changes to grace_period_state on lines 25 and 26, how
can take on three states as follows:
can we be sure that line 25’s changes won’t be lost?
1 #define GP_IDLE 0
2 #define GP_WAITING 1 The dyntick_nohz() Promela process implements
3 #define GP_DONE 2
4 byte grace_period_state = GP_DONE; this verification as shown below:
1 proctype dyntick_nohz()
The grace_period() process sets this variable as it 2 {
3 byte tmp;
progresses through the grace-period phases, as shown 4 byte i = 0;
below: 5 bit old_gp_idle;
6
7 do
1 proctype grace_period()
8 :: i >= MAX_DYNTICK_LOOP_NOHZ -> break;
2 {
9 :: i < MAX_DYNTICK_LOOP_NOHZ ->
3 byte curr;
10 tmp = dynticks_progress_counter;
4 byte snap;
11 atomic {
5
12 dynticks_progress_counter = tmp + 1;
6 grace_period_state = GP_IDLE;
13 old_gp_idle = (grace_period_state == GP_IDLE);
7 atomic {
14 assert((dynticks_progress_counter & 1) == 1);
8 printf("MDLN = %d\n", MAX_DYNTICK_LOOP_NOHZ);
15 }
9 snap = dynticks_progress_counter;
16 atomic {
10 grace_period_state = GP_WAITING;
17 tmp = dynticks_progress_counter;
11 }
18 assert(!old_gp_idle ||
12 do
19 grace_period_state != GP_DONE);
13 :: 1 ->
20 }
14 atomic {
21 atomic {
15 curr = dynticks_progress_counter;
22 dynticks_progress_counter = tmp + 1;
16 if
23 assert((dynticks_progress_counter & 1) == 0);
17 :: (curr == snap) && ((curr & 1) == 0) ->
24 }
18 break;
25 i++;
19 :: (curr - snap) > 2 || (snap & 1) == 0 ->
26 od;
20 break;
27 }
21 :: 1 -> skip;
22 fi;
23 } Line 13 sets a new old_gp_idle flag if the value of
24 od;
25 grace_period_state = GP_DONE;
the grace_period_state variable is GP_IDLE at the
26 grace_period_state = GP_IDLE; beginning of task execution, and the assertion at lines 18
27 atomic {
28 snap = dynticks_progress_counter;
and 19 fire if the grace_period_state variable has
29 grace_period_state = GP_WAITING; advanced to GP_DONE during task execution, which would
30 }
31 do
be illegal given that a single RCU read-side critical section
32 :: 1 -> could span the entire intervening time period.
33 atomic {
34 curr = dynticks_progress_counter;
The resulting model (dyntickRCU-base-s.spin),
35 if when run with the runspin.sh script, generates 964

v2022.09.25a
12.1. STATE-SPACE SEARCH 247

states and passes without errors, which is reassuring. That 23 :: (curr - snap) > 2 || (snap & 1) == 0 ->
24 break;
said, although safety is critically important, it is also quite 25 :: else -> skip;
important to avoid indefinitely stalling grace periods. The 26 fi;
27 }
next section therefore covers verifying liveness. 28 od;
29 grace_period_state = GP_DONE;
30 grace_period_state = GP_IDLE;
12.1.6.3 Validating Liveness 31 atomic {
32 shouldexit = 0;
Although liveness can be difficult to prove, there is a 33 snap = dynticks_progress_counter;
34 grace_period_state = GP_WAITING;
simple trick that applies here. The first step is to make 35 }
dyntick_nohz() indicate that it is done via a dyntick_ 36 do
37 :: 1 ->
nohz_done variable, as shown on line 27 of the following: 38 atomic {
39 assert(!shouldexit);
1 proctype dyntick_nohz() 40 shouldexit = dyntick_nohz_done;
2 { 41 curr = dynticks_progress_counter;
3 byte tmp; 42 if
4 byte i = 0; 43 :: (curr == snap) && ((curr & 1) == 0) ->
5 bit old_gp_idle; 44 break;
6 45 :: (curr != snap) ->
7 do 46 break;
8 :: i >= MAX_DYNTICK_LOOP_NOHZ -> break; 47 :: else -> skip;
9 :: i < MAX_DYNTICK_LOOP_NOHZ -> 48 fi;
10 tmp = dynticks_progress_counter; 49 }
11 atomic { 50 od;
12 dynticks_progress_counter = tmp + 1; 51 grace_period_state = GP_DONE;
13 old_gp_idle = (grace_period_state == GP_IDLE); 52 }
14 assert((dynticks_progress_counter & 1) == 1);
15 }
16 atomic { We have added the shouldexit variable on line 5,
17 tmp = dynticks_progress_counter;
18 assert(!old_gp_idle || which we initialize to zero on line 10. Line 17 as-
19 grace_period_state != GP_DONE); serts that shouldexit is not set, while line 18 sets
20 }
21 atomic { shouldexit to the dyntick_nohz_done variable main-
22 dynticks_progress_counter = tmp + 1; tained by dyntick_nohz(). This assertion will there-
23 assert((dynticks_progress_counter & 1) == 0);
24 } fore trigger if we attempt to take more than one pass
25 i++; through the wait-for-counter-flip-acknowledgement loop
26 od;
27 dyntick_nohz_done = 1; after dyntick_nohz() has completed execution. After
28 } all, if dyntick_nohz() is done, then there cannot be any
more state changes to force us out of the loop, so going
With this variable in place, we can add assertions to through twice in this state means an infinite loop, which
grace_period() to check for unnecessary blockage as in turn means no end to the grace period.
follows:
Lines 32, 39, and 40 operate in a similar manner for the
1 proctype grace_period() second (memory-barrier) loop.
2 {
3 byte curr;
However, running this model (dyntickRCU-base-
4 byte snap; sl-busted.spin) results in failure, as line 23 is check-
5 bit shouldexit;
6
ing that the wrong variable is even. Upon failure,
7 grace_period_state = GP_IDLE; spin writes out a “trail” file (dyntickRCU-base-sl-
8 atomic {
9 printf("MDLN = %d\n", MAX_DYNTICK_LOOP_NOHZ);
busted.spin.trail), which records the sequence of
10 shouldexit = 0; states that lead to the failure. Use the “spin -t -p -g
11 snap = dynticks_progress_counter;
12 grace_period_state = GP_WAITING;
-l dyntickRCU-base-sl-busted.spin” command
13 } to cause spin to retrace this sequence of states, print-
14 do
15 :: 1 ->
ing the statements executed and the values of vari-
16 atomic { ables (dyntickRCU-base-sl-busted.spin.trail.
17 assert(!shouldexit);
18 shouldexit = dyntick_nohz_done;
txt). Note that the line numbers do not match the listing
19 curr = dynticks_progress_counter; above due to the fact that spin takes both functions in a
20 if
21 :: (curr == snap) && ((curr & 1) == 0) ->
single file. However, the line numbers do match the full
22 break; model (dyntickRCU-base-sl-busted.spin).

v2022.09.25a
248 CHAPTER 12. FORMAL VERIFICATION

We see that the dyntick_nohz() process completed at Lines 10–13 can now be combined and simplified,
step 34 (search for “34:”), but that the grace_period() resulting in the following. A similar simplification can be
process nonetheless failed to exit the loop. The value of applied to rcu_try_flip_waitmb_needed().
curr is 6 (see step 35) and that the value of snap is 5 (see
1 static inline int
step 17). Therefore the first condition on line 21 above 2 rcu_try_flip_waitack_needed(int cpu)
does not hold because “curr != snap”, and the second 3 {
4 long curr;
condition on line 23 does not hold either because snap is 5 long snap;
odd and because curr is only one greater than snap. 6
7 curr = per_cpu(dynticks_progress_counter, cpu);
So one of these two conditions has to be incorrect. Refer- 8 snap = per_cpu(rcu_dyntick_snapshot, cpu);
ring to the comment block in rcu_try_flip_waitack_ 9 smp_mb();
10 if ((curr - snap) >= 2 || (curr & 0x1) == 0)
needed() for the first condition: 11 return 0;
12 return 1;
If the CPU remained in dynticks mode for the 13 }
entire time and didn’t take any interrupts, NMIs,
SMIs, or whatever, then it cannot be in the Making the corresponding correction in the model
middle of an rcu_read_lock(), so the next (dyntickRCU-base-sl.spin) results in a correct verifi-
rcu_read_lock() it executes must use the cation with 661 states that passes without errors. However,
new value of the counter. So we can safely it is worth noting that the first version of the liveness verifi-
pretend that this CPU already acknowledged the cation failed to catch this bug, due to a bug in the liveness
counter. verification itself. This liveness-verification bug was lo-
cated by inserting an infinite loop in the grace_period()
The first condition does match this, because if process, and noting that the liveness-verification code
“curr == snap” and if curr is even, then the corre- failed to detect this problem!
sponding CPU has been in dynticks-idle mode the entire
We have now successfully verified both safety and
time, as required. So let’s look at the comment block for
liveness conditions, but only for processes running and
the second condition:
blocking. We also need to handle interrupts, a task taken
If the CPU passed through or entered a dynticks up in the next section.
idle phase with no active irq handlers, then,
as above, we can safely pretend that this CPU 12.1.6.4 Interrupts
already acknowledged the counter.
There are a couple of ways to model interrupts in Promela:
The first part of the condition is correct, because if
curr and snap differ by two, there will be at least one 1. Using C-preprocessor tricks to insert the interrupt
even number in between, corresponding to having passed handler between each and every statement of the
completely through a dynticks-idle phase. However, the dynticks_nohz() process, or
second part of the condition corresponds to having started
in dynticks-idle mode, not having finished in this mode. 2. Modeling the interrupt handler with a separate
We therefore need to be testing curr rather than snap for process.
being an even number.
A bit of thought indicated that the second approach
The corrected C code is as follows:
would have a smaller state space, though it requires that
1 static inline int
2 rcu_try_flip_waitack_needed(int cpu)
the interrupt handler somehow run atomically with respect
3 { to the dynticks_nohz() process, but not with respect
4 long curr;
5 long snap;
to the grace_period() process.
6 Fortunately, it turns out that Promela permits you
7 curr = per_cpu(dynticks_progress_counter, cpu);
8 snap = per_cpu(rcu_dyntick_snapshot, cpu);
to branch out of atomic statements. This trick allows
9 smp_mb(); us to have the interrupt handler set a flag, and recode
10 if ((curr == snap) && ((curr & 0x1) == 0))
11 return 0;
dynticks_nohz() to atomically check this flag and ex-
12 if ((curr - snap) > 2 || (curr & 0x1) == 0) ecute only when the flag is not set. This can be accom-
13 return 0;
14 return 1;
plished with a C-preprocessor macro that takes a label
15 } and a Promela statement as follows:

v2022.09.25a
12.1. STATE-SPACE SEARCH 249

1 #define EXECUTE_MAINLINE(label, stmt) \ Quick Quiz 12.18: But what if the dynticks_nohz()
2 label: skip; \
3 atomic { \
process had “if” or “do” statements with conditions, where
4 if \ the statement bodies of these constructs needed to execute
5 :: in_dyntick_irq -> goto label; \ non-atomically?
6 :: else -> stmt; \
7 fi; \
8 } The next step is to write a dyntick_irq() process to
model an interrupt handler:
One might use this macro as follows: 1 proctype dyntick_irq()
2 {
EXECUTE_MAINLINE(stmt1, 3 byte tmp;
tmp = dynticks_progress_counter) 4 byte i = 0;
5 bit old_gp_idle;
6
7 do
Line 2 of the macro creates the specified statement label. 8 :: i >= MAX_DYNTICK_LOOP_IRQ -> break;
9 :: i < MAX_DYNTICK_LOOP_IRQ ->
Lines 3–8 are an atomic block that tests the in_dyntick_ 10 in_dyntick_irq = 1;
irq variable, and if this variable is set (indicating that the 11 if
12 :: rcu_update_flag > 0 ->
interrupt handler is active), branches out of the atomic 13 tmp = rcu_update_flag;
block back to the label. Otherwise, line 6 executes the 14 rcu_update_flag = tmp + 1;
15 :: else -> skip;
specified statement. The overall effect is that mainline 16 fi;
execution stalls any time an interrupt is active, as required. 17 if
18 :: !in_interrupt &&
19 (dynticks_progress_counter & 1) == 0 ->
20 tmp = dynticks_progress_counter;
12.1.6.5 Validating Interrupt Handlers 21 dynticks_progress_counter = tmp + 1;
22 tmp = rcu_update_flag;
The first step is to convert dyntick_nohz() to EXECUTE_ 23 rcu_update_flag = tmp + 1;
24 :: else -> skip;
MAINLINE() form, as follows: 25 fi;
26 tmp = in_interrupt;
1 proctype dyntick_nohz() 27 in_interrupt = tmp + 1;
2 { 28 old_gp_idle = (grace_period_state == GP_IDLE);
3 byte tmp; 29 assert(!old_gp_idle ||
4 byte i = 0; 30 grace_period_state != GP_DONE);
5 bit old_gp_idle; 31 tmp = in_interrupt;
6 32 in_interrupt = tmp - 1;
7 do 33 if
8 :: i >= MAX_DYNTICK_LOOP_NOHZ -> break; 34 :: rcu_update_flag != 0 ->
9 :: i < MAX_DYNTICK_LOOP_NOHZ -> 35 tmp = rcu_update_flag;
10 EXECUTE_MAINLINE(stmt1, 36 rcu_update_flag = tmp - 1;
11 tmp = dynticks_progress_counter) 37 if
12 EXECUTE_MAINLINE(stmt2, 38 :: rcu_update_flag == 0 ->
13 dynticks_progress_counter = tmp + 1; 39 tmp = dynticks_progress_counter;
14 old_gp_idle = (grace_period_state == GP_IDLE); 40 dynticks_progress_counter = tmp + 1;
15 assert((dynticks_progress_counter & 1) == 1)) 41 :: else -> skip;
16 EXECUTE_MAINLINE(stmt3, 42 fi;
17 tmp = dynticks_progress_counter; 43 :: else -> skip;
18 assert(!old_gp_idle || 44 fi;
19 grace_period_state != GP_DONE)) 45 atomic {
20 EXECUTE_MAINLINE(stmt4, 46 in_dyntick_irq = 0;
21 dynticks_progress_counter = tmp + 1; 47 i++;
22 assert((dynticks_progress_counter & 1) == 0)) 48 }
23 i++; 49 od;
24 od; 50 dyntick_irq_done = 1;
25 dyntick_nohz_done = 1; 51 }
26 }
The loop from lines 7–49 models up to MAX_DYNTICK_
It is important to note that when a group of statements LOOP_IRQ interrupts, with lines 8 and 9 forming the loop
is passed to EXECUTE_MAINLINE(), as in lines 12–15, all condition and line 47 incrementing the control variable.
statements in that group execute atomically. Line 10 tells dyntick_nohz() that an interrupt handler
Quick Quiz 12.17: But what would you do if you needed is running, and line 46 tells dyntick_nohz() that this
the statements in a single EXECUTE_MAINLINE() group to handler has completed. Line 50 is used for liveness
execute non-atomically? verification, just like the corresponding line of dyntick_
nohz().

v2022.09.25a
250 CHAPTER 12. FORMAL VERIFICATION

Quick Quiz 12.19: Why are lines 46 and 47 (the The implementation of grace_period() is very simi-
“in_dyntick_irq = 0;” and the “i++;”) executed atom- lar to the earlier one. The only changes are the addition of
ically? line 10 to add the new interrupt-count parameter, changes
to lines 19 and 39 to add the new dyntick_irq_done
Lines 11–25 model rcu_irq_enter(), and lines 26 variable to the liveness checks, and of course the optimiza-
and 27 model the relevant snippet of __irq_enter(). tions on lines 22 and 42.
Lines 28–30 verify safety in much the same manner as do This model (dyntickRCU-irqnn-ssl.spin) results
the corresponding lines of dynticks_nohz(). Lines 31 in a correct verification with roughly half a million states,
and 32 model the relevant snippet of __irq_exit(), and passing without errors. However, this version of the model
finally lines 33–44 model rcu_irq_exit(). does not handle nested interrupts. This topic is taken up
in the next section.
Quick Quiz 12.20: What property of interrupts is this
dynticks_irq() process unable to model?
12.1.6.6 Validating Nested Interrupt Handlers
The grace_period() process then becomes as fol- Nested interrupt handlers may be modeled by splitting the
lows: body of the loop in dyntick_irq() as follows:
1 proctype dyntick_irq()
1 proctype grace_period() 2 {
2 { 3 byte tmp;
3 byte curr; 4 byte i = 0;
4 byte snap; 5 byte j = 0;
5 bit shouldexit; 6 bit old_gp_idle;
6 7 bit outermost;
7 grace_period_state = GP_IDLE; 8
8 atomic { 9 do
9 printf("MDLN = %d\n", MAX_DYNTICK_LOOP_NOHZ); 10 :: i >= MAX_DYNTICK_LOOP_IRQ &&
10 printf("MDLI = %d\n", MAX_DYNTICK_LOOP_IRQ); 11 j >= MAX_DYNTICK_LOOP_IRQ -> break;
11 shouldexit = 0; 12 :: i < MAX_DYNTICK_LOOP_IRQ ->
12 snap = dynticks_progress_counter; 13 atomic {
13 grace_period_state = GP_WAITING; 14 outermost = (in_dyntick_irq == 0);
14 } 15 in_dyntick_irq = 1;
15 do 16 }
16 :: 1 -> 17 if
17 atomic { 18 :: rcu_update_flag > 0 ->
18 assert(!shouldexit); 19 tmp = rcu_update_flag;
19 shouldexit = dyntick_nohz_done && dyntick_irq_done; 20 rcu_update_flag = tmp + 1;
20 curr = dynticks_progress_counter; 21 :: else -> skip;
21 if 22 fi;
22 :: (curr - snap) >= 2 || (curr & 1) == 0 -> 23 if
23 break; 24 :: !in_interrupt &&
24 :: else -> skip; 25 (dynticks_progress_counter & 1) == 0 ->
25 fi; 26 tmp = dynticks_progress_counter;
26 } 27 dynticks_progress_counter = tmp + 1;
27 od; 28 tmp = rcu_update_flag;
28 grace_period_state = GP_DONE; 29 rcu_update_flag = tmp + 1;
29 grace_period_state = GP_IDLE; 30 :: else -> skip;
30 atomic { 31 fi;
31 shouldexit = 0; 32 tmp = in_interrupt;
32 snap = dynticks_progress_counter; 33 in_interrupt = tmp + 1;
33 grace_period_state = GP_WAITING; 34 atomic {
34 } 35 if
35 do 36 :: outermost ->
36 :: 1 -> 37 old_gp_idle = (grace_period_state == GP_IDLE);
37 atomic { 38 :: else -> skip;
38 assert(!shouldexit); 39 fi;
39 shouldexit = dyntick_nohz_done && dyntick_irq_done; 40 }
40 curr = dynticks_progress_counter; 41 i++;
41 if 42 :: j < i ->
42 :: (curr != snap) || ((curr & 1) == 0) -> 43 atomic {
43 break; 44 if
44 :: else -> skip; 45 :: j + 1 == i ->
45 fi; 46 assert(!old_gp_idle ||
46 } 47 grace_period_state != GP_DONE);
47 od; 48 :: else -> skip;
48 grace_period_state = GP_DONE; 49 fi;
49 } 50 }

v2022.09.25a
12.1. STATE-SPACE SEARCH 251

51 tmp = in_interrupt; 3 byte tmp;


52 in_interrupt = tmp - 1; 4 byte i = 0;
53 if 5 bit old_gp_idle;
54 :: rcu_update_flag != 0 -> 6
55 tmp = rcu_update_flag; 7 do
56 rcu_update_flag = tmp - 1; 8 :: i >= MAX_DYNTICK_LOOP_NMI -> break;
57 if 9 :: i < MAX_DYNTICK_LOOP_NMI ->
58 :: rcu_update_flag == 0 -> 10 in_dyntick_nmi = 1;
59 tmp = dynticks_progress_counter; 11 if
60 dynticks_progress_counter = tmp + 1; 12 :: rcu_update_flag > 0 ->
61 :: else -> skip; 13 tmp = rcu_update_flag;
62 fi; 14 rcu_update_flag = tmp + 1;
63 :: else -> skip; 15 :: else -> skip;
64 fi; 16 fi;
65 atomic { 17 if
66 j++; 18 :: !in_interrupt &&
67 in_dyntick_irq = (i != j); 19 (dynticks_progress_counter & 1) == 0 ->
68 } 20 tmp = dynticks_progress_counter;
69 od; 21 dynticks_progress_counter = tmp + 1;
70 dyntick_irq_done = 1; 22 tmp = rcu_update_flag;
71 } 23 rcu_update_flag = tmp + 1;
24 :: else -> skip;
25 fi;
This is similar to the earlier dynticks_irq() process. 26 tmp = in_interrupt;
It adds a second counter variable j on line 5, so that i 27 in_interrupt = tmp + 1;
28 old_gp_idle = (grace_period_state == GP_IDLE);
counts entries to interrupt handlers and j counts exits. The 29 assert(!old_gp_idle ||
outermost variable on line 7 helps determine when the 30 grace_period_state != GP_DONE);
31 tmp = in_interrupt;
grace_period_state variable needs to be sampled for 32 in_interrupt = tmp - 1;
the safety checks. The loop-exit check on lines 10 and 11 33 if
34 :: rcu_update_flag != 0 ->
is updated to require that the specified number of interrupt 35 tmp = rcu_update_flag;
handlers are exited as well as entered, and the increment 36 rcu_update_flag = tmp - 1;
37 if
of i is moved to line 41, which is the end of the interrupt- 38 :: rcu_update_flag == 0 ->
entry model. Lines 13–16 set the outermost variable to 39 tmp = dynticks_progress_counter;
40 dynticks_progress_counter = tmp + 1;
indicate whether this is the outermost of a set of nested 41 :: else -> skip;
interrupts and to set the in_dyntick_irq variable that 42 fi;
43 :: else -> skip;
is used by the dyntick_nohz() process. Lines 34–40 44 fi;
capture the state of the grace_period_state variable, 45 atomic {
46 i++;
but only when in the outermost interrupt handler. 47 in_dyntick_nmi = 0;
Line 42 has the do-loop conditional for interrupt-exit 48 }
49 od;
modeling: As long as we have exited fewer interrupts 50 dyntick_nmi_done = 1;
than we have entered, it is legal to exit another interrupt. 51 }
Lines 43–50 check the safety criterion, but only if we
Of course, the fact that we have NMIs requires ad-
are exiting from the outermost interrupt level. Finally,
justments in the other components. For example, the
lines 65–68 increment the interrupt-exit count j and, if this
EXECUTE_MAINLINE() macro now needs to pay atten-
is the outermost interrupt level, clears in_dyntick_irq.
tion to the NMI handler (in_dyntick_nmi) as well as
This model (dyntickRCU-irq-ssl.spin) results in
the interrupt handler (in_dyntick_irq) by checking the
a correct verification with a bit more than half a million
dyntick_nmi_done variable as follows:
states, passing without errors. However, this version of
the model does not handle NMIs, which are taken up in 1 #define EXECUTE_MAINLINE(label, stmt) \
2 label: skip; \
the next section. 3 atomic { \
4 if \
5 :: in_dyntick_irq || \
12.1.6.7 Validating NMI Handlers 6 in_dyntick_nmi -> goto label; \
7 :: else -> stmt; \
We take the same general approach for NMIs as we do for 8 fi; \
9 }
interrupts, keeping in mind that NMIs do not nest. This
results in a dyntick_nmi() process as follows: We will also need to introduce an EXECUTE_IRQ()
1 proctype dyntick_nmi()
macro that checks in_dyntick_nmi in order to allow
2 { dyntick_irq() to exclude dyntick_nmi():

v2022.09.25a
252 CHAPTER 12. FORMAL VERIFICATION

1 #define EXECUTE_IRQ(label, stmt) \ 62 atomic {


2 label: skip; \ 63 if
3 atomic { \ 64 :: in_dyntick_nmi -> goto stmt6;
4 if \ 65 :: !in_dyntick_nmi && j + 1 == i ->
5 :: in_dyntick_nmi -> goto label; \ 66 assert(!old_gp_idle ||
6 :: else -> stmt; \ 67 grace_period_state != GP_DONE);
7 fi; \ 68 :: else -> skip;
8 } 69 fi;
70 }
71 EXECUTE_IRQ(stmt7, tmp = in_interrupt);
It is further necessary to convert dyntick_irq() to 72 EXECUTE_IRQ(stmt8, in_interrupt = tmp - 1);
EXECUTE_IRQ() as follows: 73 stmt9: skip;
74 atomic {
1 proctype dyntick_irq() 75 if
2 { 76 :: in_dyntick_nmi -> goto stmt9;
3 byte tmp; 77 :: !in_dyntick_nmi && rcu_update_flag != 0 ->
4 byte i = 0; 78 goto stmt9_then;
5 byte j = 0; 79 :: else -> goto stmt9_else;
6 bit old_gp_idle; 80 fi;
7 bit outermost; 81 }
8 82 stmt9_then: skip;
9 do 83 EXECUTE_IRQ(stmt9_1, tmp = rcu_update_flag)
10 :: i >= MAX_DYNTICK_LOOP_IRQ && 84 EXECUTE_IRQ(stmt9_2, rcu_update_flag = tmp - 1)
11 j >= MAX_DYNTICK_LOOP_IRQ -> break; 85 stmt9_3: skip;
12 :: i < MAX_DYNTICK_LOOP_IRQ -> 86 atomic {
13 atomic { 87 if
14 outermost = (in_dyntick_irq == 0); 88 :: in_dyntick_nmi -> goto stmt9_3;
15 in_dyntick_irq = 1; 89 :: !in_dyntick_nmi && rcu_update_flag == 0 ->
16 } 90 goto stmt9_3_then;
17 stmt1: skip; 91 :: else -> goto stmt9_3_else;
18 atomic { 92 fi;
19 if 93 }
20 :: in_dyntick_nmi -> goto stmt1; 94 stmt9_3_then: skip;
21 :: !in_dyntick_nmi && rcu_update_flag -> 95 EXECUTE_IRQ(stmt9_3_1,
22 goto stmt1_then; 96 tmp = dynticks_progress_counter)
23 :: else -> goto stmt1_else; 97 EXECUTE_IRQ(stmt9_3_2,
24 fi; 98 dynticks_progress_counter = tmp + 1)
25 } 99 stmt9_3_else:
26 stmt1_then: skip; 100 stmt9_else: skip;
27 EXECUTE_IRQ(stmt1_1, tmp = rcu_update_flag) 101 atomic {
28 EXECUTE_IRQ(stmt1_2, rcu_update_flag = tmp + 1) 102 j++;
29 stmt1_else: skip; 103 in_dyntick_irq = (i != j);
30 stmt2: skip; atomic { 104 }
31 if 105 od;
32 :: in_dyntick_nmi -> goto stmt2; 106 dyntick_irq_done = 1;
33 :: !in_dyntick_nmi && 107 }
34 !in_interrupt &&
35 (dynticks_progress_counter & 1) == 0 ->
36 goto stmt2_then; Note that we have open-coded the “if” statements (for
37 :: else -> goto stmt2_else; example, lines 17–29). In addition, statements that process
38 fi;
39 } strictly local state (such as line 59) need not exclude
40 stmt2_then: skip; dyntick_nmi().
41 EXECUTE_IRQ(stmt2_1,
42 tmp = dynticks_progress_counter) Finally, grace_period() requires only a few changes:
43 EXECUTE_IRQ(stmt2_2,
44 dynticks_progress_counter = tmp + 1) 1 proctype grace_period()
45 EXECUTE_IRQ(stmt2_3, tmp = rcu_update_flag) 2 {
46 EXECUTE_IRQ(stmt2_4, rcu_update_flag = tmp + 1) 3 byte curr;
47 stmt2_else: skip; 4 byte snap;
48 EXECUTE_IRQ(stmt3, tmp = in_interrupt) 5 bit shouldexit;
49 EXECUTE_IRQ(stmt4, in_interrupt = tmp + 1) 6
50 stmt5: skip; 7 grace_period_state = GP_IDLE;
51 atomic { 8 atomic {
52 if 9 printf("MDL_NOHZ = %d\n", MAX_DYNTICK_LOOP_NOHZ);
53 :: in_dyntick_nmi -> goto stmt4; 10 printf("MDL_IRQ = %d\n", MAX_DYNTICK_LOOP_IRQ);
54 :: !in_dyntick_nmi && outermost -> 11 printf("MDL_NMI = %d\n", MAX_DYNTICK_LOOP_NMI);
55 old_gp_idle = (grace_period_state == GP_IDLE); 12 shouldexit = 0;
56 :: else -> skip; 13 snap = dynticks_progress_counter;
57 fi; 14 grace_period_state = GP_WAITING;
58 } 15 }
59 i++; 16 do
60 :: j < i -> 17 :: 1 ->
61 stmt6: skip; 18 atomic {

v2022.09.25a
12.1. STATE-SPACE SEARCH 253

19 assert(!shouldexit);
20 shouldexit = dyntick_nohz_done && static inline void rcu_enter_nohz(void)
21 dyntick_irq_done && {
22 dyntick_nmi_done; + mb();
23 curr = dynticks_progress_counter; __get_cpu_var(dynticks_progress_counter)++;
24 if - mb();
25 :: (curr - snap) >= 2 || (curr & 1) == 0 -> }
26 break;
27 :: else -> skip; static inline void rcu_exit_nohz(void)
28 fi; {
29 } - mb();
30 od; __get_cpu_var(dynticks_progress_counter)++;
31 grace_period_state = GP_DONE; + mb();
32 grace_period_state = GP_IDLE; }
33 atomic {
34 shouldexit = 0;
35 snap = dynticks_progress_counter;
36 grace_period_state = GP_WAITING; 3. Validate your code early, often, and up to the
37 } point of destruction. This effort located one sub-
38 do
39 :: 1 -> tle bug in rcu_try_flip_waitack_needed() that
40 atomic { would have been quite difficult to test or debug, as
41 assert(!shouldexit);
42 shouldexit = dyntick_nohz_done && shown by the following patch [McK08d].
43 dyntick_irq_done &&
44 dyntick_nmi_done; - if ((curr - snap) > 2 || (snap & 0x1) == 0)
45 curr = dynticks_progress_counter; + if ((curr - snap) > 2 || (curr & 0x1) == 0)
46 if
47 :: (curr != snap) || ((curr & 1) == 0) ->
48 break;
49 :: else -> skip; 4. Always verify your verification code. The usual
50 fi; way to do this is to insert a deliberate bug and verify
51 }
52 od; that the verification code catches it. Of course, if
53 grace_period_state = GP_DONE; the verification code fails to catch this bug, you
54 }
may also need to verify the bug itself, and so on,
recursing infinitely. However, if you find yourself
We have added the printf() for the new MAX_ in this position, getting a good night’s sleep can be
DYNTICK_LOOP_NMI parameter on line 11 and added an extremely effective debugging technique. You
dyntick_nmi_done to the shouldexit assignments on will then see that the obvious verify-the-verification
lines 22 and 44. technique is to deliberately insert bugs in the code
The model (dyntickRCU-irq-nmi-ssl.spin) re- being verified. If the verification fails to find them,
sults in a correct verification with several hundred million the verification clearly is buggy.
states, passing without errors.
5. Use of atomic instructions can simplify verifica-
Quick Quiz 12.21: Does Paul always write his code in this tion. Unfortunately, use of the cmpxchg atomic
painfully incremental manner? instruction would also slow down the critical IRQ
fastpath, so they are not appropriate in this case.

6. The need for complex formal verification often


12.1.6.8 Lessons (Re)Learned indicates a need to re-think your design.
This effort provided some lessons (re)learned:
To this last point, it turns out that there is a much simpler
solution to the dynticks problem, which is presented in
1. Promela and Spin can verify interrupt/NMI-han- the next section.
dler interactions.
12.1.6.9 Simplicity Avoids Formal Verification
2. Documenting code can help locate bugs. In
this case, the documentation effort located a mis- The complexity of the dynticks interface for preemptible
placed memory barrier in rcu_enter_nohz() and RCU is primarily due to the fact that both IRQs and NMIs
rcu_exit_nohz(), as shown by the following use the same code path and the same state variables. This
patch [McK08c]. leads to the notion of providing separate code paths and

v2022.09.25a
254 CHAPTER 12. FORMAL VERIFICATION

Listing 12.17: Variables for Simple Dynticks Interface Listing 12.18: Entering and Exiting Dynticks-Idle Mode
1 struct rcu_dynticks { 1 void rcu_enter_nohz(void)
2 int dynticks_nesting; 2 {
3 int dynticks; 3 unsigned long flags;
4 int dynticks_nmi; 4 struct rcu_dynticks *rdtp;
5 }; 5
6 6 smp_mb();
7 struct rcu_data { 7 local_irq_save(flags);
8 ... 8 rdtp = &__get_cpu_var(rcu_dynticks);
9 int dynticks_snap; 9 rdtp->dynticks++;
10 int dynticks_nmi_snap; 10 rdtp->dynticks_nesting--;
11 ... 11 WARN_ON(rdtp->dynticks & 0x1);
12 }; 12 local_irq_restore(flags);
13 }
14
15 void rcu_exit_nohz(void)
variables for IRQs and NMIs, as has been done for hierar- 16 {
17 unsigned long flags;
chical RCU [McK08b] as indirectly suggested by Manfred 18 struct rcu_dynticks *rdtp;
Spraul [Spr08]. This work was pulled into mainline kernel 19
20 local_irq_save(flags);
during the v2.6.29 development cycle [McK08f]. 21 rdtp = &__get_cpu_var(rcu_dynticks);
22 rdtp->dynticks++;
23 rdtp->dynticks_nesting++;
12.1.6.10 State Variables for Simplified Dynticks In- 24 WARN_ON(!(rdtp->dynticks & 0x1));
25 local_irq_restore(flags);
terface 26 smp_mb();
27 }
Listing 12.17 shows the new per-CPU state variables.
These variables are grouped into structs to allow multiple
independent RCU implementations (e.g., rcu and rcu_ handlers running. Otherwise, the counter’s value
bh) to conveniently and efficiently share dynticks state. will be even.
In what follows, they can be thought of as independent
per-CPU variables. dynticks_snap
The dynticks_nesting, dynticks, and dynticks_ This will be a snapshot of the dynticks counter, but
snap variables are for the IRQ code paths, and the only if the current RCU grace period has extended
dynticks_nmi and dynticks_nmi_snap variables are for too long a duration.
for the NMI code paths, although the NMI code path will
also reference (but not modify) the dynticks_nesting dynticks_nmi_snap
variable. These variables are used as follows: This will be a snapshot of the dynticks_nmi counter,
but again only if the current RCU grace period has
dynticks_nesting extended for too long a duration.
This counts the number of reasons that the corre-
sponding CPU should be monitored for RCU read- If both dynticks and dynticks_nmi have taken on
side critical sections. If the CPU is in dynticks-idle an even value during a given time interval, then the
mode, then this counts the IRQ nesting level, other- corresponding CPU has passed through a quiescent state
wise it is one greater than the IRQ nesting level. during that interval.
dynticks Quick Quiz 12.22: But what happens if an NMI handler
This counter’s value is even if the corresponding starts running before an IRQ handler completes, and if that
CPU is in dynticks-idle mode and there are no IRQ NMI handler continues running until a second IRQ handler
handlers currently running on that CPU, otherwise starts?
the counter’s value is odd. In other words, if this
counter’s value is odd, then the corresponding CPU
might be in an RCU read-side critical section. 12.1.6.11 Entering and Leaving Dynticks-Idle Mode
dynticks_nmi Listing 12.18 shows the rcu_enter_nohz() and rcu_
This counter’s value is odd if the corresponding CPU exit_nohz(), which enter and exit dynticks-idle mode,
is in an NMI handler, but only if the NMI arrived also known as “nohz” mode. These two functions are
while this CPU was in dyntick-idle mode with no IRQ invoked from process context.

v2022.09.25a
12.1. STATE-SPACE SEARCH 255

Listing 12.19: NMIs From Dynticks-Idle Mode Listing 12.20: Interrupts From Dynticks-Idle Mode
1 void rcu_nmi_enter(void) 1 void rcu_irq_enter(void)
2 { 2 {
3 struct rcu_dynticks *rdtp; 3 struct rcu_dynticks *rdtp;
4 4
5 rdtp = &__get_cpu_var(rcu_dynticks); 5 rdtp = &__get_cpu_var(rcu_dynticks);
6 if (rdtp->dynticks & 0x1) 6 if (rdtp->dynticks_nesting++)
7 return; 7 return;
8 rdtp->dynticks_nmi++; 8 rdtp->dynticks++;
9 WARN_ON(!(rdtp->dynticks_nmi & 0x1)); 9 WARN_ON(!(rdtp->dynticks & 0x1));
10 smp_mb(); 10 smp_mb();
11 } 11 }
12 12
13 void rcu_nmi_exit(void) 13 void rcu_irq_exit(void)
14 { 14 {
15 struct rcu_dynticks *rdtp; 15 struct rcu_dynticks *rdtp;
16 16
17 rdtp = &__get_cpu_var(rcu_dynticks); 17 rdtp = &__get_cpu_var(rcu_dynticks);
18 if (rdtp->dynticks & 0x1) 18 if (--rdtp->dynticks_nesting)
19 return; 19 return;
20 smp_mb(); 20 smp_mb();
21 rdtp->dynticks_nmi++; 21 rdtp->dynticks++;
22 WARN_ON(rdtp->dynticks_nmi & 0x1); 22 WARN_ON(rdtp->dynticks & 0x1);
23 } 23 if (__get_cpu_var(rcu_data).nxtlist ||
24 __get_cpu_var(rcu_bh_data).nxtlist)
25 set_need_resched();
26 }

Line 6 ensures that any prior memory accesses (which


might include accesses from RCU read-side critical sec-
12.1.6.13 Interrupts From Dynticks-Idle Mode
tions) are seen by other CPUs before those marking entry
to dynticks-idle mode. Lines 7 and 12 disable and reen- Listing 12.20 shows rcu_irq_enter() and rcu_irq_
able IRQs. Line 8 acquires a pointer to the current CPU’s exit(), which inform RCU of entry to and exit from,
rcu_dynticks structure, and line 9 increments the cur- respectively, IRQ context. Line 6 of rcu_irq_enter()
rent CPU’s dynticks counter, which should now be even, increments dynticks_nesting, and if this variable was
given that we are entering dynticks-idle mode in process already non-zero, line 7 silently returns. Otherwise, line 8
context. Finally, line 10 decrements dynticks_nesting, increments dynticks, which will then have an odd value,
which should now be zero. consistent with the fact that this CPU can now execute RCU
The rcu_exit_nohz() function is quite similar, but in- read-side critical sections. Line 10 therefore executes a
crements dynticks_nesting rather than decrementing memory barrier to ensure that the increment of dynticks
it and checks for the opposite dynticks polarity. is seen before any RCU read-side critical sections that the
subsequent IRQ handler might execute.
Line 18 of rcu_irq_exit() decrements dynticks_
nesting, and if the result is non-zero, line 19 silently
12.1.6.12 NMIs From Dynticks-Idle Mode returns. Otherwise, line 20 executes a memory barrier to
ensure that the increment of dynticks on line 21 is seen
Listing 12.19 shows the rcu_nmi_enter() and rcu_ after any RCU read-side critical sections that the prior
nmi_exit() functions, which inform RCU of NMI entry IRQ handler might have executed. Line 22 verifies that
and exit, respectively, from dynticks-idle mode. However, dynticks is now even, consistent with the fact that no
if the NMI arrives during an IRQ handler, then RCU will al- RCU read-side critical sections may appear in dynticks-
ready be on the lookout for RCU read-side critical sections idle mode. Lines 23–25 check to see if the prior IRQ
from this CPU, so lines 6 and 7 of rcu_nmi_enter() handlers enqueued any RCU callbacks, forcing this CPU
and lines 18 and 19 of rcu_nmi_exit() silently return if out of dynticks-idle mode via a reschedule API if so.
dynticks is odd. Otherwise, the two functions increment
dynticks_nmi, with rcu_nmi_enter() leaving it with 12.1.6.14 Checking For Dynticks Quiescent States
an odd value and rcu_nmi_exit() leaving it with an
even value. Both functions execute memory barriers be- Listing 12.21 shows dyntick_save_progress_
tween this increment and possible RCU read-side critical counter(), which takes a snapshot of the specified
sections on lines 10 and 20, respectively. CPU’s dynticks and dynticks_nmi counters. Lines 8

v2022.09.25a
256 CHAPTER 12. FORMAL VERIFICATION

Listing 12.21: Saving Dyntick Progress Counters Listing 12.22: Checking Dyntick Progress Counters
1 static int 1 static int
2 dyntick_save_progress_counter(struct rcu_data *rdp) 2 rcu_implicit_dynticks_qs(struct rcu_data *rdp)
3 { 3 {
4 int ret; 4 long curr;
5 int snap; 5 long curr_nmi;
6 int snap_nmi; 6 long snap;
7 7 long snap_nmi;
8 snap = rdp->dynticks->dynticks; 8
9 snap_nmi = rdp->dynticks->dynticks_nmi; 9 curr = rdp->dynticks->dynticks;
10 smp_mb(); 10 snap = rdp->dynticks_snap;
11 rdp->dynticks_snap = snap; 11 curr_nmi = rdp->dynticks->dynticks_nmi;
12 rdp->dynticks_nmi_snap = snap_nmi; 12 snap_nmi = rdp->dynticks_nmi_snap;
13 ret = ((snap & 0x1) == 0) && ((snap_nmi & 0x1) == 0); 13 smp_mb();
14 if (ret) 14 if ((curr != snap || (curr & 0x1) == 0) &&
15 rdp->dynticks_fqs++; 15 (curr_nmi != snap_nmi || (curr_nmi & 0x1) == 0)) {
16 return ret; 16 rdp->dynticks_fqs++;
17 } 17 return 1;
18 }
19 return rcu_implicit_offline_qs(rdp);
20 }

and 9 snapshot these two variables to locals, line 10


executes a memory barrier to pair with the memory
barriers in the functions in Listings 12.18, 12.19, Linux-kernel RCU’s dyntick-idle code has since been
and 12.20. Lines 11 and 12 record the snapshots for later rewritten yet again based on a suggestion from Andy
calls to rcu_implicit_dynticks_qs(), and line 13 Lutomirski [McK15c], but it is time to sum up and move
checks to see if the CPU is in dynticks-idle mode with on to other topics.
neither IRQs nor NMIs in progress (in other words,
both snapshots have even values), hence in an extended 12.1.6.15 Discussion
quiescent state. If so, lines 14 and 15 count this event, and
line 16 returns true if the CPU was in a quiescent state. A slight shift in viewpoint resulted in a substantial sim-
plification of the dynticks interface for RCU. The key
Listing 12.22 shows rcu_implicit_dynticks_qs(),
change leading to this simplification was minimizing of
which is called to check whether a CPU has entered
sharing between IRQ and NMI contexts. The only sharing
dyntick-idle mode subsequent to a call to dynticks_
in this simplified interface is references from NMI context
save_progress_counter(). Lines 9 and 11 take new
to IRQ variables (the dynticks variable). This type of
snapshots of the corresponding CPU’s dynticks and
sharing is benign, because the NMI functions never update
dynticks_nmi variables, while lines 10 and 12 re-
this variable, so that its value remains constant through
trieve the snapshots saved earlier by dynticks_save_
the lifetime of the NMI handler. This limitation of sharing
progress_counter(). Line 13 then executes a memory
allows the individual functions to be understood one at
barrier to pair with the memory barriers in the functions
a time, in happy contrast to the situation described in
in Listings 12.18, 12.19, and 12.20. Lines 14–15 then
Section 12.1.5, where an NMI might change shared state
check to see if the CPU is either currently in a quies-
at any point during execution of the IRQ functions.
cent state (curr and curr_nmi having even values) or
Verification can be a good thing, but simplicity is even
has passed through a quiescent state since the last call
better.
to dynticks_save_progress_counter() (the values
of dynticks and dynticks_nmi having changed). If
these checks confirm that the CPU has passed through a
dyntick-idle quiescent state, then line 16 counts that fact
12.2 Special-Purpose State-Space
and line 17 returns an indication of this fact. Either way, Search
line 19 checks for race conditions that can result in RCU
waiting for a CPU that is offline.
Jack of all trades, master of none.
Quick Quiz 12.23: This is still pretty complicated. Why Unknown
not just have a cpumask_t with per-CPU bits, clearing the bit
when entering an IRQ or NMI handler, and setting it upon Although Promela and spin allow you to verify pretty
exit? much any (smallish) algorithm, their very generality can

v2022.09.25a
12.2. SPECIAL-PURPOSE STATE-SPACE SEARCH 257

sometimes be a curse. For example, Promela does not Listing 12.23: PPCMEM Litmus Test
understand memory models or any sort of reordering 1 PPC SB+lwsync-RMW-lwsync+isync-simple
2 ""
semantics. This section therefore describes some state- 3 {
space search tools that understand memory models used 4 0:r2=x; 0:r3=2; 0:r4=y; 0:r10=0; 0:r11=0; 0:r12=z;
5 1:r2=y; 1:r4=x;
by production systems, greatly simplifying the verification 6 }
of weakly ordered code. 7 P0 | P1 ;
8 li r1,1 | li r1,1 ;
For example, Section 12.1.4 showed how to convince 9 stw r1,0(r2) | stw r1,0(r2) ;
Promela to account for weak memory ordering. Although 10 lwsync | sync ;
11 | lwz r3,0(r4) ;
this approach can work well, it requires that the developer 12 lwarx r11,r10,r12 | ;
fully understand the system’s memory model. Unfor- 13 stwcx. r11,r10,r12 | ;
14 bne Fail1 | ;
tunately, few (if any) developers fully understand the 15 isync | ;
complex memory models of modern CPUs. 16 lwz r3,0(r4) | ;
17 Fail1: | ;
Therefore, another approach is to use a tool that already 18

understands this memory ordering, such as the PPCMEM 19 exists


20 (0:r3=0 /\ 1:r3=0)
tool produced by Peter Sewell and Susmit Sarkar at the
University of Cambridge, Luc Maranget, Francesco Zappa
Nardelli, and Pankaj Pawan at INRIA, and Jade Alglave example, x=1 initializes the value of x to 1. Uninitialized
at Oxford University, in cooperation with Derek Williams variables default to the value zero, so that in the example,
of IBM [AMP+ 11]. This group formalized the memory x, y, and z are all initially zero.
models of Power, Arm, x86, as well as that of the C/C++11 Line 7 provides identifiers for the two processes, so
standard [Smi19], and produced the PPCMEM tool based that the 0:r3=2 on line 4 could instead have been written
on the Power and Arm formalizations. P0:r3=2. Line 7 is required, and the identifiers must be
Quick Quiz 12.24: But x86 has strong memory ordering, so of the form Pn, where n is the column number, starting
why formalize its memory model? from zero for the left-most column. This may seem unnec-
essarily strict, but it does prevent considerable confusion
The PPCMEM tool takes litmus tests as input. A sample in actual use.
litmus test is presented in Section 12.2.1. Section 12.2.2 re-
Quick Quiz 12.25: Why does line 8 of Listing 12.23 initialize
lates this litmus test to the equivalent C-language program,
the registers? Why not instead initialize them on lines 4 and 5?
Section 12.2.3 describes how to apply PPCMEM to this
litmus test, and Section 12.2.4 discusses the implications.
Lines 8–17 are the lines of code for each process. A
given process can have empty lines, as is the case for P0’s
12.2.1 Anatomy of a Litmus Test
line 11 and P1’s lines 12–17. Labels and branches are
An example PowerPC litmus test for PPCMEM is shown permitted, as demonstrated by the branch on line 14 to
in Listing 12.23. The ARM interface works the same the label on line 17. That said, too-free use of branches
way, but with Arm instructions substituted for the Power will expand the state space. Use of loops is a particularly
instructions and with the initial “PPC” replaced by “ARM”. good way to explode your state space.
In the example, line 1 identifies the type of system Lines 19–20 show the assertion, which in this case
(“ARM” or “PPC”) and contains the title for the model. indicates that we are interested in whether P0’s and P1’s r3
Line 2 provides a place for an alternative name for the test, registers can both contain zero after both threads complete
which you will usually want to leave blank as shown in execution. This assertion is important because there are a
the above example. Comments can be inserted between number of use cases that would fail miserably if both P0
lines 2 and 3 using the OCaml (or Pascal) syntax of (* *). and P1 saw zero in their respective r3 registers.
Lines 3–6 give initial values for all registers; each is This should give you enough information to construct
of the form P:R=V, where P is the process identifier, R is simple litmus tests. Some additional documentation is
the register identifier, and V is the value. For example, available, though much of this additional documentation
process 0’s register r3 initially contains the value 2. If is intended for a different research tool that runs tests
the value is a variable (x, y, or z in the example) then on actual hardware. Perhaps more importantly, a large
the register is initialized to the address of the variable. It number of pre-existing litmus tests are available with the
is also possible to initialize the contents of variables, for online tool (available via the “Select ARM Test” and

v2022.09.25a
258 CHAPTER 12. FORMAL VERIFICATION

Listing 12.24: Meaning of PPCMEM Litmus Test Listing 12.25: PPCMEM Detects an Error
1 void P0(void) ./ppcmem -model lwsync_read_block \
2 { -model coherence_points filename.litmus
3 int r3; ...
4 States 6
5 x = 1; /* Lines 8 and 9 */ 0:r3=0; 1:r3=0;
6 atomic_add_return(&z, 0); /* Lines 10-15 */ 0:r3=0; 1:r3=1;
7 r3 = y; /* Line 16 */ 0:r3=1; 1:r3=0;
8 } 0:r3=1; 1:r3=1;
9 0:r3=2; 1:r3=0;
10 void P1(void) 0:r3=2; 1:r3=1;
11 { Ok
12 int r3; Condition exists (0:r3=0 /\ 1:r3=0)
13 Hash=e2240ce2072a2610c034ccd4fc964e77
14 y = 1; /* Lines 8-9 */ Observation SB+lwsync-RMW-lwsync+isync Sometimes 1
15 smp_mb(); /* Line 10 */
16 r3 = x; /* Line 11 */
17 }
Listing 12.26: PPCMEM on Repaired Litmus Test
./ppcmem -model lwsync_read_block \
-model coherence_points filename.litmus
“Select POWER Test” buttons at https://www.cl.cam. ...
States 5
ac.uk/~pes20/ppcmem/). It is quite likely that one of 0:r3=0; 1:r3=1;
these pre-existing litmus tests will answer your Power or 0:r3=1; 1:r3=0;
0:r3=1; 1:r3=1;
Arm memory-ordering question. 0:r3=2; 1:r3=0;
0:r3=2; 1:r3=1;
No (allowed not found)
Condition exists (0:r3=0 /\ 1:r3=0)
12.2.2 What Does This Litmus Test Mean? Hash=77dd723cda9981248ea4459fcdf6097d
Observation SB+lwsync-RMW-lwsync+sync Never 0 5

P0’s lines 8 and 9 are equivalent to the C statement x=1


because line 4 defines P0’s register r2 to be the address
of x. P0’s lines 12 and 13 are the mnemonics for load- 12.2.3 Running a Litmus Test
linked (“load register exclusive” in Arm parlance and
“load reserve” in Power parlance) and store-conditional As noted earlier, litmus tests may be run interactively
(“store register exclusive” in Arm parlance), respectively. via https://www.cl.cam.ac.uk/~pes20/ppcmem/,
When these are used together, they form an atomic in- which can help build an understanding of the memory
struction sequence, roughly similar to the compare-and- model. However, this approach requires that the user
swap sequences exemplified by the x86 lock;cmpxchg manually carry out the full state-space search. Because
instruction. Moving to a higher level of abstraction, the it is very difficult to be sure that you have checked every
sequence from lines 10–15 is equivalent to the Linux possible sequence of events, a separate tool is provided
kernel’s atomic_add_return(&z, 0). Finally, line 16 for this purpose [McK11d].
is roughly equivalent to the C statement r3=y. Because the litmus test shown in Listing 12.23 con-
P1’s lines 8 and 9 are equivalent to the C statement tains read-modify-write instructions, we must add -model
y=1, line 10 is a memory barrier, equivalent to the Linux arguments to the command line. If the litmus test is
kernel statement smp_mb(), and line 11 is equivalent to stored in filename.litmus, this will result in the out-
the C statement r3=x. put shown in Listing 12.25, where the ... stands for
Quick Quiz 12.26: But whatever happened to line 17 of voluminous making-progress output. The list of states in-
Listing 12.23, the one that is the Fail1: label? cludes 0:r3=0; 1:r3=0;, indicating once again that the
old PowerPC implementation of atomic_add_return()
Putting all this together, the C-language equivalent to does not act as a full barrier. The “Sometimes” on the
the entire litmus test is as shown in Listing 12.24. The last line confirms this: The assertion triggers for some
key point is that if atomic_add_return() acts as a full executions, but not all of the time.
memory barrier (as the Linux kernel requires it to), then it The fix to this Linux-kernel bug is to replace P0’s
should be impossible for P0()’s and P1()’s r3 variables isync with sync, which results in the output shown in
to both be zero after execution completes. Listing 12.26. As you can see, 0:r3=0; 1:r3=0; does
The next section describes how to run this litmus test. not appear in the list of states, and the last line calls out

v2022.09.25a
12.2. SPECIAL-PURPOSE STATE-SPACE SEARCH 259

“Never”. Therefore, the model predicts that the offending 6. These tools are not much good for complex data
execution sequence cannot happen. structures, although it is possible to create and tra-
Quick Quiz 12.27: Does the Arm Linux kernel have a similar
verse extremely simple linked lists using initialization
bug? statements of the form “x=y; y=z; z=42;”.
7. These tools do not handle memory mapped I/O or
Quick Quiz 12.28: Does the lwsync on line 10 in List- device registers. Of course, handling such things
ing 12.23 provide sufficient ordering?
would require that they be formalized, which does
not appear to be in the offing.

12.2.4 PPCMEM Discussion 8. The tools will detect only those problems for which
you code an assertion. This weakness is common to
These tools promise to be of great help to people working all formal methods, and is yet another reason why
on low-level parallel primitives that run on Arm and on testing remains important. In the immortal words of
Power. These tools do have some intrinsic limitations: Donald Knuth quoted at the beginning of this chapter,
“Beware of bugs in the above code; I have only proved
1. These tools are research prototypes, and as such are
it correct, not tried it.”
unsupported.
That said, one strength of these tools is that they are
2. These tools do not constitute official statements by
designed to model the full range of behaviors allowed by
IBM or Arm on their respective CPU architectures.
the architectures, including behaviors that are legal, but
For example, both corporations reserve the right to
which current hardware implementations do not yet inflict
report a bug at any time against any version of any of
on unwary software developers. Therefore, an algorithm
these tools. These tools are therefore not a substitute
that is vetted by these tools likely has some additional
for careful stress testing on real hardware. Moreover,
safety margin when running on real hardware. Further-
both the tools and the model that they are based on are
more, testing on real hardware can only find bugs; such
under active development and might change at any
testing is inherently incapable of proving a given usage
time. On the other hand, this model was developed
correct. To appreciate this, consider that the researchers
in consultation with the relevant hardware experts,
routinely ran in excess of 100 billion test runs on real hard-
so there is good reason to be confident that it is a
ware to validate their model. In one case, behavior that
robust representation of the architectures.
is allowed by the architecture did not occur, despite 176
3. These tools currently handle a subset of the instruc- billion runs [AMP+ 11]. In contrast, the full-state-space
tion set. This subset has been sufficient for my search allows the tool to prove code fragments correct.
purposes, but your mileage may vary. In particular, It is worth repeating that formal methods and tools are
the tool handles only word-sized accesses (32 bits), no substitute for testing. The fact is that producing large
and the words accessed must be properly aligned.3 In reliable concurrent software artifacts, the Linux kernel
addition, the tool does not handle some of the weaker for example, is quite difficult. Developers must therefore
variants of the Arm memory-barrier instructions, nor be prepared to apply every tool at their disposal towards
does it handle arithmetic. this goal. The tools presented in this chapter are able to
locate bugs that are quite difficult to produce (let alone
4. The tools are restricted to small loop-free code frag- track down) via testing. On the other hand, testing can
ments running on small numbers of threads. Larger be applied to far larger bodies of software than the tools
examples result in state-space explosion, just as with presented in this chapter are ever likely to handle. As
similar tools such as Promela and spin. always, use the right tools for the job!
Of course, it is always best to avoid the need to work
5. The full state-space search does not give any indica-
at this level by designing your parallel code to be easily
tion of how each offending state was reached. That
partitioned and then using higher-level primitives (such
said, once you realize that the state is in fact reach-
as locks, sequence counters, atomic operations, and RCU)
able, it is usually not too hard to find that state using
to get your job done more straightforwardly. And even if
the interactive tool.
you absolutely must use low-level memory barriers and
3 But recent work focuses on mixed-size accesses [FSP+ 17]. read-modify-write instructions to get your job done, the

v2022.09.25a
260 CHAPTER 12. FORMAL VERIFICATION

Listing 12.27: IRIW Litmus Test Listing 12.28: Expanded IRIW Litmus Test
1 PPC IRIW.litmus 1 PPC IRIW5.litmus
2 "" 2 ""
3 (* Traditional IRIW. *) 3 (* Traditional IRIW, but with five stores instead of *)
4 { 4 (* just one. *)
5 0:r1=1; 0:r2=x; 5 {
6 1:r1=1; 1:r4=y; 6 0:r1=1; 0:r2=x;
7 2:r2=x; 2:r4=y; 7 1:r1=1; 1:r4=y;
8 3:r2=x; 3:r4=y; 8 2:r2=x; 2:r4=y;
9 } 9 3:r2=x; 3:r4=y;
10 P0 | P1 | P2 | P3 ; 10 }
11 stw r1,0(r2) | stw r1,0(r4) | lwz r3,0(r2) | lwz r3,0(r4) ; 11 P0 | P1 | P2 | P3 ;
12 | | sync | sync ; 12 stw r1,0(r2) | stw r1,0(r4) | lwz r3,0(r2) | lwz r3,0(r4) ;
13 | | lwz r5,0(r4) | lwz r5,0(r2) ; 13 addi r1,r1,1 | addi r1,r1,1 | sync | sync ;
14 14 stw r1,0(r2) | stw r1,0(r4) | lwz r5,0(r4) | lwz r5,0(r2) ;
15 exists 15 addi r1,r1,1 | addi r1,r1,1 | | ;
16 (2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0) 16 stw r1,0(r2) | stw r1,0(r4) | | ;
17 addi r1,r1,1 | addi r1,r1,1 | | ;
18 stw r1,0(r2) | stw r1,0(r4) | | ;
19 addi r1,r1,1 | addi r1,r1,1 | | ;
more conservative your use of these sharp instruments, 20 stw r1,0(r2) | stw r1,0(r4) | | ;
21
the easier your life is likely to be. 22 exists
23 (2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0)

12.3 Axiomatic Approaches


Of course, many of the traces are quite similar to one
another, which suggests that an approach that treated
Theory helps us to bear our ignorance of facts. similar traces as one might improve performace. One
George Santayana such approach is the axiomatic approach of Alglave et
al. [AMT14], which creates a set of axioms to represent the
Although the PPCMEM tool can solve the famous “in- memory model and then converts litmus tests to theorems
dependent reads of independent writes” (IRIW) litmus that might be proven or disproven over this set of axioms.
test shown in Listing 12.27, doing so requires no less The resulting tool, called “herd”, conveniently takes as
than fourteen CPU hours and generates no less than ten input the same litmus tests as PPCMEM, including the
gigabytes of state space. That said, this situation is a great IRIW litmus test shown in Listing 12.27.
improvement over that before the advent of PPCMEM, However, where PPCMEM requires 14 CPU hours
where solving this problem required perusing volumes to solve IRIW, herd does so in 17 milliseconds, which
of reference manuals, attempting proofs, discussing with represents a speedup of more than six orders of magnitude.
experts, and being unsure of the final answer. Although That said, the problem is exponential in nature, so we
fourteen hours can seem like a long time, it is much shorter should expect herd to exhibit exponential slowdowns for
than weeks or even months. larger problems. And this is exactly what happens, for
However, the time required is a bit surprising given the example, if we add four more writes per writing CPU
simplicity of the litmus test, which has two threads storing as shown in Listing 12.28, herd slows down by a factor
to two separate variables and two other threads loading of more than 50,000, requiring more than 15 minutes of
from these two variables in opposite orders. The assertion CPU time. Adding threads also results in exponential
triggers if the two loading threads disagree on the order slowdowns [MS14].
of the two stores. Even by the standards of memory-order Despite their exponential nature, both PPCMEM and
litmus tests, this is quite simple. herd have proven quite useful for checking key parallel
One reason for the amount of time and space consumed algorithms, including the queued-lock handoff on x86 sys-
is that PPCMEM does a trace-based full-state-space search, tems. The weaknesses of the herd tool are similar to those
which means that it must generate and evaluate all possible of PPCMEM, which were described in Section 12.2.4.
orders and combinations of events at the architectural level. There are some obscure (but very real) cases for which the
At this level, both loads and stores correspond to ornate PPCMEM and herd tools disagree, and as of 2021 many
sequences of events and actions, resulting in a very large but not all of these disagreements was resolved.
state space that must be completely searched, in turn It would be helpful if the litmus tests could be written
resulting in large memory and CPU consumption. in C (as in Listing 12.24) rather than assembly (as in

v2022.09.25a
12.3. AXIOMATIC APPROACHES 261

Listing 12.29: Locking Example Listing 12.30: Broken Locking Example


1 C Lock1 1 C Lock2
2 2
3 {} 3 {}
4 4
5 P0(int *x, spinlock_t *sp) 5 P0(int *x, spinlock_t *sp1)
6 { 6 {
7 spin_lock(sp); 7 spin_lock(sp1);
8 WRITE_ONCE(*x, 1); 8 WRITE_ONCE(*x, 1);
9 WRITE_ONCE(*x, 0); 9 WRITE_ONCE(*x, 0);
10 spin_unlock(sp); 10 spin_unlock(sp1);
11 } 11 }
12 12
13 P1(int *x, spinlock_t *sp) 13 P1(int *x, spinlock_t *sp2) // Buggy!
14 { 14 {
15 int r1; 15 int r1;
16 16
17 spin_lock(sp); 17 spin_lock(sp2);
18 r1 = READ_ONCE(*x); 18 r1 = READ_ONCE(*x);
19 spin_unlock(sp); 19 spin_unlock(sp2);
20 } 20 }
21 21
22 exists (1:r1=1) 22 exists (1:r1=1)

Listing 12.23). This is now possible, as will be described


12.3.2 Axiomatic Approaches and RCU
in the following sections.
Axiomatic approaches can also analyze litmus tests in-
12.3.1 Axiomatic Approaches and Locking volving RCU [AMM+ 18]. To that end, Listing 12.31
(C-RCU-remove.litmus) shows a litmus test corre-
Axiomatic approaches may also be applied to higher-
sponding to the canonical RCU-mediated removal from
level languages and also to higher-level synchronization
a linked list. As with the locking litmus test, this RCU
primitives, as exemplified by the lock-based litmus test
litmus test can be modeled by LKMM, with similar perfor-
shown in Listing 12.29 (C-Lock1.litmus). This litmus
mance advantages compared to modeling emulations of
test can be modeled by the Linux kernel memory model
RCU. Line 6 shows x as the list head, initially referencing
(LKMM) [AMM+ 18, MS18]. As expected, the herd
y, which in turn is initialized to the value 2 on line 5.
tool’s output features the string Never, correctly indicating
that P1() cannot see x having a value of one.4 P0() on lines 9–14 removes element y from the list by
replacing it with element z (line 11), waits for a grace
Quick Quiz 12.29: What do you have to do to run herd on
period (line 12), and finally zeroes y to emulate free()
litmus tests like that shown in Listing 12.29?
(line 13). P1() on lines 16–25 executes within an RCU
Of course, if P0() and P1() use different locks, as read-side critical section (lines 21–24), picking up the list
shown in Listing 12.30 (C-Lock2.litmus), then all bets head (line 22) and then loading the next element (line 23).
are off. And in this case, the herd tool’s output features The next element should be non-zero, that is, not yet freed
the string Sometimes, correctly indicating that use of (line 28). Several other variables are output for debugging
different locks allows P1() to see x having a value of one. purposes (line 27).
The output of the herd tool when running this litmus
Quick Quiz 12.30: Why bother modeling locking directly?
Why not simply emulate locking with atomic operations? test features Never, indicating that P0() never accesses a
freed element, as expected. Also as expected, removing
But locking is not the only synchronization primitive line 12 results in P0() accessing a freed element, as
that can be modeled directly: The next section looks at indicated by the Sometimes in the herd output.
RCU. A litmus test for a more complex example proposed
by Roman Penyaev [Pen18] is shown in Listing 12.32
(C-RomanPenyaev-list-rcu-rr.litmus). In this ex-
4 The output of the herd tool is compatible with that of PPCMEM, ample, readers (modeled by P0() on lines 12–35) access a
so feel free to look at Listings 12.25 and 12.26 for examples showing linked list in a round-robin fashion by “leaking” a pointer
the output format. to the last list element accessed into variable c. Updaters

v2022.09.25a
262 CHAPTER 12. FORMAL VERIFICATION

Listing 12.31: Canonical RCU Removal Litmus Test


1 C C-RCU-remove
2
3 {
4 int z=1;
5 int y=2;
6 int *x=y;
7 }
8
9 P0(int **x, int *y, int *z)
10 { Listing 12.32: Complex RCU Litmus Test
11 rcu_assign_pointer(*x, z); 1 C C-RomanPenyaev-list-rcu-rr
12 synchronize_rcu(); 2
13 WRITE_ONCE(*y, 0); 3 {
14 } 4 int *z=1;
15 5 int *y=z;
16 P1(int **x, int *y, int *z) 6 int *x=y;
17 { 7 int *w=x;
18 int *r1; 8 int *v=w;
19 int r2; 9 int *c=w;
20 10 }
21 rcu_read_lock(); 11
22 r1 = rcu_dereference(*x); 12 P0(int **c, int **v)
23 r2 = READ_ONCE(*r1); 13 {
24 rcu_read_unlock(); 14 int *r1;
25 } 15 int *r2;
26 16 int *r3;
27 locations [1:r1; x; y; z] 17 int *r4;
28 exists (1:r2=0) 18
19 rcu_read_lock();
20 r1 = READ_ONCE(*c);
21 if (r1 == 0) {
(modeled by P1() on lines 37–49) remove an element, 22 r1 = READ_ONCE(*v);
23 }
taking care to avoid disrupting current or future readers. 24 r2 = rcu_dereference(*(int **)r1);
25 smp_store_release(c, r2);
Quick Quiz 12.31: Wait!!! Isn’t leaking pointers out of an 26 rcu_read_unlock();
RCU read-side critical section a critical bug??? 27 rcu_read_lock();
28 r3 = READ_ONCE(*c);
29 if (r3 == 0) {
Lines 4–8 define the initial linked list, tail first. In the 30 r3 = READ_ONCE(*v);
Linux kernel, this would be a doubly linked circular list, 31 }
32 r4 = rcu_dereference(*(int **)r3);
but herd is currently incapable of modeling such a beast. 33 smp_store_release(c, r4);
The strategy is instead to use a singly linked linear list 34 rcu_read_unlock();
35 }
that is long enough that the end is never reached. Line 9 36
defines variable c, which is used to cache the list pointer 37 P1(int **c, int **v, int **w, int **x, int **y)
38 {
between successive RCU read-side critical sections. 39 int *r1;
Again, P0() on lines 12–35 models readers. This 40
41 rcu_assign_pointer(*w, y);
process models a pair of successive readers traversing 42 synchronize_rcu();
round-robin through the list, with the first reader on 43 r1 = READ_ONCE(*c);
44 if ((int **)r1 == x) {
lines 19–26 and the second reader on lines 27–34. Line 20 45 WRITE_ONCE(*c, 0);
fetches the pointer cached in c, and if line 21 sees that 46 synchronize_rcu();
47 }
the pointer was NULL, line 22 restarts at the beginning 48 smp_store_release(x, 0);
of the list. In either case, line 24 advances to the next 49 }
50
list element, and line 25 stores a pointer to this element 51 locations [1:r1; c; v; w; x; y]
back into variable c. Lines 27–34 repeat this process, 52 exists (0:r1=0 \/ 0:r2=0 \/ 0:r3=0 \/ 0:r4=0)

but using registers r3 and r4 instead of r1 and r2. As


with Listing 12.31, this litmus test stores zero to emulate
free(), so line 52 checks for any of these four registers
being NULL, also known as zero.
Because P0() leaks an RCU-protected pointer from its
first RCU read-side critical section to its second, P1()
must carry out its update (removing x) very carefully.

v2022.09.25a
12.4. SAT SOLVERS 263

Line 41 removes x by linking w to y. Line 42 waits for


readers, after which no subsequent reader has a path to C Code
x via the linked list. Line 43 fetches c, and if line 44
determines that c references the newly removed x, line 45
CBMC
sets c to NULL and line 46 again waits for readers, after
which no subsequent reader can fetch x from c. In either Logic Expression

case, line 48 emulates free() by storing zero to x.


Quick Quiz 12.32: In Listing 12.32, why couldn’t a reader
fetch c just before P1() zeroed it on line 45, and then later SAT Solver
store this same value back into c just after it was zeroed, thus
defeating the zeroing operation?

The output of the herd tool when running this litmus Trace Generation
(If Counterexample
test features Never, indicating that P0() never accesses a Located)
freed element, as expected. Also as expected, removing
either synchronize_rcu() results in P1() accessing a
freed element, as indicated by Sometimes in the herd
Verification Result
output.
Quick Quiz 12.33: In Listing 12.32, why not have just one
call to synchronize_rcu() immediately before line 48? Figure 12.2: CBMC Processing Flow

Quick Quiz 12.34: Also in Listing 12.32, can’t line 48 be


WRITE_ONCE() instead of smp_store_release()? also known as the satisfiability problem. SAT solvers
are heavily used in verification of hardware, which has
These sections have shown how axiomatic approaches motivated great advances. A world-class early 1990s SAT
can successfully model synchronization primitives such solver might be able to handle a logic expression with 100
as locking and RCU in C-language litmus tests. Longer distinct boolean variables, but by the early 2010s million-
term, the hope is that the axiomatic approaches will model variable SAT solvers were readily available [KS08].
even higher-level software artifacts, producing exponen-
In addition, front-end programs for SAT solvers can
tial verification speedups. This could potentially allow
automatically translate C code into logic expressions,
axiomatic verification of much larger software systems,
taking assertions into account and generating assertions
perhaps incorporating spatial-synchronization techniques
for error conditions such as array-bounds errors. One
from separation logic [GRY13, ORY01]. Another alter-
example is the C bounded model checker, or cbmc, which
native is to press the axioms of boolean logic into service,
is available as part of many Linux distributions. This
as described in the next section.
tool is quite easy to use, with cbmc test.c sufficing to
validate test.c, resulting in the processing flow shown
12.4 SAT Solvers in Figure 12.2. This ease of use is exceedingly important
because it opens the door to formal verification being incor-
porated into regression-testing frameworks. In contrast,
Live by the heuristic, die by the heuristic. the traditional tools that require non-trivial translation to
Unknown a special-purpose language are confined to design-time
verification.
Any finite program with bounded loops and recursion can More recently, SAT solvers have appeared that han-
be converted into a logic expression, which might express dle parallel code. These solvers operate by convert-
that program’s assertions in terms of its inputs. Given such ing the input code into single static assignment (SSA)
a logic expression, it would be quite interesting to know form, then generating all permitted access orders. This
whether any possible combinations of inputs could result in approach seems promising, but it remains to be seen
one of the assertions triggering. If the inputs are expressed how well it works in practice. One encouraging
as combinations of boolean variables, this is simply SAT, sign is work in 2016 applying cbmc to Linux-kernel

v2022.09.25a
264 CHAPTER 12. FORMAL VERIFICATION

RCU [LMKM16, LMKM18, Roy17]. This work used


minimal configurations of RCU, and verified scenarios C Code
using small numbers of threads, but nevertheless suc-
cessfully ingested Linux-kernel C code and produced a
Nidhugg
useful result. The logic expressions generated from the C LLVM Internal
code had up to 90 million variables, 450 million clauses, Representation
occupied tens of gigabytes of memory, and required up to
80 hours of CPU time for the SAT solver to produce the
correct result. Dynamic Partial
Nevertheless, a Linux-kernel hacker might be justified Order Reduction
(DPOR) Algorithm
in feeling skeptical of a claim that his or her code had
been automatically verified, and such hackers would find
many fellow skeptics going back decades [DMLP79]. Trace Generation
One way to productively express such skepticism is to (If Counterexample
Located)
provide bug-injected versions of the allegedly verified
code. If the formal-verification tool finds all the injected
bugs, our hacker might gain more confidence in the tool’s
Verification Result
capabilities. Of course, tools that find valid bugs of which
the hacker was not yet aware will likely engender even
more confidence. And this is exactly why there is a git Figure 12.3: Nidhugg Processing Flow
archive with a 20-branch set of mutations, with each
branch potentially containing a bug injected into Linux-
kernel RCU [McK17]. Anyone with a formal-verification Nidhugg was more than an order of magnitude faster
tool is cordially invited to try that tool out on this set of than was cbmc for some Linux-kernel RCU verification
verification challenges. scenarios. Of course, Nidhugg’s speed and scalability
Currently, cbmc is able to find a number of injected advantages are tied to the fact that it does not handle
bugs, however, it has not yet been able to locate a bug that data non-determinism, but this was not a factor in these
RCU’s maintainer was not already aware of. Nevertheless, particular verification scenarios.
there is some reason to hope that SAT solvers will someday Nevertheless, as with cbmc, Nidhugg has not yet been
be useful for finding concurrency bugs in parallel code. able to locate a bug that Linux-kernel RCU’s maintainer
was not already aware of. However, it was able to demon-
strate that one historical bug in Linux-kernel RCU was
12.5 Stateless Model Checkers fixed by a different commit than the maintainer thought,
which gives some additional hope that stateless model
He’s making a list, he’s permuting it twice. . . checkers like Nidhugg might someday be useful for finding
with apologies to Haven Gillespie and J. Fred Coots concurrency bugs in parallel code.

The SAT-solver approaches described in the previous


section are quite convenient and powerful, but the full 12.6 Summary
tracking of all possible executions, including state, can
incur substantial overhead. In fact, the memory and CPU-
Western thought has focused on True-False; it is high
time overheads can sharply limit the size of programs time to shift to Robust-Fragile.
that can be feasibly verified, which raises the question of
whether less-exact approaches might find bugs in larger Nassim Nicholas Taleb, summarized
programs.
Although the jury is still out on this question, stateless The formal-verification techniques described in this chap-
model checkers such as Nidhugg [LSLK14] have in some ter are very powerful tools for validating small parallel
cases handled larger programs [KS17b], and with similar algorithms, but they should not be the only tools in your
ease of use, as illustrated by Figure 12.3. In addition, toolbox. Despite decades of focus on formal verification,

v2022.09.25a
12.6. SUMMARY 265

testing remains the validation workhorse for large parallel said, very simple test harnesses can find significant bugs
software systems [Cor06a, Jon11, McK15d]. in arbitrarily large software systems. In contrast, the effort
It is nevertheless quite possible that this will not always required to apply formal verification seems to increase
be the case. To see this, consider that there is estimated to dramatically as the system size increases.
be more than twenty billion instances of the Linux kernel I have nevertheless made occasional use of formal
as of 2017. Suppose that the Linux kernel has a bug that verification for almost 30 years by playing to formal
manifests on average every million years of runtime. As verification’s strengths, namely design-time verification
noted at the end of the preceding chapter, this bug will be of small complex portions of the overarching software
appearing more than 50 times per day across the installed construct. The larger overarching software construct is of
base. But the fact remains that most formal validation course validated by testing.
techniques can be used only on very small codebases. So
what is a concurrency coder to do? Quick Quiz 12.36: In light of the full verification of the L4
Think in terms of finding the first bug, the first relevant microkernel, isn’t this limited view of formal verification just
bug, the last relevant bug, and the last bug. a little bit obsolete?
The first bug is normally found via inspection or com-
piler diagnostics. Although the increasingly sophisticated One final approach is to consider the following two
compiler diagnostics comprise a lightweight sort of formal definitions from Section 11.1.2 and the consequence that
verification, it is not common to think of them in those they imply:
terms. This is in part due to an odd practitioner prejudice
which says “If I am using it, it cannot be formal verifica-
tion” on the one hand, and a large gap between compiler Definition: Bug-free programs are trivial programs.
diagnostics and verification research on the other.
Although the first relevant bug might be located via Definition: Reliable programs have no known bugs.
inspection or compiler diagnostics, it is not unusual for
Consequence: Any non-trivial reliable program con-
these two steps to find only typos and false positives.
tains at least one as-yet-unknown bug.
Either way, the bulk of the relevant bugs, that is, those
bugs that might actually be encountered in production,
will often be found via testing. From this viewpoint, any advances in validation and
When testing is driven by anticipated or real use cases, verification can have but two effects: (1) An increase in
it is not uncommon for the last relevant bug to be located the number of trivial programs or (2) A decrease in the
by testing. This situation might motivate a complete number of reliable programs. Of course, the human race’s
rejection of formal verification, however, irrelevant bugs increasing reliance on multicore systems and software
have an annoying habit of suddenly becoming relevant at provides extreme motivation for a very sharp increase in
the least convenient moment possible, courtesy of black- the number of trivial programs.
hat attacks. For security-critical software, which appears
to be a continually increasing fraction of the total, there However, if your code is so complex that you find your-
can thus be strong motivation to find and fix the last bug. self relying too heavily on formal-verification tools, you
Testing is demonstrably unable to find the last bug, so should carefully rethink your design, especially if your
there is a possible role for formal verification, assuming, formal-verification tools require your code to be hand-
that is, that formal verification proves capable of growing translated to a special-purpose language. For example, a
into that role. As this chapter has shown, current formal complex implementation of the dynticks interface for pre-
verification systems are extremely limited. emptible RCU that was presented in Section 12.1.5 turned
out to have a much simpler alternative implementation,
Quick Quiz 12.35: But shouldn’t sufficiently low-level as discussed in Section 12.1.6.9. All else being equal, a
software be for all intents and purposes immune to being simpler implementation is much better than a proof of
exploited by black hats?
correctness for a complex implementation.
Please note that formal verification is often much harder And the open challenge to those working on formal ver-
to use than is testing. This is in part a cultural statement, ification techniques and systems is to prove this summary
and there is reason to hope that formal verification will wrong! To assist in this task, Verification Challenge 6 is
be perceived to be easier with increased familiarity. That now available [McK17]. Have at it!!!

v2022.09.25a
266 CHAPTER 12. FORMAL VERIFICATION

12.7 Choosing a Validation Plan One approach is to devote a given fraction of the soft-
ware budget to validation, with that fraction depending on
the criticality of the software, so that safety-critical avion-
Science is a first-rate piece of furniture for one’s ics software might grant a larger fraction of its budget to
upper chamber, but only given common sense on the
validation than would a homework assignment. Where
ground floor.
available, experience from prior similar projects should
Oliver Wendell Holmes, updated be brought to bear. However, it is necessary to structure
the project so that the validation investment starts when
What sort of validation should you use for your project? the project does, otherwise the inevitable overruns in
As is often the case in software in particular and in spending on coding will crowd out the validation effort.
engineering in general, the answer is “it depends”. Staffing start-up projects with experienced people can
Note that neither running a test nor undertaking formal result in overinvestment in validation efforts. Just as it
verification will change your project. At best, such ef- is possible to go broke buying too much insurance, it is
fort have an indirect effect by locating a bug that is later possible to kill a project by investing too much in testing.
fixed. Nevertheless, fixing a bug might prevent inconve- This is especially the case for first-of-a-kind projects where
nience, monetary loss, property damage, or even loss of it is not yet clear which use cases will be important, in
life. Clearly, this sort of indirect effect can be extremely which case testing for all possible use cases will be a
valuable. possibly fatal waste of time, energy, and funding.
Unfortunately, as we have seen, it is difficult to predict However, as the tasks supported by a start-up project
whether or not a given validation effort will find important become more routine, users often become less forgiving of
bugs. It is therefore all too easy to invest too little— failures, thus increasing the need for validation. Managing
or even to fail to invest at all, especially if development this shift in investment can be extremely challenging,
estimates proved overly optimistic or budgets unexpectedly especially in the all-too-common case where the users
tight, conditions which almost always come into play in are unwilling or unable to disclose the exact nature of
real-world software projects. their use case. It then becomes critically important to
The decision to nevertheless invest in validation is often reverse-engineer the use cases from bug reports and from
forced by experienced people with forceful personalities. discussions with the users. As these use cases are better
But this is no guarantee, given that other stakeholders understood, use of continuous integration can help reduce
might also have forceful personalities. Worse yet, these the cost of finding and fixing any bugs located.
other stakeholders might bring stories of expensive val- One example evolution of a software project’s use of
idation efforts that nevertheless allowed embarrassing validation is shown in Figure 12.4. As can be seen in
bugs to escape to the end users. So although a scarred, the figure, Linux-kernel RCU didn’t have any validation
grey-haired, and grouchy veteran might carry the day, a code whatsoever until Linux kernel v2.6.15, which was
more organized approach would perhaps be more useful. released more than two years after RCU was accepted into
Fortunately, there is a strictly financial analog to invest- the kernel. The test suite achieved its peak fraction of
ments in validation, and that is the insurance policy. the total lines of code in Linux kernel v2.6.19–v2.6.21.
Both insurance policies and validation efforts require This fraction decreased sharply with the acceptance of
consistent up-front investments, and both defend against preemptible RCU for real-time applications in v2.6.25.
disasters that might or might not ever happen. Further- This decrease was due to the fact that the RCU API was
more, both have exclusions of various types. For example, identical in the preemptible and non-preemptible variants
insurance policies for coastal areas might exclude damages of RCU. This in turn meant that the existing test suite
due to tidal waves, while on the other hand we have seen applied to both variants, so that even though the Linux-
that there is not yet any validation methodology that can kernel RCU code expanded significantly, there was no
find each and every bug. need to expand the tests.
In addition, it is possible to over-invest in both insurance Subsequent bars in Figure 12.4 show that the RCU code
and in validation. For but one example, a validation plan base expanded significantly, but that the corresponding
that consumed the entire development budget would be validation code expanded even more dramatically. Linux
just as pointless as would an insurance policy that covered kernel v3.5 added tests for the rcu_barrier() API, clos-
the Sun going nova. ing a long-standing hole in test coverage. Linux kernel

v2022.09.25a
12.7. CHOOSING A VALIDATION PLAN 267

35000 50
RCU
30000 RCU Test
% Test 40
25000

20000 30

% Test
LoC

15000 20
10000
10
5000

0 0
v2.6.12
v2.6.16
v2.6.20
v2.6.24
v2.6.28
v2.6.32
v2.6.36
v3.0
v3.4
v3.8
v3.12
v3.16
v4.0
v4.4
v4.8
v4.12
v4.16

v5.0
v5.4
v5.8
v5.12
v5.16
v5.19
Linux Release

Figure 12.4: Linux-Kernel RCU Test Code

v3.14 added automated testing and analysis of test results, This question is being answered naturally as compilers
moving RCU towards continuous integration. Linux ker- adopt increasingly aggressive formal-verification tech-
nel v4.7 added a performance validation suite for RCU’s niques into their diagnostics and as formal-verification
update-side primitives. Linux kernel v4.12 added Tree tools continue to mature. In addition, the Linux-kernel
SRCU, featuring improved update-side scalability, and lockdep and KCSAN tools illustrate the advantages of
v4.13 removed the old less-scalable SRCU implementa- combining formal verification techniques with run-time
tion. Linux kernel v5.0 briefly hosted the nolibc library analysis, as discussed in Section 11.3. Other com-
within the rcutorture scripting directory before it moved to bined techniques analyze traces gathered from execu-
its long-term home in tools/include/nolibc. Linux tions [dOCdO19]. For the time being, the best practice
kernel v5.8 added the Tasks Trace and Rude flavors of is to focus first on testing and to reserve explicit work on
RCU. Linux kernel v5.9 added the refscale.c suite formal verification for those portions of the project that
of read-side performance tests. Linux kernels v5.12 and are not well-served by testing, and that have exceptional
v5.13 started adding the ability to change a given CPU’s needs for robustness. For example, Linux-kernel RCU
callback-offloading status at runtime and also added the relies primarily on testing, but has made occasional use
torture.sh acceptance-test script. Linux kernel v5.14 of formal verification as discussed in this chapter.
added distributed rcutorture. Linux kernel v5.15 added In short, choosing a validation plan for concurrent
demonic vCPU placement in rcutorture testing, which software remains more an art than a science, let alone a
was successful in locating a number of race conditions.5 field of engineering. However, there is every reason to
Linux kernel v5.17 removed the RCU_FAST_NO_HZ Kcon- expect that increasingly rigorous approaches will continue
fig option. Numerous other changes may be found in the to become more prevalent.
Linux kernel’s git archives.
We have established that the validation budget varies
from one project to the next, and also over the lifetime of
any given project. But how should the validation invest-
ment be split between testing and formal verification?

5 The trick is to place one pair of vCPUs within the same core

on one socket, while placing another pair within the same core on
some other socket. As you might expect from Chapter 3, this produces
different memory latencies between different pairs of vCPUs (https:
//paulmck.livejournal.com/62071.html).

v2022.09.25a
268 CHAPTER 12. FORMAL VERIFICATION

v2022.09.25a
You don’t learn how to shoot and then learn how to
launch and then learn to do a controlled spin—you
learn to launch-shoot-spin.
Chapter 13 “Ender’s Shadow”, Orson Scott Card

Putting It All Together

This chapter gives some hints on concurrent-programming 13.1.2 Counting Lookups


puzzles. Section 13.1 considers counter conundrums,
Section 13.2 refurbishes reference counting, Section 13.3 Suppose that Schrödinger also wants to count the number
helps with hazard pointers, Section 13.4 surmises on of lookups for each animal, where lookups are protected
sequence-locking specials, and finally Section 13.5 reflects by RCU. How can this counting best be done?
on RCU rescues. One approach would be to protect a lookup counter
with the per-element lock, as discussed in Section 13.1.1.
Unfortunately, this would require all lookups to acquire
this lock, which would be a severe bottleneck on large
systems.
13.1 Counter Conundrums
Another approach is to “just say no” to counting, fol-
lowing the example of the noatime mount option. If
Ford carried on counting quietly. This is about the this approach is feasible, it is clearly the best: After all,
most aggressive thing you can do to a computer, the nothing is faster than doing nothing. If the lookup count
equivalent of going up to a human being and saying cannot be dispensed with, read on!
“Blood . . . blood . . . blood . . . blood . . .” Any of the counters from Chapter 5 could be pressed
Douglas Adams into service, with the statistical counters described in Sec-
tion 5.2 being perhaps the most common choice. However,
this results in a large memory footprint: The number of
This section outlines solutions to counter conundrums.
counters required is the number of data elements multi-
plied by the number of threads.
If this memory overhead is excessive, then one approach
13.1.1 Counting Updates is to keep per-core or even per-socket counters rather
than per-CPU counters, with an eye to the hash-table
performance results depicted in Figure 10.3. This will
Suppose that Schrödinger (see Section 10.1) wants to
require that the counter increments be atomic operations,
count the number of updates for each animal, and that
especially for user-mode execution where a given thread
these updates are synchronized using a per-data-element
could migrate to another CPU at any time.
lock. How can this counting best be done?
If some elements are looked up very frequently, there are
Of course, any number of counting algorithms from a number of approaches that batch updates by maintaining
Chapter 5 might qualify, but the optimal approach is quite a per-thread log, where multiple log entries for a given
simple. Just place a counter in each data element, and element can be merged. After a given log entry has a
increment it under the protection of that element’s lock! sufficiently large increment or after sufficient time has
If readers access the count locklessly, then updaters passed, the log entries may be applied to the corresponding
should use WRITE_ONCE() to update the counter and data elements. Silas Boyd-Wickizer has done some work
lockless readers should use READ_ONCE() to load it. formalizing this notion [BW14].

269

v2022.09.25a
270 CHAPTER 13. PUTTING IT ALL TOGETHER

13.2 Refurbish Reference Counting Table 13.1: Synchronizing Reference Counting

Release
Counting is the religion of this generation. It is its Reference Hazard
hope and its salvation. Acquisition Locks RCU
Counts Pointers
Gertrude Stein Locks − CAM M CA
Reference
A AM M A
Although reference counting is a conceptually simple Counts
technique, many devils hide in the details when it is Hazard
M M M M
applied to concurrent software. After all, if the object Pointers
was not subject to premature disposal, there would be no RCU CA MCA M CA
need for the reference counter in the first place. But if the
object can be disposed of, what prevents disposal during
the reference-acquisition process itself? Quick Quiz 13.1: Why not implement reference-acquisition
There are a number of ways to refurbish reference using a simple compare-and-swap operation that only acquires
counters for use in concurrent software, including: a reference if the reference counter is non-zero?

Given that the key reference-counting issue is synchro-


1. A lock residing outside of the object must be held nization between acquisition of a reference and freeing
while manipulating the reference count. of the object, we have nine possible combinations of
mechanisms, as shown in Table 13.1. This table divides
2. The object is created with a non-zero reference count, reference-counting mechanisms into the following broad
and new references may be acquired only when the categories:
current value of the reference counter is non-zero. If
a thread does not have a reference to a given object, 1. Simple counting with neither atomic operations,
it might seek help from another thread that already memory barriers, nor alignment constraints (“−”).
has a reference.
2. Atomic counting without memory barriers (“A”).
3. In some cases, hazard pointers may be used as a 3. Atomic counting, with memory barriers required
drop-in replacement for reference counters. only on release (“AM”).

4. An existence guarantee is provided for the object, thus 4. Atomic counting with a check combined with the
preventing it from being freed while some other entity atomic acquisition operation, and with memory bar-
might be attempting to acquire a reference. Existence riers required only on release (“CAM”).
guarantees are often provided by automatic garbage 5. Atomic counting with a check combined with the
collectors, and, as is seen in Sections 9.3 and 9.5, by atomic acquisition operation (“CA”).
hazard pointers and RCU, respectively.
6. Simple counting with a check combined with full
5. A type-safety guarantee is provided for the object. An memory barriers (“M”).
additional identity check must be performed once the
7. Atomic counting with a check combined with the
reference is acquired. Type-safety guarantees can be
atomic acquisition operation, and with memory bar-
provided by special-purpose memory allocators, for
riers also required on acquisition (“MCA”).
example, by the SLAB_TYPESAFE_BY_RCU feature
within the Linux kernel, as is seen in Section 9.5. However, because all Linux-kernel atomic operations that
return a value are defined to contain memory barriers,1
Of course, any mechanism that provides existence guar- all release operations contain memory barriers, and all
antees by definition also provides type-safety guaran- checked acquisition operations also contain memory bar-
tees. This results in four general categories of reference- riers. Therefore, cases “CA” and “MCA” are equivalent to
acquisition protection: Reference counting, hazard point- 1 With atomic_read() and ATOMIC_INIT() being the exceptions
ers, sequence locking, and RCU. that prove the rule.

v2022.09.25a
13.2. REFURBISH REFERENCE COUNTING 271

“CAM”, so that there are sections below for only the first Listing 13.1: Simple Reference-Count API
four cases and the sixth case: “−”, “A”, “AM”, “CAM”, 1 struct sref {
2 int refcount;
and “M”. Later sections describe optimizations that can 3 };
improve performance if reference acquisition and release 4
5 void sref_init(struct sref *sref)
is very frequent, and the reference count need be checked 6 {
for zero only very rarely. 7 sref->refcount = 1;
8 }
9
10 void sref_get(struct sref *sref)
11 {
13.2.1 Implementation of Reference- 12 sref->refcount++;
13 }
Counting Categories 14
15 int sref_put(struct sref *sref,
16 void (*release)(struct sref *sref))
Simple counting protected by locking (“−”) is described 17 {
18 WARN_ON(release == NULL);
in Section 13.2.1.1, atomic counting with no memory 19 WARN_ON(release == (void (*)(struct sref *))kfree);
barriers (“A”) is described in Section 13.2.1.2, atomic 20
21 if (--sref->refcount == 0) {
counting with acquisition memory barrier (“AM”) is de- 22 release(sref);
scribed in Section 13.2.1.3, and atomic counting with 23 return 1;
24 }
check and release memory barrier (“CAM”) is described 25 return 0;
in Section 13.2.1.4. Use of hazard pointers is described 26 }
in Section 9.3 on page 130 and in Section 13.3.

Linux kernel, the kref primitives are used to implement


13.2.1.1 Simple Counting this style of reference counting, as shown in Listing 13.2.2
Atomic counting is required in this case because lock-
Simple counting, with neither atomic operations nor mem- ing does not protect all reference-count operations, which
ory barriers, can be used when the reference-counter means that two different CPUs might concurrently ma-
acquisition and release are both protected by the same nipulate the reference count. If normal increment and
lock. In this case, it should be clear that the reference count decrement were used, a pair of CPUs might both fetch
itself may be manipulated non-atomically, because the the reference count concurrently, perhaps both obtaining
lock provides any necessary exclusion, memory barriers, the value “3”. If both of them increment their value,
atomic instructions, and disabling of compiler optimiza- they will both obtain “4”, and both will store this value
tions. This is the method of choice when the lock is back into the counter. Since the new value of the counter
required to protect other operations in addition to the ref- should instead be “5”, one of the increments has been
erence count, but where a reference to the object must be lost. Therefore, atomic operations must be used both for
held after the lock is released. Listing 13.1 shows a simple counter increments and for counter decrements.
API that might be used to implement simple non-atomic If releases are guarded by locking, hazard pointers, or
reference counting—although simple reference counting RCU, memory barriers are not required, but for different
is almost always open-coded instead. reasons. In the case of locking, the locks provide any
needed memory barriers (and disabling of compiler opti-
mizations), and the locks also prevent a pair of releases
from running concurrently. In the case of hazard pointers
13.2.1.2 Atomic Counting and RCU, cleanup will be deferred, and any needed mem-
ory barriers or disabling of compiler optimizations will
Simple atomic counting may be used in cases where any
be provided by the hazard-pointers or RCU infrastructure.
CPU acquiring a reference must already hold a reference.
Therefore, if two CPUs release the final two references
This style is used when a single CPU creates an object
concurrently, the actual cleanup will be deferred until both
for its own private use, but must allow for accesses from
other CPUs, tasks, timer handlers, and so on. Any CPU
that hands the object off must first acquire a new reference 2 As of Linux v4.10. Linux v4.11 introduced a refcount_t API that
on behalf of the recipient on the one hand, or refrain from improves efficiency weakly ordered platforms, but which is functionally
further accesses after the handoff on the other. In the equivalent to the atomic_t that it replaced.

v2022.09.25a
272 CHAPTER 13. PUTTING IT ALL TOGETHER

Listing 13.2: Linux Kernel kref API Listing 13.3: Linux Kernel dst_clone API
1 struct kref { 1 static inline
2 atomic_t refcount; 2 struct dst_entry * dst_clone(struct dst_entry * dst)
3 }; 3 {
4 4 if (dst)
5 void kref_init(struct kref *kref) 5 atomic_inc(&dst->__refcnt);
6 { 6 return dst;
7 atomic_set(&kref->refcount, 1); 7 }
8 } 8
9 9 static inline
10 void kref_get(struct kref *kref) 10 void dst_release(struct dst_entry * dst)
11 { 11 {
12 WARN_ON(!atomic_read(&kref->refcount)); 12 if (dst) {
13 atomic_inc(&kref->refcount); 13 WARN_ON(atomic_read(&dst->__refcnt) < 1);
14 } 14 smp_mb__before_atomic_dec();
15 15 atomic_dec(&dst->__refcnt);
16 static inline int 16 }
17 kref_sub(struct kref *kref, unsigned int count, 17 }
18 void (*release)(struct kref *kref))
19 {
20 WARN_ON(release == NULL);
21 Quick Quiz 13.3: Suppose that just after the atomic_sub_
22 if (atomic_sub_and_test((int) count,
23 &kref->refcount)) { and_test() on line 22 of Listing 13.2 is invoked, that some
24 release(kref); other CPU invokes kref_get(). Doesn’t this result in that
25 return 1; other CPU now having an illegal reference to a released object?
26 }
27 return 0;
28 }
Quick Quiz 13.4: Suppose that kref_sub() returns zero, in-
dicating that the release() function was not invoked. Under
CPUs have released their hazard pointers or exited their what conditions can the caller rely on the continued existence
RCU read-side critical sections, respectively. of the enclosing object?

Quick Quiz 13.2: Why isn’t it necessary to guard against Quick Quiz 13.5: Why not just pass kfree() as the release
cases where one CPU acquires a reference just after another function?
CPU releases the last reference?

The kref structure itself, consisting of a single atomic


13.2.1.3 Atomic Counting With Release Memory
data item, is shown in lines 1–3 of Listing 13.2. The kref_
Barrier
init() function on lines 5–8 initializes the counter to
the value “1”. Note that the atomic_set() primitive Atomic reference counting with release memory barriers
is a simple assignment, the name stems from the data is used by the Linux kernel’s networking layer to track
type of atomic_t rather than from the operation. The the destination caches that are used in packet routing.
kref_init() function must be invoked during object The actual implementation is quite a bit more involved;
creation, before the object has been made available to any this section focuses on the aspects of struct dst_entry
other CPU. reference-count handling that matches this use case, shown
The kref_get() function on lines 10–14 uncondition- in Listing 13.3.3
ally atomically increments the counter. The atomic_ The dst_clone() primitive may be used if the caller
inc() primitive does not necessarily explicitly disable already has a reference to the specified dst_entry, in
compiler optimizations on all platforms, but the fact that which case it obtains another reference that may be handed
the kref primitives are in a separate module and that off to some other entity within the kernel. Because a
the Linux kernel build process does no cross-module reference is already held by the caller, dst_clone()
optimizations has the same effect. need not execute any memory barriers. The act of handing
The kref_sub() function on lines 16–28 atomically the dst_entry to some other entity might or might not
decrements the counter, and if the result is zero, line 24 require a memory barrier, but if such a memory barrier is
invokes the specified release() function and line 25
returns, informing the caller that release() was invoked. 3 As of Linux v4.13. Linux v4.14 added a level of indirection to
Otherwise, kref_sub() returns zero, informing the caller permit more comprehensive debugging checks, but the overall effect in
that release() was not called. the absence of bugs is identical.

v2022.09.25a
13.2. REFURBISH REFERENCE COUNTING 273

required, it will be embedded in the mechanism used to Listing 13.4: Linux Kernel fget/fput API
hand the dst_entry off. 1 struct file *fget(unsigned int fd)
2 {
The dst_release() primitive may be invoked from 3 struct file *file;
any environment, and the caller might well reference ele- 4 struct files_struct *files = current->files;
5
ments of the dst_entry structure immediately prior to the 6 rcu_read_lock();
call to dst_release(). The dst_release() primitive 7 file = fcheck_files(files, fd);
8 if (file) {
therefore contains a memory barrier on line 14 preventing 9 if (!atomic_inc_not_zero(&file->f_count)) {
both the compiler and the CPU from misordering accesses. 10 rcu_read_unlock();
11 return NULL;
Please note that the programmer making use of dst_ 12 }
clone() and dst_release() need not be aware of the 13 }
14 rcu_read_unlock();
memory barriers, only of the rules for using these two 15 return file;
primitives. 16 }
17
18 struct file *
19 fcheck_files(struct files_struct *files, unsigned int fd)
13.2.1.4 Atomic Counting With Check and Release 20 {
Memory Barrier 21 struct file * file = NULL;
22 struct fdtable *fdt = rcu_dereference((files)->fdt);
Consider a situation where the caller must be able to 23
24 if (fd < fdt->max_fds)
acquire a new reference to an object to which it does 25 file = rcu_dereference(fdt->fd[fd]);
not already hold a reference, but where that object’s 26 return file;
27 }
existence is guaranteed. The fact that initial reference- 28

count acquisition can now run concurrently with reference- 29 void fput(struct file *file)
30 {
count release adds further complications. Suppose that 31 if (atomic_dec_and_test(&file->f_count))
a reference-count release finds that the new value of the 32 call_rcu(&file->f_u.fu_rcuhead, file_free_rcu);
33 }
reference count is zero, signaling that it is now safe to 34

clean up the reference-counted object. We clearly cannot 35 static void file_free_rcu(struct rcu_head *head)
36 {
allow a reference-count acquisition to start after such 37 struct file *f;
clean-up has commenced, so the acquisition must include 38
39 f = container_of(head, struct file, f_u.fu_rcuhead);
a check for a zero reference count. This check must be 40 kmem_cache_free(filp_cachep, f);
part of the atomic increment operation, as shown below. 41 }

Quick Quiz 13.6: Why can’t the check for a zero reference
count be made in a simple “if” statement with an atomic
increment in its “then” clause?
descriptor, then line 9 attempts to atomically acquire a ref-
erence count. If it fails to do so, lines 10–11 exit the RCU
The Linux kernel’s fget() and fput() primitives use read-side critical section and report failure. Otherwise, if
this style of reference counting. Simplified versions of the attempt is successful, lines 14–15 exit the read-side
these functions are shown in Listing 13.4.4 critical section and return a pointer to the file structure.
Line 4 of fget() fetches the pointer to the current The fcheck_files() primitive is a helper function
process’s file-descriptor table, which might well be shared for fget(). Line 22 uses rcu_dereference() to safely
with other processes. Line 6 invokes rcu_read_lock(), fetch an RCU-protected pointer to this task’s current file-
which enters an RCU read-side critical section. The call- descriptor table, and line 24 checks to see if the specified
back function from any subsequent call_rcu() primitive file descriptor is in range. If so, line 25 fetches the pointer
will be deferred until a matching rcu_read_unlock() to the file structure, again using the rcu_dereference()
is reached (line 10 or 14 in this example). Line 7 looks primitive. Line 26 then returns a pointer to the file structure
up the file structure corresponding to the file descriptor or NULL in case of failure.
specified by the fd argument, as will be described later. The fput() primitive releases a reference to a file
If there is an open file corresponding to the specified file structure. Line 31 atomically decrements the reference
count, and, if the result was zero, line 32 invokes the call_
4 As of Linux v2.6.38. Additional O_PATH functionality was added rcu() primitives in order to free up the file structure
in v2.6.39, refactoring was applied in v3.14, and mmap_sem contention (via the file_free_rcu() function specified in call_
was reduced in v4.1. rcu()’s second argument), but only after all currently-

v2022.09.25a
274 CHAPTER 13. PUTTING IT ALL TOGETHER

executing RCU read-side critical sections complete, that 13.3.1 Scalable Reference Count
is, after an RCU grace period has elapsed.
Suppose a reference count is becoming a performance or
Once the grace period completes, the file_free_ scalability bottleneck. What can you do?
rcu() function obtains a pointer to the file structure on One approach is to instead use hazard pointers.
line 39, and frees it on line 40. There are some differences, perhaps most notably that
This code fragment thus demonstrates how RCU can be with hazard pointers it is extremely expensive to determine
used to guarantee existence while an in-object reference when the corresponding reference count has reached zero.
count is being incremented. One way to work around this problem is to split the load
between reference counters and hazard pointers. Each data
element has a reference counter that tracks the number of
13.2.2 Counter Optimizations other data elements referencing this element on the one
hand, and readers use hazard pointers on the other.
In some cases where increments and decrements are Making this arrangement work both efficiently and cor-
common, but checks for zero are rare, it makes sense to rectly can be quite challenging, and so interested readers
maintain per-CPU or per-task counters, as was discussed are invited to examine the UnboundedQueue and Con-
in Chapter 5. For example, see the paper on sleepable currentHashMap data structures implemented in Folly
read-copy update (SRCU), which applies this technique open-source library.5
to RCU [McK06]. This approach eliminates the need for
atomic instructions or memory barriers on the increment
13.3.2 Long-Duration Accesses
and decrement primitives, but still requires that code-
motion compiler optimizations be disabled. In addition, Suppose a reader-writer-locking reader is holding the lock
the primitives such as synchronize_srcu() that check for so long that updates are excessively delayed. If that
for the aggregate reference count reaching zero can be quite reader can reasonably be converted to use reference count-
slow. This underscores the fact that these techniques are ing instead of reader-writer locking, but if performance
designed for situations where the references are frequently and scalability considerations prevent use of actual refer-
acquired and released, but where it is rarely necessary to ence counters, then hazard pointers provides a scalable
check for a zero reference count. variant of reference counting.
However, it is usually the case that use of reference The key point is that where reader-writer locking readers
counts requires writing (often atomically) to a data struc- block all updates for that lock, hazard pointers instead
ture that is otherwise read only. In this case, reference simply hang onto the data that is actually needed, while
counts are imposing expensive cache misses on readers. still allowing updates to proceed.
It is therefore worthwhile to look into synchronization If the reader cannot be reasonably be converted to use
mechanisms that do not require readers to write to the data reference counting, the tricks in Section 13.5.8 might be
structure being traversed. One possibility is the hazard helpful.
pointers covered in Section 9.3 and another is RCU, which
is covered in Section 9.5.
13.4 Sequence-Locking Specials

The girl who can’t dance says the band can’t play.
13.3 Hazard-Pointer Helpers
Yiddish proverb

It’s the little things that count, hundreds of them. This section looks at some special uses of sequence coun-
Cliff Shaw ters.

This section looks at some issues that can be addressed


with the help of hazard pointers. In addition, hazard
pointers can sometimes be used to address the issues
called out in Section 13.5, and vice versa. 5 https://github.com/facebook/folly

v2022.09.25a
13.4. SEQUENCE-LOCKING SPECIALS 275

13.4.1 Dueling Sequence Locks sequence locks are not a replacement for RCU protection:
Sequence locks protect against concurrent modifications,
The classic sequence-locking use case enables a reader to but RCU is still needed to protect against concurrent
see a consistent snapshot of a small collection of variables, deletions.
for example, calibration constants for timekeeping. This
This approach works quite well when the number of
works quite well in practice because calibration constants
correlated elements is small, the time to read these el-
are rarely updated and, when updated, are updated quickly.
ements is short, and the update rate is low. Otherwise,
Readers therefore almost never need to retry.
updates might happen so quickly that readers might never
However, if the updater is delayed during the update,
complete. Although Schrödinger does not expect that even
readers will also be delayed. Such delays might be due to
his least-sane relatives will marry and divorce quickly
interrupts, NMIs, or even virtual-CPU preemption.
enough for this to be a problem, he does realize that this
One way to prevent updater delays from causing reader
problem could well arise in other situations. One way to
delays is to maintain two sets of calibration constants.
avoid this reader-starvation problem is to have the readers
Each set is updated in turn, but frequently enough that
use the update-side primitives if there have been too many
readers can make good use of either set. Each set has its
retries, but this can degrade both performance and scala-
own sequence lock (seqlock_t structure).
bility. Another way to avoid starvation is to have multiple
The updater alternates between the two sets, so that an
sequence locks, in Schrödinger’s case, perhaps one per
delayed updater delays readers of at most one of the sets.
species.
Each reader attempts to access the first set, but upon
retry attempts to access the second set. If the second set In addition, if the update-side primitives are used too
also forces a retry, the reader repeats starting again from frequently, poor performance and scalability will result
the first set. If the updater is stuck, only one of the two due to lock contention. One way to avoid this is to maintain
sets will force readers to retry, and therefore readers will a per-element sequence lock, and to hold both spouses’
succeed as soon as they attempt to access the other set. locks when updating their marital status. Readers can do
their retry looping on either of the spouses’ locks to gain
Quick Quiz 13.7: Why don’t all sequence-locking use cases a stable view of any change in marital status involving
replicate the data in this fashion? both members of the pair. This avoids contention due to
high marriage and divorce rates, but complicates gaining
a stable view of all marital statuses during a single scan
13.4.2 Correlated Data Elements of the database.
Suppose we have a hash table where we need correlated If the element groupings are well-defined and persistent,
views of two or more of the elements. These elements which marital status is hoped to be, then one approach
are updated together, and we do not want to see an old is to add pointers to the data elements to link together
version of the first element along with new versions of the the members of a given group. Readers can then traverse
other elements. For example, Schrödinger decided to add these pointers to access all the data elements in the same
his extended family to his in-memory database along with group as the first one located.
all his animals. Although Schrödinger understands that This technique is used heavily in the Linux kernel,
marriages and divorces do not happen instantaneously, he perhaps most notably in the dcache subsystem [Bro15b].
is also a traditionalist. As such, he absolutely does not want Note that it is likely that similar schemes also work with
his database ever to show that the bride is now married, hazard pointers.
but the groom is not, and vice versa. Plus, if you think This approach provides sequential consistency to suc-
Schrödinger is a traditionalist, you just try conversing with cessful readers, each of which will either see the effects of
some of his family members! In other words, Schrödinger a given update or not, with any partial updates resulting in
wants to be able to carry out a wedlock-consistent traversal a read-side retry. Sequential consistency is an extremely
of his database. strong guarantee, incurring equally strong restrictions
One approach is to use sequence locks (see Section 9.4), and equally high overheads. In this case, we saw that
so that wedlock-related updates are carried out under readers might be starved on the one hand, or might need
the protection of write_seqlock(), while reads re- to acquire the update-side lock on the other. Although this
quiring wedlock consistency are carried out within a works very well in cases where updates are infrequent,
read_seqbegin() / read_seqretry() loop. Note that it unnecessarily forces read-side retries even when the

v2022.09.25a
276 CHAPTER 13. PUTTING IT ALL TOGETHER

update does not affect any of the data that a retried reader Readers looking up a given element act as sequence-lock
accesses. Section 13.5.4 therefore covers a much weaker readers across their full set of accesses to that element.
form of consistency that not only avoids reader starvation, Note that these sequence-lock operations will order each
but also avoids any form of read-side retry. The next reader’s lookups.
section instead presents a weaker form of consistency that Renaming an element can then proceed roughly as
can be provided with much lower probabilities of reader follows:
starvation.
1. Acquire a global lock protecting rename operations.
13.4.3 Atomic Move 2. Allocate and initialize a copy of the element with the
Suppose that individual data elements are moved from new name.
one data structure to another, and that readers look up
only single data structures. However, when a data element 3. Write-acquire the sequence lock on the element with
moves, readers must must never see it as being in both the old name, which has the side effect of ordering this
structures at the same time and must also never see it acquisition with the following insertion. Concurrent
as missing from both structures at the same time. At lookups of the old name will now repeatedly retry.
the same time, any reader seeing the element in its new
location must never subsequently see it in its old location. 4. Insert the copy of the element with the new name.
In addition, the move may be implemented by inserting Lookups of the new name will now succeed.
a new copy of the old data element into the destination
location. 5. Execute smp_wmb() to order the prior insertion with
For example, consider a hash table that supports an the subsequent removal.
atomic-to-readers rename operation. Expanding on Schrö-
dinger’s zoo, suppose that an animal’s name changes, for 6. Remove the element with the old name. Concurrent
example, each of the brides in Schrödinger’s traditionalist lookups of the old name will now fail.
family might change their last name to match that of their
groom. 7. Write-release the sequence lock if necessary, for
example, if required by lock dependency checkers.
But changing their name might change the hash value,
and might also require that the bride’s element move from
8. Release the global lock.
one hash chain to another. The consistency set forth above
requires that if a reader successfully looks up the new
name, then any subsequent lookup of the old name by Thus, readers looking up the old name will retry until
that reader must result in failure. Similarly, if a reader’s the new name is available, at which point their final retry
lookup of the old name results in lookup failure, then any will fail. Any subsequent lookups of the new name will
subsequent lookup of the new name by that reader must succeed. Any reader succeeding in looking up the new
succeed. In short, a given reader should not see a bride name is guaranteed that any subsequent lookup of the old
momentarily blinking out of existence, nor should that name will fail, perhaps after a series of retries.
reader lookup a bride under her new name and then later Quick Quiz 13.8: Is it possible to write-acquire the sequence
lookup that bride under her old name. lock on the new element before it is inserted instead of acquiring
This consistency guarantee could be enforced with a that of the old element before it is removed?
single global sequence lock as described in Section 13.4.2,
but this can result in reader starvation even for readers that Quick Quiz 13.9: Is it possible to avoid the global lock?
are not looking up a bride who is currently undergoing
a name change. This guarantee could also be enforced It is of course possible to instead implement this pro-
by requiring that readers acquire a per-hash-chain lock, cedure somewhat more efficiently using simple flags.
but reviewing Figure 10.2 shows that this results in poor However, this can be thought of as a simplified variant
performance and scalabilty, even for single-socket systems. of sequence locking that relies on the fact that a given
Another more reader-friendly way to implement this is element’s sequence lock is never write-acquired more than
to use RCU and to place a sequence lock on each element. once.

v2022.09.25a
13.5. RCU RESCUES 277

13.4.4 Upgrade to Writer sum. In particular, when a given thread exits, we absolutely
cannot lose the exiting thread’s count, nor can we double-
As discussed in Section 9.5.4.9, RCU permits readers to count it. Such an error could result in inaccuracies equal to
upgrade to writers. This capability can be quite useful the full precision of the result, in other words, such an error
when a reader scanning an RCU-protected data structure would make the result completely useless. And in fact, one
notices that the current element needs to be updated. What of the purposes of final_mutex is to ensure that threads
happens when you try this trick with sequence locking? do not come and go in the middle of read_count()
It turns out that this sequence-locking trick is actually execution.
used in the Linux kernel, for example, by the sdma_
Therefore, if we are to dispense with final_mutex, we
flush() function in drivers/infiniband/hw/hfi1/
will need to come up with some other method for ensuring
sdma.c. The effect is to doom the enclosing reader to
consistency. One approach is to place the total count for
retry. This trick is therefore used when the reader detects
all previously exited threads and the array of pointers to
some condition that requires a retry.
the per-thread counters into a single structure. Such a
structure, once made available to read_count(), is held
constant, ensuring that read_count() sees consistent
13.5 RCU Rescues data.

With great doubts comes great understanding, with


little doubts comes little understanding. 13.5.1.2 Implementation
Chinese proverb Lines 1–4 of Listing 13.5 show the countarray struc-
ture, which contains a ->total field for the count from
This section shows how to apply RCU to some examples previously exited threads, and a counterp[] array of
discussed earlier in this book. In some cases, RCU pointers to the per-thread counter for each currently
provides simpler code, in other cases better performance running thread. This structure allows a given execution of
and scalability, and in still other cases, both. read_count() to see a total that is consistent with the
indicated set of running threads.
13.5.1 RCU and Per-Thread-Variable- Lines 6–8 contain the definition of the per-thread
counter variable, the global pointer countarrayp refer-
Based Statistical Counters encing the current countarray structure, and the final_
Section 5.2.3 described an implementation of statistical mutex spinlock.
counters that provided excellent performance, roughly that Lines 10–13 show inc_count(), which is unchanged
of simple increment (as in the C ++ operator), and linear from Listing 5.4.
scalability—but only for incrementing via inc_count(). Lines 15–31 show read_count(), which has changed
Unfortunately, threads needing to read out the value via significantly. Lines 22 and 29 substitute rcu_
read_count() were required to acquire a global lock, and read_lock() and rcu_read_unlock() for acquisi-
thus incurred high overhead and suffered poor scalability. tion and release of final_mutex. Line 23 uses rcu_
The code for the lock-based implementation is shown in dereference() to snapshot the current countarray
Listing 5.4 on page 53. structure into local variable cap. Proper use of RCU will
guarantee that this countarray structure will remain with
Quick Quiz 13.10: Why on earth did we need that global
lock in the first place? us through at least the end of the current RCU read-side
critical section at line 29. Line 24 initializes sum to cap->
total, which is the sum of the counts of threads that
have previously exited. Lines 25–27 add up the per-thread
13.5.1.1 Design
counters corresponding to currently running threads, and,
The hope is to use RCU rather than final_mutex to finally, line 30 returns the sum.
protect the thread traversal in read_count() in order to The initial value for countarrayp is provided by
obtain excellent performance and scalability from read_ count_init() on lines 33–41. This function runs before
count(), rather than just from inc_count(). However, the first thread is created, and its job is to allocate and zero
we do not want to give up any accuracy in the computed the initial structure, and then assign it to countarrayp.

v2022.09.25a
278 CHAPTER 13. PUTTING IT ALL TOGETHER

Listing 13.5: RCU and Per-Thread Statistical Counters Lines 43–50 show the count_register_thread()
1 struct countarray { function, which is invoked by each newly created thread.
2 unsigned long total;
3 unsigned long *counterp[NR_THREADS];
Line 45 picks up the current thread’s index, line 47 acquires
4 }; final_mutex, line 48 installs a pointer to this thread’s
5
6 unsigned long __thread counter = 0;
counter, and line 49 releases final_mutex.
7 struct countarray *countarrayp = NULL;
8 DEFINE_SPINLOCK(final_mutex);
Quick Quiz 13.11: Hey!!! Line 48 of Listing 13.5 modifies
9 a value in a pre-existing countarray structure! Didn’t you
10 __inline__ void inc_count(void) say that this structure, once made available to read_count(),
11 {
12 WRITE_ONCE(counter, counter + 1); remained constant???
13 }
14 Lines 52–72 show count_unregister_thread(),
15 unsigned long read_count(void) which is invoked by each thread just before it exits.
16 {
17 struct countarray *cap; Lines 58–62 allocate a new countarray structure, line 63
18 unsigned long *ctrp; acquires final_mutex and line 69 releases it. Line 64
19 unsigned long sum;
20 int t; copies the contents of the current countarray into the
21
22 rcu_read_lock();
newly allocated version, line 65 adds the exiting thread’s
23 cap = rcu_dereference(countarrayp); counter to new structure’s ->total, and line 66 NULLs
24 sum = READ_ONCE(cap->total); the exiting thread’s counterp[] array element. Line 67
25 for_each_thread(t) {
26 ctrp = READ_ONCE(cap->counterp[t]); then retains a pointer to the current (soon to be old)
27 if (ctrp != NULL) sum += *ctrp; countarray structure, and line 68 uses rcu_assign_
28 }
29 rcu_read_unlock(); pointer() to install the new version of the countarray
30 return sum; structure. Line 70 waits for a grace period to elapse, so
31 }
32 that any threads that might be concurrently executing in
33 void count_init(void) read_count(), and thus might have references to the old
34 {
35 countarrayp = malloc(sizeof(*countarrayp)); countarray structure, will be allowed to exit their RCU
36 if (countarrayp == NULL) { read-side critical sections, thus dropping any such refer-
37 fprintf(stderr, "Out of memory\n");
38 exit(EXIT_FAILURE); ences. Line 71 can then safely free the old countarray
39 } structure.
40 memset(countarrayp, '\0', sizeof(*countarrayp));
41 } Quick Quiz 13.12: Given the fixed-size counterp array,
42
43 void count_register_thread(unsigned long *p)
exactly how does this code avoid a fixed upper bound on the
44 { number of threads???
45 int idx = smp_thread_id();
46
47 spin_lock(&final_mutex);
48 countarrayp->counterp[idx] = &counter; 13.5.1.3 Discussion
49 spin_unlock(&final_mutex);
50 }
51 Quick Quiz 13.13: Wow! Listing 13.5 contains 70 lines
52 void count_unregister_thread(int nthreadsexpected) of code, compared to only 42 in Listing 5.4. Is this extra
53 {
54 struct countarray *cap; complexity really worth it?
55 struct countarray *capold;
56 int idx = smp_thread_id(); Use of RCU enables exiting threads to wait until other
57
58 cap = malloc(sizeof(*countarrayp)); threads are guaranteed to be done using the exiting threads’
59 if (cap == NULL) { __thread variables. This allows the read_count()
60 fprintf(stderr, "Out of memory\n");
61 exit(EXIT_FAILURE); function to dispense with locking, thereby providing ex-
62 } cellent performance and scalability for both the inc_
63 spin_lock(&final_mutex);
64 *cap = *countarrayp; count() and read_count() functions. However, this
65 WRITE_ONCE(cap->total, cap->total + counter); performance and scalability come at the cost of some
66 cap->counterp[idx] = NULL;
67 capold = countarrayp; increase in code complexity. It is hoped that compiler and
68 rcu_assign_pointer(countarrayp, cap); library writers employ user-level RCU [Des09b] to provide
69 spin_unlock(&final_mutex);
70 synchronize_rcu(); safe cross-thread access to __thread variables, greatly
71 free(capold); reducing the complexity seen by users of __thread vari-
72 }
ables.

v2022.09.25a
13.5. RCU RESCUES 279

13.5.2 RCU and Counters for Removable Listing 13.6: RCU-Protected Variable-Length Array
struct foo {
I/O Devices 1
2 int length;
3 char *a;
Section 5.4.6 showed a fanciful pair of code fragments for 4 };
dealing with counting I/O accesses to removable devices.
These code fragments suffered from high overhead on Listing 13.7: Improved RCU-Protected Variable-Length Array
the fastpath (starting an I/O) due to the need to acquire a 1 struct foo_a {
reader-writer lock. 2 int length;
3 char a[0];
This section shows how RCU may be used to avoid this 4 };
overhead. 5
6 struct foo {
The code for performing an I/O is quite similar to the 7 struct foo_a *fa;
original, with an RCU read-side critical section being 8 };

substituted for the reader-writer lock read-side critical


section in the original:
is given by the field ->length. Of course, this introduces
1 rcu_read_lock(); the following race condition:
2 if (removing) {
3 rcu_read_unlock(); 1. The array is initially 16 characters long, and thus
4 cancel_io();
5 } else { ->length is equal to 16.
6 add_count(1);
7 rcu_read_unlock(); 2. CPU 0 loads the value of ->length, obtaining the
8 do_io();
9 sub_count(1); value 16.
10 }
3. CPU 1 shrinks the array to be of length 8, and assigns
a pointer to a new 8-character block of memory into
The RCU read-side primitives have minimal overhead, ->a[].
thus speeding up the fastpath, as desired.
The updated code fragment removing a device is as 4. CPU 0 picks up the new pointer from ->a[], and
follows: stores a new value into element 12. Because the
array has only 8 characters, this results in a SEGV or
1 spin_lock(&mylock); (worse yet) memory corruption.
2 removing = 1;
3 sub_count(mybias);
4 spin_unlock(&mylock); How can we prevent this?
5 synchronize_rcu(); One approach is to make careful use of memory barriers,
6 while (read_count() != 0) {
7 poll(NULL, 0, 1); which are covered in Chapter 15. This works, but incurs
8 } read-side overhead and, perhaps worse, requires use of
9 remove_device();
explicit memory barriers.
A better approach is to put the value and the array into
Here we replace the reader-writer lock with an exclusive the same structure, as shown in Listing 13.7 [ACMS03].
spinlock and add a synchronize_rcu() to wait for all of Allocating a new array (foo_a structure) then automat-
the RCU read-side critical sections to complete. Because ically provides a new place for the array length. This
of the synchronize_rcu(), once we reach line 6, we means that if any CPU picks up a reference to ->fa, it is
know that all remaining I/Os have been accounted for. guaranteed that the ->length will match the ->a[].
Of course, the overhead of synchronize_rcu() can
be large, but given that device removal is quite rare, this 1. The array is initially 16 characters long, and thus
is usually a good tradeoff. ->length is equal to 16.
2. CPU 0 loads the value of ->fa, obtaining a pointer to
13.5.3 Array and Length the structure containing the value 16 and the 16-byte
array.
Suppose we have an RCU-protected variable-length array,
as shown in Listing 13.6. The length of the array ->a[] 3. CPU 0 loads the value of ->fa->length, obtaining
can change dynamically, and at any given time, its length the value 16.

v2022.09.25a
280 CHAPTER 13. PUTTING IT ALL TOGETHER

Listing 13.8: Uncorrelated Measurement Fields Listing 13.9: Correlated Measurement Fields
1 struct animal { 1 struct measurement {
2 char name[40]; 2 double meas_1;
3 double age; 3 double meas_2;
4 double meas_1; 4 double meas_3;
5 double meas_2; 5 };
6 double meas_3; 6
7 char photo[0]; /* large bitmap. */ 7 struct animal {
8 }; 8 char name[40];
9 double age;
10 struct measurement *mp;
11 char photo[0]; /* large bitmap. */
4. CPU 1 shrinks the array to be of length 8, and assigns 12 };

a pointer to a new foo_a structure containing an 8-


character block of memory into ->fa.
this new measurement structure using rcu_assign_
5. CPU 0 picks up the new pointer from ->a[], and pointer(). After a grace period elapses, the old
stores a new value into element 12. But because measurement structure can be freed.
CPU 0 is still referencing the old foo_a structure Quick Quiz 13.14: But cant’t the approach shown in List-
that contains the 16-byte array, all is well. ing 13.9 result in extra cache misses, in turn resulting in
additional read-side overhead?
Of course, in both cases, CPU 1 must wait for a grace
period before freeing the old array. This approach enables readers to see correlated values
A more general version of this approach is presented in for selected fields, but while incurring minimal read-side
the next section. overhead. This per-data-element consistency suffices in
the common case where a reader looks only at a single
data element.
13.5.4 Correlated Fields
Suppose that each of Schödinger’s animals is represented 13.5.5 Update-Friendly Traversal
by the data element shown in Listing 13.8. The meas_
1, meas_2, and meas_3 fields are a set of correlated Suppose that a statistical scan of all elements in a hash
measurements that are updated periodically. It is critically table is required. For example, Schrödinger might wish
important that readers see these three values from a single to compute the average length-to-weight ratio over all of
measurement update: If a reader sees an old value of his animals.7 Suppose further that Schrödinger is willing
meas_1 but new values of meas_2 and meas_3, that reader to ignore slight errors due to animals being added to and
will become fatally confused. How can we guarantee that removed from the hash table while this statistical scan is
readers will see coordinated sets of these three values?6 being carried out. What should Schrödinger do to control
One approach would be to allocate a new animal concurrency?
structure, copy the old structure into the new structure, One approach is to enclose the statistical scan in an
update the new structure’s meas_1, meas_2, and meas_3 RCU read-side critical section. This permits updates to
fields, and then replace the old structure with a new one by proceed concurrently without unduly impeding the scan.
updating the pointer. This does guarantee that all readers In particular, the scan does not block the updates and
see coordinated sets of measurement values, but it requires vice versa, which allows scan of hash tables containing
copying a large structure due to the ->photo[] field. This very large numbers of elements to be supported gracefully,
copying might incur unacceptably large overhead. even in the face of very high update rates.
Another approach is to impose a level of indirection, Quick Quiz 13.15: But how does this scan work while a
as shown in Listing 13.9 [McK04, Section 5.3.4]. When resizable hash table is being resized? In that case, neither the
a new measurement is taken, a new measurement struc- old nor the new hash table is guaranteed to contain all the
ture is allocated, filled in with the measurements, and elements in the hash table!
the animal structure’s ->mp field is updated to point to
6 This situation is similar to that described in Section 13.4.2, except

that here readers need only see a consistent view of a given single data
element, not the consistent view of a group of data elements that was 7 Why would such a quantity be useful? Beats me! But group

required in that earlier section. statistics are often useful.

v2022.09.25a
13.5. RCU RESCUES 281

13.5.6 Scalable Reference Count Two open()


CLOSED OPEN
Suppose a reference count is becoming a performance or
scalability bottleneck. What can you do?
One approach is to use per-CPU counters for each
CB close() CB
reference count, somewhat similar to the algorithms in
Chapter 5, in particular, the exact limit counters described
in Section 5.4. The need to switch between per-CPU and open()
CLOSING REOPENING
global modes for these counters results either in expensive
increments and decrements on the one hand (Section 5.4.1)
or in the use of POSIX signals on the other (Section 5.4.3). CB close() open()
Another approach is to use RCU to mediate the switch
between per-CPU and global counting modes. Each update
is carried out within an RCU read-side critical section, RECLOSING
and each update checks a flag to determine whether to
update the per-CPU counters on the one hand or the global
on the other. To switch modes, update the flag, wait for a Figure 13.1: Retrigger-Grace-Period State Machine
grace period, and then move any remaining counts from
the per-CPU counters to the global counter or vice versa.
The Linux kernel uses this RCU-mediated approach in
13.5.7 Retriggered Grace Periods
its percpu_ref style of reference counter. Code using There is no call_rcu_cancel(), so once an rcu_head
this reference counter must initialize the percpu_ref structure is passed to call_rcu(), there is no calling it
structure using percpu_ref_init(), which takes as back. It must be left alone until the callback is invoked. In
arguments a pointer to the structure, a pointer to a function the common case, this is as it should be because the rcu_
to invoke when the reference count reaches zero, a set of head structure is on a one-way journey to deallocation.
mode flags, and a set of kmalloc() GFP_ flags. After However, there are use cases that combine RCU and
normal initialization, the structure has one reference and explicit open() and close() calls. After a close()
is in per-CPU mode. call, readers are not supposed to begin new accesses to the
The mode flags are usually zero, but can include the data structure, but there might well be readers completing
PERCPU_REF_INIT_ATOMIC bit if the counter is to start their traversal. This situation can be handled in the usual
in slow non-per-CPU (that is, atomic) mode. There manner: Wait for a grace period following the close()
is also a PERCPU_REF_ALLOW_REINIT bit that allows call before freeing the data structures.
a given percpu_ref counter to be reused via a call But what if open() is called before the grace period
to percpu_ref_reinit() without needing to be freed ends?
and reallocated. Regardless of how the percpu_ref Again, there is no call_rcu_cancel(), so another
structure is initialized, percpu_ref_get() may be used approach is to set a flag that is checked by the callback
to acquire a reference and percpu_ref_put() may be function, which can opt out of actually freeing anything.
used to release a reference. Problem solved!
When in per-CPU mode, the percpu_ref structure But what if open() and then another close() are both
cannot determine whether or not its value has reached called before the grace period ends?
zero. When such a determination is necessary, percpu_ One approach is to have a second value for the flag that
ref_kill() may be invoked. This function switches causes the callback to requeue itself.
the structure into atomic mode and removes the initial But what if there is not only a open() and then another
reference installed by the call to percpu_ref_init(). close(), but also another open() before the grace period
Of course, when in atomic mode, calls to percpu_ref_ ends?
get() and percpu_ref_put() are quite expensive, but In this case, the callback needs to set state to reflect that
percpu_ref_put() can tell when the value reaches zero. last open() still being in effect.
Readers desiring more percpu_ref information are Continuing this line of thought leads us to the state ma-
referred to the Linux-kernel documentation and source chine shown in Figure 13.1. The initial state is CLOSED
code. and the operational state is OPEN. The diamond-shaped

v2022.09.25a
282 CHAPTER 13. PUTTING IT ALL TOGETHER

arrowheads denote call_rcu() invocation, while the


arrows labeled “CB” denote callback invocation.
The normal path through this state machine traverses the
states CLOSED, OPEN, CLOSING (with an invocation
of call_rcu()), and back to CLOSED once the callback Listing 13.10: Retriggering a Grace Period (Pseudocode)
has been invoked. If open() is invoked before the grace 1 #define RTRG_CLOSED 0
period completes, the state machine traverses the cycle 2 #define RTRG_OPEN 1
3 #define RTRG_CLOSING 2
OPEN, CLOSING (with an invocation of call_rcu()), 4 #define RTRG_REOPENING 3
and back to CLOSED once the callback has been invoked. 5 #define RTRG_RECLOSING 4
6
If open() and then close() are invoked before the grace 7 int rtrg_status;
period completes, the state machine traverses the cycle 8 DEFINE_SPINLOCK(rtrg_lock);
9 struct rcu_head rtrg_rh;
OPEN, CLOSING (with an invocation of call_rcu()), 10
REOPENING, and back to OPEN once the callback has 11 void close_cb(struct rcu_head *rhp)
12 {
been invoked. 13 spin_lock(rtrg_lock);
14 if (rtrg_status = RTRG_CLOSING) {
Given an indefinite alternating sequence of close() 15 close_cleanup();
16 rtrg_status = RTRG_CLOSED;
and open() invocations, the state machine would traverse 17 } else if (rtrg_status == RTRG_REOPENING) {
OPEN, and CLOSING (with an invocation of call_ 18 rtrg_status = RTRG_OPEN;
19 } else if (rtrg_status == RTRG_RECLOSING) {
rcu()), followed by alternating sojourns in the REOPEN- 20 rtrg_status = RTRG_CLOSING;
ING and RECLOSING states. Once the grace period 21 call_rcu(&rtrg_rh, close_cb);
22 } else {
ends, the state machine would transition to either of the 23 WARN_ON_ONCE(1);
CLOSING or the OPEN state, depending on which of the 24 }
25 spin_unlock(rtrg_lock);
RECLOSING or REOPENING states the callback was 26 }
invoked in. 27
28 int open(void)
Rough pseudocode of this state machine is shown in 29 {
30 spin_lock(rtrg_lock);
Listing 13.10. The five states are shown on lines 1–5, the 31 if (rtrg_status == RTRG_CLOSED) {
current state is held in rtrg_status on line 7, which is 32 rtrg_status = RTRG_OPEN;
33 } else if (rtrg_status == RTRG_CLOSING ||
protected by the lock on line 8. 34 rtrg_status == RTRG_RECLOSING) {
35 rtrg_status = RTRG_REOPENING;
The three CB transitions (emanating from states CLOS- 36 } else {
ING, REOPENING, and RECLOSING) are implemented 37 spin_unlock(rtrg_lock);
38 return -EBUSY;
by the close_cb() function shown on lines 11–26. 39 }
Line 15 invokes a user-supplied close_cleanup() to 40 do_open();
41 spin_unlock(rtrg_lock);
take any final cleanup actions such as freeing memory 42 }
when transitioning to the CLOSED state. Line 21 contains 43
44 int close(void)
the call_rcu() invocation that causes a later transition 45 {
to the CLOSED state. 46 spin_lock(rtrg_lock);
47 if (rtrg_status == RTRG_OPEN) {
48 rtrg_status = RTRG_CLOSING;
The open() function on lines 28–42 implements the 49 call_rcu(&rtrg_rh, close_cb);
transitions to the OPEN, CLOSING, and REOPENING 50 } else if (rtrg_status == RTRG_REOPENING) {
51 rtrg_status = RTRG_RECLOSING;
states, with line 40 invoking a do_open() function to 52 } else {
implement any allocation and initialization of any needed 53 spin_unlock(rtrg_lock);
54 return -ENOENT;
data structures. 55 }
56 do_close();
The close() function on lines 44–58 implements the 57 spin_unlock(rtrg_lock);
transitions to the CLOSING and RECLOSING states, 58 }

with line 56 invoking a do_close() function to take any


actions that might be required to finalize this transition,
for example, causing later read-only traversals to return
errors. Line 49 contains the call_rcu() invocation that
causes a later transition to the CLOSED state.

v2022.09.25a
13.5. RCU RESCUES 283

This state machine and pseudocode shows how to get the ture could multiply system-wide grace-period overhead
effect of a call_rcu_cancel() in those rare situations by the number of such structures.
needing such semantics. Therefore, it is often better to acquire some sort of non-
RCU reference on the needed data to permit a momentary
exit from the RCU read-side critical section, as described
13.5.8 Long-Duration Accesses Two in the next section.
Suppose a reader-writer-locking reader is holding the lock
for so long that updates are excessively delayed. Suppose 13.5.8.2 Acquire a Reference
further that this reader cannot reasonably be converted to If the RCU read-side critical section is too long, shorten
use reference counting (otherwise, see Section 13.3.2). it!
If that reader can be reasonably converted to use RCU, In some cases, this can be done trivially. For example,
that might solve the problem. The reason is that RCU code that scans all of the hash chains of a statically
readers do not completely block updates, but rather block allocated array of hash buckets can just as easily scan each
only the cleanup portions of those updates (including hash chain within its own critical section.
memory reclamation). Therefore, if the system has ample This works because hash chains are normally quite short,
memory, converting the reader-writer lock to RCU may and by design. When traversing long linked structures, it
suffice. is necessary to have some way of stopping in the middle
However, converting to RCU does not always suffice. and resuming later.
For example, the code might traverse an extremely large For example, in Linux kernel v5.16, the khugepaged_
linked data structure within a single RCU read-side critical scan_file() function checks to see if some other task
section, which might so greatly extend the RCU grace needs the current CPU using need_resched(), and if
period that the system runs out of memory. These situa- so invokes xas_pause() to adjust the traversal’s iterator
tions can be handled in a couple of different ways: (1) Use appropriately, and then invokes cond_resched_rcu() to
SRCU instead of RCU and (2) Acquire a reference to exit yield the CPU. In turn, the cond_resched_rcu() func-
the RCU reader. tion invokes rcu_read_unlock(), cond_resched(),
and finally rcu_read_lock() to drop out of the RCU
13.5.8.1 Use SRCU read-side critical section in order to yield the CPU.
Of course, where feasible, another approach would be
In the Linux kernel, RCU is global. In other words, to switch to a data structure such as a hash table that is
any long-running RCU reader anywhere in the kernel more friendly to momentarily dropping out of an RCU
will delay the current RCU grace period. If the long- read-side critical section.
running RCU reader is traversing a small data structure, Quick Quiz 13.16: But how would this work with a resizable
that small amount of data is delaying freeing of all other hash table, such as the one described in Section 10.4?
data structures, which can result in memory exhaustion.
One way to avoid this problem is to use SRCU for
that long-running RCU reader’s data structure, with its
own srcu_struct structure. The resulting long-running
SRCU readers will then delay only that srcu_struct
structure’s grace periods, and not those of RCU, thus
avoiding memory exhaustion. For more details, see the
SRCU API in Section 9.5.3.
Unfortunately, this approach does have some drawbacks.
For one thing, SRCU readers are not subject to priority
boosting, which can result in additional delays to low-
priority SRCU readers on busy systems. Worse yet, defin-
ing a separate srcu_struct structure reduces the number
of RCU updaters, which in turn increases the grace-period
overhead per updater. This means that giving each current
Linux-kernel RCU use case its own srcu_struct struc-

v2022.09.25a
284 CHAPTER 13. PUTTING IT ALL TOGETHER

v2022.09.25a
If a little knowledge is a dangerous thing, just think
what you could do with a lot of knowledge!

Chapter 14 Unknown

Advanced Synchronization

This chapter covers synchronization techniques used for barriers, and even cache misses for counter increments.
lockless algorithms and parallel real-time systems. Other examples we have covered include:
Although lockless algorithms can be quite helpful when
faced with extreme requirements, they are no panacea.
For example, as noted at the end of Chapter 5, you should 1. The fastpaths through a number of other counting
thoroughly apply partitioning, batching, and well-tested algorithms in Chapter 5.
packaged weak APIs (see Chapters 8 and 9) before even
thinking about lockless algorithms.
But after doing all that, you still might find yourself 2. The fastpath through resource allocator caches in
needing the advanced techniques described in this chap- Section 6.4.3.
ter. To that end, Section 14.1 summarizes techniques
used thus far for avoiding locks and Section 14.2 gives a
3. The maze solver in Section 6.5.
brief overview of non-blocking synchronization. Memory
ordering is also quite important, but it warrants its own
chapter, namely Chapter 15.
4. The data-ownership techniques in Chapter 8.
The second form of advanced synchronization pro-
vides the stronger forward-progress guarantees needed
for parallel real-time computing, which is the topic of 5. The reference-counting, hazard-pointer, and RCU
Section 14.3. techniques in Chapter 9.

6. The lookup code paths in Chapter 10.


14.1 Avoiding Locks

We are confronted with insurmountable


7. Many of the techniques in Chapter 13.
opportunities.

Walt Kelly In short, lockless techniques are quite useful and


are heavily used. However, it is best if lockless tech-
Although locking is the workhorse of parallelism in pro- niques are hidden behind a well-defined API, such
duction, in many situations performance, scalability, and as the inc_count(), memblock_alloc(), rcu_read_
real-time response can all be greatly improved through use lock(), and so on. The reason for this is that undisci-
of lockless techniques. A particularly impressive example plined use of lockless techniques is a good way to create
of such a lockless technique is the statistical counters difficult bugs. If you believe that finding and fixing
described in Section 5.2, which avoids not only locks, such bugs is easier than avoiding them, please re-read
but also read-modify-write atomic operations, memory Chapters 11 and 12.

285

v2022.09.25a
286 CHAPTER 14. ADVANCED SYNCHRONIZATION

14.2 Non-Blocking Synchroniza- seen informal use for a great many decades, but were
reformulated in 2013.
tion Quick Quiz 14.1: Given that there will always be a sharply
limited number of CPUs available, is population obliviousness
Never worry about theory as long as the machinery really useful?
does what it’s supposed to do.
In theory, any parallel algorithm can be cast into wait-
Robert A. Heinlein free form, but there are a relatively small subset of NBS
algorithms that are in common use. A few of these are
The term non-blocking synchronization (NBS) [Her90] listed in the following section.
describes eight classes of linearizable algorithms with
differing forward-progress guarantees [ACHS13], which
are as follows: 14.2.1 Simple NBS
Perhaps the simplest NBS algorithm is atomic update of
1. Bounded population-oblivious wait-free synchroniza-
an integer counter using fetch-and-add (atomic_add_
tion: Every thread will make progress within a spe-
return()) primitives. This section lists a few additional
cific finite period of time, where this period of time is
commonly used NBS algorithms in roughly increasing
independent of the number of threads [HS08]. This
order of complexity.
level is widely considered to be even less achievable
than bounded wait-free synchronization.
14.2.1.1 NBS Sets
2. Bounded wait-free synchronization: Every thread
One simple NBS algorithm implements a set of integers in
will make progress within a specific finite period
an array. Here the array index indicates a value that might
of time [Her91]. This level is widely considered to
be a member of the set and the array element indicates
be unachievable, which might be why Alitarh et al.
whether or not that value actually is a set member. The
omitted it [ACHS13].
linearizability criterion for NBS algorithms requires that
3. Wait-free synchronization: Every thread will make reads from and updates to the array either use atomic
progress in finite time [Her93]. instructions or be accompanied by memory barriers, but
in the not-uncommon case where linearizability is not
4. Lock-free synchronization: At least one thread will important, simple volatile loads and stores suffice, for
make progress in finite time [Her93]. example, using READ_ONCE() and WRITE_ONCE().
An NBS set may also be implemented using a bit-
5. Obstruction-free synchronization: Every thread will
map, where each value that might be a member of the
make progress in finite time in the absence of con-
set corresponds to one bit. Reads and updates must
tention [HLM03].
normally be carried out via atomic bit-manipulation in-
6. Clash-free synchronization: At least one thread will structions, although compare-and-swap (cmpxchg() or
make progress in finite time in the absence of con- CAS) instructions can also be used.
tention [ACHS13].
14.2.1.2 NBS Counters
7. Starvation-free synchronization: Every thread will
make progress in finite time in the absence of fail- The statistical counters algorithm discussed in Section 5.2
ures [ACHS13]. can be considered to be bounded-wait-free, but only by us-
ing a cute definitional trick in which the sum is considered
8. Deadlock-free synchronization: At least one thread to be approximate rather than exact.1 Given sufficiently
will make progress in finite time in the absence of wide error bounds that are a function of the length of time
failures [ACHS13]. that the read_count() function takes to sum the coun-
ters, it is not possible to prove that any non-linearizable
NBS class 1 was formulated some time before 2015,
behavior occurred. This definitely (if a bit artificially)
classes 2, 3, and 4 were first formulated in the early 1990s,
classifies the statistical-counters algorithm as bounded
class 5 was first formulated in the early 2000s, and class 6
was first formulated in 2013. The final two classes have 1 Citation needed. I heard of this trick verbally from Mark Moir.

v2022.09.25a
14.2. NON-BLOCKING SYNCHRONIZATION 287

Listing 14.1: NBS Enqueue Algorithm Listing 14.2: NBS Stack Algorithm
1 static inline bool 1 struct node_t {
2 ___cds_wfcq_append(struct cds_wfcq_head *head, 2 value_t val;
3 struct cds_wfcq_tail *tail, 3 struct node_t *next;
4 struct cds_wfcq_node *new_head, 4 };
5 struct cds_wfcq_node *new_tail) 5
6 { 6 // LIFO list structure
7 struct cds_wfcq_node *old_tail; 7 struct node_t* top;
8 8
9 old_tail = uatomic_xchg(&tail->p, new_tail); 9 void list_push(value_t v)
10 CMM_STORE_SHARED(old_tail->next, new_head); 10 {
11 return old_tail != &head->node; 11 struct node_t *newnode = malloc(sizeof(*newnode));
12 } 12 struct node_t *oldtop;
13 13
14 static inline bool 14 newnode->val = v;
15 _cds_wfcq_enqueue(struct cds_wfcq_head *head, 15 oldtop = READ_ONCE(top);
16 struct cds_wfcq_tail *tail, 16 do {
17 struct cds_wfcq_node *new_tail) 17 newnode->next = oldtop;
18 { 18 oldtop = cmpxchg(&top, newnode->next, newnode);
19 return ___cds_wfcq_append(head, tail, 19 } while (newnode->next != oldtop);
20 new_tail, new_tail); 20 }
21 } 21
22
23 void list_pop_all(void (foo)(struct node_t *p))
24 {
25 struct node_t *p = xchg(&top, NULL);
wait-free. This algorithm is probably the most heavily 26
used NBS algorithm in the Linux kernel. 27 while (p) {
28 struct node_t *next = p->next;
29
30 foo(p);
14.2.1.3 Half-NBS Queue 31 free(p);
32 p = next;
33 }
Another common NBS algorithm is the atomic queue 34 }
where elements are enqueued using an atomic exchange
instruction [MS98b], followed by a store into the ->next
pointer of the new element’s predecessor, as shown in cannot be done atomically on typical computer systems. So
Listing 14.1, which shows the userspace-RCU library how is this supposed to work???
implementation [Des09b]. Line 9 updates the tail pointer
to reference the new element while returning a reference
to its predecessor, which is stored in local variable old_ 14.2.1.4 NBS Stack
tail. Line 10 then updates the predecessor’s ->next
pointer to reference the newly added element, and finally Listing 14.2 shows the LIFO push algorithm, which boasts
line 11 returns an indication as to whether or not the queue lock-free push and bounded wait-free pop (lifo-push.c),
was initially empty. forming an NBS stack. The origins of this algorithm are
Although mutual exclusion is required to dequeue a unknown, but it was referred to in a patent granted in
single element (so that dequeue is blocking), it is possible 1975 [BS75]. This patent was filed in 1973, a few months
to carry out a non-blocking removal of the entire contents before your editor saw his first computer, which had but
of the queue. What is not possible is to dequeue any one CPU.
given element in a non-blocking manner: The enqueuer Lines 1–4 show the node_t structure, which contains
might have failed between lines 9 and 10 of the listing, an arbitrary value and a pointer to the next structure on
so that the element in question is only partially enqueued. the stack and line 7 shows the top-of-stack pointer.
This results in a half-NBS algorithm where enqueues The list_push() function spans lines 9–20. Line 11
are NBS but dequeues are blocking. This algorithm is allocates a new node and line 14 initializes it. Line 17
nevertheless heavily used in practice, in part because most initializes the newly allocated node’s ->next pointer, and
production software is not required to tolerate arbitrary line 18 attempts to push it on the stack. If line 19 detects
fail-stop errors. cmpxchg() failure, another pass through the loop retries.
Otherwise, the new node has been successfully pushed,
Quick Quiz 14.2: Wait! In order to dequeue all elements,
and this function returns to its caller. Note that line 19
both the ->head and ->tail pointers must be changed, which
resolves races in which two concurrent instances of list_

v2022.09.25a
288 CHAPTER 14. ADVANCED SYNCHRONIZATION

push() attempt to push onto the stack. The cmpxchg() calls to malloc(), then those pointers must not be equal.
will succeed for one and fail for the other, causing the Real compilers really will generate the constant false in
other to retry, thereby selecting an arbitrary order for the response to a p==q comparison. A pointer to an object that
two node on the stack. has been freed, but whose memory has been reallocated
The list_pop_all() function spans lines 23–34. The for a compatibly typed object is termed a zombie pointer.
xchg() statement on line 25 atomically removes all nodes Many concurrent applications avoid this problem by
on the stack, placing the head of the resulting list in local carefully hiding the memory allocator from the compiler,
variable p and setting top to NULL. This atomic operation thus preventing the compiler from making inappropriate
serializes concurrent calls to list_pop_all(): One of assumptions. This obfuscatory approach currently works
them will get the list, and the other a NULL pointer, at in practice, but might well one day fall victim to increas-
least assuming that there were no concurrent calls to ingly aggressive optimizers. There is work underway in
list_push(). both the C and C++ standards committees to address this
An instance of list_pop_all() that obtains a non- problem [MMS19, MMM+ 20]. In the meantime, please
empty list in p processes this list in the loop spanning exercise great care when coding ABA-tolerant algorithms.
lines 27–33. Line 28 prefetches the ->next pointer, Quick Quiz 14.3: So why not ditch antique languages like C
line 30 invokes the function referenced by foo() on the and C++ for something more modern?
current node, line 31 frees the current node, and line 32
sets up p for the next pass through the loop.
But suppose that a pair of list_push() instances run
14.2.2 Applicability of NBS Benefits
concurrently with a list_pop_all() with a list initially
containing a single Node 𝐴. Here is one way that this The most heavily cited NBS benefits stem from its forward-
scenario might play out: progress guarantees, its tolerance of fail-stop bugs, and
from its linearizability. Each of these is discussed in one
1. The first list_push() instance pushes a new of the following sections.
Node 𝐵, executing through line 17, having just stored
a pointer to Node 𝐴 into Node 𝐵’s ->next pointer. 14.2.2.1 NBS Forward Progress Guarantees
2. The list_pop_all() instance runs to completion, NBS’s forward-progress guarantees have caused many to
setting top to NULL and freeing Node 𝐴. suggest its use in real-time systems, and NBS algorithms
are in fact used in a great many such systems. However, it
3. The second list_push() instance runs to comple-
is important to note that forward-progress guarantees are
tion, pushing a new Node 𝐶, but happens to allocate
largely orthogonal to those that form the basis of real-time
the memory that used to belong to Node 𝐴.
programming:
4. The first list_push() instance executes the
1. Real-time forward-progress guarantees usually have
cmpxchg() on line 18. Because new Node 𝐶
some definite time associated with them, for example,
has the same address as the newly freed Node 𝐴,
“scheduling latency must be less than 100 microsec-
this cmpxchg() succeeds and this list_push()
onds.” In contrast, the most popular forms of NBS
instance runs to completion.
only guarantees that progress will be made in finite
Note that both pushes and the popall all ran successfully time, with no definite bound.
despite the reuse of Node 𝐴’s memory. This is an unusual 2. Real-time forward-progress guarantees are often
property: Most data structures require protection against probabilistic, as in the soft-real-time guarantee that
what is often called the ABA problem. “at least 99.9 % of the time, scheduling latency must
But this property holds only for algorithm written in be less than 100 microseconds.” In contrast, many
assembly language. The sad fact is that most languages of NBS’s forward-progress guarantees are uncondi-
(including C and C++) do not support pointers to lifetime- tional.
ended objects, such as the pointer to the old Node 𝐴
contained in Node 𝐵’s ->next pointer. In fact, compilers 3. Real-time forward-progress guarantees are often con-
are within their rights to assume that if two pointers ditioned on environmental constraints, for example,
(call them p and q) were returned from two different only being honored: (1) For the highest-priority

v2022.09.25a
14.2. NON-BLOCKING SYNCHRONIZATION 289

tasks, (2) When each CPU spends at least a certain 14.2.2.2 NBS Underlying Operations
fraction of its time idle, and (3) When I/O rates
are below some specified maximum. In contrast, An NBS algorithm can be truly non-blocking only if the
NBS’s forward-progress guarantees are often uncon- underlying operations that it uses are also non-blocking.
ditional, although recent NBS work accommodates In a surprising number of cases, this is not the case in
conditional guarantees [ACHS13]. practice.
For example, non-blocking algorithms often allocate
4. An important component of a real-time program’s memory. In theory, this is fine, given the existence of
environment is the scheduler. NBS algorithms as- lock-free memory allocators [Mic04b]. But in practice,
sume a worst-case demonic scheduler, though for most environments must eventually obtain memory from
whatever reason, not a scheduler so demonic that operating-system kernels, which commonly use locking.
it simply refuses to ever run the application hous- Therefore, unless all the memory that will ever be needed
ing the NBS algorithm. In contrast, real-time sys- is somehow preallocated, a “non-blocking” algorithm that
tems assume that the scheduler is doing its level allocates memory will not be non-blocking when running
best to satisfy any scheduling constraints it knows on common-case real-world computer systems.
about, and, in the absence of such constraints, its This same point clearly also applies to algorithms
level best to honor process priorities and to provide performing I/O operations or otherwise interacting with
fair scheduling to processes of the same priority. their environment.
Non-demonic schedulers allow real-time programs Perhaps surprisingly, this point also applies to ostensi-
to use simpler algorithms than those required for bly non-blocking algorithms that do only plain loads and
NBS [ACHS13, Bra11]. stores, such as the counters discussed in Section 14.2.1.2.
And at first glance, those loads and stores that can be com-
5. NBS forward-progress guarantee classes assume that piled into single load and store instructions, respectively,
a number of underlying operations are lock-free or would seem to be not just non-blocking, but bounded
even wait-free, when in fact these operations are population-oblivious wait free.
blocking on common-case computer systems. Except that load and store instructions are not necessar-
ily either fast or deterministic. For example, as noted in
6. NBS forward-progress guarantees are often achieved
Chapter 3, cache misses can consume thousands of CPU
by subdividing operations. For example, in order
cycles. Worse yet, the measured cache-miss latencies can
to avoid a blocking dequeue operation, an NBS
be a function of the number of CPUs, as illustrated in Fig-
algorithm might substitute a non-blocking polling
ure 5.1. It is only reasonable to assume that these latencies
operation. This is fine in theory, but not helpful
also depend on the details of the system’s interconnect.
in practice to real-world programs that require an
In addition, given that hardware vendors generally do not
element to propagate through the queue in a timely
publish upper bounds for cache-miss latencies, it seems
fashion.
brave to assume that memory-reference instructions are
7. Real-time forward-progress guarantees usually apply in fact wait-free in modern computer systems. And the
only in the absence of software bugs. In contrast, antique systems for which such bounds are available suffer
many classes of NBS guarantees apply even in the from profound overall slowness.
face of fail-stop bugs. Furthermore, hardware is not the only source of slow-
ness for memory-reference instructions. For example,
8. NBS forward-progress guarantee classes imply lin- when running on typical computer systems, both loads
earizability. In contrast, real-time forward progress and stores can result in page faults. Which cause in-kernel
guarantees are often independent of ordering con- page-fault handlers to be invoked. Which might acquire
straints such as linearizability. locks, or even do I/O, potentially even using something
like network file system (NFS). All of which are most
emphatically blocking operations.
Quick Quiz 14.4: Why does anyone care about demonic
schedulers? Nor are page faults the only kernel-induced hazard.
A given CPU might be interrupted at any time, and the
To reiterate, despite these differences, a number of NBS interrupt handler might run for some time. During this
algorithms are extremely useful in real-time programs. time, the user-mode ostensibly non-blocking algorithm

v2022.09.25a
290 CHAPTER 14. ADVANCED SYNCHRONIZATION

will not be running at all. This situation raises interesting code making use of that algorithm. In such cases, not only
questions about the forward-progress guarantees provided has nothing has been gained by this trick, but this trick has
by system calls relying on interrupts, for example, the increased the complexity of all users of this algorithm.
membarrier() system call. With concurrent algorithms as elsewhere, maximizing
Things do look bleak, but the non-blocking nature of a specific metric is no substitute for thinking carefully
such algorithms can be at least partially redeemed using a about the needs of one’s users.
number of approaches:

1. Run on bare metal, with paging disabled. If you are


both brave and confident that you can write code that
is free of wild-pointer bugs, this approach might be 14.2.2.4 NBS Fail-Stop Tolerance
for you.
Of the classes of NBS algorithms, wait-free synchroniza-
2. Run on a non-blocking operating-system ker-
tion (bounded or otherwise), lock-free synchronization,
nel [GC96]. Such kernels are quite rare, in part
obstruction-free synchronization, and clash-free synchro-
because they have traditionally completely failed to
nization guarantee forward progress even in the presence
provide the hoped-for performance and scalability
of fail-stop bugs. An example fail-stop bug might cause
advantages over lock-based kernels. But perhaps you
some thread to be preempted indefinitely. As we will see,
should write one.
this fail-stop-tolerant property can be useful, but the fact
3. Use facilities such as mlockall() to avoid page is that composing a set of fail-stop-tolerant mechanisms
faults, while also ensuring that your program preal- does not necessarily result in a fail-stop-tolerant system.
locates all the memory it will ever need at boot time. To see this, consider a system made up of a series of
This can work well, but at the expense of severe wait-free queues, where an element is removed from one
common-case underutilization of memory. In envi- queue in the series, processed, and then added to the next
ronments that are cost-constrained or power-limited, queue.
this approach is not likely to be feasible. If a thread is preempted in the midst of a queuing
operation, in theory all is well because the wait-free
4. Use facilities such as the Linux kernel’s NO_HZ_ nature of the queue will guarantee forward progress. But
FULL tickless mode [Cor13]. In recent versions of in practice, the element being processed is lost because
the Linux kernel, this mode directs interrupts away the fail-stop-tolerant nature of the wait-free queues does
from a designated set of CPUs. However, this can not extend to the code using those queues.
sharply limit throughput for applications that are I/O
bound during even part of their operation. Nevertheless, there are a few applications where NBS’s
rather limited fail-stop-tolerance is useful. For example,
Given these considerations, it is no surprise that non- in some network-based or web applications, a fail-stop
blocking synchronization is far more important in theory event will eventually result in a retransmission, which
than it is in practice. will restart any work that was lost due to the fail-stop
event. Systems running such applications can therefore
14.2.2.3 NBS Subdivided Operations be heavily loaded, even to the point where the scheduler
can no longer provide any reasonable fairness guarantee.
One common trick that provides a given algorithm a loftier In constrast, if a thread fail-stops while holding a lock,
place on the NBS ranking is to replace blocking operations the application might need to be restarted. Nevertheless,
with a polling API. For example, instead of having a NBS is not a panacea even within this restricted area, due
reliable dequeue operation that might be merely lock-free to the possibility of spurious retransmissions due to pure
or even blocking, instead provide a dequeue operation scheduling delays. In some cases, it may be more efficient
that will spuriously fail in a wait-free manner rather than to reduce the load to avoid queueing delays, which will
exhibiting dreaded lock-free or blocking behaviors. also improve the scheduler’s ability to provide fair access,
This can work well in theory, but a common effect reducing or even eliminating the fail-stop events, thus
in practice is to merely move the lock-free or blocking reducing the number of retry operations, in turn further
behavior out of that specific algorithm and into the hapless reducing the load.

v2022.09.25a
14.2. NON-BLOCKING SYNCHRONIZATION 291

14.2.2.5 NBS Linearizability To their credit, there are some linearizability advocates
who are aware of some of its shortcomings [RR20]. There
It is important to note that linearizability can be quite use-
are also proposals to extend linearizability, for example,
ful, especially when analyzing concurrent code made up
interval-linearizability, which is intended to handle the
of strict locking and fully ordered atomic operations.2 Fur-
common case of operations that require non-zero time to
thermore, this handling of fully ordered atomic operations
complete [CnRR18]. It remains to be seen whether these
automatically covers simple NBS algorithms.
proposals will result in theories able to handle modern
However, the linearization points of a complex NBS
concurrent software artifacts, especially given that several
algorithms are often buried deep within that algorithm,
of the proof techniques discussed in Chapter 12 already
and thus not visible to users of a library function im-
handle many modern concurrent software artifacts.
plementing a part of such an algorithm. Therefore, any
claims that users benefit from the linearizability properties
of complex NBS algorithms should be regarded with deep 14.2.3 NBS Discussion
suspicion [HKLP12].
It is sometimes asserted that linearizability is necessary It is possible to create fully non-blocking queues [MS96],
for developers to produce proofs of correctness for their however, such queues are much more complex than the
concurrent code. However, such proofs are the exception half-NBS algorithm outlined above. The lesson here
rather than the rule, and modern developers who do is to carefully consider your actual requirements. Re-
produce proofs often use modern proof techniques that do laxing irrelevant requirements can often result in great
not depend on linearizability. Furthermore, developers improvements in simplicity, performance, and scalability.
frequently use modern proof techniques that do not require Recent research points to another important way to
a full specification, given that developers often learn their relax requirements. It turns out that systems providing
specification after the fact, one bug at a time. A few such fair scheduling can enjoy most of the benefits of wait-
proof techniques were discussed in Chapter 12.3 free synchronization even when running algorithms that
It is often asserted that linearizability maps well to se- provide only non-blocking synchronization, both in the-
quential specifications, which are said to be more natural ory [ACHS13] and in practice [AB13]. Because most
than are concurrent specifications [RR20]. But this asser- schedulers used in production do in fact provide fairness,
tion fails to account for our highly concurrent objective the more-complex algorithms providing wait-free syn-
universe. This universe can only be expected to select for chronization usually provide no practical advantages over
ability to cope with concurrency, especially for those par- simpler and faster non-wait-free algorithms.
ticipating in team sports or overseeing small children. In Interestingly enough, fair scheduling is but one benefi-
addition, given that the teaching of sequential computing cial constraint that is often respected in practice. Other sets
is still believed to be somewhat of a black art [PBCE20], of constraints can permit blocking algorithms to achieve
it is reasonable to expect that teaching of concurrent com- deterministic real-time response. For example, given:
puting is in a similar state of disarray. Therefore, focusing (1) Fair locks granted in FIFO order within a given pri-
on only one proof technique is unlikely to be a good way ority level, (2) Priority inversion avoidance (for example,
forward. priority inheritance [TS95, WTS96] or priority ceiling),
Again, please understand that linearizability is quite (3) A bounded number of threads, (4) Bounded critical
useful in many situations. Then again, so is that venerable section durations, (5) Bounded load, and (6) Absence of
tool, the hammer. But there comes a point in the field of fail-stop bugs, lock-based applications can provide deter-
computing where one should put down the hammer and ministic response times [Bra11, SM04a]. This approach
pick up a keyboard. Similarly, it appears that there are of course blurs the distinction between blocking and wait-
times when linearizability is not the best tool for the job. free synchronization, which is all to the good. Hopefully
theoretical frameworks will continue to improve their
2 For example, the Linux kernel’s value-returning atomic operations. ability to describe software actually used in practice.
3 A memorable verbal discussion with an advocate of linearizability Those who feel that theory should lead the way are
resulted in question: “So the reason linearizability is important is to referred to the inimitable Peter Denning, who said of
rescue 1980s proof techniques?” The advocate immediately replied in the
affirmative, then spent some time disparaging a particular modern proof
operating systems: “Theory follows practice” [Den15],
technique. Oddly enough, that technique was one of those successfully or to the eminent Tony Hoare, who said of the whole of
applied to Linux-kernel RCU. engineering: “In all branches of engineering science, the

v2022.09.25a
292 CHAPTER 14. ADVANCED SYNCHRONIZATION

engineering starts before the science; indeed, without the application: “My application computes million-point
early products of engineering, there would be nothing Fourier transforms in half a picosecond.” “No way!!! The
for the scientist to study!” [Mor07]. Of course, once clock cycle on this system is more than three hundred
an appropriate body of theory becomes available, it is picoseconds!” “Ah, but it is a soft real-time application!”
wise to make use of it. However, note well that the first If the term “soft real time” is to be of any use whatsoever,
appropriate body of theory is often one thing and the first some limits are clearly required.
proposed body of theory quite another. We might therefore say that a given soft real-time
application must meet its response-time requirements at
Quick Quiz 14.5: It seems like the various members of the
NBS hierarchy are rather useless. So why bother with them at least some fraction of the time, for example, we might say
all??? that it must execute in less than 20 microseconds 99.9 %
of the time.
Proponents of NBS algorithms sometimes call out real- This of course raises the question of what is to be
time computing as an important NBS beneficiary. The done when the application fails to meet its response-time
next section looks more deeply at the forward-progress requirements. The answer varies with the application,
needs of real-time systems. but one possibility is that the system being controlled
has sufficient stability and inertia to render harmless the
occasional late control action. Another possibility is that
14.3 Parallel Real-Time Computing the application has two ways of computing the result, a fast
and deterministic but inaccurate method on the one hand
One always has time enough if one applies it well. and a very accurate method with unpredictable compute
time on the other. One reasonable approach would be to
Johann Wolfgang von Göthe start both methods in parallel, and if the accurate method
fails to finish in time, kill it and use the answer from the
An important emerging area in computing is that of paral- fast but inaccurate method. One candidate for the fast but
lel real-time computing. Section 14.3.1 looks at a number inaccurate method is to take no control action during the
of definitions of “real-time computing”, moving beyond current time period, and another candidate is to take the
the usual sound bites to more meaningful criteria. Sec- same control action as was taken during the preceding
tion 14.3.2 surveys the sorts of applications that need time period.
real-time response. Section 14.3.3 notes that parallel real- In short, it does not make sense to talk about soft real
time computing is upon us, and discusses when and why time without some measure of exactly how soft it is.
parallel real-time computing can be useful. Section 14.3.4
gives a brief overview of how parallel real-time systems 14.3.1.2 Hard Real Time
may be implemented, with Sections 14.3.5 and 14.3.6 fo-
cusing on operating systems and applications, respectively. In contrast, the definition of hard real time is quite definite.
Finally, Section 14.3.7 outlines how to decide whether or After all, a given system either always meets its deadlines
not your application needs real-time facilities. or it doesn’t.
Unfortunately, a strict application of this definition
would mean that there can never be any hard real-time
14.3.1 What is Real-Time Computing? systems. The reason for this is fancifully depicted in
One traditional way of classifying real-time computing Figure 14.1. Yes, you could construct a more robust
is into the categories of hard real time and soft real time, system, perhaps with redundancy. But your adversary can
where the macho hard real-time applications never miss always get a bigger hammer.
their deadlines, but the wimpy soft real-time applications Then again, perhaps it is unfair to blame the software
miss their deadlines quite often. for what is clearly not just a hardware problem, but a bona
fide big-iron hardware problem at that.4 This suggests
that we define hard real-time software as software that
14.3.1.1 Soft Real Time
will always meet its deadlines, but only in the absence of
It should be easy to see problems with this definition a hardware failure. Unfortunately, failure is not always an
of soft real time. For one thing, by this definition, any
piece of software could be said to be a soft real-time 4 Or, given modern hammers, a big-steel problem.

v2022.09.25a
14.3. PARALLEL REAL-TIME COMPUTING 293

option, as fancifully depicted in Figure 14.2. We simply


cannot expect the poor gentleman depicted in that figure
to be reassured our saying “Rest assured that if a missed
deadline results in your tragic death, it most certainly will
not have been due to a software problem!” Hard real-time
response is a property of the entire system, not just of the
software.
But if we cannot demand perfection, perhaps we can
make do with notification, similar to the soft real-time
approach noted earlier. Then if the Life-a-Tron in Fig-
ure 14.2 is about to miss its deadline, it can alert the
hospital staff.
Unfortunately, this approach has the trivial solution
fancifully depicted in Figure 14.3. A system that always
immediately issues a notification that it won’t be able
to meet its deadline complies with the letter of the law,
but is completely useless. There clearly must also be
a requirement that the system meets its deadline some
fraction of the time, or perhaps that it be prohibited from
Figure 14.1: Real-Time Response, Meet Hammer missing its deadlines on more than a certain number of
consecutive operations.
We clearly cannot take a sound-bite approach to either
hard or soft real time. The next section therefore takes a
more real-world approach.

14.3.1.3 Real-World Real Time


Although sentences like “Hard real-time systems always
meet their deadlines!” are catchy and easy to memorize,
something else is needed for real-world real-time systems.
Although the resulting specifications are harder to memo-
rize, they can simplify construction of a real-time system
by imposing constraints on the environment, the workload,
and the real-time application itself.

Environmental Constraints Constraints on the envi-


ronment address the objection to open-ended promises of
response times implied by “hard real time”. These con-
straints might specify permissible operating temperatures,
air quality, levels and types of electromagnetic radiation,
and, to Figure 14.1’s point, levels of shock and vibration.
Of course, some constraints are easier to meet than
others. Any number of people have learned the hard way
that commodity computer components often refuse to
Figure 14.2: Real-Time Response: Hardware Matters operate at sub-freezing temperatures, which suggests a set
of climate-control requirements.
An old college friend once had the challenge of op-
erating a real-time system in an atmosphere featuring
some rather aggressive chlorine compounds, a challenge

v2022.09.25a
294 CHAPTER 14. ADVANCED SYNCHRONIZATION

Figure 14.3: Real-Time Response: Notification Insufficient

that he wisely handed off to his colleagues designing the There are also situations where a minimum level of
hardware. In effect, my colleague imposed an atmospheric- energy is required, for example, through the power leads
composition constraint on the environment immediately of the system and through the devices through which the
surrounding the computer, a constraint that the hardware system is to communicate with that portion of the outside
designers met through use of physical seals. world that is to be monitored or controlled.
Quick Quiz 14.6: But what about battery-powered systems?
Another old college friend worked on a computer- They don’t require energy flowing into the system as a whole.
controlled system that sputtered ingots of titanium using
an industrial-strength arc in a vacuum. From time to time, A number of systems are intended to operate in envi-
the arc would decide that it was bored with its path through ronments with impressive levels of shock and vibration,
the ingot of titanium and choose a far shorter and more en- for example, engine control systems. More strenuous
tertaining path to ground. As we all learned in our physics requirements may be found when we move away from
classes, a sudden shift in the flow of electrons creates an continuous vibrations to intermittent shocks. For example,
electromagnetic wave, with larger shifts in larger flows during my undergraduate studies, I encountered an old
creating higher-power electromagnetic waves. And in this Athena ballistics computer, which was designed to con-
case, the resulting electromagnetic pulses were sufficient tinue operating normally even if a hand grenade went off
to induce a quarter of a volt potential difference in the nearby.5 And finally, the “black boxes” used in airliners
leads of a small “rubber ducky” antenna located more than must continue operating before, during, and after a crash.
400 meters away. This meant that nearby conductors expe- Of course, it is possible to make hardware more robust
rienced higher voltages, courtesy of the inverse-square law. against environmental shocks and insults. Any number of
This included those conductors making up the computer ingenious mechanical shock-absorbing devices can reduce
controlling the sputtering process. In particular, the volt- the effects of shock and vibration, multiple layers of shield-
age induced on that computer’s reset line was sufficient to ing can reduce the effects of low-energy electromagnetic
actually reset the computer, mystifying everyone involved. radiation, error-correction coding can reduce the effects
This situation was addressed using hardware, including of high-energy radiation, various potting and sealing tech-
some elaborate shielding and a fiber-optic network with niques can reduce the effect of air quality, and any number
the lowest bitrate I have ever heard of, namely 9600 baud. of heating and cooling systems can counter the effects of
Less spectacular electromagnetic environments can often temperature. In extreme cases, triple modular redundancy
be handled by software through use of error detection and can reduce the probability that a fault in one part of the
correction codes. That said, it is important to remember system will result in incorrect behavior from the overall
that although error detection and correction codes can
reduce failure rates, they normally cannot reduce them 5 Decades later, the acceptance tests for some types of computer
all the way down to zero, which can present yet another systems involve large detonations, and some types of communications
obstacle to achieving hard real-time response. networks must deal with what is delicately termed “ballistic jamming.”

v2022.09.25a
14.3. PARALLEL REAL-TIME COMPUTING 295

system. However, all of these methods have one thing in This means that real-time applications must be confined
common: Although they can reduce the probability of to operations for which bounded latencies can reasonably
failure, they cannot reduce it to zero. be provided. Other operations must either be pushed
These environmental challenges are often met via ro- out into the non-real-time portions of the application or
bust hardware, however, the workload and application forgone entirely.
constraints in the next two sections are often handled in There might also be constraints on the non-real-time
software. portions of the application. For example, is the non-real-
time application permitted to use the CPUs intended for
the real-time portion? Are there time periods during which
Workload Constraints Just as with people, it is often
the real-time portion of the application is expected to be
possible to prevent a real-time system from meeting its
unusually busy, and if so, is the non-real-time portion of
deadlines by overloading it. For example, if the system is
the application permitted to run at all during those times?
being interrupted too frequently, it might not have suffi-
Finally, by what amount is the real-time portion of the
cient CPU bandwidth to handle its real-time application.
application permitted to degrade the throughput of the
A hardware solution to this problem might limit the rate
non-real-time portion?
at which interrupts were delivered to the system. Possible
software solutions include disabling interrupts for some
time if they are being received too frequently, resetting the Real-World Real-Time Specifications As can be seen
device generating too-frequent interrupts, or even avoiding from the preceding sections, a real-world real-time specifi-
interrupts altogether in favor of polling. cation needs to include constraints on the environment, on
Overloading can also degrade response times due to the workload, and on the application itself. In addition, for
queueing effects, so it is not unusual for real-time systems the operations that the real-time portion of the application
to overprovision CPU bandwidth, so that a running system is permitted to make use of, there must be constraints on
has (say) 80 % idle time. This approach also applies to the hardware and software implementing those operations.
storage and networking devices. In some cases, separate For each such operation, these constraints might in-
storage and networking hardware might be reserved for clude a maximum response time (and possibly also a
the sole use of high-priority portions of the real-time minimum response time) and a probability of meeting
application. In short, it is not unusual for this hardware to that response time. A probability of 100 % indicates that
be mostly idle, given that response time is more important the corresponding operation must provide hard real-time
than throughput in real-time systems. service.
In some cases, both the response times and the required
Quick Quiz 14.7: But given the results from queueing theory, probabilities of meeting them might vary depending on
won’t low utilization merely improve the average response time the parameters to the operation in question. For example,
rather than improving the worst-case response time? And isn’t a network operation over a local LAN would be much
worst-case response time all that most real-time systems really
more likely to complete in (say) 100 microseconds than
care about?
would that same network operation over a transcontinental
Of course, maintaining sufficiently low utilization re- WAN. Furthermore, a network operation over a copper
quires great discipline throughout the design and imple- or fiber LAN might have an extremely high probability
mentation. There is nothing quite like a little feature creep of completing without time-consuming retransmissions,
to destroy deadlines. while that same networking operation over a lossy WiFi
network might have a much higher probability of missing
tight deadlines. Similarly, a read from a tightly coupled
Application Constraints It is easier to provide bounded solid-state disk (SSD) could be expected to complete
response time for some operations than for others. For much more quickly than that same read to an old-style
example, it is quite common to see response-time specifi- USB-connected rotating-rust disk drive.6
cations for interrupts and for wake-up operations, but quite Some real-time applications pass through different
rare for (say) filesystem unmount operations. One reason phases of operation. For example, a real-time system
for this is that it is quite difficult to bound the amount of
work that a filesystem-unmount operation might need to 6 Important safety tip: Worst-case response times from USB devices
do, given that the unmount is required to flush all of that can be extremely long. Real-time systems should therefore take care to
filesystem’s in-memory data to mass storage. place any USB devices well away from critical paths.

v2022.09.25a
296 CHAPTER 14. ADVANCED SYNCHRONIZATION

controlling a plywood lathe that peels a thin sheet of wood In addition to latency requirements for the real-time por-
(called “veneer”) from a spinning log must: (1) Load the tions of the application, there will likely be performance
log into the lathe, (2) Position the log on the lathe’s chucks and scalability requirements for the non-real-time portions
so as to expose the largest cylinder contained within that of the application. These additional requirements reflect
log to the blade, (3) Start spinning the log, (4) Continu- the fact that ultimate real-time latencies are often attained
ously vary the knife’s position so as to peel the log into by degrading scalability and average performance.
veneer, (5) Remove the remaining core of the log that is Software-engineering requirements can also be impor-
too small to peel, and (6) Wait for the next log. Each of tant, especially for large applications that must be devel-
these six phases of operation might well have its own set oped and maintained by large teams. These requirements
of deadlines and environmental constraints, for example, often favor increased modularity and fault isolation.
one would expect phase 4’s deadlines to be much more This is a mere outline of the work that would be required
severe than those of phase 6, as in milliseconds rather than to specify deadlines and environmental constraints for a
seconds. One might therefore expect that low-priority production real-time system. It is hoped that this outline
work would be performed in phase 6 rather than in phase 4. clearly demonstrates the inadequacy of the sound-bite-
In any case, careful choices of hardware, drivers, and soft- based approach to real-time computing.
ware configuration would be required to support phase 4’s
more severe requirements.
14.3.2 Who Needs Real-Time?
A key advantage of this phase-by-phase approach is
that the latency budgets can be broken down, so that It is possible to argue that all computing is in fact real-
the application’s various components can be developed time computing. For one example, when you purchase a
independently, each with its own latency budget. Of birthday gift online, you expect the gift to arrive before
course, as with any other kind of budget, there will likely the recipient’s birthday. And in fact even turn-of-the-
be the occasional conflict as to which component gets millennium web services observed sub-second response
which fraction of the overall budget, and as with any constraints [Boh01], and requirements have not eased with
other kind of budget, strong leadership and a sense of the passage of time [DHJ+ 07]. It is nevertheless useful
shared goals can help to resolve these conflicts in a timely to focus on those real-time applications whose response-
fashion. And, again as with other kinds of technical time requirements cannot be achieved straightforwardly
budget, a strong validation effort is required in order to by non-real-time systems and applications. Of course,
ensure proper focus on latencies and to give early warning as hardware costs decrease and bandwidths and memory
of latency problems. A successful validation effort will sizes increase, the line between real-time and non-real-
almost always include a good test suite, which might be time will continue to shift, but such progress is by no
unsatisfying to the theorists, but has the virtue of helping means a bad thing.
to get the job done. As a point of fact, as of early 2021, Quick Quiz 14.9: Differentiating real-time from non-real-
most real-world real-time system use an acceptance test time based on what can “be achieved straightforwardly by
rather than formal proofs. non-real-time systems and applications” is a travesty! There is
absolutely no theoretical basis for such a distinction!!! Can’t
However, the widespread use of test suites to validate
we do better than that???
real-time systems does have a very real disadvantage,
namely that real-time software is validated only on spe- Real-time computing is used in industrial-control ap-
cific configurations of hardware and software. Adding plications, ranging from manufacturing to avionics; sci-
additional configurations requires additional costly and entific applications, perhaps most spectacularly in the
time-consuming testing. Perhaps the field of formal veri- adaptive optics used by large Earth-bound telescopes to
fication will advance sufficiently to change this situation, de-twinkle starlight; military applications, including the
but as of early 2021, rather large advances are required. afore-mentioned avionics; and financial-services applica-
tions, where the first computer to recognize an opportunity
Quick Quiz 14.8: Formal verification is already quite capable, is likely to reap most of the profit. These four areas could
benefiting from decades of intensive study. Are additional be characterized as “in search of production”, “in search
advances really required, or is this just a practitioner’s excuse of life”, “in search of death”, and “in search of money”.
to continue to lazily ignore the awesome power of formal Financial-services applications differ subtlely from ap-
verification? plications in the other three categories in that money is

v2022.09.25a
14.3. PARALLEL REAL-TIME COMPUTING 297

Stimulus in Paragraph “Real-World Real-Time Specifications” on


Hard Non-Real-Time Page 295. While one CPU is attending to the high-speed
Real-Time Strategy
Response "Reflexes" and Planning real-time computations required to peel one log, the other
CPUs might be analyzing the size and shape of the next
log in order to determine how to position the next log
Figure 14.4: Real-Time Reflexes so as to obtain the largest cylinder of high-quality wood.
It turns out that many applications have non-real-time
and real-time components [BMP08], so this approach can
non-material, meaning that non-computational latencies often be used to allow traditional real-time analysis to be
are quite small. In contrast, mechanical delays inherent combined with modern multicore hardware.
in the other three categories provide a very real point of Another trivial approach is to shut off all but one
diminishing returns beyond which further reductions in hardware thread so as to return to the settled mathemat-
the application’s real-time response provide little or no ics of uniprocessor real-time computing. However, this
benefit. This means that financial-services applications, approach gives up potential cost and energy-efficiency
along with other real-time information-processing appli- advantages. That said, obtaining these advantages requires
cations, face an arms race, where the application with the overcoming the parallel performance obstacles covered in
lowest latencies normally wins. Although the resulting Chapter 3, and not merely on average, but instead in the
latency requirements can still be specified as described worst case.
in Paragraph “Real-World Real-Time Specifications” on Implementing parallel real-time systems can therefore
Page 295, the unusual nature of these requirements has be quite a challenge. Ways of meeting this challenge are
led some to refer to financial and information-processing outlined in the following section.
applications as “low latency” rather than “real time”.
Regardless of exactly what we choose to call it, there is
substantial need for real-time computing [Pet06, Inm07]. 14.3.4 Implementing Parallel Real-Time
Systems
14.3.3 Who Needs Parallel Real-Time?
We will look at two major styles of real-time systems,
It is less clear who really needs parallel real-time com- event-driven and polling. An event-driven real-time sys-
puting, but the advent of low-cost multicore systems tem remains idle much of the time, responding in real
has brought it to the fore regardless. Unfortunately, the time to events passed up through the operating system to
traditional mathematical basis for real-time computing the application. Alternatively, the system could instead
assumes single-CPU systems, with a few exceptions that be running a background non-real-time workload. A
prove the rule [Bra11]. Fortunately, there are a couple polling real-time system features a real-time thread that
of ways of squaring modern computing hardware to fit is CPU bound, running in a tight loop that polls inputs
the real-time mathematical circle, and a few Linux-kernel and updates outputs on each pass. This tight polling loop
hackers have been encouraging academics to make this often executes entirely in user mode, reading from and
transition [dOCdO19, Gle10]. writing to hardware registers that have been mapped into
One approach is to recognize the fact that many real- the user-mode application’s address space. Alternatively,
time systems resemble biological nervous systems, with some applications place the polling loop into the kernel,
responses ranging from real-time reflexes to non-real-time for example, using loadable kernel modules.
strategizing and planning, as depicted in Figure 14.4. Regardless of the style chosen, the approach used to
The hard real-time reflexes, which read from sensors and implement a real-time system will depend on the deadlines,
control actuators, run real-time on a single CPU or on for example, as shown in Figure 14.5. Starting from the top
special-purpose hardware such as an FPGA. The non-real- of this figure, if you can live with response times in excess
time strategy and planning portion of the application runs of one second, you might well be able to use scripting
on the remaining CPUs. Strategy and planning activities languages to implement your real-time application—and
might include statistical analysis, periodic calibration, user scripting languages are in fact used surprisingly often,
interface, supply-chain activities, and preparation. For not that I necessarily recommend this practice. If the
an example of high-compute-load preparation activities, required latencies exceed several tens of milliseconds, old
think back to the veneer-peeling application discussed 2.4 versions of the Linux kernel can be used, not that I

v2022.09.25a
298 CHAPTER 14. ADVANCED SYNCHRONIZATION

Scripting languages 1s

RTOS Process
RTOS Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
100 ms
Linux 2.4 kernel
10 ms
Real-time Java (with GC)
1 ms
Linux 2.6.x/3.x kernel
Real-time Java (no GC)
Linux 4.x/5.x kernel 100 μs
Linux -rt patchset
10 μs
Specialty RTOSes (no MMU)
1 μs
RCU read-side
Hand-coded assembly
100 ns Linux critical sections
Custom digital hardware
10 ns Kernel Spinlock
critical sections
1 ns
Interrupt handlers
Scheduling
Custom analog hardware 100 ps Clock Interrupt disable
Interrupt Preempt disable

Figure 14.5: Real-Time Response Regimes


RTOS
necessarily recommend this practice, either. Special real-
time Java implementations can provide real-time response Figure 14.6: Linux Ported to RTOS
latencies of a few milliseconds, even when the garbage
collector is used. The Linux 2.6.x and 3.x kernels can
provide real-time latencies of a few hundred microseconds response in virtualized environments [Gle12, Kis14]. It is
if carefully configured, tuned, and run on real-time friendly therefore critically important to evaluate your hardware’s
hardware. Special real-time Java implementations can and firmware’s real-time capabilities.
provide real-time latencies below 100 microseconds if use But given competent real-time hardware and firmware,
of the garbage collector is carefully avoided. (But note the next layer up the stack is the operating system, which
that avoiding the garbage collector means also avoiding is covered in the next section.
Java’s large standard libraries, thus also avoiding Java’s
productivity advantages.) The Linux 4.x and 5.x kernels
can provide deep sub-hundred-millisecond latencies, but 14.3.5 Implementing Parallel Real-Time
with all the same caveats as for the 2.6.x and 3.x kernels. Operating Systems
A Linux kernel incorporating the -rt patchset can provide
latencies well below 20 microseconds, and specialty real- There are a number of strategies that may be used to
time operating systems (RTOSes) running without MMUs implement a real-time system. One approach is to port
can provide sub-ten-microsecond latencies. Achieving a general-purpose non-real-time OS on top of a special
sub-microsecond latencies typically requires hand-coded purpose real-time operating system (RTOS), as shown in
assembly or even special-purpose hardware. Figure 14.6. The green “Linux Process” boxes represent
Of course, careful configuration and tuning are required non-real-time processes running on the Linux kernel,
all the way down the stack. In particular, if the hardware or while the yellow “RTOS Process” boxes represent real-
firmware fails to provide real-time latencies, there is noth- time processes running on the RTOS.
ing that the software can do to make up for the lost time. This was a very popular approach before the Linux
Worse yet, high-performance hardware sometimes sacri- kernel gained real-time capabilities, and is still in
fices worst-case behavior to obtain greater throughput. In use [xen14, Yod04b]. However, this approach requires
fact, timings from tight loops run with interrupts disabled that the application be split into one portion that runs on
can provide the basis for a high-quality random-number the RTOS and another that runs on Linux. Although it
generator [MOZ09]. Furthermore, some firmware does is possible to make the two environments look similar,
cycle-stealing to carry out various housekeeping tasks, in for example, by forwarding POSIX system calls from
some cases attempting to cover its tracks by reprogram- the RTOS to a utility thread running on Linux, there are
ming the victim CPU’s hardware clocks. Of course, cycle invariably rough edges.
stealing is expected behavior in virtualized environment, In addition, the RTOS must interface to both the hard-
but people are nevertheless working towards real-time ware and to the Linux kernel, thus requiring significant

v2022.09.25a
14.3. PARALLEL REAL-TIME COMPUTING 299

maintenance with changes in both hardware and ker- be preempted, as shown in the right-most diagram. Of
nel. Furthermore, each such RTOS often has its own course, a great deal of other real-time functionality was
system-call interface and set of system libraries, which added during this time, however, it cannot be as easily
can balkanize both ecosystems and developers. In fact, represented on this diagram. It will instead be discussed
these problems seem to be what drove the combination of in Section 14.3.5.1.
RTOSes with Linux, as this approach allowed access to The bottom row of Figure 14.7 shows the -rt patchset,
the full real-time capabilities of the RTOS, while allowing which features threaded (and thus preemptible) interrupt
the application’s non-real-time code full access to Linux’s handlers for many devices, which also allows the corre-
open-source ecosystem. sponding “interrupt-disabled” regions of these drivers
Although pairing RTOSes with the Linux kernel was a to be preempted. These drivers instead use locking to
clever and useful short-term response during the time that coordinate the process-level portions of each driver with
the Linux kernel had minimal real-time capabilities, it its threaded interrupt handlers. Finally, in some cases, dis-
also motivated adding real-time capabilities to the Linux abling of preemption is replaced by disabling of migration.
kernel. Progress towards this goal is shown in Figure 14.7. These measures result in excellent response times in many
The upper row shows a diagram of the Linux kernel with systems running the -rt patchset [RMF19, dOCdO19].
preemption disabled, thus having essentially no real-time A final approach is simply to get everything out of the
capabilities. The middle row shows a set of diagrams way of the real-time process, clearing all other processing
showing the increasing real-time capabilities of the main- off of any CPUs that this process needs, as shown in
line Linux kernel with preemption enabled. Finally, the Figure 14.8. This was implemented in the 3.10 Linux
bottom row shows a diagram of the Linux kernel with kernel via the CONFIG_NO_HZ_FULL Kconfig parame-
the -rt patchset applied, maximizing real-time capabilities. ter [Cor13, Wei12]. It is important to note that this
Functionality from the -rt patchset is added to mainline, approach requires at least one housekeeping CPU to do
hence the increasing capabilities of the mainline Linux ker- background processing, for example running kernel dae-
nel over time. Nevertheless, the most demanding real-time mons. However, when there is only one runnable task on a
applications continue to use the -rt patchset. given non-housekeeping CPU, scheduling-clock interrupts
are shut off on that CPU, removing an important source
The non-preemptible kernel shown at the top of Fig-
of interference and OS jitter. With a few exceptions, the
ure 14.7 is built with CONFIG_PREEMPT=n, so that ex-
kernel does not force other processing off of the non-
ecution within the Linux kernel cannot be preempted.
housekeeping CPUs, but instead simply provides better
This means that the kernel’s real-time response latency
performance when only one runnable task is present on a
is bounded below by the longest code path in the Linux
given CPU. Any number of userspace tools may be used
kernel, which is indeed long. However, user-mode exe-
to force a given CPU to have no more that one runnable
cution is preemptible, so that one of the real-time Linux
task. If configured properly, a non-trivial undertaking,
processes shown in the upper right may preempt any of
CONFIG_NO_HZ_FULL offers real-time threads levels of
the non-real-time Linux processes shown in the upper left
performance that come close to those of bare-metal sys-
anytime the non-real-time process is executing in user
tems [ACA+ 18].
mode.
There has of course been much debate over which
The middle row of Figure 14.7 shows three stages (from of these approaches is best for real-time systems,
left to right) in the development of Linux’s preemptible and this debate has been going on for quite some
kernels. In all three stages, most process-level code within time [Cor04a, Cor04c]. As usual, the answer seems
the Linux kernel can be preempted. This of course greatly to be “It depends,” as discussed in the following sections.
improves real-time response latency, but preemption is still Section 14.3.5.1 considers event-driven real-time systems,
disabled within RCU read-side critical sections, spinlock and Section 14.3.5.2 considers real-time systems that use
critical sections, interrupt handlers, interrupt-disabled a CPU-bound polling loop.
code regions, and preempt-disabled code regions, as in-
dicated by the red boxes in the left-most diagram in the
14.3.5.1 Event-Driven Real-Time Support
middle row of the figure. The advent of preemptible RCU
allowed RCU read-side critical sections to be preempted, The operating-system support required for event-driven
as shown in the central diagram, and the advent of threaded real-time applications is quite extensive, however, this
interrupt handlers allowed device-interrupt handlers to section will focus on only a few items, namely timers,

v2022.09.25a
300 CHAPTER 14. ADVANCED SYNCHRONIZATION

RT Linux Process

RT Linux Process
RT Linux Process
Linux Process
Linux Process

Linux Process
RCU read-side
Linux critical sections

Kernel Spinlock
critical sections
Interrupt handlers
Scheduling
Clock Interrupt disable
Interrupt Preempt disable

CONFIG_PREEMPT=n
RT Linux Process
RT Linux Process
RT Linux Process

RT Linux Process
RT Linux Process
RT Linux Process

RT Linux Process
RT Linux Process
RT Linux Process
Linux Process
Linux Process

Linux Process

Linux Process

Linux Process
Linux Process

Linux Process
Linux Process
Linux Process
RCU read-side RCU read-side RCU read-side
Linux critical sections Linux critical sections Linux critical sections

Kernel Spinlock
critical sections
Kernel Spinlock
critical sections
Kernel Spinlock
critical sections
Interrupt handlers Interrupt handlers Interrupt handlers
Scheduling Scheduling Scheduling
Clock Interrupt disable Clock Interrupt disable Clock Interrupt disable
Interrupt Preempt disable Interrupt Preempt disable Interrupt Preempt disable

CONFIG_PREEMPT=y CONFIG_PREEMPT=y CONFIG_PREEMPT=y


Pre-2008 (With preemptible RCU) (With threaded interrupts)
RT Linux Process
RT Linux Process

RT Linux Process
Linux Process

Linux Process
Linux Process

RCU read-side
Linux critical sections

Kernel Spinlock
critical sections
Interrupt handlers
Scheduling
Clock Interrupt disable
Interrupt Preempt disable

-rt patchset

Figure 14.7: Linux-Kernel Real-Time Implementations

v2022.09.25a
14.3. PARALLEL REAL-TIME COMPUTING 301

0x x0

RT Linux Process
RT Linux Process

RT Linux Process
1x x1
Linux Process
Linux Process

Linux Process

NO_HZ_FULL Linux Process

NO_HZ_FULL Linux Process


2x x2
3x x3
4x x4
5x x5
6x x6
7x x7
RCU read-side 8x x8
Linux critical sections
9x x9
Kernel Spinlock
critical sections ax xa
Interrupt handlers bx xb
Scheduling
Clock Interrupt disable cx xc
Interrupt Preempt disable dx xd
ex xe

Figure 14.8: CPU Isolation fx xf

1f

threaded interrupts, priority inheritance, preemptible RCU,


and preemptible spinlocks. Figure 14.9: Timer Wheel

Timers are clearly critically important for real-time


operations. After all, if you cannot specify that something time. This works in theory, but in practice systems create
be done at a specific time, how are you going to respond by large numbers of long-duration timeouts (for example, the
that time? Even in non-real-time systems, large numbers two-hour keepalive timeouts for TCP sessions) that are
of timers are generated, so they must be handled extremely almost always canceled. These long-duration timeouts
efficiently. Example uses include retransmit timers for cause problems for small arrays because much time is
TCP connections (which are almost always canceled before wasted skipping timeouts that have not yet expired. On
they have a chance to fire),7 timed delays (as in sleep(1), the other hand, an array that is large enough to gracefully
which are rarely canceled), and timeouts for the poll() accommodate a large number of long-duration timeouts
system call (which are often canceled before they have would consume too much memory, especially given that
a chance to fire). A good data structure for such timers performance and scalability concerns require one such
would therefore be a priority queue whose addition and array for each and every CPU.
deletion primitives were fast and O (1) in the number of
A common approach for resolving this conflict is to
timers posted.
provide multiple arrays in a hierarchy. At the lowest level
The classic data structure for this purpose is the calendar
of this hierarchy, each array element represents one unit of
queue, which in the Linux kernel is called the timer
time. At the second level, each array element represents 𝑁
wheel. This age-old data structure is also heavily used
units of time, where 𝑁 is the number of elements in each
in discrete-event simulation. The idea is that time is
array. At the third level, each array element represents 𝑁 2
quantized, for example, in the Linux kernel, the duration
units of time, and so on up the hierarchy. This approach
of the time quantum is the period of the scheduling-clock
allows the individual arrays to be indexed by different
interrupt. A given time can be represented by an integer,
bits, as illustrated by Figure 14.9 for an unrealistically
and any attempt to post a timer at some non-integral
small eight-bit clock. Here, each array has 16 elements,
time will be rounded to a convenient nearby integral time
so the low-order four bits of the time (currently 0xf)
quantum.
index the low-order (rightmost) array, and the next four
One straightforward implementation would be to allo-
bits (currently 0x1) index the next level up. Thus, we
cate a single array, indexed by the low-order bits of the
have two arrays each with 16 elements, for a total of 32
7 At least assuming reasonably low packet-loss rates! elements, which, taken together, is much smaller than

v2022.09.25a
302 CHAPTER 14. ADVANCED SYNCHRONIZATION

than one-millisecond granularities. On the other hand,


Figure 14.11 shows timer processing taking place every
ten microseconds, which provides acceptably fine timer
granularity for most (but not all!) workloads, but which
processes timers so frequently that the system might well
not have time to do anything else.
The second reason is the need to cascade timers from
higher levels to lower levels. Referring back to Figure 14.9,
we can see that any timers enqueued on element 1x in
the upper (leftmost) array must be cascaded down to the
lower (rightmost) array so that may be invoked when
their time arrives. Unfortunately, there could be a large
number of timeouts waiting to be cascaded, especially for
timer wheels with larger numbers of levels. The power of
statistics causes this cascading to be a non-problem for
throughput-oriented systems, but cascading can result in
problematic degradations of latency in real-time systems.
Figure 14.10: Timer Wheel at 1 kHz
Of course, real-time systems could simply choose a
different data structure, for example, some form of heap
or tree, giving up O (1) bounds on insertion and dele-
tion operations to gain O (log 𝑛) limits on data-structure-
maintenance operations. This can be a good choice for
special-purpose RTOSes, but is inefficient for general-
purpose systems such as Linux, which routinely support
extremely large numbers of timers.
The solution chosen for the Linux kernel’s -rt patch-
set is to differentiate between timers that schedule later
activity and timeouts that schedule error handling for
low-probability errors such as TCP packet losses. One key
observation is that error handling is normally not particu-
larly time-critical, so that a timer wheel’s millisecond-level
granularity is good and sufficient. Another key observa-
Figure 14.11: Timer Wheel at 100 kHz tion is that error-handling timeouts are normally canceled
very early, often before they can be cascaded. In addition,
systems commonly have many more error-handling time-
the 256-element array that would be required for a single outs than they do timer events, so that an O (log 𝑛) data
array. structure should provide acceptable performance for timer
This approach works extremely well for throughput- events.
based systems. Each timer operation is O (1) with small However, it is possible to do better, namely by simply
constant, and each timer element is touched at most 𝑚 + 1 refusing to cascade timers. Instead of cascading, the
times, where 𝑚 is the number of levels. timers that would otherwise have been cascaded all the
Unfortunately, timer wheels do not work well for real- way down the calendar queue are handled in place. This
time systems, and for two reasons. The first reason is does result in up to a few percent error for the time duration,
that there is a harsh tradeoff between timer accuracy but the few situations where this is a problem can instead
and timer overhead, which is fancifully illustrated by use tree-based high-resolution timers (hrtimers).
Figures 14.10 and 14.11. In Figure 14.10, timer processing In short, the Linux kernel’s -rt patchset uses timer
happens only once per millisecond, which keeps overhead wheels for error-handling timeouts and a tree for timer
acceptably low for many (but not all!) workloads, but events, providing each category the required quality of
which also means that timeouts cannot be set for finer service.

v2022.09.25a
14.3. PARALLEL REAL-TIME COMPUTING 303

Return From
Interrupt Interrupt Mainline
Mainline
Interrupt Handler
Code Code

Long Latency:
Degrades Response Time

Figure 14.12: Non-Threaded Interrupt Handler

Return From

Interrupt
Interrupt Interrupt Mainline
Mainline
Code Code Preemptible
IRQ Thread

Interrupt Handler

Short Latency:
Improved Response Time

Figure 14.13: Threaded Interrupt Handler

Threaded interrupts are used to address a significant is increased interrupt latency. Instead of immediately
source of degraded real-time latencies, namely long- running the interrupt handler, the handler’s execution is
running interrupt handlers, as shown in Figure 14.12. deferred until the IRQ thread gets around to running it. Of
These latencies can be especially problematic for devices course, this is not a problem unless the device generating
that can deliver a large number of events with a single in- the interrupt is on the real-time application’s critical path.
terrupt, which means that the interrupt handler will run for Another downside is that poorly written high-priority
an extended period of time processing all of these events. real-time code might starve the interrupt handler, for ex-
Worse yet are devices that can deliver new events to a ample, preventing networking code from running, in turn
still-running interrupt handler, as such an interrupt handler making it very difficult to debug the problem. Developers
might well run indefinitely, thus indefinitely degrading must therefore take great care when writing high-priority
real-time latencies. real-time code. This has been dubbed the Spiderman
One way of addressing this problem is the use of principle: With great power comes great responsibility.
threaded interrupts shown in Figure 14.13. Interrupt
handlers run in the context of a preemptible IRQ thread, Priority inheritance is used to handle priority inver-
which runs at a configurable priority. The device interrupt sion, which can be caused by, among other things, locks
handler then runs for only a short time, just long enough acquired by preemptible interrupt handlers [SRL90]. Sup-
to make the IRQ thread aware of the new event. As shown pose that a low-priority thread holds a lock, but is pre-
in the figure, threaded interrupts can greatly improve real- empted by a group of medium-priority threads, at least one
time latencies, in part because interrupt handlers running such thread per CPU. If an interrupt occurs, a high-priority
in the context of the IRQ thread may be preempted by IRQ thread will preempt one of the medium-priority
high-priority real-time threads. threads, but only until it decides to acquire the lock held
However, there is no such thing as a free lunch, and by the low-priority thread. Unfortunately, the low-priority
there are downsides to threaded interrupts. One downside thread cannot release the lock until it starts running, which

v2022.09.25a
304 CHAPTER 14. ADVANCED SYNCHRONIZATION

A final limitation involves reader-writer locking. Sup-


pose that we have a very large number of low-priority
threads, perhaps even thousands of them, each of which
read-holds a particular reader-writer lock. Suppose that all
of these threads are preempted by a set of medium-priority
threads, with at least one medium-priority thread per CPU.
Finally, suppose that a high-priority thread awakens and
attempts to write-acquire this same reader-writer lock.
No matter how vigorously we boost the priority of the
threads read-holding this lock, it could well be a good
long time before the high-priority thread can complete its
write-acquisition.
Figure 14.14: Priority Inversion and User Input
There are a number of possible solutions to this reader-
writer lock priority-inversion conundrum:
the medium-priority threads prevent it from doing. So the 1. Only allow one read-acquisition of a given reader-
high-priority IRQ thread cannot acquire the lock until after writer lock at a time. (This is the approach tradition-
one of the medium-priority threads releases its CPU. In ally taken by the Linux kernel’s -rt patchset.)
short, the medium-priority threads are indirectly blocking
the high-priority IRQ threads, a classic case of priority 2. Only allow 𝑁 read-acquisitions of a given reader-
inversion. writer lock at a time, where 𝑁 is the number of
Note that this priority inversion could not happen with CPUs.
non-threaded interrupts because the low-priority thread
would have to disable interrupts while holding the lock, 3. Only allow 𝑁 read-acquisitions of a given reader-
which would prevent the medium-priority threads from writer lock at a time, where 𝑁 is a number specified
preempting it. somehow by the developer.
In the priority-inheritance solution, the high-priority 4. Prohibit high-priority threads from write-acquiring
thread attempting to acquire the lock donates its priority reader-writer locks that are ever read-acquired by
to the low-priority thread holding the lock until such time threads running at lower priorities. (This is a variant
as the lock is released, thus preventing long-term priority of the priority ceiling protocol [SRL90].)
inversion.
Of course, priority inheritance does have its limitations.
Quick Quiz 14.10: But if you only allow one reader at a time
For example, if you can design your application to avoid
to read-acquire a reader-writer lock, isn’t that the same as an
priority inversion entirely, you will likely obtain somewhat
exclusive lock???
better latencies [Yod04b]. This should be no surprise,
given that priority inheritance adds a pair of context The no-concurrent-readers restriction eventually be-
switches to the worst-case latency. That said, priority came intolerable, so the -rt developers looked more care-
inheritance can convert indefinite postponement into a fully at how the Linux kernel uses reader-writer spinlocks.
limited increase in latency, and the software-engineering They learned that time-critical code rarely uses those parts
benefits of priority inheritance may outweigh its latency of the kernel that write-acquire reader-writer locks, so that
costs in many applications. the prospect of writer starvation was not a show-stopper.
Another limitation is that it addresses only lock-based They therefore constructed a real-time reader-writer lock
priority inversions within the context of a given operating in which write-side acquisitions use priority inheritance
system. One priority-inversion scenario that it cannot among each other, but where read-side acquisitions take
address is a high-priority thread waiting on a network absolute priority over write-side acquisitions. This ap-
socket for a message that is to be written by a low-priority proach appears to be working well in practice, and is
process that is preempted by a set of CPU-bound medium- another lesson in the importance of clearly understanding
priority processes. In addition, a potential disadvantage what your users really need.
of applying priority inheritance to user input is fancifully One interesting detail of this implementation is that
depicted in Figure 14.14. both the rt_read_lock() and the rt_write_lock()

v2022.09.25a
14.3. PARALLEL REAL-TIME COMPUTING 305

Listing 14.3: Preemptible Linux-Kernel RCU sections. A grace period is permitted to end: (1) Once
1 void __rcu_read_lock(void) all CPUs have completed any RCU read-side critical sec-
2 {
3 current->rcu_read_lock_nesting++; tions that were in effect before the start of the current
4 barrier(); grace period and (2) Once all tasks that were preempted
5 }
6 while in one of those pre-existing critical sections have
7 void __rcu_read_unlock(void) removed themselves from their lists. A simplified version
8 {
9 barrier(); of this implementation is shown in Listing 14.3. The
10 if (!--current->rcu_read_lock_nesting) __rcu_read_lock() function spans lines 1–5 and the
11 barrier();
12 if (READ_ONCE(current->rcu_read_unlock_special.s)) { __rcu_read_unlock() function spans lines 7–15.
13 rcu_read_unlock_special(t); Line 3 of __rcu_read_lock() increments a per-task
14 }
15 } count of the number of nested rcu_read_lock() calls,
and line 4 prevents the compiler from reordering the
subsequent code in the RCU read-side critical section to
functions enter an RCU read-side critical section and both precede the rcu_read_lock().
the rt_read_unlock() and the rt_write_unlock() Line 9 of __rcu_read_unlock() prevents the com-
functions exit that critical section. This is necessary piler from reordering the code in the critical section with
because non-realtime kernels’ reader-writer locking func- the remainder of this function. Line 10 decrements the
tions disable preemption across their critical sections, and nesting count and checks to see if it has become zero, in
there really are reader-writer locking use cases that rely other words, if this corresponds to the outermost rcu_
on the fact that synchronize_rcu() will therefore wait read_unlock() of a nested set. If so, line 11 prevents
for all pre-exiting reader-writer-lock critical sections to the compiler from reordering this nesting update with
complete. Let this be a lesson to you: Understanding what line 12’s check for special handling. If special handling is
your users really need is critically important to correct required, then the call to rcu_read_unlock_special()
operation, not just to performance. Not only that, but what on line 13 carries it out.
your users really need changes over time. There are several types of special handling that can
This has the side-effect that all of a -rt kernel’s reader- be required, but we will focus on that required when the
writer locking critical sections are subject to RCU priority RCU read-side critical section has been preempted. In
boosting. This provides at least a partial solution to the this case, the task must remove itself from the list that it
problem of reader-writer lock readers being preempted was added to when it was first preempted within its RCU
for extended periods of time. read-side critical section. However, it is important to note
It is also possible to avoid reader-writer lock priority that these lists are protected by locks, which means that
inversion by converting the reader-writer lock to RCU, as rcu_read_unlock() is no longer lockless. However,
briefly discussed in the next section. the highest-priority threads will not be preempted, and
therefore, for those highest-priority threads, rcu_read_
unlock() will never attempt to acquire any locks. In
Preemptible RCU can sometimes be used as a re- addition, if implemented carefully, locking can be used to
placement for reader-writer locking [MW07, MBWW12, synchronize real-time software [Bra11, SM04a].
McK14f], as was discussed in Section 9.5. Where it can
be used, it permits readers and updaters to run concur- Quick Quiz 14.11: Suppose that preemption occurs just after
the load from t->rcu_read_unlock_special.s on line 12
rently, which prevents low-priority readers from inflicting
of Listing 14.3. Mightn’t that result in the task failing to invoke
any sort of priority-inversion scenario on high-priority up-
rcu_read_unlock_special(), thus failing to remove itself
daters. However, for this to be useful, it is necessary to be from the list of tasks blocking the current grace period, in turn
able to preempt long-running RCU read-side critical sec- causing that grace period to extend indefinitely?
tions [GMTW08]. Otherwise, long RCU read-side critical
sections would result in excessive real-time latencies. Another important real-time feature of RCU, whether
A preemptible RCU implementation was therefore preemptible or not, is the ability to offload RCU callback
added to the Linux kernel. This implementation avoids execution to a kernel thread. To use this, your kernel must
the need to individually track the state of each and every be built with CONFIG_RCU_NOCB_CPU=y and booted with
task in the kernel by keeping lists of tasks that have been the rcu_nocbs= kernel boot parameter specifying which
preempted within their current RCU read-side critical CPUs are to be offloaded. Alternatively, any CPU speci-

v2022.09.25a
306 CHAPTER 14. ADVANCED SYNCHRONIZATION

fied by the nohz_full= kernel boot parameter described is a problem, the use case at hand must properly synchro-
in Section 14.3.5.2 will also have its RCU callbacks off- nize the updates, perhaps through a set of per-CPU locks
loaded. specific to that use case. Although introducing locks again
In short, this preemptible RCU implementation enables introduces the possibility of deadlock, the per-use-case
real-time response for read-mostly data structures without nature of these locks makes any such deadlocks easier to
the delays inherent to priority boosting of large numbers of manage and avoid.
readers, and also without delays due to callback invocation.
Closing event-driven remarks. There are of course
Preemptible spinlocks are an important part of the -rt any number of other Linux-kernel components that are
patchset due to the long-duration spinlock-based critical critically important to achieving world-class real-time la-
sections in the Linux kernel. This functionality has not yet tencies, for example, deadline scheduling [dO18b, dO18a],
reached mainline: Although they are a conceptually simple however, those listed in this section give a good feeling
substitution of sleeplocks for spinlocks, they have proven for the workings of the Linux kernel augmented by the -rt
relatively controversial. In addition the real-time function- patchset.
ality that is already in the mainline Linux kernel suffices
for a great many use cases, which slowed the -rt patch-
set’s development rate in the early 2010s [Edg13, Edg14]. 14.3.5.2 Polling-Loop Real-Time Support
However, preemptible spinlocks are absolutely necessary At first glance, use of a polling loop might seem to avoid
to the task of achieving real-time latencies down in the all possible operating-system interference problems. After
tens of microseconds. Fortunately, Linux Foundation all, if a given CPU never enters the kernel, the kernel
organized an effort to fund moving the remaining code is completely out of the picture. And the traditional
from the -rt patchset to mainline. approach to keeping the kernel out of the way is simply
not to have a kernel, and many real-time applications do
Per-CPU variables are used heavily in the Linux kernel indeed run on bare metal, particularly those running on
for performance reasons. Unfortunately for real-time eight-bit microcontrollers.
applications, many use cases for per-CPU variables require One might hope to get bare-metal performance on a
coordinated update of multiple such variables, which is modern operating-system kernel simply by running a
normally provided by disabling preemption, which in single CPU-bound user-mode thread on a given CPU,
turn degrades real-time latencies. Real-time applications avoiding all causes of interference. Although the reality is
clearly need some other way of coordinating per-CPU of course more complex, it is becoming possible to do just
variable updates. that, courtesy of the NO_HZ_FULL implementation led by
One alternative is to supply per-CPU spinlocks, which Frederic Weisbecker [Cor13, Wei12] that was accepted
as noted above are actually sleeplocks, so that their critical into version 3.10 of the Linux kernel. Nevertheless,
sections can be preempted and so that priority inheritance considerable care is required to properly set up such an
is provided. In this approach, code updating groups environment, as it is necessary to control a number of
of per-CPU variables must acquire the current CPU’s possible sources of OS jitter. The discussion below covers
spinlock, carry out the update, then release whichever the control of several sources of OS jitter, including device
lock is acquired, keeping in mind that a preemption might interrupts, kernel threads and daemons, scheduler real-
have resulted in a migration to some other CPU. However, time throttling (this is a feature, not a bug!), timers, non-
this approach introduces both overhead and deadlocks. real-time device drivers, in-kernel global synchronization,
Another alternative, which is used in the -rt patchset scheduling-clock interrupts, page faults, and finally, non-
as of early 2021, is to convert preemption disabling to real-time hardware and firmware.
migration disabling. This ensures that a given kernel Interrupts are an excellent source of large amounts of
thread remains on its CPU through the duration of the OS jitter. Unfortunately, in most cases interrupts are ab-
per-CPU-variable update, but could also allow some other solutely required in order for the system to communicate
kernel thread to intersperse its own update of those same with the outside world. One way of resolving this conflict
variables, courtesy of preemption. There are cases such between OS jitter and maintaining contact with the out-
as statistics gathering where this is not a problem. In the side world is to reserve a small number of housekeeping
surprisingly rare case where such mid-update preemption CPUs, and to force all interrupts to these CPUs. The

v2022.09.25a
14.3. PARALLEL REAL-TIME COMPUTING 307

Documentation/IRQ-affinity.txt file in the Linux A fourth source of OS jitter comes from timers. In
source tree describes how to direct device interrupts to most cases, keeping a given CPU out of the kernel will
specified CPUs, which as of early 2021 involves something prevent timers from being scheduled on that CPU. One
like the following: important exception are recurring timers, where a given
timer handler posts a later occurrence of that same timer.
$ echo 0f > /proc/irq/44/smp_affinity
If such a timer gets started on a given CPU for any reason,
that timer will continue to run periodically on that CPU,
This command would confine interrupt #44 to CPUs 0–
inflicting OS jitter indefinitely. One crude but effective
3. Note that scheduling-clock interrupts require special
way to offload recurring timers is to use CPU hotplug
handling, and are discussed later in this section.
to offline all worker CPUs that are to run CPU-bound
A second source of OS jitter is due to kernel threads
real-time application threads, online these same CPUs,
and daemons. Individual kernel threads, such as RCU’s
then start your real-time application.
grace-period kthreads (rcu_bh, rcu_preempt, and rcu_
A fifth source of OS jitter is provided by device drivers
sched), may be forced onto any desired CPUs using the
that were not intended for real-time use. For an old
taskset command, the sched_setaffinity() system
canonical example, in 2005, the VGA driver would blank
call, or cgroups.
the screen by zeroing the frame buffer with interrupts
Per-CPU kthreads are often more challenging, some-
disabled, which resulted in tens of milliseconds of OS
times constraining hardware configuration and workload
jitter. One way of avoiding device-driver-induced OS
layout. Preventing OS jitter from these kthreads requires
jitter is to carefully select devices that have been used
either that certain types of hardware not be attached to
heavily in real-time systems, and which have therefore
real-time systems, that all interrupts and I/O initiation take
had their real-time bugs fixed. Another way is to confine
place on housekeeping CPUs, that special kernel Kconfig
the device’s interrupts and all code using that device to
or boot parameters be selected in order to direct work
designated housekeeping CPUs. A third way is to test the
away from the worker CPUs, or that worker CPUs never
device’s ability to support real-time workloads and fix any
enter the kernel. Specific per-kthread advice may be found
real-time bugs.8
in the Linux kernel source Documentation directory at
A sixth source of OS jitter is provided by some in-kernel
kernel-per-CPU-kthreads.txt.
full-system synchronization algorithms, perhaps most no-
A third source of OS jitter in the Linux kernel for
tably the global TLB-flush algorithm. This can be avoided
CPU-bound threads running at real-time priority is the
by avoiding memory-unmapping operations, and espe-
scheduler itself. This is an intentional debugging feature,
cially avoiding unmapping operations within the kernel.
designed to ensure that important non-realtime work is
As of early 2021, the way to avoid in-kernel unmapping
allotted at least 50 milliseconds out of each second, even if
operations is to avoid unloading kernel modules.
there is an infinite-loop bug in your real-time application.
A seventh source of OS jitter is provided by scheduling-
However, when you are running a polling-loop-style real-
clock interrrupts and RCU callback invocation. These
time application, you will need to disable this debugging
may be avoided by building your kernel with the NO_HZ_
feature. This can be done as follows:
FULL Kconfig parameter enabled, and then booting with
$ echo -1 > /proc/sys/kernel/sched_rt_runtime_us the nohz_full= parameter specifying the list of worker
CPUs that are to run real-time threads. For example,
You will of course need to be running as root to exe- nohz_full=2-7 would designate CPUs 2, 3, 4, 5, 6,
cute this command, and you will also need to carefully and 7 as worker CPUs, thus leaving CPUs 0 and 1 as
consider the aforementioned Spiderman principle. One housekeeping CPUs. The worker CPUs would not incur
way to minimize the risks is to offload interrupts and ker- scheduling-clock interrupts as long as there is no more
nel threads/daemons from all CPUs running CPU-bound than one runnable task on each worker CPU, and each
real-time threads, as described in the paragraphs above. worker CPU’s RCU callbacks would be invoked on one
In addition, you should carefully read the material in the of the housekeeping CPUs. A CPU that has suppressed
Documentation/scheduler directory. The material in scheduling-clock interrupts due to there only being one
the sched-rt-group.rst file is particularly important,
especially if you are using the cgroups real-time fea- 8 If you take this approach, please submit your fixes upstream so
tures enabled by the CONFIG_RT_GROUP_SCHED Kconfig that others can benefit. After all, when you need to port your application
parameter. to a later version of the Linux kernel, you will be one of those “others”.

v2022.09.25a
308 CHAPTER 14. ADVANCED SYNCHRONIZATION

Listing 14.4: Locating Sources of OS Jitter automation has been applied, but given the relatively small
1 cd /sys/kernel/debug/tracing number of users, automation can be expected to appear
2 echo 1 > max_graph_depth
3 echo function_graph > current_tracer relatively slowly. Nevertheless, the ability to gain near-
4 # run workload bare-metal performance while running a general-purpose
5 cat per_cpu/cpuN/trace
operating system promises to ease construction of some
types of real-time systems.
runnable task on that CPU is said to be in adaptive ticks
mode or in nohz_full mode. It is important to ensure 14.3.6 Implementing Parallel Real-Time
that you have designated enough housekeeping CPUs to Applications
handle the housekeeping load imposed by the rest of the
Developing real-time applications is a wide-ranging topic,
system, which requires careful benchmarking and tuning.
and this section can only touch on a few aspects. To this
An eighth source of OS jitter is page faults. Because
end, Section 14.3.6.1 looks at a few software components
most Linux implementations use an MMU for memory
commonly used in real-time applications, Section 14.3.6.2
protection, real-time applications running on these systems
provides a brief overview of how polling-loop-based ap-
can be subject to page faults. Use the mlock() and
plications may be implemented, Section 14.3.6.3 gives
mlockall() system calls to pin your application’s pages
a similar overview of streaming applications, and Sec-
into memory, thus avoiding major page faults. Of course,
tion 14.3.6.4 briefly covers event-based applications.
the Spiderman principle applies, because locking down
too much memory may prevent the system from getting
other work done. 14.3.6.1 Real-Time Components
A ninth source of OS jitter is unfortunately the hardware As in all areas of engineering, a robust set of components
and firmware. It is therefore important to use systems that is essential to productivity and reliability. This section is
have been designed for real-time use. not a full catalog of real-time software components—such
Unfortunately, this list of OS-jitter sources can never be a catalog would fill multiple books—but rather a brief
complete, as it will change with each new version of the overview of the types of components available.
kernel. This makes it necessary to be able to track down A natural place to look for real-time software com-
additional sources of OS jitter. Given a CPU 𝑁 running ponents would be algorithms offering wait-free synchro-
a CPU-bound usermode thread, the commands shown in nization [Her91], and in fact lockless algorithms are very
Listing 14.4 will produce a list of all the times that this important to real-time computing. However, wait-free
CPU entered the kernel. Of course, the N on line 5 must synchronization only guarantees forward progress in finite
be replaced with the number of the CPU in question, and time. Although a century is finite, this is unhelpful when
the 1 on line 2 may be increased to show additional levels your deadlines are measured in microseconds, let alone
of function call within the kernel. The resulting trace can milliseconds.
help track down the source of the OS jitter. Nevertheless, there are some important wait-free algo-
As always, there is no free lunch, and NO_HZ_FULL rithms that do provide bounded response time, including
is no exception. As noted earlier, NO_HZ_FULL makes atomic test and set, atomic exchange, atomic fetch-and-
kernel/user transitions more expensive due to the need for add, single-producer/single-consumer FIFO queues based
delta process accounting and the need to inform kernel on circular arrays, and numerous per-thread partitioned
subsystems (such as RCU) of the transitions. As a rough algorithms. In addition, recent research has confirmed
rule of thumb, NO_HZ_FULL helps with many types of the observation that algorithms with lock-free guarantees9
real-time and heavy-compute workloads, but hurts other also provide the same latencies in practice (in the wait-
workloads that feature high rates of system calls and free sense), assuming a stochastically fair scheduler and
I/O [ACA+ 18]. Additional limitations, tradeoffs, and absence of fail-stop bugs [ACHS13]. This means that
configuration advice may be found in Documentation/ many non-wait-free stacks and queues are nevertheless
timers/no_hz.rst. appropriate for real-time use.
As you can see, obtaining bare-metal performance 9 Wait-free algorithms guarantee that all threads make progress in
when running CPU-bound real-time threads on a general- finite time, while lock-free algorithms only guarantee that at least one
purpose OS such as Linux requires painstaking attention thread will make progress in finite time. See Section 14.2 for more
to detail. Automation would of course help, and some details.

v2022.09.25a
14.3. PARALLEL REAL-TIME COMPUTING 309

Quick Quiz 14.12: But isn’t correct operation despite fail- Of course, a careful and simple application design is also
stop bugs a valuable fault-tolerance property? extremely important. The best real-time components in the
world cannot make up for a poorly thought-out design. For
In practice, locking is often used in real-time programs, parallel real-time applications, synchronization overheads
theoretical concerns notwithstanding. However, under clearly must be a key component of the design.
more severe constraints, lock-based algorithms can also
provide bounded latencies [Bra11]. These constraints
include: 14.3.6.2 Polling-Loop Applications

1. Fair scheduler. In the common case of a fixed-priority Many real-time applications consist of a single CPU-bound
scheduler, the bounded latencies are provided only loop that reads sensor data, computes a control law, and
to the highest-priority threads. writes control output. If the hardware registers providing
sensor data and taking control output are mapped into the
2. Sufficient bandwidth to support the workload. An application’s address space, this loop might be completely
implementation rule supporting this constraint might free of system calls. But beware of the Spiderman princi-
be “There will be at least 50 % idle time on all CPUs ple: With great power comes great responsibility, in this
during normal operation,” or, more formally, “The case the responsibility to avoid bricking the hardware by
offered load will be sufficiently low to allow the making inappropriate references to the hardware registers.
workload to be schedulable at all times.” This arrangement is often run on bare metal, without
the benefits of (or the interference from) an operating
3. No fail-stop bugs.
system. However, increasing hardware capability and
4. FIFO locking primitives with bounded acquisition, increasing levels of automation motivates increasing soft-
handoff, and release latencies. Again, in the com- ware functionality, for example, user interfaces, logging,
mon case of a locking primitive that is FIFO within and reporting, all of which can benefit from an operating
priorities, the bounded latencies are provided only system.
to the highest-priority threads. One way of gaining much of the benefit of running on
bare metal while still having access to the full features
5. Some way of preventing unbounded priority inver- and functions of a general-purpose operating system is to
sion. The priority-ceiling and priority-inheritance use the Linux kernel’s NO_HZ_FULL capability, described
disciplines mentioned earlier in this chapter suffice. in Section 14.3.5.2.
6. Bounded nesting of lock acquisitions. We can have
an unbounded number of locks, but only as long as a 14.3.6.3 Streaming Applications
given thread never acquires more than a few of them
(ideally only one of them) at a time. One type of big-data real-time application takes input from
numerous sources, processes it internally, and outputs
7. Bounded number of threads. In combination with the alerts and summaries. These streaming applications are
earlier constraints, this constraint means that there often highly parallel, processing different information
will be a bounded number of threads waiting on any sources concurrently.
given lock. One approach for implementing streaming applications
8. Bounded time spent in any given critical section. is to use dense-array circular FIFOs to connect different
Given a bounded number of threads waiting on any processing steps [Sut13]. Each such FIFO has only a single
given lock and a bounded critical-section duration, thread producing into it and a (presumably different) single
the wait time will be bounded. thread consuming from it. Fan-in and fan-out points use
threads rather than data structures, so if the output of
several FIFOs needed to be merged, a separate thread
Quick Quiz 14.13: I couldn’t help but spot the word “includes”
before this list. Are there other constraints?
would input from them and output to another FIFO for
which this separate thread was the sole producer. Similarly,
This result opens a vast cornucopia of algorithms and if the output of a given FIFO needed to be split, a separate
data structures for use in real-time software—and validates thread would input from this FIFO and output to several
long-standing real-time practice. FIFOs as needed.

v2022.09.25a
310 CHAPTER 14. ADVANCED SYNCHRONIZATION

Listing 14.5: Timed-Wait Test Program we can still see page faults. We also need to use the
1 if (clock_gettime(CLOCK_REALTIME, &timestart) != 0) { mlockall() system call to pin the application’s memory,
2 perror("clock_gettime 1");
3 exit(-1); preventing page faults. With all of these changes, results
4 } might finally be acceptable.
5 if (nanosleep(&timewait, NULL) != 0) {
6 perror("nanosleep"); In other situations, further adjustments might be needed.
7 exit(-1);
8 }
It might be necessary to affinity time-critical threads onto
9 if (clock_gettime(CLOCK_REALTIME, &timeend) != 0) { their own CPUs, and it might also be necessary to affinity
10 perror("clock_gettime 2");
11 exit(-1);
interrupts away from those CPUs. It might be necessary
12 } to carefully select hardware and drivers, and it will very
likely be necessary to carefully select kernel configuration.
As can be seen from this example, real-time computing
This discipline might seem restrictive, but it allows com- can be quite unforgiving.
munication among threads with minimal synchronization
overhead, and minimal synchronization overhead is im-
portant when attempting to meet tight latency constraints. 14.3.6.5 The Role of RCU
This is especially true when the amount of processing for
each step is small, so that the synchronization overhead is Suppose that you are writing a parallel real-time applica-
significant compared to the processing overhead. tion that needs to access data that is subject to gradual
The individual threads might be CPU-bound, in which change, perhaps due to changes in temperature, humid-
case the advice in Section 14.3.6.2 applies. On the other ity, and barometric pressure. The real-time response
hand, if the individual threads block waiting for data from constraints on this program are so severe that it is not
their input FIFOs, the advice of the next section applies. permissible to spin or block, thus ruling out locking, nor is
it permissible to use a retry loop, thus ruling out sequence
14.3.6.4 Event-Driven Applications locks and hazard pointers. Fortunately, the temperature
and pressure are normally controlled, so that a default
We will use fuel injection into a mid-sized industrial hard-coded set of data is usually sufficient.
engine as a fanciful example for event-driven applications.
However, the temperature, humidity, and pressure oc-
Under normal operating conditions, this engine requires
casionally deviate too far from the defaults, and in such
that the fuel be injected within a one-degree interval
situations it is necessary to provide data that replaces the
surrounding top dead center. If we assume a 1,500-RPM
defaults. Because the temperature, humidity, and pressure
rotation rate, we have 25 rotations per second, or about
change gradually, providing the updated values is not a
9,000 degrees of rotation per second, which translates
matter of urgency, though it must happen within a few min-
to 111 microseconds per degree. We therefore need to
utes. The program is to use a global pointer imaginatively
schedule the fuel injection to within a time interval of
named cur_cal that normally references default_cal,
about 100 microseconds.
which is a statically allocated and initialized structure that
Suppose that a timed wait was to be used to initiate fuel
contains the default calibration values in fields imagina-
injection, although if you are building an engine, I hope
tively named a, b, and c. Otherwise, cur_cal points to
you supply a rotation sensor. We need to test the timed-
a dynamically allocated structure providing the current
wait functionality, perhaps using the test program shown
calibration values.
in Listing 14.5. Unfortunately, if we run this program, we
Listing 14.6 shows how RCU can be used to solve
can get unacceptable timer jitter, even in a -rt kernel.
this problem. Lookups are deterministic, as shown in
One problem is that POSIX CLOCK_REALTIME is, oddly
calc_control() on lines 9–15, consistent with real-
enough, not intended for real-time use. Instead, it means
time requirements. Updates are more complex, as shown
“realtime” as opposed to the amount of CPU time con-
by update_cal() on lines 17–35.
sumed by a process or thread. For real-time use, you
should instead use CLOCK_MONOTONIC. However, even Quick Quiz 14.14: Given that real-time systems are often
with this change, results are still unacceptable. used for safety-critical applications, and given that runtime
Another problem is that the thread must be raised to a memory allocation is forbidden in many safety-critical situa-
real-time priority by using the sched_setscheduler() tions, what is with the call to malloc()???
system call. But even this change is insufficient, because

v2022.09.25a
14.3. PARALLEL REAL-TIME COMPUTING 311

Listing 14.6: Real-Time Calibration Using RCU


1 struct calibration {
2 short a;
3 short b;
4 short c;
5 };
6 struct calibration default_cal = { 62, 33, 88 };
7 struct calibration cur_cal = &default_cal;
8
9 short calc_control(short t, short h, short press)
10 {
11 struct calibration *p;
12
13 p = rcu_dereference(cur_cal);
14 return do_control(t, h, press, p->a, p->b, p->c);
15 }
16
17 bool update_cal(short a, short b, short c)
18 {
19 struct calibration *p;
20 struct calibration *old_p;
21
22 old_p = rcu_dereference(cur_cal);
23 p = malloc(sizeof(*p);
24 if (!p)
25 return false;
26 p->a = a;
27 p->b = b;
28 p->c = c; Figure 14.15: The Dark Side of Real-Time Computing
29 rcu_assign_pointer(cur_cal, p);
30 if (old_p == &default_cal)
31 return true;
32 synchronize_rcu();
33 free(old_p);
34 return true;
35 }

Quick Quiz 14.15: Don’t you need some kind of synchro-


nization to protect update_cal()?

This example shows how RCU can provide deterministic


read-side data-structure access to real-time programs.

14.3.7 Real Time vs. Real Fast: How to


Choose?
The choice between real-time and real-fast computing can
be a difficult one. Because real-time systems often inflict
a throughput penalty on non-real-time computing, using
real-time when it is not required is unwise, as fancifully
depicted by Figure 14.15.
On the other hand, failing to use real-time when it is
required can also cause problems, as fancifully depicted
by Figure 14.16. It is almost enough to make you feel
Figure 14.16: The Dark Side of Real-Fast Computing
sorry for the boss!
One rule of thumb uses the following four questions to
help you choose:

1. Is average long-term throughput the only goal?

v2022.09.25a
312 CHAPTER 14. ADVANCED SYNCHRONIZATION

2. Is it permissible for heavy loads to degrade response


times?
3. Is there high memory pressure, ruling out use of the
mlockall() system call?

4. Does the basic work item of your application take


more than 100 milliseconds to complete?

If the answer to any of these questions is “yes”, you


should choose real-fast over real-time, otherwise, real-time
might be for you.
Choose wisely, and if you do choose real-time, make
sure that your hardware, firmware, and operating system
are up to the job!

v2022.09.25a
The art of progress is to preserve order amid change
and to preserve change amid order.

Chapter 15 Alfred North Whitehead

Advanced Synchronization: Memory


Ordering

Causality and sequencing are deeply intuitive, and hack- 15.1 Ordering: Why and How?
ers often have a strong grasp of these concepts. These
intuitions can be quite helpful when writing, analyzing,
Nothing is orderly till people take hold of it.
and debugging not only sequential code, but also parallel
Everything in creation lies around loose.
code that makes use of standard mutual-exclusion mech-
anisms such as locking. Unfortunately, these intuitions Henry Ward Beecher, updated
break down completely in face of code that fails to use
such mechanisms. One example of such code implements One motivation for memory ordering can be seen in the
the standard mutual-exclusion mechanisms themselves, trivial-seeming litmus test in Listing 15.1 (C-SB+o-o+o-
while another example implements fast paths that use o.litmus), which at first glance might appear to guar-
weaker synchronization. Insults to intuition notwithstand- antee that the exists clause never triggers.1 After all,
ing, some argue that weakness is a virtue [Alg13]. Virtue if 0:r2=0 as shown in the exists clause,2 we might
or vice, this chapter will help you gain an understanding of hope that Thread P0()’s load from x1 into r2 must have
memory ordering, that, with practice, will be sufficient to happened before Thread P1()’s store to x1, which might
implement synchronization primitives and performance- raise further hopes that Thread P1()’s load from x0 into
critical fast paths. r2 must happen after Thread P0()’s store to x0, so that
1:r2=2, thus not triggering the exists clause. The ex-
Section 15.1 will demonstrate that real computer sys- ample is symmetric, so similar reasoning might lead us to
tems can reorder memory references, give some reasons hope that 1:r2=0 guarantees that 0:r2=2. Unfortunately,
why they do so, and provide some information on how the lack of memory barriers dashes these hopes. The CPU
to prevent undesired reordering. Sections 15.2 and 15.3 is within its rights to reorder the statements within both
will cover the types of pain that hardware and compilers, Thread P0() and Thread P1(), even on relatively strongly
respectively, can inflict on unwary parallel programmers. ordered systems such as x86.
Section 15.4 gives an overview of the benefits of mod-
eling memory ordering at higher levels of abstraction. Quick Quiz 15.2: The compiler can also reorder
Thread P0()’s and Thread P1()’s memory accesses in List-
Section 15.5 follows up with more detail on a few rep-
ing 15.1, right?
resentative hardware platforms. Finally, Section 15.6
provides some useful rules of thumb. This willingness to reorder can be confirmed using tools
such as litmus7 [AMT14], which found that the counter-
intuitive ordering happened 314 times out of 100,000,000
Quick Quiz 15.1: This chapter has been rewritten since the
1 Purists would instead insist that the exists clause is never satisfied,
first edition. Did memory ordering change all that since 2014?
but we use “trigger” here by analogy with assertions.
2 That is, Thread P0()’s instance of local variable r2 equals zero.

See Section 12.2.1 for documentation of litmus-test nomenclature.

313

v2022.09.25a
314 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

Listing 15.1: Memory Misordering: Store-Buffering Litmus


Test CPU 0 CPU 1
1 C C-SB+o-o+o-o
2 Store Store
3 {}
4 Buffer Buffer
5 P0(int *x0, int *x1)
6 { Cache Cache
7 int r2;
8
9 WRITE_ONCE(*x0, 2);
10 r2 = READ_ONCE(*x1);
Memory
11 }
12
13 P1(int *x0, int *x1)
14 { Figure 15.1: System Architecture With Store Buffers
15 int r2;
16
17 WRITE_ONCE(*x1, 2);
18 r2 = READ_ONCE(*x0);
19 } fetch a single variable from memory. CPUs therefore sport
20 increasingly large caches, as seen back in Figure 3.10,
21 exists (1:r2=0 /\ 0:r2=0)
which means that although the first load by a given CPU
from a given variable will result in an expensive cache miss
trials on my x86 laptop. Oddly enough, the perfectly legal as was discussed in Section 3.1.5, subsequent repeated
outcome where both loads return the value 2 occurred less loads from that variable by that CPU might execute very
frequently, in this case, only 167 times.3 The lesson here quickly because the initial cache miss will have loaded
is clear: Increased counter-intuitivity does not necessarily that variable into that CPU’s cache.
imply decreased probability! However, it is also necessary to accommodate frequent
The following sections show exactly where this intuition concurrent stores from multiple CPUs to a set of shared
breaks down, and then puts forward a mental model of variables. In cache-coherent systems, if the caches hold
memory ordering that can help you avoid these pitfalls. multiple copies of a given variable, all the copies of that
Section 15.1.1 gives a brief overview of why hardware variable must have the same value. This works extremely
misorders memory accesses, and then Section 15.1.2 gives well for concurrent loads, but not so well for concurrent
an equally brief overview of how you can thwart such stores: Each store must do something about all copies
misordering. Finally, Section 15.1.3 lists some basic rules of the old value (another cache miss!), which, given the
of thumb, which will be further refined in later sections. finite speed of light and the atomic nature of matter, will
These sections focus on hardware reordering, but rest be slower than impatient software hackers would like.
assured that compilers reorder much more aggressively CPUs therefore come equipped with store buffers, as
than hardware ever dreamed of doing. But that topic will shown in Figure 15.1. When a given CPU stores to a
be taken up later in Section 15.3. variable not present in that CPU’s cache, then the new
value is instead placed in that CPU’s store buffer. The
CPU can then proceed immediately, without having to
15.1.1 Why Hardware Misordering? wait for the store to do something about all the old values
But why does memory misordering happen in the first of that variable residing in other CPUs’ caches.
place? Can’t CPUs keep track of ordering on their own? Although store buffers can greatly increase performance,
Isn’t that why we have computers in the first place, to keep they can cause instructions and memory references to
track of things? execute out of order, which can in turn cause serious
Many people do indeed expect their computers to keep confusion, as fancifully illustrated in Figure 15.2.
track of things, but many also insist that they keep track In particular, store buffers cause the memory misorder-
of things quickly. However, as seen in Chapter 3, main ing illustrated by Listing 15.1. Table 15.1 shows the steps
memory cannot keep up with modern CPUs, which can leading to this misordering. Row 1 shows the initial state,
execute hundreds of instructions in the time required to where CPU 0 has x1 in its cache and CPU 1 has x0 in its
cache, both variables having a value of zero. Row 2 shows
3 Please note that results are sensitive to the exact hardware configu- the state change due to each CPU’s store (lines 9 and 17
ration, how heavily the system is loaded, and much else besides. of Listing 15.1). Because neither CPU has the stored-to

v2022.09.25a
15.1. ORDERING: WHY AND HOW? 315

Table 15.1: Memory Misordering: Store-Buffering Sequence of Events

CPU 0 CPU 1
Instruction Store Buffer Cache Instruction Store Buffer Cache
1 (Initial state) x1==0 (Initial state) x0==0
2 x0 = 2; x0==2 x1==0 x1 = 2; x1==2 x0==0
3 r2 = x1; (0) x0==2 x1==0 r2 = x0; (0) x1==2 x0==0
4 (Read-invalidate) x0==2 x0==0 (Read-invalidate) x1==2 x1==0
5 (Finish store) x0==2 (Finish store) x1==2

for which a read-invalidate operation is used. As shown


do I out things of on row 4, after both read-invalidate operations complete,
the two CPUs have traded cachelines, so that CPU 0’s
Look! can order. cache now contains x0 and CPU 1’s cache now contains
x1. Once these two variables are in their new homes,
each CPU can flush its store buffer into the corresponding
cache line, leaving each variable with its final value as
shown on row 5.
Quick Quiz 15.4: But don’t the values also need to be flushed
from the cache to main memory?

In summary, store buffers are needed to allow CPUs to


handle store instructions efficiently, but they can result in
counter-intuitive memory misordering.
But what do you do if your algorithm really needs its
memory references to be ordered? For example, suppose
Figure 15.2: CPUs Can Do Things Out of Order that you are communicating with a driver using a pair of
flags, one that says whether or not the driver is running
and the other that says whether there is a request pending
variable in its cache, both CPUs record their stores in their for that driver. The requester needs to set the request-
respective store buffers. pending flag, then check the driver-running flag, and if
Quick Quiz 15.3: But wait!!! On row 2 of Table 15.1 both false, wake the driver. Once the driver has serviced all the
x0 and x1 each have two values at the same time, namely zero pending requests that it knows about, it needs to clear its
and two. How can that possibly work??? driver-running flag, then check the request-pending flag
to see if it needs to restart. This very reasonable approach
Row 3 shows the two loads (lines 10 and 18 of List- cannot work unless there is some way to make sure that
ing 15.1). Because the variable being loaded by each CPU the hardware processes the stores and loads in order. This
is in that CPU’s cache, each load immediately returns the is the subject of the next section.
cached value, which in both cases is zero.
But the CPUs are not done yet: Sooner or later, they 15.1.2 How to Force Ordering?
must empty their store buffers. Because caches move data
around in relatively large blocks called cachelines, and It turns out that there are compiler directives and syn-
because each cacheline can hold several variables, each chronization primitives (such as locking and RCU) that
CPU must get the cacheline into its own cache so that it are responsible for maintaining the illusion of ordering
can update the portion of that cacheline corresponding to through use of memory barriers (for example, smp_mb()
the variable in its store buffer, but without disturbing any in the Linux kernel). These memory barriers can be ex-
other part of the cacheline. Each CPU must also ensure plicit instructions, as they are on Arm, POWER, Itanium,
that the cacheline is not present in any other CPU’s cache, and Alpha, or they can be implied by other instructions,

v2022.09.25a
316 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

Listing 15.2: Memory Ordering: Store-Buffering Litmus Test optimizations. A great many situations can be handled
1 C C-SB+o-mb-o+o-mb-o with much weaker ordering guarantees that use much
2
3 {} cheaper memory-ordering instructions, or, in some case,
4 no memory-ordering instructions at all.
5 P0(int *x0, int *x1)
6 { Table 15.3 provides a cheatsheet of the Linux kernel’s
7 int r2;
8
ordering primitives and their guarantees. Each row corre-
9 WRITE_ONCE(*x0, 2); sponds to a primitive or category of primitives that might
10 smp_mb();
11 r2 = READ_ONCE(*x1);
or might not provide ordering, with the columns labeled
12 } “Prior Ordered Operation” and “Subsequent Ordered Op-
13
14 P1(int *x0, int *x1)
eration” being the operations that might (or might not)
15 { be ordered against. Cells containing “Y” indicate that
16 int r2;
17
ordering is supplied unconditionally, while other charac-
18 WRITE_ONCE(*x1, 2); ters indicate that ordering is supplied only partially or
19 smp_mb();
20 r2 = READ_ONCE(*x0);
conditionally. Blank cells indicate that no ordering is
21 } supplied.
22
23 exists (1:r2=0 /\ 0:r2=0) The “Store” row also covers the store portion of an
atomic RMW operation. In addition, the “Load” row
covers the load component of a successful value-returning
as they often are on x86. Since these standard synchro- _relaxed() RMW atomic operation, although the com-
nization primitives preserve the illusion of ordering, your bined “_relaxed() RMW operation” line provides a
path of least resistance is to simply use these primitives, convenient combined reference in the value-returning
thus allowing you to stop reading this section. case. A CPU executing unsuccessful value-returning
However, if you need to implement the synchronization atomic RMW operations must invalidate the correspond-
primitives themselves, or if you are simply interested in ing variable from all other CPUs’ caches. Therefore,
understanding how memory ordering works, read on! The unsuccessful value-returning atomic RMW operations
first stop on the journey is Listing 15.2 (C-SB+o-mb- have many of the properties of a store, which means that
o+o-mb-o.litmus), which places an smp_mb() Linux- the “_relaxed() RMW operation” line also applies to
kernel full memory barrier between the store and load unsuccessful value-returning atomic RMW operations.
in both P0() and P1(), but is otherwise identical to The *_acquire row covers smp_load_acquire(),
Listing 15.1. These barriers prevent the counter-intuitive cmpxchg_acquire(), xchg_acquire(), and so on; the
outcome from happening on 100,000,000 trials on my x86 *_release row covers smp_store_release(), rcu_
laptop. Interestingly enough, the added overhead due to assign_pointer(), cmpxchg_release(), xchg_
these barriers causes the legal outcome where both loads release(), and so on; and the “Successful full-
return the value two to happen more than 800,000 times, strength non-void RMW” row covers atomic_add_
as opposed to only 167 times for the barrier-free code in return(), atomic_add_unless(), atomic_dec_
Listing 15.1. and_test(), cmpxchg(), xchg(), and so on. The “Suc-
These barriers have a profound effect on ordering, as cessful” qualifiers apply to primitives such as atomic_
can be seen in Table 15.2. Although the first two rows add_unless(), cmpxchg_acquire(), and cmpxchg_
are the same as in Table 15.1 and although the smp_ release(), which have no effect on either memory or
mb() instructions on row 3 do not change state in and of on ordering when they indicate failure, as indicated by the
themselves, they do cause the stores to complete (rows 4 earlier “_relaxed() RMW operation” row.
and 5) before the loads (row 6), which rules out the Column “C” indicates cumulativity and propagation,
counter-intuitive outcome shown in Table 15.1. Note that as explained in Sections 15.2.7.1 and 15.2.7.2. In the
variables x0 and x1 each still have more than one value meantime, this column can usually be ignored when there
on row 2, however, as promised earlier, the smp_mb() are at most two threads involved.
invocations straighten things out in the end. Quick Quiz 15.5: The rows in Table 15.3 seem quite random
Although full barriers such as smp_mb() have extremely and confused. Whatever is the conceptual basis of this table???
strong ordering guarantees, their strength comes at a
high price in terms of foregone hardware and compiler

v2022.09.25a
15.1. ORDERING: WHY AND HOW? 317

Table 15.2: Memory Ordering: Store-Buffering Sequence of Events

CPU 0 CPU 1
Instruction Store Buffer Cache Instruction Store Buffer Cache
1 (Initial state) x1==0 (Initial state) x0==0
2 x0 = 2; x0==2 x1==0 x1 = 2; x1==2 x0==0
3 smp_mb(); x0==2 x1==0 smp_mb(); x1==2 x0==0
4 (Read-invalidate) x0==2 x0==0 (Read-invalidate) x1==2 x1==0
5 (Finish store) x0==2 (Finish store) x1==2
6 r2 = x1; (2) x1==2 r2 = x0; (2) x0==2

Table 15.3: Linux-Kernel Memory-Ordering Cheat Sheet

Prior Ordered Operation Subsequent Ordered Operation


Operation Providing Ordering C Self R W RMW Self R W DR DW RMW SV
Store, for example, WRITE_ONCE() Y Y
Load, for example, READ_ONCE() Y Y Y Y
_relaxed() RMW operation Y Y Y Y
*_dereference() Y Y Y Y
Successful *_acquire() R Y Y Y Y Y Y
Successful *_release() C Y Y Y W Y
smp_rmb() Y R Y Y R
smp_wmb() Y W Y Y W
smp_mb() and synchronize_rcu() CP Y Y Y Y Y Y Y Y
Successful full-strength non-void RMW CP Y Y Y Y Y Y Y Y Y Y Y
smp_mb__before_atomic() CP Y Y Y a a a a Y
smp_mb__after_atomic() CP a a Y Y Y Y Y Y

Key: C: Ordering is cumulative


P: Ordering propagates
R: Read, for example, READ_ONCE(), or read portion of RMW
W: Write, for example, WRITE_ONCE(), or write portion of RMW
Y: Provides the specified ordering
a: Provides specified ordering given intervening RMW atomic operation
DR: Dependent read (address dependency, Section 15.2.3)
DW: Dependent write (address, data, or control dependency, Sections 15.2.3–15.2.5)
RMW: Atomic read-modify-write operation
Self: Orders self, as opposed to accesses both before and after
SV: Orders later accesses to the same variable
Applies to Linux kernel v4.15 and later.

v2022.09.25a
318 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

CPU 0
Quick Quiz 15.6: Why is Table 15.3 missing
smp_mb__after_unlock_lock() and smp_mb__after_ Memory
.... memory barriers guarantee X0 before X1.
Reference X0
spinlock()?

It is important to note that this table is just a cheat Memory


sheet, and is therefore in no way a replacement for a good Barrier
understanding of memory ordering. To begin building
such an understanding, the next section will present some Memory
basic rules of thumb. Reference Y0
CPU 1

15.1.3 Basic Rules of Thumb Given Y0 before Y1 ... Memory


Reference Y1
This section presents some basic rules of thumb that are
“good and sufficient” for a great many situations. In fact,
Memory
you could write a great deal of concurrent code having Barrier
excellent performance and scalability without needing any-
thing more than these rules of thumb. More sophisticated
Memory
rules of thumb will be presented in Section 15.6. Reference X1

Quick Quiz 15.7: But how can I know that a given project
can be designed and coded within the confines of these rules Figure 15.3: Memory Barriers Provide Conditional If-
of thumb? Then Ordering

A given thread sees its own accesses in order. This zero. The if-then rule would then state that the load from
rule assumes that loads and stores from/to shared variables x0 on line 20 happens after the store to x0 on line 9. In
use READ_ONCE() and WRITE_ONCE(), respectively. Oth- other words, P1()’s local variable r2 is guaranteed to
erwise, the compiler can profoundly scramble4 your code, end up with the value two only if P0()’s local variable
and sometimes the CPU can do a bit of scrambling as well, r2 ends up with the value zero. This underscores the
as discussed in Section 15.5.4. point that memory ordering guarantees are conditional,
not absolute.
Ordering has conditional if-then semantics. Fig- Although Figure 15.3 specifically mentions memory
ure 15.3 illustrates this for memory barriers. Assuming barriers, this same if-then rule applies to the rest of the
that both memory barriers are strong enough, if CPU 1’s Linux kernel’s ordering operations.
access Y1 happens after CPU 0’s access Y0, then CPU 1’s
access X1 is guaranteed to happen after CPU 0’s access Ordering operations must be paired. If you carefully
X0. When in doubt as to which memory barriers are order the operations in one thread, but then fail to do so
strong enough, smp_mb() will always do the job, albeit in another thread, then there is no ordering. Both threads
at a price. must provide ordering for the if-then rule to apply.5
Quick Quiz 15.8: How can you tell which memory barriers
are strong enough for a given use case? Ordering operations almost never speed things up.
If you find yourself tempted to add a memory barrier in
Listing 15.2 is a case in point. The smp_mb() on an attempt to force a prior store to be flushed to memory
lines 10 and 19 serve as the barriers, the store to x0 on faster, resist! Adding ordering usually slows things down.
line 9 as X0, the load from x1 on line 11 as Y0, the store Of course, there are situations where adding instructions
to x1 on line 18 as Y1, and the load from x0 on line 20 as speeds things up, as was shown by Figure 9.22 on page 161,
X1. Applying the if-then rule step by step, we know that but careful benchmarking is required in such cases. And
the store to x1 on line 18 happens after the load from x1 even then, it is quite possible that although you sped things
on line 11 if P0()’s local variable r2 is set to the value up a little bit on your system, you might well have slowed
4 Many compiler writers prefer the word “optimize”. 5 In Section 15.2.7.2, pairing will be generalized to cycles.

v2022.09.25a
15.2. TRICKS AND TRAPS 319

things down significantly on your users’ systems. Or on Listing 15.3: Software Logic Analyzer
your future system. 1 state.variable = mycpu;
2 lasttb = oldtb = firsttb = gettb();
3 while (state.variable == mycpu) {
4 lasttb = oldtb;
Ordering operations are not magic. When your pro- 5 oldtb = gettb();
gram is failing due to some race condition, it is often 6 if (lasttb - firsttb > 1000)
7 break;
tempting to toss in a few memory-ordering operations in 8 }
an attempt to barrier your bugs out of existence. A far bet-
ter reaction is to use higher-level primitives in a carefully
designed manner. With concurrent programming, it is But first, let’s take a quick look at just how many values
almost always better to design your bugs out of existence a single variable might have at a single point in time.
than to hack them down to lower probabilities.
15.2.1 Variables With Multiple Values
These are only rough rules of thumb. Although these It is natural to think of a variable as taking on a well-
rules of thumb cover the vast majority of situations seen defined sequence of values in a well-defined, global order.
in actual practice, as with any set of rules of thumb, they Unfortunately, the next stop on the journey says “goodbye”
do have their limits. The next section will demonstrate to this comforting fiction. Hopefully, you already started
some of these limits by introducing trick-and-trap lit- to say “goodbye” in response to row 2 of Tables 15.1
mus tests that are intended to insult your intuition while and 15.2, and if so, the purpose of this section is to drive
increasing your understanding. These litmus tests will this point home.
also illuminate many of the concepts represented by the To this end, consider the program fragment shown in
Linux-kernel memory-ordering cheat sheet shown in Ta- Listing 15.3. This code fragment is executed in parallel
ble 15.3, and can be automatically analyzed given proper by several CPUs. Line 1 sets a shared variable to the
tooling [AMM+ 18]. Section 15.6 will circle back to this current CPU’s ID, line 2 initializes several variables from
cheat sheet, presenting a more sophisticated set of rules of a gettb() function that delivers the value of a fine-
thumb in light of learnings from all the intervening tricks grained hardware “timebase” counter that is synchronized
and traps. among all CPUs (not available from all CPU architectures,
Quick Quiz 15.9: Wait!!! Where do I find this tooling that unfortunately!), and the loop from lines 3–8 records the
automatically analyzes litmus tests??? length of time that the variable retains the value that this
CPU assigned to it. Of course, one of the CPUs will
“win”, and would thus never exit the loop if not for the
check on lines 6–7.
15.2 Tricks and Traps Quick Quiz 15.10: What assumption is the code fragment in
Listing 15.3 making that might not be valid on real hardware?
Knowing where the trap is—that’s the first step in
evading it.
Upon exit from the loop, firsttb will hold a timestamp
Duke Leto Atreides, “Dune”, Frank Herbert taken shortly after the assignment and lasttb will hold
a timestamp taken before the last sampling of the shared
Now that you know that hardware can reorder memory variable that still retained the assigned value, or a value
accesses and that you can prevent it from doing so, the equal to firsttb if the shared variable had changed
next step is to get you to admit that your intuition has a before entry into the loop. This allows us to plot each
problem. This painful task is taken up by Section 15.2.1, CPU’s view of the value of state.variable over a 532-
which presents some code demonstrating that scalar vari- nanosecond time period, as shown in Figure 15.4. This
ables can take on multiple values simultaneously, and by data was collected in 2006 on 1.5 GHz POWER5 system
Sections 15.2.2 through 15.2.7, which show a series of with 8 cores, each containing a pair of hardware threads.
intuitively correct code fragments that fail miserably on CPUs 1, 2, 3, and 4 recorded the values, while CPU 0
real hardware. Once your intuition has made it through controlled the test. The timebase counter period was about
the grieving process, later sections will summarize the 5.32 ns, sufficiently fine-grained to allow observations of
basic rules that memory ordering follows. intermediate cache states.

v2022.09.25a
320 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

CPU 1 1 2 Listing 15.4: Message-Passing Litmus Test (No Ordering)


CPU 2 2 1 C C-MP+o-wmb-o+o-o
CPU 3 3 2 2
3 {}
CPU 4 4 2
4
5 P0(int* x0, int* x1) {
6 WRITE_ONCE(*x0, 2);
0 100 200 300 400 500 (ns)
7 smp_wmb();
8 WRITE_ONCE(*x1, 2);
Figure 15.4: A Variable With Multiple Simultaneous 9 }
Values 10
11 P1(int* x0, int* x1) {
12 int r2;
13 int r3;
14
Each horizontal bar represents the observations of a 15 r2 = READ_ONCE(*x1);
given CPU over time, with the gray regions to the left 16 r3 = READ_ONCE(*x0);
17 }
indicating the time before the corresponding CPU’s first 18
measurement. During the first 5 ns, only CPU 3 has an 19 exists (1:r2=2 /\ 1:r3=0)
opinion about the value of the variable. During the next
10 ns, CPUs 2 and 3 disagree on the value of the variable,
but thereafter agree that the value is “2”, which is in fact the system! We have therefore entered a regime where we
the final agreed-upon value. However, CPU 1 believes must bid a fond farewell to comfortable intuitions about
that the value is “1” for almost 300 ns, and CPU 4 believes values of variables and the passage of time. This is the
that the value is “4” for almost 500 ns. regime where memory-ordering operations are needed.
But remember well the lessons from Chapters 3 and 6.
Quick Quiz 15.11: How could CPUs possibly have different
views of the value of a single variable at the same time? Having all CPUs store concurrently to the same variable
is no way to design a parallel program, at least not if
performance and scalability are at all important to you.
Quick Quiz 15.12: Why do CPUs 2 and 3 come to agreement
so quickly, when it takes so long for CPUs 1 and 4 to come to Unfortunately, memory ordering has many other ways
the party? of insulting your intuition, and not all of these ways
conflict with performance and scalability. The next section
And if you think that the situation with four CPUs overviews reordering of unrelated memory reference.
was intriguing, consider Figure 15.5, which shows the
same situation, but with 15 CPUs each assigning their
15.2.2 Memory-Reference Reordering
number to a single shared variable at time 𝑡 = 0. Both
diagrams in the figure are drawn in the same way as Section 15.1.1 showed that even relatively strongly ordered
Figure 15.4. The only difference is that the unit of systems like x86 can reorder prior stores with later loads,
horizontal axis is timebase ticks, with each tick lasting at least when the store and load are to different variables.
about 5.3 nanoseconds. The entire sequence therefore This section builds on that result, looking at the other
lasts a bit longer than the events recorded in Figure 15.4, combinations of loads and stores.
consistent with the increase in number of CPUs. The
upper diagram shows the overall picture, while the lower
15.2.2.1 Load Followed By Load
one zooms in on the first 50 timebase ticks. Again, CPU 0
coordinates the test, so does not record any values. Listing 15.4 (C-MP+o-wmb-o+o-o.litmus) shows the
All CPUs eventually agree on the final value of 9, but classic message-passing litmus test, where x0 is the mes-
not before the values 15 and 12 take early leads. Note sage and x1 is a flag indicating whether or not a message is
that there are fourteen different opinions on the variable’s available. In this test, the smp_wmb() forces P0() stores
value at time 21 indicated by the vertical line in the lower to be ordered, but no ordering is specified for the loads.
diagram. Note also that all CPUs see sequences whose Relatively strongly ordered architectures, such as x86, do
orderings are consistent with the directed graph shown in enforce ordering. However, weakly ordered architectures
Figure 15.6. Nevertheless, these figures underscore the often do not [AMP+ 11]. Therefore, the exists clause on
importance of proper use of memory-ordering operations. line 19 of the listing can trigger.
How many values can a single variable take on at a One rationale for reordering loads from different loca-
single point in time? As many as one per store buffer in tions is that doing so allows execution to proceed when

v2022.09.25a
15.2. TRICKS AND TRAPS 321

CPU 1 1 6 4 10 15 3 9
CPU 2 2 3 9
CPU 3 3 9
CPU 4 4 10 15 12 9
CPU 5 5 10 15 12 9
CPU 6 6 2 15 9
CPU 7 7 2 15 9
CPU 8 8 9
CPU 9 9
CPU 10 10 15 12 9
CPU 11 11 10 15 12 9
CPU 12 12 9
CPU 13 13 12 9
CPU 14 14 15 12 9
CPU 15 15 12 9

0 50 100 150 200 250 300 350 400 450 500 (tick)

CPU 1 1
CPU 2 2
CPU 3 3
CPU 4 4
CPU 5 5
CPU 6 6
CPU 7 7
CPU 8 8 9
CPU 9 9
CPU 10 10
CPU 11 11
CPU 12 12
CPU 13 13
CPU 14 14 15
CPU 15 15

0 5 10 15 20 25 30 35 40 45 (tick)

Figure 15.5: A Variable With More Simultaneous Values

v2022.09.25a
322 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

1 Listing 15.6: Load-Buffering Litmus Test (No Ordering)


1 C C-LB+o-o+o-o
2
3 {}
7 6 4
5 P0(int *x0, int *x1)
6 {
7 int r2;
8
2 4 5 11
9 r2 = READ_ONCE(*x1);
10 WRITE_ONCE(*x0, 2);
11 }
12
10 14
13 P1(int *x0, int *x1)
14 {
15 int r2;
16
15 13 17 r2 = READ_ONCE(*x0);
18 WRITE_ONCE(*x1, 2);
19 }
20
3 12 8 21 exists (1:r2=2 /\ 0:r2=2)

Listing 15.7: Enforcing Ordering of Load-Buffering Litmus


9 Test
1 C C-LB+o-r+a-o
2
Figure 15.6: Possible Global Orders With More Simulta- 3 {}
neous Values 4
5 P0(int *x0, int *x1)
6 {
7 int r2;
an earlier load misses the cache, but the values for later 8
9 r2 = READ_ONCE(*x1);
loads are already present. 10 smp_store_release(x0, 2);
11 }
Quick Quiz 15.13: But why make load-load reordering 12
visible to the user? Why not just use speculative execution to 13 P1(int *x0, int *x1)
14 {
allow execution to proceed in the common case where there 15 int r2;
are no intervening stores, in which case the reordering cannot 16
be visible anyway? 17 r2 = smp_load_acquire(x0);
18 WRITE_ONCE(*x1, 2);
19 }
Thus, portable code relying on ordered loads must add 20

explicit ordering, for example, the smp_rmb() shown 21 exists (1:r2=2 /\ 0:r2=2)

on line 16 of Listing 15.5 (C-MP+o-wmb-o+o-rmb-


o.litmus), which prevents the exists clause from trig-
Listing 15.5: Enforcing Order of Message-Passing Litmus Test gering.
1 C C-MP+o-wmb-o+o-rmb-o
2
3 {} 15.2.2.2 Load Followed By Store
4
5 P0(int* x0, int* x1) {
6 WRITE_ONCE(*x0, 2);
Listing 15.6 (C-LB+o-o+o-o.litmus) shows the classic
7 smp_wmb(); load-buffering litmus test. Although relatively strongly
8 WRITE_ONCE(*x1, 2);
9 }
ordered systems such as x86 or the IBM Mainframe do
10 not reorder prior loads with subsequent stores, many
11 P1(int* x0, int* x1) {
12 int r2;
weakly ordered architectures really do allow such reorder-
13 int r3; ing [AMP+ 11]. Therefore, the exists clause on line 21
14
15 r2 = READ_ONCE(*x1);
really can trigger.
16 smp_rmb(); Although it is rare for actual hardware to exhibit this
17 r3 = READ_ONCE(*x0);
18 }
reordering [Mar17], one situation where it might be desir-
19 able to do so is when a load misses the cache, the store
20 exists (1:r2=2 /\ 1:r3=0)
buffer is nearly full, and the cacheline for a subsequent

v2022.09.25a
15.2. TRICKS AND TRAPS 323

Listing 15.8: Message-Passing Litmus Test, No Writer Ordering Listing 15.9: Message-Passing Address-Dependency Litmus
(No Ordering) Test (No Ordering Before v4.15)
1 C C-MP+o-o+o-rmb-o 1 C C-MP+o-wmb-o+o-ad-o
2 2
3 {} 3 {
4 4 y=1;
5 P0(int* x0, int* x1) { 5 x1=y;
6 WRITE_ONCE(*x0, 2); 6 }
7 WRITE_ONCE(*x1, 2); 7
8 } 8 P0(int* x0, int** x1) {
9 9 WRITE_ONCE(*x0, 2);
10 P1(int* x0, int* x1) { 10 smp_wmb();
11 int r2; 11 WRITE_ONCE(*x1, x0);
12 int r3; 12 }
13 13
14 r2 = READ_ONCE(*x1); 14 P1(int** x1) {
15 smp_rmb(); 15 int *r2;
16 r3 = READ_ONCE(*x0); 16 int r3;
17 } 17
18 18 r2 = READ_ONCE(*x1);
19 exists (1:r2=2 /\ 1:r3=0) 19 r3 = READ_ONCE(*r2);
20 }
21
22 exists (1:r2=x0 /\ 1:r3=1)
store is ready at hand. Therefore, portable code must
enforce any required ordering, for example, as shown
in Listing 15.7 (C-LB+o-r+a-o.litmus). The smp_ by a later memory-reference instruction. This means that
store_release() and smp_load_acquire() guaran- the exact same sequence of instructions used to traverse
tee that the exists clause on line 21 never triggers. a linked data structure in single-threaded code provides
weak but extremely useful ordering in concurrent code.
Listing 15.9 (C-MP+o-wmb-o+o-addr-o.litmus)
15.2.2.3 Store Followed By Store
shows a linked variant of the message-passing pattern.
Listing 15.8 (C-MP+o-o+o-rmb-o.litmus) once again The head pointer is x1, which initially references the int
shows the classic message-passing litmus test, with the variable y (line 5), which is in turn initialized to the value
smp_rmb() providing ordering for P1()’s loads, but with- 1 (line 4). P0() updates head pointer x1 to reference x0
out any ordering for P0()’s stores. Again, the rela- (line 11), but only after initializing it to 2 (line 9) and
tively strongly ordered architectures do enforce ordering, forcing ordering (line 10). P1() picks up the head pointer
but weakly ordered architectures do not necessarily do x1 (line 18), and then loads the referenced value (line 19).
so [AMP+ 11], which means that the exists clause can There is thus an address dependency from the load on
trigger. One situation in which such reordering could be line 18 to the load on line 19. In this case, the value
beneficial is when the store buffer is full, another store is returned by line 18 is exactly the address used by line 19,
ready to execute, but the cacheline needed by the oldest but many variations are possible, including field access
store is not yet available. In this situation, allowing stores using the C-language -> operator, addition, subtraction,
to complete out of order would allow execution to proceed. and array indexing.6
Therefore, portable code must explicitly order the stores, One might hope that line 18’s load from the head pointer
for example, as shown in Listing 15.5, thus preventing the would be ordered before line 19’s dereference, which is in
exists clause from triggering. fact the case on Linux v4.15 and later. However, prior to
v4.15, this was not the case on DEC Alpha, which could
Quick Quiz 15.14: Why should strongly ordered systems
in effect use a speculated value for the dependent load, as
pay the performance price of unnecessary smp_rmb() and
smp_wmb() invocations? Shouldn’t weakly ordered systems
described in more detail in Section 15.5.1. Therefore, on
shoulder the full cost of their misordering choices??? older versions of Linux, Listing 15.9’s exists clause can
trigger.
Listing 15.10 shows how to make this work reliably
on pre-v4.15 Linux kernels running on DEC Alpha,
15.2.3 Address Dependencies by replacing line 19’s READ_ONCE() with lockless_
An address dependency occurs when the value returned 6 But note that in the Linux kernel, the address dependency must be

by a load instruction is used to compute the address used carried through the pointer to the array, not through the array index.

v2022.09.25a
324 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

Listing 15.10: Enforced Ordering of Message-Passing Address- Listing 15.12: Load-Buffering Data-Dependency Litmus Test
Dependency Litmus Test (Before v4.15) 1 C C-LB+o-r+o-data-o
1 C C-MP+o-wmb-o+ld-addr-o 2

2
3 {}
3 { 4

4 y=1; 5 P0(int *x0, int *x1)


5 x1=y; 6 {
6 } 7 int r2;
8
7
8 P0(int* x0, int** x1) { 9 r2 = READ_ONCE(*x1);
9 WRITE_ONCE(*x0, 2); 10 smp_store_release(x0, 2);
10 smp_wmb(); 11 }
11 WRITE_ONCE(*x1, x0); 12

12 } 13 P1(int *x0, int *x1)


13
14 {
14 P1(int** x1) { 15 int r2;
15 int *r2; 16

16 int r3; 17 r2 = READ_ONCE(*x0);


17
18 WRITE_ONCE(*x1, r2);
18 r2 = lockless_dereference(*x1); // Obsolete 19 }
19 r3 = READ_ONCE(*r2); 20

20 } 21 exists (1:r2=2 /\ 0:r2=2)


21
22 exists (1:r2=x0 /\ 1:r3=1)

trigger, even on DEC Alpha, even in pre-v4.15 Linux


Listing 15.11: S Address-Dependency Litmus Test kernels.
1 C C-S+o-wmb-o+o-addr-o
2 Quick Quiz 15.15: But how do we know that all platforms
3 { really avoid triggering the exists clauses in Listings 15.10
4 y=1;
5 x1=y; and 15.11?
6 }
7
8 P0(int* x0, int** x1) { Quick Quiz 15.16: SP, MP, LB, and now S. Where do all
9 WRITE_ONCE(*x0, 2); these litmus-test abbreviations come from and how can anyone
10 smp_wmb(); keep track of them?
11 WRITE_ONCE(*x1, x0);
12 }
13 However, it is important to note that address depen-
14 P1(int** x1) {
15 int *r2;
dencies can be fragile and easily broken by compiler
16 optimizations, as discussed in Section 15.3.2.
17 r2 = READ_ONCE(*x1);
18 WRITE_ONCE(*r2, 3);
19 }
20 15.2.4 Data Dependencies
21 exists (1:r2=x0 /\ x0=2)
A data dependency occurs when the value returned by
a load instruction is used to compute the data stored by
dereference(),7 which acts like READ_ONCE() on all a later store instruction. Note well the “data” above: If
platforms other than DEC Alpha, where it acts like a the value returned by a load was instead used to compute
READ_ONCE() followed by an smp_mb(), thereby forcing the address used by a later store instruction, that would
the required ordering on all platforms, in turn preventing instead be an address dependency. However, the existence
the exists clause from triggering. of data dependencies means that the exact same sequence
But what happens if the dependent operation is a of instructions used to update a linked data structure in
store rather than a load, for example, in the S litmus single-threaded code provides weak but extremely useful
test [AMP+ 11] shown in Listing 15.11 (C-S+o-wmb- ordering in concurrent code.
o+o-addr-o.litmus)? Because no production-quality Listing 15.12 (C-LB+o-r+o-data-o.litmus) is sim-
platform speculates stores, it is not possible for the WRITE_ ilar to Listing 15.7, except that P1()’s ordering between
ONCE() on line 9 to overwrite the WRITE_ONCE() on lines 17 and 18 is enforced not by an acquire load, but
line 18, meaning that the exists clause on line 21 cannot instead by a data dependency: The value loaded by line 17
7 Note that lockless_dereference() is not needed on v4.15 and is what line 18 stores. The ordering provided by this data
later, and therefore is not available in these later Linux kernels. Nor is it dependency is sufficient to prevent the exists clause
needed in versions of this book containing this sentence. from triggering.

v2022.09.25a
15.2. TRICKS AND TRAPS 325

Just as with address dependencies, data dependencies


are fragile and can be easily broken by compiler opti-
mizations, as discussed in Section 15.3.2. In fact, data
dependencies can be even more fragile than are address
dependencies. The reason for this is that address depen- Listing 15.13: Load-Buffering Control-Dependency Litmus
dencies normally involve pointer values. In contrast, as Test
shown in Listing 15.12, it is tempting to carry data depen- 1 C C-LB+o-r+o-ctrl-o
2
dencies through integral values, which the compiler has 3 {}
much more freedom to optimize into nonexistence. For 4
5 P0(int *x0, int *x1)
but one example, if the integer loaded was multiplied by 6 {
the constant zero, the compiler would know that the result 7 int r2;
8
was zero, and could therefore substitute the constant zero 9 r2 = READ_ONCE(*x1);
for the value loaded, thus breaking the dependency. 10 smp_store_release(x0, 2);
11 }
12
Quick Quiz 15.17: But wait!!! Line 17 of Listing 15.12 uses 13 P1(int *x0, int *x1)
READ_ONCE(), which marks the load as volatile, which means 14 {
that the compiler absolutely must emit the load instruction 15 int r2;
16
even if the value is later multiplied by zero. So how can the 17 r2 = READ_ONCE(*x0);
compiler possibly break this data dependency? 18 if (r2 >= 0)
19 WRITE_ONCE(*x1, 2);
20 }
In short, you can rely on data dependencies only if you 21

prevent the compiler from breaking them. 22 exists (1:r2=2 /\ 0:r2=2)

15.2.5 Control Dependencies


A control dependency occurs when the value returned by
a load instruction is tested to determine whether or not
a later store instruction is executed. In other words, a
simple conditional branch or conditional-move instruction
can act as a weak but low-overhead memory-barrier in-
struction. However, note well the “later store instruction”:
Although all platforms respect load-to-store dependen- Listing 15.14: Message-Passing Control-Dependency Litmus
cies, many platforms do not respect load-to-load control Test (No Ordering)
1 C C-MP+o-r+o-ctrl-o
dependencies. 2

Listing 15.13 (C-LB+o-r+o-ctrl-o.litmus) shows 3 {}


4
another load-buffering example, this time using a control 5 P0(int* x0, int* x1) {
dependency (line 18) to order the load on line 17 and the 6 WRITE_ONCE(*x0, 2);
7 smp_store_release(x1, 2);
store on line 19. The ordering is sufficient to prevent the 8 }
exists from triggering. 9
10 P1(int* x0, int* x1) {
However, control dependencies are even more suscep- 11 int r2;
tible to being optimized out of existence than are data 12 int r3 = 0;
13
dependencies, and Section 15.3.3 describes some of the 14 r2 = READ_ONCE(*x1);
rules that must be followed in order to prevent your com- 15 if (r2 >= 0)
16 r3 = READ_ONCE(*x0);
piler from breaking your control dependencies. 17 }
18
It is worth reiterating that control dependencies pro- 19 exists (1:r2=2 /\ 1:r3=0)
vide ordering only from loads to stores. Therefore, the
load-to-load control dependency shown on lines 14–16
of Listing 15.14 (C-MP+o-r+o-ctrl-o.litmus) does
not provide ordering, and therefore does not prevent the
exists clause from triggering.

v2022.09.25a
326 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

Listing 15.15: Cache-Coherent IRIW Litmus Test CPU 0 CPU 1 CPU 2 CPU 3
1 C C-CCIRIW+o+o+o-o+o-o
2
3 {}
4
5 P0(int *x)
6 { Memory Memory
7 WRITE_ONCE(*x, 1);
8 }
9
10 P1(int *x) Figure 15.7: Global System Bus And Multi-Copy Atom-
11 {
12 WRITE_ONCE(*x, 2);
icity
13 }
14
15 P2(int *x)
16 { Listing 15.15 (C-CCIRIW+o+o+o-o+o-o.litmus)
17 int r1; shows a litmus test that tests for cache coherence, where
18 int r2;
19
“IRIW” stands for “independent reads of independent
20 r1 = READ_ONCE(*x); writes”. Because this litmus test uses only one vari-
21 r2 = READ_ONCE(*x);
22 } able, P2() and P3() must agree on the order of P0()’s
23 and P1()’s stores. In other words, if P2() believes that
24 P3(int *x)
25 { P0()’s store came first, then P3() had better not believe
26 int r3; that P1()’s store came first. And in fact the exists
27 int r4;
28
clause on line 33 will trigger if this situation arises.
29 r3 = READ_ONCE(*x);
30 r4 = READ_ONCE(*x); Quick Quiz 15.19: But in Listing 15.15, wouldn’t be just
31 } as bad if P2()’s r1 and r2 obtained the values 2 and 1,
32
33 exists(2:r1=1 /\ 2:r2=2 /\ 3:r3=2 /\ 3:r4=1)
respectively, while P3()’s r3 and r4 obtained the values 1
and 2, respectively?

It is tempting to speculate that different-sized overlap-


In summary, control dependencies can be useful, but
ping loads and stores to a single region of memory (as
they are high-maintenance items. You should therefore
might be set up using the C-language union keyword)
use them only when performance considerations permit
would provide similar ordering guarantees. However,
no other solution.
Flur et al. [FSP+ 17] discovered some surprisingly simple
Quick Quiz 15.18: Wouldn’t control dependencies be more litmus tests that demonstrate that such guarantees can
robust if they were mandated by language standards???
be violated on real hardware. It is therefore necessary
to restrict code to non-overlapping same-sized aligned
accesses to a given variable, at least if portability is a
15.2.6 Cache Coherence consideration.9
On cache-coherent platforms, all CPUs agree on the order Adding more variables and threads increases the scope
of loads and stores to a given variable. Fortunately, when for reordering and other counter-intuitive behavior, as
READ_ONCE() and WRITE_ONCE() are used, almost all discussed in the next section.
platforms are cache-coherent, as indicated by the “SV”
column of the cheat sheet shown in Table 15.3. Unfortu- 15.2.7 Multicopy Atomicity
nately, this property is so popular that it has been named
multiple times, with “single-variable SC”,8 “single-copy Threads running on a fully multicopy atomic [SF95] plat-
atomic” [SF95], and just plain “coherence” [AMP+ 11] form are guaranteed to agree on the order of stores, even
having seen use. Rather than further compound the con- to different variables. A useful mental model of such a
fusion by inventing yet another term for this concept, system is the single-bus architecture shown in Figure 15.7.
this book uses “cache coherence” and “coherence” inter- If each store resulted in a message on the bus, and if the
changeably. bus could accommodate only one store at a time, then any
9 There is reason to believe that using atomic RMW operations (for

example, xchg()) for all the stores will provide sequentially consistent
8 Recall that SC stands for sequentially consistent. ordering, but this has not yet been proven either way.

v2022.09.25a
15.2. TRICKS AND TRAPS 327

pair of CPUs would agree on the order of all stores that


they observed. Unfortunately, building a computer system
as shown in the figure, without store buffers or even caches,
would result in glacially slow computation. Most CPU
vendors interested in providing multicopy atomicity there- Listing 15.16: WRC Litmus Test With Dependencies (No
fore instead provide the slightly weaker other-multicopy Ordering)
atomicity [ARM17, Section B2.3], which excludes the 1 C C-WRC+o+o-data-o+o-rmb-o
CPU doing a given store from the requirement that all 2
3 {}
CPUs agree on the order of all stores.10 This means that 4

if only a subset of CPUs are doing stores, the other CPUs 5 P0(int *x)
6 {
will agree on the order of stores, hence the “other” in 7 WRITE_ONCE(*x, 1);
“other-multicopy atomicity”. Unlike multicopy-atomic 8 }
9
platforms, within other-multicopy-atomic platforms, the 10 P1(int *x, int* y)
CPU doing the store is permitted to observe its store 11 {
12 int r1;
early, which allows its later loads to obtain the newly 13

stored value directly from the store buffer, which improves 14 r1 = READ_ONCE(*x);
15 WRITE_ONCE(*y, r1);
performance. 16 }
17
Quick Quiz 15.20: Can you give a specific example showing 18 P2(int *x, int* y)
different behavior for multicopy atomic on the one hand and 19 {
20 int r2;
other-multicopy atomic on the other? 21 int r3;
22

Perhaps there will come a day when all platforms 23 r2 = READ_ONCE(*y);


24 smp_rmb();
provide some flavor of multi-copy atomicity, but in the 25 r3 = READ_ONCE(*x);
meantime, non-multicopy-atomic platforms do exist, and 26 }
27
so software must deal with them. 28 exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)
Listing 15.16 (C-WRC+o+o-data-o+o-rmb-
o.litmus) demonstrates multicopy atomicity, that is,
on a multicopy-atomic platform, the exists clause on
line 28 cannot trigger. In contrast, on a non-multicopy-
atomic platform this exists clause can trigger, despite
P1()’s accesses being ordered by a data dependency and
P2()’s accesses being ordered by an smp_rmb(). Recall
that the definition of multicopy atomicity requires that
all threads agree on the order of stores, which can be
thought of as all stores reaching all threads at the same
time. Therefore, a non-multicopy-atomic platform can Store CPU 1
CPU 0 Buffer Store CPU 3
CPU 2 Buffer
have a store reach different threads at different times. In
Cache Cache
particular, P0()’s store might reach P1() long before it
reaches P2(), which raises the possibility that P1()’s
store might reach P2() before P0()’s store does.
This leads to the question of why a real system con-
strained by the usual laws of physics would ever trigger the Memory Memory
exists clause of Listing 15.16. The cartoonish diagram
of a such a real system is shown in Figure 15.8. CPU 0 Figure 15.8: Shared Store Buffers And Multi-Copy Atom-
and CPU 1 share a store buffer, as do CPUs 2 and 3. icity
This means that CPU 1 can load a value out of the store

10 As of early 2021, Armv8 and x86 provide other-multicopy atomic-

ity, IBM mainframe provides full multicopy atomicity, and PPC provides
no multicopy atomicity at all. More detail is shown in Table 15.5.

v2022.09.25a
328 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

buffer, thus potentially immediately seeing a value stored Listing 15.17: WRC Litmus Test With Release
by CPU 0. In contrast, CPUs 2 and 3 will have to wait for 1 C C-WRC+o+o-r+a-o
2
the corresponding cache line to carry this new value to 3 {}
them. 4
5 P0(int *x)
Quick Quiz 15.21: Then who would even think of designing 6 {
7 WRITE_ONCE(*x, 1);
a system with shared store buffers??? 8 }
9
Table 15.4 shows one sequence of events that can result 10 P1(int *x, int* y)
11 {
in the exists clause in Listing 15.16 triggering. This 12 int r1;
sequence of events will depend critically on P0() and 13
14 r1 = READ_ONCE(*x);
P1() sharing both cache and a store buffer in the manner 15 smp_store_release(y, r1);
shown in Figure 15.8. 16 }
17
Quick Quiz 15.22: But just how is it fair that P0() and P1() 18 P2(int *x, int* y)
19 {
must share a store buffer and a cache, but P2() gets one each 20 int r2;
of its very own??? 21 int r3;
22
23 r2 = smp_load_acquire(y);
Row 1 shows the initial state, with the initial value of y 24 r3 = READ_ONCE(*x);
in P0()’s and P1()’s shared cache, and the initial value 25 }
26
of x in P2()’s cache. 27 exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)
Row 2 shows the immediate effect of P0() executing
its store on line 7. Because the cacheline containing x is
not in P0()’s and P1()’s shared cache, the new value (1) Note well that the exists clause on line 28 has trig-
is stored in the shared store buffer. gered. The values of r1 and r2 are both the value one, and
Row 3 shows two transitions. First, P0() issues a read- the final value of r3 the value zero. This strange result oc-
invalidate operation to fetch the cacheline containing x so curred because P0()’s new value of x was communicated
that it can flush the new value for x out of the shared store to P1() long before it was communicated to P2().
buffer. Second, P1() loads from x (line 14), an operation
Quick Quiz 15.23: Referring to Table 15.4, why on earth
that completes immediately because the new value of x is
would P0()’s store take so long to complete when P1()’s store
immediately available from the shared store buffer. complete so quickly? In other words, does the exists clause
Row 4 also shows two transitions. First, it shows the on line 28 of Listing 15.16 really trigger on real systems?
immediate effect of P1() executing its store to y (line 15),
placing the new value into the shared store buffer. Second, This counter-intuitive result happens because although
it shows the start of P2()’s load from y (line 23). dependencies do provide ordering, they provide it only
Row 5 continues the tradition of showing two transitions. within the confines of their own thread. This three-thread
First, it shows P1() complete its store to y, flushing from example requires stronger ordering, which is the subject
the shared store buffer to the cache. Second, it shows of Sections 15.2.7.1 through 15.2.7.4.
P2() request the cacheline containing y.
Row 6 shows P2() receive the cacheline containing y, 15.2.7.1 Cumulativity
allowing it to finish its load into r2, which takes on the
value 1. The three-thread example shown in Listing 15.16 re-
Row 7 shows P2() execute its smp_rmb() (line 24), quires cumulative ordering, or cumulativity. A cumulative
thus keeping its two loads ordered. memory-ordering operation orders not just any given ac-
Row 8 shows P2() execute its load from x, which cess preceding it, but also earlier accesses by any thread
immediately returns with the value zero from P2()’s to that same variable.
cache. Dependencies do not provide cumulativity, which is
Row 9 shows P2() finally responding to P0()’s request why the “C” column is blank for the READ_ONCE()
for the cacheline containing x, which was made way back row of Table 15.3 on page 317. However, as indi-
up on row 3. cated by the “C” in their “C” column, release opera-
Finally, row 10 shows P0() finish its store, flushing its tions do provide cumulativity. Therefore, Listing 15.17
value of x from the shared store buffer to the shared cache. (C-WRC+o+o-r+a-o.litmus) substitutes a release oper-

v2022.09.25a
15.2. TRICKS AND TRAPS 329

Table 15.4: Memory Ordering: WRC Sequence of Events

P0() P0() & P1() P1() P2()


Instruction Store Buffer Cache Instruction Instruction Store Buffer Cache
1 (Initial state) y==0 (Initial state) (Initial state) x==0
2 x = 1; x==1 y==0 x==0
3 (Read-Invalidate x) x==1 y==0 r1 = x (1) x==0
4 x==1 y==1 y==0 y = r1 r2 = y x==0
5 x==1 y==1 (Finish store) (Read y) x==0
6 (Respond y) x==1 y==1 (r2==1) x==0 y==1
7 x==1 y==1 smp_rmb() x==0 y==1
8 x==1 y==1 r3 = x (0) x==0 y==1
9 x==1 x==0 y==1 (Respond x) y==1
10 (Finish store) x==1 y==1 y==1

ation for Listing 15.16’s data dependency. Because the


release operation is cumulative, its ordering applies not
only to Listing 15.17’s load from x by P1() on line 14,
but also to the store to x by P0() on line 7—but only
if that load returns the value stored, which matches the
1:r1=1 in the exists clause on line 27. This means that
P2()’s load-acquire suffices to force the load from x on Listing 15.18: W+RWC Litmus Test With Release (No Order-
ing)
line 24 to happen after the store on line 7, so the value
1 C C-W+RWC+o-r+a-o+o-mb-o
returned is one, which does not match 2:r3=0, which in 2

turn prevents the exists clause from triggering. 3 {}


4
These ordering constraints are depicted graphically in 5 P0(int *x, int *y)
Figure 15.9. Note also that cumulativity is not limited to 6 {
7 WRITE_ONCE(*x, 1);
a single step back in time. If there was another load from 8 smp_store_release(y, 1);
x or store to x from any thread that came before the store 9 }
10
on line 7, that prior load or store would also be ordered 11 P1(int *y, int *z)
before the load on line 24, though only if both r1 and r2 12 {
13 int r1;
both end up containing the value 1. 14 int r2;
In short, use of cumulative ordering operations can sup- 15
16 r1 = smp_load_acquire(y);
press non-multicopy-atomic behaviors in some situations. 17 r2 = READ_ONCE(*z);
Cumulativity nevertheless has limits, which are examined 18 }
19
in the next section. 20 P2(int *z, int *x)
21 {
22 int r3;
23
15.2.7.2 Propagation 24 WRITE_ONCE(*z, 1);
25 smp_mb();
Listing 15.18 (C-W+RWC+o-r+a-o+o-mb-o.litmus) 26 r3 = READ_ONCE(*x);
27 }
shows the limitations of cumulativity and store-release, 28
even with a full memory barrier. The problem is that 29 exists(1:r1=1 /\ 1:r2=0 /\ 2:r3=0)
although the smp_store_release() on line 8 has cumu-
lativity, and although that cumulativity does order P2()’s
load on line 26, the smp_store_release()’s ordering
cannot propagate through the combination of P1()’s load
(line 17) and P2()’s store (line 24). This means that the
exists clause on line 29 really can trigger.

v2022.09.25a
330 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

CPU 0

Store x=1 ... cumulativity guarantees CPU 0's store before CPU 1's store

CPU 1

... and given this link ... Load r1=x .... memory barriers guarantee this order ...

Release store
y=r1
CPU 2
Acquire load
Given this link ...
r2=y

Memory
Barrier

Load r3=x

Figure 15.9: Cumulativity

CPU 0 WRITE_ONCE(z, 1); P1() (lines 16 and 17), P2() (lines 24, 25, and 26), and
1 back to P0() (line 7). The exists clause delineates
CPU 1
0
fr z=
CPU 2 z=
this cycle: The 1:r1=1 indicates that the smp_load_
CPU 3 r1 = READ_ONCE(z) == 0;
acquire() on line 16 returned the value stored by the
smp_store_release() on line 8, the 1:r2=0 indicates
Time that the WRITE_ONCE() on line 24 came too late to affect
the value returned by the READ_ONCE() on line 17, and
Figure 15.10: Load-to-Store is Counter-Temporal
finally the 2:r3=0 indicates that the WRITE_ONCE() on
line 7 came too late to affect the value returned by the
READ_ONCE() on line 26. In this case, the fact that the
Quick Quiz 15.24: But it is not necessary to worry about
propagation unless there are at least three threads in the litmus
exists clause can trigger means that the cycle is said to
test, right? be allowed. In contrast, in cases where the exists clause
cannot trigger, the cycle is said to be prohibited.
This situation might seem completely counter-intuitive, But what if we need to keep the exists clause on line 29
but keep in mind that the speed of light is finite and of Listing 15.18? One solution is to replace P0()’s smp_
computers are of non-zero size. It therefore takes time for store_release() with an smp_mb(), which Table 15.3
the effect of the P2()’s store to z to propagate to P1(), shows to have not only cumulativity, but also propagation.
which in turn means that it is possible that P1()’s read The result is shown in Listing 15.19 (C-W+RWC+o-mb-
from z happens much later in time, but nevertheless still o+a-o+o-mb-o.litmus).
sees the old value of zero. This situation is depicted in Quick Quiz 15.25: But given that smp_mb() has the prop-
Figure 15.10: Just because a load sees the old value does agation property, why doesn’t the smp_mb() on line 25 of
not mean that this load executed at an earlier time than Listing 15.18 prevent the exists clause from triggering?
did the store of the new value.
Note that Listing 15.18 also shows the limitations of For completeness, Figure 15.11 shows that the “winning”
memory-barrier pairing, given that there are not two but store among a group of stores to the same variable is not
three processes. These more complex litmus tests can necessarily the store that started last. This should not
instead be said to have cycles, where memory-barrier come as a surprise to anyone who carefully examined
pairing is the special case of a two-thread cycle. The Figure 15.5 on page 321. One way to rationalize the
cycle in Listing 15.18 goes through P0() (lines 7 and 8), counter-temporal properties of both load-to-store and

v2022.09.25a
15.2. TRICKS AND TRAPS 331

Listing 15.19: W+WRC Litmus Test With More Barriers Listing 15.20: 2+2W Litmus Test With Write Barriers
1 C C-W+RWC+o-mb-o+a-o+o-mb-o 1 C C-2+2W+o-wmb-o+o-wmb-o
2 2
3 {} 3 {}
4 4
5 P0(int *x, int *y) 5 P0(int *x0, int *x1)
6 { 6 {
7 WRITE_ONCE(*x, 1); 7 WRITE_ONCE(*x0, 1);
8 smp_mb(); 8 smp_wmb();
9 WRITE_ONCE(*y, 1); 9 WRITE_ONCE(*x1, 2);
10 } 10 }
11 11
12 P1(int *y, int *z) 12 P1(int *x0, int *x1)
13 { 13 {
14 int r1; 14 WRITE_ONCE(*x1, 1);
15 int r2; 15 smp_wmb();
16 16 WRITE_ONCE(*x0, 2);
17 r1 = smp_load_acquire(y); 17 }
18 r2 = READ_ONCE(*z); 18
19 } 19 exists (x0=1 /\ x1=1)
20
21 P2(int *z, int *x)
22 {
23 int r3; CPU 0 WRITE_ONCE(x, 1);
1
24 CPU 1 =
25 WRITE_ONCE(*z, 1); X
26 smp_mb(); CPU 2 0 rf

r3 = READ_ONCE(*x);
=
27
CPU 3
X r1 = READ_ONCE(x);
28 }
29
30 exists(1:r1=1 /\ 1:r2=0 /\ 2:r3=0) Time

Figure 15.12: Store-to-Load is Temporal


CPU 0 WRITE_ONCE(x, 1); 1
=
CPU 1 X
0 2
= co =
CPU 2 X X Given that, are store-to-store really always counter-temporal???
CPU 3 WRITE_ONCE(x, 2);

Time But sometimes time really is on our side. Read on!


Figure 15.11: Store-to-Store is Counter-Temporal
15.2.7.3 Happens-Before

store-to-store ordering is to clearly distinguish between As shown in Figure 15.12, on platforms without user-
the temporal order in which the store instructions executed visible speculation, if a load returns the value from a
on the one hand, and the order in which the corresponding particular store, then, courtesy of the finite speed of light
cacheline visited the CPUs that executed those instructions and the non-zero size of modern computing systems, the
on the other. It is the cacheline-visitation order that defines store absolutely has to have executed at an earlier time
the externally visible ordering of the actual stores. This than did the load. This means that carefully constructed
cacheline-visitation order is not directly visible to the programs can rely on the passage of time itself as a
code executing the store instructions, which results in the memory-ordering operation.
counter-intuitive counter-temporal nature of load-to-store Of course, just the passage of time by itself is not
and store-to-store ordering.11 enough, as was seen in Listing 15.6 on page 322, which
Quick Quiz 15.26: But for litmus tests having only ordered has nothing but store-to-load links and, because it provides
stores, as shown in Listing 15.20 (C-2+2W+o-wmb-o+o-wmb- absolutely no ordering, still can trigger its exists clause.
o.litmus), research shows that the cycle is prohibited, even However, as long as each thread provides even the weakest
in weakly ordered systems such as Arm and Power [SSA+ 11]. possible ordering, exists clause would not be able to
trigger. For example, Listing 15.21 (C-LB+a-o+o-data-
11 In some hardware-multithreaded systems, the store would become o+o-data-o.litmus) shows P0() ordered with an smp_
visible to other CPUs in that same core as soon as the store reached the load_acquire() and both P1() and P2() ordered with
shared store buffer. As a result, such systems are non-multicopy atomic. data dependencies. These orderings, which are close to

v2022.09.25a
332 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

Listing 15.21: LB Litmus Test With One Acquire Listing 15.22: Long LB Release-Acquire Chain
1 C C-LB+a-o+o-data-o+o-data-o 1 C C-LB+a-r+a-r+a-r+a-r
2 2
3 {} 3 {}
4 4
5 P0(int *x0, int *x1) 5 P0(int *x0, int *x1)
6 { 6 {
7 int r2; 7 int r2;
8 8
9 r2 = smp_load_acquire(x0); 9 r2 = smp_load_acquire(x0);
10 WRITE_ONCE(*x1, 2); 10 smp_store_release(x1, 2);
11 } 11 }
12 12
13 P1(int *x1, int *x2) 13 P1(int *x1, int *x2)
14 { 14 {
15 int r2; 15 int r2;
16 16
17 r2 = READ_ONCE(*x1); 17 r2 = smp_load_acquire(x1);
18 WRITE_ONCE(*x2, r2); 18 smp_store_release(x2, 2);
19 } 19 }
20 20
21 P2(int *x2, int *x0) 21 P2(int *x2, int *x3)
22 { 22 {
23 int r2; 23 int r2;
24 24
25 r2 = READ_ONCE(*x2); 25 r2 = smp_load_acquire(x2);
26 WRITE_ONCE(*x0, r2); 26 smp_store_release(x3, 2);
27 } 27 }
28 28
29 exists (0:r2=2 /\ 1:r2=2 /\ 2:r2=2) 29 P3(int *x3, int *x0)
30 {
31 int r2;
32

the top of Table 15.3, suffice to prevent the exists clause 33 r2 = smp_load_acquire(x3);
34 smp_store_release(x0, 2);
from triggering. 35 }
36
Quick Quiz 15.27: Can you construct a litmus test like that 37 exists (0:r2=2 /\ 1:r2=2 /\ 2:r2=2 /\ 3:r2=2)
in Listing 15.21 that uses only dependencies?

An important use of time for ordering memory accesses if P3()’s READ_ONCE() returns zero, this cumulativity
is covered in the next section. will force the READ_ONCE() to be ordered before P0()’s
smp_store_release(). In addition, the release-acquire
15.2.7.4 Release-Acquire Chains chain (lines 8, 15, 16, 23, 24, and 32) forces P3()’s
READ_ONCE() to be ordered after P0()’s smp_store_
A minimal release-acquire chain was shown in Listing 15.7 release(). Because P3()’s READ_ONCE() cannot be
on page 322, but these chains can be much longer, as shown both before and after P0()’s smp_store_release(),
in Listing 15.22 (C-LB+a-r+a-r+a-r+a-r.litmus). either or both of two things must be true:
The longer the release-acquire chain, the more order-
ing is gained from the passage of time, so that no matter 1. P3()’s READ_ONCE() came after P0()’s WRITE_
how many threads are involved, the corresponding exists ONCE(), so that the READ_ONCE() returned the value
clause cannot trigger. two, so that the exists clause’s 3:r2=0 is false.
Although release-acquire chains are inherently store-to-
load creatures, it turns out that they can tolerate one load- 2. The release-acquire chain did not form, that is, one
to-store step, despite such steps being counter-temporal, or more of the exists clause’s 1:r2=2, 2:r2=2, or
as shown in Figure 15.10 on page 330. For example, List- 3:r1=2 is false.
ing 15.23 (C-ISA2+o-r+a-r+a-r+a-o.litmus) shows
a three-step release-acquire chain, but where P3()’s final Either way, the exists clause cannot trigger, despite
access is a READ_ONCE() from x0, which is accessed via this litmus test containing a notorious load-to-store link
WRITE_ONCE() by P0(), forming a non-temporal load-to- between P3() and P0(). But never forget that release-
store link between these two processes. However, because acquire chains can tolerate only one load-to-store link, as
P0()’s smp_store_release() (line 8) is cumulative, was seen in Listing 15.18.

v2022.09.25a
15.2. TRICKS AND TRAPS 333

Listing 15.23: Long ISA2 Release-Acquire Chain


1 C C-ISA2+o-r+a-r+a-r+a-o Listing 15.24: Long Z6.2 Release-Acquire Chain
2
3 {} 1 C C-Z6.2+o-r+a-r+a-r+a-o
2
4
5 P0(int *x0, int *x1) 3 {}
6 { 4

7 WRITE_ONCE(*x0, 2); 5 P0(int *x0, int *x1)


8 smp_store_release(x1, 2); 6 {
9 } 7 WRITE_ONCE(*x0, 2);
10
8 smp_store_release(x1, 2);
11 P1(int *x1, int *x2) 9 }
12 { 10

13 int r2; 11 P1(int *x1, int *x2)


14
12 {
15 r2 = smp_load_acquire(x1); 13 int r2;
16 smp_store_release(x2, 2); 14

17 } 15 r2 = smp_load_acquire(x1);
18
16 smp_store_release(x2, 2);
19 P2(int *x2, int *x3) 17 }
20 { 18

21 int r2; 19 P2(int *x2, int *x3)


22
20 {
23 r2 = smp_load_acquire(x2); 21 int r2;
24 smp_store_release(x3, 2); 22

25 } 23 r2 = smp_load_acquire(x2);
26
24 smp_store_release(x3, 2);
27 P3(int *x3, int *x0) 25 }
28 { 26

29 int r1; 27 P3(int *x3, int *x0)


30 int r2; 28 {
31
29 int r2;
32 r1 = smp_load_acquire(x3); 30

33 r2 = READ_ONCE(*x0); 31 r2 = smp_load_acquire(x3);
34 } 32 WRITE_ONCE(*x0, 3);
35
33 }
36 exists (1:r2=2 /\ 2:r2=2 /\ 3:r1=2 /\ 3:r2=0) 34
35 exists (1:r2=2 /\ 2:r2=2 /\ 3:r2=2 /\ x0=2)

Release-acquire chains can also tolerate a single store-


to-store step, as shown in Listing 15.24 (C-Z6.2+o-r+a-
Listing 15.25: Z6.0 Release-Acquire Chain (Ordering?)
r+a-r+a-o.litmus). As with the previous example,
1 C C-Z6.2+o-r+a-o+o-mb-o
smp_store_release()’s cumulativity combined with 2

the temporal nature of the release-acquire chain prevents 3 {}


4
the exists clause on line 35 from triggering. 5 P0(int *x, int *y)
6 {
Quick Quiz 15.28: Suppose we have a short release-acquire 7 WRITE_ONCE(*x, 1);
chain along with one load-to-store link and one store-to-store 8 smp_store_release(y, 1);
9 }
link, like that shown in Listing 15.25. Given that there is only 10
one of each type of non-store-to-load link, the exists cannot 11 P1(int *y, int *z)
12 {
trigger, right? 13 int r1;
14
But beware: Adding a second store-to-store link allows 15 r1 = smp_load_acquire(y);
16 WRITE_ONCE(*z, 1);
the correspondingly updated exists clause to trigger. To 17 }
see this, review Listings 15.26 and 15.27, which have 18
19 P2(int *z, int *x)
identical P0() and P1() processes. The only code dif- 20 {
ference is that Listing 15.27 has an additional P2() that 21 int r2;
22
does an smp_store_release() to the x2 variable that 23 WRITE_ONCE(*z, 2);
P0() releases and P1() acquires. The exists clause 24 smp_mb();
25 r2 = READ_ONCE(*x);
is also adjusted to exclude executions in which P2()’s 26 }
smp_store_release() precedes that of P1(). 27
28 exists(1:r1=1 /\ 2:r2=0 /\ z=2)
Running the litmus test in Listing 15.27 shows that the
addition of P2() can totally destroy the ordering from

v2022.09.25a
334 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

the release-acquire chain. Therefore, when constructing


release-acquire chains, please take care to construct them
properly.
Listing 15.26: A Release-Acquire Chain Ordering Multiple Quick Quiz 15.29: There are store-to-load links, load-to-
Accesses store links, and store-to-store links. But what about load-to-
1 C C-MP+o-r+a-o
2 load links?
3 {}
4 In short, properly constructed release-acquire chains
5 P0(int* x0, int* x1, int* x2) {
6 int r1; form a peaceful island of intuitive bliss surrounded by a
7 strongly counter-intuitive sea of more complex memory-
8 WRITE_ONCE(*x0, 2);
9 r1 = READ_ONCE(*x1); ordering constraints.
10 smp_store_release(x2, 2);
11 }
12
13 P1(int* x0, int* x1, int* x2) { 15.3 Compile-Time Consternation
14 int r2;
15 int r3;
16
17 r2 = smp_load_acquire(x2);
Science increases our power in proportion as it
18 WRITE_ONCE(*x1, 2); lowers our pride.
19 r3 = READ_ONCE(*x0);
20 } Claude Bernard
21
22 exists (1:r2=2 /\ (1:r3=0 \/ 0:r1=2))
Most languages, including C, were developed on unipro-
cessor systems by people with little or no parallel-
programming experience. As a results, unless explicitly
told otherwise, these languages assume that the current
CPU is the only thing that is reading or writing mem-
ory. This in turn means that these languages’ compilers’
optimizers are ready, willing, and oh so able to make
dramatic changes to the order, number, and sizes of mem-
Listing 15.27: A Release-Acquire Chain With Added Store
(Ordering?)
ory references that your program executes. In fact, the
1 C C-MPO+o-r+a-o+o reordering carried out by hardware can seem quite tame
2 by comparison.
3 {}
4
This section will help you tame your compiler, thus
5 P0(int* x0, int* x1, int* x2) { avoiding a great deal of compile-time consternation. Sec-
6 int r1;
7
tion 15.3.1 describes how to keep the compiler from
8 WRITE_ONCE(*x0, 2); destructively optimizing your code’s memory references,
9 r1 = READ_ONCE(*x1);
10 smp_store_release(x2, 2); Section 15.3.2 describes how to protect address and data
11 } dependencies, and finally, Section 15.3.3 describes how
12
13 P1(int* x0, int* x1, int* x2) { to protect those delicate control dependencies.
14 int r2;
15 int r3;
16 15.3.1 Memory-Reference Restrictions
17 r2 = smp_load_acquire(x2);
18 WRITE_ONCE(*x1, 2);
19 r3 = READ_ONCE(*x0);
As noted in Section 4.3.4, unless told otherwise, compilers
20 } assume that nothing else is affecting the variables that
21
22 P2(int* x2) {
the code is accessing. Furthermore, this assumption is
23 smp_store_release(x2, 3); not simply some design error, but is instead enshrined in
}
24
25
various standards.12 It is worth summarizing this material
26 exists (1:r2=3 /\ x2=3 /\ (1:r3=0 \/ 0:r1=2)) in preparation for the following sections.

12 Or perhaps it is a standardized design error.

v2022.09.25a
15.3. COMPILE-TIME CONSTERNATION 335

Plain accesses, as in plain-access C-language assign- accesses, as described in Section 4.3.4.4. After all, if
ment statements such as “r1 = a” or “b = 1” are sub- there are no data races, then each and every one of the
ject to the shared-variable shenanigans described in Sec- compiler optimizations mentioned above is perfectly safe.
tion 4.3.4.1. Ways of avoiding these shenanigans are But for code containing data races, this list is subject to
described in Sections 4.3.4.2–4.3.4.4 starting on page 43: change without notice as compiler optimizations continue
becoming increasingly aggressive.
1. Plain accesses can tear, for example, the compiler In short, use of READ_ONCE(), WRITE_ONCE(),
could choose to access an eight-byte pointer one barrier(), volatile, and other primitives called out
byte at a time. Tearing of aligned machine-sized in Table 15.3 on page 317 are valuable tools in preventing
accesses can be prevented by using READ_ONCE() the compiler from optimizing your parallel algorithm out
and WRITE_ONCE(). of existence. Compilers are starting to provide other mech-
anisms for avoiding load and store tearing, for example,
2. Plain loads can fuse, for example, if the results of
memory_order_relaxed atomic loads and stores, how-
an earlier load from that same object are still in a
ever, work is still needed [Cor16b]. In addition, compiler
machine register, the compiler might opt to reuse
issues aside, volatile is still needed to avoid fusing and
the value in that register instead of reloading from
invention of accesses, including C11 atomic accesses.
memory. Load fusing can be prevented by using
Please note that, it is possible to overdo use of READ_
READ_ONCE() or by enforcing ordering between the
ONCE() and WRITE_ONCE(). For example, if you have
two loads using barrier(), smp_rmb(), and other
prevented a given variable from changing (perhaps by
means shown in Table 15.3.
holding the lock guarding all updates to that variable),
3. Plain stores can fuse, so that a store can be omit- there is no point in using READ_ONCE(). Similarly, if you
ted entirely if there is a later store to that same have prevented any other CPUs or threads from reading a
variable. Store fusing can be prevented by using given variable (perhaps because you are initializing that
WRITE_ONCE() or by enforcing ordering between variable before any other CPU or thread has access to it),
the two stores using barrier(), smp_wmb(), and there is no point in using WRITE_ONCE(). However, in
other means shown in Table 15.3. my experience, developers need to use things like READ_
ONCE() and WRITE_ONCE() more often than they think
4. Plain accesses can be reordered in surprising ways that they do, and the overhead of unnecessary uses is quite
by modern optimizing compilers. This reordering low. In contrast, the penalty for failing to use them when
can be prevented by enforcing ordering as called out needed can be quite high.
above.
5. Plain loads can be invented, for example, register 15.3.2 Address- and Data-Dependency Dif-
pressure might cause the compiler to discard a previ- ficulties
ously loaded value from its register, and then reload
it later on. Invented loads can be prevented by using The low overheads of the address and data dependen-
READ_ONCE() or by enforcing ordering as called out cies discussed in Sections 15.2.3 and 15.2.4, respectively,
above between the load and a later use of its value makes their use extremely attractive. Unfortunately, com-
using barrier(). pilers do not understand either address or data dependen-
cies, although there are efforts underway to teach them,
6. Stores can be invented before a plain store, for ex- or at the very least, standardize the process of teaching
ample, by using the stored-to location as temporary them [MWB+ 17, MRP+ 17]. In the meantime, it is neces-
storage. This can be prevented by use of WRITE_ sary to be very careful in order to prevent your compiler
ONCE(). from breaking your dependencies.

Quick Quiz 15.30: Why not place a barrier() call im- 15.3.2.1 Give your dependency chain a good start
mediately before a plain store to prevent the compiler from
inventing stores? The load that heads your dependency chain must use
proper ordering, for example rcu_dereference() or
Please note that all of these shared-memory shenanigans READ_ONCE(). Failure to follow this rule can have serious
can instead be avoided by avoiding data races on plain side effects:

v2022.09.25a
336 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

Listing 15.28: Breakable Dependencies With Comparisons 1. Although it is permissible to compute offsets from
1 int reserve_int; a pointer, these offsets must not result in total can-
2 int *gp;
3 int *p; cellation. For example, given a char pointer cp,
4 cp-(uintptr_t)cp will cancel and can allow the
5 p = rcu_dereference(gp);
6 if (p == &reserve_int) compiler to break your dependency chain. On the
7 handle_reserve(p); other hand, canceling offset values with each other
8 do_something_with(*p); /* buggy! */
is perfectly safe and legal. For example, if a and b
are equal, cp+a-b is an identity function, including
Listing 15.29: Broken Dependencies With Comparisons preserving the dependency.
1 int reserve_int;
2 int *gp; 2. Comparisons can break dependencies. Listing 15.28
3 int *p;
4 shows how this can happen. Here global pointer gp
5 p = rcu_dereference(gp); points to a dynamically allocated integer, but if mem-
6 if (p == &reserve_int) {
7 handle_reserve(&reserve_int); ory is low, it might instead point to the reserve_int
8 do_something_with(reserve_int); /* buggy! */ variable. This reserve_int case might need spe-
9 } else {
10 do_something_with(*p); /* OK! */ cial handling, as shown on lines 6 and 7 of the listing.
11 } But the compiler could reasonably transform this
code into the form shown in Listing 15.29, espe-
cially on systems where instructions with absolute
1. On DEC Alpha, a dependent load might not be addresses run faster than instructions using addresses
ordered with the load heading the dependency chain, supplied in registers. However, there is clearly no
as described in Section 15.5.1. ordering between the pointer load on line 5 and the
dereference on line 8. Please note that this is simply
2. If the load heading the dependency chain is a C11 non-
an example: There are a great many other ways to
volatile memory_order_relaxed load, the com-
break dependency chains with comparisons.
piler could omit the load, for example, by using
a value that it loaded in the past. Quick Quiz 15.31: Why can’t you simply dereference the
pointer before comparing it to &reserve_int on line 6 of
3. If the load heading the dependency chain is a plain Listing 15.28?
load, the compiler can omit the load, again by using
a value that it loaded in the past. Worse yet, it could Quick Quiz 15.32: But it should be safe to compare two
load twice instead of once, so that different parts of pointer variables, right? After all, the compiler doesn’t know
your code use different values—and compilers really the value of either, so how can it possibly learn anything from
do this, especially when under register pressure. the comparison?

4. The value loaded by the head of the dependency Note that a series of inequality comparisons might,
chain must be a pointer. In theory, yes, you could when taken together, give the compiler enough information
load an integer, perhaps to use it as an array index. In to determine the exact value of the pointer, at which point
practice, the compiler knows too much about integers, the dependency is broken. Furthermore, the compiler
and thus has way too many opportunities to break might be able to combine information from even a single
your dependency chain [MWB+ 17]. inequality comparison with other information to learn the
exact value, again breaking the dependency. Pointers to
elements in arrays are especially susceptible to this latter
15.3.2.2 Avoid arithmetic dependency breakage form of dependency breakage.
Although it is just fine to do some arithmetic operations on
a pointer in your dependency chain, you need to be careful 15.3.2.3 Safe comparison of dependent pointers
to avoid giving the compiler too much information. After
It turns out that there are several safe ways to compare
all, if the compiler learns enough to determine the exact
dependent pointers:
value of the pointer, it can use that exact value instead of
the pointer itself. As soon as the compiler does that, the 1. Comparisons against the NULL pointer. In this case,
dependency is broken and all ordering is lost. all the compiler can learn is that the pointer is NULL,

v2022.09.25a
15.3. COMPILE-TIME CONSTERNATION 337

in which case you are not allowed to dereference it Listing 15.30: Broken Dependencies With Pointer Comparisons
anyway. 1 struct foo {
2 int a;
3 int b;
2. The dependent pointer is never dereferenced, whether 4 int c;
before or after the comparison. 5 };
6 struct foo *gp1;
7 struct foo *gp2;
3. The dependent pointer is compared to a pointer that 8

references objects that were last modified a very long 9 void updater(void)
10 {
time ago, where the only unconditionally safe value 11 struct foo *p;
of “a very long time ago” is “at compile time”. The 12
13 p = malloc(sizeof(*p));
key point is that something other than the address or 14 BUG_ON(!p);
data dependency guarantees ordering. 15 p->a = 42;
16 p->b = 43;
17 p->c = 44;
4. Comparisons between two pointers, each of which 18 rcu_assign_pointer(gp1, p);
19 WRITE_ONCE(p->b, 143);
carries an appropriate dependency. For example, you 20 WRITE_ONCE(p->c, 144);
have a pair of pointers, each carrying a dependency, 21 rcu_assign_pointer(gp2, p);
22 }
to data structures each containing a lock, and you 23
want to avoid deadlock by acquiring the locks in 24 void reader(void)
25 {
address order. 26 struct foo *p;
27 struct foo *q;
5. The comparison is not-equal, and the compiler does 28 int r1, r2 = 0;
29
not have enough other information to deduce the 30 p = rcu_dereference(gp2);
value of the pointer carrying the dependency. 31 if (p == NULL)
32 return;
33 r1 = READ_ONCE(p->b);
Pointer comparisons can be quite tricky, and so it 34 q = rcu_dereference(gp1);
is well worth working through the example shown in 35 if (p == q) {
36 r2 = READ_ONCE(p->c);
Listing 15.30. This example uses a simple struct foo 37 }
shown on lines 1–5 and two global pointers, gp1 and 38 do_something_with(r1, r2);
39 }
gp2, shown on lines 6 and 7, respectively. This example
uses two threads, namely updater() on lines 9–22 and
reader() on lines 24–39. they carry different dependencies. This means that the
The updater() thread allocates memory on line 13, compiler might well transform line 36 to instead be r2
and complains bitterly on line 14 if none is available. = q->c, which might well cause the value 44 to be loaded
Lines 15–17 initialize the newly allocated structure, and instead of the expected value 144.
then line 18 assigns the pointer to gp1. Lines 19 and 20
then update two of the structure’s fields, and does so after Quick Quiz 15.33: But doesn’t the condition in line 35
line 18 has made those fields visible to readers. Please supply a control dependency that would keep line 36 ordered
note that unsynchronized update of reader-visible fields after line 34?
often constitutes a bug. Although there are legitimate use
In short, great care is required to ensure that dependency
cases doing just this, such use cases require more care
chains in your source code are still dependency chains in
than is exercised in this example.
the compiler-generated assembly code.
Finally, line 21 assigns the pointer to gp2.
The reader() thread first fetches gp2 on line 30, with
lines 31 and 32 checking for NULL and returning if so. 15.3.3 Control-Dependency Calamities
Line 33 fetches field ->b and line 34 fetches gp1. If
line 35 sees that the pointers fetched on lines 30 and 34 are The control dependencies described in Section 15.2.5 are
equal, line 36 fetches p->c. Note that line 36 uses pointer attractive due to their low overhead, but are also especially
p fetched on line 30, not pointer q fetched on line 34. tricky because current compilers do not understand them
But this difference might not matter. An equals com- and can easily break them. The rules and examples in this
parison on line 35 might lead the compiler to (incorrectly) section are intended to help you prevent your compiler’s
conclude that both pointers are equivalent, when in fact ignorance from breaking your code.

v2022.09.25a
338 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

A load-load control dependency requires a full read 6 } else {


7 barrier();
memory barrier, not simply a data dependency barrier. 8 WRITE_ONCE(y, 1);
Consider the following bit of code: 9 do_something_else();
10 }
1 q = READ_ONCE(x);
2 if (q) {
3 <data dependency barrier> Unfortunately, current compilers will transform this as
4 q = READ_ONCE(y); follows at high optimization levels:
5 }

1 q = READ_ONCE(x);
2 barrier();
This will not have the desired effect because there is no 3 WRITE_ONCE(y, 1); /* BUG: No ordering!!! */
actual data dependency, but rather a control dependency 4 if (q) {
5 do_something();
that the CPU may short-circuit by attempting to predict 6 } else {
the outcome in advance, so that other CPUs see the load 7 do_something_else();
8 }
from y as having happened before the load from x. In
such a case what’s actually required is:
Now there is no conditional between the load from x and
1 q = READ_ONCE(x); the store to y, which means that the CPU is within its rights
2 if (q) {
3 <read barrier> to reorder them: The conditional is absolutely required,
4 q = READ_ONCE(y); and must be present in the assembly code even after all
5 }
compiler optimizations have been applied. Therefore,
if you need ordering in this example, you need explicit
However, stores are not speculated. This means that memory-ordering operations, for example, a release store:
ordering is provided for load-store control dependencies,
as in the following example: 1 q = READ_ONCE(x);
2 if (q) {
3 smp_store_release(&y, 1);
1 q = READ_ONCE(x);
4 do_something();
2 if (q)
5 } else {
3 WRITE_ONCE(y, 1);
6 smp_store_release(&y, 1);
7 do_something_else();
8 }
Control dependencies pair normally with other types
of ordering operations. That said, please note that neither
READ_ONCE() nor WRITE_ONCE() are optional! Without The initial READ_ONCE() is still required to prevent the
the READ_ONCE(), the compiler might fuse the load from x compiler from guessing the value of x. In addition, you
with other loads from x. Without the WRITE_ONCE(), the need to be careful what you do with the local variable q,
compiler might fuse the store to y with other stores to y. otherwise the compiler might be able to guess its value
Either can result in highly counter-intuitive effects on and again remove the needed conditional. For example:
ordering. 1 q = READ_ONCE(x);
Worse yet, if the compiler is able to prove (say) that 2 if (q % MAX) {
the value of variable x is always non-zero, it would be 3 WRITE_ONCE(y, 1);
4 do_something();
well within its rights to optimize the original example by 5 } else {
eliminating the “if” statement as follows: 6 WRITE_ONCE(y, 2);
7 do_something_else();
8 }
1 q = READ_ONCE(x);
2 WRITE_ONCE(y, 1); /* BUG: CPU can reorder!!! */
If MAX is defined to be 1, then the compiler knows that
It is tempting to try to enforce ordering on identical (q%MAX) is equal to zero, in which case the compiler
stores on both branches of the “if” statement as follows: is within its rights to transform the above code into the
following:
1 q = READ_ONCE(x);
2 if (q) { 1 q = READ_ONCE(x);
3 barrier(); 2 WRITE_ONCE(y, 2);
4 WRITE_ONCE(y, 1); 3 do_something_else();
5 do_something();

v2022.09.25a
15.3. COMPILE-TIME CONSTERNATION 339

Given this transformation, the CPU is not required to Listing 15.31: LB Litmus Test With Control Dependency
respect the ordering between the load from variable x and 1 C C-LB+o-cgt-o+o-cgt-o
2
the store to variable y. It is tempting to add a barrier() 3 {}
to constrain the compiler, but this does not help. The 4
5 P0(int *x, int *y)
conditional is gone, and the barrier() won’t bring it 6 {
back. Therefore, if you are relying on this ordering, you 7 int r1;
8
should make sure that MAX is greater than one, perhaps as 9 r1 = READ_ONCE(*x);
follows: 10 if (r1 > 0)
11 WRITE_ONCE(*y, 1);
12 }
1 q = READ_ONCE(x); 13
2 BUILD_BUG_ON(MAX <= 1); 14 P1(int *x, int *y)
3 if (q % MAX) { 15 {
4 WRITE_ONCE(y, 1); 16 int r2;
5 do_something(); 17
6 } else { 18 r2 = READ_ONCE(*y);
7 WRITE_ONCE(y, 2); 19 if (r2 > 0)
8 do_something_else(); 20 WRITE_ONCE(*x, 1);
9 } 21 }
22
23 exists (0:r1=1 /\ 1:r2=1)
Please note once again that the stores to y differ. If they
were identical, as noted earlier, the compiler could pull
this store outside of the “if” statement. also cannot reorder the writes to y with the condition.
You must also avoid excessive reliance on boolean Unfortunately for this line of reasoning, the compiler
short-circuit evaluation. Consider this example: might compile the two writes to y as conditional-move
instructions, as in this fanciful pseudo-assembly language:
1 q = READ_ONCE(x);
2 if (q || 1 > 0)
3 WRITE_ONCE(y, 1); 1 ld r1,x
2 cmp r1,$0
3 cmov,ne r4,$1
4 cmov,eq r4,$2
Because the first condition cannot fault and the second 5 st r4,y
condition is always true, the compiler can transform this 6 st $1,z
example as following, defeating the control dependency:
A weakly ordered CPU would have no dependency of
1 q = READ_ONCE(x);
2 WRITE_ONCE(y, 1); any sort between the load from x and the store to z. The
control dependencies would extend only to the pair of cmov
This example underscores the need to ensure that the instructions and the store depending on them. In short,
compiler cannot out-guess your code. More generally, control dependencies apply only to the stores in the “then”
although READ_ONCE() does force the compiler to actually and “else” of the “if” in question (including functions
emit code for a given load, it does not force the compiler invoked by those two clauses), and not necessarily to code
to use the value loaded. following that “if”.
In addition, control dependencies apply only to the then- Finally, control dependencies do not provide cumula-
clause and else-clause of the if-statement in question. In tivity.13 This is demonstrated by two related litmus tests,
particular, it does not necessarily apply to code following namely Listings 15.31 and 15.32 with the initial values
the if-statement: of x and y both being zero.
The exists clause in the two-thread example of
1 q = READ_ONCE(x); Listing 15.31 (C-LB+o-cgt-o+o-cgt-o.litmus) will
2 if (q) {
3 WRITE_ONCE(y, 1); never trigger. If control dependencies guaranteed cumu-
4 } else { lativity (which they do not), then adding a thread to the
5 WRITE_ONCE(y, 2);
6 } example as in Listing 15.32 (C-WWC+o-cgt-o+o-cgt-
7 WRITE_ONCE(z, 1); /* BUG: No ordering. */ o+o.litmus) would guarantee the related exists clause
never to trigger.
It is tempting to argue that there in fact is ordering
because the compiler cannot reorder volatile accesses and 13 Refer to Section 15.2.7.1 for the meaning of cumulativity.

v2022.09.25a
340 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

Listing 15.32: WWC Litmus Test With Control Dependency is needed, precede both of them with smp_mb() or
(Cumulativity?) use smp_store_release(). Please note that it is
1 C C-WWC+o-cgt-o+o-cgt-o+o not sufficient to use barrier() at beginning of each
2
3 {} leg of the “if” statement because, as shown by the
4
5 P0(int *x, int *y)
example above, optimizing compilers can destroy the
6 { control dependency while respecting the letter of the
7 int r1; barrier() law.
8
9 r1 = READ_ONCE(*x);
10 if (r1 > 0) 4. Control dependencies require at least one run-time
11 WRITE_ONCE(*y, 1); conditional between the prior load and the subsequent
12 }
13 store, and this conditional must involve the prior load.
14 P1(int *x, int *y) If the compiler is able to optimize the conditional
15 {
16 int r2; away, it will have also optimized away the ordering.
17
18 r2 = READ_ONCE(*y);
Careful use of READ_ONCE() and WRITE_ONCE()
19 if (r2 > 0) can help to preserve the needed conditional.
20 WRITE_ONCE(*x, 1);
21 } 5. Control dependencies require that the compiler
22
23 P2(int *x) avoid reordering the dependency into nonexistence.
24 { Careful use of READ_ONCE(), atomic_read(), or
25 WRITE_ONCE(*x, 2);
26 } atomic64_read() can help to preserve your control
27
28 exists (0:r1=2 /\ 1:r2=1 /\ x=2)
dependency.
6. Control dependencies apply only to the “then” and
“else” of the “if” containing the control dependency,
But because control dependencies do not provide cu- including any functions that these two clauses call.
mulativity, the exists clause in the three-thread litmus Control dependencies do not apply to code following
test can trigger. If you need the three-thread example to the end of the “if” statement containing the control
provide ordering, you will need smp_mb() between the dependency.
load and store in P0(), that is, just before or just after
the “if” statements. Furthermore, the original two-thread 7. Control dependencies pair normally with other types
example is very fragile and should be avoided. of memory-ordering operations.
Quick Quiz 15.34: Can’t you instead add an smp_mb() to 8. Control dependencies do not provide cumulativity. If
P1() in Listing 15.32?
you need cumulativity, use something that provides
The following list of rules summarizes the lessons of it, such as smp_store_release() or smp_mb().
this section:
Again, many popular languages were designed with
1. Compilers do not understand control dependencies, single-threaded use in mind. Successful multithreaded use
so it is your job to make sure that the compiler cannot of these languages requires you to pay special attention to
break your code. your memory references and dependencies.

2. Control dependencies can order prior loads against


later stores. However, they do not guarantee any 15.4 Higher-Level Primitives
other sort of ordering: Not prior loads against later
loads, nor prior stores against later anything. If you Method will teach you to win time.
need these other forms of ordering, use smp_rmb(),
smp_wmb(), or, in the case of prior stores and later Johann Wolfgang von Goethe
loads, smp_mb().
The answer to one of the quick quizzes in Section 12.3.1
3. If both legs of the “if” statement begin with iden- demonstrated exponential speedups due to verifying pro-
tical stores to the same variable, then the control grams modeled at higher levels of abstraction. This section
dependency will not order those stores, If ordering will look into how higher levels of abstraction can also

v2022.09.25a
15.4. HIGHER-LEVEL PRIMITIVES 341

provide a deeper understanding of the synchronization write before the unlock to be reordered with a read after the
primitives themselves. Section 15.4.1 takes a look at mem- lock?
ory allocation, Section 15.4.2 examines the surprisingly
Therefore, the ordering required by conventional uses of
varied semantics of locking, and Section 15.4.3 digs more
memory allocation can be provided solely by non-fastpath
deeply into RCU.
locking, allowing the fastpath to remain synchronization-
free.
15.4.1 Memory Allocation
15.4.2 Locking
Section 6.4.3.2 touched upon memory allocation, and
this section expands upon the relevant memory-ordering Locking is a well-known synchronization primitive with
issues. which the parallel-programming community has had
The key requirement is that any access executed on a decades of experience. As such, locking’s semantics
given block of memory before freeing that block must be are quite simple.
ordered before any access executed after that same block That is, they are quite simple until you start trying to
is reallocated. It would after all be a cruel and unusual mathematically model them.
memory-allocator bug if a store preceding the free were to The simple part is that any CPU or thread holding a
be reordered after another store following the reallocation! given lock is guaranteed to see any accesses executed by
However, it would also be cruel and unusual to require CPUs or threads while they were previously holding that
developers to use READ_ONCE() and WRITE_ONCE() to same lock. Similarly, any CPU or thread holding a given
access dynamically allocated memory. Full ordering must lock is guaranteed not to see accesses that will be executed
therefore be provided for plain accesses, in spite of all the by other CPUs or threads while subsequently holding that
shared-variable shenanigans called out in Section 4.3.4.1. same lock. And what else is there?
As it turns out, quite a bit:
Of course, each CPU sees its own accesses in order and
the compiler always has fully accounted for intra-CPU 1. Are CPUs, threads, or compilers allowed to pull
shenanigans. These facts are what enables the lockless fast- memory accesses into a given lock-based critical
paths in memblock_alloc() and memblock_free(), section?
which are shown in Listings 6.10 and 6.11, respectively.
However, this is also why the developer is responsible 2. Will a CPU or thread holding a given lock also see
for providing appropriate ordering (for example, by using accesses executed by CPUs and threads before they
smp_store_release()) when publishing a pointer to last acquired that same lock, and vice versa?
a newly allocated block of memory. After all, in the
3. Suppose that a given CPU or thread executes one
CPU-local case, the allocator has not necessarily provided
access (call it “A”), releases a lock, reacquires that
any ordering.
same lock, then executes another access (call it “B”).
However, the allocator must provide ordering when Is some other CPU or thread not holding that lock
rebalancing its per-thread pools. This ordering is guaranteed to see A and B in order?
provided by the calls to spin_lock() and spin_
unlock() from memblock_alloc() and memblock_ 4. As above, but with the lock reacquisition carried out
free(). For any block that has migrated from one by some other CPU or thread?
thread to another, the old thread will have executed spin_
5. As above, but with the lock reacquisition being some
unlock(&globalmem.mutex) after placing the block in
other lock?
the globalmem pool, and the new thread will have exe-
cuted spin_lock(&globalmem.mutex) before moving 6. What ordering guarantees are provided by spin_
that block to its per-thread pool. This spin_unlock() is_locked()?
and spin_lock() ensures that both the old and new
threads see the old thread’s accesses as having happened The reaction to some or even all of these questions
before those of the new thread. might well be “Why would anyone do that?” However,
any complete mathematical definition of locking must
Quick Quiz 15.35: But doesn’t PowerPC have weak unlock-
have answers to all of these questions. Therefore, the
lock ordering properties within the Linux kernel, allowing a
following sections address these questions.

v2022.09.25a
342 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

Listing 15.33: Prior Accesses Into Critical Section means that neither spin_lock() nor spin_unlock()
1 C Lock-before-into are required to act as a full memory barrier.
2
3 {} However, other environments might make other choices.
4
For example, locking implementations that run only on
5 P0(int *x, int *y, spinlock_t *sp)
6 { the x86 CPU family will have lock-acquisition primitives
7 int r1; that fully order the lock acquisition with any prior and
8
9 WRITE_ONCE(*x, 1); any subsequent accesses. Therefore, on such systems the
10 spin_lock(sp); ordering shown in Listing 15.33 comes for free. There
11 r1 = READ_ONCE(*y);
12 spin_unlock(sp); are x86 lock-release implementations that are weakly
13 } ordered, thus failing to provide the ordering shown in
14
15 P1(int *x, int *y) Listing 15.34, but an implementation could nevertheless
16 { choose to guarantee this ordering.
17 int r1;
18 For their part, weakly ordered systems might well
19 WRITE_ONCE(*y, 1); choose to execute the memory-barrier instructions re-
20 smp_mb();
21 r1 = READ_ONCE(*x); quired to guarantee both orderings, possibly simpli-
22 } fying code making advanced use of combinations of
23
24 exists (0:r1=0 /\ 1:r1=0) locked and lockless accesses. However, as noted earlier,
LKMM chooses not to provide these additional order-
ings, in part to avoid imposing performance penalties on
Listing 15.34: Subsequent Accesses Into Critical Section
1 C Lock-after-into
the simpler and more prevalent locking use cases. In-
2 stead, the smp_mb__after_spinlock() and smp_mb__
3 {}
4
after_unlock_lock() primitives are provided for those
5 P0(int *x, int *y, spinlock_t *sp) more complex use cases, as discussed in Section 15.5.
6 {
7 int r1;
Thus far, this section has discussed only hardware
8 reordering. Can the compiler also reorder memory refer-
9 spin_lock(sp);
10 WRITE_ONCE(*x, 1);
ences into lock-based critical sections?
11 spin_unlock(sp); The answer to this question in the context of the Linux
12 r1 = READ_ONCE(*y);
13 }
kernel is a resounding “No!” One reason for this other-
14 wise inexplicable favoring of hardware reordering over
15 P1(int *x, int *y)
16 {
compiler optimizations is that the hardware will avoid
17 int r1; reordering a page-faulting access into a lock-based crit-
18
19 WRITE_ONCE(*y, 1);
ical section. In contrast, compilers have no clue about
20 smp_mb(); page faults, and would therefore happily reorder a page
21 r1 = READ_ONCE(*x);
22 }
fault into a critical section, which could crash the kernel.
23 The compiler is also unable to reliably determine which
24 exists (0:r1=0 /\ 1:r1=0)
accesses will result in cache misses, so that compiler re-
ordering into critical sections could also result in excessive
lock contention. Therefore, the Linux kernel prohibits the
15.4.2.1 Accesses Into Critical Sections? compiler (but not the CPU) from moving accesses into
lock-based critical sections.
Can memory accesses be reordered into lock-based critical
sections?
15.4.2.2 Accesses Outside of Critical Section?
Within the context of the Linux-kernel memory model,
the simple answer is “yes”. This may be verified by If a given CPU or thread holds a given lock, it is guaranteed
running the litmus tests shown in Listings 15.33 and 15.34 to see accesses executed during all prior critical sections
(C-Lock-before-into.litmus and C-Lock-after- for that same lock. Similarly, such a CPU or thread is
into.litmus, respectively), both of which will yield the guaranteed not to see accesses that will be executed during
Sometimes result. This result indicates that the exists all subsequent critical sections for that same lock.
clause can be satisfied, that is, that the final value of But what about accesses preceding prior critical sections
both P0()’s and P1()’s r1 variable can be zero. This and following subsequent critical sections?

v2022.09.25a
15.4. HIGHER-LEVEL PRIMITIVES 343

Listing 15.35: Accesses Outside of Critical Sections Listing 15.36: Accesses Between Same-CPU Critical Sections
1 C Lock-outside-across 1 C Lock-across-unlock-lock-1
2 2
3 {} 3 {}
4 4
5 P0(int *x, int *y, spinlock_t *sp) 5 P0(int *x, int *y, spinlock_t *sp)
6 { 6 {
7 int r1; 7 int r1;
8 8
9 WRITE_ONCE(*x, 1); 9 spin_lock(sp);
10 spin_lock(sp); 10 WRITE_ONCE(*x, 1);
11 r1 = READ_ONCE(*y); 11 spin_unlock(sp);
12 spin_unlock(sp); 12 spin_lock(sp);
13 } 13 r1 = READ_ONCE(*y);
14 14 spin_unlock(sp);
15 P1(int *x, int *y, spinlock_t *sp) 15 }
16 { 16
17 int r1; 17 P1(int *x, int *y, spinlock_t *sp)
18 18 {
19 spin_lock(sp); 19 int r1;
20 WRITE_ONCE(*y, 1); 20
21 spin_unlock(sp); 21 WRITE_ONCE(*y, 1);
22 r1 = READ_ONCE(*x); 22 smp_mb();
23 } 23 r1 = READ_ONCE(*x);
24 24 }
25 exists (0:r1=0 /\ 1:r1=0) 25
26 exists (0:r1=0 /\ 1:r1=0)

This question can be answered for the Linux kernel by


is “no”, and that CPUs can reorder accesses across con-
referring to Listing 15.35 (C-Lock-outside-across.
secutive critical sections. In other words, not only are
litmus). Running this litmus test yields the Sometimes
spin_lock() and spin_unlock() weaker than a full
result, which means that accesses in code leading up to
barrier when considered separately, they are also weaker
a prior critical section is also visible to the current CPU
than a full barrier when taken together.
or thread holding that same lock. Similarly, code that is
If the ordering of a given lock’s critical sections are to
placed after a subsequent critical section is never visible
be observed, then either the observer must hold that lock
to the current CPU or thread holding that same lock.
on the one hand or either smp_mb__after_spinlock()
As a result, the Linux kernel cannot allow accesses to
or smp_mb__after_unlock_lock() must be executed
be moved across the entirety of a given critical section.
just after the second lock acquisition on the other.
Other environments might well wish to allow such code
But what if the two critical sections run on different
motion, but please be advised that doing so is likely to
CPUs or threads?
yield profoundly counter-intuitive results.
This question is answered for the Linux kernel by
In short, the ordering provided by spin_lock() ex-
referring to Listing 15.37 (C-Lock-across-unlock-
tends not only throughout the critical section, but also
lock-2.litmus), in which the first lock acquisition is
indefinitely beyond the end of that critical section. Simi-
executed by P0() and the second lock acquisition is
larly, the ordering provided by spin_unlock() extends
executed by P1(). Note that P1() must read x to reject
not only throughout the critical section, but also indefi-
executions in which P1() executes before P0() does.
nitely beyond the beginning of that critical section.
Running this litmus test shows that the exists can be
satisfied, which means that the answer is “no”, and that
15.4.2.3 Ordering for Non-Lock Holders? CPUs can reorder accesses across consecutive critical
sections, even if each of those critical sections runs on a
Does a CPU or thread that is not holding a given lock see
different CPU or thread.
that lock’s critical sections as being ordered?
This question can be answered for the Linux kernel by Quick Quiz 15.36: But if there are three critical sections,
isn’t it true that CPUs not holding the lock will observe the
referring to Listing 15.36 (C-Lock-across-unlock-
accesses from the first and the third critical section as being
lock-1.litmus), which shows an example where P(0) ordered?
places its write and read in two different critical sections
for the same lock. Running this litmus test shows that As before, if the ordering of a given lock’s critical
the exists can be satisfied, which means that the answer sections are to be observed, then either the observer must

v2022.09.25a
344 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

Listing 15.37: Accesses Between Different-CPU Critical Sec- concluded that spin_is_locked() should be used only
tions for debugging. Part of the reason for this is that even a fully
1 C Lock-across-unlock-lock-2
2
ordered spin_is_locked() might return true because
3 {} some other CPU or thread was just about to release the
4
5 P0(int *x, spinlock_t *sp) lock in question. In this case, there is little that can be
6 { learned from that return value of true, which means
7 spin_lock(sp);
8 WRITE_ONCE(*x, 1); that reliable use of spin_is_locked() is surprisingly
9 spin_unlock(sp); complex. Other approaches almost always work better,
10 }
11
for example, use of explicit shared variables or the spin_
12 P1(int *x, int *y, spinlock_t *sp) trylock() primitive.
13 {
14 int r1; This situation resulted in the current state, namely that
15 int r2; spin_is_locked() provides no ordering guarantees,
16
17 spin_lock(sp); except that if it returns false, the current CPU or thread
18 r1 = READ_ONCE(*x); cannot be holding the corresponding lock.
19 r2 = READ_ONCE(*y);
20 spin_unlock(sp);
21 }
Quick Quiz 15.37: But if spin_is_locked() returns
22 false, don’t we also know that no other CPU or thread is
23 P2(int *x, int *y, spinlock_t *sp) holding the corresponding lock?
24 {
25 int r1;
26
27 WRITE_ONCE(*y, 1);
28 smp_mb(); 15.4.2.5 Why Mathematically Model Locking?
29 r1 = READ_ONCE(*x);
30 }
31 Given all these possible choices, why model locking in
32 exists (1:r1=1 /\ 1:r2=0 /\ 2:r1=0) general? Why not simply model a simple implementation?
One reason is modeling performance, as shown in
Table E.4 on page 534. Directly modeling locking in
hold that lock or either smp_mb__after_spinlock() general is orders of magnitude faster than emulating even
or smp_mb__after_unlock_lock() must be executed a trivial implementation. This should be no surprise, given
just after P1()’s lock acquisition. the combinatorial explosion experienced by present-day
Given that ordering is not guaranteed when both crit- formal-verification tools with increases in the number of
ical sections are protected by the same lock, there is no memory accesses executed by the code being modeled.
hope of any ordering guarantee when different locks are
Another reason is that a trivial implementation might
used. However, readers are encouraged to construct the
needlessly constrain either real implementations or real
corresponding litmus test and see this for themselves.
use cases. In contrast, modeling a platonic lock allows
This situation can seem counter-intuitive, but it is rare
the widest variety of implementations while providing
for code to care. This approach also allows certain weakly
specific guidance to locks’ users.
ordered systems to implement more efficient locks.

15.4.2.4 Ordering for spin_is_locked()? 15.4.3 RCU


The Linux kernel’s spin_is_locked() primitive returns As described in Section 9.5.2, the fundamental property
true if the specified lock is held and false otherwise. of RCU grace periods is this straightforward two-part
Note that spin_is_locked() returns true when some guarantee: (1) If any part of a given RCU read-side
other CPU or thread holds the lock, not just when the cur- critical section precedes the beginning of a given grace
rent CPU or thread holds that lock. This raises the question period, then the entirety of that critical section precedes
of what ordering guarantees spin_is_locked() might the end of that grace period. (2) If any part of a given RCU
provide. read-side critical section follows the end of a given grace
In the Linux kernel, the answer has varied over time. period, then the entirety of that critical section follows
Initially, spin_is_locked() was unordered, but a few the beginning of that grace period. These guarantees are
interesting use cases motivated strong ordering. Later summarized in Figure 15.13, where the grace period is
discussions surrounding the Linux-kernel memory model denoted by the dashed arrow between the call_rcu()

v2022.09.25a
15.4. HIGHER-LEVEL PRIMITIVES 345

If happens before ... Listing 15.39: RCU Fundamental Property and Reordering
1 C C-SB+o-rcusync-o+i-rl-o-o-rul
2
rcu_read_lock() call_rcu()
3 {}
4
rcu_read_unlock() 5 P0(uintptr_t *x0, uintptr_t *x1)
6 {

... then happens before


7 WRITE_ONCE(*x0, 2);
8 synchronize_rcu();
9 uintptr_t r2 = READ_ONCE(*x1);
... then happens before

10 }
11
12 P1(uintptr_t *x0, uintptr_t *x1)
13 {
14 rcu_read_lock();
15 uintptr_t r2 = READ_ONCE(*x0);
16 WRITE_ONCE(*x1, 2);
17 rcu_read_unlock();
18 }
19
rcu_read_lock() 20 exists (1:r2=0 /\ 0:r2=0)

callback invocation rcu_read_unlock()


running herd on this litmus test. Note that this guarantee
If happens before ... is insensitive to the ordering of the accesses within P1()’s
critical section, so the litmus test shown in Listing 15.3915
Figure 15.13: RCU Grace-Period Ordering Guarantees also forbids this same cycle.
However, this definition is incomplete, as can be seen
Listing 15.38: RCU Fundamental Property from the following list of questions:16
1 C C-SB+o-rcusync-o+rl-o-o-rul
2 1. What ordering is provided by rcu_read_lock()
3 {}
4
and rcu_read_unlock(), independent of RCU
5 P0(uintptr_t *x0, uintptr_t *x1) grace periods?
6 {
7 WRITE_ONCE(*x0, 2); 2. What ordering is provided by synchronize_rcu()
8 synchronize_rcu();
9 uintptr_t r2 = READ_ONCE(*x1); and synchronize_rcu_expedited(), indepen-
10 } dent of RCU read-side critical sections?
11
12 P1(uintptr_t *x0, uintptr_t *x1)
13 { 3. If the entirety of a given RCU read-side critical
14 rcu_read_lock(); section precedes the end of a given RCU grace period,
15 WRITE_ONCE(*x1, 2);
16 uintptr_t r2 = READ_ONCE(*x0); what about accesses preceding that critical section?
17 rcu_read_unlock();
18 } 4. If the entirety of a given RCU read-side critical
19
20 exists (1:r2=0 /\ 0:r2=0) section follows the beginning of a given RCU grace
period, what about accesses following that critical
section?
invocation in the upper right and the corresponding RCU
5. What happens in situations involving more than one
callback invocation in the lower left.14
RCU read-side critical section and/or more than one
In short, an RCU read-side critical section is guaran-
RCU grace period?
teed never to completely overlap an RCU grace period,
as demonstrated by Listing 15.38 (C-SB+o-rcusync- 6. What happens when RCU is mixed with other
o+rl-o-o-rul.litmus). Either or neither of the r2 memory-ordering mechanisms?
registers can have the final value of zero, but at least one
of them must be non-zero (that is, the cycle identified These questions are addressed in the following sections.
by the exists clause is prohibited), courtesy of RCU’s 15 Dependencies can of course limit the ability to reorder accesses

fundamental grace-period guarantee, as can be seen by within RCU read-side critical sections.
16 Several of which were introduced to Paul by Jade Alglave during

early work on LKMM, and a few more of which came from other LKMM
14 For more detail, please see Figures 9.11–9.13 starting on page 146. participants [AMM+ 18].

v2022.09.25a
346 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

Listing 15.40: RCU Readers Provide No Lock-Like Ordering Listing 15.42: RCU Updaters Provide Full Ordering
1 C C-LB+rl-o-o-rul+rl-o-o-rul 1 C C-SB+o-rcusync-o+o-rcusync-o
2 2
3 {} 3 {}
4 4
5 P0(uintptr_t *x0, uintptr_t *x1) 5 P0(uintptr_t *x0, uintptr_t *x1)
6 { 6 {
7 rcu_read_lock(); 7 WRITE_ONCE(*x0, 2);
8 uintptr_t r1 = READ_ONCE(*x0); 8 synchronize_rcu();
9 WRITE_ONCE(*x1, 1); 9 uintptr_t r2 = READ_ONCE(*x1);
10 rcu_read_unlock(); 10 }
11 } 11
12 12 P1(uintptr_t *x0, uintptr_t *x1)
13 P1(uintptr_t *x0, uintptr_t *x1) 13 {
14 { 14 WRITE_ONCE(*x1, 2);
15 rcu_read_lock(); 15 synchronize_rcu();
16 uintptr_t r1 = READ_ONCE(*x1); 16 uintptr_t r2 = READ_ONCE(*x0);
17 WRITE_ONCE(*x0, 1); 17 }
18 rcu_read_unlock(); 18
19 } 19 exists (1:r2=0 /\ 0:r2=0)
20
21 exists (0:r1=1 /\ 1:r1=1)

read_lock() and rcu_read_unlock() are no-ops in


Listing 15.41: RCU Readers Provide No Barrier-Like Ordering the QSBR implementation of RCU.
1 C C-LB+o-rl-rul-o+o-rl-rul-o
2
3 {}
4
15.4.3.2 RCU Update-Side Ordering
5 P0(uintptr_t *x0, uintptr_t *x1)
6 { In contrast with RCU readers, the RCU update-side func-
7 uintptr_t r1 = READ_ONCE(*x0);
8 rcu_read_lock();
tions synchronize_rcu() and synchronize_rcu_
9 rcu_read_unlock(); expedited() provide memory ordering at least as strong
10 WRITE_ONCE(*x1, 1);
11 }
as smp_mb(), as can be seen by running herd on the
12 litmus test shown in Listing 15.42. This test’s cycle is pro-
13 P1(uintptr_t *x0, uintptr_t *x1)
14 {
hibited, just as it would with smp_mb(). This should be
15 uintptr_t r1 = READ_ONCE(*x1); no surprise given the information presented in Table 15.3.
16 rcu_read_lock();
17 rcu_read_unlock();
18 WRITE_ONCE(*x0, 1);
19 } 15.4.3.3 RCU Readers: Before and After
20
21 exists (0:r1=1 /\ 1:r1=1) Before reading this section, it would be well to reflect
on the distinction between guarantees that are available
and guarantees that maintainable software should rely on.
15.4.3.1 RCU Read-Side Ordering Keeping that firmly in mind, this section presents a few of
the more exotic RCU guarantees.
On their own, RCU’s read-side primitives rcu_read_ Listing 15.43 (C-SB+o-rcusync-o+o-rl-o-rul.
lock() and rcu_read_unlock() provide no ordering litmus) shows a litmus test similar to that in Listing 15.38,
whatsoever. In particular, despite their names, they do but with the RCU reader’s first access preceding the RCU
not act like locks, as can be seen in Listing 15.40 (C- read-side critical section, rather than the more conven-
LB+rl-o-o-rul+rl-o-o-rul.litmus). This litmus tional (and maintainable!) approach of being contained
test’s cycle is allowed: Both instances of the r1 register within it. Perhaps surprisingly, running herd on this lit-
can have final values of 1. mus test gives the same result as for that in Listing 15.38:
Nor do these primitives have barrier-like ordering prop- The cycle is forbidden.
erties, at least not unless there is a grace period in the Why would this be the case?
mix, as can be seen in Listing 15.41 (C-LB+o-rl-rul- Because both of P1()’s accesses are volatile, as dis-
o+o-rl-rul-o.litmus). This litmus test’s cycle is also cussed in Section 4.3.4.2, the compiler is not permit-
allowed. (Try it!) ted to reorder them. This means that the code emitted
Of course, lack of ordering in both these litmus tests for P1()’s WRITE_ONCE() will precede that of P1()’s
should be absolutely no surprise, given that both rcu_ READ_ONCE(). Therefore, RCU implementations that

v2022.09.25a
15.4. HIGHER-LEVEL PRIMITIVES 347

Listing 15.43: What Happens Before RCU Readers? Listing 15.45: What Happens With Empty RCU Readers?
1 C C-SB+o-rcusync-o+o-rl-o-rul 1 C C-SB+o-rcusync-o+o-rl-rul-o
2 2
3 {} 3 {}
4 4
5 P0(uintptr_t *x0, uintptr_t *x1) 5 P0(uintptr_t *x0, uintptr_t *x1)
6 { 6 {
7 WRITE_ONCE(*x0, 2); 7 WRITE_ONCE(*x0, 2);
8 synchronize_rcu(); 8 synchronize_rcu();
9 uintptr_t r2 = READ_ONCE(*x1); 9 uintptr_t r2 = READ_ONCE(*x1);
10 } 10 }
11 11
12 P1(uintptr_t *x0, uintptr_t *x1) 12 P1(uintptr_t *x0, uintptr_t *x1)
13 { 13 {
14 WRITE_ONCE(*x1, 2); 14 WRITE_ONCE(*x1, 2);
15 rcu_read_lock(); 15 rcu_read_lock();
16 uintptr_t r2 = READ_ONCE(*x0); 16 rcu_read_unlock();
17 rcu_read_unlock(); 17 uintptr_t r2 = READ_ONCE(*x0);
18 } 18 }
19 19
20 exists (1:r2=0 /\ 0:r2=0) 20 exists (1:r2=0 /\ 0:r2=0)

Listing 15.44: What Happens After RCU Readers? Listing 15.44 (C-SB+o-rcusync-o+rl-o-rul-o.
1 C C-SB+o-rcusync-o+rl-o-rul-o
2
litmus) is similar, but instead looks at accesses after
3 {} the RCU read-side critical section. This test’s cycle is
4
5 P0(uintptr_t *x0, uintptr_t *x1)
also forbidden, as can be checked with the herd tool. The
6 { reasoning is similar to that for Listing 15.43, and is left as
7 WRITE_ONCE(*x0, 2);
8 synchronize_rcu();
an exercise for the reader.
9 uintptr_t r2 = READ_ONCE(*x1); Listing 15.45 (C-SB+o-rcusync-o+o-rl-rul-o.
10 }
11
litmus) takes things one step farther, moving P1()’s
12 P1(uintptr_t *x0, uintptr_t *x1) WRITE_ONCE() to precede the RCU read-side critical
13 {
14 rcu_read_lock(); section and moving P1()’s READ_ONCE() to follow it,
15 WRITE_ONCE(*x1, 2); resulting in an empty RCU read-side critical section.
16 rcu_read_unlock();
17 uintptr_t r2 = READ_ONCE(*x0); Perhaps surprisingly, despite the empty critical section,
18 } RCU nevertheless still manages to forbid the cycle. This
19
20 exists (1:r2=0 /\ 0:r2=0) can again be checked using the herd tool. Furthermore,
the reasoning is once again similar to that for Listing 15.43,
Recapping, if P1()’s WRITE_ONCE() follows the end of
a given grace period, then P1()’s RCU read-side critical
place memory-barrier instructions in rcu_read_lock() section—and everything following it—must follow the
and rcu_read_unlock() will preserve the ordering of beginning of that same grace period. Similarly, if P1()’s
P1()’s two accesses all the way down to the hardware READ_ONCE() precedes the beginning of a given grace
level. On the other hand, RCU implementations that rely period, then P1()’s RCU read-side critical section—and
on interrupt-based state machines will also fully preserve everything preceding it—must precede the end of that
this ordering relative to the grace period due to the fact that same grace period. In both cases, the critical section’s
interrupts take place at a precise location in the execution emptiness is irrelevant.
of the interrupted code.
Quick Quiz 15.38: Wait a minute! In QSBR implementations
This in turn means that if the WRITE_ONCE() follows of RCU, no code is emitted for rcu_read_lock() and rcu_
the end of a given RCU grace period, then the accesses read_unlock(). This means that the RCU read-side critical
within and following that RCU read-side critical section section in Listing 15.45 isn’t just empty, it is completely
must follow the beginning of that same grace period. nonexistent!!! So how can something that doesn’t exist at all
Similarly, if the READ_ONCE() precedes the beginning of possibly have any effect whatsoever on ordering???
the grace period, everything within and preceding that
critical section must precede the end of that same grace This situation leads to the question of what hap-
period. pens if rcu_read_lock() and rcu_read_unlock() are

v2022.09.25a
348 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

Listing 15.46: What Happens With No RCU Readers? Listing 15.47: One RCU Grace Period and Two Readers
1 C C-SB+o-rcusync-o+o-o 1 C C-SB+o-rcusync-o+rl-o-o-rul+rl-o-o-rul
2 2
3 {} 3 {}
4 4
5 P0(uintptr_t *x0, uintptr_t *x1) 5 P0(uintptr_t *x0, uintptr_t *x1)
6 { 6 {
7 WRITE_ONCE(*x0, 2); 7 WRITE_ONCE(*x0, 2);
8 synchronize_rcu(); 8 synchronize_rcu();
9 uintptr_t r2 = READ_ONCE(*x1); 9 uintptr_t r2 = READ_ONCE(*x1);
10 } 10 }
11 11
12 P1(uintptr_t *x0, uintptr_t *x1) 12 P1(uintptr_t *x1, uintptr_t *x2)
13 { 13 {
14 WRITE_ONCE(*x1, 2); 14 rcu_read_lock();
15 uintptr_t r2 = READ_ONCE(*x0); 15 WRITE_ONCE(*x1, 2);
16 } 16 uintptr_t r2 = READ_ONCE(*x2);
17 17 rcu_read_unlock();
18 exists (1:r2=0 /\ 0:r2=0) 18 }
19
20 P2(uintptr_t *x2, uintptr_t *x0)
21 {
omitted entirely, as shown in Listing 15.46 (C-SB+o- 22 rcu_read_lock();
23 WRITE_ONCE(*x2, 2);
rcusync-o+o-o.litmus). As can be checked with 24 uintptr_t r2 = READ_ONCE(*x0);
herd, this litmus test’s cycle is allowed, that is, both 25 rcu_read_unlock();
26 }
instances of r2 can have final values of zero. 27

This might seem strange in light of the fact that empty 28 exists (2:r2=0 /\ 0:r2=0 /\ 1:r2=0)

RCU read-side critical sections can provide ordering. And


it is true that QSBR implementations of RCU would in
fact forbid this outcome, due to the fact that preemption Is it possible to say anything general about which RCU-
would be disabled across the entirety of P1()’s function protected litmus tests will be prohibited and which will
body, so that P1() would run within an implicit RCU be allowed? This section takes up that question.
read-side critical section. However, RCU also has non- More specifically, what if the litmus test has one RCU
QSBR implementations, and the kernels running these grace period and two RCU readers, as shown in List-
implementations are preemptible, which means there is ing 15.47? The herd tool says that this cycle is allowed,
no implied RCU read-side critical section, and in turn no but it would be good to know why.17
way for RCU to enforce ordering. Therefore, this litmus The key point is that the CPU is free to reorder P1()’s
test’s cycle is allowed. and P2()’s WRITE_ONCE() and READ_ONCE(). With that
reordering, Figure 15.14 shows how the cycle forms:
Quick Quiz 15.39: Can P1()’s accesses be reordered in the
litmus tests shown in Listings 15.43, 15.44, and 15.45 in the 1. P0()’s read from x1 precedes P1()’s write, as de-
same way that they were reordered going from Listing 15.38 picted by the dashed arrow near the bottom of the
to Listing 15.39? diagram.
2. Because P1()’s write follows the end of P0()’s
15.4.3.4 Multiple RCU Readers and Updaters grace period, P1()’s read from x2 cannot precede
the beginning of P0()’s grace period.
Because synchronize_rcu() has stronger ordering se-
mantics than does smp_mb(), no matter how many pro- 3. P1()’s read from x2 precedes P2()’s write.
cesses there are in an SB litmus test (such as Listing 15.42),
placing synchronize_rcu() between each process’s ac- 4. Because P2()’s write to x2 precedes the end of
cesses prohibits the cycle. In addition, the cycle is prohib- P0()’s grace period, it is completely legal for P2()’s
ited in an SB test where one process uses synchronize_ read from x0 to precede the beginning of P0()’s
rcu() and the other uses rcu_read_lock() and rcu_ grace period.
read_unlock(), as shown by Listing 15.38. However, if
both processes use rcu_read_lock() and rcu_read_ 17 Especially given that Paul changed his mind several times about
unlock(), the cycle will be allowed, as shown by List- this particular litmus test when working with Jade Alglave to generalize
ing 15.40. RCU ordering semantics.

v2022.09.25a
15.4. HIGHER-LEVEL PRIMITIVES 349

P0() P1() P2()

rcu_read_lock();

r2 = READ_ONCE(x0);
WRITE_ONCE(x0, 2);

synchronize_rcu();
rcu_read_lock();

r2 = READ_ONCE(x2);
WRITE_ONCE(x2, 2);

rcu_read_unlock();

r2 = READ_ONCE(x1);
WRITE_ONCE(x1, 2);

rcu_read_unlock();

Figure 15.14: Cycle for One RCU Grace Period and Two RCU Readers

P0() P1() P2() P3()

WRITE_ONCE(x0, 2);

synchronize_rcu(); rcu_read_lock();

r2 = READ_ONCE(x0);

r2 = READ_ONCE(x1);
WRITE_ONCE(x1, 2);

synchronize_rcu(); rcu_read_lock();

r2 = READ_ONCE(x3);
WRITE_ONCE(x3, 2);

rcu_read_unlock();

r2 = READ_ONCE(x2);
WRITE_ONCE(x2, 2);

rcu_read_unlock();

Figure 15.15: No Cycle for Two RCU Grace Periods and Two RCU Readers

v2022.09.25a
350 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

Listing 15.48: Two RCU Grace Periods and Two Readers there are at least as many RCU grace periods as there are
1 C C-SB+o-rcusync-o+o-rcusync-o+rl-o-o-rul+rl-o-o-rul RCU read-side critical sections, the cycle is forbidden.18
2
3 {}
4
5 P0(uintptr_t *x0, uintptr_t *x1) 15.4.3.5 RCU and Other Ordering Mechanisms
6 {
7 WRITE_ONCE(*x0, 2); But what about litmus tests that combine RCU with other
8 synchronize_rcu();
9 uintptr_t r2 = READ_ONCE(*x1);
ordering mechanisms?
10 } The general rule is that it takes only one mechanism to
11
12 P1(uintptr_t *x1, uintptr_t *x2)
forbid a cycle.
13 { For example, refer back to Listing 15.40. Applying
14 WRITE_ONCE(*x1, 2);
15 synchronize_rcu();
the general rule from the previous section, because this
16 uintptr_t r2 = READ_ONCE(*x2); litmus test has two RCU read-side critical sections and
17 }
18
no RCU grace periods, the cycle is allowed. But what
19 P2(uintptr_t *x2, uintptr_t *x3) if P0()’s WRITE_ONCE() is replaced by an smp_store_
20 {
21 rcu_read_lock();
release() and P1()’s READ_ONCE() is replaced by an
22 WRITE_ONCE(*x2, 2); smp_load_acquire()?
23 uintptr_t r2 = READ_ONCE(*x3);
24 rcu_read_unlock();
RCU would still allow the cycle, but the release-acquire
25 } pair would forbid it. Because it only takes one mechanism
26
27 P3(uintptr_t *x0, uintptr_t *x3)
to forbid a cycle, the release-acquire pair would prevail so
28 { that the cycle would be forbidden.
29 rcu_read_lock();
30 WRITE_ONCE(*x3, 2);
For another example, refer back to Listing 15.47. Be-
31 uintptr_t r2 = READ_ONCE(*x0); cause this litmus test has two RCU readers but only one
32 rcu_read_unlock();
33 }
grace period, its cycle is allowed. But suppose that an
34 smp_mb() was placed between P1()’s pair of accesses.
35 exists (3:r2=0 /\ 0:r2=0 /\ 1:r2=0 /\ 2:r2=0)
In this new litmus test, because of the addition of the smp_
mb(), P2()’s as well as P1()’s critical sections would
5. Therefore, P2()’s read from x0 can precede P0()’s extend beyond the end of P0()’s grace period, which in
write, thus allowing the cycle to form. turn would prevent P2()’s read from x0 from preceding
P0()’s write, as depicted by the red dashed arrow in Fig-
But what happens when another grace period is added? ure 15.16. In this case, RCU and the full memory barrier
This situation is shown in Listing 15.48, an SB litmus work together to forbid the cycle, with RCU preserving
test in which P0() and P1() have RCU grace periods ordering between P0() and both P1() and P2(), and
and P2() and P3() have RCU readers. Again, the CPUs with the smp_mb() preserving ordering between P1()
can reorder the accesses within RCU read-side critical and P2().
sections, as shown in Figure 15.15. For this cycle to form, Quick Quiz 15.40: What would happen if the smp_mb()
P2()’s critical section must end after P1()’s grace period was instead added between P2()’s accesses in Listing 15.47?
and P3()’s must end after the beginning of that same
grace period, which happens to also be after the end of
P0()’s grace period. Therefore, P3()’s critical section In short, where RCU’s semantics were once purely prag-
must start after the beginning of P0()’s grace period, matic, they are now fully formalized [MW05, DMS+ 12,
which in turn means that P3()’s read from x0 cannot GRY13, AMM+ 18].
possibly precede P0()’s write. Therefore, the cycle is It is hoped that detailed semantics for higher-level
forbidden because RCU read-side critical sections cannot primitives will enable more capable static analysis and
span full RCU grace periods. model checking.
However, a closer look at Figure 15.15 makes it clear
that adding a third reader would allow the cycle. This
is because this third reader could end before the end of 18 Interestingly enough, Alan Stern proved that within the context
P0()’s grace period, and thus start before the beginning of of LKMM, the two-part fundamental property of RCU expressed in
that same grace period. This in turn suggests the general Section 9.5.2 actually implies this seemingly more general result, which
rule, which is: In these sorts of RCU-only litmus tests, if is called the RCU axiom [AMM+ 18].

v2022.09.25a
15.5. HARDWARE SPECIFICS 351

P0() P1() P2()

WRITE_ONCE(x0, 2);

synchronize_rcu(); rcu_read_lock();

r2 = READ_ONCE(x0);

rcu_read_lock();
r2 = READ_ONCE(x1);
WRITE_ONCE(x1, 2);

smp_mb();

r2 = READ_ONCE(x2);

WRITE_ONCE(x2, 2);
rcu_read_unlock();
rcu_read_unlock();

Figure 15.16: Cycle for One RCU Grace Period, Two RCU Readers, and Memory Barrier

15.5 Hardware Specifics and which is explained in more detail in Section 15.5.1.
The short version is that Alpha requires memory barriers
for readers as well as updaters of linked data structures,
Rock beats paper! however, these memory barriers are provided by the Alpha
Derek Williams architecture-specific code in v4.15 and later Linux kernels.
The next row, “Non-Sequentially Consistent”, indicates
Each CPU family has its own peculiar approach to memory whether the CPU’s normal load and store instructions are
ordering, which can make portability a challenge, as constrained by sequential consistency. Almost all are not
indicated by Table 15.5. constrained in this way for performance reasons.
In fact, some software environments simply prohibit The next two rows cover multicopy atomicity, which
direct use of memory-ordering operations, restricting the was defined in Section 15.2.7. The first is full-up (and
programmer to mutual-exclusion primitives that incor- rare) multicopy atomicity, and the second is the weaker
porate them to the extent that they are required. Please other-multicopy atomicity.
note that this section is not intended to be a reference The next row, “Non-Cache Coherent”, covers accesses
manual covering all (or even most) aspects of each CPU from multiple threads to a single variable, which was
family, but rather a high-level overview providing a rough discussed in Section 15.2.6.
comparison. For full details, see the reference manual for The final three rows cover instruction-level choices and
the CPU of interest. issues. The first row indicates how each CPU implements
Getting back to Table 15.5, the first group of rows load-acquire and store-release, the second row classifies
look at memory-ordering properties and the second group CPUs by atomic-instruction type, and the third and final
looks at instruction properties. row indicates whether a given CPU has an incoherent
The first three rows indicate whether a given CPU al- instruction cache and pipeline. Such CPUs require special
lows the four possible combinations of loads and stores instructions be executed for self-modifying code.
to be reordered, as discussed in Section 15.1 and Sec- The common “just say no” approach to memory-
tions 15.2.2.1–15.2.2.3. The next row (“Atomic Instruc- ordering operations can be eminently reasonable where
tions Reordered With Loads or Stores?”) indicates whether it applies, but there are environments, such as the Linux
a given CPU allows loads and stores to be reordered with kernel, where direct use of memory-ordering operations
atomic instructions. is required. Therefore, Linux provides a carefully cho-
The fifth and sixth rows cover reordering and depen- sen least-common-denominator set of memory-ordering
dencies, which was covered in Sections 15.2.3–15.2.5 primitives, which are as follows:

v2022.09.25a
352 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

Table 15.5: Summary of Memory Ordering

CPU Family

SPARC TSO
Armv7-A/R

z Systems
POWER
Itanium
Armv8
Alpha

MIPS

x86
Property

Memory Ordering Loads Reordered After Loads or Stores? Y Y Y Y Y Y


Stores Reordered After Stores? Y Y Y Y Y Y
Stores Reordered After Loads? Y Y Y Y Y Y Y Y Y
Atomic Instructions Reordered With
Y Y Y Y Y
Loads or Stores?
Dependent Loads Reordered? Y
Dependent Stores Reordered?
Non-Sequentially Consistent? Y Y Y Y Y Y Y Y Y
Non-Multicopy Atomic? Y Y Y Y Y Y Y Y
Non-Other-Multicopy Atomic? Y Y Y Y Y
Non-Cache Coherent? Y

Instructions Load-Acquire/Store-Release? F F i I F b
Atomic RMW Instruction Type? L L L C L L C C C
Incoherent Instruction Cache/Pipeline? Y Y Y Y Y Y Y Y Y

Key: Load-Acquire/Store-Release?
b: Lightweight memory barrier
F: Full memory barrier
i: Instruction with lightweight ordering
I: Instruction with heavyweight ordering
Atomic RMW Instruction Type?
C: Compare-and-exchange instruction
L: Load-linked/store-conditional instruction

smp_mb() (full memory barrier) that orders both loads smp_mb__after_atomic() that forces ordering of ac-
and stores. This means that loads and stores preced- cesses preceding an earlier RMW atomic operation
ing the memory barrier will be committed to memory against accesses following the smp_mb__after_
before any loads and stores following the memory atomic(). This is also a noop on systems that fully
barrier. order atomic RMW operations.

smp_rmb() (read memory barrier) that orders only loads. smp_mb__after_spinlock() that forces ordering of
accesses preceding a lock acquisition against ac-
smp_wmb() (write memory barrier) that orders only cesses following the smp_mb__after_spinlock().
stores. This is also a noop on systems that fully order lock
acquisitions.
smp_mb__before_atomic() that forces ordering of ac-
cesses preceding the smp_mb__before_atomic() mmiowb() that forces ordering on MMIO writes that
against accesses following a later RMW atomic op- are guarded by global spinlocks, and is more
eration. This is a noop on systems that fully order thoroughly described in a 2016 LWN article on
atomic RMW operations. MMIO [MDR16].

v2022.09.25a
15.5. HARDWARE SPECIFICS 353

The smp_mb(), smp_rmb(), and smp_wmb() primitives Listing 15.49: Insert and Lock-Free Search (No Ordering)
also force the compiler to eschew any optimizations that 1 struct el *insert(long key, long data)
2 {
would have the effect of reordering memory optimizations 3 struct el *p;
across the barriers. 4 p = kmalloc(sizeof(*p), GFP_ATOMIC);
5 spin_lock(&mutex);
Quick Quiz 15.41: What happens to code between an atomic 6 p->next = head.next;
operation and an smp_mb__after_atomic()? 7 p->key = key;
8 p->data = data;
9 smp_store_release(&head.next, p);
These primitives generate code only in SMP kernels, 10 spin_unlock(&mutex);
11 }
however, several have UP versions (mb(), rmb(), and 12
wmb(), respectively) that generate a memory barrier even 13 struct el *search(long searchkey)
14 {
in UP kernels. The smp_ versions should be used in most 15 struct el *p;
cases. However, these latter primitives are useful when 16 p = READ_ONCE_OLD(head.next);
17 while (p != &head) {
writing drivers, because MMIO accesses must remain 18 /* Prior to v4.15, BUG ON ALPHA!!! */
ordered even in UP kernels. In absence of memory- 19 if (p->key == searchkey) {
20 return (p);
ordering operations, both CPUs and compilers would 21 }
happily rearrange these accesses, which at best would 22 p = READ_ONCE_OLD(p->next);
23 };
make the device act strangely, and could crash your kernel 24 return (NULL);
or even damage your hardware. 25 }
So most kernel programmers need not worry about the
memory-ordering peculiarities of each and every CPU, as
long as they stick to these interfaces. If you are working traces of this accommodation were removed in v5.9 with
deep in a given CPU’s architecture-specific code, of course, the removal of the smp_read_barrier_depends() and
all bets are off. read_barrier_depends() APIs. This section is nev-
Furthermore, all of Linux’s locking primitives (spin- ertheless retained in the Second Edition because here in
locks, reader-writer locks, semaphores, RCU, . . .) include early 2021 there are quite a few Linux kernel hackers still
any needed ordering primitives. So if you are working working on pre-v4.15 versions of the Linux kernel. In ad-
with code that uses these primitives properly, you need dition, the modifications to READ_ONCE() that permitted
not worry about Linux’s memory-ordering primitives. these APIs to be removed have not necessarily propagated
That said, deep knowledge of each CPU’s memory- to all userspace projects that might still support Alpha.
consistency model can be very helpful when debugging, The dependent-load difference between Alpha and the
to say nothing of when writing architecture-specific code other CPUs is illustrated by the code shown in List-
or synchronization primitives. ing 15.49. This smp_store_release() guarantees that
Besides, they say that a little knowledge is a very the element initialization in lines 6–8 is executed before
dangerous thing. Just imagine the damage you could do the element is added to the list on line 9, so that the
with a lot of knowledge! For those who wish to understand lock-free search will work correctly. That is, it makes this
more about individual CPUs’ memory consistency models, guarantee on all CPUs except Alpha.
the next sections describe those of a few popular and Given the pre-v4.15 implementation of READ_ONCE(),
prominent CPUs. Although there is no substitute for indicated by READ_ONCE_OLD() in the listing, Alpha
actually reading a given CPU’s documentation, these actually allows the code on line 19 of Listing 15.49 to
sections do give a good overview. see the old garbage values that were present before the
initialization on lines 6–8.
Figure 15.17 shows how this can happen on an aggres-
15.5.1 Alpha
sively parallel machine with partitioned caches, so that
It may seem strange to say much of anything about a CPU alternating cache lines are processed by the different parti-
whose end of life has long since passed, but Alpha is inter- tions of the caches. For example, the load of head.next
esting because it is the only mainstream CPU that reorders on line 16 of Listing 15.49 might access cache bank 0, and
dependent loads, and has thus had outsized influence on the load of p->key on line 19 and of p->next on line 22
concurrency APIs, including within the Linux kernel. The might access cache bank 1. On Alpha, the smp_store_
need for core Linux-kernel code to accommodate Alpha release() will guarantee that the cache invalidations
ended with version v4.15 of the Linux kernel, and all performed by lines 6–8 of Listing 15.49 (for p->next,

v2022.09.25a
354 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

p->data = key; Listing 15.50: Safe Insert and Lock-Free Search


smp_wmb(); p = READ_ONCE_OLD(head.next);
1 struct el *insert(long key, long data)
head.next = p; BUG_ON(p && p->key != key);
2 {
3 struct el *p;
Writing CPU Core Reading CPU Core 4 p = kmalloc(sizeof(*p), GFP_ATOMIC);
Cache 5 spin_lock(&mutex);
Cache Cache Cache
Bank 0 Bank 1 6 p->next = head.next;
Bank 0 Bank 1 (Idle) (Busy) 7 p->key = key;
8 p->data = data;
9 smp_store_release(&head.next, p);
head.next p->key
10 spin_unlock(&mutex);
p->data
11 }
p->next
12
13 struct el *search(long searchkey)
14 {
15 struct el *p;
16 p = rcu_dereference(head.next);
17 while (p != &head) {
Figure 15.17: Why smp_read_barrier_depends() is 18 if (p->key == searchkey) {
Required in Pre-v4.15 Linux Kernels 19 return (p);
20 }
21 p = rcu_dereference(p->next);
22 };
23 return (NULL);
p->key, and p->data) will reach the interconnect before 24 }
that of line 9 (for head.next), but makes absolutely no
guarantee about the order of propagation through the read-
ing CPU’s cache banks. For example, it is possible that the
Listing 15.50, which works safely and efficiently for all
reading CPU’s cache bank 1 is very busy, but cache bank 0
recent kernel versions.
is idle. This could result in the cache invalidations for
the new element (p->next, p->key, and p->data) being It is also possible to implement a software mechanism
delayed, so that the reading CPU loads the new value for that could be used in place of smp_store_release() to
head.next, but loads the old cached values for p->key force all reading CPUs to see the writing CPU’s writes
and p->next. Yes, this does mean that Alpha can in effect in order. This software barrier could be implemented
fetch the data pointed to before it fetches the pointer itself, by sending inter-processor interrupts (IPIs) to all other
strange but true. See the documentation [Com01, Pug00] CPUs. Upon receipt of such an IPI, a CPU would execute a
called out earlier for more information, or if you think memory-barrier instruction, implementing a system-wide
that I am just making all this up.19 The benefit of this memory barrier similar to that provided by the Linux
unusual approach to ordering is that Alpha can use sim- kernel’s sys_membarrier() system call. Additional
pler cache hardware, which in turn permitted higher clock logic is required to avoid deadlocks. Of course, CPUs
frequencies in Alpha’s heyday. that respect data dependencies would define such a barrier
One could place an smp_rmb() primitive between the to simply be smp_store_release(). However, this
pointer fetch and dereference in order to force Alpha approach was deemed by the Linux community to impose
to order the pointer fetch with the later dependent load. excessive overhead [McK01], and to their point would be
However, this imposes unneeded overhead on systems completely inappropriate for systems having aggressive
(such as Arm, Itanium, PPC, and SPARC) that respect data real-time response requirements.
dependencies on the read side. A smp_read_barrier_ The Linux memory-barrier primitives took their names
depends() primitive was therefore added to the Linux from the Alpha instructions, so smp_mb() is mb, smp_
kernel to eliminate overhead on these systems, but was rmb() is rmb, and smp_wmb() is wmb. Alpha is the only
removed in v5.9 of the Linux kernel in favor of augmenting CPU whose READ_ONCE() includes an smp_mb().
Alpha’s definition of READ_ONCE(). Thus, as of v5.9,
core kernel code no longer needs to concern itself with Quick Quiz 15.42: Why does Alpha’s READ_ONCE() include
an mb() rather than rmb()?
this aspect of DEC Alpha. However, it is better to use
rcu_dereference() as shown on lines 16 and 21 of
Quick Quiz 15.43: Isn’t DEC Alpha significant as having
19 Of course, the astute reader will have already recognized that
the weakest possible memory ordering?
Alpha is nowhere near as mean and nasty as it could be, the (thankfully)
mythical architecture in Appendix C.6.1 being a case in point. For more on Alpha, see its reference manual [Cor02].

v2022.09.25a
15.5. HARDWARE SPECIFICS 355

15.5.2 Armv7-A/R
The Arm family of CPUs is extremely popular in em-
bedded applications, particularly for power-constrained
applications such as cellphones. Its memory model is sim-
ilar to that of POWER (see Section 15.5.6), but Arm uses
a different set of memory-barrier instructions [ARM10]:
LDLAR
DMB (data memory barrier) causes the specified type
of operations to appear to have completed before
any subsequent operations of the same type. The
“type” of operations can be all operations or can be
restricted to only writes (similar to the Alpha wmb
and the POWER eieio instructions). In addition,
Arm allows cache coherence to have one of three
scopes: Single processor, a subset of the processors
(“inner”) and global (“outer”). Figure 15.18: Half Memory Barrier
DSB (data synchronization barrier) causes the specified
type of operations to actually complete before any 6 ISB();
subsequent operations (of any type) are executed. 7 r3 = z;

The “type” of operations is the same as that of DMB.


The DSB instruction was called DWB (drain write In this example, load-store control dependency ordering
buffer or data write barrier, your choice) in early causes the load from x on line 1 to be ordered before the
versions of the Arm architecture. store to y on line 4. However, Arm does not respect
load-load control dependencies, so that the load on line 1
ISB (instruction synchronization barrier) flushes the CPU might well happen after the load on line 5. On the other
pipeline, so that all instructions following the ISB are hand, the combination of the conditional branch on line 2
fetched only after the ISB completes. For example, if and the ISB instruction on line 6 ensures that the load on
you are writing a self-modifying program (such as a line 7 happens after the load on line 1. Note that inserting
JIT), you should execute an ISB between generating an additional ISB instruction somewhere between lines 2
the code and executing it. and 5 would enforce ordering between lines 1 and 5.
None of these instructions exactly match the semantics
of Linux’s rmb() primitive, which must therefore be 15.5.3 Armv8
implemented as a full DMB. The DMB and DSB instructions
Arm’s Armv8 CPU family [ARM17] includes 64-bit capa-
have a recursive definition of accesses ordered before and
bilities, in contrast to their 32-bit-only CPU described in
after the barrier, which has an effect similar to that of
Section 15.5.2. Armv8’s memory model closely resembles
POWER’s cumulativity, both of which are stronger than
its Armv7 counterpart, but adds load-acquire (LDLARB,
LKMM’s cumulativity described in Section 15.2.7.1.
LDLARH, and LDLAR) and store-release (STLLRB, STLLRH,
Arm also implements control dependencies, so that if
and STLLR) instructions. These instructions act as “half
a conditional branch depends on a load, then any store
memory barriers”, so that Armv8 CPUs can reorder pre-
executed after that conditional branch will be ordered after
vious accesses with a later LDLAR instruction, but are
the load. However, loads following the conditional branch
prohibited from reordering an earlier LDLAR instruction
will not be guaranteed to be ordered unless there is an ISB
with later accesses, as fancifully depicted in Figure 15.18.
instruction between the branch and the load. Consider the
Similarly, Armv8 CPUs can reorder an earlier STLLR
following example:
instruction with a subsequent access, but are prohibited
1 r1 = x; from reordering previous accesses with a later STLLR
2 if (r1 == 0) instruction. As one might expect, this means that these in-
3 nop();
4 y = 1;
structions directly support the C11 notion of load-acquire
5 r2 = z; and store-release.

v2022.09.25a
356 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

However, Armv8 goes well beyond the C11 memory Quick Quiz 15.44: Given that hardware can have a half mem-
model by mandating that the combination of a store-release ory barrier, why don’t locking primitives allow the compiler to
and load-acquire act as a full barrier under certain circum- move memory-reference instructions into lock-based critical
stances. For example, in Armv8, given a store followed sections?
by a store-release followed a load-acquire followed by a
The Itanium mf instruction is used for the smp_rmb(),
load, all to different variables and all from a single CPU,
smp_mb(), and smp_wmb() primitives in the Linux ker-
all CPUs would agree that the initial store preceded the
nel. Despite persistent rumors to the contrary, the “mf”
final load. Interestingly enough, most TSO architectures
mnemonic stands for “memory fence”.
(including x86 and the mainframe) do not make this guar-
Itanium also offers a global total order for release op-
antee, as the two loads could be reordered before the two
erations, including the mf instruction. This provides the
stores.
notion of transitivity, where if a given code fragment sees
Armv8 is one of only two architectures that needs
a given access as having happened, any later code frag-
the smp_mb__after_spinlock() primitive to be a full
ment will also see that earlier access as having happened.
barrier, due to its relatively weak lock-acquisition imple-
Assuming, that is, that all the code fragments involved
mentation in the Linux kernel.
correctly use memory barriers.
Armv8 also has the distinction of being the first CPU
Finally, Itanium is the only architecture supporting the
whose vendor publicly defined its memory ordering with
Linux kernel that can reorder normal loads to the same
an executable formal model [ARM17].
variable. The Linux kernel avoids this issue because
READ_ONCE() emits a volatile load, which is compiled
15.5.4 Itanium as a ld,acq instruction, which forces ordering of all
READ_ONCE() invocations by a given CPU, including
Itanium offers a weak consistency model, so that in absence
those to the same variable.
of explicit memory-barrier instructions or dependencies,
Itanium is within its rights to arbitrarily reorder mem-
ory references [Int02a]. Itanium has a memory-fence 15.5.5 MIPS
instruction named mf, but also has “half-memory fence”
The MIPS memory model [Wav16, page 479] appears
modifiers to loads, stores, and to some of its atomic
to resemble that of Arm, Itanium, and POWER, being
instructions [Int02b]. The acq modifier prevents subse-
weakly ordered by default, but respecting dependencies.
quent memory-reference instructions from being reordered
MIPS has a wide variety of memory-barrier instructions,
before the acq, but permits prior memory-reference in-
but ties them not to hardware considerations, but rather
structions to be reordered after the acq, similar to the
to the use cases provided by the Linux kernel and the
Armv8 load-acquire instructions. Similarly, the rel mod-
C++11 standard [Smi19] in a manner similar to the Armv8
ifier prevents prior memory-reference instructions from
additions:
being reordered after the rel, but allows subsequent
memory-reference instructions to be reordered before the
SYNC
rel.
Full barrier for a number of hardware operations
These half-memory fences are useful for critical sec-
in addition to memory references, which is used to
tions, since it is safe to push operations into a critical
implement the v4.13 Linux kernel’s smp_mb() for
section, but can be fatal to allow them to bleed out. How-
OCTEON systems.
ever, as one of the few CPUs with this property, Itanium at
one time defined Linux’s semantics of memory ordering SYNC_WMB
associated with lock acquisition and release.20 Oddly Write memory barrier, which can be used on
enough, actual Itanium hardware is rumored to implement OCTEON systems to implement the smp_wmb()
both load-acquire and store-release instructions as full bar- primitive in the v4.13 Linux kernel via the syncw
riers. Nevertheless, Itanium was the first mainstream CPU mnemonic. Other systems use plain sync.
to introduce the concept (if not the reality) of load-acquire
and store-release into its instruction set. SYNC_MB
Full memory barrier, but only for memory operations.
This may be used to implement the C++ atomic_
20 PowerPC is now the architecture with this dubious privilege. thread_fence(memory_order_seq_cst).

v2022.09.25a
15.5. HARDWARE SPECIFICS 357

SYNC_ACQUIRE to implement load-acquire and store-release opera-


Acquire memory barrier, which could be used to im- tions. Interestingly enough, the lwsync instruction
plement C++’s atomic_thread_fence(memory_ enforces the same within-CPU ordering as does x86,
order_acquire). In theory, it could also be used z Systems, and coincidentally, SPARC TSO. How-
to implement the v4.13 Linux-kernel smp_load_ ever, placing the lwsync instruction between each
acquire() primitive, but in practice sync is used pair of memory-reference instructions will not result
instead. in x86, z Systems, or SPARC TSO memory ordering.
On these other systems, if a pair of CPUs indepen-
SYNC_RELEASE dently execute stores to different variables, all other
Release memory barrier, which may be used to im- CPUs will agree on the order of these stores. Not
plement C++’s atomic_thread_fence(memory_ so on PowerPC, even with an lwsync instruction
order_release). In theory, it could also be used between each pair of memory-reference instructions,
to implement the v4.13 Linux-kernel smp_store_ because PowerPC is non-multicopy atomic.
release() primitive, but in practice sync is used
instead. eieio (enforce in-order execution of I/O, in case you were
SYNC_RMB wondering) causes all preceding cacheable stores to
Read memory barrier, which could in theory be used appear to have completed before all subsequent stores.
to implement the smp_rmb() primitive in the Linux However, stores to cacheable memory are ordered
kernel, except that current MIPS implementations separately from stores to non-cacheable memory,
supported by the v4.13 Linux kernel do not need which means that eieio will not force an MMIO
an explicit instruction to force ordering. Therefore, store to precede a spinlock release.
smp_rmb() instead simply constrains the compiler.
isync forces all preceding instructions to appear to have
SYNCI completed before any subsequent instructions start
Instruction-cache synchronization, which is used execution. This means that the preceding instructions
in conjunction with other instructions to allow self- must have progressed far enough that any traps they
modifying code, such as that produced by just-in-time might generate have either happened or are guaran-
(JIT) compilers. teed not to happen, and that any side-effects of these
instructions (for example, page-table changes) are
Informal discussions with MIPS architects indicates seen by the subsequent instructions. However, it does
that MIPS has a definition of transitivity or cumulativity not force all memory references to be ordered, only
similar to that of Arm and POWER. However, it appears the actual execution of the instruction itself. Thus,
that different MIPS implementations can have different the loads might return old still-cached values and the
memory-ordering properties, so it is important to consult isync instruction does not force values previously
the documentation for the specific MIPS implementation stored to be flushed from the store buffers.
you are using.
Unfortunately, none of these instructions line up exactly
15.5.6 POWER / PowerPC with Linux’s wmb() primitive, which requires all stores to
be ordered, but does not require the other high-overhead
The POWER and PowerPC CPU families have a wide actions of the sync instruction. But there is no choice:
variety of memory-barrier instructions [IBM94, LHF05]: ppc64 versions of wmb() and mb() are defined to be
sync causes all preceding operations to appear to have the heavyweight sync instruction. However, Linux’s
completed before any subsequent operations are smp_wmb() instruction is never used for MMIO (since a
started. This instruction is therefore quite expen- driver must carefully order MMIOs in UP as well as SMP
sive. kernels, after all), so it is defined to be the lighter weight
eieio or lwsync instruction [MDR16]. This instruction
lwsync (lightweight sync) orders loads with respect to may well be unique in having a five-vowel mnemonic.
subsequent loads and stores, and also orders stores. The smp_mb() instruction is also defined to be the sync
However, it does not order stores with respect to sub- instruction, but both smp_rmb() and rmb() are defined
sequent loads. The lwsync instruction may be used to be the lighter-weight lwsync instruction.

v2022.09.25a
358 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

POWER features “cumulativity”, which can be used StoreLoad orders preceding stores before subsequent
to obtain transitivity. When used properly, any code loads.
seeing the results of an earlier code fragment will also
see the accesses that this earlier code fragment itself LoadLoad orders preceding loads before subsequent
saw. Much more detail is available from McKenney and loads. (This option is used by the Linux smp_rmb()
Silvera [MS09]. primitive.)
POWER respects control dependencies in much the
Sync fully completes all preceding operations before
same way that Arm does, with the exception that the
starting any subsequent operations.
POWER isync instruction is substituted for the Arm ISB
instruction. MemIssue completes preceding memory operations be-
Like Armv8, POWER requires smp_mb__after_ fore subsequent memory operations, important for
spinlock() to be a full memory barrier. In addition, some instances of memory-mapped I/O.
POWER is the only architecture requiring smp_mb__
after_unlock_lock() to be a full memory barrier. In Lookaside does the same as MemIssue, but only applies
both cases, this is because of the weak ordering prop- to preceding stores and subsequent loads, and even
erties of POWER’s locking primitives, due to the use then only for stores and loads that access the same
of the lwsync instruction to provide ordering for both memory location.
acquisition and release.
Many members of the POWER architecture have in- So, why is “membar #MemIssue” needed? Because
coherent instruction caches, so that a store to memory a “membar #StoreLoad” could permit a subsequent
will not necessarily be reflected in the instruction cache. load to get its value from a store buffer, which would
Thankfully, few people write self-modifying code these be disastrous if the write was to an MMIO register
days, but JITs and compilers do it all the time. Further- that induced side effects on the value to be read. In
more, recompiling a recently run program looks just like contrast, “membar #MemIssue” would wait until the
self-modifying code from the CPU’s viewpoint. The icbi store buffers were flushed before permitting the loads
instruction (instruction cache block invalidate) invalidates to execute, thereby ensuring that the load actually gets
a specified cache line from the instruction cache, and may its value from the MMIO register. Drivers could
be used in these situations. instead use “membar #Sync”, but the lighter-weight
“membar #MemIssue” is preferred in cases where the ad-
ditional function of the more-expensive “membar #Sync”
15.5.7 SPARC TSO are not required.
The “membar #Lookaside” is a lighter-weight version
Although SPARC’s TSO (total-store order) is used by
of “membar #MemIssue”, which is useful when writing
both Linux and Solaris, the architecture also defines PSO
to a given MMIO register affects the value that will next
(partial store order) and RMO (relaxed-memory order).
be read from that register. However, the heavier-weight
Any program that runs in RMO will also run in either PSO
“membar #MemIssue” must be used when a write to a
or TSO, and similarly, a program that runs in PSO will also
given MMIO register affects the value that will next be
run in TSO. Moving a shared-memory parallel program
read from some other MMIO register.
in the other direction may require careful insertion of
SPARC requires a flush instruction be used between
memory barriers.
the time that the instruction stream is modified and the
Although SPARC’s PSO and RMO modes are not used
time that any of these instructions are executed [SPA94].
much these days, they did give rise to a very flexible
This is needed to flush any prior value for that location
memory-barrier instruction [SPA94] that permits fine-
from the SPARC’s instruction cache. Note that flush
grained control of ordering:
takes an address, and will flush only that address from the
StoreStore orders preceding stores before subsequent instruction cache. On SMP systems, all CPUs’ caches are
stores. (This option is used by the Linux smp_wmb() flushed, but there is no convenient way to determine when
primitive.) the off-CPU flushes complete, though there is a reference
to an implementation note.
LoadStore orders preceding loads before subsequent But again, the Linux kernel runs SPARC in TSO mode,
stores. so all of the above membar variants are strictly of historical

v2022.09.25a
15.6. WHERE IS MEMORY ORDERING NEEDED? 359

interest. In particular, the smp_mb() primitive only needs Although newer x86 implementations accommodate
to use #StoreLoad because the other three reorderings self-modifying code without any special instructions, to
are prohibited by TSO. be fully compatible with past and potential future x86
implementations, a given CPU must execute a jump in-
struction or a serializing instruction (e.g., cpuid) between
15.5.8 x86 modifying the code and executing it [Int11, Section 8.1.3].

Historically, the x86 CPUs provided “process ordering” 15.5.9 z Systems


so that all CPUs agreed on the order of a given CPU’s
writes to memory. This allowed the smp_wmb() primitive The z Systems machines make up the IBM mainframe
to be a no-op for the CPU [Int04b]. Of course, a compiler family, previously known as the 360, 370, 390 and
directive was also required to prevent optimizations that zSeries [Int04c]. Parallelism came late to z Systems, but
would reorder across the smp_wmb() primitive. In ancient given that these mainframes first shipped in the mid 1960s,
times, certain x86 CPUs gave no ordering guarantees this is not saying much. The “bcr 15,0” instruction is
for loads, so the smp_mb() and smp_rmb() primitives used for the Linux smp_mb() primitives, but compiler
expanded to lock;addl. This atomic instruction acts as constraints suffices for both the smp_rmb() and smp_
a barrier to both loads and stores. wmb() primitives. It also has strong memory-ordering
But those were ancient times. More recently, Intel semantics, as shown in Table 15.5. In particular, all CPUs
has published a memory model for x86 [Int07]. It turns will agree on the order of unrelated stores from different
out that Intel’s modern CPUs enforce tighter ordering CPUs, that is, the z Systems CPU family is fully multicopy
than was claimed in the previous specifications, so this atomic, and is the only commercially available system
model simply mandates this modern behavior. Even more with this property.
recently, Intel published an updated memory model for As with most CPUs, the z Systems architecture does
x86 [Int11, Section 8.2], which mandates a total global not guarantee a cache-coherent instruction stream, hence,
order for stores, although individual CPUs are still permit- self-modifying code must execute a serializing instruction
ted to see their own stores as having happened earlier than between updating the instructions and executing them.
this total global order would indicate. This exception to That said, many actual z Systems machines do in fact
the total ordering is needed to allow important hardware accommodate self-modifying code without serializing
optimizations involving store buffers. In addition, x86 instructions. The z Systems instruction set provides a
provides other-multicopy atomicity, for example, so that if large set of serializing instructions, including compare-
CPU 0 sees a store by CPU 1, then CPU 0 is guaranteed to and-swap, some types of branches (for example, the afore-
see all stores that CPU 1 saw prior to its store. Software mentioned “bcr 15,0” instruction), and test-and-set.
may use atomic operations to override these hardware opti-
mizations, which is one reason that atomic operations tend
to be more expensive than their non-atomic counterparts. 15.6 Where is Memory Ordering
It is also important to note that atomic instructions oper- Needed?
ating on a given memory location should all be of the same
size [Int16, Section 8.1.2.2]. For example, if you write
a program where one CPU atomically increments a byte Almost all people are intelligent. It is method that
while another CPU executes a 4-byte atomic increment they lack.
on that same location, you are on your own. F. W. Nichol
Some SSE instructions are weakly ordered (clflush
and non-temporal move instructions [Int04a]). Code This section revisits Table 15.3 and Section 15.1.3, summa-
that uses these non-temporal move instructions can also rizing the intervening discussion with a more sophisticated
use mfence for smp_mb(), lfence for smp_rmb(), and rules of thumb.
sfence for smp_wmb(). A few older variants of the x86 But first, it is necessary to review the temporal and
CPU have a mode bit that enables out-of-order stores, and non-temporal nature of communication from one thread
for these CPUs, smp_wmb() must also be defined to be to another when using memory as the communications
lock;addl. medium, as was discussed in detail in Section 15.2.7. The

v2022.09.25a
360 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

key point is that although loads and stores are conceptually minimal ordering suffices. Minimal ordering includes
simple, on real multicore hardware significant periods of dependencies and acquires as well as all stronger ordering
time are required for their effects to become visible to all operations.
other threads. The third rule of thumb involves release-acquire chains:
The simple and intuitive case occurs when one thread If all but one of the links in a given cycle is a store-to-load
loads a value that some other thread stored. This straight- link, it is sufficient to use release-acquire pairs for each of
forward cause-and-effect case exhibits temporal behavior, those store-to-load links, as illustrated by Listings 15.23
so that the software can safely assume that the store in- and 15.24. You can replace a given acquire with a de-
struction completed before the load instruction started. In pendency in environments permitting this, keeping in
real life, the load instruction might well have started quite mind that the C11 standard’s memory model does not
some time before the store instruction did, but all modern fully respect dependencies. Therefore, a dependency lead-
hardware must carefully hide such cases from the software. ing to a load must be headed by a READ_ONCE() or an
Again, software will see the expected temporal cause-and- rcu_dereference(): A plain C-language load is not
effect behavior when one thread loads a value that some sufficient. In addition, carefully review Sections 15.3.2
other thread stores, as discussed in Section 15.2.7.3. and 15.3.3, because a dependency broken by your com-
However, hardware is under no obligation to provide piler will not order anything. The two threads sharing the
temporal cause-and-effect illusions when one thread’s sole non-store-to-load link can usually substitute WRITE_
store overwrites a value either loaded or stored by some ONCE() plus smp_wmb() for smp_store_release() on
other thread. It is quite possible that, from the software’s the one hand, and READ_ONCE() plus smp_rmb() for
viewpoint, an earlier store will overwrite a later store’s smp_load_acquire() on the other. However, the wise
value, but only if those two stores were executed by developer will check such substitutions carefully, for ex-
different threads. Similarly, a later load might well read ample, using the herd tool as described in Section 12.3.
a value overwritten by an earlier store, but again only
Quick Quiz 15.45: Why is it necessary to use heavier-weight
if that load and store were executed by different threads. ordering for load-to-store and store-to-store links, but not for
This counter-intuitive behavior occurs due to the need to store-to-load links? What on earth makes store-to-load links
buffer stores in order to achieve adequate performance, as so special???
discussed in Section 15.2.7.2.
As a result, situations where one thread reads a value The fourth and final rule of thumb identifies where full
written by some other thread can make do with far weaker memory barriers (or stronger) are required: If a given
ordering than can situations where one thread overwrites cycle contains two or more non-store-to-load links (that is,
a value loaded or stored by some other thread. These a total of two or more links that are either load-to-store or
differences are captured by the following rules of thumb. store-to-store links), you will need at least one full barrier
The first rule of thumb is that memory-ordering oper- between each pair of non-store-to-load links in that cycle,
ations are only required where there is a possibility of as illustrated by Listing 15.19 as well as in the answer to
interaction between at least two variables shared among Quick Quiz 15.24. Full barriers include smp_mb(), suc-
at least two threads. In light of the intervening material, cessful full-strength non-void atomic RMW operations,
this single sentence encapsulates much of Section 15.1.3’s and other atomic RMW operations in conjunction with ei-
basic rules of thumb, for example, keeping in mind that ther smp_mb__before_atomic() or smp_mb__after_
“memory-barrier pairing” is a two-thread special case of atomic(). Any of RCU’s grace-period-wait primitives
“cycle”. And, as always, if a single-threaded program will (synchronize_rcu() and friends) also act as full bar-
provide sufficient performance, why bother with paral- riers, but at far greater expense than smp_mb(). With
lelism?21 After all, avoiding parallelism also avoids the strength comes expense, though full barriers usually hurt
added cost of memory-ordering operations. performance more than they hurt scalability.
The second rule of thumb involves load-buffering situ- Recapping the rules:
ations: If all thread-to-thread communication in a given
cycle use store-to-load links (that is, the next thread’s 1. Memory-ordering operations are required only if at
load returns the value stored by the previous thread), least two variables are shared by at least two threads.

21 Hobbyists and researchers should of course feel free to ignore this 2. If all links in a cycle are store-to-load links, then
and many other cautions. minimal ordering suffices.

v2022.09.25a
15.6. WHERE IS MEMORY ORDERING NEEDED? 361

3. If all but one of the links in a cycle are store-to-


load links, then each store-to-load link may use a
release-acquire pair.
4. Otherwise, at least one full barrier is required between
each pair of non-store-to-load links.

Note that these four rules of thumb encapsulate min-


imum guarantees. A given architecture may give more
substantial guarantees, as discussed in Section 15.5, but
these guarantees may only be relied upon in code that
runs only for that architecture. In addition, more involved
memory models may give stronger guarantees [AMM+ 18],
at the expense of somewhat greater complexity. In these
more formal memory-ordering papers, a store-to-load link
is an example of a reads-from (rf) link, a load-to-store link
is an example of a from-reads (fr) link, and a store-to-store
link is an example of a coherence (co) link.
One final word of advice: Use of raw memory-ordering
primitives is a last resort. It is almost always better to use
existing primitives, such as locking or RCU, thus letting
those primitives do the memory ordering for you.

v2022.09.25a
362 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING

v2022.09.25a
Creating a perfect API is like committing the perfect
crime. There are at least fifty things that can go
wrong, and if you are a genius, you might be able to
anticipate twenty-five of them.

Chapter 16 With apologies to any Kathleen Turner fans who


might still be alive.

Ease of Use

16.1 What is Easy? nothing about, you should not be surprised when those
people find fault with your project.
If you really want to help a given group of people, there
When someone says “I want a programming is simply no substitute for working closely with them over
language in which I need only say what I wish done,” an extended period of time, as in years. Nevertheless,
give them a lollipop. there are some simple things that you can do to increase
Alan J. Perlis, updated the odds of your users being happy with your software,
and some of these things are covered in the next section.
If you are tempted to look down on ease-of-use require-
ments, please consider that an ease-of-use bug in Linux-
kernel RCU resulted in an exploitable Linux-kernel secu-
16.2 Rusty Scale for API Design
rity bug in a use of RCU [McK19a]. It is therefore clearly
important that even in-kernel APIs be easy to use. Finding the appropriate measurement is thus not a
mathematical exercise. It is a risk-taking judgment.
Unfortunately, “easy” is a relative term. For example,
many people would consider a 15-hour airplane flight to Peter Drucker
be a bit of an ordeal—unless they stopped to consider
alternative modes of transportation, especially swimming. This section is adapted from portions of Rusty Russell’s
This means that creating an easy-to-use API requires that 2003 Ottawa Linux Symposium keynote address [Rus03,
you understand your intended users well enough to know Slides 39–57]. Rusty’s key point is that the goal should
what is easy for them. Which might or might not have not be merely to make an API easy to use, but rather to
anything to do with what is easy for you. make the API hard to misuse. To that end, Rusty proposed
The following question illustrates this point: “Given his “Rusty Scale” in decreasing order of this important
a randomly chosen person among everyone alive today, hard-to-misuse property.
what one change would improve that person’s life?” The following list attempts to generalize the Rusty Scale
There is no single change that would be guaranteed to beyond the Linux kernel:
help everyone’s life. After all, there is an extremely wide 1. It is impossible to get wrong. Although this is the
range of people, with a correspondingly wide range of standard to which all API designers should strive,
needs, wants, desires, and aspirations. A starving person only the mythical dwim()1 command manages to
might need food, but additional food might well hasten come close.
the death of a morbidly obese person. The high level of
excitement so fervently desired by many young people 2. The compiler or linker won’t let you get it wrong.
might well be fatal to someone recovering from a heart
attack. Information critical to the success of one person 3. The compiler or linker will warn you if you get it
might contribute to the failure of someone suffering from wrong. BUILD_BUG_ON() is your users’ friend.
information overload. In short, if you are working on a 1 The dwim() function is an acronym that expands to “do what I

software project that is intended to help people you know mean”.

363

v2022.09.25a
364 CHAPTER 16. EASE OF USE

4. The simplest use is the correct one. 17. The obvious use is wrong. The Linux kernel smp_
mb() function is an example of this point on the
5. The name tells you how to use it. But names can scale. Many developers assume that this function
be two-edged swords. Although rcu_read_lock() has much stronger ordering semantics than it actually
is plain enough for someone converting code from possesses. Chapter 15 contains the information
reader-writer locking, it might cause some conster- needed to avoid this mistake, as does the Linux-kernel
nation for someone converting code from reference source tree’s Documentation and tools/memory-
counting. model directories.
6. Do it right or it will always break at runtime. WARN_ 18. The compiler or linker will warn you if you get it
ON_ONCE() is your users’ friend. right.
7. Follow common convention and you will get it right. 19. The compiler or linker won’t let you get it right.
The malloc() library function is a good example.
Although it is easy to get memory allocation wrong, a 20. It is impossible to get right. The gets() function
great many projects do manage to get it right, at least is a famous example of this point on the scale. In
most of the time. Using malloc() in conjunction fact, gets() can perhaps best be described as an
with Valgrind [The11] moves malloc() almost up unconditional buffer-overflow security hole.
to the “do it right or it will always break at runtime”
point on the scale. 16.3 Shaving the Mandelbrot Set
8. Read the documentation and you will get it right.
Simplicity does not precede complexity,
9. Read the implementation and you will get it right. but follows it.
10. Read the right mailing-list archive and you will get it Alan J. Perlis
right.
The set of useful programs resembles the Mandelbrot set
11. Read the right mailing-list archive and you will get it
(shown in Figure 16.1) in that it does not have a clear-cut
wrong.
smooth boundary—if it did, the halting problem would
12. Read the implementation and you will get it wrong. be solvable. But we need APIs that real people can use,
The original non-CONFIG_PREEMPT implementation not ones that require a Ph.D. dissertation be completed
of rcu_read_lock() [McK07a] is an infamous ex- for each and every potential use. So, we “shave the
ample of this point on the scale. Mandelbrot set”,2 restricting the use of the API to an
easily described subset of the full set of potential uses.
13. Read the documentation and you will get it wrong. Such shaving may seem counterproductive. After all,
For example, the DEC Alpha wmb instruction’s doc- if an algorithm works, why shouldn’t it be used?
umentation [Cor02] fooled a number of developers To see why at least some shaving is absolutely necessary,
into thinking that this instruction had much stronger consider a locking design that avoids deadlock, but in
memory-order semantics than it actually does. Later perhaps the worst possible way. This design uses a circular
documentation clarified this point [Com01, Pug00], doubly linked list, which contains one element for each
moving the wmb instruction up to the “read the doc- thread in the system along with a header element. When
umentation and you will get it right” point on the a new thread is spawned, the parent thread must insert a
scale. new element into this list, which requires some sort of
synchronization.
14. Follow common convention and you will get it wrong.
One way to protect the list is to use a global lock.
The printf() statement is an example of this point
However, this might be a bottleneck if threads were being
on the scale because developers almost always fail to
created and deleted frequently.3 Another approach would
check printf()’s error return.
2 Due to Josh Triplett.
15. Do it right and it will break at runtime. 3 Those of you with strong operating-system backgrounds, please
suspend disbelief. Those unable to suspend disbelief are encouraged to
16. The name tells you how not to use it. provide better examples.

v2022.09.25a
16.3. SHAVING THE MANDELBROT SET 365

Figure 16.2: Shaving the Mandelbrot Set

Figure 16.1: Mandelbrot Set (Courtesy of Wikipedia)


So why should we prohibit use of this delightful little
algorithm?
be to use a hash table and to lock the individual hash The fact is that if you really want to use it, we cannot
buckets, but this can perform poorly when scanning the stop you. We can, however, recommend against such code
list in order. being included in any project that we care about.
A third approach is to lock the individual list elements, But, before you use this algorithm, please think through
and to require the locks for both the predecessor and the following Quick Quiz.
successor to be held during the insertion. Since both
Quick Quiz 16.1: Can a similar algorithm be used when
locks must be acquired, we need to decide which order to deleting elements?
acquire them in. Two conventional approaches would be
to acquire the locks in address order, or to acquire them The fact is that this algorithm is extremely specialized
in the order that they appear in the list, so that the header (it only works on certain sized lists), and also quite fragile.
is always acquired first when it is one of the two elements Any bug that accidentally failed to add a node to the list
being locked. However, both of these methods require could result in deadlock. In fact, simply adding the node
special checks and branches. a bit too late could result in deadlock, as could increasing
The to-be-shaven solution is to unconditionally acquire the number of threads.
the locks in list order. But what about deadlock? In addition, the other algorithms described above are
Deadlock cannot occur. “good and sufficient”. For example, simply acquiring the
To see this, number the elements in the list starting locks in address order is fairly simple and quick, while
with zero for the header up to 𝑁 for the last element in allowing the use of lists of any size. Just be careful of the
the list (the one preceding the header, given that the list special cases presented by empty lists and lists containing
is circular). Similarly, number the threads from zero to only one element!
𝑁 − 1. If each thread attempts to lock some consecutive
Quick Quiz 16.2: Yetch! What ever possessed someone to
pair of elements, at least one of the threads is guaranteed come up with an algorithm that deserves to be shaved as much
to be able to acquire both locks. as this one does???
Why?
Because there are not enough threads to reach all the In summary, we do not use algorithms simply because
way around the list. Suppose thread 0 acquires element 0’s they happen to work. We instead restrict ourselves to
lock. To be blocked, some other thread must have already algorithms that are useful enough to make it worthwhile
acquired element 1’s lock, so let us assume that thread 1 learning about them. The more difficult and complex the
has done so. Similarly, for thread 1 to be blocked, some algorithm, the more generally useful it must be in order for
other thread must have acquired element 2’s lock, and so the pain of learning it and fixing its bugs to be worthwhile.
on, up through thread 𝑁 − 1, who acquires element 𝑁 − 1’s Quick Quiz 16.3: Give an exception to this rule.
lock. For thread 𝑁 − 1 to be blocked, some other thread
must have acquired element 𝑁’s lock. But there are no Exceptions aside, we must continue to shave the soft-
more threads, and so thread 𝑁 − 1 cannot be blocked. ware “Mandelbrot set” so that our programs remain main-
Therefore, deadlock cannot occur. tainable, as shown in Figure 16.2.

v2022.09.25a
366 CHAPTER 16. EASE OF USE

v2022.09.25a
Prediction is very difficult, especially about the
future.

Chapter 17 Niels Bohr

Conflicting Visions of the Future

This chapter presents some conflicting visions of the future


of parallel programming. It is not clear which of these
will come to pass, in fact, it is not clear that any of them
will. They are nevertheless important because each vision
has its devoted adherents, and if enough people believe in
something fervently enough, you will need to deal with
that thing’s existence in the form of its influence on the
thoughts, words, and deeds of its adherents. Besides
which, one or more of these visions will actually come to
pass. But most are bogus. Tell which is which and you’ll
be rich [Spi77]!
Therefore, the following sections give an overview
of transactional memory, hardware transactional mem-
ory, formal verification in regression testing, and parallel
functional programming. But first, a cautionary tale on
prognostication taken from the early 2000s.

17.1 The Future of CPU Technol- Figure 17.1: Uniprocessor Über Alles
ogy Ain’t What it Used to Be
2. Multithreaded Mania (Figure 17.2),
A great future behind him.

David Maraniss 3. More of the Same (Figure 17.3), and

Years past always seem so simple and innocent when 4. Crash Dummies Slamming into the Memory Wall
viewed through the lens of many years of experience. (Figure 17.4).
And the early 2000s were for the most part innocent
of the impending failure of Moore’s Law to continue 5. Astounding Accelerators (Figure 17.5).
delivering the then-traditional increases in CPU clock
frequency. Oh, there were the occasional warnings about Each of these scenarios is covered in the following
the limits of technology, but such warnings had been sections.
sounded for decades. With that in mind, consider the
following scenarios: 17.1.1 Uniprocessor Über Alles
1. Uniprocessor Über Alles (Figure 17.1), As was said in 2004 [McK04]:

367

v2022.09.25a
368 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE

Figure 17.4: Crash Dummies Slamming into the Memory


Wall

Figure 17.2: Multithreaded Mania

Figure 17.3: More of the Same Figure 17.5: Astounding Accelerators

v2022.09.25a
17.1. THE FUTURE OF CPU TECHNOLOGY AIN’T WHAT IT USED TO BE 369

In this scenario, the combination of Moore’s- CPU, which can degrade performance for ap-
Law increases in CPU clock rate and continued plications with large cache footprints. There is
progress in horizontally scaled computing ren- also some possibility that the restricted amount
der SMP systems irrelevant. This scenario is of cache available will cause RCU-based algo-
therefore dubbed “Uniprocessor Über Alles”, rithms to incur performance penalties due to
literally, uniprocessors above all else. their grace-period-induced additional memory
These uniprocessor systems would be subject consumption. Investigating this possibility is
only to instruction overhead, since memory future work.
barriers, cache thrashing, and contention do not However, in order to avoid such performance
affect single-CPU systems. In this scenario, degradation, a number of multithreaded CPUs
RCU is useful only for niche applications, such and multi-CPU chips partition at least some of
as interacting with NMIs. It is not clear that an the levels of cache on a per-hardware-thread
operating system lacking RCU would see the basis. This increases the amount of cache avail-
need to adopt it, although operating systems able to each hardware thread, but re-introduces
that already implement RCU might continue to memory latency for cachelines that are passed
do so. from one hardware thread to another.
However, recent progress with multithreaded
CPUs seems to indicate that this scenario is And we all know how this story has played out, with
quite unlikely. multiple multi-threaded cores on a single die plugged
into a single socket, with varying degrees of optimization
Unlikely indeed! But the larger software community for lower numbers of active threads per core. The ques-
was reluctant to accept the fact that they would need to tion then becomes whether or not future shared-memory
embrace parallelism, and so it was some time before this systems will always fit into a single socket.
community concluded that the “free lunch” of Moore’s-
Law-induced CPU core-clock frequency increases was
well and truly finished. Never forget: Belief is an emotion, 17.1.3 More of the Same
not necessarily the result of a rational technical thought
process! Again from 2004 [McK04]:

17.1.2 Multithreaded Mania The More-of-the-Same scenario assumes that


the memory-latency ratios will remain roughly
Also from 2004 [McK04]: where they are today.

A less-extreme variant of Uniprocessor Über This scenario actually represents a change, since
Alles features uniprocessors with hardware mul- to have more of the same, interconnect perfor-
tithreading, and in fact multithreaded CPUs mance must begin keeping up with the Moore’s-
are now standard for many desktop and lap- Law increases in core CPU performance. In this
top computer systems. The most aggressively scenario, overhead due to pipeline stalls, mem-
multithreaded CPUs share all levels of cache hi- ory latency, and contention remains significant,
erarchy, thereby eliminating CPU-to-CPU mem- and RCU retains the high level of applicability
ory latency, in turn greatly reducing the perfor- that it enjoys today.
mance penalty for traditional synchronization
mechanisms. However, a multithreaded CPU And the change has been the ever-increasing levels of
would still incur overhead due to contention integration that Moore’s Law is still providing. But longer
and to pipeline stalls caused by memory barri- term, which will it be? More CPUs per die? Or more I/O,
ers. Furthermore, because all hardware threads cache, and memory?
share all levels of cache, the cache available to Servers seem to be choosing the former, while em-
a given hardware thread is a fraction of what bedded systems on a chip (SoCs) continue choosing the
it would be on an equivalent single-threaded latter.

v2022.09.25a
370 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE

10000
Instructions per Memory Reference Time

1000
1

100 spinlock

Breakeven Update Fraction


10

0.1 RCU
82 84 86 88 90 92 94 96 98 00 02
Year

Figure 17.6: Instructions per Local Memory Reference


0.1
for Sequent Computers 1 10 100 1000
Memory-Latency Ratio
17.1.4 Crash Dummies Slamming into the Figure 17.7: Breakevens vs. 𝑟, 𝜆 Large, Four CPUs
Memory Wall
And one more quote from 2004 [McK04]:

If the memory-latency trends shown in Fig-


ure 17.6 continue, then memory latency will con-
tinue to grow relative to instruction-execution
overhead. Systems such as Linux that have sig-
nificant use of RCU will find additional use of 1
RCU to be profitable, as shown in Figure 17.7.
As can be seen in this figure, if RCU is heavily spinlock
Breakeven Update Fraction

used, increasing memory-latency ratios give 0.1


RCU an increasing advantage over other syn-
chronization mechanisms. In contrast, systems
drw
with minor use of RCU will require increasingly
0.01
high degrees of read intensity for use of RCU to
pay off, as shown in Figure 17.8. As can be seen
in this figure, if RCU is lightly used, increasing
0.001 RCU
memory-latency ratios put RCU at an increasing
disadvantage compared to other synchronization
mechanisms. Since Linux has been observed
with over 1,600 callbacks per grace period under 0.0001
1 10 100 1000
heavy load [SM04b], it seems safe to say that
Memory-Latency Ratio
Linux falls into the former category.
Figure 17.8: Breakevens vs. 𝑟, 𝜆 Small, Four CPUs
On the one hand, this passage failed to anticipate the
cache-warmth issues that RCU can suffer from in work-
loads with significant update intensity, in part because it
seemed unlikely that RCU would really be used for such

v2022.09.25a
17.2. TRANSACTIONAL MEMORY 371

workloads. In the event, the SLAB_TYPESAFE_BY_RCU forthcoming, despite other somewhat similar proposals
has been pressed into service in a number of instances being put forward [SSHT93]. Not long after, Shavit
where these cache-warmth issues would otherwise be and Touitou proposed a software-only implementation of
problematic, as has sequence locking. On the other hand, transactional memory (STM) that was capable of running
this passage also failed to anticipate that RCU would be on commodity hardware, give or take memory-ordering
used to reduce scheduling latency or for security. issues [ST95]. This proposal languished for many years,
Much of the data generated for this book was collected perhaps due to the fact that the research community’s
on an eight-socket system with 28 cores per socket and attention was absorbed by non-blocking synchronization
two hardware threads per core, for a total of 448 hardware (see Section 14.2).
threads. The idle-system memory latencies are less than But by the turn of the century, TM started receiving
one microsecond, which are no worse than those of similar- more attention [MT01, RG01], and by the middle of
sized systems of the year 2004. Some claim that these the decade, the level of interest can only be termed “in-
latencies approach a microsecond only because of the candescent” [Her05, Gro07], with only a few voices of
x86 CPU family’s relatively strong memory ordering, but caution [BLM05, MMW07].
it may be some time before that particular argument is The basic idea behind TM is to execute a section of
settled. code atomically, so that other threads see no interme-
diate state. As such, the semantics of TM could be
17.1.5 Astounding Accelerators implemented by simply replacing each transaction with a
recursively acquirable global lock acquisition and release,
The potential of hardware accelerators was not quite as albeit with abysmal performance and scalability. Much of
clear in 2004 as it is in 2021, so this section has no quote. the complexity inherent in TM implementations, whether
However, the November 2020 Top 500 list [MDSS20] hardware or software, is efficiently detecting when concur-
features a great many accelerators, so one could argue rent transactions can safely run in parallel. Because this
that this section is a view of the present rather than of the detection is done dynamically, conflicting transactions can
future. The same could be said of most of the preceding be aborted or “rolled back”, and in some implementations,
sections. this failure mode is visible to the programmer.
Hardware accelerators are being put to many other uses, Because transaction roll-back is increasingly unlikely
including encryption, compression, machine learning. as transaction size decreases, TM might become quite
In short, beware of prognostications, including those in attractive for small memory-based operations, such as
the remainder of this chapter. linked-list manipulations used for stacks, queues, hash
tables, and search trees. However, it is currently much
more difficult to make the case for large transactions, par-
17.2 Transactional Memory ticularly those containing non-memory operations such
as I/O and process creation. The following sections look
Everything should be as simple as it can be, but not at current challenges to the grand vision of “Transac-
simpler. tional Memory Everywhere” [McK09b]. Section 17.2.1
Albert Einstein, by way of Louis Zukofsky examines the challenges faced interacting with the outside
world, Section 17.2.2 looks at interactions with process
modification primitives, Section 17.2.3 explores interac-
The idea of using transactions outside of databases goes
tions with other synchronization primitives, and finally
back many decades [Lom77, Kni86, HM93], with the
Section 17.2.4 closes with some discussion.
key difference between database and non-database trans-
actions being that non-database transactions drop the
“D” in the “ACID”1 properties defining database transac- 17.2.1 Outside World
tions. The idea of supporting memory-based transactions,
or “transactional memory” (TM), in hardware is more In the wise words of Donald Knuth:
recent [HM93], but unfortunately, support for such trans-
actions in commodity hardware was not immediately Many computer users feel that input and output
are not actually part of “real programming,”
1 Atomicity, consistency, isolation, and durability. they are merely things that (unfortunately) must

v2022.09.25a
372 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE

be done in order to get information in and out for unbuffered I/O, but requires that TM interoperate
of the machine. with other synchronization primitives tolerating I/O.

Whether or not we believe that input and output are “real 3. Prohibit I/O within transactions, but enlist the com-
programming”, the fact is that software absolutely must piler’s aid in enforcing this prohibition.
deal with the outside world. This section therefore cri-
tiques transactional memory’s outside-world capabilities, 4. Permit only one special irrevocable transac-
focusing on I/O operations, time delays, and persistent tion [SMS08] to proceed at any given time, thus
storage. allowing irrevocable transactions to contain I/O op-
erations.2 This works in general, but severely limits
the scalability and performance of I/O operations.
17.2.1.1 I/O Operations Given that scalability and performance is a first-class
One can execute I/O operations within a lock-based crit- goal of parallelism, this approach’s generality seems
ical section, while holding a hazard pointer, within a a bit self-limiting. Worse yet, use of irrevocability
sequence-locking read-side critical section, and from to tolerate I/O operations seems to greatly restrict
within a userspace-RCU read-side critical section, and use of manual transaction-abort operations.3 Finally,
even all at the same time, if need be. What happens when if there is an irrevocable transaction manipulating
you attempt to execute an I/O operation from within a a given data item, any other transaction manipulat-
transaction? ing that same data item cannot have non-blocking
The underlying problem is that transactions may be semantics.
rolled back, for example, due to conflicts. Roughly speak-
5. Create new hardware and protocols such that I/O op-
ing, this requires that all operations within any given
erations can be pulled into the transactional substrate.
transaction be revocable, so that executing the operation
In the case of input operations, the hardware would
twice has the same effect as executing it once. Unfor-
need to correctly predict the result of the operation,
tunately, I/O is in general the prototypical irrevocable
and to abort the transaction if the prediction failed.
operation, making it difficult to include general I/O opera-
tions in transactions. In fact, general I/O is irrevocable: I/O operations are a well-known weakness of TM,
Once you have pushed the proverbial button launching the and it is not clear that the problem of supporting I/O in
nuclear warheads, there is no turning back. transactions has a reasonable general solution, at least
Here are some options for handling of I/O within trans- if “reasonable” is to include usable performance and
actions: scalability. Nevertheless, continued time and attention to
1. Restrict I/O within transactions to buffered I/O with this problem will likely produce additional progress.
in-memory buffers. These buffers may then be in-
cluded in the transaction in the same way that any 17.2.1.2 RPC Operations
other memory location might be included. This One can execute RPCs within a lock-based critical section,
seems to be the mechanism of choice, and it does while holding a hazard pointer, within a sequence-locking
work well in many common cases of situations such read-side critical section, and from within a userspace-
as stream I/O and mass-storage I/O. However, spe- RCU read-side critical section, and even all at the same
cial handling is required in cases where multiple time, if need be. What happens when you attempt to
record-oriented output streams are merged onto a execute an RPC from within a transaction?
single file from multiple processes, as might be done If both the RPC request and its response are to be
using the “a+” option to fopen() or the O_APPEND contained within the transaction, and if some part of the
flag to open(). In addition, as will be seen in the transaction depends on the result returned by the response,
next section, common networking operations cannot then it is not possible to use the memory-buffer tricks that
be handled via buffering.
2 In earlier literature, irrevocable transactions are termed inevitable
2. Prohibit I/O within transactions, so that any attempt to
transactions.
execute an I/O operation aborts the enclosing transac- 3 This difficulty was pointed out by Michael Factor. To see the
tion (and perhaps multiple nested transactions). This problem, think through what TM should do in response to an attempt to
approach seems to be the conventional TM approach abort a transaction after it has executed an irrevocable operation.

v2022.09.25a
17.2. TRANSACTIONAL MEMORY 373

can be used in the case of buffered I/O. Any attempt to take necessary to roll all but one of them back, with con-
this buffering approach would deadlock the transaction, as sequent degradation of performance and scalability.
the request could not be transmitted until the transaction This approach nevertheless might be valuable given
was guaranteed to succeed, but the transaction’s success long-running transactions ending with an RPC. This
might not be knowable until after the response is received, approach must still restrict manual transaction-abort
as is the case in the following example: operations.

1 begin_trans(); 4. Identify special cases where the RPC response may


2 rpc_request(); be moved out of the transaction, and then proceed
3 i = rpc_response();
4 a[i]++; using techniques similar to those used for buffered
5 end_trans(); I/O.

5. Extend the transactional substrate to include the RPC


The transaction’s memory footprint cannot be deter- server as well as its client. This is in theory possible,
mined until after the RPC response is received, and until as has been demonstrated by distributed databases.
the transaction’s memory footprint can be determined, it However, it is unclear whether the requisite perfor-
is impossible to determine whether the transaction can mance and scalability requirements can be met by
be allowed to commit. The only action consistent with distributed-database techniques, given that memory-
transactional semantics is therefore to unconditionally based TM has no slow disk drives behind which to
abort the transaction, which is, to say the least, unhelpful. hide such latencies. Of course, given the advent of
Here are some options available to TM: solid-state disks, it is also quite possible that data-
bases will need to redesign their approach to latency
1. Prohibit RPC within transactions, so that any attempt hiding.
to execute an RPC operation aborts the enclosing
transaction (and perhaps multiple nested transac- As noted in the prior section, I/O is a known weakness
tions). Alternatively, enlist the compiler to enforce of TM, and RPC is simply an especially problematic case
RPC-free transactions. This approach does work, but of I/O.
will require TM to interact with other synchronization
primitives. 17.2.1.3 Time Delays
2. Permit only one special irrevocable transac- An important special case of interaction with extra-
tion [SMS08] to proceed at any given time, thus transactional accesses involves explicit time delays within
allowing irrevocable transactions to contain RPC a transaction. Of course, the idea of a time delay within a
operations. This works in general, but severely limits transaction flies in the face of TM’s atomicity property,
the scalability and performance of RPC operations. but this sort of thing is arguably what weak atomicity is
Given that scalability and performance is a first-class all about. Furthermore, correct interaction with memory-
goal of parallelism, this approach’s generality seems mapped I/O sometimes requires carefully controlled tim-
a bit self-limiting. Furthermore, use of irrevoca- ing, and applications often use time delays for varied
ble transactions to permit RPC operations restricts purposes. Finally, one can execute time delays within a
manual transaction-abort operations once the RPC lock-based critical section, while holding a hazard pointer,
operation has started. Finally, if there is an irrevoca- within a sequence-locking read-side critical section, and
ble transaction manipulating a given data item, any from within a userspace-RCU read-side critical section,
other transaction manipulating that same data item and even all at the same time, if need be. Doing so might
must have blocking semantics. not be wise from a contention or scalability viewpoint,
but then again, doing so does not raise any fundamental
3. Identify special cases where the success of the trans- conceptual issues.
action may be determined before the RPC response So, what can TM do about time delays within transac-
is received, and automatically convert these to irrev- tions?
ocable transactions immediately before sending the
RPC request. Of course, if several concurrent trans- 1. Ignore time delays within transactions. This has
actions attempt RPC calls in this manner, it might be an appearance of elegance, but like too many other

v2022.09.25a
374 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE

“elegant” solutions, fails to survive first contact with Persistent locks help avoid the need to share memory
legacy code. Such code, which might well have among unrelated applications. Persistent locking APIs
important time delays in critical sections, would fail include the flock family, lockf(), System V semaphores,
upon being transactionalized. or the O_CREAT flag to open(). These persistent APIs
can be used to protect large-scale operations spanning
2. Abort transactions upon encountering a time-delay runs of multiple applications, and, in the case of O_CREAT
operation. This is attractive, but it is unfortunately even surviving operating-system reboot. If need be, locks
not always possible to automatically detect a time- can even span multiple computer systems via distributed
delay operation. Is that tight loop carrying out a lock managers and distributed filesystems—and persist
critical computation, or is it simply waiting for time across reboots of any or all of those computer systems.
to elapse? Persistent locks can be used by any application, in-
cluding applications written using multiple languages
3. Enlist the compiler to prohibit time delays within
and software environments. In fact, a persistent lock
transactions.
might well be acquired by an application written in C and
4. Let the time delays execute normally. Unfortunately, released by an application written in Python.
some TM implementations publish modifications How could a similar persistent functionality be provided
only at commit time, which could defeat the purpose for TM?
of the time delay.
1. Restrict persistent transactions to special-purpose
It is not clear that there is a single correct answer. TM environments designed to support them, for example,
implementations featuring weak atomicity that publish SQL. This clearly works, given the decades-long
changes immediately within the transaction (rolling these history of database systems, but does not provide
changes back upon abort) might be reasonably well served the same degree of flexibility provided by persistent
by the last alternative. Even in this case, the code (or locks.
possibly even hardware) at the other end of the transaction
2. Use snapshot facilities provided by some storage de-
may require a substantial redesign to tolerate aborted
vices and/or filesystems. Unfortunately, this does not
transactions. This need for redesign would make it more
handle network communication, nor does it handle
difficult to apply transactional memory to legacy code.
I/O to devices that do not provide snapshot capabili-
ties, for example, memory sticks.
17.2.1.4 Persistence
3. Build a time machine.
There are many different types of locking primitives.
One interesting distinction is persistence, in other words, 4. Avoid the problem entirely by using existing persis-
whether the lock can exist independently of the address tent facilities, presumably avoiding such use within
space of the process using the lock. transactions.
Non-persistent locks include pthread_mutex_
lock(), pthread_rwlock_rdlock(), and most kernel- Of course, the fact that it is called transactional memory
level locking primitives. If the memory locations instan- should give us pause, as the name itself conflicts with
tiating a non-persistent lock’s data structures disappear, the concept of a persistent transaction. It is nevertheless
so does the lock. For typical use of pthread_mutex_ worthwhile to consider this possibility as an important
lock(), this means that when the process exits, all of test case probing the inherent limitations of transactional
its locks vanish. This property can be exploited in order memory.
to trivialize lock cleanup at program shutdown time, but
makes it more difficult for unrelated applications to share 17.2.2 Process Modification
locks, as such sharing requires the applications to share
memory. Processes are not eternal: They are created and destroyed,
their memory mappings are modified, they are linked to
Quick Quiz 17.1: But suppose that an application exits dynamic libraries, and they are debugged. These sections
while holding a pthread_mutex_lock() that happens to be
look at how transactional memory can handle an ever-
located in a file-mapped region of memory?
changing execution environment.

v2022.09.25a
17.2. TRANSACTIONAL MEMORY 375

17.2.2.1 Multithreaded Transactions that the parent and children are presumably permit-
ted to conflict with each other, but not with other
It is perfectly legal to create processes and threads while
threads. It also raises interesting questions as to
holding a lock or, for that matter, while holding a hazard
what should happen if the parent thread does not wait
pointer, within a sequence-locking read-side critical sec-
for its children before committing the transaction.
tion, and from within a userspace-RCU read-side critical
Even more interesting, what happens if the parent
section, and even all at the same time, if need be. Not
conditionally executes pthread_join() based on
only is it legal, but it is quite simple, as can be seen from
the values of variables participating in the transac-
the following code fragment:
tion? The answers to these questions are reasonably
1 pthread_mutex_lock(...);
straightforward in the case of locking. The answers
2 for (i = 0; i < ncpus; i++) for TM are left as an exercise for the reader.
3 pthread_create(&tid[i], ...);
4 for (i = 0; i < ncpus; i++)
5 pthread_join(tid[i], ...); Given that parallel execution of transactions is com-
6 pthread_mutex_unlock(...);
monplace in the database world, it is perhaps surprising
that current TM proposals do not provide for it. On the
This pseudo-code fragment uses pthread_create() other hand, the example above is a fairly sophisticated use
to spawn one thread per CPU, then uses pthread_join() of locking that is not normally found in simple textbook
to wait for each to complete, all under the protection of examples, so perhaps its omission is to be expected. That
pthread_mutex_lock(). The effect is to execute a lock- said, some researchers are using transactions to autoparal-
based critical section in parallel, and one could obtain a lelize code [RKM+ 10], and there are rumors that other TM
similar effect using fork() and wait(). Of course, the researchers are investigating fork/join parallelism within
critical section would need to be quite large to justify the transactions, so perhaps this topic will soon be addressed
thread-spawning overhead, but there are many examples more thoroughly.
of large critical sections in production software.
What might TM do about thread spawning within a
transaction? 17.2.2.2 The exec() System Call

1. Declare pthread_create() to be illegal within One can execute an exec() system call within a lock-
transactions, preferably by aborting the transac- based critical section, while holding a hazard pointer,
tion. Alternatively, enlist the compiler to enforce within a sequence-locking read-side critical section, and
pthread_create()-free transactions. from within a userspace-RCU read-side critical section,
and even all at the same time, if need be. The exact
2. Permit pthread_create() to be executed within a semantics depends on the type of primitive.
transaction, but only the parent thread will be con- In the case of non-persistent primitives (in-
sidered to be part of the transaction. This approach cluding pthread_mutex_lock(), pthread_rwlock_
seems to be reasonably compatible with existing and rdlock(), and userspace RCU), if the exec() succeeds,
posited TM implementations, but seems to be a trap the whole address space vanishes, along with any locks
for the unwary. This approach raises further ques- being held. Of course, if the exec() fails, the address
tions, such as how to handle conflicting child-thread space still lives, so any associated locks would also still
accesses. live. A bit strange perhaps, but well defined.
3. Convert the pthread_create()s to function calls. On the other hand, persistent primitives (including
This approach is also an attractive nuisance, as it does the flock family, lockf(), System V semaphores, and
not handle the not-uncommon cases where the child the O_CREAT flag to open()) would survive regardless
threads communicate with one another. In addition, of whether the exec() succeeded or failed, so that the
it does not permit concurrent execution of the body exec()ed program might well release them.
of the transaction. Quick Quiz 17.2: What about non-persistent primitives
4. Extend the transaction to cover the parent and all represented by data structures in mmap() regions of memory?
What happens when there is an exec() within a critical section
child threads. This approach raises interesting ques-
of such a primitive?
tions about the nature of conflicting accesses, given

v2022.09.25a
376 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE

What happens when you attempt to execute an exec() 17.2.2.3 Dynamic Linking and Loading
system call from within a transaction?
Lock-based critical section, code holding a hazard
1. Disallow exec() within transactions, so that the pointer, sequence-locking read-side critical sections, and
enclosing transactions abort upon encountering the userspace-RCU read-side critical sections can (separately
exec(). This is well defined, but clearly requires or in combination) legitimately contain code that invokes
non-TM synchronization primitives for use in con- dynamically linked and loaded functions, including C/C++
junction with exec(). shared libraries and Java class libraries. Of course, the
code contained in these libraries is by definition unknow-
2. Disallow exec() within transactions, with the com- able at compile time. So, what happens if a dynamically
piler enforcing this prohibition. There is a draft loaded function is invoked within a transaction?
specification for TM in C++ that takes this ap- This question has two parts: (a) How do you dynam-
proach, allowing functions to be decorated with the ically link and load a function within a transaction and
transaction_safe and transaction_unsafe at- (b) What do you do about the unknowable nature of the
tributes.4 This approach has some advantages over code within this function? To be fair, item (b) poses
aborting the transaction at runtime, but again re- some challenges for locking and userspace-RCU as well,
quires non-TM synchronization primitives for use in at least in theory. For example, the dynamically linked
conjunction with exec(). One disadvantage is the function might introduce a deadlock for locking or might
need to decorate a great many library functions with (erroneously) introduce a quiescent state into a userspace-
transaction_safe and transaction_unsafe at- RCU read-side critical section. The difference is that
tributes. while the class of operations permitted in locking and
userspace-RCU critical sections is well-understood, there
3. Treat the transaction in a manner similar to non-
appears to still be considerable uncertainty in the case of
persistent locking primitives, so that the transaction
TM. In fact, different implementations of TM seem to
survives if exec() fails, and silently commits if
have different restrictions.
the exec() succeeds. The case where only some
So what can TM do about dynamically linked and
of the variables affected by the transaction reside
loaded library functions? Options for part (a), the actual
in mmap()ed memory (and thus could survive a
loading of the code, include the following:
successful exec() system call) is left as an exercise
for the reader. 1. Treat the dynamic linking and loading in a manner
4. Abort the transaction (and the exec() system call) similar to a page fault, so that the function is loaded
if the exec() system call would have succeeded, and linked, possibly aborting the transaction in the
but allow the transaction to continue if the exec() process. If the transaction is aborted, the retry will
system call would fail. This is in some sense the find the function already present, and the transaction
“correct” approach, but it would require considerable can thus be expected to proceed normally.
work for a rather unsatisfying result. 2. Disallow dynamic linking and loading of functions
from within transactions.
The exec() system call is perhaps the strangest example
of an obstacle to universal TM applicability, as it is
Options for part (b), the inability to detect TM-
not completely clear what approach makes sense, and
unfriendly operations in a not-yet-loaded function, possi-
some might argue that this is merely a reflection of the
bilities include the following:
perils of real-life interaction with exec(). That said, the
two options prohibiting exec() within transactions are 1. Just execute the code: If there are any TM-unfriendly
perhaps the most logical of the group. operations in the function, simply abort the transac-
Similar issues surround the exit() and kill() sys- tion. Unfortunately, this approach makes it impos-
tem calls, as well as a longjmp() or an exception that sible for the compiler to determine whether a given
would exit the transaction. (Where did the longjmp() or group of transactions may be safely composed. One
exception come from?) way to permit composability regardless is irrevocable
4 Thanks to Mark Moir for pointing me at this spec, and to Michael transactions, however, current implementations per-
Wong for having pointed me at an earlier revision some time back. mit only a single irrevocable transaction to proceed

v2022.09.25a
17.2. TRANSACTIONAL MEMORY 377

at any given time, which can severely limit perfor- aborted. This does simplify things somewhat, but
mance and scalability. Irrevocable transactions also also requires that TM interoperate with synchro-
to restrict use of manual transaction-abort opera- nization primitives that do tolerate remapping from
tions. Finally, if there is an irrevocable transaction within their critical sections.
manipulating a given data item, any other transac-
tion manipulating that same data item cannot have 2. Memory remapping is illegal within a transaction,
non-blocking semantics. and the compiler is enlisted to enforce this prohibi-
tion.
2. Decorate the function declarations indicating which
functions are TM-friendly. These decorations can 3. Memory mapping is legal within a transaction, but
then be enforced by the compiler’s type system. Of aborts all other transactions having variables in the
course, for many languages, this requires language region mapped over.
extensions to be proposed, standardized, and imple- 4. Memory mapping is legal within a transaction, but
mented, with the corresponding time delays, and also the mapping operation will fail if the region being
with the corresponding decoration of a great many mapped overlaps with the current transaction’s foot-
otherwise uninvolved library functions. That said, the print.
standardization effort is already in progress [ATS09].
5. All memory-mapping operations, whether within or
3. As above, disallow dynamic linking and loading of
outside a transaction, check the region being mapped
functions from within transactions.
against the memory footprint of all transactions in the
I/O operations are of course a known weakness of system. If there is overlap, then the memory-mapping
TM, and dynamic linking and loading can be thought operation fails.
of as yet another special case of I/O. Nevertheless, the
proponents of TM must either solve this problem, or resign 6. The effect of memory-mapping operations that over-
themselves to a world where TM is but one tool of several lap the memory footprint of any transaction in the
in the parallel programmer’s toolbox. (To be fair, a number system is determined by the TM conflict manager,
of TM proponents have long since resigned themselves to which might dynamically determine whether to fail
a world containing more than just TM.) the memory-mapping operation or abort any conflict-
ing transactions.
17.2.2.4 Memory-Mapping Operations It is interesting to note that munmap() leaves the relevant
It is perfectly legal to execute memory-mapping operations region of memory unmapped, which could have additional
(including mmap(), shmat(), and munmap() [Gro01]) interesting implications.5
within a lock-based critical section, while holding a haz-
ard pointer, within a sequence-locking read-side critical 17.2.2.5 Debugging
section, and from within a userspace-RCU read-side crit-
The usual debugging operations such as breakpoints
ical section, and even all at the same time, if need be.
work normally within lock-based critical sections and
What happens when you attempt to execute such an op-
from usespace-RCU read-side critical sections. However,
eration from within a transaction? More to the point,
in initial transactional-memory hardware implementa-
what happens if the memory region being remapped con-
tions [DLMN09] an exception within a transaction will
tains some variables participating in the current thread’s
abort that transaction, which in turn means that break-
transaction? And what if this memory region contains
points abort all enclosing transactions.
variables participating in some other thread’s transaction?
So how can transactions be debugged?
It should not be necessary to consider cases where
the TM system’s metadata is remapped, given that most 1. Use software emulation techniques within transac-
locking primitives do not define the outcome of remapping tions containing breakpoints. Of course, it might
their lock variables. be necessary to emulate all transactions any time a
Here are some TM memory-mapping options: breakpoint is set within the scope of any transaction.
1. Memory remapping is illegal within a transaction, 5 This difference between mapping and unmapping was noted by

and will result in all enclosing transactions being Josh Triplett.

v2022.09.25a
378 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE

If the runtime system is unable to determine whether usual well-known software-engineering techniques are
or not a given breakpoint is within the scope of a employed to avoid deadlock. It is not unusual to acquire
transaction, then it might be necessary to emulate all locks from within RCU read-side critical sections, which
transactions just to be on the safe side. However, this eases deadlock concerns because RCU read-side prim-
approach might impose significant overhead, which itives cannot participate in lock-based deadlock cycles.
might in turn obscure the bug being pursued. It is also possible to acquire locks while holding hazard
pointers and within sequence-lock read-side critical sec-
2. Use only hardware TM implementations that are
tions. But what happens when you attempt to acquire a
capable of handling breakpoint exceptions. Unfor-
lock from within a transaction?
tunately, as of this writing (March 2021), all such
In theory, the answer is trivial: Simply manipulate the
implementations are research prototypes.
data structure representing the lock as part of the trans-
3. Use only software TM implementations, which are action, and everything works out perfectly. In practice, a
(very roughly speaking) more tolerant of exceptions number of non-obvious complications [VGS08] can arise,
than are the simpler of the hardware TM implemen- depending on implementation details of the TM system.
tations. Of course, software TM tends to have higher These complications can be resolved, but at the cost of a
overhead than hardware TM, so this approach may 45 % increase in overhead for locks acquired outside of
not be acceptable in all situations. transactions and a 300 % increase in overhead for locks
acquired within transactions. Although these overheads
4. Program more carefully, so as to avoid having bugs might be acceptable for transactional programs contain-
in the transactions in the first place. As soon as you ing small amounts of locking, they are often completely
figure out how to do this, please do let everyone know unacceptable for production-quality lock-based programs
the secret! wishing to use the occasional transaction.
There is some reason to believe that transactional mem-
ory will deliver productivity improvements compared to 1. Use only locking-friendly TM implementations. Un-
other synchronization mechanisms, but it does seem quite fortunately, the locking-unfriendly implementations
possible that these improvements could easily be lost if have some attractive properties, including low over-
traditional debugging techniques cannot be applied to head for successful transactions and the ability to
transactions. This seems especially true if transactional accommodate extremely large transactions.
memory is to be used by novices on large transactions. In
2. Use TM only “in the small” when introducing TM
contrast, macho “top-gun” programmers might be able to
to lock-based programs, thereby accommodating the
dispense with such debugging aids, especially for small
limitations of locking-friendly TM implementations.
transactions.
Therefore, if transactional memory is to deliver on 3. Set aside locking-based legacy systems entirely, re-
its productivity promises to novice programmers, the implementing everything in terms of transactions.
debugging problem does need to be solved. This approach has no shortage of advocates, but this
requires that all the issues described in this series be
17.2.3 Synchronization resolved. During the time it takes to resolve these
issues, competing synchronization mechanisms will
If transactional memory someday proves that it can be of course also have the opportunity to improve.
everything to everyone, it will not need to interact with
any other synchronization mechanism. Until then, it 4. Use TM strictly as an optimization in lock-based
will need to work with synchronization mechanisms that systems, as was done by the TxLinux [RHP+ 07]
can do what it cannot, or that work more naturally in a group and by a great many transactional lock elision
given situation. The following sections outline the current projects [PD11, Kle14, FIMR16, PMDY20]. This
challenges in this area. approach seems sound, but leaves the locking design
constraints (such as the need to avoid deadlock) firmly
17.2.3.1 Locking in place.

It is commonplace to acquire locks while holding other 5. Strive to reduce the overhead imposed on locking
locks, which works quite well, at least as long as the primitives.

v2022.09.25a
17.2. TRANSACTIONAL MEMORY 379

The fact that there could possibly be a problem inter- acquiring reader-writer locks from within transac-
facing TM and locking came as a surprise to many, which tions.
underscores the need to try out new mechanisms and prim-
itives in real-world production software. Fortunately, the 3. Set aside locking-based legacy systems entirely, re-
advent of open source means that a huge quantity of such implementing everything in terms of transactions.
software is now freely available to everyone, including This approach has no shortage of advocates, but this
researchers. requires that all the issues described in this series be
resolved. During the time it takes to resolve these
issues, competing synchronization mechanisms will
17.2.3.2 Reader-Writer Locking
of course also have the opportunity to improve.
It is commonplace to read-acquire reader-writer locks
4. Use TM strictly as an optimization in lock-based sys-
while holding other locks, which just works, at least as long
tems, as was done by the TxLinux [RHP+ 07] group,
as the usual well-known software-engineering techniques
and as has been done by more recent work using TM
are employed to avoid deadlock. Read-acquiring reader-
to elide reader writer locks [FIMR16]. This approach
writer locks from within RCU read-side critical sections
seems sound, at least on POWER8 CPUs [LGW+ 15],
also works, and doing so eases deadlock concerns because
but leaves the locking design constraints (such as the
RCU read-side primitives cannot participate in lock-based
need to avoid deadlock) firmly in place.
deadlock cycles. It is also possible to acquire locks
while holding hazard pointers and within sequence-lock Of course, there might well be other non-obvious issues
read-side critical sections. But what happens when you surrounding combining TM with reader-writer locking,
attempt to read-acquire a reader-writer lock from within a as there in fact were with exclusive locking.
transaction?
Unfortunately, the straightforward approach to read-
17.2.3.3 Deferred Reclamation
acquiring the traditional counter-based reader-writer lock
within a transaction defeats the purpose of the reader- This section focuses mainly on RCU. Similar issues
writer lock. To see this, consider a pair of transactions and possible resolutions arise when combining TM with
concurrently attempting to read-acquire the same reader- other deferred-reclamation mechanisms such as reference
writer lock. Because read-acquisition involves modifying counters and hazard pointers. In the text below, known
the reader-writer lock’s data structures, a conflict will differences are specifically called out.
result, which will roll back one of the two transactions. Reference counting, hazard pointers, and RCU are all
This behavior is completely inconsistent with the reader- heavily used, as noted in Sections 9.5.5 and 9.6.3. This
writer lock’s goal of allowing concurrent readers. means that any TM implementation that chooses not to
Here are some options available to TM: surmount each and every challenge called out in this
section needs to interoperate cleanly and efficiently with
1. Use per-CPU or per-thread reader-writer lock-
all of these synchronization mechanisms.
ing [HW92], which allows a given CPU (or thread,
The TxLinux group from the University of Texas at
respectively) to manipulate only local data when
Austin appears to be the group to take on the challenge
read-acquiring the lock. This would avoid the con-
of RCU/TM interoperability [RHP+ 07]. Because they
flict between the two transactions concurrently read-
applied TM to the Linux 2.6 kernel, which uses RCU,
acquiring the lock, permitting both to proceed, as
they had no choice but to integrate TM and RCU, with
intended. Unfortunately, (1) the write-acquisition
TM taking the place of locking for RCU updates. Un-
overhead of per-CPU/thread locking can be extremely
fortunately, although the paper does state that the RCU
high, (2) the memory overhead of per-CPU/thread
implementation’s locks (e.g., rcu_ctrlblk.lock) were
locking can be prohibitive, and (3) this transforma-
converted to transactions, it is silent about what was done
tion is available only when you have access to the
with those locks used by RCU-based updates (for example,
source code in question. Other more-recent scalable
dcache_lock).
reader-writer locks [LLO09] might avoid some or all
More recently, Dimitrios Siakavaras et al. have ap-
of these problems.
plied HTM and RCU to search trees [SNGK17, SBN+ 20],
2. Use TM only “in the small” when introducing Christina Giannoula et al. have used HTM and RCU to
TM to lock-based programs, thereby avoiding read- color graphs [GGK18], and SeongJae Park et al. have

v2022.09.25a
380 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE

used HTM and RCU to optimize high-contention locking transactions. In addition, not all TM implementa-
on NUMA systems [PMDY20]. tions are capable of delaying conflicting accesses.
It is important to note that RCU permits readers and Nevertheless, this approach seems eminently reason-
updaters to run concurrently, further permitting RCU able for hardware TM implementations that support
readers to access data that is in the act of being updated. only small transactions.
Of course, this property of RCU, whatever its performance,
4. RCU readers are converted to transactions. This ap-
scalability, and real-time-response benefits might be, flies
proach pretty much guarantees that RCU is compati-
in the face of the underlying atomicity properties of
ble with any TM implementation, but it also imposes
TM, although the POWER8 CPU family’s suspended-
TM’s rollbacks on RCU read-side critical sections,
transaction facility [LGW+ 15] makes it an exception to
destroying RCU’s real-time response guarantees, and
this rule.
also degrading RCU’s read-side performance. Fur-
So how should TM-based updates interact with concur-
thermore, this approach is infeasible in cases where
rent RCU readers? Some possibilities are as follows:
any of the RCU read-side critical sections contains
1. RCU readers abort concurrent conflicting TM up- operations that the TM implementation in question
dates. This is in fact the approach taken by the is incapable of handling. This approach is more
TxLinux project. This approach does preserve RCU difficult to apply to hazard pointers and reference
semantics, and also preserves RCU’s read-side perfor- counters, which do not have a sharply defined notion
mance, scalability, and real-time-response properties, of a reader as a section of code.
but it does have the unfortunate side-effect of unnec-
5. Many update-side uses of RCU modify a single
essarily aborting conflicting updates. In the worst
pointer to publish a new data structure. In some
case, a long sequence of RCU readers could poten-
of these cases, RCU can safely be permitted to see
tially starve all updaters, which could in theory result
a transactional pointer update that is subsequently
in system hangs. In addition, not all TM implementa-
rolled back, as long as the transaction respects mem-
tions offer the strong atomicity required to implement
ory ordering and as long as the roll-back process uses
this approach, and for good reasons.
call_rcu() to free up the corresponding structure.
2. RCU readers that run concurrently with conflicting Unfortunately, not all TM implementations respect
TM updates get old (pre-transaction) values from any memory barriers within a transaction. Apparently,
conflicting RCU loads. This preserves RCU seman- the thought is that because transactions are supposed
tics and performance, and also prevents RCU-update to be atomic, the ordering of the accesses within the
starvation. However, not all TM implementations transaction is not supposed to matter.
can provide timely access to old values of variables 6. Prohibit use of TM in RCU updates. This is guaran-
that have been tentatively updated by an in-flight teed to work, but restricts use of TM.
transaction. In particular, log-based TM implemen-
tations that maintain old values in the log (thus It seems likely that additional approaches will be un-
providing excellent TM commit performance) are covered, especially given the advent of user-level RCU
not likely to be happy with this approach. Perhaps the and hazard-pointer implementations.6 It is interesting
rcu_dereference() primitive can be leveraged to to note that many of the better performing and scaling
permit RCU to access the old values within a greater STM implementations make use of RCU-like techniques
range of TM implementations, though performance internally [Fra04, FH07, GYW+ 19, KMK+ 19].
might still be an issue. Nevertheless, there are pop- Quick Quiz 17.3: MV-RLU looks pretty good! Doesn’t it
ular TM implementations that have been integrated beat RCU hands down?
with RCU in this manner [PW07, HW11, HW14].
3. If an RCU reader executes an access that conflicts 17.2.3.4 Extra-Transactional Accesses
with an in-flight transaction, then that RCU access
is delayed until the conflicting transaction either Within a lock-based critical section, it is perfectly legal
commits or aborts. This approach preserves RCU to manipulate variables that are concurrently accessed or
semantics, but not RCU’s performance or real-time 6 Kudos to the TxLinux group, Maged Michael, and Josh Triplett

response, particularly in presence of long-running for coming up with a number of the above alternatives.

v2022.09.25a
17.2. TRANSACTIONAL MEMORY 381

even modified outside that lock’s critical section, with one 17.2.4 Discussion
common example being statistical counters. The same
thing is possible within RCU read-side critical sections, The obstacles to universal TM adoption lead to the follow-
and is in fact the common case. ing conclusions:
Given mechanisms such as the so-called “dirty reads”
1. One interesting property of TM is the fact that transac-
that are prevalent in production database systems, it is not
tions are subject to rollback and retry. This property
surprising that extra-transactional accesses have received
underlies TM’s difficulties with irreversible oper-
serious attention from the proponents of TM, with the
ations, including unbuffered I/O, RPCs, memory-
concept of weak atomicity [BLM06] being but one case
mapping operations, time delays, and the exec()
in point.
system call. This property also has the unfortu-
Here are some extra-transactional options:
nate consequence of introducing all the complexi-
ties inherent in the possibility of failure, often in a
1. Conflicts due to extra-transactional accesses always
developer-visible manner.
abort transactions. This is strong atomicity.
2. Another interesting property of TM, noted by Sh-
2. Conflicts due to extra-transactional accesses are ig-
peisman et al. [SATG+ 09], is that TM intertwines
nored, so only conflicts among transactions can abort
the synchronization with the data it protects. This
transactions. This is weak atomicity.
property underlies TM’s issues with I/O, memory-
3. Transactions are permitted to carry out non- mapping operations, extra-transactional accesses,
transactional operations in special cases, such as and debugging breakpoints. In contrast, conven-
when allocating memory or interacting with lock- tional synchronization primitives, including locking
based critical sections. and RCU, maintain a clear separation between the
synchronization primitives and the data that they
4. Produce hardware extensions that permit some op- protect.
erations (for example, addition) to be carried out
concurrently on a single variable by multiple trans- 3. One of the stated goals of many workers in the TM
actions. area is to ease parallelization of large sequential
programs. As such, individual transactions are com-
5. Introduce weak semantics to transactional memory. monly expected to execute serially, which might do
One approach is the combination with RCU de- much to explain TM’s issues with multithreaded
scribed in Section 17.2.3.3, while Gramoli and Guer- transactions.
raoui survey a number of other weak-transaction
approaches [GG14], for example, restricted parti- Quick Quiz 17.4: Given things like spin_trylock(), how
tioning of large “elastic” transactions into smaller does it make any sense at all to claim that TM introduces the
transactions, thus reducing conflict probabilities (al- concept of failure???
beit with tepid performance and scalability). Per-
haps further experience will show that some uses of What should TM researchers and developers do about
extra-transactional accesses can be replaced by weak all of this?
transactions. One approach is to focus on TM in the small, focusing
on small transactions where hardware assist potentially
It appears that transactions were conceived in a vacuum, provides substantial advantages over other synchronization
with no interaction required with any other synchronization primitives and on small programs where there is some
mechanism. If so, it is no surprise that much confusion evidence for increased productivity for a combined TM-
and complexity arises when combining transactions with locking approach [PAT11]. Sun took the small-transaction
non-transactional accesses. But unless transactions are to approach with its Rock research CPU [DLMN09]. Some
be confined to small updates to isolated data structures, or TM researchers seem to agree with these two small-is-
alternatively to be confined to new programs that do not beautiful approaches [SSHT93], others have much higher
interact with the huge body of existing parallel code, then hopes for TM, and yet others hint that high TM aspirations
transactions absolutely must be so combined if they are to might be TM’s worst enemy [Att10, Section 6]. It is
have large-scale practical impact in the near term. nonetheless quite possible that TM will be able to take on

v2022.09.25a
382 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE

Figure 17.9: The STM Vision

larger problems, and this section has listed a few of the Figure 17.10: The STM Reality: Conflicts
issues that must be resolved if TM is to achieve this lofty
goal.
Of course, everyone involved should treat this as a
learning experience. It would seem that TM researchers
have great deal to learn from practitioners who have
successfully built large software systems using traditional
synchronization primitives.
And vice versa.
Quick Quiz 17.5: What is to learn? Why not just use TM
for memory-based data structures and locking for those rare
cases featuring the many silly corner cases listed in this silly
section???

But for the moment, the current state of STM can best be
summarized with a series of cartoons. First, Figure 17.9
shows the STM vision. As always, the reality is a bit more
nuanced, as fancifully depicted by Figures 17.10, 17.11,
and 17.12.7 Less fanciful STM retrospectives are also
available [Duf10a, Duf10b].
Some commercially available hardware supports re-
stricted variants of HTM, which are addressed in the
following section.

7 Recent academic work-in-progress has investigated lock-based

STM systems for real-time use [And19, NA18], albeit without any Figure 17.11: The STM Reality: Irrevocable Operations
performance results, and with some indications that real-time hybrid
STM/HTM systems must choose between fast common-case performance
and worst-case forward-progress guarantees [AKK+ 14, SBV10].

v2022.09.25a
17.3. HARDWARE TRANSACTIONAL MEMORY 383

instruction itself. Each transaction executes atomically


with respect to all other transactions.
HTM has a number of important benefits, including au-
tomatic dynamic partitioning of data structures, reducing
synchronization-primitive cache misses, and supporting a
fair number of practical applications.
However, it always pays to read the fine print, and
HTM is no exception. A major point of this section
is determining under what conditions HTM’s benefits
outweigh the complications hidden in its fine print. To
this end, Section 17.3.1 describes HTM’s benefits and
Section 17.3.2 describes its weaknesses. This is the same
approach used in earlier papers [MMW07, MMTW10]
and also in the previous section.8
Section 17.3.3 then describes HTM’s weaknesses with
respect to the combination of synchronization primitives
used in the Linux kernel (and in many user-space applica-
Figure 17.12: The STM Reality: Realtime Response tions). Section 17.3.4 looks at where HTM might best fit
into the parallel programmer’s toolbox, and Section 17.3.5
lists some events that might greatly increase HTM’s scope
and appeal. Finally, Section 17.3.6 presents concluding
17.3 Hardware Transactional remarks.
Memory
17.3.1 HTM Benefits WRT Locking
Make sure your report system is reasonably clean and
efficient before you automate. Otherwise, your new
The primary benefits of HTM are (1) its avoidance of the
computer will just speed up the mess. cache misses that are often incurred by other synchro-
nization primitives, (2) its ability to dynamically partition
Robert Townsend data structures, and (3) the fact that it has a fair number
of practical applications. I break from TM tradition by
As of 2021, hardware transactional memory (HTM) not listing ease of use separately for two reasons. First,
has been available for many years on several types ease of use should stem from HTM’s primary benefits,
of commercially available commodity computer sys- which this section focuses on. Second, there has been
tems [YHLR13, Mer11, JSG12, Hay20]. This section considerable controversy surrounding attempts to test for
makes an attempt to identify HTM’s place in the parallel raw programming talent [Bor06, DBA09, PBCE20] and
programmer’s toolbox. even around the use of small programming exercises in
From a conceptual viewpoint, HTM uses processor job interviews [Bra07]. This indicates that we really do
caches and speculative execution to make a designated not have a firm grasp on what makes programming easy
group of statements (a “transaction”) take effect atomi- or hard. Therefore, the remainder of this section focuses
cally from the viewpoint of any other transactions running on the three benefits listed above.
on other processors. This transaction is initiated by a
begin-transaction machine instruction and completed by 17.3.1.1 Avoiding Synchronization Cache Misses
a commit-transaction machine instruction. There is typi-
cally also an abort-transaction machine instruction, which Most synchronization mechanisms are based on data struc-
squashes the speculation (as if the begin-transaction in- tures that are operated on by atomic instructions. Because
struction and all following instructions had not executed) these atomic instructions normally operate by first causing
and commences execution at a failure handler. The lo- the relevant cache line to be owned by the CPU that they are
cation of the failure handler is typically specified by 8 I gratefully acknowledge many stimulating discussions with the
the begin-transaction instruction, either as an explicit other authors, Maged Michael, Josh Triplett, and Jonathan Walpole, as
failure-handler address or via a condition code set by the well as with Andi Kleen.

v2022.09.25a
384 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE

running on, a subsequent execution of the same instance of 17.3.1.3 Practical Value
that synchronization primitive on some other CPU will re-
Some evidence of HTM’s practical value has been demon-
sult in a cache miss. These communications cache misses
strated in a number of hardware platforms, including
severely degrade both the performance and scalability
Sun Rock [DLMN09], Azul Vega [Cli09], IBM Blue
of conventional synchronization mechanisms [ABD+ 97,
Gene/Q [Mer11], Intel Haswell TSX [RD12], and IBM
Section 4.2.3].
System z [JSG12].
In contrast, HTM synchronizes by using the CPU’s
Expected practical benefits include:
cache, avoiding the need for a separate synchronization
data structure and resultant cache misses. HTM’s advan-
1. Lock elision for in-memory data access and up-
tage is greatest in cases where a lock data structure is
date [MT01, RG02].
placed in a separate cache line, in which case, convert-
ing a given critical section to an HTM transaction can 2. Concurrent access and small random updates to large
reduce that critical section’s overhead by a full cache miss. non-partitionable data structures.
These savings can be quite significant for the common
case of short critical sections, at least for those situations However, HTM also has some very real shortcomings,
where the elided lock does not share a cache line with an which will be discussed in the next section.
oft-written variable protected by that lock.
Quick Quiz 17.6: Why would it matter that oft-written 17.3.2 HTM Weaknesses WRT Locking
variables shared the cache line with the lock variable?
The concept of HTM is quite simple: A group of accesses
and updates to memory occurs atomically. However, as
17.3.1.2 Dynamic Partitioning of Data Structures is the case with many simple ideas, complications arise
when you apply it to real systems in the real world. These
A major obstacle to the use of some conventional synchro- complications are as follows:
nization mechanisms is the need to statically partition data
structures. There are a number of data structures that are 1. Transaction-size limitations.
trivially partitionable, with the most prominent example
being hash tables, where each hash chain constitutes a 2. Conflict handling.
partition. Allocating a lock for each hash chain then triv-
3. Aborts and rollbacks.
ially parallelizes the hash table for operations confined to
a given chain.9 Partitioning is similarly trivial for arrays, 4. Lack of forward-progress guarantees.
radix trees, skiplists, and several other data structures.
However, partitioning for many types of trees and 5. Irrevocable operations.
graphs is quite difficult, and the results are often quite
6. Semantic differences.
complex [Ell80]. Although it is possible to use two-
phased locking and hashed arrays of locks to partition Each of these complications is covered in the following
general data structures, other techniques have proven sections, followed by a summary.
preferable [Mil06], as will be discussed in Section 17.3.3.
Given its avoidance of synchronization cache misses,
17.3.2.1 Transaction-Size Limitations
HTM is therefore a very real possibility for large non-
partitionable data structures, at least assuming relatively The transaction-size limitations of current HTM imple-
small updates. mentations stem from the use of the processor caches to
Quick Quiz 17.7: Why are relatively small updates important hold the data affected by the transaction. Although this
to HTM performance and scalability? allows a given CPU to make the transaction appear atomic
to other CPUs by executing the transaction within the
confines of its cache, it also means that any transaction
that does not fit cannot commit. Furthermore, events that
9 And it is also easy to extend this scheme to operations accessing change execution context, such as interrupts, system calls,
multiple hash chains by having such operations acquire the locks for all exceptions, traps, and context switches either must abort
relevant chains in hash order. any ongoing transaction on the CPU in question or must

v2022.09.25a
17.3. HARDWARE TRANSACTIONAL MEMORY 385

further restrict transaction size due to the cache footprint currently available systems do not implement any of these
of the other execution context. research ideas, and perhaps for good reason.
Of course, modern CPUs tend to have large caches, and
the data required for many transactions would fit easily 17.3.2.2 Conflict Handling
in a one-megabyte cache. Unfortunately, with caches,
sheer size is not all that matters. The problem is that The first complication is the possibility of conflicts. For
most caches can be thought of hash tables implemented example, suppose that transactions A and B are defined
in hardware. However, hardware caches do not chain as follows:
their buckets (which are normally called sets), but rather Transaction A Transaction B
provide a fixed number of cachelines per set. The number
x = 1; y = 2;
of elements provided for each set in a given cache is y = 3; x = 4;
termed that cache’s associativity.
Although cache associativity varies, the eight-way as- Suppose that each transaction executes concurrently on
sociativity of the level-0 cache on the laptop I am typing its own processor. If transaction A stores to x at the same
this on is not unusual. What this means is that if a given time that transaction B stores to y, neither transaction can
transaction needed to touch nine cache lines, and if all progress. To see this, suppose that transaction A executes
nine cache lines mapped to the same set, then that trans- its store to y. Then transaction A will be interleaved
action cannot possibly complete, never mind how many within transaction B, in violation of the requirement that
megabytes of additional space might be available in that transactions execute atomically with respect to each other.
cache. Yes, given randomly selected data elements in a Allowing transaction B to execute its store to x similarly
given data structure, the probability of that transaction violates the atomic-execution requirement. This situation
being able to commit is quite high, but there can be no is termed a conflict, which happens whenever two concur-
guarantee [McK11c]. rent transactions access the same variable where at least
There has been some research work to alleviate this one of the accesses is a store. The system is therefore
limitation. Fully associative victim caches would alleviate obligated to abort one or both of the transactions in order
the associativity constraints, but there are currently strin- to allow execution to progress. The choice of exactly
gent performance and energy-efficiency constraints on the which transaction to abort is an interesting topic that will
sizes of victim caches. That said, HTM victim caches for very likely retain the ability to generate Ph.D. dissertations
unmodified cache lines can be quite small, as they need to for some time to come, see for example [ATC+ 11].10 For
retain only the address: The data itself can be written to the purposes of this section, we can assume that the system
memory or shadowed by other caches, while the address makes a random choice.
itself is sufficient to detect a conflicting write [RD12]. Another complication is conflict detection, which is
Unbounded-transactional-memory (UTM) comparatively straightforward, at least in the simplest case.
schemes [AAKL06, MBM+ 06] use DRAM as an When a processor is executing a transaction, it marks every
extremely large victim cache, but integrating such cache line touched by that transaction. If the processor’s
schemes into a production-quality cache-coherence cache receives a request involving a cache line that has been
mechanism is still an unsolved problem. In addition, marked as touched by the current transaction, a potential
use of DRAM as a victim cache may have unfortunate conflict has occurred. More sophisticated systems might
performance and energy-efficiency consequences, try to order the current processors’ transaction to precede
particularly if the victim cache is to be fully associative. that of the processor sending the request, and optimizing
Finally, the “unbounded” aspect of UTM assumes that all this process will likely also retain the ability to generate
of DRAM could be used as a victim cache, while in reality Ph.D. dissertations for quite some time. However this
the large but still fixed amount of DRAM assigned to a section assumes a very simple conflict-detection strategy.
given CPU would limit the size of that CPU’s transactions. However, for HTM to work effectively, the probability
Other schemes use a combination of hardware and of conflict must be quite low, which in turn requires
software transactional memory [KCH+ 06] and one could that the data structures be organized so as to maintain a
imagine using STM as a fallback mechanism for HTM. sufficiently low probability of conflict. For example, a
However, to the best of my knowledge, with the ex- 10 Liu’s and Spear’s paper entitled “Toxic Transactions” [LS11] is

ception of abbreviating representation of TM read sets, particularly instructive.

v2022.09.25a
386 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE

red-black tree with simple insertion, deletion, and search abort those of low-priority threads? If so, how is the hard-
operations fits this description, but a red-black tree that ware efficiently informed of priorities? The literature on
maintains an accurate count of the number of elements real-time use of HTM is quite sparse, perhaps because
in the tree does not.11 For another example, a red-black there are more than enough problems in making HTM
tree that enumerates all elements in the tree in a single work well in non-real-time environments.
transaction will have high conflict probabilities, degrading Because current HTM implementations might determin-
performance and scalability. As a result, many serial istically abort a given transaction, software must provide
programs will require some restructuring before HTM can fallback code. This fallback code must use some other
work effectively. In some cases, practitioners will prefer form of synchronization, for example, locking. If a lock-
to take the extra steps (in the red-black-tree case, perhaps based fallback is ever used, then all the limitations of
switching to a partitionable data structure such as a radix locking, including the possibility of deadlock, reappear.
tree or a hash table), and just use locking, particularly One can of course hope that the fallback isn’t used of-
until such time as HTM is readily available on all relevant ten, which might allow simpler and less deadlock-prone
architectures [Cli09]. locking designs to be used. But this raises the question
Quick Quiz 17.8: How could a red-black tree possibly of how the system transitions from using the lock-based
efficiently enumerate all elements of the tree regardless of fallbacks back to transactions.12 One approach is to use a
choice of synchronization mechanism??? test-and-test-and-set discipline [MT02], so that everyone
holds off until the lock is released, allowing the system to
Furthermore, the potential for conflicting accesses start from a clean slate in transactional mode at that point.
among concurrent transactions can result in failure. Han- However, this could result in quite a bit of spinning, which
dling such failure is discussed in the next section. might not be wise if the lock holder has blocked or been
preempted. Another approach is to allow transactions to
17.3.2.3 Aborts and Rollbacks proceed in parallel with a thread holding a lock [MT02],
but this raises difficulties in maintaining atomicity, espe-
Because any transaction might be aborted at any time, cially if the reason that the thread is holding the lock is
it is important that transactions contain no statements because the corresponding transaction would not fit into
that cannot be rolled back. This means that transactions cache.
cannot do I/O, system calls, or debugging breakpoints (no Finally, dealing with the possibility of aborts and roll-
single stepping in the debugger for HTM transactions!!!). backs seems to put an additional burden on the developer,
Instead, transactions must confine themselves to accessing who must correctly handle all combinations of possible
normal cached memory. Furthermore, on some systems, error conditions.
interrupts, exceptions, traps, TLB misses, and other events It is clear that users of HTM must put considerable
will also abort transactions. Given the number of bugs that validation effort into testing both the fallback code paths
have resulted from improper handling of error conditions, and transition from fallback code back to transactional
it is fair to ask what impact aborts and rollbacks have on code. Nor is there any reason to believe that the validation
ease of use. requirements of HTM hardware are any less daunting.
Quick Quiz 17.9: But why can’t a debugger emulate single
stepping by setting breakpoints at successive lines of the 17.3.2.4 Lack of Forward-Progress Guarantees
transaction, relying on the retry to retrace the steps of the
earlier instances of the transaction? Even though transaction size, conflicts, and aborts/roll-
backs can all cause transactions to abort, one might hope
Of course, aborts and rollbacks raise the question of
that sufficiently small and short-duration transactions
whether HTM can be useful for hard real-time systems.
could be guaranteed to eventually succeed. This would per-
Do the performance benefits of HTM outweigh the costs
mit a transaction to be unconditionally retried, in the same
of the aborts and rollbacks, and if so under what condi-
way that compare-and-swap (CAS) and load-linked/store-
tions? Can transactions use priority boosting? Or should
conditional (LL/SC) operations are unconditionally retried
transactions for high-priority threads instead preferentially

11 The need to update the count would result in additions to and 12 The possibility of an application getting stuck in fallback mode

deletions from the tree conflicting with each other, resulting in strong has been termed the “lemming effect”, a term that Dave Dice has been
non-commutativity [AGH+ 11a, AGH+ 11b, McK11b]. credited with coining.

v2022.09.25a
17.3. HARDWARE TRANSACTIONAL MEMORY 387

in code that uses these instructions to implement atomic changes in configuration. But if this empty critical section
operations. is translated to a transaction, the result is a no-op. The
Unfortunately, other than low-clock-rate academic re- guarantee that all prior critical sections have terminated
search prototypes [SBV10], currently available HTM im- is lost. In other words, transactional lock elision pre-
plementations refuse to make any sort of forward-progress serves the data-protection semantics of locking, but loses
guarantee. As noted earlier, HTM therefore cannot be locking’s time-based messaging semantics.
used to avoid deadlock on those systems. Hopefully fu- Quick Quiz 17.10: But why would anyone need an empty
ture implementations of HTM will provide some sort of lock-based critical section???
forward-progress guarantees. Until that time, HTM must
be used with extreme caution in real-time applications. Quick Quiz 17.11: Can’t transactional lock elision trivially
The one exception to this gloomy picture as of 2021 is handle locking’s time-based messaging semantics by simply
the IBM mainframe, which provides constrained trans- choosing not to elide empty lock-based critical sections?
actions [JSG12]. The constraints are quite severe, and
are presented in Section 17.3.5.1. It will be interesting Quick Quiz 17.12: Given modern hardware [MOZ09], how
to see if HTM forward-progress guarantees migrate from can anyone possibly expect parallel software relying on timing
the mainframe to commodity CPU families. to work?

One important semantic difference between locking


17.3.2.5 Irrevocable Operations and transactions is the priority boosting that is used to
avoid priority inversion in lock-based real-time programs.
Another consequence of aborts and rollbacks is that HTM One way in which priority inversion can occur is when
transactions cannot accommodate irrevocable operations. a low-priority thread holding a lock is preempted by a
Current HTM implementations typically enforce this lim- medium-priority CPU-bound thread. If there is at least one
itation by requiring that all of the accesses in the trans- such medium-priority thread per CPU, the low-priority
action be to cacheable memory (thus prohibiting MMIO thread will never get a chance to run. If a high-priority
accesses) and aborting transactions on interrupts, traps, thread now attempts to acquire the lock, it will block.
and exceptions (thus prohibiting system calls). It cannot acquire the lock until the low-priority thread
Note that buffered I/O can be accommodated by HTM releases it, the low-priority thread cannot release the lock
transactions as long as the buffer fill/flush operations until it gets a chance to run, and it cannot get a chance to
occur extra-transactionally. The reason that this works is run until one of the medium-priority threads gives up its
that adding data to and removing data from the buffer is CPU. Therefore, the medium-priority threads are in effect
revocable: Only the actual buffer fill/flush operations are blocking the high-priority process, which is the rationale
irrevocable. Of course, this buffered-I/O approach has the for the name “priority inversion.”
effect of including the I/O in the transaction’s footprint, One way to avoid priority inversion is priority inheri-
increasing the size of the transaction and thus increasing tance, in which a high-priority thread blocked on a lock
the probability of failure. temporarily donates its priority to the lock’s holder, which
is also called priority boosting. However, priority boost-
17.3.2.6 Semantic Differences ing can be used for things other than avoiding priority
inversion, as shown in Listing 17.1. Lines 1–12 of this
Although HTM can in many cases be used as a drop-in
listing show a low-priority process that must nevertheless
replacement for locking (hence the name transactional
run every millisecond or so, while lines 14–24 of this
lock elision (TLE) [DHL+ 08]), there are subtle differences
same listing show a high-priority process that uses priority
in semantics. A particularly nasty example involving
boosting to ensure that boostee() runs periodically as
coordinated lock-based critical sections that results in
needed.
deadlock or livelock when executed transactionally was
The boostee() function arranges this by always
given by Blundell [BLM06], but a much simpler example
holding one of the two boost_lock[] locks, so that
is the empty critical section.
lines 20–21 of booster() can boost priority as needed.
In a lock-based program, an empty critical section will
guarantee that all processes that had previously been hold- Quick Quiz 17.13: But the boostee() function in List-
ing that lock have now released it. This idiom was used ing 17.1 alternatively acquires its locks in reverse order! Won’t
by the 2.4 Linux kernel’s networking stack to coordinate this result in deadlock?

v2022.09.25a
388 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE

Listing 17.1: Exploiting Priority Boosting Quick Quiz 17.14: So a bunch of people set out to supplant
1 void boostee(void) locking, and they mostly end up just optimizing locking???
2 {
3 int i = 0;
4
5 acquire_lock(&boost_lock[i]);
6 for (;;) { 17.3.2.7 Summary
7 acquire_lock(&boost_lock[!i]);
8 release_lock(&boost_lock[i]); Although it seems likely that HTM will have com-
9 i = i ^ 1;
10 do_something();
pelling use cases, current implementations have serious
11 } transaction-size limitations, conflict-handling complica-
12 }
13
tions, abort-and-rollback issues, and semantic differences
14 void booster(void) that will require careful handling. HTM’s current situa-
15 {
16 int i = 0;
tion relative to locking is summarized in Table 17.1. As
17 can be seen, although the current state of HTM alleviates
for (;;) {
18
19 usleep(500); /* sleep 0.5 ms. */
some serious shortcomings of locking,13 it does so by
20 acquire_lock(&boost_lock[i]); introducing a significant number of shortcomings of its
21 release_lock(&boost_lock[i]);
22 i = i ^ 1;
own. These shortcomings are acknowledged by leaders in
23 } the TM community [MS12].14
24 }
In addition, this is not the whole story. Locking is not
normally used by itself, but is instead typically augmented
by other synchronization mechanisms, including reference
This arrangement requires that boostee() acquire its counting, atomic operations, non-blocking data structures,
first lock on line 5 before the system becomes busy, but hazard pointers [Mic04a, HLM02], and RCU [MS98a,
this is easily arranged, even on modern hardware. MAK+ 01, HMBW07, McK12b]. The next section looks
Unfortunately, this arrangement can break down in at how such augmentation changes the equation.
presence of transactional lock elision. The boostee()
function’s overlapping critical sections become one infinite
17.3.3 HTM Weaknesses WRT Locking
transaction, which will sooner or later abort, for example,
on the first time that the thread running the boostee() When Augmented
function is preempted. At this point, boostee() will fall Practitioners have long used reference counting, atomic
back to locking, but given its low priority and that the operations, non-blocking data structures, hazard point-
quiet initialization period is now complete (which after ers, and RCU to avoid some of the shortcomings of
all is why boostee() was preempted), this thread might locking. For example, deadlock can be avoided in
never again get a chance to run. many cases by using reference counts, hazard point-
And if the boostee() thread is not holding the lock, ers, or RCU to protect data structures, particularly for
then the booster() thread’s empty critical section on read-only critical sections [Mic04a, HLM02, DMS+ 12,
lines 20 and 21 of Listing 17.1 will become an empty GMTW08, HMBW07]. These approaches also reduce
transaction that has no effect, so that boostee() never the need to partition data structures, as was seen in Chap-
runs. This example illustrates some of the subtle con- ter 10. RCU further provides contention-free bounded
sequences of transactional memory’s rollback-and-retry wait-free read-side primitives [MS98a, DMS+ 12], while
semantics.
13 In fairness, it is important to emphasize that locking’s shortcomings
Given that experience will likely uncover additional do have well-known and heavily used engineering solutions, including
subtle semantic differences, application of HTM-based deadlock detectors [Cor06a], a wealth of data structures that have been
lock elision to large programs should be undertaken with adapted to locking, and a long history of augmentation, as discussed in
caution. That said, where it does apply, HTM-based Section 17.3.3. In addition, if locking really were as horrible as a quick
skim of many academic papers might reasonably lead one to believe,
lock elision can eliminate the cache misses associated where did all the large lock-based parallel programs (both FOSS and
with the lock variable, which has resulted in tens of proprietary) come from, anyway?
14 In addition, in early 2011, I was invited to deliver a critique of
percent performance increases in large real-world software
systems as of early 2015. We can therefore expect to see some of the assumptions underlying transactional memory [McK11e].
The audience was surprisingly non-hostile, though perhaps they were
substantial use of this technique on hardware providing taking it easy on me due to the fact that I was heavily jet-lagged while
reliable support for it. giving the presentation.

v2022.09.25a
17.3. HARDWARE TRANSACTIONAL MEMORY 389

Table 17.1: Comparison of Locking and HTM ( Advantage , Disadvantage , Strong Disadvantage )

Locking Hardware Transactional Memory


Basic Idea Allow only one thread at a time to access a given Cause a given operation over a set of objects to
set of objects. execute atomically.
Scope Handles all operations. Handles revocable operations.
Irrevocable operations force fallback (typically to
locking).
Composability Limited by deadlock. Limited by irrevocable operations, transaction size,
and deadlock (assuming lock-based fallback code).
Scalability & Per- Data must be partitionable to avoid lock contention. Data must be partitionable to avoid conflicts.
formance
Partioning must typically be fixed at design time. Dynamic adjustment of partitioning carried out
automatically down to cacheline boundaries.
Partitioning required for fallbacks (less important
for rare fallbacks).
Locking primitives typically result in expensive Transactions begin/end instructions typically do
cache misses and memory-barrier instructions. not result in cache misses, but do have memory-
ordering and overhead consequences.
Contention effects are focused on acquisition and Contention aborts conflicting transactions, even if
release, so that the critical section runs at full speed. they have been running for a long time.
Privatization operations are simple, intuitive, per- Privatized data contributes to transaction size.
formant, and scalable.
Hardware Support Commodity hardware suffices. New hardware required (and is starting to become
available).
Performance is insensitive to cache-geometry de- Performance depends critically on cache geometry.
tails.
Software Support APIs exist, large body of code and experience, APIs emerging, little experience outside of DBMS,
debuggers operate naturally. breakpoints mid-transaction can be problematic.
Interaction With Long experience of successful interaction. Just beginning investigation of interaction.
Other Mechanisms
Practical Apps Yes. Yes.
Wide Applicability Yes. Jury still out.

v2022.09.25a
390 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE

Table 17.2: Comparison of Locking (Augmented by RCU or Hazard Pointers) and HTM ( Advantage , Disadvantage ,
Strong Disadvantage )

Locking with Userspace RCU or Hazard Pointers Hardware Transactional Memory


Basic Idea Allow only one thread at a time to access a given set Cause a given operation over a set of objects to execute
of objects. atomically.
Scope Handles all operations. Handles revocable operations.
Irrevocable operations force fallback (typically to lock-
ing).
Composability Readers limited only by grace-period-wait operations. Limited by irrevocable operations, transaction size,
and deadlock. (Assuming lock-based fallback code.)
Updaters limited by deadlock. Readers reduce dead-
lock.
Scalability & Per- Data must be partitionable to avoid lock contention Data must be partitionable to avoid conflicts.
formance among updaters.
Partitioning not needed for readers.
Partitioning for updaters must typically be fixed at Dynamic adjustment of partitioning carried out auto-
design time. matically down to cacheline boundaries.
Partitioning not needed for readers. Partitioning required for fallbacks (less important for
rare fallbacks).
Updater locking primitives typically result in expensive Transactions begin/end instructions typically do not
cache misses and memory-barrier instructions. result in cache misses, but do have memory-ordering
and overhead consequences.
Update-side contention effects are focused on acquisi- Contention aborts conflicting transactions, even if they
tion and release, so that the critical section runs at full have been running for a long time.
speed.
Readers do not contend with updaters or with each
other.
Read-side primitives are typically bounded wait-free Read-only transactions subject to conflicts and roll-
with low overhead. (Lock-free with low overhead for backs. No forward-progress guarantees other than
hazard pointers.) those supplied by fallback code.
Privatization operations are simple, intuitive, perfor- Privatized data contributes to transaction size.
mant, and scalable when data is visible only to updaters.
Privatization operations are expensive (though still
intuitive and scalable) for reader-visible data.
Hardware Support Commodity hardware suffices. New hardware required (and is starting to become
available).
Performance is insensitive to cache-geometry details. Performance depends critically on cache geometry.
Software Support APIs exist, large body of code and experience, debug- APIs emerging, little experience outside of DBMS,
gers operate naturally. breakpoints mid-transaction can be problematic.
Interaction With Long experience of successful interaction. Just beginning investigation of interaction.
Other Mechanisms
Practical Apps Yes. Yes.
Wide Applicability Yes. Jury still out.

v2022.09.25a
17.3. HARDWARE TRANSACTIONAL MEMORY 391

hazard pointers provides lock-free read-side primi- operations that traverse large fractions of the data struc-
tives [Mic02, HLM02, Mic04a]. Adding these consider- ture [PMDY20]. Current HTM implementations uncondi-
ations to Table 17.1 results in the updated comparison tionally abort an update transaction that conflicts with an
between augmented locking and HTM shown in Table 17.2. RCU or hazard-pointer reader, but perhaps future HTM
A summary of the differences between the two tables is implementations will interoperate more smoothly with
as follows: these synchronization mechanisms. In the meantime, the
probability of an update conflicting with a large RCU or
1. Use of non-blocking read-side mechanisms alleviates hazard-pointer read-side critical section should be much
deadlock issues. smaller than the probability of conflicting with the equiv-
2. Read-side mechanisms such as hazard pointers and alent read-only transaction.15 Nevertheless, it is quite
RCU can operate efficiently on non-partitionable possible that a steady stream of RCU or hazard-pointer
data. readers might starve updaters due to a corresponding
steady stream of conflicts. This vulnerability could be
3. Hazard pointers and RCU do not contend with each eliminated (at significant hardware cost and complexity)
other or with updaters, allowing excellent perfor- by giving extra-transactional reads the pre-transaction
mance and scalability for read-mostly workloads. copy of the memory location being loaded.
The fact that HTM transactions must have fallbacks
4. Hazard pointers and RCU provide forward-progress might in some cases force static partitionability of data
guarantees (lock freedom and bounded wait-freedom, structures back onto HTM. This limitation might be
respectively). alleviated if future HTM implementations provide forward-
5. Privatization operations for hazard pointers and RCU progress guarantees, which might eliminate the need for
are straightforward. fallback code in some cases, which in turn might allow
HTM to be used efficiently in situations with higher
For those with good eyesight, Table 17.3 combines conflict probabilities.
Tables 17.1 and 17.2. In short, although HTM is likely to have important
Quick Quiz 17.15: Tables 17.1 and 17.2 state that hardware uses and applications, it is another tool in the parallel
is only starting to become available. But hasn’t HTM hardware programmer’s toolbox, not a replacement for the toolbox
support been widely available for almost a full decade? in its entirety.

Of course, it is also possible to augment HTM, as


17.3.5 Potential Game Changers
discussed in the next section.
Game changers that could greatly increase the need for
17.3.4 Where Does HTM Best Fit In? HTM include the following:

Although it will likely be some time before HTM’s area 1. Forward-progress guarantees.
of applicability can be as crisply delineated as that shown
for RCU in Figure 9.33 on page 176, that is no reason not 2. Transaction-size increases.
to start moving in that direction. 3. Improved debugging support.
HTM seems best suited to update-heavy workloads
involving relatively small changes to disparate portions 4. Weak atomicity.
of relatively large in-memory data structures running on
large multiprocessors, as this meets the size restrictions These are expanded upon in the following sections.
of current HTM implementations while minimizing the
probability of conflicts and attendant aborts and rollbacks.
This scenario is also one that is relatively difficult to handle
given current synchronization primitives. 15 It is quite ironic that strictly transactional mechanisms are ap-

Use of locking in conjunction with HTM seems likely pearing in shared-memory systems at just about the time that NoSQL
databases are relaxing the traditional database-application reliance on
to overcome HTM’s difficulties with irrevocable opera- strict transactions. Nevertheless, HTM has in fact realized the ease-of-
tions, while use of RCU or hazard pointers might alle- use promise of TM, albeit for black-hat attacks on the Linux kernel’s
viate HTM’s transaction-size limitations for read-only address-space randomization defense mechanism [JLK16a, JLK16b].

v2022.09.25a
CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE

Table 17.3: Comparison of Locking (Plain and Augmented) and HTM ( Advantage , Disadvantage , Strong Disadvantage )

v2022.09.25a
Locking Locking with Userspace RCU or Hazard Pointers Hardware Transactional Memory
Basic Idea Allow only one thread at a time to access a given set Allow only one thread at a time to access a given set Cause a given operation over a set of objects to execute
of objects. of objects. atomically.
Scope Handles all operations. Handles all operations. Handles revocable operations.
Irrevocable operations force fallback (typically to lock-
ing).
Composability Limited by deadlock. Readers limited only by grace-period-wait operations. Limited by irrevocable operations, transaction size,
and deadlock. (Assuming lock-based fallback code.)
Updaters limited by deadlock. Readers reduce dead-
lock.
Scalability & Per- Data must be partitionable to avoid lock contention. Data must be partitionable to avoid lock contention Data must be partitionable to avoid conflicts.
formance among updaters.
Partitioning not needed for readers.
Partitioning must typically be fixed at design time. Partitioning for updaters must typically be fixed at Dynamic adjustment of partitioning carried out auto-
design time. matically down to cacheline boundaries.
Partitioning not needed for readers. Partitioning required for fallbacks (less important for
rare fallbacks).
Locking primitives typically result in expensive cache Updater locking primitives typically result in expensive Transactions begin/end instructions typically do not
misses and memory-barrier instructions. cache misses and memory-barrier instructions. result in cache misses, but do have memory-ordering
and overhead consequences.
Contention effects are focused on acquisition and re- Update-side contention effects are focused on acquisi- Contention aborts conflicting transactions, even if they
lease, so that the critical section runs at full speed. tion and release, so that the critical section runs at full have been running for a long time.
speed.
Readers do not contend with updaters or with each
other.
Read-side primitives are typically bounded wait-free Read-only transactions subject to conflicts and roll-
with low overhead. (Lock-free with low overhead for backs. No forward-progress guarantees other than
hazard pointers.) those supplied by fallback code.
Privatization operations are simple, intuitive, perfor- Privatization operations are simple, intuitive, perfor- Privatized data contributes to transaction size.
mant, and scalable. mant, and scalable when data is visible only to updaters.
Privatization operations are expensive (though still
intuitive and scalable) for reader-visible data.
Hardware Support Commodity hardware suffices. Commodity hardware suffices. New hardware required (and is starting to become
available).
Performance is insensitive to cache-geometry details. Performance is insensitive to cache-geometry details. Performance depends critically on cache geometry.
Software Support APIs exist, large body of code and experience, debug- APIs exist, large body of code and experience, debug- APIs emerging, little experience outside of DBMS,
gers operate naturally. gers operate naturally. breakpoints mid-transaction can be problematic.
Interaction With Long experience of successful interaction. Long experience of successful interaction. Just beginning investigation of interaction.
Other Mechanisms
Practical Apps Yes. Yes. Yes.
392

Wide Applicability Yes. Yes. Jury still out.


17.3. HARDWARE TRANSACTIONAL MEMORY 393

17.3.5.1 Forward-Progress Guarantees The complexity of conflict handling is evidenced by the


large number of HTM conflict-resolution policies that
As was discussed in Section 17.3.2.4, current HTM im- have been proposed [ATC+ 11, LS11]. Additional com-
plementations lack forward-progress guarantees, which plications are introduced by extra-transactional accesses,
requires that fallback software is available to handle HTM as noted by Blundell [BLM06]. It is easy to blame the
failures. Of course, it is easy to demand guarantees, but extra-transactional accesses for all of these problems, but
not always easy to provide them. In the case of HTM, the folly of this line of thinking is easily demonstrated by
obstacles to guarantees can include cache size and asso- placing each of the extra-transactional accesses into its
ciativity, TLB size and associativity, transaction duration own single-access transaction. It is the pattern of accesses
and interrupt frequency, and scheduler implementation. that is the issue, not whether or not they happen to be
Cache size and associativity was discussed in Sec- enclosed in a transaction.
tion 17.3.2.1, along with some research intended to work Finally, any forward-progress guarantees for transac-
around current limitations. However, HTM forward- tions also depend on the scheduler, which must let the
progress guarantees would come with size limits, large thread executing the transaction run long enough to suc-
though these limits might one day be. So why don’t cessfully commit.
current HTM implementations provide forward-progress So there are significant obstacles to HTM vendors of-
guarantees for small transactions, for example, limited fering forward-progress guarantees. However, the impact
to the associativity of the cache? One potential reason of any of them doing so would be enormous. It would
might be the need to deal with hardware failure. For mean that HTM transactions would no longer need soft-
example, a failing cache SRAM cell might be handled ware fallbacks, which would mean that HTM could finally
by deactivating the failing cell, thus reducing the associa- deliver on the TM promise of deadlock elimination.
tivity of the cache and therefore also the maximum size However, in late 2012, the IBM Mainframe announced
of transactions that can be guaranteed forward progress. an HTM implementation that includes constrained trans-
Given that this would simply decrease the guaranteed actions in addition to the usual best-effort HTM imple-
transaction size, it seems likely that other reasons are at mentation [JSG12]. A constrained transaction starts with
work. Perhaps providing forward progress guarantees on the tbeginc instruction instead of the tbegin instruction
production-quality hardware is more difficult than one that is used for best-effort transactions. Constrained trans-
might think, an entirely plausible explanation given the actions are guaranteed to always complete (eventually), so
difficulty of making forward-progress guarantees in soft- if a transaction aborts, rather than branching to a fallback
ware. Moving a problem from software to hardware does path (as is done for best-effort transactions), the hardware
not necessarily make it easier to solve [JSG12]. instead restarts the transaction at the tbeginc instruction.
Given a physically tagged and indexed cache, it is The Mainframe architects needed to take extreme mea-
not enough for the transaction to fit in the cache. Its sures to deliver on this forward-progress guarantee. If a
address translations must also fit in the TLB. Any forward- given constrained transaction repeatedly fails, the CPU
progress guarantees must therefore also take TLB size and might disable branch prediction, force in-order execution,
associativity into account. and even disable pipelining. If the repeated failures are
Given that interrupts, traps, and exceptions abort trans- due to high contention, the CPU might disable specula-
actions in current HTM implementations, it is necessary tive fetches, introduce random delays, and even serialize
that the execution duration of a given transaction be shorter execution of the conflicting CPUs. “Interesting” forward-
than the expected interval between interrupts. No matter progress scenarios involve as few as two CPUs or as many
how little data a given transaction touches, if it runs too as one hundred CPUs. Perhaps these extreme measures
long, it will be aborted. Therefore, any forward-progress provide some insight as to why other CPUs have thus far
guarantees must be conditioned not only on transaction refrained from offering constrained transactions.
size, but also on transaction duration. As the name implies, constrained transactions are in
Forward-progress guarantees depend critically on the fact severely constrained:
ability to determine which of several conflicting trans-
actions should be aborted. It is all too easy to imagine 1. The maximum data footprint is four blocks of mem-
an endless series of transactions, each aborting an earlier ory, where each block can be no larger than 32 bytes.
transaction only to itself be aborted by a later transac-
tions, so that none of the transactions actually commit. 2. The maximum code footprint is 256 bytes.

v2022.09.25a
394 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE

3. If a given 4K page contains a constrained transaction’s is that a single-step exception aborts the enclosing trans-
code, then that page may not contain that transaction’s action. There are a number of workarounds for this issue,
data. including emulating the processor (slow!), substituting
STM for HTM (slow and slightly different semantics!),
4. The maximum number of assembly instructions that
playback techniques using repeated retries to emulate for-
may be executed is 32.
ward progress (strange failure modes!), and full support
5. Backwards branches are forbidden. of debugging HTM transactions (complex!).
Should one of the HTM vendors produce an HTM sys-
Nevertheless, these constraints support a number of
tem that allows straightforward use of classical debugging
important data structures, including linked lists, stacks,
techniques within transactions, including breakpoints, sin-
queues, and arrays. Constrained HTM therefore seems
gle stepping, and print statements, this will make HTM
likely to become an important tool in the parallel program-
much more compelling. Some transactional-memory
mer’s toolbox.
researchers started to recognize this problem in 2013,
Note that these forward-progress guarantees need not
with at least one proposal involving hardware-assisted
be absolute. For example, suppose that a use of HTM
debugging facilities [GKP13]. Of course, this proposal
uses a global lock as fallback. Assuming that the fall-
depends on readily available hardware gaining such facili-
back mechanism has been carefully designed to avoid the
ties [Hay20, Int20b]. Worse yet, some cutting-edge debug-
“lemming effect” discussed in Section 17.3.2.3, then if
ging facilities are incompatible with HTM [OHOC20].
HTM rollbacks are sufficiently infrequent, the global lock
will not be a bottleneck. That said, the larger the system,
the longer the critical sections, and the longer the time 17.3.5.4 Weak Atomicity
required to recover from the “lemming effect”, the more Given that HTM is likely to face some sort of size limi-
rare “sufficiently infrequent” needs to be. tations for the foreseeable future, it will be necessary for
HTM to interoperate smoothly with other mechanisms.
17.3.5.2 Transaction-Size Increases HTM’s interoperability with read-mostly mechanisms
Forward-progress guarantees are important, but as we saw, such as hazard pointers and RCU would be improved if
they will be conditional guarantees based on transaction extra-transactional reads did not unconditionally abort
size and duration. There has been some progress, for exam- transactions with conflicting writes—instead, the read
ple, some commercially available HTM implementations could simply be provided with the pre-transaction value.
use approximation techniques to support extremely large In this way, hazard pointers and RCU could be used to
HTM read sets [RD12]. For another example, POWER8 allow HTM to handle larger data structures and to reduce
HTM supports suspended transations, which avoid adding conflict probabilities.
irrelevant accesses to the suspended transation’s read and This is not necessarily simple, however. The most
write sets [LGW+ 15]. This capability has been used to straightforward way of implementing this requires an ad-
produce a high performance reader-writer lock [FIMR16]. ditional state in each cache line and on the bus, which is
It is important to note that even small-sized guarantees a non-trivial added expense. The benefit that goes along
will be quite useful. For example, a guarantee of two with this expense is permitting large-footprint readers
cache lines is sufficient for a stack, queue, or dequeue. without the risk of starving updaters due to continual
However, larger data structures require larger guarantees, conflicts. An alternative approach, applied to great effect
for example, traversing a tree in order requires a guarantee to binary search trees by Siakavaras et al. [SNGK17], is
equal to the number of nodes in the tree. Therefore, even to use RCU for read-only traversals and HTM only for
modest increases in the size of the guarantee also increases the actual updates themselves. This combination outper-
the usefulness of HTM, thereby increasing the need for formed other transactional-memory techniques by up to
CPUs to either provide it or provide good-and-sufficient 220 %, a speedup similar to that observed by Howard and
workarounds. Walpole [HW11] when they combined RCU with STM. In
both cases, the weak atomicity is implemented in software
rather than in hardware. It would nevertheless be inter-
17.3.5.3 Improved Debugging Support
esting to see what additional speedups could be obtained
Another inhibitor to transaction size is the need to debug by implementing weak atomicity in both hardware and
the transactions. The problem with current mechanisms software.

v2022.09.25a
17.4. FORMAL REGRESSION TESTING? 395

17.3.6 Conclusions 1. Any required translation must be automated.


Although current HTM implementations have delivered 2. The environment (including memory ordering) must
real performance benefits in some situations, they also be correctly handled.
have significant shortcomings. The most significant short-
comings appear to be limited transaction sizes, the need 3. The memory and CPU overhead must be acceptably
for conflict handling, the need for aborts and rollbacks, modest.
the lack of forward-progress guarantees, the inability 4. Specific information leading to the location of the
to handle irrevocable operations, and subtle semantic bug must be provided.
differences from locking. There are also reasons for
lingering concerns surrounding HTM-implementation 5. Information beyond the source code and inputs must
reliability [JSG12, Was14, Int20a, Int21, Lar21, Int20c]. be modest in scope.
Some of these shortcomings might be alleviated in
6. The bugs located must be relevant to the code’s users.
future implementations, but it appears that there will
continue to be a strong need to make HTM work well This list builds on, but is somewhat more modest
with the many other types of synchronization mech- than, Richard Bornat’s dictum: “Formal-verification re-
anisms, as noted earlier [MMW07, MMTW10]. Al- searchers should verify the code that developers write, in
though there has been some work using HTM with the language they write it in, running in the environment
RCU [SNGK17, SBN+ 20, GGK18, PMDY20], there has that it runs in, as they write it.” The following sections
been little evidence of progress towards HTM work better discuss each of the above requirements, followed by a
with RCU and with other deferred-reclamation mecha- section presenting a scorecard of how well a few tools
nisms. stack up against these requirements.
In short, current HTM implementations appear to be
welcome and useful additions to the parallel programmer’s Quick Quiz 17.16: This list is ridiculously utopian! Why
not stick to the current state of the formal-verification art?
toolbox, and much interesting and challenging work is
required to make use of them. However, they cannot be
considered to be a magic wand with which to wave away
all parallel-programming problems.
17.4.1 Automatic Translation
Although Promela and spin are invaluable design aids, if
you need to formally regression-test your C-language pro-
17.4 Formal Regression Testing? gram, you must hand-translate to Promela each time you
would like to re-verify your code. If your code happens to
be in the Linux kernel, which releases every 60–90 days,
Theory without experiments: Have we gone too far?
you will need to hand-translate from four to six times
Michael Mitzenmacher each year. Over time, human error will creep in, which
means that the verification won’t match the source code,
Formal verification has long proven useful in a number rendering the verification useless. Repeated verification
of production environments [LBD+ 04, BBC+ 10, Coo18, clearly requires either that the formal-verification tooling
SAE+ 18, DFLO19]. However, it is an open question as to input your code directly, or that there be bug-free auto-
whether hard-core formal verification will ever be included matic translation of your code to the form required for
in the automated regression-test suites used for continuous verification.
integration within complex concurrent codebases, such PPCMEM and herd can in theory directly input as-
as the Linux kernel. Although there is already a proof of sembly language and C++ code, but these tools work
concept for Linux-kernel SRCU [Roy17], this test is for a only on very small litmus tests, which normally means
small portion of one of the simplest RCU implementations, that you must extract the core of your mechanism—by
and has proven difficult to keep it current with the ever- hand. As with Promela and spin, both PPCMEM and
changing Linux kernel. It is therefore worth asking what herd are extremely useful, but they are not well-suited
would be required to incorporate formal verification as for regression suites.
first-class members of the Linux kernel’s regression tests. In contrast, cbmc and Nidhugg can input C programs
The following list is a good start [McK15a, slide 34]: of reasonable (though still quite limited) size, and if

v2022.09.25a
396 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE

their capabilities continue to grow, could well become Promela and spin assume sequential consistency,
excellent additions to regression suites. The Coverity which is not a good match for modern computer sys-
static-analysis tool also inputs C programs, and of very tems, as was seen in Chapter 15. In contrast, one of
large size, including the Linux kernel. Of course, Cover- the great strengths of PPCMEM and herd is their de-
ity’s static analysis is quite simple compared to that of tailed modeling of various CPU families memory models,
cbmc and Nidhugg. On the other hand, Coverity had an including x86, Arm, Power, and, in the case of herd,
all-encompassing definition of “C program” that posed a Linux-kernel memory model [AMM+ 18], which was
special challenges [BBC+ 10]. Amazon Web Services uses accepted into Linux-kernel version v4.17.
a variety of formal-verification tools, including cbmc, and The cbmc and Nidhugg tools provide some ability to
applies some of these tools to regression testing [Coo18]. select memory models, but do not provide the variety that
Google uses a number of relatively simple static analy- PPCMEM and herd do. However, it is likely that the
sis tools directly on large Java code bases, which are larger-scale tools will adopt a greater variety of memory
arguably less diverse than C code bases [SAE+ 18]. Face- models as time goes on.
book uses more aggressive forms of formal verifica- In the longer term, it would be helpful for formal-
tion against its code bases, including analysis of con- verification tools to include I/O [MDR16], but it may be
currency [DFLO19, O’H19], though not yet on the Linux some time before this comes to pass.
kernel. Finally, Microsoft has long used static analysis on Nevertheless, tools that fail to match the environment
its code bases [LBD+ 04]. can still be useful. For example, a great many concur-
Given this list, it is clearly possible to create sophis- rency bugs would still be bugs on a mythical sequentially
ticated formal-verification tools that directly consume consistent system, and these bugs could be located by a
production-quality source code. tool that over-approximates the system’s memory model
However, one shortcoming of taking C code as input is with sequential consistency. Nevertheless, these tools
that it assumes that the compiler is correct. An alternative will fail to find bugs involving missing memory-ordering
approach is to take the binary produced by the C compiler directives, as noted in the aforementioned cautionary tale
as input, thereby accounting for any relevant compiler bugs. of Section 12.1.4.6.
This approach has been used in a number of verification
efforts, perhaps most notably by the SEL4 project [SM13].
Quick Quiz 17.17: Given the groundbreaking nature of the
17.4.3 Overhead
various verifiers used in the SEL4 project, why doesn’t this Almost all hard-core formal-verification tools are expo-
chapter cover them in more depth?
nential in nature, which might seem discouraging until
However, verifying directly from either the source or you consider that many of the most interesting software
binary both have the advantage of eliminating human questions are in fact undecidable. However, there are
translation errors, which is critically important for reliable differences in degree, even among exponentials.
regression testing. PPCMEM by design is unoptimized, in order to provide
This is not to say that tools with special-purpose lan- greater assurance that the memory models of interest are
guages are useless. On the contrary, they can be quite accurately represented. The herd tool optimizes more
helpful for design-time verification, as was discussed in aggressively, as described in Section 12.3, and is thus
Chapter 12. However, such tools are not particularly orders of magnitude faster than PPCMEM. Nevertheless,
helpful for automated regression testing, which is in fact both PPCMEM and herd target very small litmus tests
the topic of this section. rather than larger bodies of code.
In contrast, Promela/spin, cbmc, and Nidhugg
are designed for (somewhat) larger bodies of code.
17.4.2 Environment Promela/spin was used to verify the Curiosity rover’s
It is critically important that formal-verification tools filesystem [GHH+ 14] and, as noted earlier, both cbmc and
correctly model their environment. One all-too-common Nidhugg were appled to Linux-kernel RCU.
omission is the memory model, where a great many formal- If advances in heuristics continue at the rate of the past
verification tools, including Promela/spin, are restricted three decades, we can look forward to large reductions in
to sequential consistency. The QRCU experience related overhead for formal verification. That said, combinatorial
in Section 12.1.4.6 is an important cautionary tale. explosion is still combinatorial explosion, which would be

v2022.09.25a
17.4. FORMAL REGRESSION TESTING? 397

Listing 17.2: Emulating Locking with cmpxchg_acquire() o-u*-C.litmus). Table 17.4 compares the performance
1 C C-SB+l-o-o-u+l-o-o-u-C and scalability of using the model’s spin_lock() and
2
3 {} spin_unlock() against emulating these primitives as
4 shown in the listing. The difference is not insignificant:
5 P0(int *sl, int *x0, int *x1)
6 { At four processes, the model is more than two orders of
7 int r2; magnitude faster than emulation!
8 int r1;
9
Quick Quiz 17.18: Why bother with a separate filter
10 r2 = cmpxchg_acquire(sl, 0, 1);
11 WRITE_ONCE(*x0, 1); command on line 27 of Listing 17.2 instead of just adding the
12 r1 = READ_ONCE(*x1); condition to the exists clause? And wouldn’t it be simpler to
13 smp_store_release(sl, 0);
14 }
use xchg_acquire() instead of cmpxchg_acquire()?
15
16 P1(int *sl, int *x0, int *x1) It would of course be quite useful for tools to automat-
17 {
18 int r2; ically divide up large programs, verify the pieces, and
19 int r1; then verify the combinations of pieces. In the meantime,
20
21 r2 = cmpxchg_acquire(sl, 0, 1); verification of large programs will require significant
22 WRITE_ONCE(*x1, 1); manual intervention. This intervention will preferably
23 r1 = READ_ONCE(*x0);
24 smp_store_release(sl, 0); mediated by scripting, the better to reliably carry out
25 } repeated verifications on each release, and preferably
26
27 filter (0:r2=0 /\ 1:r2=0) eventually in a manner well-suited for continuous inte-
28 exists (0:r1=0 /\ 1:r1=0) gration. And Facebook’s Infer tool has taken important
steps towards doing just that, via compositionality and
Table 17.4: Emulating Locking: Performance (s) abstraction [BGOS18, DFLO19].
In any case, we can expect formal-verification capa-
# Threads Locking cmpxchg_acquire bilities to continue to increase over time, and any such
2 0.004 0.022
increases will in turn increase the applicability of formal
3 0.041 0.743
verification to regression testing.
4 0.374 59.565
5 4.905 17.4.4 Locate Bugs
Any software artifact of any size contains bugs. Therefore,
a formal-verification tool that reports only the presence or
expected to sharply limit the size of programs that could absence of bugs is not particularly useful. What is needed
be verified, with or without continued improvements in is a tool that gives at least some information as to where
heuristics. the bug is located and the nature of that bug.
However, the flip side of combinatorial explosion is The cbmc output includes a traceback mapping back
Philip II of Macedon’s timeless advice: “Divide and rule.” to the source code, similar to Promela/spin’s, as does
If a large program can be divided and the pieces verified, Nidhugg. Of course, these tracebacks can be quite long,
the result can be combinatorial implosion [McK11e]. One and analyzing them can be quite tedious. However, doing
natural place to divide is on API boundaries, for example, so is usually quite a bit faster and more pleasant than
those of locking primitives. One verification pass can locating bugs the old-fashioned way.
then verify that the locking implementation is correct, and In addition, one of the simplest tests of formal-
additional verification passes can verify correct use of the verification tools is bug injection. After all, not only
locking APIs. could any of us write printf("VERIFIED\n"), but the
The performance benefits of this approach can plain fact is that developers of formal-verification tools are
be demonstrated using the Linux-kernel memory just as bug-prone as are the rest of us. Therefore, formal-
model [AMM+ 18]. This model provides spin_lock() verification tools that just proclaim that a bug exists are
and spin_unlock() primitives, but these primitives can fundamentally less trustworthy because it is more difficult
also be emulated using cmpxchg_acquire() and smp_ to verify them on real-world code.
store_release(), as shown in Listing 17.2 (C-SB+l- All that aside, people writing formal-verification tools
o-o-u+l-o-o-*u.litmus and C-SB+l-o-o-u+l-o- are permitted to leverage existing tools. For example, a

v2022.09.25a
398 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE

tool designed to determine only the presence or absence to stress tests.17 The cbmc tool also checks for array-out-
of a serious but rare bug might leverage bisection. If an of-bound references, thus implicitly adding them to the
old version of the program under test did not contain the specification. The aforementioned incorrectness logic can
bug, but a new version did, then bisection could be used also be thought of as using an implicit bugs-not-present
to quickly locate the commit that inserted the bug, which specification [O’H19].
might be sufficient information to find and fix the bug. This implicit-specification approach makes quite a bit of
Of course, this sort of strategy would not work well for sense, particularly if you look at formal verification not as
common bugs because in this case bisection would fail a full proof of correctness, but rather an alternative form of
due to all commits having at least one instance of the validation with a different set of strengths and weaknesses
common bug. than the common case, that is, testing. From this viewpoint,
Therefore, the execution traces provided by many software will always have bugs, and therefore any tool of
formal-verification tools will continue to be valuable, any kind that helps to find those bugs is a very good thing
particularly for complex and difficult-to-understand bugs. indeed.
In addition, recent work applies incorrectness-logic for-
malism reminiscent of the traditional Hoare logic used for 17.4.6 Relevant Bugs
full-up correctness proofs, but with the sole purpose of
finding bugs [O’H19]. Finding bugs—and fixing them—is of course the whole
point of any type of validation effort. Clearly, false
positives are to be avoided. But even in the absence of
17.4.5 Minimal Scaffolding
false positives, there are bugs and there are bugs.
In the old days, formal-verification researchers demanded For example, suppose that a software artifact had exactly
a full specification against which the software would 100 remaining bugs, each of which manifested on average
be verified. Unfortunately, a mathematically rigorous once every million years of runtime. Suppose further
specification might well be larger than the actual code, and that an omniscient formal-verification tool located all 100
each line of specification is just as likely to contain bugs as bugs, which the developers duly fixed. What happens to
is each line of code. A formal verification effort proving the reliability of this software artifact?
that the code faithfully implemented the specification The answer is that the reliability decreases.
would be a proof of bug-for-bug compatibility between To see this, keep in mind that historical experience indi-
the two, which might not be all that helpful. cates that about 7 % of fixes introduce a new bug [BJ12].
Worse yet, the requirements for a number of software Therefore, fixing the 100 bugs, which had a combined
artifacts, including Linux-kernel RCU, are empirical in mean time to failure (MTBF) of about 10,000 years, will
nature [McK15h, McK15e, McK15f].16 For this common introduce seven more bugs. Historical statistics indicate
type of software, a complete specification is a polite fiction. that each new bug will have an MTBF much less than
Nor are complete specifications any less fictional for 70,000 years. This in turn suggests that the combined
hardware, as was made clear by the late-2017 Meltdown MTBF of these seven new bugs will most likely be much
and Spectre side-channel attacks [Hor18]. less than 10,000 years, which in turn means that the
This situation might cause one to give up all hope of well-intentioned fixing of the original 100 bugs actually
formal verification of real-world software and hardware decreased the reliability of the overall software.
artifacts, but it turns out that there is quite a bit that can Quick Quiz 17.19: How do we know that the MTBFs of
be done. For example, design and coding rules can act known bugs is a good estimate of the MTBFs of bugs that have
as a partial specification, as can assertions contained in not yet been located?
the code. And in fact formal-verification tools such as
cbmc and Nidhugg both check for assertions that can be Quick Quiz 17.20: But the formal-verification tools should
triggered, implicitly treating these assertions as part of immediately find all the bugs introduced by the fixes, so why
the specification. However, the assertions are also part is this a problem?
of the code, which makes it less likely that they will
become obsolete, especially if the code is also subjected Worse yet, imagine another software artifact with one
bug that fails once every day on average and 99 more
16 Or, in formal-verification parlance, Linux-kernel RCU has an

incomplete specification. 17 And you do stress-test your code, don’t you?

v2022.09.25a
17.4. FORMAL REGRESSION TESTING? 399

that fail every million years each. Suppose that a formal- requires constructing an exists clause and cannot take
verification tool located the 99 million-year bugs, but intra-process assertions, so its fifth cell is also yellow.
failed to find the one-day bug. Fixing the 99 bugs located The herd tool has size restrictions similar to those of
will take time and effort, decrease reliability, and do PPCMEM, so herd’s first cell is also orange. It supports a
nothing at all about the pressing each-day failure that is wide variety of memory models, so its second cell is blue.
likely causing embarrassment and perhaps much worse It has reasonable overhead, so its third cell is yellow. Its
besides. bug-location and assertion capabilities are quite similar to
Therefore, it would be best to have a validation tool those of PPCMEM, so herd also gets yellow for the next
that preferentially located the most troublesome bugs. two cells.
However, as noted in Section 17.4.4, it is permissible The cbmc tool inputs C code directly, so its first cell
to leverage additional tools. One powerful tool is none is blue. It supports a few memory models, so its second
other than plain old testing. Given knowledge of the cell is yellow. It has reasonable overhead, so its third cell
bug, it should be possible to construct specific tests for is also yellow, however, perhaps SAT-solver performance
it, possibly also using some of the techniques described will continue improving. It provides a traceback, so its
in Section 11.6.4 to increase the probability of the bug fourth cell is green. It takes assertions directly from the C
manifesting. These techniques should allow calculation code, so its fifth cell is blue.
of a rough estimate of the bug’s raw failure rate, which Nidhugg also inputs C code directly, so its first cell is
could in turn be used to prioritize bug-fix efforts. also blue. It supports only a couple of memory models,
so its second cell is orange. Its overhead is quite low (for
Quick Quiz 17.21: But many formal-verification tools can
formal-verification), so its third cell is green. It provides
only find one bug at a time, so that each bug must be fixed
a traceback, so its fourth cell is green. It takes assertions
before the tool can locate the next. How can bug-fix efforts be
prioritized given such a tool? directly from the C code, so its fifth cell is blue.
So what about the sixth and final row? It is too early to
There has been some recent formal-verification work tell how any of the tools do at finding the right bugs, so
that prioritizes executions having fewer preemptions, un- they are all yellow with question marks.
der that reasonable assumption that smaller numbers of Quick Quiz 17.22: How would testing stack up in the
preemptions are more likely. scorecard shown in Table 17.5?
Identifying relevant bugs might sound like too much to
ask, but it is what is really required if we are to actually Quick Quiz 17.23: But aren’t there a great many more
increase software reliability. formal-verification systems than are shown in Table 17.5?

Once again, please note that this table rates these tools
17.4.7 Formal Regression Scorecard for use in regression testing. Just because many of them
are a poor fit for regression testing does not at all mean
Table 17.5 shows a rough-and-ready scorecard for the that they are useless, in fact, many of them have proven
formal-verification tools covered in this chapter. Shorter their worth many times over.18 Just not for regression
wavelengths are better than longer wavelengths. testing.
Promela requires hand translation and supports only However, this might well change. After all, formal
sequential consistency, so its first two cells are red. It verification tools made impressive strides in the 2010s.
has reasonable overhead (for formal verification, anyway) If that progress continues, formal verification might well
and provides a traceback, so its next two cells are yel- become an indispensable tool in the parallel programmer’s
low. Despite requiring hand translation, Promela handles validation toolbox.
assertions in a natural way, so its fifth cell is green.
PPCMEM usually requires hand translation due to the
small size of litmus tests that it supports, so its first cell is
orange. It handles several memory models, so its second
cell is green. Its overhead is quite high, so its third
cell is red. It provides a graphical display of relations 18 For but one example, Promela was used to verify the file system of
among operations, which is not as helpful as a traceback, none other than the Curiosity Rover. Was your formal verification tool
but is still quite useful, so its fourth cell is yellow. It used on software that currently runs on Mars???

v2022.09.25a
400 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE

Table 17.5: Formal Regression Scorecard

Promela PPCMEM herd cbmc Nidhugg


(1) Automated
(2) Environment (MM) (MM) (MM)
(3) Overhead (SAT)
(4) Locate Bugs
(5) Minimal Scaffolding
(6) Relevant Bugs ??? ??? ??? ??? ???

17.5 Functional Programming for tions, and transactions, which inflict added violence
upon the functional model.
Parallelism
3. Procedural languages can alias function arguments,
for example, by passing a pointer to the same structure
The curious failure of functional programming for
via two different arguments to the same invocation
parallel applications.
of a given function. This can result in the function
Malte Skarupke unknowingly updating that structure via two different
(and possibly overlapping) code sequences, which
When I took my first-ever functional-programming class greatly complicates analysis.
in the early 1980s, the professor asserted that the side-
Of course, given the importance of global state, syn-
effect-free functional-programming style was well-suited
chronization primitives, and aliasing, clever functional-
to trivial parallelization and analysis. Thirty years later,
programming experts have proposed any number of at-
this assertion remains, but mainstream production use
tempts to reconcile the function programming model to
of parallel functional languages is minimal, a state of
them, monads being but one case in point.
affairs that might not be entirely unrelated to professor’s
Another approach is to compile the parallel procedural
additional assertion that programs should neither maintain
program into a functional program, then to use functional-
state nor do I/O. There is niche use of functional languages
programming tools to analyze the result. But it is possible
such as Erlang, and multithreaded support has been added
to do much better than this, given that any real computation
to several other functional languages, but mainstream
is a large finite-state machine with finite input that runs for
production usage remains the province of procedural
a finite time interval. This means that any real program
languages such as C, C++, Java, and Fortran (usually
can be transformed into an expression, possibly albeit an
augmented with OpenMP, MPI, or coarrays).
impractically large one [DHK12].
This situation naturally leads to the question “If analysis
However, a number of the low-level kernels of paral-
is the goal, why not transform the procedural language into
lel algorithms transform into expressions that are small
a functional language before doing the analysis?” There
enough to fit easily into the memories of modern comput-
are of course a number of objections to this approach, of
ers. If such an expression is coupled with an assertion,
which I list but three:
checking to see if the assertion would ever fire becomes a
1. Procedural languages often make heavy use of global satisfiability problem. Even though satisfiability problems
variables, which can be updated independently by are NP-complete, they can often be solved in much less
different functions, or, worse yet, by multiple threads. time than would be required to generate the full state
Note that Haskell’s monads were invented to deal space. In addition, the solution time appears to be only
with single-threaded global state, and that multi- weakly dependent on the underlying memory model, so
threaded access to global state inflicts additional that algorithms running on weakly ordered systems can
violence on the functional model. also be checked [AKT13].
The general approach is to transform the program into
2. Multithreaded procedural languages often use syn- single-static-assignment (SSA) form, so that each assign-
chronization primitives such as locks, atomic opera- ment to a variable creates a separate version of that variable.

v2022.09.25a
17.6. SUMMARY 401

This applies to assignments from all the active threads, it is more likely that, as in the past, the future will be far
so that the resulting expression embodies all possible stranger than we can possibly imagine.
executions of the code in question. The addition of an
assertion entails asking whether any combination of inputs
and initial values can result in the assertion firing, which,
as noted above, is exactly the satisfiability problem.
One possible objection is that it does not gracefully
handle arbitrary looping constructs. However, in many
cases, this can be handled by unrolling the loop a finite
number of times. In addition, perhaps some loops will
also prove amenable to collapse via inductive methods.
Another possible objection is that spinlocks involve
arbitrarily long loops, and any finite unrolling would fail
to capture the full behavior of the spinlock. It turns out
that this objection is easily overcome. Instead of modeling
a full spinlock, model a trylock that attempts to obtain
the lock, and aborts if it fails to immediately do so. The
assertion must then be crafted so as to avoid firing in
cases where a spinlock aborted due to the lock not being
immediately available. Because the logic expression is
independent of time, all possible concurrency behaviors
will be captured via this approach.
A final objection is that this technique is unlikely to
be able to handle a full-sized software artifact such as
the millions of lines of code making up the Linux kernel.
This is likely the case, but the fact remains that exhaustive
validation of each of the much smaller parallel primitives
within the Linux kernel would be quite valuable. And
in fact the researchers spearheading this approach have
applied it to non-trivial real-world code, including the
Tree RCU implementation in the Linux kernel [LMKM16,
KS17a].
It remains to be seen how widely applicable this tech-
nique is, but it is one of the more interesting innovations
in the field of formal verification. Although it might well
be that the functional-programming advocates are at long
last correct in their assertion of the inevitable dominance
of functional programming, it is clearly the case that
this long-touted methodology is starting to see credible
competition on its formal-verification home turf. There
is therefore continued reason to doubt the inevitability of
functional-programming dominance.

17.6 Summary
This chapter has taken a quick tour of a number of possible
futures, including multicore, transactional memory, formal
verification as a regression test, and concurrent functional
programming. Any of these futures might come true, but

v2022.09.25a
402 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE

v2022.09.25a
History is the sum total of things that could have
been avoided.

Chapter 18 Konrad Adenauer

Looking Forward and Back

You have arrived at the end of this book, well done! I hope Chapter 7 expounded on that parallel-programming
that your journey was a pleasant but challenging and workhorse (and villain), locking. This chapter covered
worthwhile one. a number of types of locking and presented some engi-
For your editor and contributors, this is the end of the neering solutions to many well-known and aggressively
journey to the Second Edition, but for those willing to join advertised shortcomings of locking.
in, it is also the start of the journey to the Third Edition. Chapter 8 discussed the uses of data ownership, where
Either way, it is good to recap this past journey. synchronization is supplied by the association of a given
Chapter 1 covered what this book is about, along with data item with a specific thread. Where it applies, this
some alternatives for those interested in something other approach combines excellent performance and scalability
than low-level parallel programming. with profound simplicity.
Chapter 2 covered parallel-programming challenges and Chapter 9 showed how a little procrastination can greatly
high-level approaches for addressing them. It also touched improve performance and scalability, while in a surpris-
on ways of avoiding these challenges while nevertheless ingly large number of cases also simplifying the code.
still gaining most of the benefits of parallelism. A number of the mechanisms presented in this chapter
Chapter 3 gave a high-level overview of multicore take advantage of the ability of CPU caches to replicate
hardware, especially those aspects that pose challenges read-only data, thus sidestepping the laws of physics that
for concurrent software. This chapter puts the blame cruelly limit the speed of light and the smallness of atoms.
for these challenges where it belongs, very much on the Chapter 10 looked at concurrent data structures, with
laws of physics and rather less on intransigent hardware emphasis on hash tables, which have a long and honorable
architects and designers. However, there might be some history in parallel programs.
things that hardware architects and engineers can do, and Chapter 11 dug into code-review and testing methods,
this chapter discusses a few of them. In the meantime, and Chapter 12 overviewed formal verification. Whichever
software architects and engineers must do their part to side of the formal-verification/testing divide you might be
meet these challenges, as discussed in the rest of the book. on, if code has not been thoroughly validated, it does not
Chapter 4 gave a quick overview of the tools of the work. And that goes at least double for concurrent code.
low-level concurrency trade. Chapter 5 then demon- Chapter 13 presented a number of situations where com-
strated use of those tools—and, more importantly, use of bining concurrency mechanisms with each other or with
parallel-programming design techniques—on the simple other design tricks can greatly ease parallel programmers’
but surprisingly challenging task of concurrent counting. lives. Chapter 14 looked at advanced synchronization
So challenging, in fact, that a number of concurrent count- methods, including lockless programming, non-blocking
ing algorithms are in common use, each specialized for a synchronization, and parallel real-time computing. Chap-
different use case. ter 15 dug into the critically important topic of memory
Chapter 6 dug more deeply into the most important ordering, presenting techniques and tools to help you not
parallel-programming design technique, namely partition- only solve memory-ordering problems, but also to avoid
ing the problem at the highest possible level. This chapter them completely. Chapter 16 presented a brief overview
also overviewed a number of points in this design space. of the surprisingly important topic of ease of use.

403

v2022.09.25a
404 CHAPTER 18. LOOKING FORWARD AND BACK

Last, but definitely not least, Chapter 17 expounded on with many excellent innovations and improvements from
a number of conflicting visions of the future, including throughout the community. The thought of writing a book
CPU-technology trends, transactional memory, hardware occurred to Paul from time to time, but life was flowing
transactional memory, use of formal verification in re- fast, so he made no progress on this project.
gression testing, and the long-standing prediction that In 2006, Paul was invited to a conference on Linux
the future of parallel programming belongs to functional- scalability, and was granted the privilege of asking the
programming languages. last question of panel of esteemed parallel-programming
But now that we have recapped the contents of this experts. Paul began his question by noting that in the
Second Edition, how did this book get started? 15 years from 1991 to 2006, the price of a parallel system
Paul’s parallel-programming journey started in earnest had dropped from that of a house to that of a mid-range
in 1990, when he joined Sequent Computer Systems, Inc. bicycle, and it was clear that there was much more room for
Sequent used an apprenticeship-like program in which additional dramatic price decreases over the next 15 years
newly hired engineers were placed in cubicles surrounded extending to the year 2021. He also noted that decreasing
by experienced engineers, who mentored them, reviewed price should result in greater familiarity and faster progress
their code, and gave copious quantities of advice on a in solving parallel-programming problems. This led to
variety of topics. A few of the newly hired engineers his question: “In the year 2021, why wouldn’t parallel
were greatly helped by the fact that there were no on-chip programming have become routine?”
caches in those days, which meant that logic analyzers The first panelist seemed quite disdainful of anyone who
could easily display a given CPU’s instruction stream would ask such an absurd question, and quickly responded
and memory accesses, complete with accurate timing with a soundbite answer. To which Paul gave a soundbite
information. Of course, the downside of this transparency response. They went back and forth for some time, for
was that CPU core clock frequencies were 100 times example, the panelist’s sound-bite answer “Deadlock”
slower than those of the twenty-first century. Between provoked Paul’s sound-bite response “Lock dependency
apprenticeship and hardware performance transparency, checker”.
these newly hired engineers became productive parallel The panelist eventually ran out of soundbites, impro-
programmers within two or three months, and some were vising a final “People like you should be hit over the head
doing ground-breaking work within a couple of years. with a hammer!”
Sequent understood that its ability to quickly train new Paul’s response was of course “You will have to get in
engineers in the mysteries of parallelism was unusual, so line for that!”
it produced a slim volume that crystalized the company’s Paul turned his attention to the next panelist, who
parallel-programming wisdom [Seq88], which joined a seemed torn between agreeing with the first panelist and
pair of groundbreaking papers that had been written a few not wishing to have to deal with Paul’s series of responses.
years earlier [BK85, Inm85]. People already steeped in He therefore have a short non-committal speech. And so
these mysteries saluted this book and these papers, but it went through the rest of the panel.
novices were usually unable to benefit much from them, Until it was the turn of the last panelist, who was
invariably making highly creative and quite destructive someone you might have heard of who goes by the name
errors that were not explicitly prohibited by either the of Linus Torvalds. Linus noted that three years earlier (that
book or the papers.1 This situation of course caused Paul is, 2003), the initial version of any concurrency-related
to start thinking in terms of writing an improved book, patch was usually quite poor, having design flaws and
but his efforts during this time were limited to internal many bugs. And even when it was cleaned up enough
training materials and to published papers. to be accepted, bugs still remained. Linus contrasted
By the time Sequent was acquired by IBM in 1999, this with the then-current situation in 2006, in which
many of the world’s largest database instances ran on he said that it was not unusual for the first version of a
Sequent hardware. But times change, and by 2001 many concurrency-related patch to be well-designed with few or
of Sequent’s parallel programmers had shifted their focus even no bugs. He then suggested that if tools continued to
to the Linux kernel. After some initial reluctance, the improve, then maybe parallel programming would become
Linux kernel community embraced concurrency both routine by the year 2021.2
enthusiastically and effectively [BWCM+ 10, McK12a],
2 Tools have in fact continued to improve, including fuzzers, lock
1 “But why on earth would you do that???” “Well, why not?” dependency checkers, static analyzers, formal verification, memory

v2022.09.25a
405

Figure 18.1: The Most Important Lesson

The conference then concluded. Paul was not surprised


to be given wide berth by many audience members, es-
pecially those who saw the world in the same way as
did the first panelist. Paul was also not surprised that
a few audience members thanked him for the question.
However, he was quite surprised when one man came up
to say “thank you” with tears streaming down his face,
sobbing so hard that he could barely speak.
You see, this man had worked several years at Sequent,
and thus very well understood parallel programming.
Furthermore, he was currently assigned to a group whose
job it was to write parallel code. Which was not going well.
You see, it wasn’t that they had trouble understanding his
explanations of parallel programming.
It was that they refused to listen to him at all.
In short, his group was treating this man in the same
way that the first panelist attempted to treat Paul. And so
in that moment, Paul went from “I should write a book
some day” to “I will do whatever it takes to write this
book”. Paul is embarrassed to admit that he does not
remember the man’s name, if in fact he ever knew it.
This book is nevertheless for that man.
And this book is also for everyone else who would
like to add low-level concurrency to their skillset. If you
remember nothing else from this book, let it be the lesson
of Figure 18.1.
For the rest of us, when someone tries to show us a
solution to pressing problem, perhaps we should at the
very least do them the courtesy of listening!

models, and code-modification tools such as coccinelle. Therefore,


those who wish to assert that year-2021 parallel programming is not
routine should refer to Chapter 2’s epigraph.

v2022.09.25a
406 CHAPTER 18. LOOKING FORWARD AND BACK

v2022.09.25a
Ask me no questions, and I’ll tell you no fibs.

“She Stoops to Conquer”, Oliver Goldsmith


Appendix A

Important Questions

The following sections discuss some important questions For more information on this question, see Chapter 3,
relating to SMP programming. Each section also shows Section 5.1, and especially Chapter 6, each of which
how to avoid worrying about the corresponding question, present ways of slowing down your code by ineptly paral-
which can be extremely important if your goal is to simply lelizing it. Of course, much of this book deals with ways
get your SMP code working as quickly and painlessly as of ensuring that your parallel programs really are faster
possible—which is an excellent goal, by the way! than their sequential counterparts.
Although the answers to these questions are often less
However, never forget that parallel programs can be
intuitive than they would be in a single-threaded setting,
quite fast while at the same time being quite simple, with
with a bit of work, they are not that difficult to understand.
the example in Section 4.1 being a case in point. Also
If you managed to master recursion, there is nothing here
never forget that parallel execution is but one optimiza-
that should pose an overwhelming challenge.
tion of many, and there are programs for which other
With that, here are the questions:
optimizations produce better results.
1. Why aren’t parallel programs always faster? (Appen-
dix A.1)

2. Why not remove locking? (Appendix A.2)

3. What time is it? (Appendix A.3)


A.2 Why Not Remove Locking?
4. What does “after” mean? (Appendix A.4)
There can be no doubt that many have cast locking as
5. How much ordering is needed? (Appendix A.5) the evil villain of parallel programming, and not entirely
without reason. And there are important examples where
6. What is the difference between “concurrent” and lockless code does much better than its locked counterpart,
“parallel”? (Appendix A.6) a few of which are discussed in Section 14.2.
7. Why is software buggy? (Appendix A.7) However, lockless algorithms are not guaranteed to
perform and scale well, as shown by Figure 5.1 on page 50.
Read on to learn some answers. Improve upon these Furthermore, as a general rule, the more complex the
answers if you can! algorithm, the greater the advantage of combining locking
with selected lockless techniques, even with significant
hardware support, as shown in Table 17.3 on page 392.
A.1 Why Aren’t Parallel Programs Section 14.2 looks more deeply at non-blocking synchro-
nization, which is a popular lockless methodology.
Always Faster?
As a more general rule, a sound-bite approach to parallel
The short answer is “because parallel execution often programming is not likely to end well. Some would argue
requires communication, and communication is not free”. that this is also true of many other fields of endeavor.

407

v2022.09.25a
408 APPENDIX A. IMPORTANT QUESTIONS

There is an old saying that those who have but one clock
What time is it? always know the time, but those who have several clocks
can never be sure. And there was a time when the typical
low-end computer’s sole software-visible clock was its
Uh. When did program counter, but those days are long gone. This is not
you ask? a bad thing, considering that on modern computer systems,
the program counter is a truly horrible clock [MOZ09].
In addition, different clocks provide different tradeoffs
of performance, accuracy, precision, and ordering. For
example, in the Linux kernel, the jiffies counter1
provides high-speed access to a course-grained counter (at
best one-millisecond accuracy and precision) that imposes
almost no ordering on either the compiler or the hardware.
In contrast, the x86 HPET hardware provides an accurate
and precise clock, but at the price of slow access. The
x86 time-stamp counter has a checkered past, but is more
Figure A.1: What Time Is It? recently held out as providing a good combination of
precision, accuracy, and performance. Unfortunately,
for all of these counters, ordering against all effects of
A.3 What Time Is It? prior and subsequent code requires expensive memory-
barrier instructions. And this expense appears to be
an unavoidable consequence of the complex superscalar
A key issue with timekeeping on multicore computer
nature of modern computer systems.
systems is illustrated by Figure A.1. One problem is
In addition, all of these clock sources provide their
that it takes time to read out the time. An instruction
own timebase, so that (for example) the jiffies counter
might read from a hardware clock, and might have to
on one system is not necessarily compatible with that
go off-core (or worse yet, off-socket) to complete this
of another. Not only did they start counting at different
read operation. It might also be necessary to do some
times, but they might well be counting at different rates.
computation on the value read out, for example, to convert
This brings up the topic of synchronizing a given system’s
it to the desired format, to apply network time protocol
counters with some real-world notion of time, but that
(NTP) adjustments, and so on. So does the time eventually
topic is beyond the scope of this book.
returned correspond to the beginning of the resulting time
In short, time is a slippery topic that causes untold
interval, the end, or somewhere in between?
confusion to parallel programmers and to their code.
Worse yet, the thread reading the time might be inter-
rupted or preempted. Furthermore, there will likely be
some computation between reading out the time and the A.4 What Does “After” Mean?
actual use of the time that has been read out. Both of these
possibilities further extend the interval of uncertainty. “After” is an intuitive, but surprisingly difficult concept. An
One approach is to read the time twice, and take the important non-intuitive issue is that code can be delayed
arithmetic mean of the two readings, perhaps one on each at any point for any amount of time. Consider a producing
side of the operation being timestamped. The difference and a consuming thread that communicate using a global
between the two readings is then a measure of uncertainty struct with a timestamp “t” and integer fields “a”, “b”,
of the time at which the intervening operation occurred. and “c”. The producer loops recording the current time
Of course, in many cases, the exact time is not necessary. (in seconds since 1970 in decimal), then updating the
For example, when printing the time for the benefit of values of “a”, “b”, and “c”, as shown in Listing A.1. The
a human user, we can rely on slow human reflexes to consumer code loops, also recording the current time,
render internal hardware and software delays irrelevant. but also copying the producer’s timestamp along with the
Similarly, if a server needs to timestamp the response to a 1 The jiffies variable is a location in normal memory that is
client, any time between the reception of the request and incremented by software in response to events such as the scheduling-
the transmission of the response will do equally well. clock interrupt.

v2022.09.25a
A.4. WHAT DOES “AFTER” MEAN? 409

Listing A.1: “After” Producer Function


1 /* WARNING: BUGGY CODE. */
2 void *producer(void *ignored)
3 {
4 int i = 0;
5
6 producer_ready = 1;
7 while (!goflag)
8 sched_yield();
9 while (goflag) { Listing A.2: “After” Consumer Function
10 ss.t = dgettimeofday();
11 ss.a = ss.c + 1; 1 /* WARNING: BUGGY CODE. */
12 ss.b = ss.a + 1; 2 void *consumer(void *ignored)
13 ss.c = ss.b + 1; 3 {
14 i++; 4 struct snapshot_consumer curssc;
15 } 5 int i = 0;
16 printf("producer exiting: %d samples\n", i); 6 int j = 0;
17 producer_done = 1; 7
18 return (NULL); 8 consumer_ready = 1;
19 } 9 while (ss.t == 0.0) {
10 sched_yield();
11 }
12 while (goflag) {
Table A.1: “After” Program Sample Output 13 curssc.tc = dgettimeofday();
14 curssc.t = ss.t;
seq time (seconds) delta a b c 15 curssc.a = ss.a;
16 curssc.b = ss.b;
17563: 1152396.251585 (−16.928) 27 27 27 17 curssc.c = ss.c;
18004: 1152396.252581 (−12.875) 24 24 24 18 curssc.sequence = curseq;
18163: 1152396.252955 (−19.073) 18 18 18 19 curssc.iserror = 0;
20 if ((curssc.t > curssc.tc) ||
18765: 1152396.254449 (−148.773) 216 216 216 21 modgreater(ssc[i].a, curssc.a) ||
19863: 1152396.256960 (−6.914) 18 18 18 22 modgreater(ssc[i].b, curssc.b) ||
21644: 1152396.260959 (−5.960) 18 18 18 23 modgreater(ssc[i].c, curssc.c) ||
23408: 1152396.264957 (−20.027) 15 15 15 24 modgreater(curssc.a, ssc[i].a + maxdelta) ||
25 modgreater(curssc.b, ssc[i].b + maxdelta) ||
26 modgreater(curssc.c, ssc[i].c + maxdelta)) {
27 i++;
28 curssc.iserror = 1;
fields “a”, “b”, and “c”, as shown in Listing A.2. At the 29 } else if (ssc[i].iserror)
30 i++;
end of the run, the consumer outputs a list of anomalous 31 ssc[i] = curssc;
recordings, e.g., where time has appeared to go backwards. 32 curseq++;
33 if (i + 1 >= NSNAPS)
34 break;
Quick Quiz A.1: What SMP coding errors can you see in 35 }
these examples? See time.c for full code. 36 printf("consumer exited loop, collected %d items %d\n",
37 i, curseq);
38 if (ssc[0].iserror)
One might intuitively expect that the difference between 39 printf("0/%ld: %.6f %.6f (%.3f) %ld %ld %ld\n",
40 ssc[0].sequence,
the producer and consumer timestamps would be quite 41 ssc[j].t, ssc[j].tc,
small, as it should not take much time for the producer 42 (ssc[j].tc - ssc[j].t) * 1000000,
43 ssc[j].a, ssc[j].b, ssc[j].c);
to record the timestamps or the values. An excerpt of 44 for (j = 0; j <= i; j++)
some sample output on a dual-core 1 GHz x86 is shown 45 if (ssc[j].iserror)
46 printf("%d/%ld: %.6f (%.3f) %ld %ld %ld\n",
in Table A.1. Here, the “seq” column is the number of 47 j, ssc[j].sequence,
times through the loop, the “time” column is the time 48 ssc[j].t, (ssc[j].tc - ssc[j].t) * 1000000,
49 ssc[j].a - ssc[j - 1].a,
of the anomaly in seconds, the “delta” column is the 50 ssc[j].b - ssc[j - 1].b,
number of seconds the consumer’s timestamp follows that 51 ssc[j].c - ssc[j - 1].c);
52 consumer_done = 1;
of the producer (where a negative value indicates that the 53 }
consumer has collected its timestamp before the producer
did), and the columns labelled “a”, “b”, and “c” show
the amount that these variables increased since the prior
snapshot collected by the consumer.
Why is time going backwards? The number in parenthe-
ses is the difference in microseconds, with a large number
exceeding 10 microseconds, and one exceeding even 100

v2022.09.25a
410 APPENDIX A. IMPORTANT QUESTIONS

Table A.2: Locked “After” Program Sample Output Time


Producer
seq time (seconds) delta a b c
ss.t = dgettimeofday();
58597: 1156521.556296 (3.815) 1485 1485 1485 ss.a = ss.c + 1;
403927: 1156523.446636 (2.146) 2583 2583 2583 ss.b = ss.a + 1;
ss.c = ss.b + 1;

microseconds! Please note that this CPU can potentially Consumer


execute more than 100,000 instructions in that time. curssc.tc = gettimeofday();
One possible reason is given by the following sequence curssc.t = ss.t;
curssc.a = ss.a;
of events: curssc.b = ss.b;
curssc.c = ss.c;
1. Consumer obtains timestamp (Listing A.2, line 13).
Producer
2. Consumer is preempted.
ss.t = dgettimeofday();
3. An arbitrary amount of time passes. ss.a = ss.c + 1;
ss.b = ss.a + 1;
4. Producer obtains timestamp (Listing A.1, line 10). ss.c = ss.b + 1;

5. Consumer starts running again, and picks up the


producer’s timestamp (Listing A.2, line 14). Figure A.2: Effect of Locking on Snapshot Collection
In this scenario, the producer’s timestamp might be an
arbitrary amount of time after the consumer’s timestamp. to happen after anything done by any prior holder of that
How do you avoid agonizing over the meaning of “after” lock, at least give or take transactional lock elision (see
in your SMP code? Section 17.3.2.6). No need to worry about which CPU
Simply use SMP primitives as designed. did or did not execute a memory barrier, no need to worry
In this example, the easiest fix is to use locking, for about the CPU or compiler reordering operations—life is
example, acquire a lock in the producer before line 10 simple. Of course, the fact that this locking prevents these
in Listing A.1 and in the consumer before line 13 in two pieces of code from running concurrently might limit
Listing A.2. This lock must also be released after line 13 the program’s ability to gain increased performance on
in Listing A.1 and after line 17 in Listing A.2. These locks multiprocessors, possibly resulting in a “safe but slow” sit-
cause the code segments in lines 10–13 of Listing A.1 uation. Chapter 6 describes ways of gaining performance
and in lines 13–17 of Listing A.2 to exclude each other, in and scalability in many situations.
other words, to run atomically with respect to each other. In short, in many parallel programs, the really important
This is represented in Figure A.2: The locking prevents definition of “after” is ordering of operations, which is
any of the boxes of code from overlapping in time, so that covered in dazzling detail in Chapter 15.
the consumer’s timestamp must be collected after the prior However, in most cases, if you find yourself worrying
producer’s timestamp. The segments of code in each box about what happens before or after a given piece of code,
in this figure are termed “critical sections”; only one such you should take this as a hint to make better use of the
critical section may be executing at a given time. standard primitives. Let these primitives do the worrying
This addition of locking results in output as shown in for you.
Table A.2. Here there are no instances of time going
backwards, instead, there are only cases with more than
1,000 counts difference between consecutive reads by the A.5 How Much Ordering Is
consumer.
Quick Quiz A.2: How could there be such a large gap
Needed?
between successive consumer reads? See timelocked.c for
full code.
Perhaps you have carefully constructed a strongly ordered
concurrent system, only to find that it neither performs
In summary, if you acquire an exclusive lock, you know nor scales well. Or perhaps you threw caution to the
that anything you do while holding that lock will appear wind, only to find that your brilliantly fast and scalable

v2022.09.25a
A.5. HOW MUCH ORDERING IS NEEDED? 411

software is also unreliable. Is there a happy medium Nevertheless, it is wise to adopt some meaningful
with both robust reliability on the one hand and powerful semantics that are visible to those accessing the data, for
performance augmented by scintellating scalability on the example, a given function’s return value might be:
other?
The answer, as is so often the case, is “it depends”. 1. Some value between the conceptual value at the time
One approach is to construct a strongly ordered system, of the call to the function and the conceptual value
then examine its performance and scalability. If these at the time of the return from that function. For
suffice, the system is good and sufficient, and no more example, see the statistical counters discussed in
need be done. Otherwise, undertake careful analysis (see Section 5.2, keeping in mind that such counters are
Section 11.7) and attack each bottleneck until the system’s normally monotonic, at least between consecutive
performance is good and sufficient. overflows.
This approach can work very well, especially in contrast
to the all-too-common approach of optimizing random 2. The actual value at some time between the call to and
components of the system in the hope of achieving sig- the return from that function. For example, see the
nificant system-wide benefits. However, starting with single-variable atomic counter shown in Listing 5.2.
strong ordering can also be quite wasteful, given that
weakening ordering of the system’s bottleneck can require 3. If the values used by that function remain unchanged
that large portions of the rest of the system be redesigned during the time between that function’s call and
and rewritten to accommodate the weakening. Worse return, the expected value, otherwise some approxi-
yet, eliminating one bottleneck often exposes another, mation to the expected value. Precise specification
which in turn needs to be weakened and which in turn can of the bounds on the approximation can be quite chal-
result in wholesale redesigns and rewrites of other parts lenging. For example, consider a function combining
of the system. Perhaps even worse is the approach, also values from different elements of an RCU-protected
common, of starting with a fast but unreliable system and linked data structure, as described in Section 10.3.
then playing whack-a-mole with an endless succession of
concurrency bugs, though in the latter case, Chapters 11
Weaker ordering usually implies weaker semantics, and
and 12 are always there for you.
you should be able to give some sort of promise to your
It would be better to have design-time tools to determine
users as to how this weakening affects them. At the same
which portions of the system could use weak ordering,
time, unless the caller holds a lock across both the function
and at the same time, which portions actually benefit from
call and the use of any values computed by that function,
weak ordering. These tasks are taken up by the following
even fully ordered implementations normally cannot do
sections.
any better than the semantics given by the options above.

A.5.1 Where is the Defining Data? Quick Quiz A.3: But if fully ordered implementations cannot
offer stronger guarantees than the better performing and more
One way to do this is to keep firmly in mind that the region scalable weakly ordered implementations, why bother with
of consistency engendered by strong ordering cannot full ordering?
extend out past the boundaries of the system.2 Portions of
the system whose role is to track the state of the outside Some might argue that useful computing deals only
world can usually feature weak ordering, given that speed- with the outside world, and therefore that all computing
of-light delays will force the within-system state to lag that can use weak ordering. Such arguments are incorrect. For
of the outside world. There is often no point in incurring example, the value of your bank account is defined within
large overheads to force a consistent view of data that your bank’s computers, and people often prefer exact
is inherently out of date. In these cases, the methods of computations involving their account balances, especially
Chapter 9 can be quite helpful, as can some of the data those who might suspect that any such approximations
structures described in Chapter 10. would be in the bank’s favor.
In short, although data tracking external state can be
an attractive candidate for weakly ordered access, please
think carefully about exactly what is being tracked and
2 Which might well be a distributed system. what is doing the tracking.

v2022.09.25a
412 APPENDIX A. IMPORTANT QUESTIONS

A.5.2 Consistent Data Used Consistently? the two, and it turns out that these distinctions can be
understood from a couple of different perspectives.
Another hint that weakening is safe can appear in the The first perspective treats “parallel” as an abbreviation
guise of data that is computed while holding a lock, for “data parallel”, and treats “concurrent” as pretty much
but then used after the lock is released. The computed everything else. From this perspective, in parallel com-
result clearly becomes at best an approximation as soon puting, each partition of the overall problem can proceed
as the lock is released, which suggests computing an completely independently, with no communication with
approximate result in the first place, possibly permitting other partitions. In this case, little or no coordination
use of weaker ordering. To this end, Chapter 5 covers among partitions is required. In contrast, concurrent com-
numerous approximate methods for counting. puting might well have tight interdependencies, in the form
Great care is required, however. Is the use of data of contended locks, transactions, or other synchronization
following lock release a hint that weak-ordering optimiza- mechanisms.
tions might be helpful? Or is instead a bug in which the
Quick Quiz A.4: Suppose a portion of a program uses RCU
lock was released too soon?
read-side primitives as its only synchronization mechanism.
Is this parallelism or concurrency?
A.5.3 Is the Problem Partitionable?
This of course begs the question of why such a distinc-
Suppose that the system holds the defining instance of tion matters, which brings us to the second perspective,
the data, or that using a computed value past lock release that of the underlying scheduler. Schedulers come in a
proved to be a bug. What then? wide range of complexities and capabilities, and as a rough
One approach is to partition the system, as discussed in rule of thumb, the more tightly and irregularly a set of
Chapter 6. Partititioning can provide excellent scalability parallel processes communicate, the higher the level of so-
and in its more extreme form, per-CPU performance phistication required from the scheduler. As such, parallel
rivaling that of a sequential program, as discussed in computing’s avoidance of interdependencies means that
Chapter 8. Partial partitioning is often mediated by parallel-computing programs run well on the least-capable
locking, which is the subject of Chapter 7. schedulers. In fact, a pure parallel-computing program
can run successfully after being arbitrarily subdivided and
interleaved onto a uniprocessor.3 In contrast, concurrent-
A.5.4 None of the Above? computing programs might well require extreme subtlety
The previous sections described the easier ways to gain on the part of the scheduler.
performance and scalability, sometimes using weaker One could argue that we should simply demand a
ordering and sometimes not. But the plain fact is that reasonable level of competence from the scheduler, so
multicore systems are under no compunction to make that we could simply ignore any distinctions between
life easy. But perhaps the advanced topics covered in parallelism and concurrency. Although this is often a good
Chapters 14 and 15 will prove helpful. strategy, there are important situations where efficiency,
performance, and scalability concerns sharply limit the
But please proceed with care, as it is all too easy to
level of competence that the scheduler can reasonably
destabilize your codebase optimizing non-bottlenecks.
offer. One important example is when the scheduler is
Once again, Section 11.7 can help. It might also be worth
implemented in hardware, as it often is in SIMD units or
your time to review other portions of this book, as it
GPGPUs. Another example is a workload where the units
contains much information on handling a number of tricky
of work are quite short, so that even a software-based
situations.
scheduler must make hard choices between subtlety on
the one hand and efficiency on the other.
Now, this second perspective can be thought of as
A.6 What is the Difference Between making the workload match the available scheduler, with
“Concurrent” and “Parallel”? parallel workloads able to use simple schedulers and
concurrent workloads requiring sophisticated schedulers.
From a classic computing perspective, “concurrent” and
“parallel” are clearly synonyms. However, this has not 3 Yes, this does mean that data-parallel-computing programs are

stopped many people from drawing distinctions between best-suited for sequential execution. Why did you ask?

v2022.09.25a
A.7. WHY IS SOFTWARE BUGGY? 413

Unfortunately, this perspective does not always align


with the dependency-based distinction put forth by the
first perspective. For example, a highly interdependent
lock-based workload with one thread per CPU can make
do with a trivial scheduler because no scheduler decisions
are required. In fact, some workloads of this type can
even be run one after another on a sequential machine.
Therefore, such a workload would be labeled “concurrent”
by the first perspective and “parallel” by many taking the
second perspective.
Quick Quiz A.5: In what part of the second (scheduler-
based) perspective would the lock-based single-thread-per-
CPU workload be considered “concurrent”?

Which is just fine. No rule that humankind writes


carries any weight against the objective universe, not even
rules dividing multiprocessor programs into categories
such as “concurrent” and “parallel”.
This categorization failure does not mean such rules
are useless, but rather that you should take on a suitably
skeptical frame of mind when attempting to apply them
to new situations. As always, use such rules where they
apply and ignore them otherwise.
In fact, it is likely that new categories will arise in
addition to parallel, concurrent, map-reduce, task-based,
and so on. Some will stand the test of time, but good luck
guessing which!

A.7 Why Is Software Buggy?


The short answer is “because it was written by humans,
and to err is human”. This does not necessarily mean
that automated code generation is the answer, because
the program that does the code generation will have
been written by humans. In addition, one of the biggest
problems in producing software is working out what that
software is supposed to do, and this task has thus far
proven rather resistant to automation.
Nevertheless, automation is an important part of the
process of reducing the number of bugs in software. For
but one example, despite their many flaws, it is almost
always better to use a compiler than to write in assembly
language.
Furthermore, careful validation can be very helpful in
finding bugs, as discussed in Chapters 11–12.

v2022.09.25a
414 APPENDIX A. IMPORTANT QUESTIONS

v2022.09.25a
The only difference between men and boys is the
price of their toys.

Appendix B M. Hébert

“Toy” RCU Implementations

The toy RCU implementations in this appendix are de- Listing B.1: Lock-Based RCU Implementation
signed not for high performance, practicality, or any kind 1 static void rcu_read_lock(void)
2 {
of production use,1 but rather for clarity. Nevertheless, 3 spin_lock(&rcu_gp_lock);
you will need a thorough understanding of Chapters 2, 3, 4 }
5
4, 6, and 9 for even these toy RCU implementations to be 6 static void rcu_read_unlock(void)
easily understandable. 7 {
8 spin_unlock(&rcu_gp_lock);
This appendix provides a series of RCU implemen- 9 }
tations in order of increasing sophistication, from the 10
11 void synchronize_rcu(void)
viewpoint of solving the existence-guarantee problem. 12 {
Appendix B.1 presents a rudimentary RCU implemen- 13 spin_lock(&rcu_gp_lock);
14 spin_unlock(&rcu_gp_lock);
tation based on simple locking, while Appendices B.2 15 }
through B.9 present a series of simple RCU implemen-
tations based on locking, reference counters, and free-
running counters. Finally, Appendix B.10 provides a heavyweight, with read-side overhead ranging from about
summary and a list of desirable RCU properties. 100 nanoseconds on a single POWER5 CPU up to more
than 17 microseconds on a 64-CPU system. Worse yet,
these same lock operations permit rcu_read_lock() to
B.1 Lock-Based RCU participate in deadlock cycles. Furthermore, in absence
of recursive locks, RCU read-side critical sections cannot
Perhaps the simplest RCU implementation leverages be nested, and, finally, although concurrent RCU updates
locking, as shown in Listing B.1 (rcu_lock.h and could in principle be satisfied by a common grace period,
rcu_lock.c). this implementation serializes grace periods, preventing
In this implementation, rcu_read_lock() acquires grace-period sharing.
a global spinlock, rcu_read_unlock() releases it, and
Quick Quiz B.1: Why wouldn’t any deadlock in the RCU
synchronize_rcu() acquires it then immediately re- implementation in Listing B.1 also be a deadlock in any other
leases it. RCU implementation?
Because synchronize_rcu() does not return until
it has acquired (and released) the lock, it cannot return Quick Quiz B.2: Why not simply use reader-writer locks in
until all prior RCU read-side critical sections have com- the RCU implementation in Listing B.1 in order to allow RCU
pleted, thus faithfully implementing RCU semantics. Of readers to proceed in parallel?
course, only one RCU reader may be in its read-side
critical section at a time, which almost entirely defeats the It is hard to imagine this implementation being useful
purpose of RCU. In addition, the lock operations in rcu_ in a production setting, though it does have the virtue of
read_lock() and rcu_read_unlock() are extremely being implementable in almost any user-level application.
Furthermore, similar implementations having one lock
1 However, production-quality user-level RCU implementations are per CPU or using reader-writer locks have been used in
available [Des09b, DMS+ 12]. production in the 2.4 Linux kernel.

415

v2022.09.25a
416 APPENDIX B. “TOY” RCU IMPLEMENTATIONS

Listing B.2: Per-Thread Lock-Based RCU Implementation Listing B.3: RCU Implementation Using Single Global Refer-
1 static void rcu_read_lock(void) ence Counter
2 { 1 atomic_t rcu_refcnt;
3 spin_lock(&__get_thread_var(rcu_gp_lock)); 2
4 } 3 static void rcu_read_lock(void)
5 4 {
6 static void rcu_read_unlock(void) 5 atomic_inc(&rcu_refcnt);
7 { 6 smp_mb();
8 spin_unlock(&__get_thread_var(rcu_gp_lock)); 7 }
9 } 8
10 9 static void rcu_read_unlock(void)
11 void synchronize_rcu(void) 10 {
12 { 11 smp_mb();
13 int t; 12 atomic_dec(&rcu_refcnt);
14 13 }
15 for_each_running_thread(t) { 14
16 spin_lock(&per_thread(rcu_gp_lock, t)); 15 void synchronize_rcu(void)
17 spin_unlock(&per_thread(rcu_gp_lock, t)); 16 {
18 } 17 smp_mb();
19 } 18 while (atomic_read(&rcu_refcnt) != 0) {
19 poll(NULL, 0, 10);
20 }
21 smp_mb();
A modified version of this one-lock-per-CPU approach, 22 }

but instead using one lock per thread, is described in the


next section.
Quick Quiz B.5: Isn’t one advantage of the RCU algorithm
shown in Listing B.2 that it uses only primitives that are widely
available, for example, in POSIX pthreads?
B.2 Per-Thread Lock-Based RCU
This approach could be useful in some situations, given
Listing B.2 (rcu_lock_percpu.h and rcu_lock_ that a similar approach was used in the Linux 2.4 ker-
percpu.c) shows an implementation based on one lock nel [MM00].
per thread. The rcu_read_lock() and rcu_read_ The counter-based RCU implementation described next
unlock() functions acquire and release, respectively, overcomes some of the shortcomings of the lock-based
the current thread’s lock. The synchronize_rcu() implementation.
function acquires and releases each thread’s lock in turn.
Therefore, all RCU read-side critical sections running
when synchronize_rcu() starts must have completed B.3 Simple Counter-Based RCU
before synchronize_rcu() can return.
This implementation does have the virtue of permitting A slightly more sophisticated RCU implementation is
concurrent RCU readers, and does avoid the deadlock shown in Listing B.3 (rcu_rcg.h and rcu_rcg.c). This
condition that can arise with a single global lock. Further- implementation makes use of a global reference counter
more, the read-side overhead, though high at roughly 140 rcu_refcnt defined on line 1. The rcu_read_lock()
nanoseconds, remains at about 140 nanoseconds regard- primitive atomically increments this counter, then exe-
less of the number of CPUs. However, the update-side cutes a memory barrier to ensure that the RCU read-side
overhead ranges from about 600 nanoseconds on a single critical section is ordered after the atomic increment. Sim-
POWER5 CPU up to more than 100 microseconds on 64 ilarly, rcu_read_unlock() executes a memory barrier
CPUs. to confine the RCU read-side critical section, then atomi-
cally decrements the counter. The synchronize_rcu()
Quick Quiz B.3: Wouldn’t it be cleaner to acquire all the
locks, and then release them all in the loop from lines 15–18 primitive spins waiting for the reference counter to reach
of Listing B.2? After all, with this change, there would be a zero, surrounded by memory barriers. The poll() on
point in time when there were no readers, simplifying things line 19 merely provides pure delay, and from a pure RCU-
greatly. semantics point of view could be omitted. Again, once
synchronize_rcu() returns, all prior RCU read-side
Quick Quiz B.4: Is the implementation shown in Listing B.2 critical sections are guaranteed to have completed.
free from deadlocks? Why or why not? In happy contrast to the lock-based implementation
shown in Appendix B.1, this implementation allows par-

v2022.09.25a
B.4. STARVATION-FREE COUNTER-BASED RCU 417

allel execution of RCU read-side critical sections. In Listing B.4: RCU Global Reference-Count Pair Data
happy contrast to the per-thread lock-based implemen- 1 DEFINE_SPINLOCK(rcu_gp_lock);
2 atomic_t rcu_refcnt[2];
tation shown in Appendix B.2, it also allows them to 3 atomic_t rcu_idx;
be nested. In addition, the rcu_read_lock() primitive 4 DEFINE_PER_THREAD(int, rcu_nesting);
5 DEFINE_PER_THREAD(int, rcu_read_idx);
cannot possibly participate in deadlock cycles, as it never
spins nor blocks.
Listing B.5: RCU Read-Side Using Global Reference-Count
Quick Quiz B.6: But what if you hold a lock across a call to Pair
synchronize_rcu(), and then acquire that same lock within 1 static void rcu_read_lock(void)
2 {
an RCU read-side critical section? 3 int i;
4 int n;
5
However, this implementation still has some serious 6 n = __get_thread_var(rcu_nesting);
shortcomings. First, the atomic operations in rcu_ 7 if (n == 0) {
8 i = atomic_read(&rcu_idx);
read_lock() and rcu_read_unlock() are still quite 9 __get_thread_var(rcu_read_idx) = i;
heavyweight, with read-side overhead ranging from about 10 atomic_inc(&rcu_refcnt[i]);
11 }
100 nanoseconds on a single POWER5 CPU up to almost 12 __get_thread_var(rcu_nesting) = n + 1;
40 microseconds on a 64-CPU system. This means that 13 smp_mb();
14 }
the RCU read-side critical sections have to be extremely 15
long in order to get any real read-side parallelism. On 16 static void rcu_read_unlock(void)
17 {
the other hand, in the absence of readers, grace periods 18 int i;
elapse in about 40 nanoseconds, many orders of magni- 19 int n;
20
tude faster than production-quality implementations in the 21 smp_mb();
Linux kernel. 22 n = __get_thread_var(rcu_nesting);
23 if (n == 1) {
Quick Quiz B.7: How can the grace period possibly elapse 24 i = __get_thread_var(rcu_read_idx);
25 atomic_dec(&rcu_refcnt[i]);
in 40 nanoseconds when synchronize_rcu() contains a 26 }
10-millisecond delay? 27 __get_thread_var(rcu_nesting) = n - 1;
28 }

Second, if there are many concurrent rcu_read_


lock() and rcu_read_unlock() operations, there will
be extreme memory contention on rcu_refcnt, resulting a variation on the reference-counting scheme that is more
in expensive cache misses. Both of these first two short- favorable to writers.
comings largely defeat a major purpose of RCU, namely
to provide low-overhead read-side synchronization primi-
tives.
B.4 Starvation-Free Counter-
Finally, a large number of RCU readers with long read- Based RCU
side critical sections could prevent synchronize_rcu()
from ever completing, as the global counter might never Listing B.5 (rcu_rcpg.h) shows the read-side primitives
reach zero. This could result in starvation of RCU updates, of an RCU implementation that uses a pair of reference
which is of course unacceptable in production settings. counters (rcu_refcnt[]), along with a global index
that selects one counter out of the pair (rcu_idx), a
Quick Quiz B.8: Why not simply make rcu_read_lock() per-thread nesting counter rcu_nesting, a per-thread
wait when a concurrent synchronize_rcu() has been wait- snapshot of the global index (rcu_read_idx), and a
ing too long in the RCU implementation in Listing B.3?
global lock (rcu_gp_lock), which are themselves shown
Wouldn’t that prevent synchronize_rcu() from starving?
in Listing B.4.

Therefore, it is still hard to imagine this implementation Design It is the two-element rcu_refcnt[] array that
being useful in a production setting, though it has a provides the freedom from starvation. The key point
bit more potential than the lock-based mechanism, for is that synchronize_rcu() is only required to wait
example, as an RCU implementation suitable for a high- for pre-existing readers. If a new reader starts after
stress debugging environment. The next section describes a given instance of synchronize_rcu() has already

v2022.09.25a
418 APPENDIX B. “TOY” RCU IMPLEMENTATIONS

begun execution, then that instance of synchronize_ Listing B.6: RCU Update Using Global Reference-Count Pair
rcu() need not wait on that new reader. At any given 1 void synchronize_rcu(void)
2 {
time, when a given reader enters its RCU read-side critical 3 int i;
section via rcu_read_lock(), it increments the element 4
5 smp_mb();
of the rcu_refcnt[] array indicated by the rcu_idx 6 spin_lock(&rcu_gp_lock);
variable. When that same reader exits its RCU read-side 7 i = atomic_read(&rcu_idx);
8 atomic_set(&rcu_idx, !i);
critical section via rcu_read_unlock(), it decrements 9 smp_mb();
whichever element it incremented, ignoring any possible 10 while (atomic_read(&rcu_refcnt[i]) != 0) {
11 poll(NULL, 0, 10);
subsequent changes to the rcu_idx value. 12 }
This arrangement means that synchronize_rcu() 13 smp_mb();
14 atomic_set(&rcu_idx, i);
can avoid starvation by complementing the value of rcu_ 15 smp_mb();
idx, as in rcu_idx = !rcu_idx. Suppose that the 16 while (atomic_read(&rcu_refcnt[!i]) != 0) {
17 poll(NULL, 0, 10);
old value of rcu_idx was zero, so that the new value 18 }
is one. New readers that arrive after the complement 19 spin_unlock(&rcu_gp_lock);
20 smp_mb();
operation will increment rcu_refcnt[1], while the old 21 }
readers that previously incremented rcu_refcnt[0] will
decrement rcu_refcnt[0] when they exit their RCU
read-side critical sections. This means that the value of section does not bleed out before the rcu_read_lock()
rcu_refcnt[0] will no longer be incremented, and thus code.
will be monotonically decreasing.2 This means that all Similarly, the rcu_read_unlock() function executes
that synchronize_rcu() need do is wait for the value a memory barrier at line 21 to ensure that the RCU
of rcu_refcnt[0] to reach zero. read-side critical section does not bleed out after the rcu_
With the background, we are ready to look at the read_unlock() code. Line 22 picks up this thread’s
implementation of the actual primitives. instance of rcu_nesting, and if line 23 finds that this is
the outermost rcu_read_unlock(), then lines 24 and 25
Implementation The rcu_read_lock() primitive pick up this thread’s instance of rcu_read_idx (saved by
atomically increments the member of the rcu_refcnt[] the outermost rcu_read_lock()) and atomically decre-
pair indexed by rcu_idx, and keeps a snapshot of this in- ments the selected element of rcu_refcnt. Regardless of
dex in the per-thread variable rcu_read_idx. The rcu_ the nesting level, line 27 decrements this thread’s instance
read_unlock() primitive then atomically decrements of rcu_nesting.
whichever counter of the pair that the corresponding rcu_ Listing B.6 (rcu_rcpg.c) shows the corresponding
read_lock() incremented. However, because only one synchronize_rcu() implementation. Lines 6 and 19
value of rcu_idx is remembered per thread, additional acquire and release rcu_gp_lock in order to prevent
measures must be taken to permit nesting. These addi- more than one concurrent instance of synchronize_
tional measures use the per-thread rcu_nesting variable rcu(). Lines 7 and 8 pick up the value of rcu_idx and
to track nesting. complement it, respectively, so that subsequent instances
To make all this work, line 6 of rcu_read_lock() of rcu_read_lock() will use a different element of
in Listing B.5 picks up the current thread’s instance of rcu_refcnt than did preceding instances. Lines 10–12
rcu_nesting, and if line 7 finds that this is the outermost then wait for the prior element of rcu_refcnt to reach
rcu_read_lock(), then lines 8–10 pick up the current zero, with the memory barrier on line 9 ensuring that
value of rcu_idx, save it in this thread’s instance of the check of rcu_refcnt is not reordered to precede
rcu_read_idx, and atomically increment the selected the complementing of rcu_idx. Lines 13–18 repeat
element of rcu_refcnt. Regardless of the value of this process, and line 20 ensures that any subsequent
rcu_nesting, line 12 increments it. Line 13 executes a reclamation operations are not reordered to precede the
memory barrier to ensure that the RCU read-side critical checking of rcu_refcnt.
Quick Quiz B.9: Why the memory barrier on line 5 of
2 There is a race condition that this “monotonically decreasing” synchronize_rcu() in Listing B.6 given that there is a
statement ignores. This race condition will be dealt with by the code for spin-lock acquisition immediately after?
synchronize_rcu(). In the meantime, I suggest suspending disbelief.

v2022.09.25a
B.5. SCALABLE COUNTER-BASED RCU 419

Quick Quiz B.10: Why is the counter flipped twice in List- Listing B.7: RCU Per-Thread Reference-Count Pair Data
ing B.6? Shouldn’t a single flip-and-wait cycle be sufficient? 1 DEFINE_SPINLOCK(rcu_gp_lock);
2 DEFINE_PER_THREAD(int [2], rcu_refcnt);
3 atomic_t rcu_idx;
4 DEFINE_PER_THREAD(int, rcu_nesting);
This implementation avoids the update-starvation issues 5 DEFINE_PER_THREAD(int, rcu_read_idx);
that could occur in the single-counter implementation
shown in Listing B.3. Listing B.8: RCU Read-Side Using Per-Thread Reference-Count
Pair
1 static void rcu_read_lock(void)
Discussion There are still some serious shortcomings. 2 {
First, the atomic operations in rcu_read_lock() and 3 int i;
4 int n;
rcu_read_unlock() are still quite heavyweight. In fact, 5
they are more complex than those of the single-counter 6 n = __get_thread_var(rcu_nesting);
7 if (n == 0) {
variant shown in Listing B.3, with the read-side primitives 8 i = atomic_read(&rcu_idx);
consuming about 150 nanoseconds on a single POWER5 9 __get_thread_var(rcu_read_idx) = i;
10 __get_thread_var(rcu_refcnt)[i]++;
CPU and almost 40 microseconds on a 64-CPU system. 11 }
The update-side synchronize_rcu() primitive is more 12 __get_thread_var(rcu_nesting) = n + 1;
13 smp_mb();
costly as well, ranging from about 200 nanoseconds on 14 }
a single POWER5 CPU to more than 40 microseconds 15
16 static void rcu_read_unlock(void)
on a 64-CPU system. This means that the RCU read-side 17 {
critical sections have to be extremely long in order to get 18 int i;
19 int n;
any real read-side parallelism. 20
Second, if there are many concurrent rcu_read_ 21 smp_mb();
22 n = __get_thread_var(rcu_nesting);
lock() and rcu_read_unlock() operations, there will 23 if (n == 1) {
be extreme memory contention on the rcu_refcnt ele- 24 i = __get_thread_var(rcu_read_idx);
25 __get_thread_var(rcu_refcnt)[i]--;
ments, resulting in expensive cache misses. This further 26 }
extends the RCU read-side critical-section duration re- 27 __get_thread_var(rcu_nesting) = n - 1;
28 }
quired to provide parallel read-side access. These first
two shortcomings defeat the purpose of RCU in most
situations.
Third, the need to flip rcu_idx twice imposes sub- B.5 Scalable Counter-Based RCU
stantial overhead on updates, especially if there are large
numbers of threads. Listing B.8 (rcu_rcpl.h) shows the read-side primitives
Finally, despite the fact that concurrent RCU updates of an RCU implementation that uses per-thread pairs of
could in principle be satisfied by a common grace period, reference counters. This implementation is quite similar
this implementation serializes grace periods, preventing to that shown in Listing B.5, the only difference being
grace-period sharing. that rcu_refcnt is now a per-thread array (as shown
in Listing B.7). As with the algorithm in the previous
Quick Quiz B.11: Given that atomic increment and decre-
section, use of this two-element array prevents readers
ment are so expensive, why not just use non-atomic increment
from starving updaters. One benefit of per-thread rcu_
on line 10 and a non-atomic decrement on line 25 of List-
ing B.5? refcnt[] array is that the rcu_read_lock() and rcu_
read_unlock() primitives no longer perform atomic
Despite these shortcomings, one could imagine this operations.
variant of RCU being used on small tightly coupled multi- Quick Quiz B.12: Come off it! We can see the atomic_
processors, perhaps as a memory-conserving implementa- read() primitive in rcu_read_lock()!!! So why are you
tion that maintains API compatibility with more complex trying to pretend that rcu_read_lock() contains no atomic
implementations. However, it would not likely scale well operations???
beyond a few CPUs.
The next section describes yet another variation on the Listing B.9 (rcu_rcpl.c) shows the implementa-
reference-counting scheme that provides greatly improved tion of synchronize_rcu(), along with a helper
read-side performance and scalability. function named flip_counter_and_wait(). The

v2022.09.25a
420 APPENDIX B. “TOY” RCU IMPLEMENTATIONS

Listing B.9: RCU Update Using Per-Thread Reference-Count Listing B.10: RCU Read-Side Using Per-Thread Reference-
Pair Count Pair and Shared Update Data
1 static void flip_counter_and_wait(int i) 1 DEFINE_SPINLOCK(rcu_gp_lock);
2 { 2 DEFINE_PER_THREAD(int [2], rcu_refcnt);
3 int t; 3 long rcu_idx;
4 4 DEFINE_PER_THREAD(int, rcu_nesting);
5 atomic_set(&rcu_idx, !i); 5 DEFINE_PER_THREAD(int, rcu_read_idx);
6 smp_mb();
7 for_each_thread(t) {
8 while (per_thread(rcu_refcnt, t)[i] != 0) {
9 poll(NULL, 0, 10);
10 } This implementation still has several shortcomings.
11 } First, the need to flip rcu_idx twice imposes substantial
12 smp_mb();
13 } overhead on updates, especially if there are large numbers
14 of threads.
15 void synchronize_rcu(void)
16 { Second, synchronize_rcu() must now examine a
17 int i; number of variables that increases linearly with the number
18
19 smp_mb(); of threads, imposing substantial overhead on applications
20 spin_lock(&rcu_gp_lock); with large numbers of threads.
21 i = atomic_read(&rcu_idx);
22 flip_counter_and_wait(i); Third, as before, although concurrent RCU updates
23 flip_counter_and_wait(!i); could in principle be satisfied by a common grace period,
24 spin_unlock(&rcu_gp_lock);
25 smp_mb(); this implementation serializes grace periods, preventing
26 } grace-period sharing.
Finally, as noted in the text, the need for per-thread
variables and for enumerating threads may be problematic
synchronize_rcu() function resembles that shown in in some software environments.
Listing B.6, except that the repeated counter flip is re- That said, the read-side primitives scale very nicely,
placed by a pair of calls on lines 22 and 23 to the new requiring about 115 nanoseconds regardless of whether
helper function. running on a single-CPU or a 64-CPU POWER5 system.
The new flip_counter_and_wait() function up- As noted above, the synchronize_rcu() primitive does
dates the rcu_idx variable on line 5, executes a memory not scale, ranging in overhead from almost a microsecond
barrier on line 6, then lines 7–11 spin on each thread’s on a single POWER5 CPU up to almost 200 microseconds
prior rcu_refcnt element, waiting for it to go to zero. on a 64-CPU system. This implementation could con-
Once all such elements have gone to zero, it executes ceivably form the basis for a production-quality user-level
another memory barrier on line 12 and returns. RCU implementation.
The next section describes an algorithm permitting
This RCU implementation imposes important new re-
more efficient concurrent RCU updates.
quirements on its software environment, namely, (1) that
it be possible to declare per-thread variables, (2) that these
per-thread variables be accessible from other threads, and
(3) that it is possible to enumerate all threads. These
B.6 Scalable Counter-Based RCU
requirements can be met in almost all software envi- With Shared Grace Periods
ronments, but often result in fixed upper bounds on the
number of threads. More-complex implementations might Listing B.11 (rcu_rcpls.h) shows the read-side prim-
avoid such bounds, for example, by using expandable hash itives for an RCU implementation using per-thread ref-
tables. Such implementations might dynamically track erence count pairs, as before, but permitting updates to
threads, for example, by adding them on their first call to share grace periods. The main difference from the earlier
rcu_read_lock(). implementation shown in Listing B.8 is that rcu_idx
is now a long that counts freely, so that line 8 of List-
Quick Quiz B.13: Great, if we have 𝑁 threads, we can have ing B.11 must mask off the low-order bit. We also switched
2𝑁 ten-millisecond waits (one set per flip_counter_and_ from using atomic_read() and atomic_set() to using
wait() invocation, and even that assumes that we wait only READ_ONCE(). The data is also quite similar, as shown
once for each thread). Don’t we need the grace period to
in Listing B.10, with rcu_idx now being a long instead
complete much more quickly?
of an atomic_t.

v2022.09.25a
B.6. SCALABLE COUNTER-BASED RCU WITH SHARED GRACE PERIODS 421

Listing B.11: RCU Read-Side Using Per-Thread Reference- Listing B.12: RCU Shared Update Using Per-Thread Reference-
Count Pair and Shared Update Count Pair
1 static void rcu_read_lock(void) 1 static void flip_counter_and_wait(int ctr)
2 { 2 {
3 int i; 3 int i;
4 int n; 4 int t;
5 5
6 n = __get_thread_var(rcu_nesting); 6 WRITE_ONCE(rcu_idx, ctr + 1);
7 if (n == 0) { 7 i = ctr & 0x1;
8 i = READ_ONCE(rcu_idx) & 0x1; 8 smp_mb();
9 __get_thread_var(rcu_read_idx) = i; 9 for_each_thread(t) {
10 __get_thread_var(rcu_refcnt)[i]++; 10 while (per_thread(rcu_refcnt, t)[i] != 0) {
11 } 11 poll(NULL, 0, 10);
12 __get_thread_var(rcu_nesting) = n + 1; 12 }
13 smp_mb(); 13 }
14 } 14 smp_mb();
15 15 }
16 static void rcu_read_unlock(void) 16
17 { 17 void synchronize_rcu(void)
18 int i; 18 {
19 int n; 19 int ctr;
20 20 int oldctr;
21 smp_mb(); 21
22 n = __get_thread_var(rcu_nesting); 22 smp_mb();
23 if (n == 1) { 23 oldctr = READ_ONCE(rcu_idx);
24 i = __get_thread_var(rcu_read_idx); 24 smp_mb();
25 __get_thread_var(rcu_refcnt)[i]--; 25 spin_lock(&rcu_gp_lock);
26 } 26 ctr = READ_ONCE(rcu_idx);
27 __get_thread_var(rcu_nesting) = n - 1; 27 if (ctr - oldctr >= 3) {
28 } 28 spin_unlock(&rcu_gp_lock);
29 smp_mb();
30 return;
31 }
Listing B.12 (rcu_rcpls.c) shows the implementation 32 flip_counter_and_wait(ctr);
33 if (ctr - oldctr < 2)
of synchronize_rcu() and its helper function flip_ 34 flip_counter_and_wait(ctr + 1);
counter_and_wait(). These are similar to those in 35 spin_unlock(&rcu_gp_lock);
36 smp_mb();
Listing B.9. The differences in flip_counter_and_ 37 }
wait() include:
1. Line 6 uses WRITE_ONCE() instead of atomic_ two counter flips while the lock was being acquired.
set(), and increments rather than complementing. On the other hand, if there were two counter flips,
some other thread did one full wait for all the counters
2. A new line 7 masks the counter down to its bottom
to go to zero, so only one more is required.
bit.
With this approach, if an arbitrarily large number of
The changes to synchronize_rcu() are more perva-
threads invoke synchronize_rcu() concurrently, with
sive:
one CPU for each thread, there will be a total of only three
1. There is a new oldctr local variable that captures waits for counters to go to zero.
the pre-lock-acquisition value of rcu_idx on line 20. Despite the improvements, this implementation of RCU
still has a few shortcomings. First, as before, the need
2. Line 23 uses READ_ONCE() instead of atomic_ to flip rcu_idx twice imposes substantial overhead on
read(). updates, especially if there are large numbers of threads.
3. Lines 27–30 check to see if at least three counter flips Second, each updater still acquires rcu_gp_lock, even
were performed by other threads while the lock was if there is no work to be done. This can result in a
being acquired, and, if so, releases the lock, does a severe scalability limitation if there are large numbers of
memory barrier, and returns. In this case, there were concurrent updates. There are ways of avoiding this, as
two full waits for the counters to go to zero, so those was done in a production-quality real-time implementation
other threads already did all the required work. of RCU for the Linux kernel [McK07a].
Third, this implementation requires per-thread variables
4. At lines 33–34, flip_counter_and_wait() is and the ability to enumerate threads, which again can be
only invoked a second time if there were fewer than problematic in some software environments.

v2022.09.25a
422 APPENDIX B. “TOY” RCU IMPLEMENTATIONS

Finally, on 32-bit machines, a given update thread might Listing B.13: Data for Free-Running Counter Using RCU
be preempted long enough for the rcu_idx counter to 1 DEFINE_SPINLOCK(rcu_gp_lock);
2 long rcu_gp_ctr = 0;
overflow. This could cause such a thread to force an 3 DEFINE_PER_THREAD(long, rcu_reader_gp);
unnecessary pair of counter flips. However, even if each 4 DEFINE_PER_THREAD(long, rcu_reader_gp_snap);
grace period took only one microsecond, the offending
thread would need to be preempted for more than an hour, Listing B.14: Free-Running Counter Using RCU
in which case an extra pair of counter flips is likely the 1 static inline void rcu_read_lock(void)
least of your worries. 2 {
3 __get_thread_var(rcu_reader_gp) =
As with the implementation described in Appendix B.3, 4 READ_ONCE(rcu_gp_ctr) + 1;
the read-side primitives scale extremely well, incurring 5 smp_mb();
6 }
roughly 115 nanoseconds of overhead regardless of the 7

number of CPUs. The synchronize_rcu() primitive 8 static inline void rcu_read_unlock(void)


9 {
is still expensive, ranging from about one microsecond 10 smp_mb();
up to about 16 microseconds. This is nevertheless much 11 __get_thread_var(rcu_reader_gp) =
12 READ_ONCE(rcu_gp_ctr);
cheaper than the roughly 200 microseconds incurred by 13 }
14
the implementation in Appendix B.5. So, despite its short- 15 void synchronize_rcu(void)
comings, one could imagine this RCU implementation 16 {
being used in production in real-life applications. 17 int t;
18
19 smp_mb();
Quick Quiz B.14: All of these toy RCU implementations 20 spin_lock(&rcu_gp_lock);
have either atomic operations in rcu_read_lock() and rcu_ 21 WRITE_ONCE(rcu_gp_ctr, rcu_gp_ctr + 2);
read_unlock(), or synchronize_rcu() overhead that in- 22 smp_mb();
23 for_each_thread(t) {
creases linearly with the number of threads. Under what cir- 24 while ((per_thread(rcu_reader_gp, t) & 0x1) &&
cumstances could an RCU implementation enjoy lightweight 25 ((per_thread(rcu_reader_gp, t) -
implementations for all three of these primitives, all having 26 rcu_gp_ctr) < 0)) {
27 poll(NULL, 0, 10);
deterministic (O (1)) overheads and latencies? 28 }
29 }
Referring back to Listing B.11, we see that there is 30 spin_unlock(&rcu_gp_lock);
31 smp_mb();
one global-variable access and no fewer than four ac- 32 }
cesses to thread-local variables. Given the relatively
high cost of thread-local accesses on systems implement-
ing POSIX threads, it is tempting to collapse the three the rcu_reader_gp per-thread variable. Line 5 executes
thread-local variables into a single structure, permitting a memory barrier to prevent the content of the subsequent
rcu_read_lock() and rcu_read_unlock() to access RCU read-side critical section from “leaking out”.
their thread-local data with a single thread-local-storage The rcu_read_unlock() implementation is similar.
access. However, an even better approach would be to Line 10 executes a memory barrier, again to prevent the
reduce the number of thread-local accesses to one, as is prior RCU read-side critical section from “leaking out”.
done in the next section. Lines 11 and 12 then copy the rcu_gp_ctr global variable
to the rcu_reader_gp per-thread variable, leaving this
per-thread variable with an even-numbered value so that a
B.7 RCU Based on Free-Running concurrent instance of synchronize_rcu() will know
Counter to ignore it.
Quick Quiz B.15: If any even value is sufficient to tell
Listing B.14 (rcu.h and rcu.c) shows an RCU imple- synchronize_rcu() to ignore a given task, why don’t
mentation based on a single global free-running counter lines 11 and 12 of Listing B.14 simply assign zero to rcu_
that takes on only even-numbered values, with data shown reader_gp?
in Listing B.13.
The resulting rcu_read_lock() implementation is Thus, synchronize_rcu() could wait for all of the
extremely straightforward. Lines 3 and 4 simply add per-thread rcu_reader_gp variables to take on even-
the value one to the global free-running rcu_gp_ctr numbered values. However, it is possible to do much better
variable and stores the resulting odd-numbered value into than that because synchronize_rcu() need only wait

v2022.09.25a
B.8. NESTABLE RCU BASED ON FREE-RUNNING COUNTER 423

on pre-existing RCU read-side critical sections. Line 19 Listing B.15: Data for Nestable RCU Using a Free-Running
executes a memory barrier to prevent prior manipulations Counter
1 DEFINE_SPINLOCK(rcu_gp_lock);
of RCU-protected data structures from being reordered (by 2 #define RCU_GP_CTR_SHIFT 7
either the CPU or the compiler) to follow the increment on 3 #define RCU_GP_CTR_BOTTOM_BIT (1 << RCU_GP_CTR_SHIFT)
4 #define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BOTTOM_BIT - 1)
line 21. Line 20 acquires the rcu_gp_lock (and line 30 5 #define MAX_GP_ADV_DISTANCE (RCU_GP_CTR_NEST_MASK << 8)
releases it) in order to prevent multiple synchronize_ 6 unsigned long rcu_gp_ctr = 0;
7 DEFINE_PER_THREAD(unsigned long, rcu_reader_gp);
rcu() instances from running concurrently. Line 21 then
increments the global rcu_gp_ctr variable by two, so
that all pre-existing RCU read-side critical sections will
have corresponding per-thread rcu_reader_gp variables Quick Quiz B.18: Is the possibility of readers being pre-
empted in lines 3–4 of Listing B.14 a real problem, in other
with values less than that of rcu_gp_ctr, modulo the
words, is there a real sequence of events that could lead to
machine’s word size. Recall also that threads with even-
failure? If not, why not? If so, what is the sequence of events,
numbered values of rcu_reader_gp are not in an RCU and how can the failure be addressed?
read-side critical section, so that lines 23–29 scan the rcu_
reader_gp values until they all are either even (line 24)
or are greater than the global rcu_gp_ctr (lines 25–26).
Line 27 blocks for a short period of time to wait for a B.8 Nestable RCU Based on Free-
pre-existing RCU read-side critical section, but this can be
replaced with a spin-loop if grace-period latency is of the Running Counter
essence. Finally, the memory barrier at line 31 ensures
that any subsequent destruction will not be reordered into Listing B.16 (rcu_nest.h and rcu_nest.c) shows an
the preceding loop. RCU implementation based on a single global free-running
counter, but that permits nesting of RCU read-side critical
Quick Quiz B.16: Why are the memory barriers on lines 19 sections. This nestability is accomplished by reserving the
and 31 of Listing B.14 needed? Aren’t the memory barriers
low-order bits of the global rcu_gp_ctr to count nesting,
inherent in the locking primitives on lines 20 and 30 sufficient?
using the definitions shown in Listing B.15. This is a
generalization of the scheme in Appendix B.7, which can
This approach achieves much better read-side perfor- be thought of as having a single low-order bit reserved for
mance, incurring roughly 63 nanoseconds of overhead counting nesting depth. Two C-preprocessor macros are
regardless of the number of POWER5 CPUs. Updates in- used to arrange this, RCU_GP_CTR_NEST_MASK and RCU_
cur more overhead, ranging from about 500 nanoseconds GP_CTR_BOTTOM_BIT. These are related: RCU_GP_CTR_
on a single POWER5 CPU to more than 100 microseconds NEST_MASK=RCU_GP_CTR_BOTTOM_BIT-1. The RCU_
on 64 such CPUs. GP_CTR_BOTTOM_BIT macro contains a single bit that
Quick Quiz B.17: Couldn’t the update-side batching opti- is positioned just above the bits reserved for counting
mization described in Appendix B.6 be applied to the imple- nesting, and the RCU_GP_CTR_NEST_MASK has all one
mentation shown in Listing B.14? bits covering the region of rcu_gp_ctr used to count
nesting. Obviously, these two C-preprocessor macros
This implementation suffers from some serious short- must reserve enough of the low-order bits of the counter
comings in addition to the high update-side overhead to permit the maximum required nesting of RCU read-
noted earlier. First, it is no longer permissible to nest side critical sections, and this implementation reserves
RCU read-side critical sections, a topic that is taken up seven bits, for a maximum RCU read-side critical-section
in the next section. Second, if a reader is preempted at nesting depth of 127, which should be well in excess of
line 3 of Listing B.14 after fetching from rcu_gp_ctr that needed by most applications.
but before storing to rcu_reader_gp, and if the rcu_ The resulting rcu_read_lock() implementation is
gp_ctr counter then runs through more than half but less still reasonably straightforward. Line 6 places a pointer
than all of its possible values, then synchronize_rcu() to this thread’s instance of rcu_reader_gp into the local
will ignore the subsequent RCU read-side critical section. variable rrgp, minimizing the number of expensive calls
Third and finally, this implementation requires that the to the pthreads thread-local-state API. Line 7 records
enclosing software environment be able to enumerate the current value of rcu_reader_gp into another local
threads and maintain per-thread variables. variable tmp, and line 8 checks to see if the low-order bits

v2022.09.25a
424 APPENDIX B. “TOY” RCU IMPLEMENTATIONS

Listing B.16: Nestable RCU Using a Free-Running Counter Interestingly enough, despite their rcu_read_lock()
1 static void rcu_read_lock(void) differences, the implementation of rcu_read_unlock()
2 {
3 unsigned long tmp; is broadly similar to that shown in Appendix B.7. Line 17
4 unsigned long *rrgp; executes a memory barrier in order to prevent the RCU
5
6 rrgp = &__get_thread_var(rcu_reader_gp); read-side critical section from bleeding out into code
7 tmp = *rrgp; following the call to rcu_read_unlock(), and line 18
8 if ((tmp & RCU_GP_CTR_NEST_MASK) == 0)
9 tmp = READ_ONCE(rcu_gp_ctr); decrements this thread’s instance of rcu_reader_gp,
10 tmp++; which has the effect of decrementing the nesting count
11 WRITE_ONCE(*rrgp, tmp);
12 smp_mb(); contained in rcu_reader_gp’s low-order bits. Debug-
13 } ging versions of this primitive would check (before decre-
14
15 static void rcu_read_unlock(void) menting!) that these low-order bits were non-zero.
16 {
17 smp_mb(); The implementation of synchronize_rcu() is quite
18 __get_thread_var(rcu_reader_gp)--; similar to that shown in Appendix B.7. There are two
19 }
20 differences. The first is that lines 27 and 28 adds RCU_
21 void synchronize_rcu(void) GP_CTR_BOTTOM_BIT to the global rcu_gp_ctr instead
22 {
23 int t; of adding the constant “2”, and the second is that the
24 comparison on line 31 has been abstracted out to a separate
25 smp_mb();
26 spin_lock(&rcu_gp_lock); function, where it checks the bits indicated by RCU_GP_
27 WRITE_ONCE(rcu_gp_ctr, rcu_gp_ctr + CTR_NEST_MASK instead of unconditionally checking the
28 RCU_GP_CTR_BOTTOM_BIT);
29 smp_mb(); low-order bit.
30 for_each_thread(t) {
31 while (rcu_gp_ongoing(t) && This approach achieves read-side performance almost
32 ((READ_ONCE(per_thread(rcu_reader_gp, t)) - equal to that shown in Appendix B.7, incurring roughly
33 rcu_gp_ctr) < 0)) {
34 poll(NULL, 0, 10); 65 nanoseconds of overhead regardless of the number
35 } of POWER5 CPUs. Updates again incur more overhead,
36 }
37 spin_unlock(&rcu_gp_lock); ranging from about 600 nanoseconds on a single POWER5
38 smp_mb(); CPU to more than 100 microseconds on 64 such CPUs.
39 }

Quick Quiz B.19: Why not simply maintain a separate per-


thread nesting-level variable, as was done in previous section,
are zero, which would indicate that this is the outermost rather than having all this complicated bit manipulation?
rcu_read_lock(). If so, line 9 places the global rcu_
gp_ctr into tmp because the current value previously This implementation suffers from the same shortcom-
fetched by line 7 is likely to be obsolete. In either case, ings as does that of Appendix B.7, except that nesting
line 10 increments the nesting depth, which you will of RCU read-side critical sections is now permitted. In
recall is stored in the seven low-order bits of the counter. addition, on 32-bit systems, this approach shortens the
Line 11 stores the updated counter back into this thread’s time required to overflow the global rcu_gp_ctr variable.
instance of rcu_reader_gp, and, finally, line 12 executes The following section shows one way to greatly increase
a memory barrier to prevent the RCU read-side critical the time required for overflow to occur, while greatly
section from bleeding out into the code preceding the call reducing read-side overhead.
to rcu_read_lock().
In other words, this implementation of rcu_read_ Quick Quiz B.20: Given the algorithm shown in Listing B.16,
lock() picks up a copy of the global rcu_gp_ctr unless how could you double the time required to overflow the global
the current invocation of rcu_read_lock() is nested rcu_gp_ctr?
within an RCU read-side critical section, in which case it
instead fetches the contents of the current thread’s instance
of rcu_reader_gp. Either way, it increments whatever Quick Quiz B.21: Again, given the algorithm shown in
value it fetched in order to record an additional nesting Listing B.16, is counter overflow fatal? Why or why not? If it
level, and stores the result in the current thread’s instance is fatal, what can be done to fix it?
of rcu_reader_gp.

v2022.09.25a
B.9. RCU BASED ON QUIESCENT STATES 425

Listing B.17: Data for Quiescent-State-Based RCU Listing B.18: Quiescent-State-Based RCU Read Side
1 DEFINE_SPINLOCK(rcu_gp_lock); 1 static void rcu_read_lock(void)
2 long rcu_gp_ctr = 0; 2 {
3 DEFINE_PER_THREAD(long, rcu_reader_qs_gp); 3 }
4
5 static void rcu_read_unlock(void)
6 {
}
B.9 RCU Based on Quiescent States 7
8
9 static void rcu_quiescent_state(void)
10 {
Listing B.18 (rcu_qs.h) shows the read-side primitives 11 smp_mb();
used to construct a user-level implementation of RCU 12 __get_thread_var(rcu_reader_qs_gp) =
13 READ_ONCE(rcu_gp_ctr) + 1;
based on quiescent states, with the data shown in List- 14 smp_mb();
ing B.17. As can be seen from lines 1–7 in the listing, 15 }
16
the rcu_read_lock() and rcu_read_unlock() prim- 17 static void rcu_thread_offline(void)
itives do nothing, and can in fact be expected to be inlined 18 {
19 smp_mb();
and optimized away, as they are in server builds of the 20 __get_thread_var(rcu_reader_qs_gp) =
Linux kernel. This is due to the fact that quiescent-state- 21 READ_ONCE(rcu_gp_ctr);
22 smp_mb();
based RCU implementations approximate the extents of 23 }
RCU read-side critical sections using the aforementioned 24
25 static void rcu_thread_online(void)
quiescent states. Each of these quiescent states contains a 26 {
call to rcu_quiescent_state(), which is shown from 27 rcu_quiescent_state();
28 }
lines 9–15 in the listing. Threads entering extended quies-
cent states (for example, when blocking) may instead call
rcu_thread_offline() (lines 17–23) when entering
an extended quiescent state and then call rcu_thread_ read-side critical sections will thus know to ignore this
online() (lines 25–28) when leaving it. As such, new one. Finally, line 14 executes a memory barrier,
rcu_thread_online() is analogous to rcu_read_ which prevents subsequent code (including a possible
lock() and rcu_thread_offline() is analogous to RCU read-side critical section) from being re-ordered
rcu_read_unlock(). In addition, rcu_quiescent_ with the lines 12–13.
state() can be thought of as a rcu_thread_online()
immediately followed by a rcu_thread_offline().3 Quick Quiz B.22: Doesn’t the additional memory barrier
It is illegal to invoke rcu_quiescent_state(), rcu_ shown on line 14 of Listing B.18 greatly increase the overhead
of rcu_quiescent_state?
thread_offline(), or rcu_thread_online() from
an RCU read-side critical section.
In rcu_quiescent_state(), line 11 executes a mem- Some applications might use RCU only occasionally,
ory barrier to prevent any code prior to the quiescent state but use it very heavily when they do use it. Such ap-
(including possible RCU read-side critical sections) from plications might choose to use rcu_thread_online()
being reordered into the quiescent state. Lines 12–13 pick when starting to use RCU and rcu_thread_offline()
up a copy of the global rcu_gp_ctr, using READ_ONCE() when no longer using RCU. The time between a call
to ensure that the compiler does not employ any optimiza- to rcu_thread_offline() and a subsequent call to
tions that would result in rcu_gp_ctr being fetched more rcu_thread_online() is an extended quiescent state,
than once, and then adds one to the value fetched and so that RCU will not expect explicit quiescent states to be
stores it into the per-thread rcu_reader_qs_gp variable, registered during this time.
so that any concurrent instance of synchronize_rcu() The rcu_thread_offline() function simply sets the
will see an odd-numbered value, thus becoming aware that per-thread rcu_reader_qs_gp variable to the current
a new RCU read-side critical section has started. Instances value of rcu_gp_ctr, which has an even-numbered value.
of synchronize_rcu() that are waiting on older RCU Any concurrent instances of synchronize_rcu() will
thus know to ignore this thread.
3 Although the code in the listing is consistent with rcu_

quiescent_state() being the same as rcu_thread_online() im- Quick Quiz B.23: Why are the two memory barriers on
mediately followed by rcu_thread_offline(), this relationship is lines 11 and 14 of Listing B.18 needed?
obscured by performance optimizations.

v2022.09.25a
426 APPENDIX B. “TOY” RCU IMPLEMENTATIONS

Listing B.19: RCU Update Side Using Quiescent States That said, one could easily imagine a production-quality
1 void synchronize_rcu(void) RCU implementation based on this version of RCU.
2 {
3 int t;
4
5
6
smp_mb();
spin_lock(&rcu_gp_lock);
B.10 Summary of Toy RCU Imple-
WRITE_ONCE(rcu_gp_ctr, rcu_gp_ctr + 2);
7
8 smp_mb(); mentations
9 for_each_thread(t) {
10 while (rcu_gp_ongoing(t) &&
11 ((per_thread(rcu_reader_qs_gp, t) If you made it this far, congratulations! You should
12 - rcu_gp_ctr) < 0)) { now have a much clearer understanding not only of RCU
13 poll(NULL, 0, 10);
14 } itself, but also of the requirements of enclosing software
15 } environments and applications. Those wishing an even
16 spin_unlock(&rcu_gp_lock);
17 smp_mb(); deeper understanding are invited to read descriptions
18 } of production-quality RCU implementations [DMS+ 12,
McK07a, McK08b, McK09a].
The preceding sections listed some desirable properties
The rcu_thread_online() function simply invokes of the various RCU primitives. The following list is
rcu_quiescent_state(), thus marking the end of the provided for easy reference for those wishing to create a
extended quiescent state. new RCU implementation.
Listing B.19 (rcu_qs.c) shows the implementation of
synchronize_rcu(), which is quite similar to that of 1. There must be read-side primitives (such as
the preceding sections. rcu_read_lock() and rcu_read_unlock()) and
This implementation has blazingly fast read-side primi- grace-period primitives (such as synchronize_
tives, with an rcu_read_lock()–rcu_read_unlock() rcu() and call_rcu()), such that any RCU read-
round trip incurring an overhead of roughly 50 picosec- side critical section in existence at the start of a grace
onds. The synchronize_rcu() overhead ranges from period has completed by the end of the grace period.
about 600 nanoseconds on a single-CPU POWER5 system 2. RCU read-side primitives should have minimal over-
up to more than 100 microseconds on a 64-CPU system. head. In particular, expensive operations such as
Quick Quiz B.24: To be sure, the clock frequencies of cache misses, atomic instructions, memory barriers,
POWER systems in 2008 were quite high, but even a 5 GHz and branches should be avoided.
clock frequency is insufficient to allow loops to be executed in
50 picoseconds! What is going on here? 3. RCU read-side primitives should have O (1) compu-
tational complexity to enable real-time use. (This
However, this implementation requires that each thread implies that readers run concurrently with updaters.)
either invoke rcu_quiescent_state() periodically or
to invoke rcu_thread_offline() for extended quies- 4. RCU read-side primitives should be usable in all
cent states. The need to invoke these functions periodically contexts (in the Linux kernel, they are permitted
can make this implementation difficult to use in some sit- everywhere except in the idle loop). An important
uations, such as for certain types of library functions. special case is that RCU read-side primitives be
usable within an RCU read-side critical section, in
Quick Quiz B.25: Why would the fact that the code is in a other words, that it be possible to nest RCU read-side
library make any difference for how easy it is to use the RCU critical sections.
implementation shown in Listings B.18 and B.19?
5. RCU read-side primitives should be unconditional,
Quick Quiz B.26: But what if you hold a lock across a with no failure returns. This property is extremely
call to synchronize_rcu(), and then acquire that same lock important, as failure checking increases complexity
within an RCU read-side critical section? This should be a and complicates testing and validation.
deadlock, but how can a primitive that generates absolutely no
code possibly participate in a deadlock cycle? 6. Any operation other than a quiescent state (and thus
a grace period) should be permitted in an RCU
In addition, this implementation does not permit concur- read-side critical section. In particular, irrevocable
rent calls to synchronize_rcu() to share grace periods. operations such as I/O should be permitted.

v2022.09.25a
B.10. SUMMARY OF TOY RCU IMPLEMENTATIONS 427

7. It should be possible to update an RCU-protected data


structure while executing within an RCU read-side
critical section.
8. Both RCU read-side and update-side primitives
should be independent of memory allocator design
and implementation, in other words, the same RCU
implementation should be able to protect a given
data structure regardless of how the data elements
are allocated and freed.

9. RCU grace periods should not be blocked by threads


that halt outside of RCU read-side critical sections.
(But note that most quiescent-state-based implemen-
tations violate this desideratum.)

Quick Quiz B.27: Given that grace periods are prohibited


within RCU read-side critical sections, how can an RCU data
structure possibly be updated while in an RCU read-side critical
section?

v2022.09.25a
428 APPENDIX B. “TOY” RCU IMPLEMENTATIONS

v2022.09.25a
Order! Order in the court!

Unknown
Appendix C

Why Memory Barriers?

So what possessed CPU designers to cause them to inflict


memory barriers on poor unsuspecting SMP software CPU 0 CPU 1
designers?
In short, because reordering memory references allows
much better performance, courtesy of the finite speed of Cache Cache
light and the non-zero size of atoms noted in Section 3.2,
and particularly in the hardware-performance question Interconnect
posed by Quick Quiz 3.7. Therefore, memory barriers are
needed to force ordering in things like synchronization
primitives whose correct operation depends on ordered
memory references. Memory
Getting a more detailed answer to this question requires
a good understanding of how CPU caches work, and
especially what is required to make caches really work Figure C.1: Modern Computer System Cache Structure
well. The following sections:

1. Present the structure of a cache, nanoseconds to fetch a data item from main memory. This
disparity in speed—more than two orders of magnitude—
2. Describe how cache-coherency protocols ensure that has resulted in the multi-megabyte caches found on modern
CPUs agree on the value of each location in memory, CPUs. These caches are associated with the CPUs as
and, finally, shown in Figure C.1, and can typically be accessed in a
few cycles.1
3. Outline how store buffers and invalidate queues help Data flows among the CPUs’ caches and memory in
caches and cache-coherency protocols achieve high fixed-length blocks called “cache lines”, which are nor-
performance. mally a power of two in size, ranging from 16 to 256
We will see that memory barriers are a necessary evil that bytes. When a given data item is first accessed by a given
is required to enable good performance and scalability, CPU, it will be absent from that CPU’s cache, meaning
an evil that stems from the fact that CPUs are orders of that a “cache miss” (or, more specifically, a “startup” or
magnitude faster than are both the interconnects between “warmup” cache miss) has occurred. The cache miss
them and the memory they are attempting to access. means that the CPU will have to wait (or be “stalled”) for
hundreds of cycles while the item is fetched from memory.
However, the item will be loaded into that CPU’s cache,
C.1 Cache Structure 1 It is standard practice to use multiple levels of cache, with a small

level-one cache close to the CPU with single-cycle access time, and a
Modern CPUs are much faster than are modern memory larger level-two cache with a longer access time, perhaps roughly ten
systems. A 2006 CPU might be capable of executing ten clock cycles. Higher-performance CPUs often have three or even four
instructions per nanosecond, but will require many tens of levels of cache.

429

v2022.09.25a
430 APPENDIX C. WHY MEMORY BARRIERS?

Way 0 Way 1
bits of each address are zero, and the choice of hardware
0x0 0x12345000 hash function means that the next-higher four bits match
0x1 0x12345100
0x2 0x12345200 the hash line number.
0x3 0x12345300
0x4 0x12345400 The situation depicted in the figure might arise if the pro-
0x5 0x12345500 gram’s code were located at address 0x43210E00 through
0x6 0x12345600 0x43210EFF, and this program accessed data sequentially
0x7 0x12345700
0x8 0x12345800 from 0x12345000 through 0x12345EFF. Suppose that
0x9 0x12345900 the program were now to access location 0x12345F00.
0xA 0x12345A00 This location hashes to line 0xF, and both ways of this
0xB 0x12345B00 line are empty, so the corresponding 256-byte line can be
0xC 0x12345C00 accommodated. If the program were to access location
0xD 0x12345D00
0xE 0x12345E00 0x43210E00 0x1233000, which hashes to line 0x0, the corresponding
0xF 256-byte cache line can be accommodated in way 1. How-
ever, if the program were to access location 0x1233E00,
Figure C.2: CPU Cache Structure which hashes to line 0xE, one of the existing lines must
be ejected from the cache to make room for the new cache
line. If this ejected line were accessed later, a cache miss
so that subsequent accesses will find it in the cache and would result. Such a cache miss is termed an “associativity
therefore run at full speed. miss”.
After some time, the CPU’s cache will fill, and subse-
quent misses will likely need to eject an item from the Thus far, we have been considering only cases where
cache in order to make room for the newly fetched item. a CPU reads a data item. What happens when it does a
Such a cache miss is termed a “capacity miss”, because it write? Because it is important that all CPUs agree on
is caused by the cache’s limited capacity. However, most the value of a given data item, before a given CPU writes
caches can be forced to eject an old item to make room to that data item, it must first cause it to be removed,
for a new item even when they are not yet full. This is due or “invalidated”, from other CPUs’ caches. Once this
to the fact that large caches are implemented as hardware invalidation has completed, the CPU may safely modify
hash tables with fixed-size hash buckets (or “sets”, as the data item. If the data item was present in this CPU’s
CPU designers call them) and no chaining, as shown in cache, but was read-only, this process is termed a “write
Figure C.2. miss”. Once a given CPU has completed invalidating a
This cache has sixteen “sets” and two “ways” for a total given data item from other CPUs’ caches, that CPU may
of 32 “lines”, each entry containing a single 256-byte repeatedly write (and read) that data item.
“cache line”, which is a 256-byte-aligned block of memory.
This cache line size is a little on the large size, but makes Later, if one of the other CPUs attempts to access the
the hexadecimal arithmetic much simpler. In hardware data item, it will incur a cache miss, this time because
parlance, this is a two-way set-associative cache, and is the first CPU invalidated the item in order to write to
analogous to a software hash table with sixteen buckets, it. This type of cache miss is termed a “communication
where each bucket’s hash chain is limited to at most two miss”, since it is usually due to several CPUs using the
elements. The size (32 cache lines in this case) and the data items to communicate (for example, a lock is a data
associativity (two in this case) are collectively called the item that is used to communicate among CPUs using a
cache’s “geometry”. Since this cache is implemented in mutual-exclusion algorithm).
hardware, the hash function is extremely simple: Extract
four bits from the memory address. Clearly, much care must be taken to ensure that all CPUs
In Figure C.2, each box corresponds to a cache entry, maintain a coherent view of the data. With all this fetching,
which can contain a 256-byte cache line. However, a invalidating, and writing, it is easy to imagine data being
cache entry can be empty, as indicated by the empty boxes lost or (perhaps worse) different CPUs having conflicting
in the figure. The rest of the boxes are flagged with the values for the same data item in their respective caches.
memory address of the cache line that they contain. Since These problems are prevented by “cache-coherency proto-
the cache lines must be 256-byte aligned, the low eight cols”, described in the next section.

v2022.09.25a
C.2. CACHE-COHERENCE PROTOCOLS 431

C.2 Cache-Coherence Protocols possible. This approach is preferred because replacing a


line in any other state could result in an expensive cache
Cache-coherence protocols manage cache-line states so miss should the replaced line be referenced in the future.
as to prevent inconsistent or lost data. These protocols Since all CPUs must maintain a coherent view of the
can be quite complex, with many tens of states,2 but for data carried in the cache lines, the cache-coherence proto-
our purposes we need only concern ourselves with the col provides messages that coordinate the movement of
four-state MESI cache-coherence protocol. cache lines through the system.

C.2.1 MESI States C.2.2 MESI Protocol Messages


MESI stands for “modified”, “exclusive”, “shared”, and Many of the transitions described in the previous section
“invalid”, the four states a given cache line can take on require communication among the CPUs. If the CPUs are
using this protocol. Caches using this protocol therefore on a single shared bus, the following messages suffice:
maintain a two-bit state “tag” on each cache line in addition
to that line’s physical address and data. Read:
A line in the “modified” state has been subject to a The “read” message contains the physical address of
recent memory store from the corresponding CPU, and the cache line to be read.
the corresponding memory is guaranteed not to appear
in any other CPU’s cache. Cache lines in the “modified” Read Response:
state can thus be said to be “owned” by the CPU. Because The “read response” message contains the data re-
this cache holds the only up-to-date copy of the data, this quested by an earlier “read” message. This “read
cache is ultimately responsible for either writing it back to response” message might be supplied either by mem-
memory or handing it off to some other cache, and must ory or by one of the other caches. For example, if one
do so before reusing this line to hold other data. of the caches has the desired data in “modified” state,
The “exclusive” state is very similar to the “modified” that cache must supply the “read response” message.
state, the single exception being that the cache line has Invalidate:
not yet been modified by the corresponding CPU, which The “invalidate” message contains the physical ad-
in turn means that the copy of the cache line’s data that dress of the cache line to be invalidated. All other
resides in memory is up-to-date. However, since the CPU caches must remove the corresponding data from
can store to this line at any time, without consulting other their caches and respond.
CPUs, a line in the “exclusive” state can still be said to be
owned by the corresponding CPU. That said, because the Invalidate Acknowledge:
corresponding value in memory is up to date, this cache A CPU receiving an “invalidate” message must re-
can discard this data without writing it back to memory spond with an “invalidate acknowledge” message
or handing it off to some other CPU. after removing the specified data from its cache.
A line in the “shared” state might be replicated in at
least one other CPU’s cache, so that this CPU is not Read Invalidate:
permitted to store to the line without first consulting with The “read invalidate” message contains the physical
other CPUs. As with the “exclusive” state, because the address of the cache line to be read, while at the
corresponding value in memory is up to date, this cache same time directing other caches to remove the data.
can discard this data without writing it back to memory Hence, it is a combination of a “read” and an “invali-
or handing it off to some other CPU. date”, as indicated by its name. A “read invalidate”
A line in the “invalid” state is empty, in other words, message requires both a “read response” and a set of
it holds no data. When new data enters the cache, it is “invalidate acknowledge” messages in reply.
placed into a cache line that was in the “invalid” state if
Writeback:
2 See Culler et al. [CSG99] pages 670 and 671 for the nine-state
The “writeback” message contains both the address
and 26-state diagrams for SGI Origin2000 and Sequent (now IBM)
and the data to be written back to memory (and
NUMA-Q, respectively. Both diagrams are significantly simpler than perhaps “snooped” into other CPUs’ caches along
real life. the way). This message permits caches to eject lines

v2022.09.25a
432 APPENDIX C. WHY MEMORY BARRIERS?

M to modify it. This transition requires a “writeback”


message.
a f
Transition (b):
b c d e The CPU writes to the cache line that it already had
exclusive access to. This transition does not require
g
any messages to be sent or received.
E S
h Transition (c):
The CPU receives a “read invalidate” message for
j k
a cache line that it has modified. The CPU must
i l invalidate its local copy, then respond with both
a “read response” and an “invalidate acknowledge”
I message, both sending the data to the requesting CPU
and indicating that it no longer has a local copy.
Figure C.3: MESI Cache-Coherency State Diagram
Transition (d):
The CPU does an atomic read-modify-write operation
in the “modified” state as needed to make room for on a data item that was not present in its cache. It
other data. transmits a “read invalidate”, receiving the data via a
“read response”. The CPU can complete the transition
Quick Quiz C.1: Where does a writeback message originate once it has also received a full set of “invalidate
from and where does it go to? acknowledge” responses.
Interestingly enough, a shared-memory multiprocessor Transition (e):
system really is a message-passing computer under the The CPU does an atomic read-modify-write operation
covers. This means that clusters of SMP machines that on a data item that was previously read-only in its
use distributed shared memory are using message passing cache. It must transmit “invalidate” messages, and
to implement shared memory at two different levels of the must wait for a full set of “invalidate acknowledge”
system architecture. responses before completing the transition.
Quick Quiz C.2: What happens if two CPUs attempt to
Transition (f):
invalidate the same cache line concurrently?
Some other CPU reads the cache line, and it is
supplied from this CPU’s cache, which retains a read-
Quick Quiz C.3: When an “invalidate” message appears in
a large multiprocessor, every CPU must give an “invalidate
only copy, possibly also writing it back to memory.
acknowledge” response. Wouldn’t the resulting “storm” of This transition is initiated by the reception of a
“invalidate acknowledge” responses totally saturate the system “read” message, and this CPU responds with a “read
bus? response” message containing the requested data.

Quick Quiz C.4: If SMP machines are really using message Transition (g):
passing anyway, why bother with SMP at all? Some other CPU reads a data item in this cache line,
and it is supplied either from this CPU’s cache or
from memory. In either case, this CPU retains a read-
only copy. This transition is initiated by the reception
C.2.3 MESI State Diagram
of a “read” message, and this CPU responds with
A given cache line’s state changes as protocol messages a “read response” message containing the requested
are sent and received, as shown in Figure C.3. data.
The transition arcs in this figure are as follows:
Transition (h):
Transition (a): This CPU realizes that it will soon need to write to
A cache line is written back to memory, but the CPU some data item in this cache line, and thus transmits
retains it in its cache and further retains the right an “invalidate” message. The CPU cannot complete

v2022.09.25a
C.3. STORES RESULT IN UNNECESSARY STALLS 433

the transition until it receives a full set of “invalidate state of each CPU’s cache line (memory address followed
acknowledge” responses, indicating that no other by MESI state), and the final two columns whether the
CPU has this cacheline in its cache. In other words, corresponding memory contents are up to date (“V”) or
this CPU is the only CPU caching it. not (“I”).
Initially, the CPU cache lines in which the data would
Transition (i): reside are in the “invalid” state, and the data is valid in
Some other CPU does an atomic read-modify-write memory. When CPU 0 loads the data at address 0, it
operation on a data item in a cache line held only in enters the “shared” state in CPU 0’s cache, and is still
this CPU’s cache, so this CPU invalidates it from its valid in memory. CPU 3 also loads the data at address 0,
cache. This transition is initiated by the reception of so that it is in the “shared” state in both CPUs’ caches,
a “read invalidate” message, and this CPU responds and is still valid in memory. Next CPU 0 loads some
with both a “read response” and an “invalidate ac- other cache line (at address 8), which forces the data at
knowledge” message. address 0 out of its cache via an invalidation, replacing it
Transition (j): with the data at address 8. CPU 2 now does a load from
This CPU does a store to a data item in a cache line address 0, but this CPU realizes that it will soon need
that was not in its cache, and thus transmits a “read to store to it, and so it uses a “read invalidate” message
invalidate” message. The CPU cannot complete the in order to gain an exclusive copy, invalidating it from
transition until it receives the “read response” and a CPU 3’s cache (though the copy in memory remains up to
full set of “invalidate acknowledge” messages. The date). Next CPU 2 does its anticipated store, changing the
cache line will presumably transition to “modified” state to “modified”. The copy of the data in memory is
state via transition (b) as soon as the actual store now out of date. CPU 1 does an atomic increment, using
completes. a “read invalidate” to snoop the data from CPU 2’s cache
and invalidate it, so that the copy in CPU 1’s cache is in
Transition (k): the “modified” state (and the copy in memory remains out
This CPU loads a data item in a cache line that of date). Finally, CPU 1 reads the cache line at address 8,
was not in its cache. The CPU transmits a “read” which uses a “writeback” message to push address 0’s
message, and completes the transition upon receiving data back out to memory.
the corresponding “read response”. Note that we end with data in some of the CPU’s caches.
Transition (l): Quick Quiz C.6: What sequence of operations would put
Some other CPU does a store to a data item in this the CPUs’ caches all back into the “invalid” state?
cache line, but holds this cache line in read-only state
due to its being held in other CPUs’ caches (such as
the current CPU’s cache). This transition is initiated C.3 Stores Result in Unnecessary
by the reception of an “invalidate” message, and
this CPU responds with an “invalidate acknowledge” Stalls
message.
Although the cache structure shown in Figure C.1 provides
Quick Quiz C.5: How does the hardware handle the delayed
good performance for repeated reads and writes from a
transitions described above? given CPU to a given item of data, its performance for the
first write to a given cache line is quite poor. To see this,
consider Figure C.4, which shows a timeline of a write by
C.2.4 MESI Protocol Example CPU 0 to a cacheline held in CPU 1’s cache. Since CPU 0
must wait for the cache line to arrive before it can write to
Let’s now look at this from the perspective of a cache line’s it, CPU 0 must stall for an extended period of time.3
worth of data, initially residing in memory at address 0, But there is no real reason to force CPU 0 to stall for
as it travels through the various single-line direct-mapped so long—after all, regardless of what data happens to be
caches in a four-CPU system. Table C.1 shows this flow
of data, with the first column showing the sequence of 3 The time required to transfer a cache line from one CPU’s cache to
operations, the second the CPU performing the operation, another’s is typically a few orders of magnitude more than that required
the third the operation being performed, the next four the to execute a simple register-to-register instruction.

v2022.09.25a
434 APPENDIX C. WHY MEMORY BARRIERS?

Table C.1: Cache Coherence Example

CPU Cache Memory

Sequence # CPU # Operation 0 1 2 3 0 8

0 Initial State −/I −/I −/I −/I V V


1 0 Load 0/S −/I −/I −/I V V
2 3 Load 0/S −/I −/I 0/S V V
3 0 Invalidation 8/S −/I −/I 0/S V V
4 2 RMW 8/S −/I 0/E −/I V V
5 2 Store 8/S −/I 0/M −/I I V
6 1 Atomic Inc 8/S 0/M −/I −/I I V
7 1 Writeback 8/S 8/S −/I −/I V V

CPU 0 CPU 1
CPU 0 CPU 1
Write

Invalidate

Store Store
Buffer Buffer
Stall

Acknowledgement
Cache Cache

Interconnect

Memory

Figure C.4: Writes See Unnecessary Stalls


Figure C.5: Caches With Store Buffers

in the cache line that CPU 1 sends it, CPU 0 is going to


unconditionally overwrite it. Please note that the store buffer does not necessarily
operate on full cache lines. The reason for this is that a
given store-buffer entry need only contain the value stored,
C.3.1 Store Buffers not the other data contained in the corresponding cache
line. Which is a good thing, because the CPU doing the
One way to prevent this unnecessary stalling of writes is store has no idea what that other data might be! But once
to add “store buffers” between each CPU and its cache, the corresponding cache line arrives, any values from the
as shown in Figure C.5. With the addition of these store store buffer that update that cache line can be merged into
buffers, CPU 0 can simply record its write in its store it, and the corresponding entries can then be removed
buffer and continue executing. When the cache line does from the store buffer. Any other data in that cache line is
finally make its way from CPU 1 to CPU 0, the data will of course left intact.
be moved from the store buffer to the cache line.
Quick Quiz C.8: So store-buffer entries are variable length?
Quick Quiz C.7: But then why do uniprocessors also have Isn’t that difficult to implement in hardware?
store buffers?

v2022.09.25a
C.3. STORES RESULT IN UNNECESSARY STALLS 435

These store buffers are local to a given CPU or, on


systems with hardware multithreading, local to a given CPU 0 CPU 1
core. Either way, a given CPU is permitted to access
only the store buffer assigned to it. For example, in Fig-
ure C.5, CPU 0 cannot access CPU 1’s store buffer and
Store Store
vice versa. This restriction simplifies the hardware by Buffer Buffer
separating concerns: The store buffer improves perfor-
mance for consecutive writes, while the responsibility for
communicating among CPUs (or cores, as the case may Cache Cache
be) is fully shouldered by the cache-coherence protocol.
However, even given this restriction, there are complica-
Interconnect
tions that must be addressed, which are covered in the
next two sections.

Memory
C.3.2 Store Forwarding
To see the first complication, a violation of self-
Figure C.6: Caches With Store Forwarding
consistency, consider the following code with variables
“a” and “b” both initially zero, and with the cache line
containing variable “a” initially owned by CPU 1 and that
containing “b” initially owned by CPU 0: 8 CPU 0 loads “a” from its cache, finding the value
zero.
1 a = 1;
2 b = a + 1; 9 CPU 0 applies the entry from its store buffer to the
3 assert(b == 2); newly arrived cache line, setting the value of “a” in
its cache to one.
One would not expect the assertion to fail. However, if
one were foolish enough to use the very simple architecture 10 CPU 0 adds one to the value zero loaded for “a”
shown in Figure C.5, one would be surprised. Such a above, and stores it into the cache line containing “b”
system could potentially see the following sequence of (which we will assume is already owned by CPU 0).
events:
11 CPU 0 executes assert(b == 2), which fails.
1 CPU 0 starts executing the a = 1.

2 CPU 0 looks “a” up in the cache, and finds that it is The problem is that we have two copies of “a”, one in
missing. the cache and the other in the store buffer.
This example breaks a very important guarantee, namely
3 CPU 0 therefore sends a “read invalidate” message that each CPU will always see its own operations as if they
in order to get exclusive ownership of the cache line happened in program order. Breaking this guarantee is
containing “a”. violently counter-intuitive to software types, so much so
that the hardware guys took pity and implemented “store
4 CPU 0 records the store to “a” in its store buffer. forwarding”, where each CPU refers to (or “snoops”) its
store buffer as well as its cache when performing loads, as
5 CPU 1 receives the “read invalidate” message, and
shown in Figure C.6. In other words, a given CPU’s stores
responds by transmitting the cache line and removing
are directly forwarded to its subsequent loads, without
that cacheline from its cache.
having to pass through the cache.
6 CPU 0 starts executing the b = a + 1. With store forwarding in place, item 8 in the above
sequence would have found the correct value of 1 for “a”
7 CPU 0 receives the cache line from CPU 1, which in the store buffer, so that the final value of “b” would
still has a value of zero for “a”. have been 2, as one would hope.

v2022.09.25a
436 APPENDIX C. WHY MEMORY BARRIERS?

C.3.3 Store Buffers and Memory Barriers 8 CPU 1 receives the “read invalidate” message, and
transmits the cache line containing “a” to CPU 0 and
To see the second complication, a violation of global invalidates this cache line from its own cache. But it
memory ordering, consider the following code sequences is too late.
with variables “a” and “b” initially zero:
9 CPU 0 receives the cache line containing “a” and
1 void foo(void) applies the buffered store just in time to fall victim
2 {
3 a = 1;
to CPU 1’s failed assertion.
4 b = 1;
5 } Quick Quiz C.9: In step 1 above, why does CPU 0 need
6
to issue a “read invalidate” rather than a simple “invalidate”?
7 void bar(void)
8 {
After all, foo() will overwrite the variable a in any case, so
9 while (b == 0) continue; why should it care about the old value of a?
10 assert(a == 1);
11 }
Quick Quiz C.10: In step 9 above, did bar() read a stale
value from a, or did its reads of b and a get reordered?
Suppose CPU 0 executes foo() and CPU 1 executes
bar(). Suppose further that the cache line containing “a” The hardware designers cannot help directly here, since
resides only in CPU 1’s cache, and that the cache line the CPUs have no idea which variables are related, let
containing “b” is owned by CPU 0. Then the sequence of alone how they might be related. Therefore, the hardware
operations might be as follows: designers provide memory-barrier instructions to allow
the software to tell the CPU about such relations. The
1 CPU 0 executes a = 1. The cache line is not in program fragment must be updated to contain the memory
CPU 0’s cache, so CPU 0 places the new value of barrier:
“a” in its store buffer and transmits a “read invalidate”
message. 1 void foo(void)
2 {
3 a = 1;
2 CPU 1 executes while (b == 0)continue, but 4 smp_mb();
the cache line containing “b” is not in its cache. It 5 b = 1;
therefore transmits a “read” message. 6 }
7
8 void bar(void)
3 CPU 0 executes b = 1. It already owns this cache
9 {
line (in other words, the cache line is already in either 10 while (b == 0) continue;
the “modified” or the “exclusive” state), so it stores 11 assert(a == 1);
the new value of “b” in its cache line. 12 }

4 CPU 0 receives the “read” message, and transmits


the cache line containing the now-updated value of The memory barrier smp_mb() will cause the CPU to
“b” to CPU 1, also marking the line as “shared” in its flush its store buffer before applying each subsequent store
own cache. to its variable’s cache line. The CPU could either simply
stall until the store buffer was empty before proceeding,
5 CPU 1 receives the cache line containing “b” and or it could use the store buffer to hold subsequent stores
installs it in its cache. until all of the prior entries in the store buffer had been
applied.
6 CPU 1 can now finish executing while (b == With this latter approach the sequence of operations
0)continue, and since it finds that the value of “b” might be as follows:
is 1, it proceeds to the next statement.
1 CPU 0 executes a = 1. The cache line is not in
7 CPU 1 executes the assert(a == 1), and, since CPU 0’s cache, so CPU 0 places the new value of
CPU 1 is working with the old value of “a”, this “a” in its store buffer and transmits a “read invalidate”
assertion fails. message.

v2022.09.25a
C.4. STORE SEQUENCES RESULT IN UNNECESSARY STALLS 437

2 CPU 1 executes while (b == 0)continue, but 14 CPU 0 receives the “acknowledgement” message, and
the cache line containing “b” is not in its cache. It puts the cache line containing “b” into the “exclusive”
therefore transmits a “read” message. state. CPU 0 now stores the new value of “b” into
the cache line.
3 CPU 0 executes smp_mb(), and marks all current
store-buffer entries (namely, the a = 1). 15 CPU 0 receives the “read” message, and transmits
the cache line containing the new value of “b” to
4 CPU 0 executes b = 1. It already owns this cache CPU 1. It also marks its own copy of this cache line
line (in other words, the cache line is already in as “shared”.
either the “modified” or the “exclusive” state), but
there is a marked entry in the store buffer. Therefore, 16 CPU 1 receives the cache line containing “b” and
rather than store the new value of “b” in the cache installs it in its cache.
line, it instead places it in the store buffer (but in an
17 CPU 1 can now load the value of “b”, and since it
unmarked entry).
finds that the value of “b” is 1, it exits the while
5 CPU 0 receives the “read” message, and transmits loop and proceeds to the next statement.
the cache line containing the original value of “b” to
18 CPU 1 executes the assert(a == 1), but the cache
CPU 1. It also marks its own copy of this cache line
line containing “a” is no longer in its cache. Once
as “shared”.
it gets this cache from CPU 0, it will be working
6 CPU 1 receives the cache line containing “b” and with the up-to-date value of “a”, and the assertion
installs it in its cache. therefore passes.

7 CPU 1 can now load the value of “b”, but since it Quick Quiz C.11: After step 15 in Appendix C.3.3 on
finds that the value of “b” is still 0, it repeats the page 437, both CPUs might drop the cache line containing the
while statement. The new value of “b” is safely new value of “b”. Wouldn’t that cause this new value to be
hidden in CPU 0’s store buffer. lost?

8 CPU 1 receives the “read invalidate” message, and As you can see, this process involves no small amount
transmits the cache line containing “a” to CPU 0 and of bookkeeping. Even something intuitively simple, like
invalidates this cache line from its own cache. “load the value of a” can involve lots of complex steps in
silicon.
9 CPU 0 receives the cache line containing “a” and
applies the buffered store, placing this line into the
“modified” state. C.4 Store Sequences Result in Un-
10 Since the store to “a” was the only entry in the store
necessary Stalls
buffer that was marked by the smp_mb(), CPU 0 can Unfortunately, each store buffer must be relatively small,
also store the new value of “b”—except for the fact which means that a CPU executing a modest sequence
that the cache line containing “b” is now in “shared” of stores can fill its store buffer (for example, if all of
state. them result in cache misses). At that point, the CPU must
11 CPU 0 therefore sends an “invalidate” message to once again wait for invalidations to complete in order
CPU 1. to drain its store buffer before it can continue executing.
This same situation can arise immediately after a memory
12 CPU 1 receives the “invalidate” message, invalidates barrier, when all subsequent store instructions must wait
the cache line containing “b” from its cache, and for invalidations to complete, regardless of whether or not
sends an “acknowledgement” message to CPU 0. these stores result in cache misses.
This situation can be improved by making invalidate
13 CPU 1 executes while (b == 0)continue, but acknowledge messages arrive more quickly. One way of
the cache line containing “b” is not in its cache. It accomplishing this is to use per-CPU queues of invalidate
therefore transmits a “read” message to CPU 0. messages, or “invalidate queues”.

v2022.09.25a
438 APPENDIX C. WHY MEMORY BARRIERS?

C.4.1 Invalidate Queues


CPU 0 CPU 1
One reason that invalidate acknowledge messages can take
so long is that they must ensure that the corresponding
cache line is actually invalidated, and this invalidation can
be delayed if the cache is busy, for example, if the CPU is Store Store
Buffer Buffer
intensively loading and storing data, all of which resides
in the cache. In addition, if a large number of invalidate
messages arrive in a short time period, a given CPU might
Cache Cache
fall behind in processing them, thus possibly stalling all
the other CPUs.
However, the CPU need not actually invalidate the cache Invalidate Invalidate
line before sending the acknowledgement. It could instead Queue Queue
queue the invalidate message with the understanding that Interconnect
the message will be processed before the CPU sends any
further messages regarding that cache line.

Memory
C.4.2 Invalidate Queues and Invalidate Ac-
knowledge
Figure C.7: Caches With Invalidate Queues
Figure C.7 shows a system with invalidate queues. A CPU
with an invalidate queue may acknowledge an invalidate
message as soon as it is placed in the queue, instead Suppose the values of “a” and “b” are initially zero, that
of having to wait until the corresponding line is actually “a” is replicated read-only (MESI “shared” state), and that
invalidated. Of course, the CPU must refer to its invalidate “b” is owned by CPU 0 (MESI “exclusive” or “modified”
queue when preparing to transmit invalidation messages— state). Then suppose that CPU 0 executes foo() while
if an entry for the corresponding cache line is in the CPU 1 executes function bar() in the following code
invalidate queue, the CPU cannot immediately transmit fragment:
the invalidate message; it must instead wait until the
1 void foo(void)
invalidate-queue entry has been processed. 2 {
Placing an entry into the invalidate queue is essentially 3 a = 1;
a promise by the CPU to process that entry before trans- 4 smp_mb();
5 b = 1;
mitting any MESI protocol messages regarding that cache
6 }
line. As long as the corresponding data structures are not 7
highly contended, the CPU will rarely be inconvenienced 8 void bar(void)
by such a promise. 9 {
10 while (b == 0) continue;
However, the fact that invalidate messages can be 11 assert(a == 1);
buffered in the invalidate queue provides additional oppor- 12 }
tunity for memory-misordering, as discussed in the next
section. Then the sequence of operations might be as follows:
1 CPU 0 executes a = 1. The corresponding cache
C.4.3 Invalidate Queues and Memory Bar- line is read-only in CPU 0’s cache, so CPU 0 places
riers the new value of “a” in its store buffer and trans-
mits an “invalidate” message in order to flush the
Let us suppose that CPUs queue invalidation requests, but corresponding cache line from CPU 1’s cache.
respond to them immediately. This approach minimizes
the cache-invalidation latency seen by CPUs doing stores, 2 CPU 1 executes while (b == 0)continue, but
but can defeat memory barriers, as seen in the following the cache line containing “b” is not in its cache. It
example. therefore transmits a “read” message.

v2022.09.25a
C.4. STORE SEQUENCES RESULT IN UNNECESSARY STALLS 439

3 CPU 1 receives CPU 0’s “invalidate” message,


queues it, and immediately responds to it. 1 void foo(void)
2 {
3 a = 1;
4 CPU 0 receives the response from CPU 1, and is 4 smp_mb();
therefore free to proceed past the smp_mb() on line 4 5 b = 1;
6 }
above, moving the value of “a” from its store buffer 7
to its cache line. 8 void bar(void)
9 {
10 while (b == 0) continue;
5 CPU 0 executes b = 1. It already owns this cache 11 smp_mb();
line (in other words, the cache line is already in either 12 assert(a == 1);
the “modified” or the “exclusive” state), so it stores 13 }
the new value of “b” in its cache line.
Quick Quiz C.13: Say what??? Why do we need a memory
6 CPU 0 receives the “read” message, and transmits barrier here, given that the CPU cannot possibly execute the
the cache line containing the now-updated value of assert() until after the while loop completes?
“b” to CPU 1, also marking the line as “shared” in its
own cache. With this change, the sequence of operations might be
as follows:
7 CPU 1 receives the cache line containing “b” and 1 CPU 0 executes a = 1. The corresponding cache
installs it in its cache. line is read-only in CPU 0’s cache, so CPU 0 places
the new value of “a” in its store buffer and trans-
8 CPU 1 can now finish executing while (b == mits an “invalidate” message in order to flush the
0)continue, and since it finds that the value of “b” corresponding cache line from CPU 1’s cache.
is 1, it proceeds to the next statement.
2 CPU 1 executes while (b == 0)continue, but
the cache line containing “b” is not in its cache. It
9 CPU 1 executes the assert(a == 1), and, since therefore transmits a “read” message.
the old value of “a” is still in CPU 1’s cache, this
assertion fails. 3 CPU 1 receives CPU 0’s “invalidate” message,
queues it, and immediately responds to it.
10 Despite the assertion failure, CPU 1 processes the
4 CPU 0 receives the response from CPU 1, and is
queued “invalidate” message, and (tardily) invali-
therefore free to proceed past the smp_mb() on line 4
dates the cache line containing “a” from its own
above, moving the value of “a” from its store buffer
cache.
to its cache line.

Quick Quiz C.12: In step 1 of the first scenario in Ap- 5 CPU 0 executes b = 1. It already owns this cache
pendix C.4.3, why is an “invalidate” sent instead of a ”read line (in other words, the cache line is already in either
invalidate” message? Doesn’t CPU 0 need the values of the the “modified” or the “exclusive” state), so it stores
other variables that share this cache line with “a”? the new value of “b” in its cache line.

There is clearly not much point in accelerating inval- 6 CPU 0 receives the “read” message, and transmits
idation responses if doing so causes memory barriers the cache line containing the now-updated value of
to effectively be ignored. However, the memory-barrier “b” to CPU 1, also marking the line as “shared” in its
instructions can interact with the invalidate queue, so that own cache.
when a given CPU executes a memory barrier, it marks 7 CPU 1 receives the cache line containing “b” and
all the entries currently in its invalidate queue, and forces installs it in its cache.
any subsequent load to wait until all marked entries have
been applied to the CPU’s cache. Therefore, we can add a 8 CPU 1 can now finish executing while (b ==
memory barrier to function bar as follows: 0)continue, and since it finds that the value of “b”

v2022.09.25a
440 APPENDIX C. WHY MEMORY BARRIERS?

is 1, it proceeds to the next statement, which is now the CPU that executes it, so that all loads preceding the
a memory barrier. read memory barrier will appear to have completed before
any load following the read memory barrier. Similarly,
9 CPU 1 must now stall until it processes all pre- a write memory barrier orders only stores, again on the
existing messages in its invalidation queue. CPU that executes it, and again so that all stores preceding
10 CPU 1 now processes the queued “invalidate” mes- the write memory barrier will appear to have completed
sage, and invalidates the cache line containing “a” before any store following the write memory barrier. A
from its own cache. full-fledged memory barrier orders both loads and stores,
but again only on the CPU executing the memory barrier.
11 CPU 1 executes the assert(a == 1), and, since
Quick Quiz C.15: But can’t full memory barriers impose
the cache line containing “a” is no longer in CPU 1’s
global ordering? After all, isn’t that needed to provide the
cache, it transmits a “read” message. ordering shown in Listing 12.27?
12 CPU 0 responds to this “read” message with the
If we update foo and bar to use read and write memory
cache line containing the new value of “a”.
barriers, they appear as follows:
13 CPU 1 receives this cache line, which contains a
value of 1 for “a”, so that the assertion does not 1 void foo(void)
2 {
trigger. 3 a = 1;
4 smp_wmb();
With much passing of MESI messages, the CPUs arrive 5 b = 1;
at the correct answer. This section illustrates why CPU 6 }
designers must be extremely careful with their cache- 7

coherence optimizations. The key requirement is that the 8 void bar(void)


9 {
memory barriers provide the appearance of ordering to the 10 while (b == 0) continue;
software. As long as these appearances are maintained, 11 smp_rmb();
the hardware can carry out whatever queueing, buffering, 12 assert(a == 1);
13 }
marking, stallings, and flushing optimizations it likes.
Quick Quiz C.14: Instead of all of this marking of
invalidation-queue entries and stalling of loads, why not simply Some computers have even more flavors of memory bar-
force an immediate flush of the invalidation queue? riers, but understanding these three variants will provide
a good introduction to memory barriers in general.

C.5 Read and Write Memory Bar- C.6 Example Memory-Barrier Se-
riers quences
In the previous section, memory barriers were used to This section presents some seductive but subtly broken
mark entries in both the store buffer and the invalidate uses of memory barriers. Although many of them will
queue. But in our code fragment, foo() had no reason to work most of the time, and some will work all the time
do anything with the invalidate queue, and bar() similarly on some specific CPUs, these uses must be avoided if the
had no reason to do anything with the store buffer. goal is to produce code that works reliably on all CPUs.
Many CPU architectures therefore provide weaker To help us better see the subtle breakage, we first need to
memory-barrier instructions that do only one or the other focus on an ordering-hostile architecture.
of these two. Roughly speaking, a “read memory barrier”
marks only the invalidate queue (and snoops entries in the
C.6.1 Ordering-Hostile Architecture
store buffer) and a “write memory barrier” marks only the
store buffer, while a full-fledged memory barrier does all A number of ordering-hostile computer systems have been
of the above. produced over the decades, but the nature of the hostility
The software-visible effect of these hardware mecha- has always been extremely subtle, and understanding it
nisms is that a read memory barrier orders only loads on has required detailed knowledge of the specific hardware.

v2022.09.25a
C.6. EXAMPLE MEMORY-BARRIER SEQUENCES 441

Rather than picking on a specific hardware vendor, and as Node 0 Node 1


a presumably attractive alternative to dragging the reader CPU 0 CPU 1 CPU 2 CPU 3
through detailed technical specifications, let us instead
design a mythical but maximally memory-ordering-hostile Cache Cache
computer architecture.4
This hardware must obey the following ordering con-
straints [McK05a, McK05b]: CPU 0 CPU 1 CPU 2 CPU 3
Message Message Message Message
1. Each CPU will always perceive its own memory Queue Queue Queue Queue
accesses as occurring in program order.
Interconnect
2. CPUs will reorder a given operation with a store
only if the two operations are referencing different
locations. Memory

3. All of a given CPU’s loads preceding a read memory


barrier (smp_rmb()) will be perceived by all CPUs Figure C.8: Example Ordering-Hostile Architecture
to precede any loads following that read memory
barrier. Listing C.1: Memory Barrier Example 1
4. All of a given CPU’s stores preceding a write memory CPU 0 CPU 1 CPU 2
barrier (smp_wmb()) will be perceived by all CPUs a = 1;
smp_wmb(); while (b == 0);
to precede any stores following that write memory b = 1; c = 1; z = c;
barrier. smp_rmb();
x = a;
assert(z == 0 || x == 1);
5. All of a given CPU’s accesses (loads and stores)
preceding a full memory barrier (smp_mb()) will
be perceived by all CPUs to precede any accesses
following that memory barrier. C.6.2 Example 1

Quick Quiz C.16: Does the guarantee that each CPU sees Listing C.1 shows three code fragments, executed concur-
its own memory accesses in order also guarantee that each rently by CPUs 0, 1, and 2. Each of “a”, “b”, and “c” are
user-level thread will see its own memory accesses in order? initially zero.
Why or why not? Suppose CPU 0 recently experienced many cache
misses, so that its message queue is full, but that CPU 1
Imagine a large non-uniform cache architecture (NUCA)
has been running exclusively within the cache, so that its
system that, in order to provide fair allocation of inter-
message queue is empty. Then CPU 0’s assignment to
connect bandwidth to CPUs in a given node, provided
“a” and “b” will appear in Node 0’s cache immediately
per-CPU queues in each node’s interconnect interface, as
(and thus be visible to CPU 1), but will be blocked behind
shown in Figure C.8. Although a given CPU’s accesses
CPU 0’s prior traffic. In contrast, CPU 1’s assignment
are ordered as specified by memory barriers executed by
to “c” will sail through CPU 1’s previously empty queue.
that CPU, however, the relative order of a given pair of
Therefore, CPU 2 might well see CPU 1’s assignment to
CPUs’ accesses could be severely reordered, as we will
“c” before it sees CPU 0’s assignment to “a”, causing the
see.5
assertion to fire, despite the memory barriers.
4 Readers preferring a detailed look at real hardware architectures

are encouraged to consult CPU vendors’ manuals [SW95, Adv02, Int02a,


Therefore, portable code cannot rely on this assertion
IBM94, LHF05, SPA94, Int04b, Int04a, Int04c], Gharachorloo’s disser- not firing, as both the compiler and the CPU can reorder
tation [Gha95], Peter Sewell’s work [Sew], or the excellent hardware- the code so as to trip the assertion.
oriented primer by Sorin, Hill, and Wood [SHW11].
5 Any real hardware architect or designer will no doubt be objecting
Quick Quiz C.17: Could this code be fixed by inserting a
strenuously, as they just might be a bit upset about the prospect of
working out which queue should handle a message involving a cache
memory barrier between CPU 1’s “while” and assignment to
line that both CPUs accessed, to say nothing of the many races that this “c”? Why or why not?
example poses. All I can say is “Give me a better example”.

v2022.09.25a
442 APPENDIX C. WHY MEMORY BARRIERS?

Listing C.2: Memory Barrier Example 2 if any, are required to enable the code to work correctly, in
CPU 0 CPU 1 CPU 2 other words, to prevent the assertion from firing?
a = 1; while (a == 0);
smp_mb(); y = b;
b = 1; smp_rmb(); Quick Quiz C.19: If CPU 2 executed an
x = a; assert(e==0||c==1) in the example in Listing C.3, would
assert(y == 0 || x == 1);
this assert ever trigger?

The Linux kernel’s synchronize_rcu() primitive


C.6.3 Example 2
uses an algorithm similar to that shown in this example.
Listing C.2 shows three code fragments, executed concur-
rently by CPUs 0, 1, and 2. Both “a” and “b” are initially
zero. C.7 Are Memory Barriers Forever?
Again, suppose CPU 0 recently experienced many cache
misses, so that its message queue is full, but that CPU 1 There have been a number of recent systems that are
has been running exclusively within the cache, so that its significantly less aggressive about out-of-order execution
message queue is empty. Then CPU 0’s assignment to “a” in general and re-ordering memory references in particular.
will appear in Node 0’s cache immediately (and thus be Will this trend continue to the point where memory barriers
visible to CPU 1), but will be blocked behind CPU 0’s are a thing of the past?
prior traffic. In contrast, CPU 1’s assignment to “b” will
The argument in favor would cite proposed massively
sail through CPU 1’s previously empty queue. Therefore,
multi-threaded hardware architectures, so that each thread
CPU 2 might well see CPU 1’s assignment to “b” before
would wait until memory was ready, with tens, hundreds,
it sees CPU 0’s assignment to “a”, causing the assertion
or even thousands of other threads making progress in
to fire, despite the memory barriers.
the meantime. In such an architecture, there would be no
In theory, portable code should not rely on this example need for memory barriers, because a given thread would
code fragment, however, as before, in practice it actually simply wait for all outstanding operations to complete
does work on most mainstream computer systems. before proceeding to the next instruction. Because there
would be potentially thousands of other threads, the CPU
C.6.4 Example 3 would be completely utilized, so no CPU time would be
wasted.
Listing C.3 shows three code fragments, executed con- The argument against would cite the extremely lim-
currently by CPUs 0, 1, and 2. All variables are initially ited number of applications capable of scaling up to a
zero. thousand threads, as well as increasingly severe realtime
Note that neither CPU 1 nor CPU 2 can proceed to requirements, which are in the tens of microseconds for
line 5 until they see CPU 0’s assignment to “b” on line 3. some applications. The realtime-response requirements
Once CPU 1 and 2 have executed their memory barriers on are difficult enough to meet as is, and would be even more
line 4, they are both guaranteed to see all assignments by difficult to meet given the extremely low single-threaded
CPU 0 preceding its memory barrier on line 2. Similarly, throughput implied by the massive multi-threaded scenar-
CPU 0’s memory barrier on line 8 pairs with those of ios.
CPUs 1 and 2 on line 4, so that CPU 0 will not execute Another argument in favor would cite increasingly
the assignment to “e” on line 9 until after its assignment sophisticated latency-hiding hardware implementation
to “b” is visible to both of the other CPUs. Therefore, techniques that might well allow the CPU to provide the
CPU 2’s assertion on line 9 is guaranteed not to fire. illusion of fully sequentially consistent execution while
Quick Quiz C.18: Suppose that lines 3–5 for CPUs 1 and 2 still providing almost all of the performance advantages
in Listing C.3 are in an interrupt handler, and that the CPU 2’s of out-of-order execution. A counter-argument would
line 9 runs at process level. In other words, the code in all cite the increasingly severe power-efficiency requirements
three columns of the table runs on the same CPU, but the first presented both by battery-operated devices and by envi-
two columns run in an interrupt handler, and the third column ronmental responsibility.
runs at process level, so that the code in third column can be
Who is right? We have no clue, so we are preparing to
interrupted by the code in the first two columns. What changes,
live with either scenario.

v2022.09.25a
C.8. ADVICE TO HARDWARE DESIGNERS 443

Listing C.3: Memory Barrier Example 3


CPU 0 CPU 1 CPU 2
1 a = 1;
2 smp_wmb();
3 b = 1; while (b == 0); while (b == 0);
4 smp_mb(); smp_mb();
5 c = 1; d = 1;
6 while (c == 0);
7 while (d == 0);
8 smp_mb();
9 e = 1; assert(e == 0 || a == 1);

C.8 Advice to Hardware Designers 2. External busses that fail to transmit cache-coherence
data.
There are any number of things that hardware designers This is an even more painful variant of the above
can do to make the lives of software people difficult. Here problem, but causes groups of devices—and even
is a list of a few such things that we have encountered memory itself—to fail to respect cache coherence. It
in the past, presented here in the hope that it might help is my painful duty to inform you that as embedded
prevent future such problems: systems move to multicore architectures, we will no
doubt see a fair number of such problems arise. By
1. I/O devices that ignore cache coherence.
the year 2021, there were some efforts to address
This charming misfeature can result in DMAs from these problems with new interconnect standards, with
memory missing recent changes to the output buffer, some debate as to how effective these standards will
or, just as bad, cause input buffers to be overwritten really be [Won19].
by the contents of CPU caches just after the DMA
completes. To make your system work in face of 3. Device interrupts that ignore cache coherence.
such misbehavior, you must carefully flush the CPU This might sound innocent enough—after all, in-
caches of any location in any DMA buffer before terrupts aren’t memory references, are they? But
presenting that buffer to the I/O device. Otherwise, a imagine a CPU with a split cache, one bank of which
store from one of the CPUs might not be accounted is extremely busy, therefore holding onto the last
for in the data DMAed out through the device. This cacheline of the input buffer. If the corresponding
is a form of data corruption, which is an extremely I/O-complete interrupt reaches this CPU, then that
serious bug. CPU’s memory reference to the last cache line of the
Similarly, you need to invalidate6 the CPU caches buffer could return old data, again resulting in data
corresponding to any location in any DMA buffer corruption, but in a form that will be invisible in a
after DMA to that buffer completes. Otherwise, a later crash dump. By the time the system gets around
given CPU might see the old data still residing in to dumping the offending input buffer, the DMA will
its cache instead of the newly DMAed data that it most likely have completed.
was supposed to see. This is another form of data
corruption. 4. Inter-processor interrupts (IPIs) that ignore cache
coherence.
And even then, you need to be very careful to avoid
pointer bugs, as even a misplaced read to an input This can be problematic if the IPI reaches its destina-
buffer can result in corrupting the data input! One tion before all of the cache lines in the corresponding
way to avoid this is to invalidate all of the caches of message buffer have been committed to memory.
all of the CPUs once the DMA completes, but it is
5. Context switches that get ahead of cache coherence.
much easier and more efficient if the device DMA
participates in the cache-coherence protocol, making If memory accesses can complete too wildly out of
all of this flushing and invalidating unnecessary. order, then context switches can be quite harrowing.
If the task flits from one CPU to another before all
6 Why not flush? If there is a difference, then a CPU must have the memory accesses visible to the source CPU make
incorrectly stored to the DMA buffer in the midst of the DMA operation. it to the destination CPU, then the task could easily

v2022.09.25a
444 APPENDIX C. WHY MEMORY BARRIERS?

see the corresponding variables revert to prior values,


which can fatally confuse most algorithms.
6. Overly kind simulators and emulators.
It is difficult to write simulators or emulators that
force memory re-ordering, so software that runs just
fine in these environments can get a nasty surprise
when it first runs on the real hardware. Unfortunately,
it is still the rule that the hardware is more devious
than are the simulators and emulators, but we hope
that this situation changes.

Again, we encourage hardware designers to avoid these


practices!

v2022.09.25a
De gustibus non est disputandum.
Latin maxim
Appendix D

Style Guide

This appendix is a collection of style guides which is • \co{} for identifiers, \url{} for URLs, \path{}
intended as a reference to improve consistency in perfbook. for filenames.
It also contains several suggestions and their experimental
examples. • Dates should use an unambiguous format. Never
Appendix D.1 describes basic punctuation and spelling “mm/dd/yy” or “dd/mm/yy”, but rather “July 26, 2016”
rules. Appendix D.2 explains rules related to unit symbols. or “26 July 2016” or “26-Jul-2016” or “2016/07/26”.
Appendix D.3 summarizes LATEX-specific conventions. I tend to use yyyy.mm.ddA for filenames, for exam-
ple.
• North American rules on periods and abbreviations.
D.1 Paul’s Conventions For example neither of the following can reasonably
be interpreted as two sentences:
Following is the list of Paul’s conventions assembled from
his answers to Akira’s questions regarding perfbook’s – Say hello, to Mr. Jones.
punctuation policy. – If it looks like she sprained her ankle, call
Dr. Smith and then tell her to keep the ankle
• (On punctuations and quotations) Despite being iced and elevated.
American myself, for this sort of book, the UK
approach is better because it removes ambiguities An ambiguous example:
like the following:
If I take the cow, the pig, the horse, etc.
Type “ls -a,” look for the file “.,” and George will be upset.
file a bug if you don’t see it.
can be written with more words:
The following is much more clear:
If I take the cow, the pig, the horse, or
Type “ls -a”, look for the file “.”, and much of anything else, George will be
file a bug if you don’t see it. upset.
or:
• American English spelling: “color” rather than
“colour”. If I take the cow, the pig, the horse, etc.,
George will be upset.
• Oxford comma: “a, b, and c” rather than “a, b and c”.
This is arbitrary. Cases where the Oxford comma • I don’t like ampersand (“&”) in headings, but will
results in ambiguity should be reworded, for example, sometimes use it if doing so prevents a line break in
by introducing numbering: “a, b, and c and d” should that heading.
be “(1) a, (2) b, and (3) c and d”.
• When mentioning words, I use quotations. When
• Italic for emphasis. Use sparingly. introducing a new word, I use \emph{}.

445

v2022.09.25a
446 APPENDIX D. STYLE GUIDE

Following is a convention regarding punctuation in “A 240 GB hard drive”, rather than “a 240-GB
LATEX sources. hard drive” nor “a 240GB hard drive”.

• Place a newline after a colon (:) and the end of a Strictly speaking, NIST guide requires us to use the
sentence. This avoids the whole one-space/two-space binary prefixes “Ki”, “Mi”, or “Gi” to represent powers
food fight and also has the advantage of more clearly of 210 . However, we accept the JEDEC conventions to
showing changes to single sentences in the middle use “K”, “M”, and “G” as binary prefixes in describing
of long paragraphs. memory capacity [JED].
An acceptable example:
“8 GB of main memory”, meaning “8 GiB of
D.2 NIST Style Guide main memory”.

D.2.1 Unit Symbol Also, it is acceptable to use just “K”, “M”, or “G”
as abbreviations appended to a numerical value, e.g.,
D.2.1.1 SI Unit Symbol “4K entries”. In such cases, no space before an abbreviation
is required. For example,
NIST style guide [Nat19, Chapter 5] states the following
rules (rephrased for perfbook). “8K entries”, rather than “8 K entries”.

• When SI unit symbols such as “ns”, “MHz”, and “K” If you put a space in between, the symbol looks like
(kelvin) are used behind numerical values, narrow a unit symbol and is confusing. Note that “K” and “k”
spaces should be placed between the values and the represent 210 and 103 , respectively. “M” can represent
symbols. either 220 or 106 , and “G” can represent either 230 or 109 .
These ambiguities should not be confusing in discussing
A narrow space can be coded in LATEX by the sequence
approximate order.
of “\,”. For example,
“2.4 GHz”, rather then “2.4GHz”. D.2.1.3 Degree Symbol
• Even when the value is used in adjectival sense, a The angular-degree symbol (°) does not require any space
narrow space should be placed. For example, in front of it. NIST style guide clearly states so.
The symbol of degree can also be typeset easily by the
“a 10 ms interval”, rather than “a 10-ms help of gensymb package. A macro “\degree” can be
interval” nor “a 10ms interval”. used in both text and math modes.
Example:
The symbol of micro (µ:10−6 ) can be typeset easily by
the help of “gensymb” LATEX package. A macro “\micro” 45°, rather than 45 °.
can be used in both text and math modes. To typeset the
symbol of “microsecond”, you can do so by “\micro s”. D.2.1.4 Percent Symbol
For example,
NIST style guide treats the percent symbol (%) as the
10 µs same as SI unit symbols.
Note that math mode “\mu” is italic by default and 50 % possibility, rather than 50% possibility.
should not be used as a prefix. An improper example:
10 𝜇s (math mode “\mu”) D.2.1.5 Font Style
Quote from NIST check list [Nata, #6]:
D.2.1.2 Non-SI Unit Symbol
Variables and quantity symbols are in italic
Although NIST style guide does not cover non-SI unit type. Unit symbols are in roman type. Num-
symbols such as “KB”, “MB”, and “GB”, the same rule bers should generally be written in roman type.
should be followed. These rules apply irrespective of the typeface
Example: used in the surrounding text.

v2022.09.25a
D.3. LATEX CONVENTIONS 447

Table D.1: Digit-Grouping Style By marking up constant decimal values by \num{}


commands, the LATEX source would be exempted from any
Style Outputs of \num{}
particular conventions.
NIST/SI (English) 12 345 12.345 1 234 567.89 Because of its open-source policy, this approach should
SI (French) 12 345 12,345 1 234 567,89 give more “portability” to perfbook.
English 12,345 12.345 1,234,567.89
French 12 345 12,345 1 234 567,89
Other Europe 12.345 12,345 1.234.567,89
D.3 LATEX Conventions
Good looking LATEX documents require further considera-
For example,
tions on proper use of font styles, line break exceptions,
e (elementary charge) etc. This section summarizes guidelines specific to LATEX.

On the other hand, mathematical constants such as the


base of natural logarithms should be roman [Natb]. For D.3.1 Monospace Font
example, Monospace font (or typewriter font) is heavily used in
this textbook. First policy regarding monospace font in
e𝑥
perfbook is to avoid directly using “\texttt” or “\tt”
macro. It is highly recommended to use a macro or an
D.2.2 NIST Guide Yet To Be Followed environment indicating the reason why you want the font.
This section explains the use cases of such macros and
There are a few cases where NIST style guide is not environments.
followed. Other English conventions are followed in such
cases.
D.3.1.1 Code Snippet
D.2.2.1 Digit Grouping Because the “verbatim” environment is a primitive way
to include listings, we have transitioned to a scheme which
Quote from NIST checklist [Nata, #16]:
uses the “fancyvrb” package for code snippets.
The digits of numerical values having more than The goal of the scheme is to extract LATEX sources
four digits on either side of the decimal marker of code snippets directly from code samples under
are separated into groups of three using a thin, CodeSamples directory. It also makes it possible to
fixed space counting from both the left and right embed line labels in the code samples, which can be
of the decimal marker. Commas are not used to referenced within the LATEX sources. This reduces the
separate digits into groups of three. burden of keeping line numbers in the text consistent with
those in code snippets.
NIST Example: 15 739.012 53 ms Code-snippet extraction is handled by a couple of perl
Our convention: 15,739.01253 ms scripts and recipes in Makefile. We use the escaping
feature of the fancyvrb package to embed line labels as
In LATEX coding, it is cumbersome to place thin spaces as comments.
are recommended in NIST guide. The \num{} command We used to use the “verbbox” environment provided
provided by the “siunitx” package would be of help for us by the “verbatimbox” package. Appendix D.3.1.2 de-
to follow this rule. It would also help us overcome different scribes how verbbox can automatically generate line
conventions. We can select a specific digit-grouping style numbers, but those line numbers cannot be referenced
as a default in preamble, or specify an option to each within the LATEX sources.
\num{} command as is shown in Table D.1. Let’s start by looking at how code snippets are coded in
As are evident in Table D.1, periods and commas used the current scheme. There are three customized environ-
as other than decimal markers are confusing and should ments of “Verbatim”. “VerbatimL” is for floating snip-
be avoided, especially in documents expecting global pets within the “listing” environment. “VerbatimN” is
audiences. for inline snippets with line count enabled. “VerbatimU”

v2022.09.25a
448 APPENDIX D. STYLE GUIDE

Listing D.1: LATEX Source of Sample Code Snippet (Current) Above code results in the paragraph below:
1 \begin{listing}
2 \begin{fcvlabel}[ln:base1]
3 \begin{VerbatimL}[commandchars=\$\[\]] Lines 7 and 8 can be referred to from text.
4 /*
5 * Sample Code Snippet
6 */
Macros “\lnlbl{}” and “\lnref{}” are defined in
7 #include <stdio.h> the preamble as follows:
8 int main(void)
9 {
10 printf("Hello world!\n"); $lnlbl[printf] \newcommand{\lnlblbase}{}
11 return 0; $lnlbl[return] \newcommand{\lnlbl}[1]{%
12 } \phantomsection\label{\lnlblbase:#1}}
13 \end{VerbatimL} \newcommand{\lnrefbase}{}
14 \end{fcvlabel} \newcommand{\lnref}[1]{\ref{\lnrefbase:#1}}
15 \caption{Sample Code Snippet}
16 \label{lst:app:styleguide:Sample Code Snippet}
17 \end{listing} Environments “fcvlabel” and “fcvref” are defined
as shown below:
Listing D.2: Sample Code Snippet
1 /* \newenvironment{fcvlabel}[1][]{%
2 * Sample Code Snippet \renewcommand{\lnlblbase}{#1}%
3 */ \ignorespaces}{\ignorespacesafterend}
4 #include <stdio.h> \newenvironment{fcvref}[1][]{%
5 int main(void) \renewcommand{\lnrefbase}{#1}%
6 { \ignorespaces}{\ignorespacesafterend}
7 printf("Hello world!\n");
8 return 0;
9 } The main part of LATEX source shown on lines 2–14
in Listing D.1 can be extracted from a code sample of
Listing D.3 by a perl script utilities/fcvextract.
is for inline snippets without line count. They are defined pl. All the relevant rules of extraction are described as
in the preamble as shown below: recipes in the top level Makefile and a script to generate
dependencies (utilities/gen_snippet_d.pl).
\DefineVerbatimEnvironment{VerbatimL}{Verbatim}%
{fontsize=\scriptsize,numbers=left,numbersep=5pt,%
As you can see, Listing D.3 has meta commands in
xleftmargin=9pt,obeytabs=true,tabsize=2} comments of C (C++ style). Those meta commands
\AfterEndEnvironment{VerbatimL}{\vspace*{-9pt}}
\DefineVerbatimEnvironment{VerbatimN}{Verbatim}%
are interpreted by utilities/fcvextract.pl, which
{fontsize=\scriptsize,numbers=left,numbersep=3pt,% distinguishes the type of comment style by the suffix of
xleftmargin=5pt,xrightmargin=5pt,obeytabs=true,%
tabsize=2,frame=single}
code sample’s file name.
\DefineVerbatimEnvironment{VerbatimU}{Verbatim}% Meta commands which can be used in code samples
{fontsize=\scriptsize,numbers=none,xleftmargin=5pt,%
xrightmargin=5pt,obeytabs=true,tabsize=2,% are listed below:
samepage=true,frame=single}
• \begin{snippet}[<options>]
• \end{snippet}
The LATEX source of a sample code snippet is shown in
• \lnlbl{<label string>}
Listing D.1 and is typeset as shown in Listing D.2.
• \fcvexclude
Labels to lines are specified in “$lnlbl[]” command.
• \fcvblank
The characters specified by “commandchars” option to
VarbatimL environment are used by the fancyvrb pack- “<options>” to the \begin{snippet} meta com-
age to substitute “\lnlbl{}” for “$lnlbl[]”. Those mand is a comma-spareted list of options shown below:
characters should be selected so that they don’t appear
elsewhere in the code snippet. • labelbase=<label base string>
Labels “printf” and “return” in Listing D.2 can be • keepcomment=yes
referred to as shown below: • gobbleblank=yes
• commandchars=\X\Y\Z
\begin{fcvref}[ln:base1]
\Clnref{printf, return} can be referred
to from text. The “labelbase” option is mandatory and
\end{fcvref}
the string given to it will be passed to the

v2022.09.25a
D.3. LATEX CONVENTIONS 449

Listing D.3: Source of Code Sample with “snippet” Meta Command


1 //\begin{snippet}[labelbase=ln:base1,keepcomment=yes,commandchars=\$\[\]]
2 /*
3 * Sample Code Snippet
4 */
5 #include <stdio.h>
6 int main(void)
7 {
8 printf("Hello world!\n"); //\lnlbl{printf}
9 return 0; //\lnlbl{return}
10 }
11 //\end{snippet}

“\begin{fcvlabel}[<label base string>]” com- Once one of them appears in a litmus test, comments
mand as shown on line 2 of Listing D.1. The should be of OCaml style (“(* ... *)”). Those to-
“keepcomment=yes” option tells fcvextract.pl to kens keep the same meaning even when they appear in
keep comment blocks. Otherwise, comment blocks in C comments!
source code will be omitted. The “gobbleblank=yes” The pair of characters “{” and “}” also have special
option will remove empty or blank lines in the resulting meaning in the C flavour tests. They are used to separate
snippet. The “commandchars” option is given to the portions in a litmus test.
VerbatimL environment as is. At the moment, it is also First pair of “{” and “}” encloses initialization part.
mandatory and must come at the end of options listed Comments in this part should also be in the ocaml form.
above. Other types of options, if any, are also passed to You can’t use “{” and “}” in comments in litmus tests,
the VerbatimL environment. either.
The “\lnlbl” commands are converted along the way Examples of disallowed comments in a litmus test are
to reflect the escape-character choice.1 Source lines with shown below:
“\fcvexclude” are removed. “\fcvblank” can be used
1 // Comment at first
to keep blank lines when the “gobbleblank=yes” option 2 C C-sample
is specified. 3 // Comment with { and } characters
4 {
There can be multiple pairs of \begin{snippet} 5 x=2; // C style comment in initialization
and \end{snippet} as long as they have unique 6 }
7
“labelbase” strings. 8 P0(int *x}
Our naming scheme of “labelbase” for unique name 9 {
10 int r1;
space is as follows: 11
12 r1 = READ_ONCE(*x); // Comment with "exists"
ln:<Chapter/Subdirectory>:<File Name>:<Function Name> 13 }
14
15 [...]
16
Litmus tests, which are handled by “herdtools7” com- 17 exists (0:r1=0) // C++ style comment after test body
mands such as “litmus7” and “herd7”, were problematic
in this scheme. Those commands have particular rules To avoid parse errors, meta commands in litmus tests
of where comments can be placed and restriction on per- (C flavor) are embedded in the following way.
mitted characters in comments. They also forbid a couple
of tokens to appear in comments. (Tokens in comments 1 C C-SB+o-o+o-o
2 //\begin[snippet][labelbase=ln:base,commandchars=\%\@\$]
might sound strange, but they do have such restriction.) 3
For example, the first token in a litmus test must be one 4 {
5 1:r2=0 (*\lnlbl[initr2]*)
of “C”, “PPC”, “X86”, “LISA”, etc., which indicates the 6 }
flavor of the test. This means no comment is allowed at 7
8 P0(int *x0, int *x1) //\lnlbl[P0:b]
the beginning of a litmus test. 9 {
Similarly, several tokens such as “exists”, “filter”, 10 int r2;
11
and “locations” indicate the end of litmus test’s body. 12 WRITE_ONCE(*x0, 2);
13 r2 = READ_ONCE(*x1);
1 Characters
forming comments around the “\lnlbl” commands 14 } //\lnlbl[P0:e]
are also gobbled up regardless of the “keepcomment” setting. 15

v2022.09.25a
450 APPENDIX D. STYLE GUIDE

16 P1(int *x0, int *x1) Listing D.4: LATEX Source of Sample Code Snippet (Obsolete)
17 { 1 \begin{listing}
18 int r2; 2 { \scriptsize
19 3 \begin{verbbox}[\LstLineNo]
20 WRITE_ONCE(*x1, 2); 4 /*
21 r2 = READ_ONCE(*x0); 5 * Sample Code Snippet
22 } 6 */
23 7 #include <stdio.h>
24 //\end[snippet] 8 int main(void)
25 exists (1:r2=0 /\ 0:r2=0) (* \lnlbl[exists_] *) 9 {
10 printf("Hello world!\n");
11 return 0;
Example above is converted to the following interme- 12 }
13 \end{verbbox}
diate code by a script utilities/reorder_ltms.pl.2 14 }
The intermediate code can be handled by the common 15 \centering
16 \theverbbox
script utilities/fcvextract.pl. 17 \caption{Sample Code Snippet (Obsolete)}
18 \label{lst:app:styleguide:Sample Code Snippet (Obsolete)}
1 // Do not edit! 19 \end{listing}
2 // Generated by utillities/reorder_ltms.pl
3 //\begin{snippet}[labelbase=ln:base,commandchars=\%\@\$]
4 C C-SB+o-o+o-o Listing D.5: Sample Code Snippet (Obsolete)
5
1 /*
6 {
2 * Sample Code Snippet
7 1:r2=0 //\lnlbl{initr2}
3 */
8 }
4 #include <stdio.h>
9
5 int main(void)
10 P0(int *x0, int *x1) //\lnlbl{P0:b}
6 {
11 {
7 printf("Hello world!\n");
12 int r2;
8 return 0;
13
9 }
14 WRITE_ONCE(*x0, 2);
15 r2 = READ_ONCE(*x1);
16 } //\lnlbl{P0:e}
17
18 P1(int *x0, int *x1) The “verbatim” environment is used for listings with
19 { too many lines to fit in a column. It is also used to avoid
20 int r2;
21 overwhelming LATEX with a lot of floating objects. They
22 WRITE_ONCE(*x1, 2); are being converted to the scheme using the VerbatimN
23 r2 = READ_ONCE(*x0);
24 } environment.
25
26 exists (1:r2=0 /\ 0:r2=0) \lnlbl{exists_}
27 //\end{snippet}
D.3.1.3 Identifier

Note that each litmus test’s source file can con- We use “\co{}” macro for inline identifiers. (“co” stands
tain at most one pair of \begin[snippet] and for “code”.)
\end[snippet] because of the restriction of comments. By putting them into \co{}, underscore characters in
their names are free of escaping in LATEX source. It is
D.3.1.2 Code Snippet (Obsolete) convenient to search them in source files. Also, \co{}
macro has a capability to permit line breaks at particular
Sample LATEX source of a code snippet coded using the sequences of letters. Current definition permits a line
“verbatimbox” package is shown in Listing D.4 and is break at an underscore (_), two consecutive underscores
typeset as shown in Listing D.5. (__), a white space, or an operator ->.
The auto-numbering feature of verbbox is enabled
by the “\LstLineNo” macro specified in the option to
verbbox (line 3 in Listing D.4). The macro is defined in D.3.1.4 Identifier inside Table and Heading
the preamble of perfbook.tex as follows: Although \co{} command is convenient for inlining
within text, it is fragile because of its capability of line
\newcommand{\LstLineNo}
{\makebox[5ex][r]{\arabic{VerbboxLineNo}\hspace{2ex}}} break. When it is used inside a “tabular” environment
or its derivative such as “tabularx”, it confuses column
2 Currently, only C flavor litmus tests are supported. width estimation of those environments. Furthermore,

v2022.09.25a
D.3. LATEX CONVENTIONS 451

Table D.2: Limitation of Monospace Macro D.3.2 Cross-reference


Macro Need Escape Should Avoid Cross-references to Chapters, Sections, Listings, etc.
have been expressed by combinations of names and bare
\co, \nbco \, %, {, }
\tco # %, {, }, \
\ref{} commands in the following way:
1 Chapter~\ref{chp:Introduction},
2 Table~\ref{tab:app:styleguide:Digit-Grouping Style}
\co{} can not be safely used in section headings nor
description headings. This is a traditional way of cross-referencing. However,
As a workaround, we use “\tco{}” command inside it is tedious and sometimes error-prone to put a name man-
tables and headings. It has no capability of line break ually on every cross-reference. The cleveref package
at particular sequences, but still frees us from escaping provides a nicer way of cross-referencing. A few examples
underscores. follow:
When used in text, \tco{} permits line breaks at white 1 \Cref{chp:Introduction},
spaces. 2 \cref{sec:intro:Parallel Programming Goals},
3 \cref{chp:app:styleguide:Style Guide},
4 \cref{tab:app:styleguide:Digit-Grouping Style}, and
D.3.1.5 Other Use Cases of Monospace Font 5 \cref{lst:app:styleguide:Source of Code Sample} are
6 examples of cross\-/references.
For URLs, we use “\url{}” command provided by the
“hyperref” package. It will generate hyper references to Above code is typeset as follows:
the URLs.
For path names, we use “\path{}” command. It won’t Chapter 2, Section 2.2, Appendix D, Table D.1,
generate hyper references. and Listing D.3 are examples of cross-refer-
Both \url{} and \path{} permit line breaks at “/”, ences.
“-”, and “.”.3 As you can see, naming of cross-references is automated.
For short monospace statements not to be line broken, Current setting generates capitalized names for both of
we use the “\nbco{}” (non-breakable co) macro. \Cref{} and \cref{}, but the former should be used at
the beginning of a sentence.
D.3.1.6 Limitations We are in the middle of conversion to cleveref-style
cross-referencing.
There are a few cases where macros introduced in this
Cross-references to line numbers of code snippets
section do not work as expected. Table D.2 lists such
can be done in a similar way by using \Clnref{} and
limitations.
\clnref{} macros, which mimic cleveref. The former
While \co{} requires some characters to be escaped,
puts “Line” as the name of the reference and the latter
it can contain any character.
“line”.
On the other hand, \tco{} can not handle “%”, “{”,
Please refer to cleveref’s documentation for further
“}”, nor “\” properly. If they are escaped by a “\”, they
info on its cleverness.
appear in the end result with the escape character. The
“\verb” command can be used in running text if you need
to use monospace font for a string which contains many D.3.3 Non Breakable Spaces
characters to escape.4 In LATEX conventions, proper use of non-breakable white
spaces is highly recommended. They can prevent widow-
ing and orphaning of single digit numbers or short variable
names, which would cause the text to be confusing at first
3 Overfill can be a problem if the URL or the path name contains glance.
long runs of unbreakable characters. The thin space mentioned earlier to be placed in front
4 The \verb command is not almighty though. For example, you
of a unit symbol is non breakable.
can’t use it within a footnote. If you do so, you will see a fatal LATEX
error. A workaround would be a macro named \VerbatimFootnotes
Other cases to use a non-breakable space (“~” in LATEX
provided by the fancyvrb package. Unfortunately, perfbook can’t source, often referred to as “nbsp”) are the following
employ it due to the interference with the footnotebackref package. (inexhaustive).

v2022.09.25a
452 APPENDIX D. STYLE GUIDE

• Reference to a Chapter or a Section: x-, y-, and z-coordinates; x-, y-, and z-
Please refer to Appendix D.2. coordinates; x-, y-, and z-coordinates; x-, y-,
and z-coordinates; x-, y-, and z-coordinates; x-,
• Calling out CPU number or Thread name: y-, and z-coordinates;

After they load the pointer, CPUs 1 and 2


Example with “\-/”:
will see the stored value.
• Short variable name: x-, y-, and z-coordinates; x-, y-, and z-coordi-
nates; x-, y-, and z-coordinates; x-, y-, and z-
The results will be stored in variables a coordinates; x-, y-, and z-coordinates; x-, y-,
and b. and z-coordinates;

D.3.4 Hyphenation and Dashes Example with “\=/”:


D.3.4.1 Hyphenation in Compound Word
x-, y-, and z-coordinates; x-, y-, and z-coor-
In plain LATEX, compound words such as “high-frequency” dinates; x-, y-, and z-coordinates; x-, y-, and
can be hyphenated only at the hyphen. This sometimes z-coordinates; x-, y-, and z-coordinates; x-, y-,
results in poor typesetting. For example: and z-coordinates;

High-frequency radio wave, high-frequency ra- Note that “\=/” enables hyphenation in elements of
dio wave, high-frequency radio wave, high- compound words as the same as “\-/” does.
frequency radio wave, high-frequency radio
wave, high-frequency radio wave. D.3.4.3 Em Dash
Em dashes are used to indicate parenthetic expression. In
By using a shortcut “\-/” provided by the “extdash”
perfbook, em dashes are placed without spaces around it.
package, hyphenation in elements of compound words is
In LATEX source, an em dash is represented by “---”.
enabled in perfbook.5
Example (quote from Appendix C.1):
Example with “\-/”:
This disparity in speed—more than two or-
High-frequency radio wave, high-frequency ra- ders of magnitude—has resulted in the multi-
dio wave, high-frequency radio wave, high-fre- megabyte caches found on modern CPUs.
quency radio wave, high-frequency radio wave,
high-frequency radio wave. D.3.4.4 En Dash
In LATEX convention, en dashes (–) are used for ranges
D.3.4.2 Non Breakable Hyphen of (mostly) numbers. Past revisions of perfbook didn’t
follow this rule and used plain dashes (-) for such cases.
We want hyphenated compound terms such as “x-coordi- Now that \clnrefrange, \crefrange, and their vari-
nate”, “y-coordinate”, etc. not to be broken at the hyphen ants, which generate en dashes, are used for ranges of
following a single letter. cross-references, the remaining couple of tens of simple
To make a hyphen unbreakable, we can use a short cut dashes of other types of ranges have been converted to
“\=/” also provided by the “extdash” package. en dashes for consistency.
Example without a shortcut: Example with a simple dash:

Lines 4-12 in Listing D.4 are the contents of the


verbbox environment. The box is output by the
5 In exchange for enabling the shortcut, we can’t use plain LAT X’s
E \theverbbox macro on line 16.
shortcut “\-” to specify hyphenation points. Use pfhyphex.tex to
add such exceptions. Example with an en dash:

v2022.09.25a
D.3. LATEX CONVENTIONS 453

Lines 4–12 in Listing D.4 are the contents of D.3.5.2 Full Stop
the verbbox environment. The box is output by
LATEX treats a full stop in front of a white space as an end
the \theverbbox macro on line 16.
of a sentence and puts a slightly wider skip by default
(double spacing). There is an exception to this rule, i.e.
D.3.4.5 Numerical Minus Sign where the full stop is next to a capital letter, LATEX assumes
Numerical minus signs should be coded as math mode it represents an abbreviation and puts a normal skip.
minus signs, namely “$-$”.6 For example, To make LATEX use proper skips, one need to annotate
such exceptions. For example, given the following LATEX
−30, rather than -30. source:
\begin{quote}
D.3.5 Punctuation Lock~1 is owned by CPU~A.
Lock~2 is owned by CPU~B. (Bad.)
D.3.5.1 Ellipsis
Lock~1 is owned by CPU~A\@.
Lock~2 is owned by CPU~B\@. (Good.)
In monospace fonts, ellipses can be expressed by series of \end{quote}
periods. For example:
Great ... So how do I fix it? the output will be as the following:
However, in proportional fonts, the series of periods is
Lock 1 is owned by CPU A. Lock 2 is owned
printed with tight spaces as follows:
by CPU B. (Bad.)
Great ... So how do I fix it? Lock 1 is owned by CPU A. Lock 2 is owned
Standard LATEX defines the \dots macro for this pur- by CPU B. (Good.)
pose. However, it has a kludge in the evenness of spaces.
The “ellipsis” package redefines the \dots macro to fix On the other hand, where a full stop is following a lower
the issue.7 By using \dots, the above example is typeset case letter, e.g. as in “Mr. Smith”, a wider skip will follow
as the following: in the output unless it is properly hinted. Such hintings
can be done in one of several ways.
Great . . . So how do I fix it? Given the following source,
Note that the “xspace” option specified to the “ellipsis”
\begin{itemize}[nosep]
package adjusts the spaces after ellipses depending on \item Mr. Smith (bad)
what follows them. \item Mr.~Smith (good)
\item Mr.\ Smith (good)
For example: \item Mr.\@ Smith (good)
\end{itemize}
• He said, “I . . . really don’t remember . . .”
• Sequence A: (one, two, three, . . .) the result will look as follows:
• Sequence B: (4, 5, . . . , 𝑛) • Mr. Smith (bad)
• Mr. Smith (good)
As you can see, extra space is placed before the comma.
• Mr. Smith (good)
\dots macro can also be used in math mode:
• Mr. Smith (good)
• Sequence C: (1, 2, 3, 5, 8, . . . )
• Sequence D: (10, 12, . . . , 20) D.3.6 Floating Object Format
The \ldots macro behaves the same as the \dots D.3.6.1 Ruled Line in Table
macro.
6 This rule assumes that math mode uses the same upright glyph as They say that tables drawn by using ruled lines of plain
text mode. Our default font choice meets the assumption. LATEX look ugly.8 Vertical lines should be avoided and
7 To be exact, it is the \textellipsis macro that is redefined. The

behavior of \dots macro in math mode is not affected. The “amsmath”


package has another definition of \dots. It is not used in perfbook at 8 https://www.inf.ethz.ch/personal/markusp/

the moment. teaching/guides/guide-tables.pdf

v2022.09.25a
454 APPENDIX D. STYLE GUIDE

Table D.3: Refrigeration Power Consumption D.3.7 Improvement Candidates


Power per watt There are a few areas yet to be attempted in perfbook
Situation 𝑇 (K) 𝐶P waste heat (W) which would further improve its appearance. This section
Dry Ice 195 1.990 0.5 lists such candidates.
Liquid N2 77 0.356 2.8
Liquid H2 20 0.073 13.7 D.3.7.1 Grouping Related Figures/Listings
Liquid He 4 0.0138 72.3 To prevent a pair of closely related figures or listings from
IBM Q 0.015 0.000051 19,500.0 being placed in different pages, it is desirable to group
them into a single floating object. The “subfig” package
provides the features to do so.10
Two floating objects can be placed side by side by using
horizontal lines should be used sparingly, especially in \parbox or minipage. For example, Figures 14.10
tables of simple structure. and 14.11 can be grouped together by using a pair of
minipages as shown in Figures D.1 and D.2.
Table D.3 (corresponding to a table from a now-deleted By using subfig package, Listings 15.4 and 15.5 can
section) is drawn by using the features of “booktabs” and be grouped together as shown in Listing D.6 with sub-
“xcolor” packages. Note that ruled lines of booktabs can captions (with a minor change of blank line).
not be mixed with vertical lines in a table.9 Note that they can not be grouped in the same way as
Figures D.1 and D.2 because the “ruled” style prevents
their captions from being properly typeset.
The sub-caption can be cited by combining a \cref{}
macro and a \subref{} macro, for example, “List-
D.3.6.2 Position of Caption ing D.6 (a)”.
It can also be cited by a \cref{} macro, for example,
“Listing D.6b”. Note the difference in the resulting format.
In LATEX conventions, captions of tables are usually placed For the citing by a \cref{} to work, you need to place the
above them. The reason is the flow of your eye movement \label{} macro of the combined floating object ahead
when you look at them. Most tables have a row of heading of the definition of subfloats. Otherwise, the resulting
at the top. You naturally look at the top of a table at first. caption number would be off by one from the actual
Captions at the bottom of tables disturb this flow. The number.
same can be said of code snippets, which are read from
top to bottom. D.3.7.2 Table Layout Experiment
For code snippets, the “ruled” style chosen for listing This section presents some experimental tables using book-
environment places the caption at the top. See Listing D.2 tabs, xcolors, and arydshln packages. The corresponding
for an example. tables in the text have been converted using one of the
format shown here. The source of this section can be
As for tables, the position of caption is tweaked by
regarded as a reference to be consulted when new tables
\floatstyle{} and \restylefloat{} macros in pre-
are added in the text.
amble.
In Table D.4 (corresponding to Table 3.1), the “S”
Vertical skips below captions are reduced by setting column specifiers provided by the “siunitx” package are
a smaller value to the \abovecaptionskip variable, used to align numbers.
which would also affect captions to figures. Table D.5 (corresponding to Table 13.1) is an example
of table with a complex header. In Table D.5, the gap in
In the tables which use horizontal rules of “booktabs”
9 There is another package named “arydshln” which provides dashed
package, the vertical skips between captions and tables
lines to be used in tables. A couple of experimental examples are
are further reduced by setting a negative value to the presented in Appendix D.3.7.2.
\abovetopsep variable, which controls the behavior of 10 One problem of grouping figures might be the complexity in LAT X
E
\toprule. source.

v2022.09.25a
D.3. LATEX CONVENTIONS 455

Figure D.1: Timer Wheel at 1 kHz Figure D.2: Timer Wheel at 100 kHz

Listing D.6: Message-Passing Litmus Test (by subfig)


(a) Not Enforcing Order (b) Enforcing Order
1 C C-MP+o-wmb-o+o-o.litmus 1 C C-MP+o-wmb-o+o-rmb-o.litmus
2 2
3 { 3 {
4 } 4 }
5 5
6 P0(int* x0, int* x1) { 6 P0(int* x0, int* x1) {
7 7
8 WRITE_ONCE(*x0, 2); 8 WRITE_ONCE(*x0, 2);
9 smp_wmb(); 9 smp_wmb();
10 WRITE_ONCE(*x1, 2); 10 WRITE_ONCE(*x1, 2);
11 11
12 } 12 }
13 13
14 P1(int* x0, int* x1) { 14 P1(int* x0, int* x1) {
15 15
16 int r2; 16 int r2;
17 int r3; 17 int r3;
18 18
19 r2 = READ_ONCE(*x1); 19 r2 = READ_ONCE(*x1);
20 r3 = READ_ONCE(*x0); 20 smp_rmb();
21 21 r3 = READ_ONCE(*x0);
22 } 22
23 23 }
24 24
25 exists (1:r2=2 /\ 1:r3=0) 25 exists (1:r2=2 /\ 1:r3=0)

v2022.09.25a
456 APPENDIX D. STYLE GUIDE

Table D.4: CPU 0 View of Synchronization Mechanisms Table D.5: Synchronization and Reference Counting
on 8-Socket System With Intel Xeon Platinum 8176
CPUs @ 2.10GHz Release
Reference Hazard
Ratio Acquisition Locks RCU
Counts Pointers
Operation Cost (ns) (cost/clock)
Locks − CAMR M CA
Clock period 0.5 1.0
Reference
Best-case CAS 7.0 14.6 A AMR M A
Counts
Best-case lock 15.4 32.3
Hazard
Blind CAS 7.2 15.2 M M M M
Pointers
CAS 18.0 37.7
RCU CA MA CA M CA
Blind CAS (off-core) 47.5 99.8
CAS (off-core) 101.9 214.0
Key: A: Atomic counting
Blind CAS (off-socket) 148.8 312.5
C: Check combined with the atomic acquisition
CAS (off-socket) 442.9 930.1 operation
Comms Fabric 5,000 10,500 M: Full memory barriers required
Global Comms 195,000,000 409,500,000 MR : Memory barriers required only on release
MA : Memory barriers required on acquire

the mid-rule corresponds to the distinction which had been


represented by double vertical rules before the conversion.
The legends in the frame box appended here explain the
abbreviations used in the matrix. Two types of memory
barrier are denoted by subscripts here. The legends and
subscripts are not present in Table 13.1 since they are
redundant there.
Table D.6 (corresponding to Table C.1) is a sequence
diagram drawn as a table.
Table D.7 is a tweaked version of Table 9.3. Here,
the “Category” column in the original is removed and
the categories are indicated in rows of bold-face font just
below the mid-rules. This change makes it easier for
\rowcolors{} command of “xcolor” package to work
properly.
Table D.8 is another version which keeps original col-
umns and colors rows only where a category has multiple
rows. This is done by combining \rowcolors{} of
“xcolor” and \cellcolor{} commands of the “colortbl”
package (\cellcolor{} overrides \rowcolors{}).
In Table 9.3, the latter layout without partial row color-
ing has been chosen for simplicity.
Table D.9 (corresponding to Table 15.1) is also a se-
quence diagram drawn as a tabular object.
Table D.10 shows another version of Table D.3 with
dashed horizontal and vertical rules of the arydshln pack-
age.
In this case, the vertical dashed rules seems unnecessary.
The one without the vertical rules is shown in Table D.11.

v2022.09.25a
D.3. LATEX CONVENTIONS 457

Table D.6: Cache Coherence Example

CPU Cache Memory

Sequence # CPU # Operation 0 1 2 3 0 8

0 Initial State −/I −/I −/I −/I V V


1 0 Load 0/S −/I −/I −/I V V
2 3 Load 0/S −/I −/I 0/S V V
3 0 Invalidation 8/S −/I −/I 0/S V V
4 2 RMW 8/S −/I 0/E −/I V V
5 2 Store 8/S −/I 0/M −/I I V
6 1 Atomic Inc 8/S 0/M −/I −/I I V
7 1 Writeback 8/S 8/S −/I −/I V V

Table D.7: RCU Publish-Subscribe and Version Maintenance APIs

Primitives Availability Overhead


List traversal
list_for_each_entry_rcu() 2.5.59 Simple instructions (memory barrier on Alpha)
List update
list_add_rcu() 2.5.44 Memory barrier
list_add_tail_rcu() 2.5.44 Memory barrier
list_del_rcu() 2.5.44 Simple instructions
list_replace_rcu() 2.6.9 Memory barrier
list_splice_init_rcu() 2.6.21 Grace-period latency
Hlist traversal
hlist_for_each_entry_rcu() 2.6.8 Simple instructions (memory barrier on Alpha)
Hlist update
hlist_add_after_rcu() 2.6.14 Memory barrier
hlist_add_before_rcu() 2.6.14 Memory barrier
hlist_add_head_rcu() 2.5.64 Memory barrier
hlist_del_rcu() 2.5.64 Simple instructions
hlist_replace_rcu() 2.6.15 Memory barrier
Pointer traversal
rcu_dereference() 2.6.9 Simple instructions (memory barrier on Alpha)
Pointer update
rcu_assign_pointer() 2.6.10 Memory barrier

v2022.09.25a
458 APPENDIX D. STYLE GUIDE

Table D.8: RCU Publish-Subscribe and Version Maintenance APIs

Category Primitives Availability Overhead

List traversal list_for_each_entry_rcu() 2.5.59 Simple instructions (mem-


ory barrier on Alpha)

List update list_add_rcu() 2.5.44 Memory barrier


list_add_tail_rcu() 2.5.44 Memory barrier
list_del_rcu() 2.5.44 Simple instructions
list_replace_rcu() 2.6.9 Memory barrier
list_splice_init_rcu() 2.6.21 Grace-period latency

Hlist traversal hlist_for_each_entry_rcu() 2.6.8 Simple instructions (mem-


ory barrier on Alpha)

Hlist update hlist_add_after_rcu() 2.6.14 Memory barrier


hlist_add_before_rcu() 2.6.14 Memory barrier
hlist_add_head_rcu() 2.5.64 Memory barrier
hlist_del_rcu() 2.5.64 Simple instructions
hlist_replace_rcu() 2.6.15 Memory barrier

Pointer traversal rcu_dereference() 2.6.9 Simple instructions (mem-


ory barrier on Alpha)

Pointer update rcu_assign_pointer() 2.6.10 Memory barrier

Table D.9: Memory Misordering: Store-Buffering Sequence of Events

CPU 0 CPU 1
Instruction Store Buffer Cache Instruction Store Buffer Cache
1 (Initial state) x1==0 (Initial state) x0==0
2 x0 = 2; x0==2 x1==0 x1 = 2; x1==2 x0==0
3 r2 = x1; (0) x0==2 x1==0 r2 = x0; (0) x1==2 x0==0
4 (Read-invalidate) x0==2 x0==0 (Read-invalidate) x1==2 x1==0
5 (Finish store) x0==2 (Finish store) x1==2

v2022.09.25a
D.3. LATEX CONVENTIONS 459

Table D.10: Refrigeration Power Consumption


Power per watt
Situation 𝑇 (K) 𝐶P waste heat (W)
Dry Ice 195 1.990 0.5
Liquid N2 77 0.356 2.8
Liquid H2 20 0.073 13.7
Liquid He 4 0.0138 72.3
IBM Q 0.015 0.000051 19,500.0

Table D.11: Refrigeration Power Consumption


Power per watt
Situation 𝑇 (K) 𝐶P waste heat (W)

Dry Ice 195 1.990 0.5


Liquid N2 77 0.356 2.8
Liquid H2 20 0.073 13.7
Liquid He 4 0.0138 72.3
IBM Q 0.015 0.000051 19,500.0

D.3.7.3 Miscellaneous Candidates


Other improvement candidates are listed in the source of
this section as comments.

v2022.09.25a
460 APPENDIX D. STYLE GUIDE

v2022.09.25a
The Answer to the Ultimate Question of Life, The
Universe, and Everything.

“The Hitchhikers Guide to the Galaxy”,


Appendix E Douglas Adams

Answers to Quick Quizzes

E.1 How To Use This Book Quick Quiz 1.3: p.2


These Quick Quizzes are just not my cup of tea. What
can I do about it?
Quick Quiz 1.1: p.2
Where are the answers to the Quick Quizzes found? Answer:
Here are a few possible strategies:
Answer:
In Appendix E starting on page 461. Hey, I thought I 1. Just ignore the Quick Quizzes and read the rest of the
owed you an easy one! q book. You might miss out on the interesting material
in some of the Quick Quizzes, but the rest of the
book has lots of good material as well. This is an
p.2 eminently reasonable approach if your main goal is
Quick Quiz 1.2:
to gain a general understanding of the material or if
Some of the Quick Quiz questions seem to be from the
you are skimming through the book to find a solution
viewpoint of the reader rather than the author. Is that
to a specific problem.
really the intent?
2. Look at the answer immediately rather than investing
Answer: a large amount of time in coming up with your own
Indeed it is! Many are questions that Paul E. McKenney answer. This approach is reasonable when a given
would probably have asked if he was a novice student Quick Quiz’s answer holds the key to a specific
in a class covering this material. It is worth noting that problem you are trying to solve. This approach
Paul was taught most of this material by parallel hardware is also reasonable if you want a somewhat deeper
and software, not by professors. In Paul’s experience, understanding of the material, but when you do not
professors are much more likely to provide answers to expect to be called upon to generate parallel solutions
verbal questions than are parallel systems, recent advances given only a blank sheet of paper.
in voice-activated assistants notwithstanding. Of course,
we could have a lengthy debate over which of professors 3. If you find the Quick Quizzes distracting but impossi-
or parallel systems provide the most useful answers to ble to ignore, you can always clone the LATEX source
these sorts of questions, but for the time being let’s just for this book from the git archive. You can then
agree that usefulness of answers varies widely across the run the command make nq, which will produce a
population both of professors and of parallel systems. perfbook-nq.pdf. This PDF contains unobtrusive
boxed tags where the Quick Quizzes would otherwise
Other quizzes are quite similar to actual questions that
be, and gathers each chapter’s Quick Quizzes at the
have been asked during conference presentations and
end of that chapter in the classic textbook style.
lectures covering the material in this book. A few others
are from the viewpoint of the author. q 4. Learn to like (or at least tolerate) the Quick Quizzes.
Experience indicates that quizzing yourself periodi-

461

v2022.09.25a
462 APPENDIX E. ANSWERS TO QUICK QUIZZES

cally while reading greatly increases comprehension E.2 Introduction


and depth of understanding.
Note that the quick quizzes are hyperlinked to the p.8
Quick Quiz 2.1:
answers and vice versa. Click either the “Quick Quiz”
Come on now!!! Parallel programming has been known
heading or the small black square to move to the beginning
to be exceedingly hard for many decades. You seem to
of the answer. From the answer, click on the heading or
be hinting that it is not so hard. What sort of game are
the small black square to move to the beginning of the quiz,
you playing?
or, alternatively, click on the small white square at the end
of the answer to move to the end of the corresponding Answer:
quiz. q If you really believe that parallel programming is exceed-
ingly hard, then you should have a ready answer to the
Quick Quiz 1.4: p.2 question “Why is parallel programming hard?” One could
If passively reading this book doesn’t get me full problem- list any number of reasons, ranging from deadlocks to
solving and code-production capabilities, what on earth race conditions to testing coverage, but the real answer
is the point??? is that it is not really all that hard. After all, if parallel
programming was really so horribly difficult, how could
Answer: a large number of open-source projects, ranging from
For those preferring analogies, coding concurrent software Apache to MySQL to the Linux kernel, have managed to
is similar to playing music in that there are good uses for master it?
many different levels of talent and skill. Not everyone A better question might be: “Why is parallel program-
needs to devote their entire live to becoming a concert ming perceived to be so difficult?” To see the answer, let’s
pianist. In fact, for every such virtuoso, there are a great go back to the year 1991. Paul McKenney was walking
many lesser pianists whose of music is welcomed by their across the parking lot to Sequent’s benchmarking center
friends and families. But these lesser pianists are probably carrying six dual-80486 Sequent Symmetry CPU boards,
doing something else to support themselves, and so it is when he suddenly realized that he was carrying several
with concurrent coding. times the price of the house he had just purchased.1 This
One potential benefit of passively reading this book high cost of parallel systems meant that parallel program-
is the ability to read and understand modern concurrent ming was restricted to a privileged few who worked for
code. This ability might in turn permit you to: an employer who either manufactured or could afford to
1. See what the kernel does so that you can check to purchase machines costing upwards of $100,000—in 1991
see if a proposed use case is valid. dollars US.
In contrast, in 2020, Paul finds himself typing these
2. Chase down a kernel bug. words on a six-core x86 laptop. Unlike the dual-80486
CPU boards, this laptop also contains 64 GB of main
3. Use information in the kernel to more easily chase
memory, a 1 TB solid-state disk, a display, Ethernet, USB
down a userspace bug.
ports, wireless, and Bluetooth. And the laptop is more
4. Produce a fix for a kernel bug. than an order of magnitude cheaper than even one of those
dual-80486 CPU boards, even before taking inflation into
5. Create a straightforward kernel feature, whether from
account.
scratch or using the modern copy-pasta development
Parallel systems have truly arrived. They are no longer
methodology.
the sole domain of a privileged few, but something avail-
If you are proficient with straightforward uses of locks able to almost everyone.
and atomic operations, passively reading this book should The earlier restricted availability of parallel hardware is
enable you to successfully apply modern concurrency the real reason that parallel programming is considered so
techniques. difficult. After all, it is quite difficult to learn to program
And finally, if your job is to coordinate the activities of even the simplest machine if you have no access to it.
developers making use of modern concurrency techniques,
passively reading this book might help you understand 1 Yes, this sudden realization did cause him to walk quite a bit more

what on earth they are talking about. q carefully. Why do you ask?

v2022.09.25a
E.2. INTRODUCTION 463

Since the age of rare and expensive parallel machines is p.8


Quick Quiz 2.4:
for the most part behind us, the age during which parallel
And if correctness, maintainability, and robustness don’t
programming is perceived to be mind-crushingly difficult
make the list, why do productivity and generality?
is coming to a close.2 q
Answer:
Quick Quiz 2.2: p.8 Given that parallel programming is perceived to be much
How could parallel programming ever be as easy as harder than sequential programming, productivity is tan-
sequential programming? tamount and therefore must not be omitted. Furthermore,
high-productivity parallel-programming environments
Answer: such as SQL serve a specific purpose, hence general-
It depends on the programming environment. SQL [Int92] ity must also be added to the list. q
is an underappreciated success story, as it permits pro-
grammers who know nothing about parallelism to keep a
Quick Quiz 2.5: p.9
large parallel system productively busy. We can expect
more variations on this theme as parallel computers con- Given that parallel programs are much harder to prove
tinue to become cheaper and more readily available. For correct than are sequential programs, again, shouldn’t
example, one possible contender in the scientific and tech- correctness really be on the list?
nical computing arena is MATLAB*P, which is an attempt
Answer:
to automatically parallelize common matrix operations.
From an engineering standpoint, the difficulty in proving
Finally, on Linux and UNIX systems, consider the
correctness, either formally or informally, would be impor-
following shell command:
tant insofar as it impacts the primary goal of productivity.
get_input | grep "interesting" | sort So, in cases where correctness proofs are important, they
are subsumed under the “productivity” rubric. q
This shell pipeline runs the get_input, grep, and
sort processes in parallel. There, that wasn’t so hard, Quick Quiz 2.6: p.9
now was it? What about just having fun?
In short, parallel programming is just as easy as se-
quential programming—at least in those environments Answer:
that hide the parallelism from the user! q Having fun is important as well, but, unless you are a
hobbyist, would not normally be a primary goal. On the
p.8 other hand, if you are a hobbyist, go wild! q
Quick Quiz 2.3:
Oh, really??? What about correctness, maintainability,
robustness, and so on? Quick Quiz 2.7: p.9
Are there no cases where parallel programming is about
Answer:
something other than performance?
These are important goals, but they are just as important
for sequential programs as they are for parallel programs. Answer:
Therefore, important though they are, they do not belong There certainly are cases where the problem to be solved
on a list specific to parallel programming. q is inherently parallel, for example, Monte Carlo meth-
ods and some numerical computations. Even in these
cases, however, there will be some amount of extra work
managing the parallelism.
Parallelism is also sometimes used for reliability. For
but one example, triple-modulo redundancy has three
systems run in parallel and vote on the result. In extreme
2 Parallel programming is in some ways more difficult than sequential cases, the three systems will be independently imple-
programming, for example, parallel validation is more difficult. But no mented using different algorithms and technologies. q
longer mind-crushingly difficult.

v2022.09.25a
464 APPENDIX E. ANSWERS TO QUICK QUIZZES

p.9 3. If the low-cost parallel machine is controlling the


Quick Quiz 2.8:
operation of a valuable piece of equipment, then the
Why not instead rewrite programs from inefficient script-
cost of this piece of equipment might easily justify
ing languages to C or C++?
substantial developer effort.
Answer: 4. If the software for the low-cost parallel machine
If the developers, budget, and time is available for such produces an extremely valuable result (e.g., energy
a rewrite, and if the result will attain the required levels savings), then this valuable result might again justify
of performance on a single CPU, this can be a reasonable substantial developer cost.
approach. q
5. Safety-critical systems protect lives, which can
clearly justify very large developer effort.
Quick Quiz 2.9: p.10
6. Hobbyists and researchers might instead seek knowl-
Why all this prattling on about non-technical issues??? edge, experience, fun, or glory.
And not just any non-technical issue, but productivity of
all things? Who cares? So it is not the case that the decreasing cost of hardware
renders software worthless, but rather that it is no longer
Answer: possible to “hide” the cost of software development within
If you are a pure hobbyist, perhaps you don’t need to care. the cost of the hardware, at least not unless there are
But even pure hobbyists will often care about how much extremely large quantities of hardware. q
they can get done, and how quickly. After all, the most
popular hobbyist tools are usually those that are the best Quick Quiz 2.11: p.12
suited for the job, and an important part of the definition This is a ridiculously unachievable ideal! Why not focus
of “best suited” involves productivity. And if someone on something that is achievable in practice?
is paying you to write parallel code, they will very likely
care deeply about your productivity. And if the person Answer:
paying you cares about something, you would be most This is eminently achievable. The cellphone is a computer
wise to pay at least some attention to it! that can be used to make phone calls and to send and
Besides, if you really didn’t care about productivity, you receive text messages with little or no programming or
would be doing it by hand rather than using a computer! configuration on the part of the end user.
q This might seem to be a trivial example at first glance,
but if you consider it carefully you will see that it is both
simple and profound. When we are willing to sacrifice
Quick Quiz 2.10: p.10
generality, we can achieve truly astounding increases in
Given how cheap parallel systems have become, how productivity. Those who indulge in excessive generality
can anyone afford to pay people to program them? will therefore fail to set the productivity bar high enough
to succeed near the top of the software stack. This fact
Answer: of life even has its own acronym: YAGNI, or “You Ain’t
There are a number of answers to this question: Gonna Need It.” q

1. Given a large computational cluster of parallel ma- p.13


Quick Quiz 2.12:
chines, the aggregate cost of the cluster can easily
Wait a minute! Doesn’t this approach simply shift
justify substantial developer effort, because the de-
the development effort from you to whoever wrote the
velopment cost can be spread over the large number
existing parallel software you are using?
of machines.
Answer:
2. Popular software that is run by tens of millions of Exactly! And that is the whole point of using existing soft-
users can easily justify substantial developer effort, ware. One team’s work can be used by many other teams,
as the cost of this development can be spread over resulting in a large decrease in overall effort compared to
the tens of millions of users. Note that this includes all teams needlessly reinventing the wheel. q
things like kernels and system libraries.

v2022.09.25a
E.2. INTRODUCTION 465

p.13 3. Synchronization overhead. For many synchroniza-


Quick Quiz 2.13:
tion protocols, excessive numbers of threads can
What other bottlenecks might prevent additional CPUs
result in excessive spinning, blocking, or rollbacks,
from providing additional performance?
thus degrading performance.
Answer:
Specific applications and platforms may have any num-
There are any number of potential bottlenecks:
ber of additional limiting factors. q
1. Main memory. If a single thread consumes all avail-
able memory, additional threads will simply page Quick Quiz 2.15: p.15
themselves silly. Just what is “explicit timing”???
2. Cache. If a single thread’s cache footprint completely Answer:
fills any shared CPU cache(s), then adding more Where each thread is given access to some set of resources
threads will simply thrash those affected caches, as during an agreed-to slot of time. For example, a parallel
will be seen in Chapter 10. program with eight threads might be organized into eight-
3. Memory bandwidth. If a single thread consumes all millisecond time intervals, so that the first thread is given
available memory bandwidth, additional threads will access during the first millisecond of each interval, the
simply result in additional queuing on the system second thread during the second millisecond, and so
interconnect. on. This approach clearly requires carefully synchronized
clocks and careful control of execution times, and therefore
4. I/O bandwidth. If a single thread is I/O bound, adding should be used with considerable caution.
more threads will simply result in them all waiting In fact, outside of hard realtime environments, you al-
in line for the affected I/O resource. most certainly want to use something else instead. Explicit
timing is nevertheless worth a mention, as it is always
Specific hardware systems might have any number of there when you need it. q
additional bottlenecks. The fact is that every resource
which is shared between multiple CPUs or threads is a
Quick Quiz 2.16: p.16
potential bottleneck. q
Are there any other obstacles to parallel programming?

Quick Quiz 2.14: p.14


Other than CPU cache capacity, what might require Answer:
limiting the number of concurrent threads? There are a great many other potential obstacles to parallel
programming. Here are a few of them:
Answer:
There are any number of potential limits on the number 1. The only known algorithms for a given project might
of threads: be inherently sequential in nature. In this case,
either avoid parallel programming (there being no
1. Main memory. Each thread consumes some mem- law saying that your project has to run in parallel) or
ory (for its stack if nothing else), so that excessive invent a new parallel algorithm.
numbers of threads can exhaust memory, resulting
in excessive paging or memory-allocation failures. 2. The project allows binary-only plugins that share
the same address space, such that no one developer
2. I/O bandwidth. If each thread initiates a given amount has access to all of the source code for the project.
of mass-storage I/O or networking traffic, excessive Because many parallel bugs, including deadlocks,
numbers of threads can result in excessive I/O queu- are global in nature, such binary-only plugins pose
ing delays, again degrading performance. Some a severe challenge to current software development
networking protocols may be subject to timeouts methodologies. This might well change, but for the
or other failures if there are so many threads that time being, all developers of parallel code sharing a
networking events cannot be responded to in a timely given address space need to be able to see all of the
fashion. code running in that address space.

v2022.09.25a
466 APPENDIX E. ANSWERS TO QUICK QUIZZES

3. The project contains heavily used APIs that were level properties of the hardware? Wouldn’t it be easier,
designed without regard to parallelism [AGH+ 11a, better, and more elegant to remain at a higher level of
CKZ+ 13]. Some of the more ornate features of the abstraction?
System V message-queue API form a case in point.
Of course, if your project has been around for a few Answer:
decades, and its developers did not have access to It might well be easier to ignore the detailed properties
parallel hardware, it undoubtedly has at least its share of the hardware, but in most cases it would be quite
of such APIs. foolish to do so. If you accept that the only purpose of
parallelism is to increase performance, and if you further
4. The project was implemented without regard to paral- accept that performance depends on detailed properties
lelism. Given that there are a great many techniques of the hardware, then it logically follows that parallel
that work extremely well in a sequential environment, programmers are going to need to know at least a few
but that fail miserably in parallel environments, if hardware properties.
your project ran only on sequential hardware for most This is the case in most engineering disciplines. Would
of its lifetime, then your project undoubtably has at you want to use a bridge designed by an engineer who
least its share of parallel-unfriendly code. did not understand the properties of the concrete and steel
making up that bridge? If not, why would you expect
5. The project was implemented without regard to good
a parallel programmer to be able to develop competent
software-development practice. The cruel truth is
parallel software without at least some understanding of
that shared-memory parallel environments are often
the underlying hardware? q
much less forgiving of sloppy development practices
than are sequential environments. You may be well-
served to clean up the existing design and code prior Quick Quiz 3.2: p.20
to attempting parallelization. What types of machines would allow atomic operations
on multiple data elements?
6. The people who originally did the development on
your project have since moved on, and the people Answer:
remaining, while well able to maintain it or add small One answer to this question is that it is often possible to
features, are unable to make “big animal” changes. pack multiple elements of data into a single machine word,
In this case, unless you can work out a very simple which can then be manipulated atomically.
way to parallelize your project, you will probably be A more trendy answer would be machines support-
best off leaving it sequential. That said, there are a ing transactional memory [Lom77, Kni86, HM93]. By
number of simple approaches that you might use to early 2014, several mainstream systems provided limited
parallelize your project, including running multiple hardware transactional memory implementations, which
instances of it, using a parallel implementation of is covered in more detail in Section 17.3. The jury
some heavily used library function, or making use is still out on the applicability of software transactional
of some other parallel project, such as a database. memory [MMW07, PW07, RHP+ 07, CBM+ 08, DFGG11,
One can argue that many of these obstacles are non- MS12], which is covered in Section 17.2. q
technical in nature, but that does not make them any less
real. In short, parallelization of a large body of code Quick Quiz 3.3: p.20
can be a large and complex effort. As with any large So have CPU designers also greatly reduced the overhead
and complex effort, it makes sense to do your homework of cache misses?
beforehand. q
Answer:
Unfortunately, not so much. There has been some re-
duction given constant numbers of CPUs, but the finite
E.3 Hardware and its Habits speed of light and the atomic nature of matter limits their
ability to reduce cache-miss overhead for larger systems.
Quick Quiz 3.1: p.17 Section 3.3 discusses some possible avenues for possible
Why should parallel programmers bother learning low- future progress. q

v2022.09.25a
E.3. HARDWARE AND ITS HABITS 467

p.22 the file /sys/devices/system/cpu/cpu0/cache/


Quick Quiz 3.4:
index0/shared_cpu_list really does contain the
This is a simplified sequence of events? How could it
string 0,224. Therefore, CPU 0’s hyperthread twin really
possibly be any more complex?
is CPU 224. Some people speculate that this number-
Answer: ing allows naive applications and schedulers to perform
This sequence ignored a number of possible complications, better, citing the fact that on many workloads the second
including: hyperthread does not provide a huge amount of additional
performance. This speculation assumes that naive appli-
1. Other CPUs might be concurrently attempting to cations and schedulers would utilize CPUs in numerical
perform memory-reference operations involving this order, leaving aside the weaker hyperthread twin CPUs
same cacheline. until all cores are in use. q

2. The cacheline might have been replicated read-only


Quick Quiz 3.7: p.24
in several CPUs’ caches, in which case, it would need
to be flushed from their caches. Surely the hardware designers could be persuaded to
improve this situation! Why have they been content with
3. CPU 7 might have been operating on the cache line such abysmal performance for these single-instruction
when the request for it arrived, in which case CPU 7 operations?
might need to hold off the request until its own
operation completed. Answer:
The hardware designers have been working on this prob-
4. CPU 7 might have ejected the cacheline from its lem, and have consulted with no less a luminary than
cache (for example, in order to make room for other the late physicist Stephen Hawking. Hawking’s obser-
data), so that by the time that the request arrived, the vation was that the hardware designers have two basic
cacheline was on its way to memory. problems [Gar07]:
5. A correctable error might have occurred in the cache-
line, which would then need to be corrected at some 1. The finite speed of light, and
point before the data was used.
2. The atomic nature of matter.
Production-quality cache-coherence mechanisms are
extremely complicated due to these sorts of considera- The first problem limits raw speed, and the second
tions [HP95, CSG99, MHS12, SHW11]. q limits miniaturization, which in turn limits frequency.
And even this sidesteps the power-consumption issue that
is currently limiting production frequencies to well below
Quick Quiz 3.5: p.22
10 GHz.
Why is it necessary to flush the cacheline from CPU 7’s In addition, Table 3.1 on page 23 represents a reasonably
cache? large system with no fewer than 448 hardware threads.
Answer: Smaller systems often achieve better latency, as may be
If the cacheline was not flushed from CPU 7’s cache, then seen in Table E.1, which represents a much smaller system
CPUs 0 and 7 might have different values for the same with only 16 hardware threads. A similar view is provided
set of variables in the cacheline. This sort of incoherence by the rows of Table 3.1 down to and including the two
greatly complicates parallel software, which is why wise “Off-core” rows.
hardware architects avoid it. q Furthermore, newer small-scale single-socket systems
such as the laptop on which I am typing this also have
p.23 more reasonable latencies, as can be seen in Table E.2.
Quick Quiz 3.6:
Table 3.1 shows CPU 0 sharing a core with CPU 224. Alternatively, a 64-CPU system in the mid 1990s had
Shouldn’t that instead be CPU 1??? cross-interconnect latencies in excess of five microsec-
onds, so even the eight-socket 448-hardware-thread mon-
Answer: ster shown in Table 3.1 represents more than a five-fold
It is easy to be sympathetic to this view, but improvement over its 25-years-prior counterparts.

v2022.09.25a
468 APPENDIX E. ANSWERS TO QUICK QUIZZES

Table E.1: Performance of Synchronization Mechanisms For the more-expensive inter-system communications
on 16-CPU 2.8 GHz Intel X5550 (Nehalem) System latencies, use several rolls (or multiple cases) of toilet
paper to represent the communications latency.
Ratio
Operation Cost (ns) (cost/clock) Important safety tip: Make sure to account for the needs
of those you live with when appropriating toilet paper,
Clock period 0.4 1.0 especially in 2020 or during a similar time when store
Same-CPU CAS 12.2 33.8 shelves are free of toilet paper and much else besides.
Same-CPU lock 25.6 71.2 Furthermore, for those working on kernel code, a CPU
Blind CAS 12.9 35.8 disabling interrupts across a cache miss is analogous to
CAS 7.0 19.4 you holding your breath while unrolling a roll of toilet
Off-Core paper. How many rolls of toilet paper can you unroll while
Blind CAS 31.2 86.6 holding your breath? You might wish to avoid disabling
CAS 31.2 86.5 interrupts across that many cache misses.3 q

Off-Socket
Quick Quiz 3.9: p.25
Blind CAS 92.4 256.7
CAS 95.9 266.4 But individual electrons don’t move anywhere near that
fast, even in conductors!!! The electron drift velocity in
Off-System a conductor under semiconductor voltage levels is on the
Comms Fabric 2,600 7,220 order of only one millimeter per second. What gives???
Global Comms 195,000,000 542,000,000

Answer:
Integration of hardware threads in a single core and Electron drift velocity tracks the long-term movement of
multiple cores on a die have improved latencies greatly, individual electrons. It turns out that individual electrons
at least within the confines of a single core or single bounce around quite randomly, so that their instantaneous
die. There has been some improvement in overall system speed is very high, but over the long term, they don’t
latency, but only by about a factor of two. Unfortunately, move very far. In this, electrons resemble long-distance
neither the speed of light nor the atomic nature of matter commuters, who might spend most of their time traveling
has changed much in the past few years [Har16]. Therefore, at full highway speed, but over the long term go nowhere.
spatial and temporal locality are first-class concerns for These commuters’ speed might be 70 miles per hour (113
concurrent software, even when running on relatively kilometers per hour), but their long-term drift velocity
small systems. relative to the planet’s surface is zero.
Therefore, we should pay attention not to the electrons’
Section 3.3 looks at what else hardware designers might drift velocity, but to their instantaneous velocities. How-
be able to do to ease the plight of parallel programmers. ever, even their instantaneous velocities are nowhere near
q a significant fraction of the speed of light. Nevertheless,
the measured velocity of electric waves in conductors is a
substantial fraction of the speed of light, so we still have a
Quick Quiz 3.8: p.24 mystery on our hands.
These numbers are insanely large! How can I possibly The other trick is that electrons interact with each other
get my head around them? at significant distances (from an atomic perspective, any-
way), courtesy of their negative charge. This interaction
Answer: is carried out by photons, which do move at the speed of
Get a roll of toilet paper. In the USA, each roll will light. So even with electricity’s electrons, it is photons
normally have somewhere around 350–500 sheets. Tear doing most of the fast footwork.
off one sheet to represent a single clock cycle, setting it Extending the commuter analogy, a driver might use a
aside. Now unroll the rest of the roll. smartphone to inform other drivers of an accident or con-
gestion, thus allowing a change in traffic flow to propagate
The resulting pile of toilet paper will likely represent a
single CAS cache miss. 3 Kudos to Matthew Wilcox for this holding-breath analogy.

v2022.09.25a
E.3. HARDWARE AND ITS HABITS 469

Table E.2: CPU 0 View of Synchronization Mechanisms on 12-CPU Intel Core i7-8750H CPU @ 2.20 GHz
Ratio
Operation Cost (ns) (cost/clock) CPUs
Clock period 0.5 1.0
Same-CPU CAS 6.2 13.6 0
Same-CPU lock 13.5 29.6 0
In-core blind CAS 6.5 14.3 6
In-core CAS 16.2 35.6 6
Off-core blind CAS 22.2 48.8 1–5,7–11
Off-core CAS 53.6 117.9 1–5,7–11
Off-System
Comms Fabric 5,000 11,000
Global Comms 195,000,000 429,000,000

much faster than the instantaneous velocity of the individ- 1. Shared-memory multiprocessor systems have strict
ual cars. Summarizing the analogy between electricity size limits. If you need more than a few thousand
and traffic flow: CPUs, you have no choice but to use a distributed
system.
1. The (very low) drift velocity of an electron is similar
2. Large shared-memory systems tend to be more ex-
to the long-term velocity of a commuter, both being
pensive per unit computation than their smaller coun-
very nearly zero.
terparts.
2. The (still rather low) instantaneous velocity of an
3. Large shared-memory systems tend to have much
electron is similar to the instantaneous velocity of
longer cache-miss latencies than do smaller system.
a car in traffic. Both are much higher than the drift
To see this, compare Table 3.1 on page 23 with
velocity, but quite small compared to the rate at which
Table E.2.
changes propagate.
4. The distributed-systems communications operations
3. The (much higher) propagation velocity of an elec- do not necessarily use much CPU, so that computa-
tric wave is primarily due to photons transmitting tion can proceed in parallel with message transfer.
electromagnetic force among the electrons. Simi-
larly, traffic patterns can change quite quickly due 5. Many important problems are “embarrassingly paral-
to communication among drivers. Not that this is lel”, so that extremely large quantities of processing
necessarily of much help to the drivers already stuck may be enabled by a very small number of messages.
in traffic, any more than it is to the electrons already SETI@HOME [Uni08b] was but one example of
pooled in a given capacitor. such an application. These sorts of applications can
make good use of networks of computers despite
Of course, to fully understand this topic, you should extremely long communications latencies.
read up on electrodynamics. q
Thus, large shared-memory systems tend to be used
for applications that benefit from faster latencies than can
Quick Quiz 3.10: p.28 be provided by distributed computing, and particularly
Given that distributed-systems communication is so for those applications that benefit from a large shared
horribly expensive, why does anyone bother with such memory.
systems? It is likely that continued work on parallel applications
will increase the number of embarrassingly parallel ap-
Answer: plications that can run well on machines and/or clusters
There are a number of reasons: having long communications latencies, reductions in cost

v2022.09.25a
470 APPENDIX E. ANSWERS TO QUICK QUIZZES

being the driving force that it is. That said, greatly re- p.29
Quick Quiz 4.3:
duced hardware latencies would be an extremely welcome
Is there a simpler way to create a parallel shell script?
development, both for single-system and for distributed
If so, how? If not, why not?
computing. q
Answer:
One straightforward approach is the shell pipeline:
Quick Quiz 3.11: p.28
OK, if we are going to have to apply distributed- grep $pattern1 | sed -e 's/a/b/' | sort

programming techniques to shared-memory parallel


programs, why not just always use these distributed For a sufficiently large input file, grep will pattern-
techniques and dispense with shared memory? match in parallel with sed editing and with the input
processing of sort. See the file parallel.sh for a
Answer: demonstration of shell-script parallelism and pipelining.
Because it is often the case that only a small fraction q
of the program is performance-critical. Shared-memory
parallelism allows us to focus distributed-programming Quick Quiz 4.4: p.30
techniques on that small fraction, allowing simpler shared- But if script-based parallel programming is so easy, why
memory techniques to be used on the non-performance- bother with anything else?
critical bulk of the program. q
Answer:
In fact, it is quite likely that a very large fraction of
parallel programs in use today are script-based. However,
script-based parallelism does have its limitations:
E.4 Tools of the Trade
1. Creation of new processes is usually quite heavy-
weight, involving the expensive fork() and exec()
Quick Quiz 4.1: p.29 system calls.
You call these tools??? They look more like low-level 2. Sharing of data, including pipelining, typically in-
synchronization primitives to me! volves expensive file I/O.
3. The reliable synchronization primitives available to
Answer:
scripts also typically involve expensive file I/O.
They look that way because they are in fact low-level
synchronization primitives. And they are in fact the fun- 4. Scripting languages are often too slow, but are often
damental tools for building low-level concurrent software. quite useful when coordinating execution of long-
q running programs written in lower-level program-
ming languages.

Quick Quiz 4.2: p.29 These limitations require that script-based parallelism
But this silly shell script isn’t a real parallel program! use coarse-grained parallelism, with each unit of work
Why bother with such trivia??? having execution time of at least tens of milliseconds, and
preferably much longer.
Those requiring finer-grained parallelism are well ad-
Answer:
vised to think hard about their problem to see if it can be
Because you should never forget the simple stuff!
expressed in a coarse-grained form. If not, they should
Please keep in mind that the title of this book is “Is consider using other parallel-programming environments,
Parallel Programming Hard, And, If So, What Can You such as those discussed in Section 4.2. q
Do About It?”. One of the most effective things you can
do about it is to avoid forgetting the simple stuff! After all, p.30
Quick Quiz 4.5:
if you choose to do parallel programming the hard way,
Why does this wait() primitive need to be so compli-
you have no one but yourself to blame. q

v2022.09.25a
E.4. TOOLS OF THE TRADE 471

cated? Why not just make it work like the shell-script Quick Quiz 4.8: p.32
wait does? If the C language makes no guarantees in presence of a
Answer: data race, then why does the Linux kernel have so many
Some parallel applications need to take special action when data races? Are you trying to tell me that the Linux
specific children exit, and therefore need to wait for each kernel is completely broken???
child individually. In addition, some parallel applications
Answer:
need to detect the reason that the child died. As we saw in
Ah, but the Linux kernel is written in a carefully selected
Listing 4.2, it is not hard to build a waitall() function
superset of the C language that includes special GNU
out of the wait() function, but it would be impossible
extensions, such as asms, that permit safe execution even
to do the reverse. Once the information about a specific
in presence of data races. In addition, the Linux kernel
child is lost, it is lost. q
does not run on a number of platforms where data races
would be especially problematic. For an example, consider
Quick Quiz 4.6: p.31 embedded systems with 32-bit pointers and 16-bit busses.
Isn’t there a lot more to fork() and wait() than dis- On such a system, a data race involving a store to and a
cussed here? load from a given pointer might well result in the load
returning the low-order 16 bits of the old value of the
Answer:
pointer concatenated with the high-order 16 bits of the
Indeed there is, and it is quite possible that this section
new value of the pointer.
will be expanded in future versions to include messaging
Nevertheless, even in the Linux kernel, data races
features (such as UNIX pipes, TCP/IP, and shared file I/O)
can be quite dangerous and should be avoided where
and memory mapping (such as mmap() and shmget()).
feasible [Cor12]. q
In the meantime, there are any number of textbooks that
cover these primitives in great detail, and the truly moti-
vated can read manpages, existing parallel applications Quick Quiz 4.9: p.32
using these primitives, as well as the source code of the What if I want several threads to hold the same lock at
Linux-kernel implementations themselves. the same time?
It is important to note that the parent process in List-
ing 4.3 waits until after the child terminates to do its Answer:
printf(). Using printf()’s buffered I/O concurrently The first thing you should do is to ask yourself why you
to the same file from multiple processes is non-trivial, would want to do such a thing. If the answer is “because I
and is best avoided. If you really need to do concur- have a lot of data that is read by many threads, and only
rent buffered I/O, consult the documentation for your occasionally updated”, then POSIX reader-writer locks
OS. For UNIX/Linux systems, Stewart Weiss’s lecture might be what you are looking for. These are introduced
notes provide a good introduction with informative exam- in Section 4.2.4.
ples [Wei13]. q Another way to get the effect of multiple threads holding
the same lock is for one thread to acquire the lock, and
then use pthread_create() to create the other threads.
Quick Quiz 4.7: p.31
The question of why this would ever be a good idea is left
If the mythread() function in Listing 4.4 can simply to the reader. q
return, why bother with pthread_exit()?

Answer: Quick Quiz 4.10: p.33


In this simple example, there is no reason whatsoever. Why not simply make the argument to lock_reader()
However, imagine a more complex example, where on line 6 of Listing 4.5 be a pointer to a pthread_
mythread() invokes other functions, possibly separately mutex_t?
compiled. In such a case, pthread_exit() allows these
other functions to end the thread’s execution without hav- Answer:
ing to pass some sort of error return all the way back up Because we will need to pass lock_reader() to
to mythread(). q pthread_create(). Although we could cast the func-
tion when passing it to pthread_create(), function

v2022.09.25a
472 APPENDIX E. ANSWERS TO QUICK QUIZZES

casts are quite a bit uglier and harder to get right than are on a multiprocessor, one would normally expect lock_
simple pointer casts. q reader() to acquire the lock first. Nevertheless, there
are no guarantees, especially on a busy system. q

Quick Quiz 4.11: p.33


What is the READ_ONCE() on lines 20 and 47 and the
Quick Quiz 4.14: p.34
WRITE_ONCE() on line 47 of Listing 4.5?
Using different locks could cause quite a bit of confu-
Answer: sion, what with threads seeing each others’ intermediate
These macros constrain the compiler so as to prevent it states. So should well-written parallel programs restrict
from carrying out optimizations that would be problematic themselves to using a single lock in order to avoid this
for concurrently accessed shared variables. They don’t kind of confusion?
constrain the CPU at all, other than by preventing reorder-
ing of accesses to a given single variable. Note that this Answer:
single-variable constraint does apply to the code shown in Although it is sometimes possible to write a program
Listing 4.5 because only the variable x is accessed. using a single global lock that both performs and scales
For more information on READ_ONCE() and WRITE_ well, such programs are exceptions to the rule. You
ONCE(), please see Section 4.2.5. For more in- will normally need to use multiple locks to attain good
formation on ordering accesses to multiple variables performance and scalability.
by multiple threads, please see Chapter 15. In the
One possible exception to this rule is “transactional
meantime, READ_ONCE(x) has much in common with
memory”, which is currently a research topic. Transac-
the GCC intrinsic __atomic_load_n(&x, __ATOMIC_
tional-memory semantics can be loosely thought of as
RELAXED) and WRITE_ONCE(x, v) has much in common
those of a single global lock with optimizations permitted
with the GCC intrinsic __atomic_store_n(&x, v, __
and with the addition of rollback [Boe09]. q
ATOMIC_RELAXED). q

Quick Quiz 4.12: p.33 p.34


Quick Quiz 4.15:
Writing four lines of code for each acquisition and release In the code shown in Listing 4.7, is lock_reader()
of a pthread_mutex_t sure seems painful! Isn’t there guaranteed to see all the values produced by lock_
a better way? writer()? Why or why not?
Answer:
Indeed! And for that reason, the pthread_mutex_ Answer:
lock() and pthread_mutex_unlock() primitives are No. On a busy system, lock_reader() might be pre-
normally wrapped in functions that do this error check- empted for the entire duration of lock_writer()’s ex-
ing. Later on, we will wrap them with the Linux kernel ecution, in which case it would not see any of lock_
spin_lock() and spin_unlock() APIs. q writer()’s intermediate states for x. q

Quick Quiz 4.13: p.33


Quick Quiz 4.16: p.34
Is “x = 0” the only possible output from the code fragment
Wait a minute here!!! Listing 4.6 didn’t initialize shared
shown in Listing 4.6? If so, why? If not, what other
variable x, so why does it need to be initialized in
output could appear, and why?
Listing 4.7?
Answer:
No. The reason that “x = 0” was output was that lock_ Answer:
reader() acquired the lock first. Had lock_writer() See line 4 of Listing 4.5. Because the code in Listing 4.6
instead acquired the lock first, then the output would have ran first, it could rely on the compile-time initialization
been “x = 3”. However, because the code fragment started of x. The code in Listing 4.7 ran next, so it had to
lock_reader() first and because this run was performed re-initialize x. q

v2022.09.25a
E.4. TOOLS OF THE TRADE 473

p.35 Answer:
Quick Quiz 4.17:
It depends. If the per-thread variable was accessed only
Instead of using READ_ONCE() everywhere, why not just
from its thread, and never from a signal handler, then
declare goflag as volatile on line 10 of Listing 4.8?
no. Otherwise, it is quite possible that READ_ONCE()
is needed. We will see examples of both situations in
Answer: Section 5.4.4.
A volatile declaration is in fact a reasonable alternative This leads to the question of how one thread can gain
in this particular case. However, use of READ_ONCE() has access to another thread’s __thread variable, and the
the benefit of clearly flagging to the reader that goflag answer is that the second thread must store a pointer to
is subject to concurrent reads and updates. Note that its __thread variable somewhere that the first thread has
READ_ONCE() is especially useful in cases where most of access to. One common approach is to maintain a linked
the accesses are protected by a lock (and thus not subject list with one element per thread, and to store the address
to change), but where a few of the accesses are made of each thread’s __thread variable in the corresponding
outside of the lock. Using a volatile declaration in element. q
this case would make it harder for the reader to note the
special accesses outside of the lock, and would also make Quick Quiz 4.20: p.35
it harder for the compiler to generate good code under the Isn’t comparing against single-CPU throughput a bit
lock. q harsh?

p.35 Answer:
Quick Quiz 4.18:
Not at all. In fact, this comparison was, if anything,
READ_ONCE() only affects the compiler, not the CPU.
overly lenient. A more balanced comparison would be
Don’t we also need memory barriers to make sure that
against single-CPU throughput with the locking primitives
the change in goflag’s value propagates to the CPU in
commented out. q
a timely fashion in Listing 4.8?

Answer: Quick Quiz 4.21: p.35


No, memory barriers are not needed and won’t help here. But one microsecond is not a particularly small size for
Memory barriers only enforce ordering among multiple a critical section. What do I do if I need a much smaller
memory references: They absolutely do not guarantee critical section, for example, one containing only a few
to expedite the propagation of data from one part of the instructions?
system to another.4 This leads to a quick rule of thumb:
You do not need memory barriers unless you are using Answer:
more than one variable to communicate between multiple If the data being read never changes, then you do not need
threads. to hold any locks while accessing it. If the data changes
But what about nreadersrunning? Isn’t that a second sufficiently infrequently, you might be able to checkpoint
variable used for communication? Indeed it is, and there execution, terminate all threads, change the data, then
really are the needed memory-barrier instructions buried restart at the checkpoint.
in __sync_fetch_and_add(), which make sure that the Another approach is to keep a single exclusive lock per
thread proclaims its presence before checking to see if it thread, so that a thread read-acquires the larger aggregate
should start. q reader-writer lock by acquiring its own lock, and write-
acquires by acquiring all the per-thread locks [HW92].
This can work quite well for readers, but causes writers
Quick Quiz 4.19: p.35
to incur increasingly large overheads as the number of
Would it ever be necessary to use READ_ONCE() when threads increases.
accessing a per-thread variable, for example, a variable Some other ways of efficiently handling very small
declared using GCC’s __thread storage class? critical sections are described in Chapter 9. q

4 There have been persistent rumors of hardware in which memory


Quick Quiz 4.22: p.36
barriers actually do expedite propagation of data, but no confirmed The system used is a few years old, and new hardware
sightings.

v2022.09.25a
474 APPENDIX E. ANSWERS TO QUICK QUIZZES

should be faster. So why should anyone worry about Quick Quiz 4.25: p.36
reader-writer locks being slow? What happened to ACCESS_ONCE()?

Answer:
Answer:
In the 2018 v4.15 release, the Linux kernel’s ACCESS_
In general, newer hardware is improving. However, it will
ONCE() was replaced by READ_ONCE() and WRITE_
need to improve several orders of magnitude to permit
ONCE() for reads and writes, respectively [Cor12, Cor14a,
reader-writer lock to achieve ideal performance on 448
Rut17]. ACCESS_ONCE() was introduced as a helper in
CPUs. Worse yet, the greater the number of CPUs, the
RCU code, but was promoted to core API soon after-
larger the required performance improvement. The per-
ward [McK07b, Tor08]. Linux kernel’s READ_ONCE()
formance problems of reader-writer locking are therefore
and WRITE_ONCE() have evolved into complex forms that
very likely to be with us for quite some time to come. q
look quite different than the original ACCESS_ONCE()
implementation due to the need to support access-once
semantics for large structures, but with the possibility of
Quick Quiz 4.23: p.36 load/store tearing if the structure cannot be loaded and
Is it really necessary to have both sets of primitives? stored with a single machine instruction. q

Answer: Quick Quiz 4.26: p.39


Strictly speaking, no. One could implement any member What happened to the Linux-kernel equivalents to
of the second set using the corresponding member of the fork() and wait()?
first set. For example, one could implement __sync_
Answer:
nand_and_fetch() in terms of __sync_fetch_and_
They don’t really exist. All tasks executing within the
nand() as follows:
Linux kernel share memory, at least unless you want to
do a huge amount of memory-mapping work by hand. q
tmp = v;
ret = __sync_fetch_and_nand(p, tmp);
ret = ~ret & tmp; Quick Quiz 4.27: p.40
What problems could occur if the variable counter
were incremented without the protection of mutex?
It is similarly possible to implement __sync_fetch_ Answer:
and_add(), __sync_fetch_and_sub(), and __sync_ On CPUs with load-store architectures, incrementing
fetch_and_xor() in terms of their post-value counter- counter might compile into something like the following:
parts.
LOAD counter,r0
However, the alternative forms can be quite convenient, INC r0
both for the programmer and for the compiler/library STORE r0,counter

implementor. q
On such machines, two threads might simultaneously
load the value of counter, each increment it, and each
store the result. The new value of counter will then
Quick Quiz 4.24: p.36
only be one greater than before, despite two threads each
Given that these atomic operations will often be able incrementing it. q
to generate single atomic instructions that are directly
supported by the underlying instruction set, shouldn’t
Quick Quiz 4.28: p.40
they be the fastest possible way to get things done?
What is wrong with loading Listing 4.14’s global_ptr
up to three times?
Answer:
Unfortunately, no. See Chapter 5 for some stark coun- Answer:
terexamples. q Suppose that global_ptr is initially non-NULL, but that

v2022.09.25a
E.4. TOOLS OF THE TRADE 475

some other thread sets global_ptr to NULL. Suppose In the case of volatile and atomic variables, the
further that line 1 of the transformed code (Listing 4.15) compiler is specifically forbidden from inventing writes.
executes just before global_ptr is set to NULL and line 2 q
just after. Then line 1 will conclude that global_ptr is
non-NULL, line 2 will conclude that it is less than high_ p.45
address, so that line 3 passes do_low() a NULL pointer, Quick Quiz 4.31:
which do_low() just might not be prepared to deal with. But aren’t full memory barriers very heavyweight? Isn’t
Your editor made exactly this mistake in the DYNIX/ptx there a cheaper way to enforce the ordering needed in
kernel’s memory allocator in the early 1990s. Tracking Listing 4.29?
down the bug consumed a holiday weekend not just for
Answer:
your editor, but also for several of his colleagues. In short,
As is often the case, the answer is “it depends”. However,
this is not a new problem, nor is it likely to go away on its
if only two threads are accessing the status and other_
own. q
task_ready variables, then the smp_store_release()
and smp_load_acquire() functions discussed in Sec-
Quick Quiz 4.29: p.41
tion 4.3.5 will suffice. q
Why does it matter whether do_something() and do_
something_else() in Listing 4.18 are inline func-
tions? Quick Quiz 4.32: p.46
What needs to happen if an interrupt or signal handler
Answer: might itself be interrupted?
Because gp is not a static variable, if either do_
something() or do_something_else() were sepa- Answer:
rately compiled, the compiler would have to assume that Then that interrupt handler must follow the same rules
either or both of these two functions might change the that are followed by other interrupted code. Only those
value of gp. This possibility would force the compiler handlers that cannot be themselves interrupted or that
to reload gp on line 15, thus avoiding the NULL-pointer access no variables shared with an interrupting handler
dereference. q may safely use plain accesses, and even then only if those
variables cannot be concurrently accessed by some other
Quick Quiz 4.30: p.43 CPU or thread. q
Ouch! So can’t the compiler invent a store to a normal
variable pretty much any time it likes?
Quick Quiz 4.33: p.46
Answer: How could you work around the lack of a per-thread-
Thankfully, the answer is no. This is because the compiler variable API on systems that do not provide it?
is forbidden from introducing data races. The case of
inventing a store just before a normal store is quite special: Answer:
It is not possible for some other entity, be it CPU, thread, One approach would be to create an array indexed by
signal handler, or interrupt handler, to be able to see the smp_thread_id(), and another would be to use a hash
invented store unless the code already has a data race, table to map from smp_thread_id() to an array index—
even without the invented store. And if the code already which is in fact what this set of APIs does in pthread
has a data race, it already invokes the dreaded spectre of environments.
undefined behavior, which allows the compiler to generate Another approach would be for the parent to allocate
pretty much whatever code it wants, regardless of the a structure containing fields for each desired per-thread
wishes of the developer. variable, then pass this to the child during thread cre-
But if the original store is volatile, as in WRITE_ONCE(), ation. However, this approach can impose large software-
for all the compiler knows, there might be a side effect engineering costs in large systems. To see this, imagine if
associated with the store that could signal some other all global variables in a large system had to be declared
thread, allowing data-race-free access to the variable. By in a single file, regardless of whether or not they were C
inventing the store, the compiler might be introducing a static variables! q
data race, which it is not permitted to do.

v2022.09.25a
476 APPENDIX E. ANSWERS TO QUICK QUIZZES

Quick Quiz 4.34: p.47 Quick Quiz 5.3: p.49


Wouldn’t the shell normally use vfork() rather than Approximate structure-allocation limit problem.
fork()? Suppose that you need to maintain a count of the number
of structures allocated in order to fail any allocations
Answer: once the number of structures in use exceeds a limit
It might well do that, however, checking is left as an (say, 10,000). Suppose further that the structures are
exercise for the reader. But in the meantime, I hope that short-lived, the limit is rarely exceeded, and a “sloppy”
we can agree that vfork() is a variant of fork(), so that approximate limit is acceptable.
we can use fork() as a generic term covering both. q
Answer:
Hint: The act of updating the counter must again be
E.5 Counting blazingly fast, but the counter is read out each time that the
counter is increased. However, the value read out need not
be accurate except that it must distinguish approximately
Quick Quiz 5.1: p.49
between values below the limit and values greater than or
Why should efficient and scalable counting be hard??? equal to the limit. See Section 5.3. q
After all, computers have special hardware for the sole
purpose of doing counting!!!
Quick Quiz 5.4: p.49
Answer:
Because the straightforward counting algorithms, for ex- Exact structure-allocation limit problem. Suppose
ample, atomic operations on a shared counter, either are that you need to maintain a count of the number of
slow and scale badly, or are inaccurate, as will be seen in structures allocated in order to fail any allocations once
Section 5.1. q the number of structures in use exceeds an exact limit
(again, say 10,000). Suppose further that these structures
p.49
are short-lived, and that the limit is rarely exceeded, that
Quick Quiz 5.2:
there is almost always at least one structure in use, and
Network-packet counting problem. Suppose that you
suppose further still that it is necessary to know exactly
need to collect statistics on the number of networking
when this counter reaches zero, for example, in order to
packets transmitted and received. Packets might be
free up some memory that is not required unless there is
transmitted or received by any CPU on the system.
at least one structure in use.
Suppose further that your system is capable of handling
millions of packets per second per CPU, and that a Answer:
systems-monitoring package reads the count every five Hint: The act of updating the counter must once again be
seconds. How would you implement this counter? blazingly fast, but the counter is read out each time that
Answer: the counter is increased. However, the value read out need
Hint: The act of updating the counter must be blazingly not be accurate except that it absolutely must distinguish
fast, but because the counter is read out only about once perfectly between values between the limit and zero on
in five million updates, the act of reading out the counter the one hand, and values that either are less than or equal
can be quite slow. In addition, the value read out normally to zero or are greater than or equal to the limit on the other
need not be all that accurate—after all, since the counter hand. See Section 5.4. q
is updated a thousand times per millisecond, we should
be able to work with a value that is within a few thousand p.49
Quick Quiz 5.5:
counts of the “true value”, whatever “true value” might
Removable I/O device access-count problem. Sup-
mean in this context. However, the value read out should
pose that you need to maintain a reference count on a
maintain roughly the same absolute error over time. For
heavily used removable mass-storage device, so that you
example, a 1 % error might be just fine when the count
can tell the user when it is safe to remove the device. As
is on the order of a million or so, but might be abso-
usual, the user indicates a desire to remove the device,
lutely unacceptable once the count reaches a trillion. See
and the system tells the user when it is safe to do so.
Section 5.2. q

v2022.09.25a
E.5. COUNTING 477

Answer: p.50
Quick Quiz 5.8:
Hint: Yet again, the act of updating the counter must be
The 8-figure accuracy on the number of failures indicates
blazingly fast and scalable in order to avoid slowing down
that you really did test this. Why would it be necessary
I/O operations, but because the counter is read out only
to test such a trivial program, especially when the bug
when the user wishes to remove the device, the counter
is easily seen by inspection?
read-out operation can be extremely slow. Furthermore,
there is no need to be able to read out the counter at all Answer:
unless the user has already indicated a desire to remove the Not only are there very few trivial parallel programs, and
device. In addition, the value read out need not be accurate most days I am not so sure that there are many trivial
except that it absolutely must distinguish perfectly between sequential programs, either.
non-zero and zero values, and even then only when the No matter how small or simple the program, if you
device is in the process of being removed. However, once haven’t tested it, it does not work. And even if you have
it has read out a zero value, it must act to keep the value at tested it, Murphy’s Law says that there will be at least a
zero until it has taken some action to prevent subsequent few bugs still lurking.
threads from gaining access to the device being removed. Furthermore, while proofs of correctness certainly do
See Section 5.4.6. q have their place, they never will replace testing, including
the counttorture.h test setup used here. After all,
p.50 proofs are only as good as the assumptions that they are
Quick Quiz 5.6:
based on. Finally, proofs can be every bit as buggy as are
One thing that could be simpler is ++ instead of that
programs! q
concatenation of READ_ONCE() and WRITE_ONCE().
Why all that extra typing???
Quick Quiz 5.9: p.50
Answer: Why doesn’t the horizontal dashed line on the x axis
See Section 4.3.4.1 on page 40 for more information meet the diagonal line at 𝑥 = 1?
on how the compiler can cause trouble, as well as how
READ_ONCE() and WRITE_ONCE() can avoid this trouble. Answer:
q Because of the overhead of the atomic operation. The
dashed line on the x axis represents the overhead of a single
non-atomic increment. After all, an ideal algorithm would
Quick Quiz 5.7: p.50 not only scale linearly, it would also incur no performance
But can’t a smart compiler prove that line 5 of Listing 5.1 penalty compared to single-threaded code.
is equivalent to the ++ operator and produce an x86 add- This level of idealism may seem severe, but if it is good
to-memory instruction? And won’t the CPU cache cause enough for Linus Torvalds, it is good enough for you. q
this to be atomic?
Quick Quiz 5.10: p.50
Answer:
But atomic increment is still pretty fast. And incre-
Although the ++ operator could be atomic, there is no
menting a single variable in a tight loop sounds pretty
requirement that it be so unless it is applied to a C11
unrealistic to me, after all, most of the program’s exe-
_Atomic variable. And indeed, in the absence of _
cution should be devoted to actually doing work, not
Atomic, GCC often chooses to load the value to a register,
accounting for the work it has done! Why should I care
increment the register, then store the value to memory,
about making this go faster?
which is decidedly non-atomic.
Furthermore, note the volatile casts in READ_ONCE() Answer:
and WRITE_ONCE(), which tell the compiler that the In many cases, atomic increment will in fact be fast enough
location might well be an MMIO device register. Because for you. In those cases, you should by all means use atomic
MMIO registers are not cached, it would be unwise for increment. That said, there are many real-world situations
the compiler to assume that the increment operation is where more elaborate counting algorithms are required.
atomic. q The canonical example of such a situation is counting
packets and bytes in highly optimized networking stacks,

v2022.09.25a
478 APPENDIX E. ANSWERS TO QUICK QUIZZES

where it is all too easy to find much of the execution time CPU 0 CPU 1 CPU 2 CPU 3
going into these sorts of accounting tasks, especially on
Cache Cache Cache Cache
large multiprocessors.
Interconnect Interconnect
In addition, as noted at the beginning of this chap-
ter, counting provides an excellent view of the issues
encountered in shared-memory parallel programs. q Memory System Interconnect Memory

Quick Quiz 5.11: p.51


Interconnect Interconnect
But why can’t CPU designers simply ship the addition
Cache Cache Cache Cache
operation to the data, avoiding the need to circulate the
CPU 4 CPU 5 CPU 6 CPU 7
cache line containing the global variable being incre-
mented?
Figure E.1: Data Flow For Global Combining-Tree
Answer: Atomic Increment
It might well be possible to do this in some cases. However,
there are a few complications:
practical. Nevertheless, we will see that in some important
1. If the value of the variable is required, then the thread special cases, software can do much better. q
will be forced to wait for the operation to be shipped
to the data, and then for the result to be shipped back.
Quick Quiz 5.12: p.51
2. If the atomic increment must be ordered with respect But doesn’t the fact that C’s “integers” are limited in
to prior and/or subsequent operations, then the thread size complicate things?
will be forced to wait for the operation to be shipped
to the data, and for an indication that the operation Answer:
completed to be shipped back. No, because modulo addition is still commutative and
associative. At least as long as you use unsigned integers.
3. Shipping operations among CPUs will likely require Recall that in the C standard, overflow of signed integers
more lines in the system interconnect, which will results in undefined behavior, never mind the fact that
consume more die area and more electrical power. machines that do anything other than wrap on overflow are
But what if neither of the first two conditions holds? Then quite rare these days. Unfortunately, compilers frequently
you should think carefully about the algorithms discussed carry out optimizations that assume that signed integers
in Section 5.2, which achieve near-ideal performance on will not overflow, so if your code allows signed integers
commodity hardware. to overflow, you can run into trouble even on modern
If either or both of the first two conditions hold, there twos-complement hardware.
is some hope for improved hardware. One could imagine That said, one potential source of additional complex-
the hardware implementing a combining tree, so that the ity arises when attempting to gather (say) a 64-bit sum
increment requests from multiple CPUs are combined by from 32-bit per-thread counters. Dealing with this added
the hardware into a single addition when the combined complexity is left as an exercise for the reader, for whom
request reaches the hardware. The hardware could also some of the techniques introduced later in this chapter
apply an order to the requests, thus returning to each CPU could be quite helpful. q
the return value corresponding to its particular atomic
increment. This results in instruction latency that varies Quick Quiz 5.13: p.51
as O (log 𝑁), where 𝑁 is the number of CPUs, as shown An array??? But doesn’t that limit the number of threads?
in Figure E.1. And CPUs with this sort of hardware
optimization started to appear in 2011.
This is a great improvement over the O (𝑁) perfor- Answer:
mance of current hardware shown in Figure 5.2, and it is It can, and in this toy implementation, it does. But it is
possible that hardware latencies might decrease further if not that hard to come up with an alternative implemen-
innovations such as three-dimensional fabrication prove tation that permits an arbitrary number of threads, for

v2022.09.25a
E.5. COUNTING 479

example, using C11’s _Thread_local facility, as shown In the worst case, the read operation completes immedi-
in Section 5.2.3. q ately, but is then delayed for 𝛥 time units before returning,
in which case the worst-case error is simply 𝑟 𝛥.
p.52 This worst-case behavior is rather unlikely, so let us
Quick Quiz 5.14:
instead consider the case where the reads from each of
What other nasty optimizations could GCC apply?
the 𝑁 counters is spaced equally over the time period 𝛥.
There will be 𝑁 + 1 intervals of duration 𝑁𝛥+1 between
Answer:
the 𝑁 reads. The error due to the delay after the read
See Sections 4.3.4.1 and 15.3 for more information. One
from the last thread’s counter will be given by 𝑁 (𝑟𝑁𝛥+1) ,
nasty optimization would be to apply common subexpres-
sion elimination to successive calls to the read_count() the second-to-last thread’s counter by 𝑁 (2𝑟𝑁𝛥+1) , the third-
function, which might come as a surprise to code expect- to-last by 𝑁 (3𝑟𝑁𝛥+1) , and so on. The total error is given by
ing changes in the values returned from successive calls the sum of the errors due to the reads from each thread’s
to that function. q counter, which is:

𝑁
Quick Quiz 5.15: p.52 𝑟𝛥 ∑︁
𝑖 (E.1)
How does the per-thread counter variable in Listing 5.3 𝑁 (𝑁 + 1) 𝑖=1
get initialized?
Expressing the summation in closed form yields:
Answer: 𝑟𝛥 𝑁 (𝑁 + 1)
The C standard specifies that the initial value of global (E.2)
𝑁 (𝑁 + 1) 2
variables is zero, unless they are explicitly initialized,
thus implicitly initializing all the instances of counter Canceling yields the intuitively expected result:
to zero. Besides, in the common case where the user is
interested only in differences between consecutive reads 𝑟𝛥
(E.3)
from statistical counters, the initial value is irrelevant. q 2
It is important to remember that error continues accu-
p.52
mulating as the caller executes code making use of the
Quick Quiz 5.16:
count returned by the read operation. For example, if the
How is the code in Listing 5.3 supposed to permit more
caller spends time 𝑡 executing some computation based
than one counter?
on the result of the returned count, the worst-case error
Answer: will have increased to 𝑟 ( 𝛥 + 𝑡).
Indeed, this toy example does not support more than one The expected error will have similarly increased to:
counter. Modifying it so that it can provide multiple  
counters is left as an exercise to the reader. q 𝛥
𝑟 +𝑡 (E.4)
2

Quick Quiz 5.17: p.52 Of course, it is sometimes unacceptable for the counter
The read operation takes time to sum up the per-thread to continue incrementing during the read operation. Sec-
values, and during that time, the counter could well tion 5.4.6 discusses a way to handle this situation.
be changing. This means that the value returned by Thus far, we have been considering a counter that is
read_count() in Listing 5.3 will not necessarily be only increased, never decreased. If the counter value is
exact. Assume that the counter is being incremented at being changed by 𝑟 counts per unit time, but in either
rate 𝑟 counts per unit time, and that read_count()’s direction, we should expect the error to reduce. However,
execution consumes 𝛥 units of time. What is the expected the worst case is unchanged because although the counter
error in the return value? could move in either direction, the worst case is when the
read operation completes immediately, but then is delayed
Answer: for 𝛥 time units, during which time all the changes in the
Let’s do worst-case analysis first, followed by a less con- counter’s value move it in the same direction, again giving
servative analysis. us an absolute error of 𝑟 𝛥.

v2022.09.25a
480 APPENDIX E. ANSWERS TO QUICK QUIZZES

There are a number of ways to compute the average the corresponding CPU exists yet, or indeed, whether the
error, based on a variety of assumptions about the patterns corresponding CPU will ever exist.
of increments and decrements. For simplicity, let’s assume A key limitation that the Linux kernel imposes is a
that the 𝑓 fraction of the operations are decrements, and compile-time maximum bound on the number of CPUs,
that the error of interest is the deviation from the counter’s namely, CONFIG_NR_CPUS, along with a typically tighter
long-term trend line. Under this assumption, if 𝑓 is less boot-time bound of nr_cpu_ids. In contrast, in user
than or equal to 0.5, each decrement will be canceled by space, there is not necessarily a hard-coded upper limit
an increment, so that 2 𝑓 of the operations will cancel each on the number of threads.
other, leaving 1 − 2 𝑓 of the operations being uncanceled Of course, both environments must handle dynamically
increments. On the other hand, if 𝑓 is greater than 0.5, 1− 𝑓 loaded code (dynamic libraries in user space, kernel mod-
of the decrements are canceled by increments, so that the ules in the Linux kernel), which increases the complexity
counter moves in the negative direction by −1 + 2 (1 − 𝑓 ), of per-thread variables.
which simplifies to 1 − 2 𝑓 , so that the counter moves an
These complications make it significantly harder for
average of 1 − 2 𝑓 per operation in either case. Therefore,
user-space environments to provide access to other threads’
that the long-term movement of the counter is given by
per-thread variables. Nevertheless, such access is highly
(1 − 2 𝑓 ) 𝑟. Plugging this into Eq. E.3 yields:
useful, and it is hoped that it will someday appear.
(1 − 2 𝑓 ) 𝑟 𝛥 In the meantime, textbook examples such as this one can
(E.5) use arrays whose limits can be easily adjusted by the user.
2
Alternatively, such arrays can be dynamically allocated
All that aside, in most uses of statistical counters, the and expanded as needed at runtime. Finally, variable-
error in the value returned by read_count() is irrelevant. length data structures such as linked lists can be used, as
This irrelevance is due to the fact that the time required for is done in the userspace RCU library [Des09b, DMS+ 12].
read_count() to execute is normally extremely small This last approach can also reduce false sharing in some
compared to the time interval between successive calls to cases. q
read_count(). q
Quick Quiz 5.19: p.53

Quick Quiz 5.18: p.53 Doesn’t the check for NULL on line 19 of Listing 5.4 add
Doesn’t that explicit counterp array in Listing 5.4 extra branch mispredictions? Why not have a variable set
reimpose an arbitrary limit on the number of threads? permanently to zero, and point unused counter-pointers
Why doesn’t the C language provide a per_thread() to that variable rather than setting them to NULL?
interface, similar to the Linux kernel’s per_cpu() prim-
itive, to allow threads to more easily access each others’ Answer:
per-thread variables? This is a reasonable strategy. Checking for the perfor-
mance difference is left as an exercise for the reader.
Answer: However, please keep in mind that the fastpath is not
Why indeed? read_count(), but rather inc_count(). q
To be fair, user-mode thread-local storage faces some
challenges that the Linux kernel gets to ignore. When Quick Quiz 5.20: p.53
a user-level thread exits, its per-thread variables all dis- Why on earth do we need something as heavyweight as
appear, which complicates the problem of per-thread- a lock guarding the summation in the function read_
variable access, particularly before the advent of user-level count() in Listing 5.4?
RCU (see Section 9.5). In contrast, in the Linux kernel,
when a CPU goes offline, that CPU’s per-CPU variables Answer:
remain mapped and accessible. Remember, when a thread exits, its per-thread variables
Similarly, when a new user-level thread is created, its disappear. Therefore, if we attempt to access a given
per-thread variables suddenly come into existence. In thread’s per-thread variables after that thread exits, we will
contrast, in the Linux kernel, all per-CPU variables are get a segmentation fault. The lock coordinates summation
mapped and initialized at boot time, regardless of whether and thread exit, preventing this scenario.

v2022.09.25a
E.5. COUNTING 481

Of course, we could instead read-acquire a reader-writer Listing E.1: Per-Thread Statistical Counters With Lockless
lock, but Chapter 9 will introduce even lighter-weight Summation
1 unsigned long __thread counter = 0;
mechanisms for implementing the required coordination. 2 unsigned long *counterp[NR_THREADS] = { NULL };
Another approach would be to use an array instead of 3 int finalthreadcount = 0;
4 DEFINE_SPINLOCK(final_mutex);
a per-thread variable, which, as Alexey Roytman notes, 5

would eliminate the tests against NULL. However, array 6 static __inline__ void inc_count(void)
7 {
accesses are often slower than accesses to per-thread 8 WRITE_ONCE(counter, counter + 1);
variables, and use of an array would imply a fixed upper 9 }
10
bound on the number of threads. Also, note that neither 11 static __inline__ unsigned long read_count(void)
tests nor locks are needed on the inc_count() fastpath. 12 /* need to tweak counttorture! */
13 {
q 14 int t;
15 unsigned long sum = 0;
16

p.53 17 for_each_thread(t)
Quick Quiz 5.21: 18 if (READ_ONCE(counterp[t]) != NULL)
Why on earth do we need to acquire the lock in count_ 19 sum += READ_ONCE(*counterp[t]);
20 return sum;
register_thread() in Listing 5.4? It is a single 21 }
properly aligned machine-word store to a location that 22
23 void count_register_thread(unsigned long *p)
no other thread is modifying, so it should be atomic 24 {
anyway, right? 25 WRITE_ONCE(counterp[smp_thread_id()], &counter);
26 }
27
Answer: 28 void count_unregister_thread(int nthreadsexpected)
This lock could in fact be omitted, but better safe than 29 {
30 spin_lock(&final_mutex);
sorry, especially given that this function is executed only 31 finalthreadcount++;
at thread startup, and is therefore not on any critical path. 32 spin_unlock(&final_mutex);
33 while (READ_ONCE(finalthreadcount) < nthreadsexpected)
Now, if we were testing on machines with thousands of 34 poll(NULL, 0, 1);
CPUs, we might need to omit the lock, but on machines 35 }

with “only” a hundred or so CPUs, there is no need to get


fancy. q
Quick Quiz 5.23: p.55
Why doesn’t inc_count() in Listing 5.5 need to use
Quick Quiz 5.22: p.53
atomic instructions? After all, we now have multiple
Fine, but the Linux kernel doesn’t have to acquire a threads accessing the per-thread counters!
lock when reading out the aggregate value of per-CPU
counters. So why should user-space code need to do Answer:
this??? Because one of the two threads only reads, and because
the variable is aligned and machine-sized, non-atomic
Answer: instructions suffice. That said, the READ_ONCE() macro
Remember, the Linux kernel’s per-CPU variables are is used to prevent compiler optimizations that might
always accessible, even if the corresponding CPU is otherwise prevent the counter updates from becoming
offline—even if the corresponding CPU never existed and visible to eventual().5
never will exist. An older version of this algorithm did in fact use atomic
One workaround is to ensure that each thread contin- instructions, kudos to Ersoy Bayramoglu for pointing out
ues to exist until all threads are finished, as shown in that they are in fact unnecessary. However, note that
Listing E.1 (count_tstat.c). Analysis of this code is on a 32-bit system, the per-thread counter variables
left as an exercise to the reader, however, please note might need to be limited to 32 bits in order to sum them
that it requires tweaks in the counttorture.h counter- accurately, but with a 64-bit global_count variable to
evaluation scheme. (Hint: See #ifndef KEEP_GCC_ avoid overflow. In this case, it is necessary to zero the per-
THREAD_LOCAL.) Chapter 9 will introduce synchroniza- thread counter variables periodically in order to avoid
tion mechanisms that handle this situation in a much more overflow, which does require atomic instructions. It is
graceful manner. q
5A simple definition of READ_ONCE() is shown in Listing 4.9.

v2022.09.25a
482 APPENDIX E. ANSWERS TO QUICK QUIZZES

extremely important to note that this zeroing cannot be eventually-consistent counters, which would limit the
delayed too long or overflow of the smaller per-thread overhead to a single CPU, but would result in increas-
variables will result. This approach therefore imposes ing update-to-read latencies as the number of counters
real-time requirements on the underlying system, and in increased. Alternatively, that single thread could track
turn must be used with extreme care. the update rates of the counters, visiting the frequently-
In contrast, if all variables are the same size, overflow updated counters more frequently. In addition, the num-
of any variable is harmless because the eventual sum will ber of threads handling the counters could be set to some
be modulo the word size. q fraction of the total number of CPUs, and perhaps also
adjusted at runtime. Finally, each counter could specify
Quick Quiz 5.24: p.55 its latency, and deadline-scheduling techniques could be
Won’t the single global thread in the function used to provide the required latencies to each counter.
eventual() of Listing 5.5 be just as severe a bottleneck There are no doubt many other tradeoffs that could be
as a global lock would be? made. q

Answer:
Quick Quiz 5.27: p.55
In this case, no. What will happen instead is that as the
number of threads increases, the estimate of the counter What is the accuracy of the estimate returned by read_
value returned by read_count() will become more in- count() in Listing 5.5?
accurate. q
Answer:
p.55 A straightforward way to evaluate this estimate is to use
Quick Quiz 5.25:
the analysis derived in Quick Quiz 5.17, but set 𝛥 to the
Won’t the estimate returned by read_count() in List-
interval between the beginnings of successive runs of the
ing 5.5 become increasingly inaccurate as the number
eventual() thread. Handling the case where a given
of threads rises?
counter has multiple eventual() threads is left as an
Answer: exercise for the reader. q
Yes. If this proves problematic, one fix is to provide
multiple eventual() threads, each covering its own
Quick Quiz 5.28: p.55
subset of the other threads. In more extreme cases, a tree-
like hierarchy of eventual() threads might be required. What fundamental difference is there between counting
q packets and counting the total number of bytes in the
packets, given that the packets vary in size?
Quick Quiz 5.26: p.55
Answer:
Given that in the eventually-consistent algorithm shown When counting packets, the counter is only incremented
in Listing 5.5 both reads and updates have extremely by the value one. On the other hand, when counting bytes,
low overhead and are extremely scalable, why would the counter might be incremented by largish numbers.
anyone bother with the implementation described in
Why does this matter? Because in the increment-by-one
Section 5.2.2, given its costly read-side code?
case, the value returned will be exact in the sense that the
Answer: counter must necessarily have taken on that value at some
The thread executing eventual() consumes CPU time. point in time, even if it is impossible to say precisely when
As more of these eventually-consistent counters are added, that point occurred. In contrast, when counting bytes, two
the resulting eventual() threads will eventually con- different threads might return values that are inconsistent
sume all available CPUs. This implementation therefore with any global ordering of operations.
suffers a different sort of scalability limitation, with the To see this, suppose that thread 0 adds the value three to
scalability limit being in terms of the number of eventually its counter, thread 1 adds the value five to its counter, and
consistent counters rather than in terms of the number of threads 2 and 3 sum the counters. If the system is “weakly
threads or CPUs. ordered” or if the compiler uses aggressive optimizations,
Of course, it is possible to make other tradeoffs. For thread 2 might find the sum to be three and thread 3 might
example, a single thread could be created to handle all find the sum to be five. The only possible global orders of

v2022.09.25a
E.5. COUNTING 483

the sequence of values of the counter are 0,3,8 and 0,5,8, Try the formulation in Listing 5.8 with counter equal
and neither order is consistent with the results obtained. to 10 and delta equal to ULONG_MAX. Then try it again
If you missed this one, you are not alone. Michael Scott with the code shown in Listing 5.7.
used this question to stump Paul E. McKenney during A good understanding of integer overflow will be re-
Paul’s Ph.D. defense. q quired for the rest of this example, so if you have never
dealt with integer overflow before, please try several exam-
p.55
ples to get the hang of it. Integer overflow can sometimes
Quick Quiz 5.29:
be more difficult to get right than parallel algorithms! q
Given that the reader must sum all the threads’ coun-
ters, this counter-read operation could take a long time
given large numbers of threads. Is there any way that Quick Quiz 5.32: p.58
the increment operation can remain fast and scalable Why does globalize_count() zero the per-thread
while allowing readers to also enjoy not only reasonable variables, only to later call balance_count() to refill
performance and scalability, but also good accuracy? them in Listing 5.7? Why not just leave the per-thread
variables non-zero?
Answer:
One approach would be to maintain a global approxima- Answer:
tion to the value, similar to the approach described in That is in fact what an earlier version of this code did.
Section 5.2.4. Updaters would increment their per-thread But addition and subtraction are extremely cheap, and
variable, but when it reached some predefined limit, atom- handling all of the special cases that arise is quite complex.
ically add it to a global variable, then zero their per-thread Again, feel free to try it yourself, but beware of integer
variable. This would permit a tradeoff between average overflow! q
increment overhead and accuracy of the value read out. In
particular, it would allow sharp bounds on the read-side Quick Quiz 5.33: p.58
inaccuracy. Given that globalreserve counted against us in add_
Another approach makes use of the fact that readers count(), why doesn’t it count for us in sub_count()
often care only about certain transitions in value, not in in Listing 5.7?
the exact value. This approach is examined in Section 5.3.
The reader is encouraged to think up and try out other Answer:
approaches, for example, using a combining tree. q The globalreserve variable tracks the sum of all
threads’ countermax variables. The sum of these threads’
counter variables might be anywhere from zero to
Quick Quiz 5.30: p.57
globalreserve. We must therefore take a conservative
Why does Listing 5.7 provide add_count() and approach, assuming that all threads’ counter variables
sub_count() instead of the inc_count() and dec_ are full in add_count() and that they are all empty in
count() interfaces show in Section 5.2? sub_count().
But remember this question, as we will come back to it
Answer: later. q
Because structures come in different sizes. Of course,
a limit counter corresponding to a specific size of struc-
Quick Quiz 5.34: p.58
ture might still be able to use inc_count() and dec_
count(). q Suppose that one thread invokes add_count() shown
in Listing 5.7, and then another thread invokes sub_
count(). Won’t sub_count() return failure even
Quick Quiz 5.31: p.57
though the value of the counter is non-zero?
What is with the strange form of the condition on line 3
of Listing 5.7? Why not the more intuitive form of the Answer:
fastpath shown in Listing 5.8? Indeed it will! In many cases, this will be a problem,
as discussed in Section 5.3.3, and in those cases the
Answer: algorithms from Section 5.4 will likely be preferable. q
Two words. “Integer overflow.”

v2022.09.25a
484 APPENDIX E. ANSWERS TO QUICK QUIZZES

p.58 limit. To see this last point, step through the algorithm
Quick Quiz 5.35:
and watch what it does. q
Why have both add_count() and sub_count() in
Listing 5.7? Why not simply pass a negative number to
add_count()? Quick Quiz 5.38: p.61
Why is it necessary to atomically manipulate the thread’s
Answer: counter and countermax variables as a unit? Wouldn’t
Given that add_count() takes an unsigned long as its it be good enough to atomically manipulate them indi-
argument, it is going to be a bit tough to pass it a negative vidually?
number. And unless you have some anti-matter memory,
there is little point in allowing negative numbers when Answer:
counting the number of structures in use! This might well be possible, but great care is re-
All kidding aside, it would of course be possible to quired. Note that removing counter without first zeroing
combine add_count() and sub_count(), however, the countermax could result in the corresponding thread
if conditions on the combined function would be more increasing counter immediately after it was zeroed, com-
complex than in the current pair of functions, which would pletely negating the effect of zeroing the counter.
in turn mean slower execution of these fast paths. q The opposite ordering, namely zeroing countermax
and then removing counter, can also result in a non-zero
p.59
counter. To see this, consider the following sequence of
Quick Quiz 5.36:
events:
Why set counter to countermax / 2 in line 15 of List-
ing 5.9? Wouldn’t it be simpler to just take countermax 1. Thread A fetches its countermax, and finds that it
counts? is non-zero.
Answer: 2. Thread B zeroes Thread A’s countermax.
First, it really is reserving countermax counts (see
line 14), however, it adjusts so that only half of these 3. Thread B removes Thread A’s counter.
are actually in use by the thread at the moment. This
allows the thread to carry out at least countermax / 2 4. Thread A, having found that its countermax is non-
increments or decrements before having to refer back to zero, proceeds to add to its counter, resulting in a
globalcount again. non-zero value for counter.
Note that the accounting in globalcount remains
accurate, thanks to the adjustment in line 18. q Again, it might well be possible to atomically manipu-
late countermax and counter as separate variables, but
it is clear that great care is required. It is also quite likely
Quick Quiz 5.37: p.59 that doing so will slow down the fastpath.
In Figure 5.6, even though a quarter of the remaining Exploring these possibilities are left as exercises for the
count up to the limit is assigned to thread 0, only an reader. q
eighth of the remaining count is consumed, as indicated
by the uppermost dotted line connecting the center and p.61
Quick Quiz 5.39:
the rightmost configurations. Why is that?
In what way does line 7 of Listing 5.12 violate the C
Answer: standard?
The reason this happened is that thread 0’s counter Answer:
was set to half of its countermax. Thus, of the quarter It assumes eight bits per byte. This assumption does hold
assigned to thread 0, half of that quarter (one eighth) came for all current commodity microprocessors that can be
from globalcount, leaving the other half (again, one easily assembled into shared-memory multiprocessors,
eighth) to come from the remaining count. but certainly does not hold for all computer systems that
There are two purposes for taking this approach: (1) To have ever run C code. (What could you do instead in order
allow thread 0 to use the fastpath for decrements as well to comply with the C standard? What drawbacks would it
as increments and (2) To reduce the inaccuracies if all have?) q
threads are monotonically incrementing up towards the

v2022.09.25a
E.5. COUNTING 485

Quick Quiz 5.40: p.62 local_count() on line 14 of Listing 5.15 empties it?
Given that there is only one counterandmax variable,
why bother passing in a pointer to it on line 18 of
Answer:
Listing 5.12?
This other thread cannot refill its counterandmax un-
Answer: til the caller of flush_local_count() releases the
There is only one counterandmax variable per gblcnt_mutex. By that time, the caller of flush_
thread. Later, we will see code that needs to pass local_count() will have finished making use of the
other threads’ counterandmax variables to split_ counts, so there will be no problem with this other thread
counterandmax(). q refilling—assuming that the value of globalcount is
large enough to permit a refill. q
Quick Quiz 5.41: p.62
Why does merge_counterandmax() in Listing 5.12 re- Quick Quiz 5.45: p.63
turn an int rather than storing directly into an atomic_ What prevents concurrent execution of the fastpath of
t? either add_count() or sub_count() from interfer-
ing with the counterandmax variable while flush_
Answer:
local_count() is accessing it on line 27 of List-
Later, we will see that we need the int return to pass to
ing 5.15?
the atomic_cmpxchg() primitive. q
Answer:
Quick Quiz 5.42: p.62 Nothing. Consider the following three cases:
Yecch! Why the ugly goto on line 11 of Listing 5.13?
Haven’t you heard of the break statement??? 1. If flush_local_count()’s atomic_xchg() exe-
cutes before the split_counterandmax() of either
Answer: fastpath, then the fastpath will see a zero counter
Replacing the goto with a break would require keeping and countermax, and will thus transfer to the slow-
a flag to determine whether or not line 15 should return, path (unless of course delta is zero).
which is not the sort of thing you want on a fastpath. If
you really hate the goto that much, your best bet would be 2. If flush_local_count()’s atomic_xchg() ex-
to pull the fastpath into a separate function that returned ecutes after the split_counterandmax() of ei-
success or failure, with “failure” indicating a need for ther fastpath, but before that fastpath’s atomic_
the slowpath. This is left as an exercise for goto-hating cmpxchg(), then the atomic_cmpxchg() will fail,
readers. q causing the fastpath to restart, which reduces to case 1
above.
Quick Quiz 5.43: p.62
3. If flush_local_count()’s atomic_xchg() exe-
Why would the atomic_cmpxchg() primitive at
cutes after the atomic_cmpxchg() of either fast-
lines 13–14 of Listing 5.13 ever fail? After all, we
path, then the fastpath will (most likely) complete
picked up its old value on line 9 and have not changed
successfully before flush_local_count() zeroes
it!
the thread’s counterandmax variable.
Answer:
Later, we will see how the flush_local_count() Either way, the race is resolved correctly. q
function in Listing 5.15 might update this thread’s
counterandmax variable concurrently with the execu- p.64
Quick Quiz 5.46:
tion of the fastpath on lines 8–14 of Listing 5.13. q
Given that the atomic_set() primitive does a simple
store to the specified atomic_t, how can line 21 of
Quick Quiz 5.44: p.63
balance_count() in Listing 5.16 work correctly in
What stops a thread from simply refilling its face of concurrent flush_local_count() updates to
counterandmax variable immediately after flush_ this variable?

v2022.09.25a
486 APPENDIX E. ANSWERS TO QUICK QUIZZES

Answer: (c) The thread receives the signal, which also notes
The caller of both balance_count() and flush_ the REQACK state, and, because there is no
local_count() hold gblcnt_mutex, so only one may fastpath in effect, sets the state to READY.
be executing at a given time. q (d) The slowpath notes the READY state, steals the
count, and sets the state to IDLE, and completes.
Quick Quiz 5.47: p.64 (e) The fastpath sets the state to READY, disabling
But signal handlers can be migrated to some other CPU further fastpath execution for this thread.
while running. Doesn’t this possibility require that
The basic problem here is that the combined
atomic instructions and memory barriers are required
REQACK state can be referenced by both the signal
to reliably communicate between a thread and a signal
handler and the fastpath. The clear separation main-
handler that interrupts that thread?
tained by the four-state setup ensures orderly state
Answer: transitions.
No. If the signal handler is migrated to another CPU, then That said, you might well be able to make a three-state
the interrupted thread is also migrated along with it. q setup work correctly. If you do succeed, compare carefully
to the four-state setup. Is the three-state solution really
Quick Quiz 5.48: p.65 preferable, and why or why not? q
In Figure 5.7, why is the REQ theft state colored red?
Quick Quiz 5.50: p.65
In Listing 5.18’s function flush_local_count_
Answer:
sig(), why are there READ_ONCE() and WRITE_
To indicate that only the fastpath is permitted to change
ONCE() wrappers around the uses of the theft per-
the theft state, and that if the thread remains in this state
thread variable?
for too long, the thread running the slowpath will resend
the POSIX signal. q Answer:
The first one (on line 11) can be argued to be unnecessary.
Quick Quiz 5.49: p.65 The last two (lines 14 and 16) are important. If these
In Figure 5.7, what is the point of having separate REQ are removed, the compiler would be within its rights to
and ACK theft states? Why not simplify the state rewrite lines 14–16 as follows:
machine by collapsing them into a single REQACK 14 theft = THEFT_READY;
15 if (counting) {
state? Then whichever of the signal handler or the 16 theft = THEFT_ACK;
fastpath gets there first could set the state to READY. 17 }

Answer: This would be fatal, as the slowpath might see the


Reasons why collapsing the REQ and ACK states would transient value of THEFT_READY, and start stealing before
be a very bad idea include: the corresponding thread was ready. q
1. The slowpath uses the REQ and ACK states to deter-
mine whether the signal should be retransmitted. If Quick Quiz 5.51: p.65
the states were collapsed, the slowpath would have In Listing 5.18, why is it safe for line 28 to directly
no choice but to send redundant signals, which would access the other thread’s countermax variable?
have the unhelpful effect of needlessly slowing down
the fastpath. Answer:
Because the other thread is not permitted to change the
2. The following race would result: value of its countermax variable unless it holds the
gblcnt_mutex lock. But the caller has acquired this
(a) The slowpath sets a given thread’s state to lock, so it is not possible for the other thread to hold it,
REQACK. and therefore the other thread is not permitted to change
(b) That thread has just finished its fastpath, and its countermax variable. We can therefore safely access
notes the REQACK state. it—but not change it. q

v2022.09.25a
E.5. COUNTING 487

Quick Quiz 5.52: p.65 Quick Quiz 5.56: p.68


In Listing 5.18, why doesn’t line 33 check for the current What if you want an exact limit counter to be exact only
thread sending itself a signal? for its lower limit, but to allow the upper limit to be
inexact?
Answer: Answer:
There is no need for an additional check. The One simple solution is to overstate the upper limit by the
caller of flush_local_count() has already invoked desired amount. The limiting case of such overstatement
globalize_count(), so the check on line 28 will have results in the upper limit being set to the largest value that
succeeded, skipping the later pthread_kill(). q the counter is capable of representing. q

p.65 Quick Quiz 5.57: p.68


Quick Quiz 5.53:
The code shown in Listings 5.17 and 5.18 works with What else had you better have done when using a biased
GCC and POSIX. What would be required to make it counter?
also conform to the ISO C standard?
Answer:
You had better have set the upper limit to be large enough
Answer: accommodate the bias, the expected maximum number
The theft variable must be of type sig_atomic_t to of accesses, and enough “slop” to allow the counter to
guarantee that it can be safely shared between the signal work efficiently even when the number of accesses is at
handler and the code interrupted by the signal. q its maximum. q

Quick Quiz 5.58: p.68


Quick Quiz 5.54: p.66
This is ridiculous! We are read-acquiring a reader-writer
In Listing 5.18, why does line 41 resend the signal?
lock to update the counter? What are you playing at???

Answer:
Because many operating systems over several decades have Answer:
had the property of losing the occasional signal. Whether Strange, perhaps, but true! Almost enough to make you
this is a feature or a bug is debatable, but irrelevant. The think that the name “reader-writer lock” was poorly chosen,
obvious symptom from the user’s viewpoint will not be a isn’t it? q
kernel bug, but rather a user application hanging.
Quick Quiz 5.59: p.68
Your user application hanging! q
What other issues would need to be accounted for in a
real system?
Quick Quiz 5.55: p.68
Answer:
Not only are POSIX signals slow, sending one to each A huge number!
thread simply does not scale. What would you do if you
Here are a few to start with:
had (say) 10,000 threads and needed the read side to be
fast?
1. There could be any number of devices, so that the
global variables are inappropriate, as are the lack of
Answer: arguments to functions like do_io().
One approach is to use the techniques shown in Sec-
tion 5.2.4, summarizing an approximation to the overall 2. Polling loops can be problematic in real systems,
counter value in a single variable. Another approach wasting CPU time and energy. In many cases, an
would be to use multiple threads to carry out the reads, event-driven design is far better, for example, where
with each such thread interacting with a specific subset of the last completing I/O wakes up the device-removal
the updating threads. q thread.

v2022.09.25a
488 APPENDIX E. ANSWERS TO QUICK QUIZZES

3. The I/O might fail, and so do_io() will likely need Table 5.1, we should always prefer signals over atomic
a return value. operations, right?
4. If the device fails, the last I/O might never complete.
Answer:
In such cases, there might need to be some sort of
That depends on the workload. Note that on a 64-core
timeout to allow error recovery.
system, you need more than one hundred non-atomic
5. Both add_count() and sub_count() can fail, but operations (with roughly a 40-nanosecond performance
their return values are not checked. gain) to make up for even one signal (with almost a 5-
microsecond performance loss). Although there are no
6. Reader-writer locks do not scale well. One way of shortage of workloads with far greater read intensity, you
avoiding the high read-acquisition costs of reader- will need to consider your particular workload.
writer locks is presented in Chapters 7 and 9. q In addition, although memory barriers have historically
been expensive compared to ordinary instructions, you
should check this on the specific hardware you will be
Quick Quiz 5.60: p.69
running. The properties of computer hardware do change
On the count_stat.c row of Table 5.1, we see that over time, and algorithms must change accordingly. q
the read-side scales linearly with the number of threads.
How is that possible given that the more threads there
are, the more per-thread counters must be summed up? Quick Quiz 5.63: p.69
Can advanced techniques be applied to address the
lock contention for readers seen in the bottom half of
Answer: Table 5.1?
The read-side code must scan the entire fixed-size array, re-
gardless of the number of threads, so there is no difference Answer:
in performance. In contrast, in the last two algorithms, One approach is to give up some update-side perfor-
readers must do more work when there are more threads. mance, as is done with scalable non-zero indicators
In addition, the last two algorithms interpose an additional (SNZI) [ELLM07]. There are a number of other ways one
level of indirection because they map from integer thread might go about this, and these are left as exercises for the
ID to the corresponding _Thread_local variable. q reader. Any number of approaches that apply hierarchy,
which replace frequent global-lock acquisitions with local
Quick Quiz 5.61: p.69 lock acquisitions corresponding to lower levels of the
Even on the fourth row of Table 5.1, the read-side hierarchy, should work quite well. q
performance of these statistical counter implementations
is pretty horrible. So why bother with them? p.70
Quick Quiz 5.64:
Answer: The ++ operator works just fine for 1,000-digit numbers!
“Use the right tool for the job.” Haven’t you heard of operator overloading???
As can be seen from Figure 5.1, single-variable atomic
increment need not apply for any job involving heavy use of Answer:
parallel updates. In contrast, the algorithms shown in the In the C++ language, you might well be able to use ++ on
top half of Table 5.1 do an excellent job of handling update- a 1,000-digit number, assuming that you had access to a
heavy situations. Of course, if you have a read-mostly class implementing such numbers. But as of 2021, the C
situation, you should use something else, for example, an language does not permit operator overloading. q
eventually consistent design featuring a single atomically
incremented variable that can be read out using a single p.70
Quick Quiz 5.65:
load, similar to the approach used in Section 5.2.4. q
But if we are going to have to partition everything, why
bother with shared-memory multithreading? Why not
Quick Quiz 5.62: p.69 just partition the problem completely and run as multiple
Given the performance data shown in the bottom half of processes, each in its own address space?

v2022.09.25a
E.6. PARTITIONING AND SYNCHRONIZATION DESIGN 489

Answer:
P1
Indeed, multiple processes with separate address spaces
can be an excellent way to exploit parallelism, as the
proponents of the fork-join methodology and the Erlang
language would be very quick to tell you. However, there
are also some advantages to shared-memory parallelism: P5 P2

1. Only the most performance-critical portions of the


application must be partitioned, and such portions
are usually a small fraction of the application.
2. Although cache misses are quite slow compared
to individual register-to-register instructions, they
are typically considerably faster than inter-process-
P4 P3
communication primitives, which in turn are consid-
erably faster than things like TCP/IP networking.
Figure E.2: Dining Philosophers Problem, Fully Parti-
3. Shared-memory multiprocessors are readily available tioned
and quite inexpensive, so, in stark contrast to the
1990s, there is little cost penalty for use of shared-
memory parallelism. First, for algorithms in which picking up left-hand and
right-hand forks are separate operations, start with all
As always, use the right tool for the job! q
forks on the table. Then have all philosophers attempt to
pick up their first fork. Once all philosophers either have
their first fork or are waiting for someone to put down
E.6 Partitioning and Synchroniza- their first fork, have each non-waiting philosopher pick
up their second fork. At this point in any starvation-free
tion Design solution, at least one philosopher will be eating. If there
were any waiting philosophers, repeat this test, preferably
Quick Quiz 6.1: p.75 imposing random variations in timing.
Is there a better solution to the Dining Philosophers Second, create a stress test in which philosphers start
Problem? and stop eating at random times. Generate starvation and
fairness conditions and verify that these conditions are
Answer: met. Here are a couple of example starvation and fairness
One such improved solution is shown in Figure E.2, where conditions:
the philosophers are simply provided with an additional
five forks. All five philosophers may now eat simultane- 1. If all other philosophers have stopped eating 𝑁 times
ously, and there is never any need for philosophers to wait since a given philosopher attempted to pick up a
on one another. In addition, this approach offers greatly given fork, that philosopher should have succeeded
improved disease control. in picking up that fork. For high-quality solutions
This solution might seem like cheating to some, but using high-quality locking primitives (or high-quality
such “cheating” is key to finding good solutions to many atomic operations), 𝑁 = 1 is doable.
concurrency problems. q
2. Given an upper bound 𝑇 on the time any philosopher
holds onto both forks before putting them down, the
Quick Quiz 6.2: p.75 maximum waiting time for any philosopher should
How would you valididate an algorithm alleged to solve be bounded by 𝑁𝑇 for some 𝑁 that is not hugely
the Dining Philosophers Problem? larger than the number of philosophers.
Answer: 3. Generate some statistic representing the time from
Much depends on the details of the algorithm, but here when philosophers attempt to pick up their first fork
are a couple of places to start. to the time when they start eating. The smaller this

v2022.09.25a
490 APPENDIX E. ANSWERS TO QUICK QUIZZES

statistic, the better the solution. Mean, median, and p.78


Quick Quiz 6.6:
maximum are all useful statistics, but examining the
Move all the elements to the queue that became empty?
full distribution can also be enlightening.
In what possible universe is this brain-dead solution in
Readers are encouraged to actually try testing any of any way optimal???
the solutions presented in this book, and especially testing
Answer:
solutions of their own devising. q
It is optimal in the case where data flow switches direction
only rarely. It would of course be an extremely poor
Quick Quiz 6.3: p.75 choice if the double-ended queue was being emptied from
And in just what sense can this “horizontal parallelism” both ends concurrently. This of course raises another
be said to be “horizontal”? question, namely, in what possible universe emptying
from both ends concurrently would be a reasonable thing
Answer: to do. Work-stealing queues are one possible answer to
Inman was working with protocol stacks, which are nor- this question. q
mally depicted vertically, with the application on top and
the hardware interconnect on the bottom. Data flows up
and down this stack. “Horizontal parallelism” processes Quick Quiz 6.7: p.78
packets from different network connections in parallel, Why can’t the compound parallel double-ended queue
while “vertical parallelism” handles different protocol- implementation be symmetric?
processing steps for a given packet in parallel.
“Vertical parallelism” is also called “pipelining”. q Answer:
The need to avoid deadlock by imposing a lock hierarchy
p.76
forces the asymmetry, just as it does in the fork-numbering
Quick Quiz 6.4:
solution to the Dining Philosophers Problem (see Sec-
In this compound double-ended queue implementation,
tion 6.1.1). q
what should be done if the queue has become non-empty
while releasing and reacquiring the lock?
Quick Quiz 6.8: p.79
Answer: Why is it necessary to retry the right-dequeue operation
In this case, simply dequeue an item from the non-empty on line 28 of Listing 6.3?
queue, release both locks, and return. q
Answer:
Quick Quiz 6.5: p.78 This retry is necessary because some other thread might
Is the hashed double-ended queue a good solution? Why have enqueued an element between the time that this
or why not? thread dropped d->rlock on line 25 and the time that it
reacquired this same lock on line 27. q
Answer:
The best way to answer this is to run lockhdeq.c on p.79
Quick Quiz 6.9:
a number of different multiprocessor systems, and you
Surely the left-hand lock must sometimes be available!!!
are encouraged to do so in the strongest possible terms.
So why is it necessary that line 25 of Listing 6.3 uncon-
One reason for concern is that each operation on this
ditionally release the right-hand lock?
implementation must acquire not one but two locks.
The first well-designed performance study will be
Answer:
cited.6 Do not forget to compare to a sequential im-
It would be possible to use spin_trylock() to attempt to
plementation! q
acquire the left-hand lock when it was available. However,
the failure case would still need to drop the right-hand
lock and then re-acquire the two locks in order. Making
this transformation (and determining whether or not it is
6 The studies by Dalessandro et al. [DCW+ 11] and Dice et worthwhile) is left as an exercise for the reader. q
al. [DLM+ 10] are excellent starting points.

v2022.09.25a
E.6. PARTITIONING AND SYNCHRONIZATION DESIGN 491

p.79 time at each end, and in the common case requires only
Quick Quiz 6.10:
one lock acquisition per operation. Therefore, the tandem
But in the case where data is flowing in only one di-
double-ended queue should be expected to outperform the
rection, the algorithm shown in Listing 6.3 will have
hashed double-ended queue.
both ends attempting to acquire the same lock whenever
the consuming end empties its underlying double-ended Can you create a double-ended queue that allows multi-
queue. Doesn’t that mean that sometimes this algorithm ple concurrent operations at each end? If so, how? If not,
fails to provide concurrent access to both ends of the why not? q
queue even when the queue contains an arbitrarily large
number of elements?
Quick Quiz 6.13: p.80
Answer: Is there a significantly better way of handling concur-
Indeed it does! rency for double-ended queues?
But the same is true of other algorithms claiming
this property. For example, in solutions using software Answer:
transactional memory mechanisms based on hashed ar- One approach is to transform the problem to be solved so
rays of locks, the leftmost and rightmost elements’ ad- that multiple double-ended queues can be used in parallel,
dresses will sometimes happen to hash to the same lock. allowing the simpler single-lock double-ended queue to
These hash collisions will also prevent concurrent ac- be used, and perhaps also replace each double-ended
cess. For another example, solutions using hardware queue with a pair of conventional single-ended queues.
transactional memory mechanisms with software fall- Without such “horizontal scaling”, the speedup is limited
backs [YHLR13, Mer11, JSG12] often use locking within to 2.0. In contrast, horizontal-scaling designs can achieve
those software fallbacks, and thus suffer (albeit hopefully very large speedups, and are especially attractive if there
rarely) from whatever concurrency limitations that these are multiple threads working either end of the queue,
locking solutions suffer from. because in this multiple-thread case the dequeue simply
Therefore, as of 2021, all practical solutions to the cannot provide strong ordering guarantees. After all, the
concurrent double-ended queue problem fail to provide fact that a given thread removed an item first in no way
full concurrency in at least some circumstances, including implies that it will process that item first [HKLP12]. And
the compound double-ended queue. q if there are no guarantees, we may as well obtain the
performance benefits that come with refusing to provide
Quick Quiz 6.11: p.80 these guarantees.
Why are there not one but two solutions to the double- Regardless of whether or not the problem can be trans-
ended queue problem? formed to use multiple queues, it is worth asking whether
work can be batched so that each enqueue and dequeue op-
Answer: eration corresponds to larger units of work. This batching
There are actually at least three. The third, by Dominik approach decreases contention on the queue data struc-
Dingel, makes interesting use of reader-writer locking, tures, which increases both performance and scalability,
and may be found in lockrwdeq.c. q as will be seen in Section 6.3. After all, if you must incur
high synchronization overheads, be sure you are getting
p.80
your money’s worth.
Quick Quiz 6.12:
The tandem double-ended queue runs about twice as fast Other researchers are working on other ways to take ad-
as the hashed double-ended queue, even when I increase vantage of limited ordering guarantees in queues [KLP12].
the size of the hash table to an insanely large number. q
Why is that?
Quick Quiz 6.14: p.82
Answer:
The hashed double-ended queue’s locking design only Don’t all these problems with critical sections mean
permits one thread at a time at each end, and further that we should just always use non-blocking synchro-
requires two lock acquisitions for each operation. The nization [Her90], which don’t have critical sections?
tandem double-ended queue also permits one thread at a

v2022.09.25a
492 APPENDIX E. ANSWERS TO QUICK QUIZZES

Answer: 3. Use a garbage collector, in software environments


Although non-blocking synchronization can be very useful providing them, so that a structure cannot be deallo-
in some situations, it is no panacea, as discussed in cated while being referenced. This works very well,
Section 14.2. Also, non-blocking synchronization really removing the existence-guarantee burden (and much
does have critical sections, as noted by Josh Triplett. For else besides) from the developer’s shoulders, but
example, in a non-blocking algorithm based on compare- imposes the overhead of garbage collection on the
and-swap operations, the code starting at the initial load program. Although garbage-collection technology
and continuing to the compare-and-swap is analogous to has advanced considerably in the past few decades, its
a lock-based critical section. q overhead may be unacceptably high for some appli-
cations. In addition, some applications require that
Quick Quiz 6.15: p.83 the developer exercise more control over the layout
What should you do to validate a hash table? and placement of data structures than is permitted by
most garbage collected environments.
Answer:
4. As a special case of a garbage collector, use a global
Quite a bit, actually.
reference counter, or a global array of reference coun-
See Section 10.3.2 for a good starting point. q
ters. These have strengths and limitations similar to
those called out above for locks.
Quick Quiz 6.16: p.84
“Partitioning time”? Isn’t that an odd turn of phrase? 5. Use hazard pointers [Mic04a], which can be thought
of as an inside-out reference count. Hazard-pointer-
Answer: based algorithms maintain a per-thread list of point-
Perhaps so. ers, so that the appearance of a given pointer on any
But in the next section we will be partitioning space of these lists acts as a reference to the correspond-
(that is, address space) as well as time. This nomenclature ing structure. Hazard pointers are starting to see
will permit us to partition spacetime, as opposed to (say) significant production use (see Section 9.6.3.1).
partitioning space but segmenting time. q
6. Use transactional memory (TM) [HM93, Lom77,
ST95], so that each reference and modification to the
Quick Quiz 6.17: p.86 data structure in question is performed atomically.
What are some ways of preventing a structure from being Although TM has engendered much excitement in
freed while its lock is being acquired? recent years, and seems likely to be of some use
in production software, developers should exercise
Answer: some caution [BLM05, BLM06, MMW07], partic-
Here are a few possible solutions to this existence guaran- ularly in performance-critical code. In particular,
tee problem: existence guarantees require that the transaction cov-
1. Provide a statically allocated lock that is held while ers the full path from a global reference to the data
the per-structure lock is being acquired, which is an elements being updated. For more on TM, including
example of hierarchical locking (see Section 6.4.2). ways to overcome some of its weaknesses by combin-
Of course, using a single global lock for this pur- ing it with other synchronization mechanisms, see
pose can result in unacceptably high levels of lock Sections 17.2 and 17.3.
contention, dramatically reducing performance and 7. Use RCU, which can be thought of as an extremely
scalability. lightweight approximation to a garbage collector. Up-
2. Provide an array of statically allocated locks, hash- daters are not permitted to free RCU-protected data
ing the structure’s address to select the lock to be structures that RCU readers might still be referenc-
acquired, as described in Chapter 7. Given a hash ing. RCU is most heavily used for read-mostly data
function of sufficiently high quality, this avoids the structures, and is discussed at length in Section 9.5.
scalability limitations of the single global lock, but in For more on providing existence guarantees, see Chap-
read-mostly situations, the lock-acquisition overhead ters 7 and 9. q
can result in unacceptably degraded performance.

v2022.09.25a
E.6. PARTITIONING AND SYNCHRONIZATION DESIGN 493

p.86 Moral: If you have a parallel program with variable


Quick Quiz 6.18:
input, always include a check for the input size being
But won’t system boot and shutdown (or application
too small to be worth parallelizing. And when it is not
startup and shutdown) be partitioning time, even for data
helpful to parallelize, it is not helpful to incur the overhead
ownership?
required to spawn a thread, now is it? q
Answer:
You can indeed think in these terms. Quick Quiz 6.21: p.88
And if you are working on a persistent data store where What did you do to validate this matrix multiply algo-
state survives shutdown, thinking in these terms might rithm?
even be useful. q
Answer:
For this simple approach, very little.
Quick Quiz 6.19: p.88 However, the validation of production-quality matrix
How can a single-threaded 64-by-64 matrix multiple multiply requires great care and attention. Some cases
possibly have an efficiency of less than 1.0? Shouldn’t require careful handling of floating-point rounding er-
all of the traces in Figure 6.17 have efficiency of exactly rors, others involve complex sparse-matrix data structures,
1.0 when running on one thread? and still others make use of special-purpose arithmetic
hardware such as vector units or GPGPUs. Adequate
Answer: tests for handling of floating-point rounding errors can be
The matmul.c program creates the specified number of especially challenging. q
worker threads, so even the single-worker-thread case
incurs thread-creation overhead. Making the changes p.89
required to optimize away thread-creation overhead in Quick Quiz 6.22:
the single-worker-thread case is left as an exercise to the In what situation would hierarchical locking work well?
reader. q
Answer:
Quick Quiz 6.20: p.88 If the comparison on line 31 of Listing 6.8 were replaced
How are data-parallel techniques going to help with by a much heavier-weight operation, then releasing bp->
matrix multiply? It is already data parallel!!! bucket_lock might reduce lock contention enough to
outweigh the overhead of the extra acquisition and release
Answer: of cur->node_lock. q
I am glad that you are paying attention! This example
serves to show that although data parallelism can be a very Quick Quiz 6.23: p.91
good thing, it is not some magic wand that automatically Doesn’t this resource-allocator design resemble that of
wards off any and all sources of inefficiency. Linear the approximate limit counters covered in Section 5.3?
scaling at full performance, even to “only” 64 threads,
requires care at all phases of design and implementation.
In particular, you need to pay careful attention to the Answer:
size of the partitions. For example, if you split a 64-by- Indeed it does! We are used to thinking of allocating and
64 matrix multiply across 64 threads, each thread gets freeing memory, but the algorithms in Section 5.3 are
only 64 floating-point multiplies. The cost of a floating- taking very similar actions to allocate and free “count”. q
point multiply is minuscule compared to the overhead of
thread creation, and cache-miss overhead also plays a role Quick Quiz 6.24: p.92
in spoiling the theoretically perfect scalability (and also In Figure 6.21, there is a pattern of performance rising
in making the traces so jagged). The full 448 hardware with increasing run length in groups of three samples,
threads would require a matrix with hundreds of thousands for example, for run lengths 10, 11, and 12. Why?
of rows and columns to attain good scalability, but by that
point GPGPUs become quite attractive, especially from a Answer:
price/performance viewpoint. This is due to the per-CPU target value being three. A

v2022.09.25a
494 APPENDIX E. ANSWERS TO QUICK QUIZZES

run length of 12 must acquire the global-pool lock twice,


Global Pool g-i-p(n-1)
while a run length of 13 must acquire the global-pool lock
three times. q
Per-Thread Pool i 0 p-m p-m

Quick Quiz 6.25: p.92


Per-Thread Allocation 0 0 m m
Allocation failures were observed in the two-thread tests
at run lengths of 19 and greater. Given the global-pool
size of 40 and the per-thread target pool size 𝑠 of three, n
number of threads 𝑛 equal to two, and assuming that
the per-thread pools are initially empty with none of Figure E.3: Allocator Cache Run-Length Analysis
the memory in use, what is the smallest allocation run
length 𝑚 at which failures can occur? (Recall that each
this figure, and the “extra” initializer thread’s per-thread
thread repeatedly allocates 𝑚 block of memory, and then
pool and per-thread allocations are the left-most pair of
frees the 𝑚 blocks of memory.) Alternatively, given 𝑛
boxes. The initializer thread has no blocks allocated,
threads each with pool size 𝑠, and where each thread
but has 𝑖 blocks stranded in its per-thread pool. The
repeatedly first allocates 𝑚 blocks of memory and then
rightmost two pairs of boxes are the per-thread pools and
frees those 𝑚 blocks, how large must the global pool
per-thread allocations of threads holding the maximum
size be? Note: Obtaining the correct answer will require
possible number of blocks, while the second-from-left
you to examine the smpalloc.c source code, and very
pair of boxes represents the thread currently trying to
likely single-step it as well. You have been warned!
allocate.
Answer: The total number of blocks is 𝑔, and adding up the
This solution is adapted from one put forward by Alexey per-thread allocations and per-thread pools, we see that
Roytman. It is based on the following definitions: the global pool contains 𝑔 − 𝑖 − 𝑝(𝑛 − 1) blocks. If the
allocating thread is to be successful, it needs at least 𝑚
𝑔 Number of blocks globally available. blocks in the global pool, in other words:

𝑖 Number of blocks left in the initializing thread’s per- 𝑔 − 𝑖 − 𝑝(𝑛 − 1) ≥ 𝑚 (E.8)


thread pool. (This is one reason you needed to look
at the code!) The question has 𝑔 = 40, 𝑠 = 3, and 𝑛 = 2. Equation E.7
gives 𝑖 = 4, and Eq. E.6 gives 𝑝 = 18 for 𝑚 = 18 and
𝑚 Allocation/free run length. 𝑝 = 21 for 𝑚 = 19. Plugging these into Eq. E.8 shows
𝑛 Number of threads, excluding the initialization thread. that 𝑚 = 18 will not overflow, but that 𝑚 = 19 might well
do so.
𝑝 Per-thread maximum block consumption, including The presence of 𝑖 could be considered to be a bug.
both the blocks actually allocated and the blocks After all, why allocate memory only to have it stranded in
remaining in the per-thread pool. the initialization thread’s cache? One way of fixing this
would be to provide a memblock_flush() function that
The values 𝑔, 𝑚, and 𝑛 are given. The value for 𝑝 is 𝑚 flushed the current thread’s pool into the global pool. The
rounded up to the next multiple of 𝑠, as follows: initialization thread could then invoke this function after
l𝑚m freeing all of the blocks. q
𝑝=𝑠 (E.6)
𝑠
The value for 𝑖 is as follows:
E.7 Locking

𝑔 (mod 2𝑠) = 0 : 2𝑠
𝑖= (E.7)
𝑔 (mod 2𝑠) ≠ 0 : 𝑔 (mod 2𝑠) Quick Quiz 7.1: p.101
Just how can serving as a whipping boy be considered
The relationships between these quantities are shown
to be in any way honorable???
in Figure E.3. The global pool is shown on the top of

v2022.09.25a
E.7. LOCKING 495

Answer: 2. If one of the library functions returns a pointer to a


The reason locking serves as a research-paper whipping lock that is acquired by the caller, and if the caller
boy is because it is heavily used in practice. In contrast, if acquires one of its locks while holding the library’s
no one used or cared about locking, most research papers lock, we could again have a deadlock cycle involving
would not bother even mentioning it. q both caller and library locks.

3. If one of the library functions acquires a lock and


Quick Quiz 7.2: p.102
then returns while still holding it, and if the caller
But the definition of lock-based deadlock only said that acquires one of its locks, we have yet another way
each thread was holding at least one lock and waiting to create a deadlock cycle involving both caller and
on another lock that was held by some thread. How do library locks.
you know that there is a cycle?
4. If the caller has a signal handler that acquires locks,
Answer: then the deadlock cycle can involve both caller and
Suppose that there is no cycle in the graph. We would library locks. In this case, however, the library’s locks
then have a directed acyclic graph (DAG), which would are innocent bystanders in the deadlock cycle. That
have at least one leaf node. said, please note that acquiring a lock from within a
If this leaf node was a lock, then we would have a thread signal handler is a no-no in many environments—it
that was waiting on a lock that wasn’t held by any thread, is not just a bad idea, it is unsupported. But if you
counter to the definition. In this case the thread would absolutely must acquire a lock in a signal handler,
immediately acquire the lock. be sure to block that signal while holding that same
On the other hand, if this leaf node was a thread, then lock in thread context, and also while holding any
we would have a thread that was not waiting on any lock, other locks acquired while that same lock is held. q
again counter to the definition. And in this case, the thread
would either be running or be blocked on something that is
not a lock. In the first case, in the absence of infinite-loop Quick Quiz 7.4: p.103
bugs, the thread will eventually release the lock. In the But if qsort() releases all its locks before invoking the
second case, in the absence of a failure-to-wake bug, the comparison function, how can it protect against races
thread will eventually wake up and release the lock.7 with other qsort() threads?
Therefore, given this definition of lock-based deadlock,
there must be a cycle in the corresponding graph. q Answer:
By privatizing the data elements being compared (as dis-
Quick Quiz 7.3: p.103 cussed in Chapter 8) or through use of deferral mechanisms
Are there any exceptions to this rule, so that there really such as reference counting (as discussed in Chapter 9). Or
could be a deadlock cycle containing locks from both through use of layered locking hierarchies, as described
the library and the caller, even given that the library in Section 7.1.1.3.
code never invokes any of the caller’s functions? On the other hand, changing a key in a list that is
currently being sorted is at best rather brave. q
Answer:
Indeed there are! Here are a few of them: p.105
Quick Quiz 7.5:
1. If one of the library function’s arguments is a pointer What do you mean “cannot always safely invoke the
to a lock that this library function acquires, and if the scheduler”? Either call_rcu() can or cannot safely
library function holds one of its locks while acquiring invoke the scheduler, right?
the caller’s lock, then we could have a deadlock cycle
Answer:
involving both caller and library locks.
It really does depend.
The scheduler locks are always held with interrupts
7 Of course, one type of failure-to-wake bug is a deadlock that disabled. Therefore, if call_rcu() is invoked with
involves not only locks, but also non-lock resources. But the question interrupts enabled, no scheduler locks are held, and call_
really did say “lock-based deadlock”! rcu() can safely call into the scheduler. Otherwise, if

v2022.09.25a
496 APPENDIX E. ANSWERS TO QUICK QUIZZES

interrupts are disabled, one of the scheduler locks might p.107


Quick Quiz 7.10:
be held, so call_rcu() must play it safe and refrain from
When using the “acquire needed locks first” approach de-
calling into the scheduler. q
scribed in Section 7.1.1.7, how can livelock be avoided?

Quick Quiz 7.6: p.106


Name one common situation where a pointer to a lock Answer:
is passed into a function. Provide an additional global lock. If a given thread has
repeatedly tried and failed to acquire the needed locks,
Answer: then have that thread unconditionally acquire the new
Locking primitives, of course! q global lock, and then unconditionally acquire any needed
locks. (Suggested by Doug Lea.) q
Quick Quiz 7.7: p.106
Doesn’t the fact that pthread_cond_wait() first re-
Quick Quiz 7.11: p.107
leases the mutex and then re-acquires it eliminate the
possibility of deadlock? Suppose Lock A is never acquired within a signal handler,
but Lock B is acquired both from thread context and
Answer: by signal handlers. Suppose further that Lock A is
Absolutely not! sometimes acquired with signals unblocked. Why is it
Consider a program that acquires mutex_a, and illegal to acquire Lock A holding Lock B?
then mutex_b, in that order, and then passes mutex_
a to pthread_cond_wait(). Now, pthread_cond_ Answer:
wait() will release mutex_a, but will re-acquire it before Because this would lead to deadlock. Given that Lock A
returning. If some other thread acquires mutex_a in the is sometimes held outside of a signal handler without
meantime and then blocks on mutex_b, the program will blocking signals, a signal might be handled while holding
deadlock. q this lock. The corresponding signal handler might then
acquire Lock B, so that Lock B is acquired while holding
Quick Quiz 7.8: p.106 Lock A. Therefore, if we also acquire Lock A while
Can the transformation from Listing 7.3 to Listing 7.4 holding Lock B, we will have a deadlock cycle. Note
be applied universally? that this problem exists even if signals are blocked while
holding Lock B.
Answer: This is another reason to be very careful with locks that
Absolutely not! are acquired within interrupt or signal handlers. But the
This transformation assumes that the layer_2_ Linux kernel’s lock dependency checker knows about this
processing() function is idempotent, given that it might situation and many others as well, so please do make full
be executed multiple times on the same packet when the use of it! q
layer_1() routing decision changes. Therefore, in real
life, this transformation can become arbitrarily complex.
Quick Quiz 7.12: p.107
q
How can you legally block signals within a signal han-
Quick Quiz 7.9: p.106 dler?
But the complexity in Listing 7.4 is well worthwhile Answer:
given that it avoids deadlock, right? One of the simplest and fastest ways to do so is to use the
Answer: sa_mask field of the struct sigaction that you pass
Maybe. to sigaction() when setting up the signal. q
If the routing decision in layer_1() changes often
enough, the code will always retry, never making forward p.107
Quick Quiz 7.13:
progress. This is termed “livelock” if no thread makes
If acquiring locks in signal handlers is such a bad idea,
any forward progress or “starvation” if some threads make
why even discuss ways of making it safe?
forward progress but others do not (see Section 7.1.2). q

v2022.09.25a
E.7. LOCKING 497

Answer: acquire them before actually carrying out any updates.


Because these same rules apply to the interrupt handlers If the prediction turns out to be incorrect, drop all
used in operating-system kernels and in some embedded the locks and retry with an updated prediction that
applications. includes the benefit of experience. This approach
In many application environments, acquiring locks in was discussed in Section 7.1.1.7.
signal handlers is frowned upon [Ope97]. However, that
does not stop clever developers from (perhaps unwisely) 6. Use transactional memory. This approach has a
fashioning home-brew locks out of atomic operations. number of advantages and disadvantages which will
And atomic operations are in many cases perfectly legal be discussed in Sections 17.2–17.3.
in signal handlers. q 7. Refactor the application to be more concurrency-
friendly. This would likely also have the side effect
Quick Quiz 7.14: p.108 of making the application run faster even when single-
Given an object-oriented application that passes control threaded, but might also make it more difficult to
freely among a group of objects such that there is no modify the application.
straightforward locking hierarchy,a layered or otherwise, 8. Use techniques from later chapters in addition to
how can this application be parallelized? locking. q
a Also known as “object-oriented spaghetti code.”

Answer: Quick Quiz 7.15: p.108


There are a number of approaches: How can the livelock shown in Listing 7.5 be avoided?
1. In the case of parametric search via simulation, where
a large number of simulations will be run in order Answer:
to converge on (for example) a good design for a Listing 7.4 provides some good hints. In many cases,
mechanical or electrical device, leave the simulation livelocks are a hint that you should revisit your locking
single-threaded, but run many instances of the sim- design. Or visit it in the first place if your locking design
ulation in parallel. This retains the object-oriented “just grew”.
design, and gains parallelism at a higher level, and That said, one good-and-sufficient approach due to
likely also avoids both deadlocks and synchronization Doug Lea is to use conditional locking as described in
overhead. Section 7.1.1.6, but combine this with acquiring all needed
locks first, before modifying shared data, as described
2. Partition the objects into groups such that there is no
in Section 7.1.1.7. If a given critical section retries
need to operate on objects in more than one group at
too many times, unconditionally acquire a global lock,
a given time. Then associate a lock with each group.
then unconditionally acquire all the needed locks. This
This is an example of a single-lock-at-a-time design,
avoids both deadlock and livelock, and scales reasonably
which discussed in Section 7.1.1.8.
assuming that the global lock need not be acquired too
3. Partition the objects into groups such that threads often. q
can all operate on objects in the groups in some
groupwise ordering. Then associate a lock with Quick Quiz 7.16: p.109
each group, and impose a locking hierarchy over the What problems can you spot in the code in Listing 7.6?
groups.

4. Impose an arbitrarily selected hierarchy on the locks, Answer:


and then use conditional locking if it is necessary Here are a couple:
to acquire a lock out of order, as was discussed in
Section 7.1.1.6. 1. A one-second wait is way too long for most uses.
Wait intervals should begin with roughly the time
5. Before carrying out a given group of operations, required to execute the critical section, which will
predict which locks will be acquired, and attempt to normally be in the microsecond or millisecond range.

v2022.09.25a
498 APPENDIX E. ANSWERS TO QUICK QUIZZES

2. The code does not check for overflow. On the other Answer:
hand, this bug is nullified by the previous bug: 32 Empty lock-based critical sections are rarely used, but
bits worth of seconds is more than 50 years. q they do have their uses. The point is that the semantics
of exclusive locks have two components: (1) The familiar
data-protection semantic and (2) A messaging semantic,
Quick Quiz 7.17: p.109 where releasing a given lock notifies a waiting acquisi-
Wouldn’t it be better just to use a good parallel design so tion of that same lock. An empty critical section uses
that lock contention was low enough to avoid unfairness? the messaging component without the data-protection
component.
The rest of this answer provides some example uses of
Answer: empty critical sections, however, these examples should
It would be better in some sense, but there are situations be considered “gray magic.”8 As such, empty critical
where it can be appropriate to use designs that sometimes sections are almost never used in practice. Nevertheless,
result in high lock contentions. pressing on into this gray area . . .
For example, imagine a system that is subject to a One historical use of empty critical sections appeared in
rare error condition. It might well be best to have a the networking stack of the 2.4 Linux kernel through use
simple error-handling design that has poor performance of a read-side-scalable reader-writer lock called brlock
and scalability for the duration of the rare error condition, for “big reader lock”. This use case is a way of approxi-
as opposed to a complex and difficult-to-debug design that mating the semantics of read-copy update (RCU), which
is helpful only when one of those rare error conditions is is discussed in Section 9.5. And in fact this Linux-kernel
in effect. use case has been replaced with RCU.
That said, it is usually worth putting some effort into The empty-lock-critical-section idiom can also be used
attempting to produce a design that both simple as well as to reduce lock contention in some situations. For example,
efficient during error conditions, for example by partition- consider a multithreaded user-space application where
ing the problem. q each thread processes units of work maintained in a per-
thread list, where threads are prohibited from touching
each others’ lists [McK12e]. There could also be updates
Quick Quiz 7.18: p.109
that require that all previously scheduled units of work
How might the lock holder be interfered with? have completed before the update can progress. One way
to handle this is to schedule a unit of work on each thread,
Answer: so that when all of these units of work complete, the
If the data protected by the lock is in the same cache line update may proceed.
as the lock itself, then attempts by other CPUs to acquire In some applications, threads can come and go. For
the lock will result in expensive cache misses on the part example, each thread might correspond to one user of
of the CPU holding the lock. This is a special case of the application, and thus be removed when that user
false sharing, which can also occur if a pair of variables logs out or otherwise disconnects. In many applications,
protected by different locks happen to share a cache line. threads cannot depart atomically: They must instead
In contrast, if the lock is in a different cache line than the explicitly unravel themselves from various portions of
data that it protects, the CPU holding the lock will usually the application using a specific sequence of actions. One
suffer a cache miss only on first access to a given variable. specific action will be refusing to accept further requests
Of course, the downside of placing the lock and data from other threads, and another specific action will be
into separate cache lines is that the code will incur two disposing of any remaining units of work on its list, for
cache misses rather than only one in the uncontended case. example, by placing these units of work in a global work-
As always, choose wisely! q item-disposal list to be taken by one of the remaining
threads. (Why not just drain the thread’s work-item list by
Quick Quiz 7.19: p.110 executing each item? Because a given work item might
Does it ever make sense to have an exclusive lock acqui- generate more work items, so that the list could not be
sition immediately followed by a release of that same drained in a timely fashion.)
lock, that is, an empty critical section? 8 Thanks to Alexey Roytman for this description.

v2022.09.25a
E.7. LOCKING 499

If the application is to perform and scale well, a good 1. Set a global counter to one and initialize a condition
locking design is required. One common solution is to variable to zero.
have a global lock (call it G) protecting the entire process
of departing (and perhaps other things as well), with 2. Send a message to all threads to cause them to
finer-grained locks protecting the individual unraveling atomically increment the global counter, and then to
operations. enqueue a work item. The work item will atomically
Now, a departing thread must clearly refuse to accept decrement the global counter, and if the result is zero,
further requests before disposing of the work on its list, it will set a condition variable to one.
because otherwise additional work might arrive after the
disposal action, which would render that disposal action 3. Acquire G, which will wait on any currently depart-
ineffective. So simplified pseudocode for a departing ing thread to finish departing. Because only one
thread might be as follows: thread may depart at a time, all the remaining threads
will have already received the message sent in the
1. Acquire lock G. preceding step.
2. Acquire the lock guarding communications. 4. Release G.
3. Refuse further communications from other threads.
5. Acquire the lock guarding the global work-item-
4. Release the lock guarding communications. disposal list.
5. Acquire the lock guarding the global work-item- 6. Move all work items from the global work-item-
disposal list. disposal list to this thread’s list, processing them as
6. Move all pending work items to the global work-item- needed along the way.
disposal list.
7. Release the lock guarding the global work-item-
7. Release the lock guarding the global work-item- disposal list.
disposal list.
8. Enqueue an additional work item onto this thread’s
8. Release lock G. list. (As before, this work item will atomically
Of course, a thread that needs to wait for all pre-existing decrement the global counter, and if the result is zero,
work items will need to take departing threads into account. it will set a condition variable to one.)
To see this, suppose that this thread starts waiting for all
9. Wait for the condition variable to take on the value
pre-existing work items just after a departing thread has
one.
refused further communications from other threads. How
can this thread wait for the departing thread’s work items
Once this procedure completes, all pre-existing work
to complete, keeping in mind that threads are not allowed
items are guaranteed to have completed. The empty
to access each others’ lists of work items?
critical sections are using locking for messaging as well
One straightforward approach is for this thread to ac-
as for protection of data. q
quire G and then the lock guarding the global work-item-
disposal list, then move the work items to its own list. The
thread then release both locks, places a work item on the Quick Quiz 7.20: p.112
end of its own list, and then wait for all of the work items Is there any other way for the VAX/VMS DLM to
that it placed on each thread’s list (including its own) to emulate a reader-writer lock?
complete.
This approach does work well in many cases, but if Answer:
special processing is required for each work item as it There are in fact several. One way would be to use the
is pulled in from the global work-item-disposal list, the null, protected-read, and exclusive modes. Another way
result could be excessive contention on G. One way to would be to use the null, protected-read, and concurrent-
avoid that contention is to acquire G and then immediately write modes. A third way would be to use the null,
release it. Then the process of waiting for all prior work concurrent-read, and exclusive modes. q
items look something like the following:

v2022.09.25a
500 APPENDIX E. ANSWERS TO QUICK QUIZZES

Quick Quiz 7.21: p.113 Quick Quiz 7.25: p.114


The code in Listing 7.7 is ridiculously complicated! Why not simply store zero into the lock word on line 14
Why not conditionally acquire a single global lock? of Listing 7.8?

Answer:
Answer:
This can be a legitimate implementation, but only if this
Conditionally acquiring a single global lock does work
store is preceded by a memory barrier and makes use
very well, but only for relatively small numbers of CPUs.
of WRITE_ONCE(). The memory barrier is not required
To see why it is problematic in systems with many hundreds
when the xchg() operation is used because this operation
of CPUs, look at Figure 5.1. q
implies a full memory barrier due to the fact that it returns
a value. q

Quick Quiz 7.22: p.114


Quick Quiz 7.26: p.116
Wait a minute! If we “win” the tournament on line 16
of Listing 7.7, we get to do all the work of do_force_ How can you tell if one counter is greater than another,
quiescent_state(). Exactly how is that a win, really? while accounting for counter wrap?

Answer:
In the C language, the following macro correctly handles
Answer: this:
How indeed? This just shows that in concurrency, just as
in life, one should take care to learn exactly what winning #define ULONG_CMP_LT(a, b) \
(ULONG_MAX / 2 < (a) - (b))
entails before playing the game. q

Although it is tempting to simply subtract two signed


Quick Quiz 7.23: p.114 integers, this should be avoided because signed overflow is
Why not rely on the C language’s default initialization undefined in the C language. For example, if the compiler
of zero instead of using the explicit initializer shown on knows that one of the values is positive and the other
line 2 of Listing 7.8? negative, it is within its rights to simply assume that the
positive number is greater than the negative number, even
though subtracting the negative number from the positive
Answer: number might well result in overflow and thus a negative
Because this default initialization does not apply to locks number.
allocated as auto variables within the scope of a function. How could the compiler know the signs of the two
q numbers? It might be able to deduce it based on prior
assignments and comparisons. In this case, if the per-CPU
counters were signed, the compiler could deduce that they
Quick Quiz 7.24: p.114
were always increasing in value, and then might assume
Why bother with the inner loop on lines 7–8 of List- that they would never go negative. This assumption
ing 7.8? Why not simply repeatedly do the atomic could well lead the compiler to generate unfortunate
exchange operation on line 6? code [McK12d, Reg10]. q

Answer: p.116
Quick Quiz 7.27:
Suppose that the lock is held and that several threads
Which is better, the counter approach or the flag ap-
are attempting to acquire the lock. In this situation, if
proach?
these threads all loop on the atomic exchange operation,
they will ping-pong the cache line containing the lock Answer:
among themselves, imposing load on the interconnect. In The flag approach will normally suffer fewer cache misses,
contrast, if these threads are spinning in the inner loop but a better answer is to try both and see which works best
on lines 7–8, they will each spin within their own caches, for your particular workload. q
placing negligible load on the interconnect. q

v2022.09.25a
E.8. DATA OWNERSHIP 501

Quick Quiz 7.28: p.117 Quick Quiz 8.2: p.123


How can relying on implicit existence guarantees result What synchronization remains in the example shown in
in a bug? Section 8.1?

Answer: Answer:
Here are some bugs resulting from improper use of implicit The creation of the threads via the sh & operator and the
existence guarantees: joining of thread via the sh wait command.
Of course, if the processes explicitly share memory,
1. A program writes the address of a global variable to a for example, using the shmget() or mmap() system calls,
file, then a later instance of that same program reads explicit synchronization might well be needed when acc-
that address and attempts to dereference it. This cessing or updating the shared memory. The processes
can fail due to address-space randomization, to say might also synchronize using any of the following inter-
nothing of recompilation of the program. process communications mechanisms:

2. A module can record the address of one of its vari- 1. System V semaphores.
ables in a pointer located in some other module, then 2. System V message queues.
attempt to dereference that pointer after the module
has been unloaded. 3. UNIX-domain sockets.
4. Networking protocols, including TCP/IP, UDP, and
3. A function can record the address of one of its on-
a whole host of others.
stack variables into a global pointer, which some
other function might attempt to dereference after that 5. File locking.
function has returned.
6. Use of the open() system call with the O_CREAT
I am sure that you can come up with additional possibilities. and O_EXCL flags.
q
7. Use of the rename() system call.

p.117 A complete list of possible synchronization mechanisms


Quick Quiz 7.29:
is left as an exercise to the reader, who is warned that it
What if the element we need to delete is not the first
will be an extremely long list. A surprising number of
element of the list on line 8 of Listing 7.9?
unassuming system calls can be pressed into service as
Answer: synchronization mechanisms. q
This is a very simple hash table with no chaining, so the
only element in a given bucket is the first element. The Quick Quiz 8.3: p.123
reader is invited to adapt this example to a hash table with Is there any shared data in the example shown in Sec-
full chaining. q tion 8.1?
Answer:
That is a philosophical question.
E.8 Data Ownership Those wishing the answer “no” might argue that pro-
cesses by definition do not share memory.
Those wishing to answer “yes” might list a large number
Quick Quiz 8.1: p.123 of synchronization mechanisms that do not require shared
What form of data ownership is extremely difficult to memory, note that the kernel will have some shared state,
avoid when creating shared-memory parallel programs and perhaps even argue that the assignment of process
(for example, using pthreads) in C or C++? IDs (PIDs) constitute shared data.
Such arguments are excellent intellectual exercise, and
Answer: are also a wonderful way of feeling intelligent and scoring
Use of auto variables in functions. By default, these are points against hapless classmates or colleagues, but are
private to the thread executing the current function. q mostly a way of avoiding getting anything useful done. q

v2022.09.25a
502 APPENDIX E. ANSWERS TO QUICK QUIZZES

p.124 However, there really is data that is owned by the


Quick Quiz 8.4:
eventual() thread, namely the t and sum variables
Does it ever make sense to have partial data ownership
defined on lines 19 and 20 of the listing.
where each thread reads only its own instance of a per-
For other examples of designated threads, look at the
thread variable, but writes to other threads’ instances?
kernel threads in the Linux kernel, for example, those
created by kthread_create() and kthread_run(). q
Answer:
Amazingly enough, yes. One example is a simple message- Quick Quiz 8.7: p.125
passing system where threads post messages to other Is it possible to obtain greater accuracy while still main-
threads’ mailboxes, and where each thread is responsible taining full privacy of the per-thread data?
for removing any message it sent once that message has
been acted on. Implementation of such an algorithm is Answer:
left as an exercise for the reader, as is identifying other Yes. One approach is for read_count() to add the value
algorithms with similar ownership patterns. q of its own per-thread variable. This maintains full owner-
ship and performance, but only a slight improvement in
accuracy, particularly on systems with very large numbers
Quick Quiz 8.5: p.124
of threads.
What mechanisms other than POSIX signals may be Another approach is for read_count() to use function
used for function shipping? shipping, for example, in the form of per-thread signals.
This greatly improves accuracy, but at a significant perfor-
Answer:
mance cost for read_count().
There is a very large number of such mechanisms, includ-
However, both of these methods have the advantage
ing:
of eliminating cache thrashing for the common case of
1. System V message queues. updating counters. q

2. Shared-memory dequeue (see Section 6.1.2).

3. Shared-memory mailboxes.
E.9 Deferred Processing
4. UNIX-domain sockets. p.129
Quick Quiz 9.1:
5. TCP/IP or UDP, possibly augmented by any number Why bother with a use-after-free check?
of higher-level protocols, including RPC, HTTP,
Answer:
XML, SOAP, and so on.
To greatly increase the probability of finding bugs. A
Compilation of a complete list is left as an exercise to small torture-test program (routetorture.h) that allo-
sufficiently single-minded readers, who are warned that cates and frees only one type of structure can tolerate a
the list will be extremely long. q surprisingly large amount of use-after-free misbehavior.
See Figure 11.4 on page 217 and the related discussion
in Section 11.6.4 starting on page 218 for more on the
Quick Quiz 8.6: p.125 importance of increasing the probability of finding bugs.
But none of the data in the eventual() function shown q
on lines 17–34 of Listing 5.5 is actually owned by
the eventual() thread! In just what way is this data Quick Quiz 9.2: p.129
ownership??? Why doesn’t route_del() in Listing 9.3 use reference
counts to protect the traversal to the element to be freed?
Answer:
The key phrase is “owns the rights to the data”. In this
case, the rights in question are the rights to access the per- Answer:
thread counter variable defined on line 1 of the listing. Because the traversal is already protected by the lock, so
This situation is similar to that described in Section 8.2. no additional protection is required. q

v2022.09.25a
E.9. DEFERRED PROCESSING 503

8
p.130 1x10
Quick Quiz 9.3:
Why the break in the “ideal” line at 224 CPUs in Fig- 7 ideal
1x10

Lookups per Millisecond


ure 9.2? Shouldn’t it be a straight line?
1x106
Answer:
The break is due to hyperthreading. On this particular 100000
system, the first hardware thread in each core within a
socket have consecutive CPU numbers, followed by the 10000 refcnt
first hardware threads in each core for the other sockets,
1000
and finally followed by the second hardware thread in
each core on all the sockets. On this particular system, 100
CPU numbers 0–27 are the first hardware threads in each 1 10 100
of the 28 cores in the first socket, numbers 28–55 are Number of CPUs (Threads)
the first hardware threads in each of the 28 cores in the
Figure E.4: Pre-BSD Routing Table Protected by Refer-
second socket, and so on, so that numbers 196–223 are
ence Counting, Log Scale
the first hardware threads in each of the 28 cores in the
eighth socket. Then CPU numbers 224–251 are the second
hardware threads in each of the 28 cores of the first socket, fulness of reference counting”, why are there so many
numbers 252–279 are the second hardware threads in reference counters in the Linux kernel?
each of the 28 cores of the second socket, and so on until
numbers 420–447 are the second hardware threads in each Answer:
of the 28 cores of the eighth socket. That sentence did say “reduced the usefulness”, not “elim-
Why does this matter? inated the usefulness”, now didn’t it?
Because the two hardware threads of a given core share Please see Section 13.2, which discusses some of the
resources, and this workload seems to allow a single techniques that the Linux kernel uses to take advantage of
hardware thread to consume more than half of the relevant reference counting in a highly concurrent environment. q
resources within its core. Therefore, adding the second
hardware thread of that core adds less than one might
hope. Other workloads might gain greater benefit from Quick Quiz 9.6: p.131
each core’s second hardware thread, but much depends on Given that papers on hazard pointers use the bottom bits
the details of both the hardware and the workload. q of each pointer to mark deleted elements, what is up
with HAZPTR_POISON?
Quick Quiz 9.4: p.130
Answer:
Shouldn’t the refcnt trace in Figure 9.2 be at least a little
The published implementations of hazard pointers used
bit off of the x-axis???
non-blocking synchronization techniques for insertion and
Answer: deletion. These techniques require that readers traversing
Define “a little bit.” the data structure “help” updaters complete their updates,
Figure E.4 shows the same data, but on a log-log plot. which in turn means that readers need to look at the
As you can see, the refcnt line drops below 5,000 at two successor of a deleted element.
CPUs. This means that the refcnt performance at two In contrast, we will be using locking to synchronize
CPUs is more than one thousand times smaller than the updates, which does away with the need for readers to
first y-axis tick of 5 × 106 in Figure 9.2. Therefore, the help updaters complete their updates, which in turn allows
depiction of the performance of reference counting shown us to leave pointers’ bottom bits alone. This approach
in Figure 9.2 is all too accurate. q allows read-side code to be simpler and faster. q

Quick Quiz 9.5: p.130 Quick Quiz 9.7: p.131


If concurrency has “most definitely reduced the use- Why does hp_try_record() in Listing 9.4 take a dou-

v2022.09.25a
504 APPENDIX E. ANSWERS TO QUICK QUIZZES

ble indirection to the data element? Why not void * Quick Quiz 9.11: p.134
instead of void **? Figure 9.3 shows no sign of hyperthread-induced flatten-
Answer: ing at 224 threads. Why is that?
Because hp_try_record() must check for concurrent
modifications. To do that job, it needs a pointer to a pointer Answer:
to the element, so that it can check for a modification to Modern microprocessors are complicated beasts, so signif-
the pointer to the element. q icant skepticism is appropriate for any simple answer. That
aside, the most likely reason is the full memory barriers
required by hazard-pointers readers. Any delays resulting
Quick Quiz 9.8: p.131
from those memory barriers would make time available
Why bother with hp_try_record()? Wouldn’t it be to the other hardware thread sharing the core, resulting in
easier to just use the failure-immune hp_record() greater scalability at the expense of per-hardware-thread
function? performance. q
Answer:
It might be easier in some sense, but as will be seen in the Quick Quiz 9.12: p.134
Pre-BSD routing example, there are situations for which The paper “Structured Deferral: Synchronization via
hp_record() simply does not work. q Procrastination” [McK13] shows that hazard pointers
have near-ideal performance. Whatever happened in
Quick Quiz 9.9: p.133 Figure 9.3???
Readers must “typically” restart? What are some excep-
tions? Answer:
First, Figure 9.3 has a linear y-axis, while most of the
Answer: graphs in the “Structured Deferral” paper have logscale
If the pointer emanates from a global variable or is other- y-axes. Next, that paper uses lightly-loaded hash tables,
wise not subject to being freed, then hp_record() may while Figure 9.3’s uses a 10-element simple linked list,
be used to repeatedly attempt to record the hazard pointer, which means that hazard pointers face a larger memory-
even in the face of concurrent deletions. barrier penalty in this workload than in that of the “Struc-
In certain cases, restart can be avoided by using link tured Deferral” paper. Finally, that paper used an older
counting as exemplified by the UnboundedQueue and modest-sized x86 system, while a much newer and larger
ConcurrentHashMap data structures implemented in Folly system was used to generate the data shown in Figure 9.3.
open-source library.9 q
In addition, use of pairwise asymmetric barriers [Mic08,
Cor10b, Cor18] has been proposed to eliminate the read-
Quick Quiz 9.10: p.133 side hazard-pointer memory barriers on systems sup-
But don’t these restrictions on hazard pointers also apply porting this notion [Gol18b], which might improve the
to other forms of reference counting? performance of hazard pointers beyond what is shown in
the figure.
Answer: As always, your mileage may vary. Given the difference
Yes and no. These restrictions apply only to reference- in performance, it is clear that hazard pointers give you
counting mechanisms whose reference acquisition can fail. the best performance either for very large data structures
q (where the memory-barrier overhead will at least partially
overlap cache-miss penalties) and for data structures such
as hash tables where a lookup operation needs a minimal
number of hazard pointers. q

Quick Quiz 9.13: p.134


Why isn’t this sequence-lock discussion in Chapter 7,
9 https://github.com/facebook/folly
you know, the one on locking?

v2022.09.25a
E.9. DEFERRED PROCESSING 505

Answer: Answer:
The sequence-lock mechanism is really a combination In older versions of the Linux kernel, no.
of two separate synchronization mechanisms, sequence In very new versions of the Linux kernel, line 16 could
counts and locking. In fact, the sequence-count mech- use smp_load_acquire() instead of READ_ONCE(),
anism is available separately in the Linux kernel via which in turn would allow the smp_mb() on line 17 to
the write_seqcount_begin() and write_seqcount_ be dropped. Similarly, line 41 could use an smp_store_
end() primitives. release(), for example, as follows:
However, the combined write_seqlock() and
write_sequnlock() primitives are used much more smp_store_release(&slp->seq, READ_ONCE(slp->seq) + 1);
heavily in the Linux kernel. More importantly, many
more people will understand what you mean if you say This would allow the smp_mb() on line 40 to be
“sequence lock” than if you say “sequence count”. dropped. q
So this section is entitled “Sequence Locks” so that
people will understand what it is about just from the title,
Quick Quiz 9.17: p.136
and it appears in the “Deferred Processing” because (1) of
the emphasis on the “sequence count” aspect of “sequence What prevents sequence-locking updaters from starving
locks” and (2) because a “sequence lock” is much more readers?
than merely a lock. q
Answer:
Nothing. This is one of the weaknesses of sequence
Quick Quiz 9.14: p.135 locking, and as a result, you should use sequence locking
Why not have read_seqbegin() in Listing 9.10 check only in read-mostly situations. Unless of course read-side
for the low-order bit being set, and retry internally, rather starvation is acceptable in your situation, in which case,
than allowing a doomed read to start? go wild with the sequence-locking updates! q

Answer:
Quick Quiz 9.18: p.136
That would be a legitimate implementation. However,
if the workload is read-mostly, it would likely increase What if something else serializes writers, so that the
the overhead of the common-case successful read, which lock is not needed?
could be counter-productive. However, given a sufficiently Answer:
large fraction of updates and sufficiently high-overhead In this case, the ->lock field could be omitted, as it is in
readers, having the check internal to read_seqbegin() seqcount_t in the Linux kernel. q
might be preferable. q

Quick Quiz 9.19: p.136


Quick Quiz 9.15: p.136
Why isn’t seq on line 2 of Listing 9.10 unsigned rather
Why is the smp_mb() on line 26 of Listing 9.10 needed?
than unsigned long? After all, if unsigned is good
enough for the Linux kernel, shouldn’t it be good enough
Answer: for everyone?
If it was omitted, both the compiler and the CPU would be
within their rights to move the critical section preceding Answer:
the call to read_seqretry() down below this function. Not at all. The Linux kernel has a number of special
This would prevent the sequence lock from protecting the attributes that allow it to ignore the following sequence of
critical section. The smp_mb() primitive prevents such events:
reordering. q 1. Thread 0 executes read_seqbegin(), picking up
->seq in line 16, noting that the value is even, and
Quick Quiz 9.16: p.136 thus returning to the caller.
Can’t weaker memory barriers be used in the code in
2. Thread 0 starts executing its read-side critical section,
Listing 9.10?
but is then preempted for a long time.

v2022.09.25a
506 APPENDIX E. ANSWERS TO QUICK QUIZZES

3. Other threads repeatedly invoke write_seqlock() 4. CPU 0 picks up what used to be element A’s ->
and write_sequnlock(), until the value of ->seq next pointer, gets random bits, and therefore gets a
overflows back to the value that Thread 0 fetched. segmentation fault.
4. Thread 0 resumes execution, completing its read-side One way to protect against this sort of problem requires
critical section with inconsistent data. use of “type-safe memory”, which will be discussed in Sec-
tion 9.5.4.5. Roughly similar solutions are possible using
5. Thread 0 invokes read_seqretry(), which incor-
the hazard pointers discussed in Section 9.3. But in either
rectly concludes that Thread 0 has seen a consistent
case, you would be using some other synchronization
view of the data protected by the sequence lock.
mechanism in addition to sequence locks! q
The Linux kernel uses sequence locking for things that
are updated rarely, with time-of-day information being a Quick Quiz 9.21: p.139
case in point. This information is updated at most once Why does Figure 9.7 use smp_store_release() given
per millisecond, so that seven weeks would be required to that it is storing a NULL pointer? Wouldn’t WRITE_
overflow the counter. If a kernel thread was preempted for ONCE() work just as well in this case, given that there
seven weeks, the Linux kernel’s soft-lockup code would is no structure initialization to order against the store of
be emitting warnings every two minutes for that entire the NULL pointer?
time.
In contrast, with a 64-bit counter, more than five cen- Answer:
turies would be required to overflow, even given an update Yes, it would.
every nanosecond. Therefore, this implementation uses a Because a NULL pointer is being assigned, there is noth-
type for ->seq that is 64 bits on 64-bit systems. q ing to order against, so there is no need for smp_store_
release(). In contrast, when assigning a non-NULL
Quick Quiz 9.20: p.137 pointer, it is necessary to use smp_store_release()
Can this bug be fixed? In other words, can you use in order to ensure that initialization of the pointed-to
sequence locks as the only synchronization mechanism structure is carried out before assignment of the pointer.
protecting a linked list supporting concurrent addition, In short, WRITE_ONCE() would work, and would save a
deletion, and lookup? little bit of CPU time on some architectures. However, as
we will see, software-engineering concerns will motivate
Answer: use of a special rcu_assign_pointer() that is quite
One trivial way of accomplishing this is to surround all similar to smp_store_release(). q
accesses, including the read-only accesses, with write_
seqlock() and write_sequnlock(). Of course, this Quick Quiz 9.22: p.139
solution also prohibits all read-side parallelism, resulting Readers running concurrently with each other and with
in massive lock contention, and furthermore could just as the procedure outlined in Figure 9.7 can disagree on the
easily be implemented using simple locking. value of gptr. Isn’t that just a wee bit problematic???
If you do come up with a solution that uses read_
seqbegin() and read_seqretry() to protect read-side
accesses, make sure that you correctly handle the following Answer:
sequence of events: Not necessarily.
As hinted at in Sections 3.2.3 and 3.3, speed-of-light
1. CPU 0 is traversing the linked list, and picks up a delays mean that a computer’s data is always stale com-
pointer to list element A. pared to whatever external reality that data is intended to
2. CPU 1 removes element A from the list and frees it. model.
Real-world algorithms therefore absolutely must tol-
3. CPU 2 allocates an unrelated data structure, and gets erate inconsistancies between external reality and the
the memory formerly occupied by element A. In this in-computer data reflecting that reality. Many of those
unrelated data structure, the memory previously used algorithms are also able to tolerate some degree of in-
for element A’s ->next pointer is now occupied by consistency within the in-computer data. Section 10.3.4
a floating-point number. discusses this point in more detail.

v2022.09.25a
E.9. DEFERRED PROCESSING 507

Please note that this need to tolerate inconsistent and for single-theaded situations. In other words, this addi-
stale data is not limited to RCU. It also applies to reference tional “redundant” waiting enables excellent read-side
counting, hazard pointers, sequence locks, and even to performance, scalability, and real-time response. q
some locking use cases. For example, if you compute
some quantity while holding a lock, but use that quantity
Quick Quiz 9.26: p.142
after releasing that lock, you might well be using stale
data. After all, the data that quantity is based on might What is the point of rcu_read_lock() and rcu_read_
change arbitrarily as soon as the lock is released. unlock() in Listing 9.13? Why not just let the quiescent
So yes, RCU readers can see stale and inconsistent data, states speak for themselves?
but no, this is not necessarily problematic. And, when
needed, there are RCU usage patterns that avoid both Answer:
staleness and inconsistency [ACMS03]. q Recall that readers are not permitted to pass through a
quiescent state. For example, within the Linux kernel,
RCU readers are not permitted to execute a context switch.
Quick Quiz 9.23: p.139
Use of rcu_read_lock() and rcu_read_unlock() en-
What is an RCU-protected pointer? ables debug checks for improperly placed quiescent states,
making it easy to find bugs that would otherwise be
Answer:
difficult to find, intermittent, and quite destructive. q
A pointer to RCU-protected data. RCU-protected data is
in turn a block of dynamically allocated memory whose
freeing will be deferred such that an RCU grace period Quick Quiz 9.27: p.142
will elapse between the time that there were no longer any What is the point of rcu_dereference(), rcu_
RCU-reader-accessible pointers to that block and the time assign_pointer() and RCU_INIT_POINTER() in
that that block is freed. This ensures that no RCU readers Listing 9.13? Why not just use READ_ONCE(), smp_
will have access to that block at the time that it is freed. store_release(), and WRITE_ONCE(), respectively?
RCU-protected pointers must be handled carefully.
For example, any reader that intends to dereference an
RCU-protected pointer must use rcu_dereference() Answer:
(or stronger) to load that pointer. In addition, any updater The RCU-specific APIs do have similar semantics to the
must use rcu_assign_pointer() (or stronger) to store suggested replacements, but also enable static-analysis
to that pointer. q debugging checks that complain if an RCU-specific API
is invoked on a non-RCU pointer and vice versa. q
Quick Quiz 9.24: p.140
What does synchronize_rcu() do if it starts at about Quick Quiz 9.28: p.143
the same time as an rcu_read_lock()? But what if the old structure needs to be freed, but
the caller of ins_route() cannot block, perhaps due
Answer:
to performance considerations or perhaps because the
If a synchronize_rcu() cannot prove that it started
caller is executing within an RCU read-side critical
before a given rcu_read_lock(), then it must wait for
section?
the corresponding rcu_read_unlock(). q
Answer:
Quick Quiz 9.25: p.142 A call_rcu() function, which is described in Sec-
In Figure 9.8, the last of CPU 3’s readers that could tion 9.5.2.2, permits asynchronous grace-period waits.
possibly have access to the old data item ended before q
the grace period even started! So why would anyone
bother waiting until CPU 3’s later context switch??? p.143
Quick Quiz 9.29:
Answer: Doesn’t Section 9.4’s seqlock also permit readers and
Because that waiting is exactly what enables readers to updaters to make useful concurrent forward progress?
use the same sequence of instructions that is appropriate

v2022.09.25a
508 APPENDIX E. ANSWERS TO QUICK QUIZZES

Answer: each present for only part of the iteration, the reader is
Yes and no. Although seqlock readers can run concurrently permitted to iterate over them, but not obliged to. Note
with seqlock writers, whenever this happens, the read_ that this supports the common case where the reader is
seqretry() primitive will force the reader to retry. This simply looking up a single item, and does not know or
means that any work done by a seqlock reader running care about the presence or absence of other items.
concurrently with a seqlock updater will be discarded If stronger consistency is required, then higher-cost
and then redone upon retry. So seqlock readers can run synchronization mechanisms are required, for example,
concurrently with updaters, but they cannot actually get sequence locking or reader-writer locking. But if stronger
any work done in this case. consistency is not required (and it very often is not), then
In contrast, RCU readers can perform useful work even why pay the higher cost? q
in presence of concurrent RCU updaters.
However, both reference counters (Section 9.2) and p.146
Quick Quiz 9.32:
hazard pointers (Section 9.3) really do permit useful
What other final values of r1 and r2 are possible in
concurrent forward progress for both updaters and readers,
Figure 9.11?
just at somewhat greater cost. Please see Section 9.6 for
a comparison of these different solutions to the deferred- Answer:
reclamation problem. q The r1 == 0 && r2 == 0 possibility was called out in
the text. Given that r1 == 0 implies r2 == 0, we know
Quick Quiz 9.30: p.145 that r1 == 0 && r2 == 1 is forbidden. The following
Wouldn’t use of data ownership for RCU updaters mean discussion will show that both r1 == 1 && r2 == 1
that the updates could use exactly the same sequence of and r1 == 1 && r2 == 0 are possible. q
instructions as would the corresponding single-threaded
code? Quick Quiz 9.33: p.146
What would happen if the order of P0()’s two accesses
Answer:
was reversed in Figure 9.12?
Sometimes, for example, on TSO systems such as x86 or
the IBM mainframe where a store-release operation emits a Answer:
single store instruction. However, weakly ordered systems Absolutely nothing would change. The fact that P0()’s
must also emit a memory barrier or perhaps a store-release loads from x and y are in the same RCU read-side critical
instruction. In addition, removing data requires quite a section suffices; their order is irrelevant. q
bit of additional work because it is necessary to wait for
pre-existing readers before freeing the removed data. q
Quick Quiz 9.34: p.147
What would happen if P0()’s accesses in Figures 9.11–
Quick Quiz 9.31: p.145
9.13 were stores?
But suppose that updaters are adding and removing
multiple data items from a linked list while a reader Answer:
is iterating over that same list. Specifically, suppose The exact same ordering rules would apply, that is, (1) If
that a list initially contains elements A, B, and C, and any part of P0()’s RCU read-side critical section preceded
that an updater removes element A and then adds a new the beginning of P1()’s grace period, all of P0()’s RCU
element D at the end of the list. The reader might well read-side critical section would precede the end of P1()’s
see {A, B, C, D}, when that sequence of elements never grace period, and (2) If any part of P0()’s RCU read-side
actually ever existed! In what alternate universe would critical section followed the end of P1()’s grace period,
that qualify as “not disrupting concurrent readers”??? all of P0()’s RCU read-side critical section would follow
the beginning of P1()’s grace period.
Answer: It might seem strange to have RCU read-side critical
In the universe where an iterating reader is only required sections containing writes, but this capability is not only
to traverse elements that were present throughout the full permitted, but also highly useful. For example, the Linux
duration of the iteration. In the example, that would be kernel frequently carries out an RCU-protected traversal
elements B and C. Because elements A and D were of a linked data structure and then acquires a reference to

v2022.09.25a
E.9. DEFERRED PROCESSING 509

Listing E.2: Concurrent RCU Deletion 1. Thread A traverses the list, obtaining a reference to
1 spin_lock(&mylock); Element C.
2 p = search(head, key);
3 if (p == NULL)
4 spin_unlock(&mylock); 2. Thread B replaces Element C with a new Element F,
5 else {
6 list_del_rcu(&p->list);
then waits for its synchronize_rcu() call to return.
7 spin_unlock(&mylock);
8 synchronize_rcu(); 3. Thread C traverses the list, obtaining a reference to
9 kfree(p);
10 } Element F.

4. Thread D replaces Element F with a new Element G,


the destination data element. Because this data element then waits for its synchronize_rcu() call to return.
must not be freed in the meantime, that element’s refer-
5. Thread E traverses the list, obtaining a reference to
ence counter must necessarily be incremented within the
Element G.
traversal’s RCU read-side critical section. However, that
increment entails a write to memory. Therefore, it is a 6. Thread F replaces Element G with a new Element H,
very good thing that memory writes are permitted within then waits for its synchronize_rcu() call to return.
RCU read-side critical sections.
If having writes in RCU read-side critical sections 7. Thread G traverses the list, obtaining a reference to
still seems strange, please review Section 5.4.6, which Element H.
presented a use case for writes in reader-writer locking
read-side critical sections. q 8. And the previous two steps repeat quickly with ad-
ditional new elements, so that all of them happen
p.149
before any of the synchronize_rcu() calls return.
Quick Quiz 9.35:
How would you modify the deletion example to permit
Thus, there can be an arbitrary number of versions
more than two versions of the list to be active?
active, limited only by memory and by how many updates
Answer: could be completed within a grace period. But please
One way of accomplishing this is as shown in Listing E.2. note that data structures that are updated so frequently are
Note that this means that multiple concurrent deletions not likely to be good candidates for RCU. Nevertheless,
might be waiting in synchronize_rcu(). q RCU can handle high update rates when necessary. q

p.149 Quick Quiz 9.37: p.150


Quick Quiz 9.36:
How many RCU versions of a given list can be active at How can RCU updaters possibly delay RCU readers,
any given time? given that neither rcu_read_lock() nor rcu_read_
unlock() spin or block?
Answer:
That depends on the synchronization design. If a sema- Answer:
phore protecting the update is held across the grace period, The modifications undertaken by a given RCU updater
then there can be at most two versions, the old and the will cause the corresponding CPU to invalidate cache lines
new. containing the data, forcing the CPUs running concurrent
However, suppose that only the search, the update, RCU readers to incur expensive cache misses. (Can you
and the list_replace_rcu() were protected by a lock, design an algorithm that changes a data structure without
so that the synchronize_rcu() was outside of that inflicting expensive cache misses on concurrent readers?
lock, similar to the code shown in Listing E.2. Suppose On subsequent readers?) q
further that a large number of threads undertook an RCU
replacement at about the same time, and that readers are p.151
Quick Quiz 9.38:
also constantly traversing the data structure.
Why do some of the cells in Table 9.2 have exclamation
Then the following sequence of events could occur,
marks (“!”)?
starting from the end state of Figure 9.15:

v2022.09.25a
510 APPENDIX E. ANSWERS TO QUICK QUIZZES

Answer: Listing E.3: Multistage SRCU Deadlocks


The API members with exclamation marks (rcu_read_ 1 idx = srcu_read_lock(&ssa);
2 synchronize_srcu(&ssb);
lock(), rcu_read_unlock(), and call_rcu()) were 3 srcu_read_unlock(&ssa, idx);
the only members of the Linux RCU API that Paul E. 4
5 /* . . . */
McKenney was aware of back in the mid-90s. During this 6
timeframe, he was under the mistaken impression that he 7 idx = srcu_read_lock(&ssb);
8 synchronize_srcu(&ssa);
knew all that there is to know about RCU. q 9 srcu_read_unlock(&ssb, idx);

Quick Quiz 9.39: p.151


code shown in Listing E.3 could still result in deadlock.
How do you prevent a huge number of RCU read-
q
side critical sections from indefinitely blocking a
synchronize_rcu() invocation?
Quick Quiz 9.42: p.153
Answer: In a kernel built with CONFIG_PREEMPT_NONE=y, won’t
There is no need to do anything to prevent RCU synchronize_rcu() wait for all trampolines, given
read-side critical sections from indefinitely block- that preemption is disabled and that trampolines never
ing a synchronize_rcu() invocation, because the directly or indirectly invoke schedule()?
synchronize_rcu() invocation need wait only for pre-
existing RCU read-side critical sections. So as long as Answer:
each RCU read-side critical section is of finite duration, You are quite right!
RCU grace periods will also remain finite. q In fact, in nonpreemptible kernels, synchronize_
rcu_tasks() is a wrapper around synchronize_rcu().
q
Quick Quiz 9.40: p.151
The synchronize_rcu() API waits for all pre-existing Quick Quiz 9.43: p.154
interrupt handlers to complete, right? Normally, any pointer subject to rcu_dereference()
must always be updated using one of the pointer-publish
Answer:
functions in Table 9.3, for example, rcu_assign_
In v4.20 and later Linux kernels, yes [McK19c, McK19a].
pointer().
But not in earlier kernels, and especially not when us-
What is an exception to this rule?
ing preemptible RCU! You instead want synchronize_
irq(). Alternatively, you can place calls to rcu_read_ Answer:
lock() and rcu_read_unlock() in the specific inter- One such exception is when a multi-element linked data
rupt handlers that you want synchronize_rcu() to wait structure is initialized as a unit while inaccessible to
for. But even then, be careful, as preemptible RCU will other CPUs, and then a single rcu_assign_pointer()
not be guaranteed to wait for that portion of the interrupt is used to plant a global pointer to this data structure.
handler preceding the rcu_read_lock() or following The initialization-time pointer assignments need not use
the rcu_read_unlock(). q rcu_assign_pointer(), though any such assignments
that happen after the structure is globally visible must use
Quick Quiz 9.41: p.153 rcu_assign_pointer().
Under what conditions can synchronize_srcu() be However, unless this initialization code is on an im-
safely used within an SRCU read-side critical section? pressively hot code-path, it is probably wise to use rcu_
assign_pointer() anyway, even though it is in theory
unnecessary. It is all too easy for a “minor” change to inval-
Answer: idate your cherished assumptions about the initialization
In principle, you can use either synchronize_srcu() or happening privately. q
synchronize_srcu_expedited() with a given srcu_
struct within an SRCU read-side critical section that p.154
Quick Quiz 9.44:
uses some other srcu_struct. In practice, however,
Are there any downsides to the fact that these traversal
doing this is almost certainly a bad idea. In particular, the

v2022.09.25a
E.9. DEFERRED PROCESSING 511

Listing E.4: Diverse RCU Read-Side Nesting p.159


1 rcu_read_lock();
Quick Quiz 9.46:
2 preempt_disable(); Why isn’t there a rcu_read_lock_tasks_held() for
3 p = rcu_dereference(global_pointer); Tasks RCU?
4
5 /* . . . */
6 Answer:
7 preempt_enable();
8 rcu_read_unlock();
Because Tasks RCU does not have read-side markers.
Instead, Tasks RCU read-side critical sections are bounded
by voluntary context switches. q
and update primitives can be used with any of the RCU
API family members? Quick Quiz 9.47: p.161
Wait, what??? How can RCU QSBR possibly be better
Answer: than ideal? Just what rubbish definition of ideal would
It can sometimes be difficult for automated code checkers fail to be the best of all possible results???
such as “sparse” (or indeed for human beings) to work out
which type of RCU read-side critical section a given RCU Answer:
traversal primitive corresponds to. For example, consider This is an excellent question, and the answer is that modern
the code shown in Listing E.4. CPUs and compilers are extremely complex. But before
Is the rcu_dereference() primitive in a vanilla RCU getting into that, it is well worth noting that RCU QSBR’s
critical section or an RCU Sched critical section? What performance advantage appears only in the one-hardware-
would you have to do to figure this out? thread-per-core regime. Once the system is fully loaded,
RCU QSBR’s performance drops back to ideal.
But perhaps after the consolidation of the RCU flavors The RCU variant of the route_lookup() search loop
in the v4.20 Linux kernel we no longer need to care! q actually has one more x86 instruction than does the se-
quential version, namely the lea in the sequence cmp,
je, mov, cmp, lea, and jne. This extra instruction is
Quick Quiz 9.45: p.156
due to the rcu_head structure at the beginning of the
But what if an hlist_nulls reader gets moved to some RCU variant’s route_entry structure, so that, unlike the
other bucket and then back again? sequential variant, the RCU variant’s ->re_next.next
pointer has a non-zero offset. Back in the 1980s, this
Answer: additional lea instruction might have reliably resulted in
One way to handle this is to always move nodes to the the RCU variant being slower, but we are now in the 21st
beginning of the destination bucket, ensuring that when century, and the 1980s are long gone.
the reader reaches the end of the list having a matching But those of you who read Section 3.1.1 carefully
NULL pointer, it will have searched the entire list. already knew all of this!
Of course, if there are too many move operations in These counter-intuitive results of course means that
a hash table with many elements per bucket, the reader any performance result on modern microprocessors must
might never reach the end of a list. One way of avoiding be subject to some skepticism. In theory, it really does
this in the common case is to keep hash tables well- not make sense to obtain performance results that are
tuned, thus with short lists. One way of detecting the better than ideal, but it really can happen on modern
problem and handling it is for the reader to terminate microprocessors. Such results can be thought of as similar
the search after traversing some large number of nodes, to the celebrated super-linear speedups (see Section 6.5 for
acquire the update-side lock, and redo the search, but this one such example), that is, of interest but also of limited
might introduce deadlocks. Another way of avoiding the practical importance. Nevertheless, one of the strengths
problem entirely is for readers to search within RCU read- of RCU is that its read-side overhead is so low that tiny
side critical sections, and to wait for an RCU grace period effects such as this one are visible in real performance
between successive updates. An intermediate position measurements.
might wait for an RCU grace period every 𝑁 updates, for This raises the question as to what would happen if
some suitable value of 𝑁. q the rcu_head structure were to be moved so that RCU’s
->re_next.next pointer also had zero offset, just the

v2022.09.25a
512 APPENDIX E. ANSWERS TO QUICK QUIZZES

2.5x107 Listing E.5: Using RCU to Wait for Mythical Preemptible NMIs
to Finish
struct profile_buffer {
2x107
1
Lookups per Millisecond ideal
2 long size;
3 atomic_t entry[0];
7 4 };
1.5x10 static struct profile_buffer *buf = NULL;
RCU 5
6
void nmi_profile(unsigned long pcvalue)
1x107
7
8 {
9 struct profile_buffer *p;
seqlock 10
5x106 11 rcu_read_lock();
hazptr 12 p = rcu_dereference(buf);
13 if (p == NULL) {
0 14 rcu_read_unlock();
0 50 100 150 200 250 300 350 400 450 15 return;
Number of CPUs (Threads) 16 }
17 if (pcvalue >= p->size) {
Figure E.5: Pre-BSD Routing Table Protected by RCU 18 rcu_read_unlock();
19 return;
QSBR With Non-Initial rcu_head 20 }
21 atomic_inc(&p->entry[pcvalue]);
22 rcu_read_unlock();
23 }
same as the sequential variant. And the answer, as can 24
25 void nmi_stop(void)
be seen in Figure E.5, is that this causes RCU QSBR’s 26 {
performance to decrease to where it is still very nearly 27 struct profile_buffer *p = buf;
28
ideal, but no longer super-ideal. q 29 if (p == NULL)
30 return;
31 rcu_assign_pointer(buf, NULL);
32 synchronize_rcu();
Quick Quiz 9.48: p.161 33 kfree(p);
Given RCU QSBR’s read-side performance, why bother 34 }

with any other flavor of userspace RCU?

Answer: p.164
Quick Quiz 9.50:
Because RCU QSBR places constraints on the overall ap-
What is the point of the second call to synchronize_
plication that might not be tolerable, for example, requiring
rcu() in function maint() in Listing 9.17? Isn’t it
that each and every thread in the application regularly
OK for any cco() invocations in the clean-up phase to
pass through a quiescent state. Among other things, this
invoke either cco_carefully() or cco_quickly()?
means that RCU QSBR is not helpful to library writers,
who might be better served by other flavors of userspace
RCU [MDJ13c]. q
Answer:
The problem is that there is no ordering between the cco()
Quick Quiz 9.49: p.164
function’s load from be_careful and any memory loads
Suppose that the nmi_profile() function was pre- executed by the cco_quickly() function. Because there
emptible. What would need to change to make this is no ordering, without that second call to syncrhonize_
example work correctly? rcu(), memory ordering could cause loads in cco_
quickly() to overlap with stores by do_maint().
Answer:
One approach would be to use rcu_read_lock() and Another alternative would be to compensate for the
rcu_read_unlock() in nmi_profile(), and to replace removal of that second call to synchronize_rcu() by
the synchronize_sched() with synchronize_rcu(), changing the READ_ONCE() to smp_load_acquire()
perhaps as shown in Listing E.5. and the WRITE_ONCE() to smp_store_release(), thus
But why on earth would an NMI handler be pre- restoring the needed ordering. q
emptible??? q

v2022.09.25a
E.9. DEFERRED PROCESSING 513

p.164 first element. The reader is again invited to adapt this


Quick Quiz 9.51:
example to a hash table with full chaining. Less energetic
How can you be sure that the code shown in maint()
reader might wish to refer to Chapter 10. q
in Listing 9.17 really works?

Answer: Quick Quiz 9.54: p.166


By one popular school of thought, you cannot. Why is it OK to exit the RCU read-side critical section
But in this case, those willing to jump ahead on line 15 of Listing 9.18 before releasing the lock on
to Chapter 12 and Chapter 15 might find a cou- line 17?
ple of LKMM litmus tests to be interesting (C-
RCU-phased-state-change-1.litmus and C-RCU- Answer:
phased-state-change-2.litmus). These tests could First, please note that the second check on line 14 is
be argued to demonstrate that this code and a variant of it necessary because some other CPU might have removed
really do work. q this element while we were waiting to acquire the lock.
However, the fact that we were in an RCU read-side
critical section while acquiring the lock guarantees that
Quick Quiz 9.52: p.165
this element could not possibly have been re-allocated and
But what if there is an arbitrarily long series of RCU re-inserted into this hash table. Furthermore, once we
read-side critical sections in multiple threads, so that acquire the lock, the lock itself guarantees the element’s
at any point in time there is at least one thread in the existence, so we no longer need to be in an RCU read-side
system executing in an RCU read-side critical section? critical section.
Wouldn’t that prevent any data from a SLAB_TYPESAFE_ The question as to whether it is necessary to re-check
BY_RCU slab ever being returned to the system, possibly the element’s key is left as an exercise to the reader. q
resulting in OOM events?
Quick Quiz 9.55: p.166
Answer:
There could certainly be an arbitrarily long period of Why not exit the RCU read-side critical section on line 23
time during which at least one thread is always in an of Listing 9.18 before releasing the lock on line 22?
RCU read-side critical section. However, the key words
Answer:
in the description in Section 9.5.4.5 are “in-use” and
Suppose we reverse the order of these two lines. Then this
“pre-existing”. Keep in mind that a given RCU read-side
code is vulnerable to the following sequence of events:
critical section is conceptually only permitted to gain
references to data elements that were visible to readers 1. CPU 0 invokes delete(), and finds the element to
during that critical section. Furthermore, remember that be deleted, executing through line 15. It has not yet
a slab cannot be returned to the system until all of its data actually deleted the element, but is about to do so.
elements have been freed, in fact, the RCU grace period
cannot start until after they have all been freed. 2. CPU 1 concurrently invokes delete(), attempting
Therefore, the slab cache need only wait for those RCU to delete this same element. However, CPU 0 still
read-side critical sections that started before the freeing holds the lock, so CPU 1 waits for it at line 13.
of the last element of the slab. This in turn means that any 3. CPU 0 executes lines 16 and 17, and blocks at line 18
RCU grace period that begins after the freeing of the last waiting for CPU 1 to exit its RCU read-side critical
element will do—the slab may be returned to the system section.
after that grace period ends. q
4. CPU 1 now acquires the lock, but the test on line 14
fails because CPU 0 has already removed the element.
Quick Quiz 9.53: p.166
CPU 1 now executes line 22 (which we switched
What if the element we need to delete is not the first with line 23 for the purposes of this Quick Quiz) and
element of the list on line 9 of Listing 9.18? exits its RCU read-side critical section.
Answer: 5. CPU 0 can now return from synchronize_rcu(),
As with Listing 7.9, this is a very simple hash table with and thus executes line 19, sending the element to the
no chaining, so the only element in a given bucket is the freelist.

v2022.09.25a
514 APPENDIX E. ANSWERS TO QUICK QUIZZES

6. CPU 1 now attempts to release a lock for an element p.167


Quick Quiz 9.57:
that has been freed, and, worse yet, possibly reallo-
Didn’t an earlier edition of this book show RCU read-
cated as some other type of data structure. This is a
side overhead way down in the sub-picosecond range?
fatal memory-corruption error. q
What happened???

Answer:
Quick Quiz 9.56: p.167 Excellent memory!!! The overhead in some early releases
WTF? How the heck do you expect me to believe that was in fact roughly 100 femtoseconds.
RCU can have less than a 300-picosecond overhead when What happened was that RCU usage spread more
the clock period at 2.10 GHz is almost 500 picoseconds? broadly through the Linux kernel, including into code
that takes page faults. Back at that time, rcu_read_
lock() and rcu_read_unlock() were complete no-
Answer: ops in CONFIG_PREEMPT=n kernels. Unfortunately, that
First, consider that the inner loop used to take this mea- situation allowed the compiler to reorder page-faulting
surement is as follows: memory accesses into RCU read-side critical sections. Of
course, page faults can block, which destroys those critical
1 for (i = nloops; i >= 0; i--) {
2 rcu_read_lock();
sections.
3 rcu_read_unlock(); Nor was this a theoretical problem: A failure actually
4 }
manifested in 2019. Herbert Xu tracked down this failure
down and Linus Torvalds therefore queued a commit to
Next, consider the effective definitions of rcu_read_ upgrade rcu_read_lock() and rcu_read_unlock()
lock() and rcu_read_unlock(): to unconditionally include a call to barrier() [Tor19].
And although barrier() emits no code, it does constrain
1 #define rcu_read_lock() barrier() compiler optimizations. And so the price of widespread
2 #define rcu_read_unlock() barrier() RCU usage is slightly higher rcu_read_lock() and
rcu_read_unlock() overhead. As such, Linux-kernel
These definitions constrain compiler code-movement RCU has proven to be a victim of its own success.
optimizations involving memory references, but emit no Of course, it is also the case that the older results were
instructions in and of themselves. However, if the loop obtained on a different system than were those shown in
variable is maintained in a register, the accesses to i Figure 9.25. So which change had the most effect, Linus’s
will not count as memory references. Furthermore, the commit or the change in the system? This question is left
compiler can do loop unrolling, allowing the resulting as an exercise to the reader. q
code to “execute” multiple passes through the loop body
simply by incrementing i by some value larger than the Quick Quiz 9.58: p.167
value 1. Why is there such large variation for the RCU trace in
So the “measurement” of 267 picoseconds is simply the Figure 9.25?
fixed overhead of the timing measurements divided by the
number of passes through the inner loop containing the Answer:
calls to rcu_read_lock() and rcu_read_unlock(), Keep in mind that this is a log-log plot, so those large-
plus the code to manipulate i divided by the loop-unrolling seeming RCU variances in reality span only a few hundred
factor. And therefore, this measurement really is in error, picoseconds. And that is such a short time that anything
in fact, it exaggerates the overhead by an arbitrary number could cause it. However, given that the variance decreases
of orders of magnitude. After all, in terms of machine with both small and large numbers of CPUs, one hypothesis
instructions emitted, the actual overheads of rcu_read_ is that the variation is due to migrations from one CPU to
lock() and of rcu_read_unlock() are each precisely another.
zero. Yes, these measurements were taken with interrupts
It is not just every day that a timing measurement of disabled, but they were also taken within a guest OS,
267 picoseconds turns out to be an overestimate! q so that preemption was still possible at the hypervisor
level. In addition, the system featured hyperthreading and

v2022.09.25a
E.9. DEFERRED PROCESSING 515

a single hardware thread running this RCU workload is result is a classic self-deadlock—you get the same effect
able to consume more than half of the core’s resources. when attempting to write-acquire a reader-writer lock
Therefore, the overall throughput varies depending on how while read-holding it.
many of a given guest OS’s CPUs share cores. Attempting Note that this self-deadlock scenario does not apply to
to reduce these variations by running the guest OSes at RCU QSBR, because the context switch performed by the
real-time priority (as suggested by Joel Fernandes) is left synchronize_rcu() would act as a quiescent state for
as an exercise for the reader. q this CPU, allowing a grace period to complete. However,
this is if anything even worse, because data used by the
Quick Quiz 9.59: p.168 RCU read-side critical section might be freed as a result of
Given that the system had no fewer than 448 hardware the grace period completing. Plus Linux kernel’s lockdep
threads, why only 192 CPUs? facility will yell at you.
In short, do not invoke synchronous RCU update-side
Answer: primitives from within an RCU read-side critical section.
Because the script (rcuscale.sh) that generates this data In addition, within the Linux kernel, RCU uses the
spawns a guest operating system for each set of points scheduler and the scheduler uses RCU. In some cases,
gathered, and on this particular system, both qemu and both RCU and the scheduler must take care to avoid
KVM limit the number of CPUs that may be configured deadlock. q
into a given guest OS. Yes, it would have been possible
to run a few more CPUs, but 192 is a nice round number Quick Quiz 9.62: p.169
from a binary perspective, given that 256 is infeasible. q Immunity to both deadlock and priority inversion???
Sounds too good to be true. Why should I believe that
Quick Quiz 9.60: p.168 this is even possible?
Why the larger error ranges for the submicrosecond
durations in Figure 9.27? Answer:
It really does work. After all, if it didn’t work, the Linux
Answer: kernel would not run. q
Because smaller disturbances result in greater relative
errors for smaller measurements. Also, the Linux kernel’s Quick Quiz 9.63: p.169
ndelay() nanosecond-scale primitive is (as of 2020) less But how many other algorithms really tolerate stale and
accurate than is the udelay() primitive used for the data inconsistent data?
for durations of a microsecond or more. It is instructive to
compare to the zero-length case shown in Figure 9.25. q Answer:
Quite a few!
p.168 Please keep in mind that the finite speed of light means
Quick Quiz 9.61:
that data reaching a given computer system is at least
Is there an exception to this deadlock immunity, and if
slightly stale at the time that it arrives, and extremely
so, what sequence of events could lead to deadlock?
stale in the case of astronomical data. The finite speed of
Answer: light also places a sharp limit on the consistency of data
One way to cause a deadlock cycle involving RCU read- arriving from different sources of via different paths.
side primitives is via the following (illegal) sequence of You might as well face the fact that the laws of physics
statements: are incompatible with naive notions of perfect freshness
and consistency. q
rcu_read_lock();
synchronize_rcu();
rcu_read_unlock(); p.170
Quick Quiz 9.64:
If Tasks RCU Trace might someday be priority boosted,
The synchronize_rcu() cannot return until all pre- why not also Tasks RCU and Tasks RCU Rude?
existing RCU read-side critical sections complete, but is
enclosed in an RCU read-side critical section that cannot Answer:
complete until the synchronize_rcu() returns. The Maybe, but these are less likely.

v2022.09.25a
516 APPENDIX E. ANSWERS TO QUICK QUIZZES

In the case of Tasks RCU, recall that the quiescent state difference is in how we think about the code. For example,
is a voluntary context switch. Thus, all tasks not blocked what does an atomic_inc() operation do? It might be
after a voluntary context switch might need to be boosted, acquiring another explicit reference to an object to which
and the mechanics of deboosting would not likely be at we already have a reference, it might be incrementing an
all pretty. often-read/seldom-updated statistical counter, it might be
In the case of Tasks RCU Rude, as was the case with checking into an HPC-style barrier, or any of a number of
the old RCU Sched, any preemptible region of code is other things.
a quiescent state. Thus, the only tasks that might need However, these differences can be extremely important.
boosting are those currently running with preemption For but one example of the importance, consider that if we
disabled. But boosting the priority of a preemption- think of RCU as a restricted reference counting scheme,
disabled task has no effect. It therefore seems doubly we would never be fooled into thinking that the updates
unlikely that priority boosting will ever be introduced to would exclude the RCU read-side critical sections.
Tasks RCU Rude, at least in its current form. q It nevertheless is often useful to think of RCU as a
replacement for reader-writer locking, for example, when
Quick Quiz 9.65: p.173 you are replacing reader-writer locking with RCU. q
Is RCU the only synchronization mechanism that com-
bines temporal and spatial synchronization in this way? Quick Quiz 9.67: p.176
Which of these use cases best describes the Pre-BSD
routing example in Section 9.5.4.1?
Answer:
Not at all.
Answer:
Hazard pointers can be considered to combine temporal Pre-BSD routing could be argued to fit into either quasi
and spatial synchronization in a similar manner. Referring reader-writer lock, quasi reference count, or quasi multi-
to Listing 9.4, the hp_record() function’s acquisition version concurrency control. The code is the same either
of a reference provides both spatial and temporal syn- way. This is similar to things like atomic_inc(), another
chronization, subscribing to a version and marking the tool that can be put to a great many uses. q
start of a reference, respectively. This function therefore
combines the effects of RCU’s rcu_read_lock() and
rcu_dereference(). Referring now to Listing 9.5, the Quick Quiz 9.68: p.178
hp_clear() function’s release of a reference provides Why not just drop the lock before waiting for the grace
temporal synchronization marking the end of a reference, period, or using something like call_rcu() instead of
and is thus similar to RCU’s rcu_read_unlock(). The waiting for a grace period?
hazptr_free_later() function’s retiring of a hazard-
pointer-protected object provides temporal synchroniza- Answer:
tion, similar to RCU’s call_rcu(). The primitives used The authors wished to support linearizable tree opera-
to mutate a hazard-pointer-protected structure provide tions, so that concurrent additions to, deletions from, and
spatial synchronization, similar to RCU’s rcu_assign_ searches of the tree would appear to execute in some glob-
pointer(). ally agreed-upon order. In their search trees, this requires
Alternatively, one could instead come at hazard pointers holding locks across grace periods. (It is probably better
by analogy with reference counting. q to drop linearizability as a requirement in most cases,
but linearizability is a surprisingly popular (and costly!)
requirement.) q
Quick Quiz 9.66: p.174
But wait! This is exactly the same code that might
be used when thinking of RCU as a replacement for Quick Quiz 9.69: p.180
reader-writer locking! What gives? Why can’t users dynamically allocate the hazard pointers
as they are needed?
Answer:
This is an effect of the Law of Toy Examples: Beyond a Answer:
certain point, the code fragments look the same. The only They can, but at the expense of additional reader-traversal

v2022.09.25a
E.10. DATA STRUCTURES 517

overhead and, in some environments, the need to handle thus well-suited to concurrent use. There are other
memory-allocation failure. q completely-partitionable hash tables, for example, split-
ordered list [SS06], but they are considerably more com-
p.180 plex. We therefore start with chained hash tables. q
Quick Quiz 9.70:
But don’t Linux-kernel kref reference counters allow
guaranteed unconditional reference acquisition? Quick Quiz 10.2: p.186
But isn’t the double comparison on lines 10–13 in List-
Answer: ing 10.3 inefficient in the case where the key fits into an
Yes they do, but the guarantee only applies unconditionally unsigned long?
in cases where a reference is already held. With this in
mind, please review the paragraph at the beginning of Answer:
Section 9.6, especially the part saying “large enough Indeed it is! However, hash tables quite frequently store
that readers do not hold references from one traversal to information with keys such as character strings that do
another”. q not necessarily fit into an unsigned long. Simplifying the
hash-table implementation for the case where keys always
Quick Quiz 9.71: p.181 fit into unsigned longs is left as an exercise for the reader.
But didn’t the answer to one of the quick quizzes in q
Section 9.3 say that pairwise asymmetric barriers could
eliminate the read-side smp_mb() from hazard pointers? Quick Quiz 10.3: p.188
Instead of simply increasing the number of hash buckets,
Answer: wouldn’t it be better to cache-align the existing hash
Yes, it did. However, doing this could be argued to buckets?
change hazard-pointers “Reclamation Forward Progress”
Answer:
row (discussed later) from lock-free to blocking because a
The answer depends on a great many things. If the hash
CPU spinning with interrupts disabled in the kernel would
table has a large number of elements per bucket, it would
prevent the update-side portion of the asymmetric barrier
clearly be better to increase the number of hash buckets.
from completing. In the Linux kernel, such blocking
On the other hand, if the hash table is lightly loaded, the
could in theory be prevented by building the kernel with
answer depends on the hardware, the effectiveness of the
CONFIG_NO_HZ_FULL, designating the relevant CPUs as
hash function, and the workload. Interested readers are
nohz_full at boot time, ensuring that only one thread
encouraged to experiment. q
was ever runnable on a given CPU at a given time, and
avoiding ever calling into the kernel. Alternatively, you
could ensure that the kernel was free of any bugs that Quick Quiz 10.4: p.188
might cause CPUs to spin with interrupts disabled. Given the negative scalability of the Schrödinger’s Zoo
Given that CPUs spinning in the Linux kernel with application across sockets, why not just run multiple
interrupts disabled seems to be rather rare, one might copies of the application, with each copy having a subset
counter-argue that asymmetric-barrier hazard-pointer up- of the animals and confined to run on a single socket?
dates are non-blocking in practice, if not in theory. q
Answer:
You can do just that! In fact, you can extend this idea
to large clustered systems, running one copy of the ap-
E.10 Data Structures plication on each node of the cluster. This practice is
called “sharding”, and is heavily used in practice by large
Quick Quiz 10.1: p.186 web-based retailers [DHJ+ 07].
But chained hash tables are but one type of many. Why However, if you are going to shard on a per-socket basis
the focus on chained hash tables? within a multisocket system, why not buy separate smaller
and cheaper single-socket systems, and then run one shard
Answer: of the database on each of those systems? q
Chained hash tables are completely partitionable, and

v2022.09.25a
518 APPENDIX E. ANSWERS TO QUICK QUIZZES

p.189 1x108

Total Lookups per Millisecond


Quick Quiz 10.5:
But if elements in a hash table can be removed concur-
rently with lookups, doesn’t that mean that a lookup ideal
1x107
could return a reference to a data element that was
unsync
removed immediately after it was looked up?
6
Answer: 1x10 QSBR,RCU,hazptr
Yes it can! This is why hashtab_lookup() must be
invoked within an RCU read-side critical section, and
it is why hashtab_add() and hashtab_del() must 100000

100

1000

10000

100000

1x106
also use RCU-aware list-manipulation primitives. Finally,
this is why the caller of hashtab_del() must wait for
a grace period (e.g., by calling synchronize_rcu())
Hash Table Size (Buckets and Maximum Elements)
before freeing the removed element. This will ensure that
all RCU readers that might reference the newly removed Figure E.6: Read-Only RCU-Protected Hash-Table Per-
element have completed before that element is freed. q formance For Schrödinger’s Zoo at 448 CPUs, Vary-
ing Table Size
Quick Quiz 10.6: p.190
The hashtorture.h file contains more than 1,000
lines! Is that a comprehensive test or what??? that varying hash-table size has almost no effect? Might
the problem instead be something like false sharing?
Answer:
What. Answer:
The hashtorture.h tests are a good start and suffice Excellent question!
for a textbook algorithm. If this code was to be used in
False sharing requires writes, which are not featured
production, much more testing would be required:
in the unsynchronized and RCU runs of this lookup-only
benchmark. The problem is therefore not false sharing.
1. Have some subset of elements that always reside
in the table, and verify that lookups always find Still unconvinced? Then look at the log-log plot in
these elements regardless of the number and type of Figure E.6, which shows performance for 448 CPUs
concurrent updates in flight. as a function of the hash-table size, that is, number of
buckets and maximum number of elements. A hash-
2. Pair an updater with one or more readers, verifying table of size 1,024 has 1,024 buckets and contains at
that after an element is added, once a reader success- most 1,024 elements, with the average occupancy being
fully looks up that element, all later lookups succeed. 512 elements. Because this is a read-only benchmark, the
The definition of “later” will depend on the table’s actual occupancy is always equal to the average occupancy.
consistency requirements.
This figure shows near-ideal performance below about
3. Pair an updater with one or more readers, verifying 8,000 elements, that is, when the hash table comprises
that after an element is deleted, once a reader’s lookup less than 1 MB of data. This near-ideal performance is
of that element fails, all later lookups also fail. consistent with that for the pre-BSD routing table shown in
Figure 9.21 on page 161, even at 448 CPUs. However, the
There are many more tests where those came from, performance drops significantly (this is a log-log plot) at
the exact nature of which depend on the details of the about 8,000 elements, which is where the 1,048,576-byte
requirements on your particular hash table. q L2 cache overflows. Performance falls off a cliff (even on
this log-log plot) at about 300,000 elements, where the
p.192 40,370,176-byte L3 cache overflows. This demonstrates
Quick Quiz 10.7:
that the memory-system bottleneck is profound, degrading
How can we be so sure that the hash-table size is at fault
performance by well in excess of an order of magnitude
here, especially given that Figure 10.4 on page 188 shows
for the large hash tables. This should not be a surprise,

v2022.09.25a
E.10. DATA STRUCTURES 519

as the size-8,388,608 hash table occupies about 1 GB of was made quite clear in Section 10.2.3. Would extrapo-
memory, overflowing the L3 caches by a factor of 25. lating up from 448 CPUs be any safer?
The reason that Figure 10.4 on page 188 shows little
effect is that its data was gathered from bucket-locked hash Answer:
tables, where locking overhead and contention drowned In theory, no, it isn’t any safer, and a useful exercise
out cache-capacity effects. In contrast, both RCU and would be to run these programs on larger systems. In
hazard-pointers readers avoid stores to shared data, which practice, there are only a very few systems with more than
means that the cache-capacity effects come to the fore. 448 CPUs, in contrast to the huge number having more
Still not satisfied? Find a multi-socket system and run than 28 CPUs. This means that although it is dangerous
this code, making use of whatever performance-counter to extrapolate beyond 448 CPUs, there is very little need
hardware is available. This hardware should allow you to to do so.
track down the precise cause of any slowdowns exhibited In addition, other testing has shown that RCU read-side
on your particular system. The experience gained by primitives offer consistent performance and scalability up
doing this exercise will be extremely valuable, giving you to at least 1024 CPUs. However, it is useful to review
a significant advantage over those whose understanding Figure E.6 and its associated commentary. You see,
of this issue is strictly theoretical.10 q unlike the 448-CPU system that provided this data, the
system enjoying linear scalability up to 1024 CPUs boasted
excellent memory bandwidth. q
Quick Quiz 10.8: p.192
The memory system is a serious bottleneck on this big
Quick Quiz 10.10: p.197
system. Why bother putting 448 CPUs on a system
without giving them enough memory bandwidth to do How does the code in Listing 10.10 protect against the
something useful??? resizing process progressing past the selected bucket?

Answer: Answer:
It would indeed be a bad idea to use this large and expensive It does not provide any such protection. That is instead
system for a workload consisting solely of simple hash- the job of the update-side concurrency-control functions
table lookups of small data elements. However, this described next. q
system is extremely useful for a great many workloads
that feature more processing and less memory accessing. Quick Quiz 10.11: p.198
For example, some in-memory databases run extremely Suppose that one thread is inserting an element into
well on this class of system, albeit when running much the hash table during a resize operation. What prevents
more complex sets of queries than performed by the this insertion from being lost due to a subsequent resize
benchmarks in this chapter. For example, such systems operation completing before the insertion does?
might be processing images or video streams stored in
each element, providing further performance benefits due Answer:
to the fact that the resulting sequential memory accesses The second resize operation will not be able to move
will make better use of the available memory bandwidth beyond the bucket into which the insertion is taking place
than will a pure pointer-following workload. due to the insertion holding the lock(s) on one or both
But let this be a lesson to you. Modern computer of the hash buckets in the hash tables. Furthermore, the
systems come in a great many shapes and sizes, and great insertion operation takes place within an RCU read-side
care is frequently required to select one that suits your critical section. As we will see when we examine the
application. And perhaps even more frequently, significant hashtab_resize() function, this means that each resize
care and work is required to adjust your application to the operation uses synchronize_rcu() invocations to wait
specific computer systems at hand. q for the insertion’s read-side critical section to complete.
q

Quick Quiz 10.9: p.193


Quick Quiz 10.12: p.198
The dangers of extrapolating from 28 CPUs to 448 CPUs
The hashtab_lookup() function in Listing 10.12 ig-
10 Of
nores concurrent resize operations. Doesn’t this mean
course, a theoretical understanding beats no understanding.

v2022.09.25a
520 APPENDIX E. ANSWERS TO QUICK QUIZZES

that readers might miss an element that was previously Listing E.6: Resizable Hash-Table Access Functions (Fewer
added during a resize operation? Updates)
1 struct ht_elem *
2 hashtab_lookup(struct hashtab *htp_master, void *key)
Answer: 3 {
4 struct ht *htp;
No. As we will see soon, the hashtab_add() and 5 struct ht_elem *htep;
hashtab_del() functions keep the old hash table up- 6
htp = rcu_dereference(htp_master->ht_cur);
to-date while a resize operation is in progress. q 7
8 htep = ht_search_bucket(htp, key);
9 if (htep)
10 return htep;
p.198 11 htp = rcu_dereference(htp->ht_new);
Quick Quiz 10.13: 12 if (!htp)
The hashtab_add() and hashtab_del() functions 13 return NULL;
14 return ht_search_bucket(htp, key);
in Listing 10.12 can update two hash buckets while a 15 }
resize operation is progressing. This might cause poor 16
17 void hashtab_add(struct ht_elem *htep,
performance if the frequency of resize operation is not 18 struct ht_lock_state *lsp)
negligible. Isn’t it possible to reduce the cost of updates 19 {
20 struct ht_bucket *htbp = lsp->hbp[0];
in such cases? 21 int i = lsp->hls_idx[0];
22
Answer: 23 htep->hte_next[!i].prev = NULL;
24 cds_list_add_rcu(&htep->hte_next[i], &htbp->htb_head);
Yes, at least assuming that a slight increase in the cost 25 }
of hashtab_lookup() is acceptable. One approach is 26
27 void hashtab_del(struct ht_elem *htep,
shown in Listings E.6 and E.7 (hash_resize_s.c). 28 struct ht_lock_state *lsp)
This version of hashtab_add() adds an element to 29 {
30 int i = lsp->hls_idx[0];
either the old bucket if it is not resized yet, or to the 31
new bucket if it has been resized, and hashtab_del() 32 if (htep->hte_next[i].prev) {
33 cds_list_del_rcu(&htep->hte_next[i]);
removes the specified element from any buckets into which 34 htep->hte_next[i].prev = NULL;
it has been inserted. The hashtab_lookup() function 35 }
36 if (lsp->hbp[1] && htep->hte_next[!i].prev) {
searches the new bucket if the search of the old bucket 37 cds_list_del_rcu(&htep->hte_next[!i]);
fails, which has the disadvantage of adding overhead to the 38 htep->hte_next[!i].prev = NULL;
39 }
lookup fastpath. The alternative hashtab_lock_mod() 40 }
returns the locking state of the new bucket in ->hbp[0]
and ->hls_idx[0] if resize operation is in progress,
instead of the perhaps more natural choice of ->hbp[1] between the time that we install the new hash-table ref-
and ->hls_idx[1]. However, this less-natural choice erence on line 29 and the time that we update ->ht_
has the advantage of simplifying hashtab_add(). resize_cur on line 40. This means that any reader that
Further analysis of the code is left as an exercise for the sees a non-negative value of ->ht_resize_cur cannot
reader. q have started before the assignment to ->ht_new, and thus
must be able to see the reference to the new hash table.
p.200
And this is why the update-side hashtab_add() and
Quick Quiz 10.14:
hashtab_del() functions must be enclosed in RCU read-
In the hashtab_resize() function in Listing 10.13,
side critical sections, courtesy of hashtab_lock_mod()
what guarantees that the update to ->ht_new on line 29
and hashtab_unlock_mod() in Listing 10.11. q
will be seen as happening before the update to ->
ht_resize_cur on line 40 from the perspective of
hashtab_add() and hashtab_del()? In other words, Quick Quiz 10.15: p.200
what prevents hashtab_add() and hashtab_del() Why is there a WRITE_ONCE() on line 40 in List-
from dereferencing a NULL pointer loaded from ->ht_ ing 10.13?
new?
Answer:
Answer: Together with the READ_ONCE() on line 16 in hashtab_
The synchronize_rcu() on line 30 of Listing 10.13 lock_mod() of Listing 10.11, it tells the compiler that
ensures that all pre-existing RCU readers have completed the non-initialization accesses to ->ht_resize_cur must

v2022.09.25a
E.10. DATA STRUCTURES 521

Listing E.7: Resizable Hash-Table Update-Side Locking Func- 1x107


tion (Fewer Updates)
1 static void
hashtab_lock_mod(struct hashtab *htp_master, void *key,

Lookups per Millisecond


2
3 struct ht_lock_state *lsp) 1x106
4 { 262,144
5 long b;
6 unsigned long h;
7 struct ht *htp; 100000
8 struct ht_bucket *htbp;
9
10 rcu_read_lock();
htp = rcu_dereference(htp_master->ht_cur);
10000
11
12 htbp = ht_get_bucket(htp, key, &b, &h); 2,097,152
13 spin_lock(&htbp->htb_lock);
14 lsp->hbp[0] = htbp;
15 lsp->hls_idx[0] = htp->ht_idx;
16 if (b > READ_ONCE(htp->ht_resize_cur)) { 1000
17 lsp->hbp[1] = NULL; 1 10 100
return;
18
19 }
Number of CPUs (Threads)
20 htp = rcu_dereference(htp->ht_new);
21 htbp = ht_get_bucket(htp, key, &b, &h); Figure E.7: Effect of Memory-System Bottlenecks on
22 spin_lock(&htbp->htb_lock); Hash Tables
23 lsp->hbp[1] = lsp->hbp[0];
24 lsp->hls_idx[1] = lsp->hls_idx[0];
25 lsp->hbp[0] = htbp;
lsp->hls_idx[0] = htp->ht_idx;
26
27 }
performance slightly even on that single CPU.11 This is
to be expected, given that unlike its smaller counterpart,
the 2,097,152-bucket hash table does not fit into the L3
remain because reads from ->ht_resize_cur really can cache. This new trace rises just past 28 CPUs, which is
race with writes, just not in a way to change the “if” also to be expected. This rise is due to the fact that the
conditions. q 29th CPU is on another socket, which brings with it an
additional 39 MB of cache as well as additional memory
bandwidth.
Quick Quiz 10.16: p.200 But the large hash table’s advantage over that of the hash
How much of the difference in performance between table with 524,288 buckets (but still 2,097,152 elements)
the large and small hash tables shown in Figure 10.19 decreases with additional CPUs, which is consistent with
was due to long hash chains and how much was due to the bottleneck residing in the memory system. Above
memory-system bottlenecks? about 400 CPUs, the 2,097,152-bucket hash table is ac-
tually outperformed slightly by the 524,288-bucket hash
table. This should not be a surprise because the memory
Answer: system is the bottleneck and the larger number of buckets
The easy way to answer this question is to do another run increases this workload’s memory footprint.
with 2,097,152 elements, but this time also with 2,097,152 The alert reader will have noted the word “rough” above
buckets, thus bringing the average number of elements and might be interested in a more detailed analysis. Such
per bucket back down to unity. readers are invited to run similar benchmarks, using what-
The results are shown by the triple-dashed new trace in ever performance counters or hardware-analysis tools they
the middle of Figure E.7. The other six traces are identical might have available. This can be a long and complex
to their counterparts in Figure 10.19 on page 200. The journey, but those brave enough to embark on it will be re-
gap between this new trace and the lower set of three warded with detailed knowledge of hardware performance
traces is a rough measure of how much of the difference and its effect on software. q
in performance was due to hash-chain length, and the gap
between the new trace and the upper set of three traces Quick Quiz 10.17: p.204
is a rough measure of how much of that difference was Couldn’t the hashtorture.h code be modified to ac-
due to memory-system bottlenecks. The new trace starts
out slightly below its 262,144-element counterpart at a 11 Yes, as far as hardware architects are concerned, caches are part of

single CPU, showing that cache capacity is degrading the memory system.

v2022.09.25a
522 APPENDIX E. ANSWERS TO QUICK QUIZZES

commodate a version of hashtab_lock_mod() that Answer:


subsumes the ht_get_bucket() functionality? Yes, projects are important, but if you like being paid for
your work, you need organizations as well as projects. q
Answer:
It probably could, and doing so would benefit all of the Quick Quiz 11.3: p.209
per-bucket-locked hash tables presented in this chapter. Suppose that you are writing a script that processes the
Making this modification is left as an exercise for the output of the time command, which looks as follows:
reader. q
real 0m0.132s
user 0m0.040s
Quick Quiz 10.18: p.204 sys 0m0.008s
How much do these specializations really save? Are
they really worth it? The script is required to check its input for errors, and
to give appropriate diagnostics if fed erroneous time
Answer: output. What test inputs should you provide to this
The answer to the first question is left as an exercise to program to test it for use with time output generated by
the reader. Try specializing the resizable hash table and single-threaded programs?
see how much performance improvement results. The
second question cannot be answered in general, but must Answer:
instead be answered with respect to a specific use case. Can you say “Yes” to all the following questions?
Some use cases are extremely sensitive to performance 1. Do you have a test case in which all the time is
and scalability, while others are less so. q consumed in user mode by a CPU-bound program?
2. Do you have a test case in which all the time is
consumed in system mode by a CPU-bound program?
E.11 Validation
3. Do you have a test case in which all three times are
zero?
Quick Quiz 11.1: p.208
4. Do you have a test case in which the “user” and “sys”
When in computing is it necessary to follow a fragmen-
times sum to more than the “real” time? (This would
tary plan?
of course be completely legitimate in a multithreaded
program.)
Answer:
There are any number of situations, but perhaps the most 5. Do you have a set of tests cases in which one of the
important situation is when no one has ever created any- times uses more than one second?
thing resembling the program to be developed. In this case,
the only way to create a credible plan is to implement the 6. Do you have a set of tests cases in which one of the
program, create the plan, and implement it a second time. times uses more than ten seconds?
But whoever implements the program for the first time
7. Do you have a set of test cases in which one of
has no choice but to follow a fragmentary plan because
the times has non-zero minutes? (For example,
any detailed plan created in ignorance cannot survive first
“15m36.342s”.)
contact with the real world.
And perhaps this is one reason why evolution has 8. Do you have a set of test cases in which one of the
favored insanely optimistic human beings who are happy times has a seconds value of greater than 60?
to follow fragmentary plans! q
9. Do you have a set of test cases in which one of the
times overflows 32 bits of milliseconds? 64 bits of
Quick Quiz 11.2: p.208 milliseconds?
Who cares about the organization? After all, it is the
10. Do you have a set of test cases in which one of the
project that is important!
times is negative?

v2022.09.25a
E.11. VALIDATION 523

11. Do you have a set of test cases in which one of the that the time will not be completely wasted. For example,
times has a positive minutes value but a negative if you are embarking on a first-of-a-kind project, the
seconds value? requirements are in some sense unknowable anyway. In
this case, the best approach might be to quickly prototype
12. Do you have a set of test cases in which one of the a number of rough solutions, try them out, and see what
times omits the “m” or the “s”? works best.
13. Do you have a set of test cases in which one of the On the other hand, if you are being paid to produce
times is non-numeric? (For example, “Go Fish”.) a system that is broadly similar to existing systems, you
owe it to your users, your employer, and your future self
14. Do you have a set of test cases in which one of the to validate early and often. q
lines is omitted? (For example, where there is a “real”
value and a “sys” value, but no “user” value.)
Quick Quiz 11.5: p.210
15. Do you have a set of test cases where one of the lines
Are you actually suggesting that it is possible to test
is duplicated? Or duplicated, but with a different
correctness into software??? Everyone knows that is
time value for the duplicate?
impossible!!!
16. Do you have a set of test cases where a given line
has more than one time value? (For example, “real Answer:
0m0.132s 0m0.008s”.) Please note that the text used the word “validation” rather
than the word “testing”. The word “validation” includes
17. Do you have a set of test cases containing random formal methods as well as testing, for more on which
characters? please see Chapter 12.
18. In all test cases involving invalid input, did you But as long as we are bringing up things that everyone
generate all permutations? should know, let’s remind ourselves that Darwinian evo-
lution is not about correctness, but rather about survival.
19. For each test case, do you have an expected outcome As is software. My goal as a developer is not that my
for that test? software be attractive from a theoretical viewpoint, but
rather that it survive whatever its users throw at it.
If you did not generate test data for a substantial number
Although the notion of correctness does have its uses,
of the above cases, you will need to cultivate a more
its fundamental limitation is that the specification against
destructive attitude in order to have a chance of generating
which correctness is judged will also have bugs. This
high-quality tests.
means nothing more nor less than that traditional correct-
Of course, one way to economize on destructiveness
ness proofs prove that the code in question contains the
is to generate the tests with the to-be-tested source code
intended set of bugs!
at hand, which is called white-box testing (as opposed to
black-box testing). However, this is no panacea: You will Alternative definitions of correctness instead focus on
find that it is all too easy to find your thinking limited by the lack of problematic properties, for example, proving
what the program can handle, thus failing to generate truly that the software has no use-after-free bugs, no NULL
destructive inputs. q pointer dereferences, no array-out-of-bounds references,
and so on. Make no mistake, finding and eliminating such
classes of bugs can be highly useful. But the fact remains
Quick Quiz 11.4: p.210
that the lack of certain classes of bugs does nothing to
You are asking me to do all this validation BS before demonstrate fitness for any specific purpose.
I even start coding??? That sounds like a great way to
Therefore, usage-driven validation remains critically
never get started!!!
important.
Answer: Besides, it is also impossible to verify correctness into
If it is your project, for example, a hobby, do what you your software, especially given the problematic need to
like. Any time you waste will be your own, and you have verify both the verifier and the specification. q
no one else to answer to for it. And there is a good chance

v2022.09.25a
524 APPENDIX E. ANSWERS TO QUICK QUIZZES

Quick Quiz 11.6: p.212 Quick Quiz 11.8: p.214


How can you implement WARN_ON_ONCE()? Why would anyone bother copying existing code in pen
on paper??? Doesn’t that just increase the probability of
Answer: transcription errors?
If you don’t mind WARN_ON_ONCE() sometimes warning
more than once, simply maintain a static variable that is Answer:
initialized to zero. If the condition triggers, check the If you are worried about transcription errors, please allow
variable, and if it is non-zero, return. Otherwise, set it to me to be the first to introduce you to a really cool tool
one, print the message, and return. named diff. In addition, carrying out the copying can
If you really need the message to never appear more be quite valuable:
than once, you can use an atomic exchange operation in
place of “set it to one” above. Print the message only if 1. If you are copying a lot of code, you are probably
the atomic exchange operation returns zero. q failing to take advantage of an opportunity for ab-
straction. The act of copying code can provide great
motivation for abstraction.
Quick Quiz 11.7: p.213
Just what invalid assumptions are you accusing Linux 2. Copying the code gives you an opportunity to think
kernel hackers of harboring??? about whether the code really works in its new setting.
Is there some non-obvious constraint, such as the
Answer: need to disable interrupts or to hold some lock?
Those wishing a complete answer to this question are
encouraged to search the Linux kernel git repository for 3. Copying the code also gives you time to consider
commits containing the string “Fixes:”. There were many whether there is some better way to get the job done.
thousands of them just in the year 2020, including fixes So, yes, copy the code! q
for the following invalid assumptions:
1. Testing for a non-zero denominator will prevent Quick Quiz 11.9: p.215
divide-by-zero errors. (Hint: Suppose that the test This procedure is ridiculously over-engineered! How
uses 64-bit arithmetic but that the division uses 32-bit can you expect to get a reasonable amount of software
arithmetic.) written doing it this way???
2. Userspace can be trusted to zero out versioned data Answer:
structures used to communicate with the kernel. Indeed, repeatedly copying code by hand is laborious
(Hint: Sometimes userspace has no idea how large and slow. However, when combined with heavy-duty
the data structure is.) stress testing and proofs of correctness, this approach is
3. Outdated TCP duplicate selective acknowledgement also extremely effective for complex parallel code where
(D-SACK) packets can be completely ignored. (Hint: ultimate performance and reliability are required and
These packets might also contain other information.) where debugging is difficult. The Linux-kernel RCU
implementation is a case in point.
4. All CPUs are little-endian. On the other hand, if you are writing a simple single-
threaded shell script, then you would be best-served by a
5. Once a data structure is no longer needed, all of its
different methodology. For example, enter each command
memory may be immediately freed.
one at a time into an interactive shell with a test data set to
6. All devices can be initialized while in standby mode. make sure that it does what you want, then copy-and-paste
the successful commands into your script. Finally, test the
7. Developers can be trusted to consistently do correct script as a whole.
hexidecimal arithmetic. If you have a friend or colleague who is willing to help
Those who look at these commits in greater detail will out, pair programming can work very well, as can any
conclude that invalid assumptions are the rule, not the number of formal design- and code-review processes.
exception. q And if you are writing code as a hobby, then do whatever
you like.

v2022.09.25a
E.11. VALIDATION 525

In short, different types of software need different p.215


Quick Quiz 11.12:
development methodologies. q
Suppose that you had a very large number of systems
at your disposal. For example, at current cloud prices,
you can purchase a huge amount of CPU time at low
Quick Quiz 11.10: p.215 cost. Why not use this approach to get close enough to
What do you do if, after all the pen-on-paper copying, certainty for all practical purposes?
you find a bug while typing in the resulting code?
Answer:
This approach might well be a valuable addition to your
Answer: validation arsenal. But it does have limitations that rule
The answer, as is often the case, is “it depends”. If the out “for all practical purposes”:
bug is a simple typo, fix that typo and continue typing.
However, if the bug indicates a design flaw, go back to 1. Some bugs have extremely low probabilities of occur-
pen and paper. q rence, but nevertheless need to be fixed. For example,
suppose that the Linux kernel’s RCU implementation
had a bug that is triggered only once per million years
of machine time on average. A million years of CPU
Quick Quiz 11.11: p.215
time is hugely expensive even on the cheapest cloud
Wait! Why on earth would an abstract piece of software platforms, but we could expect this bug to result
fail only sometimes??? in more than 50 failures per day on the more than
20 billion Linux instances in the world as of 2017.
Answer: 2. The bug might well have zero probability of occur-
Because complexity and concurrency can produce results rence on your particular cloud-computing test setup,
that are indistinguishable from randomness [MOZ09]. which means that you won’t see it no matter how
For example, a bug in Linux-kernel RCU required the much machine time you burn testing it. For but one
following to hold before that bug would manifest: example, there are RCU bugs that appear only in
preemptible kernels, and also other RCU bugs that
1. The kernel was built for HPC or real-time use, so appear only in non-preemptible kernels.
that a given CPU’s RCU work could be offloaded to Of course, if your code is small enough, formal validation
some other CPU. may be helpful, as discussed in Chapter 12. But beware:
Formal validation of your code will not find errors in
2. An offloaded CPU went offline just after generating your assumptions, misunderstanding of the requirements,
a large quantity of RCU work. misunderstanding of the software or hardware primitives
you use, or errors that you did not think to construct a
3. A special rcu_barrier() API was invoked just at proof for. q
this time.
Quick Quiz 11.13: p.216
4. The RCU work from the newly offlined CPU was still Say what??? When I plug the earlier five-test 10 %-
being processed after rcu_barrier() returned. failure-rate example into the formula, I get 59,050 % and
that just doesn’t make sense!!!
5. One of these remaining RCU work items was related
to the code invoking the rcu_barrier(). Answer:
You are right, that makes no sense at all.
Remember that a probability is a number between zero
Making this bug manifest therefore required considerable and one, so that you need to divide a percentage by 100 to
luck or great testing skill. But the testing skill could be get a probability. So 10 % is a probability of 0.1, which
effective only if the bug was known, which of course it gets a probability of 0.4095, which rounds to 41 %, which
was not. Therefore, the manifesting of this bug was very quite sensibly matches the earlier result. q
well modeled as a probabilistic process. q

v2022.09.25a
526 APPENDIX E. ANSWERS TO QUICK QUIZZES

p.217
Table E.3: Human-Friendly Poisson-Function Display
Quick Quiz 11.14:
In Eq. 11.6, are the logarithms base-10, base-2, or Improvement
base-e?
Certainty (%) Any 10x 100x
Answer:
It does not matter. You will get the same answer no matter 90.0 2.3 23.0 230.0
what base of logarithms you use because the result is 95.0 3.0 30.0 300.0
a pure ratio of logarithms. The only constraint is that 99.0 4.6 46.1 460.5
you use the same base for both the numerator and the 99.9 6.9 69.1 690.7
denominator. q

Quick Quiz 11.15: p.218 also greatly prefer error-free test runs, and so should you
Suppose that a bug causes a test failure three times per because doing so reduces your required test durations.
hour on average. How long must the test run error-free Therefore, it is quite possible that the values in Table E.3
to provide 99.9 % confidence that the fix significantly will suffice. Simply look up the desired confidence and
reduced the probability of failure? degree of improvement, and the resulting number will
give you the required error-free test duration in terms of
Answer: the expected time for a single error to appear. So if your
We set 𝑛 to 3 and 𝑃 to 99.9 in Eq. 11.11, resulting in: pre-fix testing suffered one failure per hour, and the powers
that be require a 95 % confidence of a 10x improvement,
1 100 − 99.9 you need a 30-hour error-free run.
𝑇 = − ln = 2.3 (E.9)
3 100 Alternatively, you can use the rough-and-ready method
If the test runs without failure for 2.3 hours, we can described in Section 11.6.2. q
be 99.9 % certain that the fix reduced the probability of
failure. q Quick Quiz 11.17: p.218
But wait!!! Given that there has to be some number
Quick Quiz 11.16: p.218 of failures (including the possibility of zero failures),
Doing the summation of all the factorials and exponen- shouldn’t Eq. 11.13 approach the value 1 as 𝑚 goes to
tials is a real pain. Isn’t there an easier way? infinity?

Answer: Answer:
One approach is to use the open-source symbolic ma- Indeed it should. And it does.
nipulation program named “maxima”. Once you have To see this, note that e−𝜆 does not depend on 𝑖, which
installed this program, which is a part of many Linux dis- means that it can be pulled out of the summation as
tributions, you can run it and give the load(distrib); follows:
command followed by any number of bfloat(cdf_ ∞
poisson(m,l)); commands, where the m is replaced
∑︁ 𝜆𝑖
e−𝜆 (E.10)
by the desired value of 𝑚 (the actual number of failures in 𝑖=0
𝑖!
actual test) and the l is replaced by the desired value of 𝜆
The remaining summation is exactly the Taylor series
(the expected number of failures in the actual test).
for e𝜆 , yielding:
In particular, the bfloat(cdf_poisson(2,24));
command results in 1.181617112359357b-8, which
e−𝜆 e𝜆 (E.11)
matches the value given by Eq. 11.13.
Another approach is to recognize that in this real world, The two exponentials are reciprocals, and therefore
it is not all that useful to compute (say) the duration cancel, resulting in exactly 1, as required. q
of a test having two or fewer errors that would give a
76.8 % confidence of a 349.2x improvement in reliability. p.219
Quick Quiz 11.18:
Instead, human beings tend to focus on specific values, for
How is this approach supposed to help if the corruption
example, a 95 % confidence of a 10x improvement. People

v2022.09.25a
E.11. VALIDATION 527

affected some unrelated pointer, which then caused the Answer:


corruption??? This question fails to consider the option of choosing
not to compute the answer at all, and in doing so, also
Answer: fails to consider the costs of computing the answer. For
Indeed, that can happen. Many CPUs have hardware- example, consider short-term weather forecasting, for
debugging facilities that can help you locate that unrelated which accurate models exist, but which require large (and
pointer. Furthermore, if you have a core dump, you expensive) clustered supercomputers, at least if you want
can search the core dump for pointers referencing the to actually run the model faster than the weather.
corrupted region of memory. You can also look at the And in this case, any performance bug that prevents
data layout of the corruption, and check pointers whose the model from running faster than the actual weather
type matches that layout. prevents any forecasting. Given that the whole purpose
You can also step back and test the modules making up of purchasing the large clustered supercomputers was to
your program more intensively, which will likely confine forecast weather, if you cannot run the model faster than
the corruption to the module responsible for it. If this the weather, you would be better off not running the model
makes the corruption vanish, consider adding additional at all.
argument checking to the functions exported from each More severe examples may be found in the area of
module. safety-critical real-time computing. q
Nevertheless, this is a hard problem, which is why I
used the words “a bit of a dark art”. q p.222
Quick Quiz 11.22:
But if you are going to put in all the hard work of
Quick Quiz 11.19: p.219 parallelizing an application, why not do it right? Why
But I did the bisection, and ended up with a huge commit. settle for anything less than optimal performance and
What do I do now? linear scalability?

Answer: Answer:
A huge commit? Shame on you! This is but one reason Although I do heartily salute your spirit and aspirations,
why you are supposed to keep the commits small. you are forgetting that there may be high costs due to
And that is your answer: Break up the commit into delays in the program’s completion. For an extreme
bite-sized pieces and bisect the pieces. In my experience, example, suppose that a 40 % performance shortfall from
the act of breaking up the commit is often sufficient to a single-threaded application is causing one person to die
make the bug painfully obvious. q each day. Suppose further that in a day you could hack
together a quick and dirty parallel program that ran 50 %
p.220
faster on an eight-CPU system than the sequential version,
Quick Quiz 11.20:
but that an optimal parallel program would require four
Why don’t conditional-locking primitives provide this
months of painstaking design, coding, debugging, and
spurious-failure functionality?
tuning.
Answer: It is safe to say that more than 100 people would prefer
There are locking algorithms that depend on conditional- the quick and dirty version. q
locking primitives telling them the truth. For example, if
conditional-lock failure signals that some other thread is Quick Quiz 11.23: p.224
already working on a given job, spurious failure might But what about other sources of error, for example, due
cause that job to never get done, possibly resulting in a to interactions between caches and memory layout?
hang. q
Answer:
Changes in memory layout can indeed result in unrealistic
Quick Quiz 11.21: p.222
decreases in execution time. For example, suppose that
That is ridiculous!!! After all, isn’t getting the correct a given microbenchmark almost always overflows the
answer later than one would like better than getting an L0 cache’s associativity, but with just the right memory
incorrect answer??? layout, it all fits. If this is a real concern, consider running

v2022.09.25a
528 APPENDIX E. ANSWERS TO QUICK QUIZZES

your microbenchmark using huge pages (or within the Of course, it is possible to create a script similar to that
kernel or on bare metal) in order to completely control the in Listing 11.2 that uses standard deviation rather than
memory layout. absolute difference to get a similar effect, and this is left
But note that there are many different possible memory- as an exercise for the interested reader. Be careful to avoid
layout bottlenecks. Benchmarks sensitive to memory divide-by-zero errors arising from strings of identical data
bandwidth (such as those involving matrix arithmetic) values! q
should spread the running threads across the available
cores and sockets to maximize memory parallelism. They p.227
Quick Quiz 11.26:
should also spread the data across NUMA nodes, memory
But what if all the y-values in the trusted group of data
controllers, and DRAM chips to the extent possible. In
are exactly zero? Won’t that cause the script to reject
contrast, benchmarks sensitive to memory latency (in-
any non-zero value?
cluding most poorly scaling applications) should instead
maximize locality, filling each core and socket in turn Answer:
before adding another one. q Indeed it will! But if your performance measurements
often produce a value of exactly zero, perhaps you need
Quick Quiz 11.24: p.225 to take a closer look at your performance-measurement
Wouldn’t the techniques suggested to isolate the code un- code.
der test also affect that code’s performance, particularly Note that many approaches based on mean and standard
if it is running within a larger application? deviation will have similar problems with this sort of
dataset. q
Answer:
Indeed it might, although in most microbenchmarking
efforts you would extract the code under test from the
enclosing application. Nevertheless, if for some reason E.12 Formal Verification
you must keep the code under test within the application,
you will very likely need to use the techniques discussed
Quick Quiz 12.1: p.236
in Section 11.7.6. q
Why is there an unreached statement in locker? After
all, isn’t this a full state-space search?
Quick Quiz 11.25: p.227
This approach is just plain weird! Why not use means Answer:
and standard deviations, like we were taught in our The locker process is an infinite loop, so control never
statistics classes? reaches the end of this process. However, since there are
Answer: no monotonically increasing variables, Promela is able to
Because mean and standard deviation were not designed model this infinite loop with a small number of states. q
to do this job. To see this, try applying mean and standard
deviation to the following data set, given a 1 % relative Quick Quiz 12.2: p.236
error in measurement: What are some Promela code-style issues with this
example?
49,548.4 49,549.4 49,550.2 49,550.9 49,550.9
49,551.0 49,551.5 49,552.1 49,899.0 49,899.3 Answer:
49,899.7 49,899.8 49,900.1 49,900.4 52,244.9 There are several:
53,333.3 53,333.3 53,706.3 53,706.3 54,084.5
1. The declaration of sum should be moved to within
The problem is that mean and standard deviation do not the init block, since it is not used anywhere else.
rest on any sort of measurement-error assumption, and
they will therefore see the difference between the values 2. The assertion code should be moved outside of the
near 49,500 and those near 49,900 as being statistically initialization loop. The initialization loop can then
significant, when in fact they are well within the bounds be placed in an atomic block, greatly reducing the
of estimated measurement error. state space (by how much?).

v2022.09.25a
E.12. FORMAL VERIFICATION 529

3. The atomic block covering the assertion code should 5. The second updater fetches the value of ctr[1],
be extended to include the initialization of sum and which is now one.
j, and also to cover the assertion. This also reduces
the state space (again, by how much?). q 6. The second updater now incorrectly concludes that
it is safe to proceed on the fastpath, despite the fact
that the original reader has not yet completed. q

Quick Quiz 12.3: p.237


Is there a more straightforward way to code the do-od
Quick Quiz 12.6: p.239
statement?
A compression rate of 0.48 % corresponds to a 200-to-
Answer: 1 decrease in memory occupied by the states! Is the
Yes. Replace it with if-fi and remove the two break state-space search really exhaustive???
statements. q
Answer:
According to Spin’s documentation, yes, it is.
Quick Quiz 12.4: p.238
As an indirect evidence, let’s compare the results of
Why are there atomic blocks at lines 12–21 and runs with -DCOLLAPSE and with -DMA=88 (two readers
lines 44–56, when the operations within those atomic and three updaters). The diff of outputs from those runs
blocks have no atomic implementation on any current is shown in Listing E.8. As you can see, they agree on the
production microprocessor? numbers of states (stored and matched). q

Answer:
Quick Quiz 12.7: p.241
Because those operations are for the benefit of the assertion
only. They are not part of the algorithm itself. There But different formal-verification tools are often designed
is therefore no harm in marking them atomic, and so to locate particular classes of bugs. For example, very
marking them greatly reduces the state space that must be few formal-verification tools will find an error in the
searched by the Promela model. q specification. So isn’t this “clearly untrustworthy” judg-
ment a bit harsh?

Quick Quiz 12.5: p.238 Answer:


Is the re-summing of the counters on lines 24–27 really It is certainly true that many formal-verification tools are
necessary? specialized in some way. For example, Promela does
not handle realistic memory models (though they can be
Answer: programmed into Promela [DMD13]), CBMC [CKL04]
Yes. To see this, delete these lines and run the model. does not detect probabilistic hangs and deadlocks, and
Alternatively, consider the following sequence of steps: Nidhugg [LSLK14] does not detect bugs involving data
nondeterminism. But this means that these tools cannot
be trusted to find bugs that they are not designed to locate.
1. One process is within its RCU read-side critical
And therefore people creating formal-verification tools
section, so that the value of ctr[0] is zero and the
should “tell the truth on the label”, clearly calling out
value of ctr[1] is two.
what classes of bugs their tools can and cannot detect.
2. An updater starts executing, and sees that the sum Otherwise, the first time a practitioner finds a tool failing to
of the counters is two so that the fastpath cannot be detect a bug, that practitioner is likely to make extremely
executed. It therefore acquires the lock. harsh and extremely public denunciations of that tool.
Yes, yes, there is something to be said for putting your
3. A second updater starts executing, and fetches the best foot forward, but putting it too far forward without
value of ctr[0], which is zero. appropriate disclaimers can easily trigger a land mine of
negative reaction that your tool might or might not be able
4. The first updater adds one to ctr[0], flips the index to recover from.
(which now becomes zero), then subtracts one from You have been warned! q
ctr[1] (which now becomes one).

v2022.09.25a
530 APPENDIX E. ANSWERS TO QUICK QUIZZES

Quick Quiz 12.8: p.241


Given that we have two independent proofs of correctness
for the QRCU algorithm described herein, and given that
the proof of incorrectness covers what is known to be a
different algorithm, why is there any room for doubt?

Answer:
There is always room for doubt. In this case, it is important
Listing E.8: Spin Output Diff of -DCOLLAPSE and -DMA=88
to keep in mind that the two proofs of correctness preceded
@@ -1,6 +1,6 @@
(Spin Version 6.4.6 -- 2 December 2016) the formalization of real-world memory models, raising
+ Partial Order Reduction the possibility that these two proofs are based on incorrect
- + Compression
+ + Graph Encoding (-DMA=88) memory-ordering assumptions. Furthermore, since both
proofs were constructed by the same person, it is quite
Full statespace search for:
never claim - (none specified) possible that they contain a common error. Again, there
@@ -9,27 +9,22 @@ is always room for doubt. q
invalid end states +

State-vector 88 byte, depth reached 328014, errors: 0


+MA stats: -DMA=77 is sufficient p.242
+Minimized Automaton: 2084798 nodes and 6.38445e+06 edges
Quick Quiz 12.9:
1.8620286e+08 states, stored Yeah, that’s just great! Now, just what am I supposed to
1.7759831e+08 states, matched
3.6380117e+08 transitions (= stored+matched)
do if I don’t happen to have a machine with 40 GB of
1.3724093e+08 atomic steps main memory???
-hash conflicts: 1.1445626e+08 (resolved)

Stats on memory usage (in Megabytes): Answer:


20598.919 equivalent memory usage for states
(stored*(State-vector + overhead))
Relax, there are a number of lawful answers to this ques-
- 8418.559 actual memory usage for states tion:
- (compression: 40.87%)
- state-vector as stored =
- 19 byte + 28 byte overhead 1. Try compiler flags -DCOLLAPSE and -DMA=N to re-
- 2048.000 memory used for hash table (-w28) duce memory consumption. See Section 12.1.4.1.
+ 204.907 actual memory usage for states
+ (compression: 0.99%)
17.624 memory used for DFS stack (-m330000) 2. Further optimize the model, reducing its memory
- 1.509 memory lost to fragmentation
-10482.675 total actual memory usage consumption.
+ 222.388 total actual memory usage
3. Work out a pencil-and-paper proof, perhaps starting
-nr of templates: [ 0:globals 1:chans 2:procs ]
-collapse counts: [ 0:1021 2:32 3:1869 4:2 ] with the comments in the code in the Linux kernel.
unreached in proctype qrcu_reader
(0 of 18 states)
unreached in proctype qrcu_updater 4. Devise careful torture tests, which, though they can-
@@ -38,5 +33,5 @@ not prove the code correct, can find hidden bugs.
unreached in init
(0 of 23 states)
5. There is some movement towards tools that do model
-pan: elapsed time 369 seconds checking on clusters of smaller machines. However,
-pan: rate 505107.58 states/second
+pan: elapsed time 2.68e+03 seconds please note that we have not actually used such tools
+pan: rate 69453.282 states/second myself, courtesy of some large machines that Paul
has occasional access to.

6. Wait for memory sizes of affordable systems to ex-


pand to fit your problem.

7. Use one of a number of cloud-computing services to


rent a large system for a short time period. q

v2022.09.25a
E.12. FORMAL VERIFICATION 531

Quick Quiz 12.10: p.243 Quick Quiz 12.14: p.245


Why not simply increment rcu_update_flag, and then Isn’t it a bit strange to model rcu_exit_nohz() fol-
only increment dynticks_progress_counter if the lowed by rcu_enter_nohz()? Wouldn’t it be more
old value of rcu_update_flag was zero??? natural to instead model entry before exit?

Answer: Answer:
This fails in presence of NMIs. To see this, suppose It probably would be more natural, but we will need this
an NMI was received just after rcu_irq_enter() in- particular order for the liveness checks that we will add
cremented rcu_update_flag, but before it incremented later. q
dynticks_progress_counter. The instance of rcu_
irq_enter() invoked by the NMI would see that the
Quick Quiz 12.15: p.246
original value of rcu_update_flag was non-zero, and
would therefore refrain from incrementing dynticks_ Wait a minute! In the Linux kernel, both dynticks_
progress_counter. This would leave the RCU grace- progress_counter and rcu_dyntick_snapshot
period machinery no clue that the NMI handler was are per-CPU variables. So why are they instead be-
executing on this CPU, so that any RCU read-side crit- ing modeled as single global variables?
ical sections in the NMI handler would lose their RCU
Answer:
protection.
Because the grace-period code processes each CPU’s
The possibility of NMI handlers, which, by definition
dynticks_progress_counter and rcu_dyntick_
cannot be masked, does complicate this code. q
snapshot variables separately, we can collapse the state
onto a single CPU. If the grace-period code were instead
Quick Quiz 12.11: p.243 to do something special given specific values on specific
But if line 7 finds that we are the outermost inter- CPUs, then we would indeed need to model multiple
rupt, wouldn’t we always need to increment dynticks_ CPUs. But fortunately, we can safely confine ourselves to
progress_counter? two CPUs, the one running the grace-period processing
and the one entering and leaving dynticks-idle mode. q
Answer:
Not if we interrupted a running task! In that case,
Quick Quiz 12.16: p.246
dynticks_progress_counter would have already
been incremented by rcu_exit_nohz(), and there would Given there are a pair of back-to-back changes to grace_
be no need to increment it again. q period_state on lines 25 and 26, how can we be sure
that line 25’s changes won’t be lost?

Quick Quiz 12.12: p.244


Answer:
Can you spot any bugs in any of the code in this section? Recall that Promela and Spin trace out every possible
sequence of state changes. Therefore, timing is irrelevant:
Promela/Spin will be quite happy to jam the entire rest
Answer:
of the model between those two statements unless some
Read the next section to see if you were correct. q
state variable specifically prohibits doing so. q

Quick Quiz 12.13: p.245


Quick Quiz 12.17: p.249
Why isn’t the memory barrier in rcu_exit_nohz()
and rcu_enter_nohz() modeled in Promela? But what would you do if you needed the statements
in a single EXECUTE_MAINLINE() group to execute
Answer: non-atomically?
Promela assumes sequential consistency, so it is not neces-
sary to model memory barriers. In fact, one must instead Answer:
explicitly model lack of memory barriers, for example, as The easiest thing to do would be to put each such statement
shown in Listing 12.13 on page 237. q in its own EXECUTE_MAINLINE() statement. q

v2022.09.25a
532 APPENDIX E. ANSWERS TO QUICK QUIZZES

Quick Quiz 12.18: p.249 Quick Quiz 12.21: p.253


But what if the dynticks_nohz() process had “if” or Does Paul always write his code in this painfully incre-
“do” statements with conditions, where the statement bod- mental manner?
ies of these constructs needed to execute non-atomically?
Answer:
Not always, but more and more frequently. In this case,
Answer: Paul started with the smallest slice of code that included
One approach, as we will see in a later section, is to use an interrupt handler, because he was not sure how best to
explicit labels and “goto” statements. For example, the model interrupts in Promela. Once he got that working,
construct: he added other features. (But if he was doing it again, he
would start with a “toy” handler. For example, he might
if have the handler increment a variable twice and have the
:: i == 0 -> a = -1; mainline code verify that the value was always even.)
:: else -> a = -2;
fi; Why the incremental approach? Consider the following,
attributed to Brian W. Kernighan:
could be modeled as something like: Debugging is twice as hard as writing the code
in the first place. Therefore, if you write the code
EXECUTE_MAINLINE(stmt1, as cleverly as possible, you are, by definition,
if
:: i == 0 -> goto stmt1_then; not smart enough to debug it.
:: else -> goto stmt1_else;
fi)
stmt1_then: skip;
This means that any attempt to optimize the production
EXECUTE_MAINLINE(stmt1_then1, a = -1; goto stmt1_end) of code should place at least 66 % of its emphasis on
stmt1_else: skip;
EXECUTE_MAINLINE(stmt1_then1, a = -2)
optimizing the debugging process, even at the expense of
stmt1_end: skip; increasing the time and effort spent coding. Incremental
coding and testing is one way to optimize the debugging
process, at the expense of some increase in coding effort.
However, it is not clear that the macro is helping much
Paul uses this approach because he rarely has the luxury
in the case of the “if” statement, so these sorts of situations
of devoting full days (let alone weeks) to coding and
will be open-coded in the following sections. q
debugging. q

Quick Quiz 12.19: p.250 p.254


Quick Quiz 12.22:
Why are lines 46 and 47 (the “in_dyntick_irq = 0;” But what happens if an NMI handler starts running
and the “i++;”) executed atomically? before an IRQ handler completes, and if that NMI
handler continues running until a second IRQ handler
Answer: starts?
These lines of code pertain to controlling the model, not
to the code being modeled, so there is no reason to model Answer:
them non-atomically. The motivation for modeling them This cannot happen within the confines of a single CPU.
atomically is to reduce the size of the state space. q The first IRQ handler cannot complete until the NMI
handler returns. Therefore, if each of the dynticks and
dynticks_nmi variables have taken on an even value
Quick Quiz 12.20: p.250 during a given time interval, the corresponding CPU
What property of interrupts is this dynticks_irq() really was in a quiescent state at some time during that
process unable to model? interval. q

Answer: Quick Quiz 12.23: p.256


One such property is nested interrupts, which are handled This is still pretty complicated. Why not just have a
in the following section. q cpumask_t with per-CPU bits, clearing the bit when

v2022.09.25a
E.12. FORMAL VERIFICATION 533

entering an IRQ or NMI handler, and setting it upon terminating the model with the initial value of 2 in P0’s
exit? r3 register, which will not trigger the exists assertion.
There is some debate about whether this trick is univer-
Answer: sally applicable, but I have not seen an example where it
Although this approach would be functionally correct, it fails. q
would result in excessive IRQ entry/exit overhead on large
machines. In contrast, the approach laid out in this section
Quick Quiz 12.27: p.259
allows each CPU to touch only per-CPU data on IRQ and
NMI entry/exit, resulting in much lower IRQ entry/exit Does the Arm Linux kernel have a similar bug?
overhead, especially on large machines. q
Answer:
Arm does not have this particular bug because it places
Quick Quiz 12.24: p.257 smp_mb() before and after the atomic_add_return()
But x86 has strong memory ordering, so why formalize function’s assembly-language implementation. PowerPC
its memory model? no longer has this bug; it has long since been fixed [Her11].
q
Answer:
Actually, academics consider the x86 memory model to p.259
Quick Quiz 12.28:
be weak because it can allow prior stores to be reordered
Does the lwsync on line 10 in Listing 12.23 provide
with subsequent loads. From an academic viewpoint, a
sufficient ordering?
strong memory model is one that allows absolutely no
reordering, so that all threads agree on the order of all Answer:
operations visible to them. It depends on the semantics required. The rest of this
Plus it really is the case that developers are sometimes answer assumes that the assembly language for P0 in
confused about x86 memory ordering. q Listing 12.23 is supposed to implement a value-returning
atomic operation.
Quick Quiz 12.25: p.257 As is discussed in Chapter 15, Linux kernel’s memory
Why does line 8 of Listing 12.23 initialize the registers? consistency model requires value-returning atomic RMW
Why not instead initialize them on lines 4 and 5? operations to be fully ordered on both sides. The ordering
provided by lwsync is insufficient for this purpose, and so
Answer: sync should be used instead. This change has since been
Either way works. However, in general, it is better made [Fen15] in response to an email thread discussing a
to use initialization than explicit instructions. The ex- couple of other litmus tests [McK15g]. Finding any other
plicit instructions are used in this example to demonstrate bugs that the Linux kernel might have is left as an exercise
their use. In addition, many of the litmus tests available for the reader.
on the tool’s web site (https://www.cl.cam.ac.uk/ In other enviroments providing weaker semantics,
~pes20/ppcmem/) were automatically generated, which lwsync might be sufficient. But not for the Linux kernel’s
generates explicit initialization instructions. q value-returning atomic operations! q

Quick Quiz 12.26: p.258 Quick Quiz 12.29: p.261


But whatever happened to line 17 of Listing 12.23, the What do you have to do to run herd on litmus tests like
one that is the Fail1: label? that shown in Listing 12.29?
Answer: Answer:
The implementation of powerpc version of atomic_add_ Get version v4.17 (or later) of the Linux-kernel source
return() loops when the stwcx instruction fails, which code, then follow the instructions in tools/memory-
it communicates by setting non-zero status in the condition- model/README to install the needed tools. Then follow
code register, which in turn is tested by the bne instruction. the further instructions to run these tools on the litmus
Because actually modeling the loop would result in state- test of your choice. q
space explosion, we instead branch to the Fail: label,

v2022.09.25a
534 APPENDIX E. ANSWERS TO QUICK QUIZZES

Table E.4: Locking: Modeling vs. Emulation Time (s) p.262


Quick Quiz 12.31:
Model Emulate Wait!!! Isn’t leaking pointers out of an RCU read-side
critical section a critical bug???
# Proc.

filter exists
cmpxchg xchg cmpxchg xchg Answer:
2 0.004 0.022 0.027 0.039 0.058 Yes, it usually is a critical bug. However, in this case,
3 0.041 0.743 0.968 1.653 3.203 the updater has been cleverly constructed to properly
4 0.374 59.565 74.818 151.962 500.960 handle such pointer leaks. But please don’t make a habit
5 4.905 of doing this sort of thing, and especially don’t do this
without having put a lot of thought into making some
more conventional approach work. q

Quick Quiz 12.30: p.261


Quick Quiz 12.32: p.263
Why bother modeling locking directly? Why not simply
emulate locking with atomic operations? In Listing 12.32, why couldn’t a reader fetch c just before
P1() zeroed it on line 45, and then later store this same
Answer: value back into c just after it was zeroed, thus defeating
In a word, performance, as can be seen in Table E.4. the zeroing operation?
The first column shows the number of herd processes
Answer:
modeled. The second column shows the herd runtime
Because the reader advances to the next element on line 24,
when modeling spin_lock() and spin_unlock() di-
thus avoiding storing a pointer to the same element as was
rectly in herd’s cat language. The third column shows
fetched. q
the herd runtime when emulating spin_lock() with
cmpxchg_acquire() and spin_unlock() with smp_
Quick Quiz 12.33: p.263
store_release(), using the herd filter clause to
reject executions that fail to acquire the lock. The fourth In Listing 12.32, why not have just one call to
column is like the third, but using xchg_acquire() synchronize_rcu() immediately before line 48?
instead of cmpxchg_acquire(). The fifth and sixth col-
umns are like the third and fourth, but instead using the Answer:
herd exists clause to reject executions that fail to acquire Because this results in P0() accessing a freed element.
the lock. But don’t take my word for this, try it out in herd! q
Note also that use of the filter clause is about twice
as fast as is use of the exists clause. This is no surprise Quick Quiz 12.34: p.263
because the filter clause allows early abandoning of ex- Also in Listing 12.32, can’t line 48 be WRITE_ONCE()
cluded executions, where the executions that are excluded instead of smp_store_release()?
are the ones in which the lock is concurrently held by
more than one process. Answer:
That is an excellent question. As of late 2021, the answer
More important, modeling spin_lock() and spin_ is “no one knows”. Much depends on the semantics of
unlock() directly ranges from five times faster to more Armv8’s conditional-move instruction. While awaiting
than two orders of magnitude faster than modeling emu- clarity on these semantics, smp_store_release() is
lated locking. This should also be no surprise, as direct the safe choice. q
modeling raises the level of abstraction, thus reducing the
number of events that herd must model. Because almost
Quick Quiz 12.35: p.265
everything that herd does is of exponential computational
complexity, modest reductions in the number of events But shouldn’t sufficiently low-level software be for all
produces exponentially large reductions in runtime. intents and purposes immune to being exploited by black
hats?
Thus, in formal verification even more than in parallel
programming itself, divide and conquer!!! q Answer:
Unfortunately, no.

v2022.09.25a
E.13. PUTTING IT ALL TOGETHER 535

At one time, Paul E. McKenny felt that Linux-kernel E.13 Putting It All Together
RCU was immune to such exploits, but the advent of Row
Hammer showed him otherwise. After all, if the black
hats can hit the system’s DRAM, they can hit any and all Quick Quiz 13.1: p.270
low-level software, even including RCU. Why not implement reference-acquisition using a sim-
And in 2018, this possibility passed from the realm ple compare-and-swap operation that only acquires a
of theoretical speculation into the hard and fast realm of reference if the reference counter is non-zero?
objective reality [McK19a]. q
Answer:
Although this can resolve the race between the release of
Quick Quiz 12.36: p.265
the last reference and acquisition of a new reference, it
In light of the full verification of the L4 microkernel, does absolutely nothing to prevent the data structure from
isn’t this limited view of formal verification just a little being freed and reallocated, possibly as some completely
bit obsolete? different type of structure. It is quite likely that the “sim-
Answer: ple compare-and-swap operation” would give undefined
Unfortunately, no. results if applied to the differently typed structure.
The first full verification of the L4 microkernel was In short, use of atomic operations such as compare-and-
a tour de force, with a large number of Ph.D. students swap absolutely requires either type-safety or existence
hand-verifying code at a very slow per-student rate. This guarantees.
level of effort could not be applied to most software But what if it is absolutely necessary to let the type
projects because the rate of change is just too great. change?
Furthermore, although the L4 microkernel is a large One approach is for each such type to have the refer-
software artifact from the viewpoint of formal verification, ence counter at the same location, so that as long as the
it is tiny compared to a great number of projects, including reallocation results in an object from this group of types,
LLVM, GCC, the Linux kernel, Hadoop, MongoDB, all is well. If you do this in C, make sure you comment
and a great many others. In addition, this verification the reference counter in each structure in which it appears.
did have limits, as the researchers freely admit, to their In C++, use inheritance and templates. q
credit: https://docs.sel4.systems/projects/
sel4/frequently-asked-questions.html#does- p.272
Quick Quiz 13.2:
sel4-have-zero-bugs.
Why isn’t it necessary to guard against cases where one
Although formal verification is finally starting to show
CPU acquires a reference just after another CPU releases
some promise, including more-recent L4 verifications
the last reference?
involving greater levels of automation, it currently has no
chance of completely displacing testing in the foreseeable Answer:
future. And although I would dearly love to be proven Because a CPU must already hold a reference in order
wrong on this point, please note that such proof will be in to legally acquire another reference. Therefore, if one
the form of a real tool that verifies real software, not in CPU releases the last reference, there had better not be
the form of a large body of rousing rhetoric. any CPU acquiring a new reference! q
Perhaps someday formal verification will be used heav-
ily for validation, including for what is now known as
Quick Quiz 13.3: p.272
regression testing. Section 17.4 looks at what would be
required to make this possibility a reality. q Suppose that just after the atomic_sub_and_test()
on line 22 of Listing 13.2 is invoked, that some other
CPU invokes kref_get(). Doesn’t this result in that
other CPU now having an illegal reference to a released
object?

Answer:
This cannot happen if these functions are used correctly.
It is illegal to invoke kref_get() unless you already

v2022.09.25a
536 APPENDIX E. ANSWERS TO QUICK QUIZZES

hold a reference, in which case the kref_sub() could p.275


Quick Quiz 13.7:
not possibly have decremented the counter to zero. q
Why don’t all sequence-locking use cases replicate the
data in this fashion?

Quick Quiz 13.4: p.272 Answer:


Suppose that kref_sub() returns zero, indicating that Such replication is impractical if the data is too large, as
the release() function was not invoked. Under what it might be in the Schrödinger’s-zoo example described in
conditions can the caller rely on the continued existence Section 13.4.2.
of the enclosing object? Such replication is unnecessary if delays are prevented,
for example, when updaters disable interrupts when run-
ning on bare-metal hardware (that is, without the use of a
Answer:
vCPU-preemption-prone hypervisor).
The caller cannot rely on the continued existence of the
object unless it knows that at least one reference will Alternatively, if readers can tolerate the occasional
continue to exist. Normally, the caller will have no delay, then replication is again unnecessary. Consider the
way of knowing this, and must therefore carefully avoid example of reader-writer locking, where writers always
referencing the object after the call to kref_sub(). delay readers and vice versa.
However, if the data to be replicated is reasonably small,
Interested readers are encouraged to work around this
if delays are possible, and if readers cannot tolerate these
limitation using RCU, in particular, call_rcu(). q
delays, replicating the data is an excellent approach. q

p.272 Quick Quiz 13.8: p.276


Quick Quiz 13.5:
Why not just pass kfree() as the release function? Is it possible to write-acquire the sequence lock on the
new element before it is inserted instead of acquiring
that of the old element before it is removed?
Answer:
Because the kref structure normally is embedded in a Answer:
larger structure, and it is necessary to free the entire Yes, and the details are left as an exercise to the reader.
structure, not just the kref field. This is normally ac- The term tombstone is sometimes used to refer to the
complished by defining a wrapper function that does a element with the old name after its sequence lock is
container_of() and then a kfree(). q acquired. Similarly, the term birthstone is sometimes
used to refer to the element with the new name while its
sequence lock is still held. q
Quick Quiz 13.6: p.273
Why can’t the check for a zero reference count be made
Quick Quiz 13.9: p.276
in a simple “if” statement with an atomic increment in
its “then” clause? Is it possible to avoid the global lock?

Answer:
Answer:
Yes, and one way to do this would be to use per-hash-chain
Suppose that the “if” condition completed, finding the
locks. The updater could acquire lock(s) corresponding
reference counter value equal to one. Suppose that a
to both the old and the new element, acquiring them in
release operation executes, decrementing the reference
address order. In this case, the insertion and removal
counter to zero and therefore starting cleanup operations.
operations would of course need to refrain from acquiring
But now the “then” clause can increment the counter back
and releasing these same per-hash-chain locks. This
to a value of one, allowing the object to be used after it
complexity can be worthwhile if rename operations are
has been cleaned up.
frequent, and of course can allow rename operations to
This use-after-cleanup bug is every bit as bad as a execute concurrently. q
full-fledged use-after-free bug. q

v2022.09.25a
E.13. PUTTING IT ALL TOGETHER 537

p.277
Listing E.9: Localized Correlated Measurement Fields
Quick Quiz 13.10: 1 struct measurement {
Why on earth did we need that global lock in the first 2 double meas_1;
place? 3 double meas_2;
4 double meas_3;
5 };
Answer: 6
7 struct animal {
A given thread’s __thread variables vanish when that 8 char name[40];
thread exits. It is therefore necessary to synchronize any 9 double age;
10 struct measurement *mp;
operation that accesses other threads’ __thread variables 11 struct measurement meas;
with thread exit. Without such synchronization, accesses 12 char photo[0]; /* large bitmap. */
13 };
to __thread variable of a just-exited thread will result in
segmentation faults. q

Quick Quiz 13.13: p.278


Quick Quiz 13.11: p.278
Wow! Listing 13.5 contains 70 lines of code, compared
Hey!!! Line 48 of Listing 13.5 modifies a value in a
to only 42 in Listing 5.4. Is this extra complexity really
pre-existing countarray structure! Didn’t you say that
worth it?
this structure, once made available to read_count(),
remained constant??? Answer:
This of course needs to be decided on a case-by-case basis.
Answer:
If you need an implementation of read_count() that
Indeed I did say that. And it would be possible to make
scales linearly, then the lock-based implementation shown
count_register_thread() allocate a new structure,
in Listing 5.4 simply will not work for you. On the other
much as count_unregister_thread() currently does.
hand, if calls to read_count() are sufficiently rare, then
But this is unnecessary. Recall the derivation of the
the lock-based version is simpler and might thus be better,
error bounds of read_count() that was based on the
although much of the size difference is due to the structure
snapshots of memory. Because new threads start with
definition, memory allocation, and NULL return checking.
initial counter values of zero, the derivation holds even
Of course, a better question is “Why doesn’t the lan-
if we add a new thread partway through read_count()’s
guage implement cross-thread access to __thread vari-
execution. So, interestingly enough, when adding a new
ables?” After all, such an implementation would make
thread, this implementation gets the effect of allocating
both the locking and the use of RCU unnecessary. This
a new structure, but without actually having to do the
would in turn enable an implementation that was even
allocation.
simpler than the one shown in Listing 5.4, but with all the
On the other hand, count_unregister_thread()
scalability and performance benefits of the implementation
can result in the outgoing thread’s work being double
shown in Listing 13.5! q
counted. This can happen when read_count() is in-
voked between lines 65 and 66. There are efficient ways
of avoiding this double-counting, but these are left as an Quick Quiz 13.14: p.280
exercise for the reader. q But cant’t the approach shown in Listing 13.9 result
in extra cache misses, in turn resulting in additional
Quick Quiz 13.12: p.278 read-side overhead?
Given the fixed-size counterp array, exactly how does Answer:
this code avoid a fixed upper bound on the number of Indeed it can.
threads??? One way to avoid this cache-miss overhead is shown in
Answer: Listing E.9: Simply embed an instance of a measurement
You are quite right, that array does in fact reimpose structure named meas into the animal structure, and point
the fixed upper limit. This limit may be avoided by the ->mp field at this ->meas field.
tracking threads with a linked list, as is done in userspace Measurement updates can then be carried out as follows:
RCU [DMS+ 12]. Doing something similar for this code
is left as an exercise for the reader. q 1. Allocate a new measurement structure and place
the new measurements into it.

v2022.09.25a
538 APPENDIX E. ANSWERS TO QUICK QUIZZES

2. Use rcu_assign_pointer() to point ->mp to this One way to handle this is to have a reference count
new structure. on each set of buckets, which is initially set to the value
one. A full-table scan would acquire a reference at the
3. Wait for a grace period to elapse, for example using beginning of the scan (but only if the reference is non-zero)
either synchronize_rcu() or call_rcu(). and release it at the end of the scan. The resizing would
4. Copy the measurements from the new measurement populate the new buckets, release the reference, wait for
structure into the embedded ->meas field. a grace period, and then wait for the reference to go to
zero. Once the reference was zero, the resizing could let
5. Use rcu_assign_pointer() to point ->mp back updaters forget about the old hash buckets and then free it.
to the old embedded ->meas field. Actual implementation is left to the interested reader,
who will gain much insight from this task. q
6. After another grace period elapses, free up the new
measurement structure.

This approach uses a heavier weight update procedure E.14 Advanced Synchronization
to eliminate the extra cache miss in the common case. The
extra cache miss will be incurred only while an update is p.286
Quick Quiz 14.1:
actually in progress. q
Given that there will always be a sharply limited number
of CPUs available, is population obliviousness really
Quick Quiz 13.15: p.280 useful?
But how does this scan work while a resizable hash table
Answer:
is being resized? In that case, neither the old nor the
Given the surprisingly limited scalability of any num-
new hash table is guaranteed to contain all the elements
ber of NBS algorithms, population obliviousness can be
in the hash table!
surprisingly useful. Nevertheless, the overall point of
Answer: the question is valid. It is not normally helpful for an
True, resizable hash tables as described in Section 10.4 algorithm to scale beyond the size of the largest system it
cannot be fully scanned while being resized. One simple is ever going to run on. q
way around this is to acquire the hashtab structure’s
->ht_lock while scanning, but this prevents more than Quick Quiz 14.2: p.287
one scan from proceeding concurrently. Wait! In order to dequeue all elements, both the ->head
Another approach is for updates to mutate the old hash and ->tail pointers must be changed, which cannot be
table as well as the new one while resizing is in progress. done atomically on typical computer systems. So how
This would allow scans to find all elements in the old is this supposed to work???
hash table. Implementing this is left as an exercise for the
reader. q Answer:
One pointer at a time!
Quick Quiz 13.16: p.283 First, atomically exchange the ->head pointer with
But how would this work with a resizable hash table, NULL. If the return value from the atomic exchange
such as the one described in Section 10.4? operation is NULL, the queue was empty and you are done.
And if someone else attempts a dequeue-all at this point,
Answer: they will get back a NULL pointer.
In this case, more care is required because the hash table Otherwise, atomically exchange the ->tail pointer
might well be resized during the time that we momentarily with a pointer to the now-NULL ->head pointer. The
exited the RCU read-side critical section. Worse yet, return value from the atomic exchange operation is a
the resize operation can be expected to free the old hash pointer to the ->next field of the eventual last element on
buckets, leaving us pointing to the freelist. the list.
But it is not sufficient to prevent the old hash buckets Producing and testing actual code is left as an exercise
from being freed. It is also necessary to ensure that those for the interested and enthusiastic reader, as are strategies
buckets continue to be updated. for handling half-enqueued elements. q

v2022.09.25a
E.14. ADVANCED SYNCHRONIZATION 539

Quick Quiz 14.3: p.288 Quick Quiz 14.5: p.292


So why not ditch antique languages like C and C++ for It seems like the various members of the NBS hierarchy
something more modern? are rather useless. So why bother with them at all???

Answer: Answer:
That won’t help unless the more-modern languages pro- One advantage of the members of the NBS hierarchy is
ponents are energetic enough to write their own compiler that they are reasonably simple to define and use from
backends. The usual practice of re-using existing back- a theoretical viewpoint. We can hope that work done in
ends also reuses charming properties such as refusal to the NBS arena will help lay the groundwork for analysis
support pointers to lifetime-ended objects. q of real-world forward-progress guarantees for concurrent
real-time programs. However, as of 2022 it appears that
p.289 trace-based methodologies are in the lead [dOCdO19].
Quick Quiz 14.4:
So why bother learning about NBS at all?
Why does anyone care about demonic schedulers?
Because a great many people know of it, and are vaguely
Answer: aware that it is somehow related to real-time computing.
A demonic scheduler is one way to model an insanely Their response to your carefully designed real-time con-
overloaded system. After all, if you have an algorithm that straints might well be of the form “Bah, just use wait-free
you can prove runs reasonably given a demonic scheduler, algorithms!”. In the all-too-common case where they are
mere overload should be no problem, right? very convincing to your management, you will need to
On the other hand, it is only reasonable to ask if a understand NBS in order to bring the discussion back to
demonic scheduler is really the best way to model overload reality. I hope that this section has provided you with the
conditions. And perhaps it is time for more accurate required depth of understanding.
models. For one thing, a system might be overloaded in Another thing to note is that learning about the NBS
any of a number of ways. After all, an NBS algorithm that hierarchy is probably no more harmful than learning
works fine on a demonic scheduler might or might not about transfinite numbers of the computational-complexity
do well in out-of-memory conditions, when mass storage hierarchy. In all three cases, it is important to avoid over-
fills, or when the network is congested. applying the theory. Which is in and of itself good
Except that systems’ core counts have been increasing, practice! q
which means that an overloaded system is quite likely to
be running more than one concurrent program.12 In that Quick Quiz 14.6: p.294
case, even if a demonic scheduler is not so demonic as But what about battery-powered systems? They don’t
to inject idle cycles while there are runnable tasks, it is require energy flowing into the system as a whole.
easy to imagine such a scheduler consistently favoring
the other program over yours. If both programs could Answer:
consume all available CPU, then this scheduler might not Sooner or later, the battery must be recharged, which
run your program at all. requires energy to flow into the system. q
One way to avoid these issues is to simply avoid over-
load conditions. This is often the preferred approach in p.295
Quick Quiz 14.7:
production, where load balancers direct traffic away from
But given the results from queueing theory, won’t low
overloaded systems. And if all systems are overloaded,
utilization merely improve the average response time
it is not unheard of to simply shed load, that is, to drop
rather than improving the worst-case response time?
the low-priority incoming requests. Nor is this approach
And isn’t worst-case response time all that most real-
limited to computing, as those who have suffered through
time systems really care about?
a rolling blackout can attest. But load-shedding is often
considered a bad thing by those whose load is being shed. Answer:
As always, choose wisely! q Yes, but . . .
12 As a point of reference, back in the mid-1990s, Paul witnessed Those queueing-theory results assume infinite “calling
a 16-CPU system running about 20 instances of a certain high-end populations”, which in the Linux kernel might correspond
proprietary database. to an infinite number of tasks. As of early 2021, no real

v2022.09.25a
540 APPENDIX E. ANSWERS TO QUICK QUIZZES

system supports an infinite number of tasks, so results p.296


Quick Quiz 14.9:
assuming infinite calling populations should be expected
Differentiating real-time from non-real-time based on
to have less-than-infinite applicability.
what can “be achieved straightforwardly by non-real-
Other queueing-theory results have finite calling
time systems and applications” is a travesty! There is
populations, which feature sharply bounded response
absolutely no theoretical basis for such a distinction!!!
times [HL86]. These results better model real systems,
Can’t we do better than that???
and these models do predict reductions in both average
and worst-case response times as utilizations decrease. Answer:
These results can be extended to model concurrent sys- This distinction is admittedly unsatisfying from a strictly
tems that use synchronization mechanisms such as lock- theoretical perspective. But on the other hand, it is exactly
ing [Bra11, SM04a]. what the developer needs in order to decide whether the
In short, queueing-theory results that accurately de- application can be cheaply and easily developed using
scribe real-world real-time systems show that worst-case standard non-real-time approaches, or whether the more
response time decreases with decreasing utilization. q difficult and expensive real-time approaches are required.
In other words, although theory is quite important, for
Quick Quiz 14.8: p.296 those of us called upon to complete practical projects,
Formal verification is already quite capable, benefiting theory supports practice, never the other way around. q
from decades of intensive study. Are additional advances
really required, or is this just a practitioner’s excuse to p.304
Quick Quiz 14.10:
continue to lazily ignore the awesome power of formal
But if you only allow one reader at a time to read-acquire
verification?
a reader-writer lock, isn’t that the same as an exclusive
Answer: lock???
Perhaps this situation is just a theoretician’s excuse to avoid
diving into the messy world of real software? Perhaps Answer:
more constructively, the following advances are required: Indeed it is, other than the API. And the API is important
because it allows the Linux kernel to offer real-time capa-
1. Formal verification needs to handle larger software bilities without having the -rt patchset grow to ridiculous
artifacts. The largest verification efforts have been sizes.
for systems of only about 10,000 lines of code, and However, this approach clearly and severely limits read-
those have been verifying much simpler properties side scalability. The Linux kernel’s -rt patchset was long
than real-time latencies. able to live with this limitation for several reasons: (1) Re-
2. Hardware vendors will need to publish formal tim- al-time systems have traditionally been relatively small,
ing guarantees. This used to be common practice (2) Real-time systems have generally focused on process
back when hardware was much simpler, but today’s control, thus being unaffected by scalability limitations in
complex hardware results in excessively complex ex- the I/O subsystems, and (3) Many of the Linux kernel’s
pressions for worst-case performance. Unfortunately, reader-writer locks have been converted to RCU.
energy-efficiency concerns are pushing vendors in However, the day came when it was absolutely necessary
the direction of even more complexity. to permit concurrent readers, as described in the text
following this quiz. q
3. Timing analysis needs to be integrated into develop-
ment methodologies and IDEs.
Quick Quiz 14.11: p.305
All that said, there is hope, given recent work for-
Suppose that preemption occurs just after the load from
malizing the memory models of real computer sys-
t->rcu_read_unlock_special.s on line 12 of List-
tems [AMP+ 11, AKNT13]. On the other hand, formal
ing 14.3. Mightn’t that result in the task failing to
verification has just as much trouble as does testing with
invoke rcu_read_unlock_special(), thus failing to
the astronomical number of variants of the Linux kernel
remove itself from the list of tasks blocking the current
that can be constructed from different combinations of its
grace period, in turn causing that grace period to extend
tens of thousands of Kconfig options. Sometimes life is
indefinitely?
hard! q

v2022.09.25a
E.15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING 541

Answer: p.311
Quick Quiz 14.15:
That is a real problem, and it is solved in RCU’s scheduler
Don’t you need some kind of synchronization to protect
hook. If that scheduler hook sees that the value of t->
update_cal()?
rcu_read_lock_nesting is negative, it invokes rcu_
read_unlock_special() if needed before allowing the Answer:
context switch to complete. q Indeed you do, and you could use any of a number of
techniques discussed earlier in this book. One of those
Quick Quiz 14.12: p.309 techniques is use of a single updater thread, which would
But isn’t correct operation despite fail-stop bugs a valu- result in exactly the code shown in update_cal() in
able fault-tolerance property? Listing 14.6. q

Answer:
Yes and no.
Yes in that non-blocking algorithms can provide fault E.15 Advanced Synchronization:
tolerance in the face of fail-stop bugs, but no in that this Memory Ordering
is grossly insufficient for practical fault tolerance. For
example, suppose you had a wait-free queue, and further
suppose that a thread has just dequeued an element. If Quick Quiz 15.1: p.313
that thread now succumbs to a fail-stop bug, the element it This chapter has been rewritten since the first edition.
has just dequeued is effectively lost. True fault tolerance Did memory ordering change all that since 2014?
requires way more than mere non-blocking properties,
and is beyond the scope of this book. q Answer:
The earlier memory-ordering section had its roots in a
p.309
pair of Linux Journal articles [McK05a, McK05b] dating
Quick Quiz 14.13:
back to 2005. Since then, the C and C++ memory mod-
I couldn’t help but spot the word “includes” before this
els [Bec11] have been formalized (and critiqued [BS14,
list. Are there other constraints?
BD14, VBC+ 15, BMN+ 15, LVK+ 17, BGV17]), exe-
Answer: cutable formal memory models for computer systems have
Indeed there are, and lots of them. However, they tend to become the norm [MSS12, McK11d, SSA+ 11, AMP+ 11,
be specific to a given situation, and many of them can be AKNT13, AKT13, AMT14, MS14, FSP+ 17, ARM17],
thought of as refinements of some of the constraints listed and there is even a memory model for the Linux ker-
above. For example, the many constraints on choices of nel [AMM+ 17a, AMM+ 17b, AMM+ 18], along with a
data structure will help meeting the “Bounded time spent paper describing differences between the C11 and Linux
in any given critical section” constraint. q memory models [MWPF18].
Given all this progress since 2005, it was high time for
a full rewrite! q
Quick Quiz 14.14: p.310
Given that real-time systems are often used for safety-
Quick Quiz 15.2: p.313
critical applications, and given that runtime memory
allocation is forbidden in many safety-critical situations, The compiler can also reorder Thread P0()’s and
what is with the call to malloc()??? Thread P1()’s memory accesses in Listing 15.1, right?

Answer:
In early 2016, projects forbidding runtime memory al- Answer:
location were also not at all interested in multithreaded In general, compiler optimizations carry out more exten-
computing. So the runtime memory allocation is not an sive and profound reorderings than CPUs can. However,
additional obstacle to safety criticality. in this case, the volatile accesses in READ_ONCE() and
However, by 2020 runtime memory allocation in multi- WRITE_ONCE() prevent the compiler from reordering.
core real-time systems was gaining some traction. q And also from doing much else as well, so the examples
in this section will be making heavy use of READ_ONCE()

v2022.09.25a
542 APPENDIX E. ANSWERS TO QUICK QUIZZES

and WRITE_ONCE(). See Section 15.3 for more detail on The *_dereference() row captures the address
the need for READ_ONCE() and WRITE_ONCE(). q and data dependency ordering provided by rcu_
dereference() and friends. Again, these dependen-
cies must been constructed carefully, as described in
Quick Quiz 15.3: p.315
Section 15.3.2.
But wait!!! On row 2 of Table 15.1 both x0 and x1 each
have two values at the same time, namely zero and two. The “Successful *_acquire()” row captures the fact
How can that possibly work??? that many CPUs have special “acquire” forms of loads
and of atomic RMW instructions, and that many other
Answer: CPUs have lightweight memory-barrier instructions that
There is an underlying cache-coherence protocol that order prior loads against subsequent loads and stores.
straightens things out, which are discussed in Appen- The “Successful *_release()” row captures the fact
dix C.2. But if you think that a given variable having two that many CPUs have special “release” forms of stores
values at the same time is surprising, just wait until you and of atomic RMW instructions, and that many other
get to Section 15.2.1! q CPUs have lightweight memory-barrier instructions that
order prior loads and stores against subsequent stores.
Quick Quiz 15.4: p.315 The smp_rmb() row captures the fact that many CPUs
But don’t the values also need to be flushed from the have lightweight memory-barrier instructions that order
cache to main memory? prior loads against subsequent loads. Similarly, the smp_
wmb() row captures the fact that many CPUs have light-
Answer: weight memory-barrier instructions that order prior stores
Perhaps surprisingly, not necessarily! On some systems, against subsequent stores.
if the two variables are being used heavily, they might be None of the ordering operations thus far require prior
bounced back and forth between the CPUs’ caches and stores to be ordered against subsequent loads, which means
never land in main memory. q that these operations need not interfere with store buffers,
whose main purpose in life is in fact to reorder prior
p.316 stores against subsequent loads. The lightweight nature
Quick Quiz 15.5:
of these operations is precisely due to their policy of
The rows in Table 15.3 seem quite random and confused.
store-buffer non-interference. However, as noted earlier, it
Whatever is the conceptual basis of this table???
is sometimes necessary to interfere with the store buffer in
Answer: order to prevent prior stores from being reordered against
The rows correspond roughly to hardware mechanisms of later stores, which brings us to the remaining rows in this
increasing power and overhead. table.
The WRITE_ONCE() row captures the fact that accesses The smp_mb() row corresponds to the full memory
to a single variable are always fully ordered, as indicated by barrier available on most platforms, with Itanium being
the “SV”column. Note that all other operations providing the exception that proves the rule. However, even on
ordering against accesses to multiple variables also provide Itanium, smp_mb() provides full ordering with respect
this same-variable ordering. to READ_ONCE() and WRITE_ONCE(), as discussed in
The READ_ONCE() row captures the fact that (as of Section 15.5.4.
2021) compilers and CPUs do not indulge in user-visible The “Successful full-strength non-void RMW” row
speculative stores, so that any store whose address, data, captures the fact that on some platforms (such as x86)
or execution depends on a prior load is guaranteed to atomic RMW instructions provide full ordering both be-
happen after that load completes. However, this guarantee fore and after. The Linux kernel therefore requires that
assumes that these dependencies have been constructed full-strength non-void atomic RMW operations provide
carefully, as described in Sections 15.3.2 and 15.3.3. full ordering in cases where these operations succeed.
The “_relaxed() RMW operation” row captures the (Full-strength atomic RMW operation’s names do not end
fact that a value-returning _relaxed() RMW has done in _relaxed, _acquire, or _release.) As noted earlier,
a load and a store, which are every bit as good as a the case where these operations do not succeed is covered
READ_ONCE() and a WRITE_ONCE(), respectively. by the “_relaxed() RMW operation” row.

v2022.09.25a
E.15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING 543

However, the Linux kernel does not require that either p.318
Quick Quiz 15.7:
void or _relaxed() atomic RMW operations provide
But how can I know that a given project can be designed
any ordering whatsoever, with the canonical example
and coded within the confines of these rules of thumb?
being atomic_inc(). Therefore, these operations, along
with failing non-void atomic RMW operations may be
preceded by smp_mb__before_atomic() and followed Answer:
by smp_mb__after_atomic() to provide full ordering Much of the purpose of the remainder of this chapter is to
for any accesses preceding or following both. No ordering answer exactly that question! q
need be provided for accesses between the smp_mb__
before_atomic() (or, similarly, the smp_mb__after_ p.318
Quick Quiz 15.8:
atomic()) and the atomic RMW operation, as indicated
How can you tell which memory barriers are strong
by the “a” entries on the smp_mb__before_atomic()
enough for a given use case?
and smp_mb__after_atomic() rows of the table.
In short, the structure of this table is dictated by the Answer:
properties of the underlying hardware, which are con- Ah, that is a deep question whose answer requires most
strained by nothing other than the laws of physics, which of the rest of this chapter. But the short answer is that
were covered back in Chapter 3. That is, the table is not smp_mb() is almost always strong enough, albeit at some
random, although it is quite possible that you are confused. cost. q
q
Quick Quiz 15.9: p.319
Quick Quiz 15.6: p.318 Wait!!! Where do I find this tooling that automatically
Why is Table 15.3 missing smp_mb__after_unlock_ analyzes litmus tests???
lock() and smp_mb__after_spinlock()?
Answer:
Answer: Get version v4.17 (or later) of the Linux-kernel source
These two primitives are rather specialized, and at present code, then follow the instructions in tools/memory-
seem difficult to fit into Table 15.3. The smp_mb__after_ model/README to install the needed tools. Then follow
unlock_lock() primitive is intended to be placed im- the further instructions to run these tools on the litmus
mediately after a lock acquisition, and ensures that all test of your choice. q
CPUs see all accesses in prior critical sections as happen-
ing before all accesses following the smp_mb__after_ Quick Quiz 15.10: p.319
unlock_lock() and also before all accesses in later What assumption is the code fragment in Listing 15.3
critical sections. Here “all CPUs” includes those CPUs making that might not be valid on real hardware?
not holding that lock, and “prior critical sections” in-
cludes all prior critical sections for the lock in question Answer:
as well as all prior critical sections for all other locks The code assumes that as soon as a given CPU stops seeing
that were released by the same CPU that executed the its own value, it will immediately see the final agreed-upon
smp_mb__after_unlock_lock(). value. On real hardware, some of the CPUs might well
The smp_mb__after_spinlock() provides the same see several intermediate results before converging on the
guarantees as does smp_mb__after_unlock_lock(), final value. The actual code used to produce the data in
but also provides additional visibility guarantees for other the figures discussed later in this section was therefore
accesses performed by the CPU that executed the smp_ somewhat more complex. q
mb__after_spinlock(). Given any store S performed
prior to any earlier lock acquisition and any load L Quick Quiz 15.11: p.320
performed after the smp_mb__after_spinlock(), all How could CPUs possibly have different views of the
CPUs will see S as happening before L. In other words, value of a single variable at the same time?
if a CPU performs a store S, acquires a lock, executes an
smp_mb__after_spinlock(), then performs a load L, Answer:
all CPUs will see S as happening before L. q As discussed in Section 15.1.1, many CPUs have store

v2022.09.25a
544 APPENDIX E. ANSWERS TO QUICK QUIZZES

buffers that record the values of recent stores, which do not mance price of unnecessary smp_rmb() and smp_
become globally visible until the corresponding cache line wmb() invocations? Shouldn’t weakly ordered systems
makes its way to the CPU. Therefore, it is quite possible shoulder the full cost of their misordering choices???
for each CPU to see its own value for a given variable
(in its own store buffer) at a single point in time—and Answer:
for main memory to hold yet another value. One of the That is in fact exactly what happens. On strongly ordered
reasons that memory barriers were invented was to allow systems, smp_rmb() and smp_wmb() emit no instructions,
software to deal gracefully with situations like this one. but instead just constrain the compiler. Thus, in this case,
Fortunately, software rarely cares about the fact that weakly ordered systems do in fact shoulder the full cost
multiple CPUs might see multiple values for the same of their memory-ordering choices. q
variable. q

Quick Quiz 15.15: p.324


Quick Quiz 15.12: p.320
But how do we know that all platforms really avoid trig-
Why do CPUs 2 and 3 come to agreement so quickly,
gering the exists clauses in Listings 15.10 and 15.11?
when it takes so long for CPUs 1 and 4 to come to the
party?
Answer:
Answer:
Answering this requires identifying three major groups
CPUs 2 and 3 are a pair of hardware threads on the same
of platforms: (1) Total-store-order (TSO) platforms,
core, sharing the same cache hierarchy, and therefore have
(2) Weakly ordered platforms, and (3) DEC Alpha.
very low communications latencies. This is a NUMA, or,
more accurately, a NUCA effect. The TSO platforms order all pairs of memory references
This leads to the question of why CPUs 2 and 3 ever except for prior stores against later loads. Because the
disagree at all. One possible reason is that they each address dependency on lines 18 and 19 of Listing 15.10
might have a small amount of private cache in addition to a is instead a load followed by another load, TSO platforms
larger shared cache. Another possible reason is instruction preserve this address dependency. They also preserve the
reordering, given the short 10-nanosecond duration of address dependency on lines 17 and 18 of Listing 15.11
the disagreement and the total lack of memory-ordering because this is a load followed by a store. Because address
operations in the code fragment. q dependencies must start with a load, TSO platforms im-
plicitly but completely respect them, give or take compiler
optimizations, hence the need for READ_ONCE().
Quick Quiz 15.13: p.322
Weakly ordered platforms don’t necessarily maintain
But why make load-load reordering visible to the user? ordering of unrelated accesses. However, the address
Why not just use speculative execution to allow execution dependencies in Listings 15.10 and 15.11 are not unrelated:
to proceed in the common case where there are no There is an address dependency. The hardware tracks
intervening stores, in which case the reordering cannot dependencies and maintains the needed ordering.
be visible anyway? There is one (famous) exception to this rule for weakly
Answer: ordered platforms, and that exception is DEC Alpha for
They can and many do, otherwise systems containing load-to-load address dependencies. And this is why, in
strongly ordered CPUs would be slow indeed. However, Linux kernels predating v4.15, DEC Alpha requires the
speculative execution does have its downsides, especially explicit memory barrier supplied for it by the now-obsolete
if speculation must be rolled back frequently, particularly lockless_dereference() on line 18 of Listing 15.10.
on battery-powered systems. But perhaps future systems However, DEC Alpha does track load-to-store address
will be able to overcome these disadvantages. Until then, dependencies, which is why line 17 of Listing 15.11 does
we can expect vendors to continue producing weakly not need a lockless_dereference(), even in Linux
ordered CPUs. q kernels predating v4.15.
To sum up, current platforms either respect address
p.323 dependencies implicitly, as is the case for TSO platforms
Quick Quiz 15.14:
(x86, mainframe, SPARC, . . .), have hardware tracking for
Why should strongly ordered systems pay the perfor-
address dependencies (Arm, PowerPC, MIPS, . . .), have

v2022.09.25a
E.15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING 545

the required memory barriers supplied by READ_ONCE() Listing E.10: Litmus Test Distinguishing Multicopy Atomic
(DEC Alpha in Linux kernel v4.15 and later), or supplied From Other Multicopy Atomic
by rcu_dereference() (DEC Alpha in Linux kernel 1 C C-MP-OMCA+o-o-o+o-rmb-o
2
v4.14 and earlier). q 3 {}
4
5 P0(int *x, int *y)
6 {
Quick Quiz 15.16: p.324
7 int r0;
SP, MP, LB, and now S. Where do all these litmus-test 8
9 WRITE_ONCE(*x, 1);
abbreviations come from and how can anyone keep track 10 r0 = READ_ONCE(*x);
of them? 11 WRITE_ONCE(*y, r0);
12 }
13
Answer: 14 P1(int *x, int *y)
The best scorecard is the infamous test6.pdf [SSA+ 11]. 15 {
16 int r1;
Unfortunately, not all of the abbreviations have catchy 17 int r2;
expansions like SB (store buffering), MP (message pass- 18
19 r1 = READ_ONCE(*y);
ing), and LB (load buffering), but at least the list of 20 smp_rmb();
abbreviations is readily available. q 21 r2 = READ_ONCE(*x);
22 }
23
24 exists (1:r1=1 /\ 1:r2=0)
Quick Quiz 15.17: p.325
But wait!!! Line 17 of Listing 15.12 uses READ_ONCE(),
which marks the load as volatile, which means that the Answer:
compiler absolutely must emit the load instruction even Yes, it would. Feel free to modify the exists clause to
if the value is later multiplied by zero. So how can the check for that outcome and see what happens. q
compiler possibly break this data dependency?

Answer: Quick Quiz 15.20: p.327


Yes, the compiler absolutely must emit a load instruction Can you give a specific example showing different be-
for a volatile load. But if you multiply the value loaded havior for multicopy atomic on the one hand and other-
by zero, the compiler is well within its rights to substitute multicopy atomic on the other?
a constant zero for the result of that multiplication, which
will break the data dependency on many platforms. Answer:
Worse yet, if the dependent store does not use WRITE_ Listing E.10 (C-MP-OMCA+o-o-o+o-rmb-o.litmus)
ONCE(), the compiler could hoist it above the load, which shows such a test.
would cause even TSO platforms to fail to provide ordering. On a multicopy-atomic platform, P0()’s store to x
q on line 9 must become visible to both P0() and P1()
simultaneously. Because this store becomes visible to
Quick Quiz 15.18: p.326 P0() on line 10, before P0()’s store to y on line 11,
Wouldn’t control dependencies be more robust if they P0()’s store to x must become visible before its store to
were mandated by language standards??? y everywhere, including P1(). Therefore, if P1()’s load
from y on line 19 returns the value 1, so must its load from
Answer: x on line 21, given that the smp_rmb() on line 20 forces
But of course! And perhaps in the fullness of time they these two loads to execute in order. Therefore, the exists
will be so mandated. q clause on line 24 cannot trigger on a multicopy-atomic
platform.
Quick Quiz 15.19: p.326 In contrast, on an other-multicopy-atomic platform,
But in Listing 15.15, wouldn’t be just as bad if P2()’s P0() could see its own store early, so that there would be
r1 and r2 obtained the values 2 and 1, respectively, no constraint on the order of visibility of the two stores
while P3()’s r3 and r4 obtained the values 1 and 2, from P1(), which in turn allows the exists clause to
respectively? trigger. q

v2022.09.25a
546 APPENDIX E. ANSWERS TO QUICK QUIZZES

p.328
Listing E.11: R Litmus Test With Write Memory Barrier (No
Quick Quiz 15.21: Ordering)
Then who would even think of designing a system with 1 C C-R+o-wmb-o+o-mb-o
shared store buffers??? 2
3 {}
4
Answer: 5 P0(int *x0, int *x1)
This is in fact a very natural design for any system hav- 6 {
7 WRITE_ONCE(*x0, 1);
ing multiple hardware threads per core. Natural from a 8 smp_wmb();
hardware point of view, that is! q 9 WRITE_ONCE(*x1, 1);
10 }
11
12 P1(int *x0, int *x1)
Quick Quiz 15.22: p.328 13 {
14 int r2;
But just how is it fair that P0() and P1() must share a 15
store buffer and a cache, but P2() gets one each of its 16 WRITE_ONCE(*x1, 2);
17 smp_mb();
very own??? 18 r2 = READ_ONCE(*x0);
19 }
20
Answer: 21 exists (1:r2=0 /\ x1=2)
Presumably there is a P3(), as is in fact shown in Fig-
ure 15.8, that shares P2()’s store buffer and cache. But
not necessarily. Some platforms allow different cores to
Quick Quiz 15.24: p.330
disable different numbers of threads, allowing the hard-
ware to adjust to the needs of the workload at hand. For But it is not necessary to worry about propagation unless
example, a single-threaded critical-path portion of the there are at least three threads in the litmus test, right?
workload might be assigned to a core with only one thread
enabled, thus allowing the single thread running that por- Answer:
tion of the workload to use the entire capabilities of that Wrong.
core. Other more highly parallel but cache-miss-prone
Listing E.11 (C-R+o-wmb-o+o-mb-o.litmus) shows
portions of the workload might be assigned to cores with
a two-thread litmus test that requires propagation due to
all hardware threads enabled to provide improved through-
the fact that it only has store-to-store and load-to-store
put. This improved throughput could be due to the fact
links between its pair of threads. Even though P0() is
that while one hardware thread is stalled on a cache miss,
fully ordered by the smp_wmb() and P1() is fully ordered
the other hardware threads can make forward progress.
by the smp_mb(), the counter-temporal nature of the links
In such cases, performance requirements override quaint
means that the exists clause on line 21 really can trigger.
human notions of fairness. q
To prevent this triggering, the smp_wmb() on line 8 must
become an smp_mb(), bringing propagation into play
Quick Quiz 15.23: p.328 twice, once for each non-temporal link. q
Referring to Table 15.4, why on earth would P0()’s store
take so long to complete when P1()’s store complete
Quick Quiz 15.25: p.330
so quickly? In other words, does the exists clause on
line 28 of Listing 15.16 really trigger on real systems? But given that smp_mb() has the propagation property,
why doesn’t the smp_mb() on line 25 of Listing 15.18
prevent the exists clause from triggering?
Answer:
You need to face the fact that it really can trigger. Akira Answer:
Yokosawa used the litmus7 tool to run this litmus test on a As a rough rule of thumb, the smp_mb() barrier’s propaga-
POWER8 system. Out of 1,000,000,000 runs, 4 triggered tion property is sufficient to maintain ordering through only
the exists clause. Thus, triggering the exists clause one load-to-store link between processes. Unfortunately,
is not merely a one-in-a-million occurrence, but rather a Listing 15.18 has not one but two load-to-store links, with
one-in-a-hundred-million occurrence. But it nevertheless the first being from the READ_ONCE() on line 17 to the
really does trigger on real systems. q WRITE_ONCE() on line 24 and the second being from the
READ_ONCE() on line 26 to the WRITE_ONCE() on line 7.

v2022.09.25a
E.15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING 547

Listing E.12: 2+2W Litmus Test (No Ordering) Listing E.13: LB Litmus Test With No Acquires
1 C C-2+2W+o-o+o-o 1 C C-LB+o-data-o+o-data-o+o-data-o
2 2
3 {} 3 {
4 4 x1=1;
5 P0(int *x0, int *x1) 5 x2=2;
6 { 6 }
7 WRITE_ONCE(*x0, 1); 7
8 WRITE_ONCE(*x1, 2); 8 P0(int *x0, int *x1)
9 } 9 {
10 10 int r2;
11 P1(int *x0, int *x1) 11
12 { 12 r2 = READ_ONCE(*x0);
13 WRITE_ONCE(*x1, 1); 13 WRITE_ONCE(*x1, r2);
14 WRITE_ONCE(*x0, 2); 14 }
15 } 15
16 16 P1(int *x1, int *x2)
17 exists (x0=1 /\ x1=1) 17 {
18 int r2;
19
20 r2 = READ_ONCE(*x1);
Therefore, preventing the exists clause from triggering 21 WRITE_ONCE(*x2, r2);
22 }
should be expected to require not one but two instances 23

of smp_mb(). 24 P2(int *x2, int *x0)


25 {
As a special exception to this rule of thumb, a release- 26 int r2;
acquire chain can have one load-to-store link between 27
28 r2 = READ_ONCE(*x2);
processes and still prohibit the cycle. q 29 WRITE_ONCE(*x0, r2);
30 }
31

p.331 32 exists (0:r2=2 /\ 1:r2=0 /\ 2:r2=1)


Quick Quiz 15.26:
But for litmus tests having only ordered stores, as
shown in Listing 15.20 (C-2+2W+o-wmb-o+o-wmb-
Answer:
o.litmus), research shows that the cycle is prohib-
Listing E.13 shows a somewhat nonsensical but very real
ited, even in weakly ordered systems such as Arm and
example. Creating a more useful (but still real) litmus test
Power [SSA+ 11]. Given that, are store-to-store really
is left as an exercise for the reader. q
always counter-temporal???

Answer:
Quick Quiz 15.28: p.333
This litmus test is indeed a very interesting curiosity.
Its ordering apparently occurs naturally given typical Suppose we have a short release-acquire chain along
weakly ordered hardware design, which would normally with one load-to-store link and one store-to-store link,
be considered a great gift from the relevant laws of physics like that shown in Listing 15.25. Given that there is only
and cache-coherency-protocol mathematics. one of each type of non-store-to-load link, the exists
Unfortunately, no one has been able to come up with a cannot trigger, right?
software use case for this gift that does not have a much
better alternative implementation. Therefore, neither the Answer:
C11 nor the Linux kernel memory models provide any Wrong. It is the number of non-store-to-load links that
guarantee corresponding to Listing 15.20. This means matters. If there is only one non-store-to-load link, a
that the exists clause on line 19 can trigger. release-acquire chain can prevent the exists clause from
Of course, without the barrier, there are no ordering triggering. However, if there is more than one non-store-
guarantees, even on real weakly ordered hardware, as to-load link, be they store-to-store, load-to-store, or any
shown in Listing E.12 (C-2+2W+o-o+o-o.litmus). q combination thereof, it is necessary to have at least one
full barrier (smp_mb() or better) between each non-store-
to-load link. In Listing 15.25, preventing the exists
Quick Quiz 15.27: p.332
clause from triggering therefore requires an additional full
Can you construct a litmus test like that in Listing 15.21 barrier between either P0()’s or P1()’s accesses. q
that uses only dependencies?

v2022.09.25a
548 APPENDIX E. ANSWERS TO QUICK QUIZZES

p.334
Listing E.14: Breakable Dependencies With Non-Constant
Quick Quiz 15.29: Comparisons
There are store-to-load links, load-to-store links, and 1 int *gp1;
store-to-store links. But what about load-to-load links? 2 int *p;
3 int *q;
4
5 p = rcu_dereference(gp1);
Answer: 6 q = get_a_pointer();
7 if (p == q)
The problem with the concept of load-to-load links is 8 handle_equality(p);
that if the two loads from the same variable return the 9 do_something_with(*p);
same value, there is no way to determine their ordering.
The only way to determine their ordering is if they return Listing E.15: Broken Dependencies With Non-Constant Com-
different values, in which case there had to have been an parisons
intervening store. And that intervening store means that 1 int *gp1;
2 int *p;
there is no load-to-load link, but rather a load-to-store link 3 int *q;
followed by a store-to-load link. q 4
5 p = rcu_dereference(gp1);
6 q = get_a_pointer();
7 if (p == q) {
Quick Quiz 15.30: p.335 8 handle_equality(q);
9 do_something_with(*q);
Why not place a barrier() call immediately before a 10 } else {
plain store to prevent the compiler from inventing stores? 11 do_something_with(*p);
12 }

Answer:
Because it would not work. Although the compiler Listing E.14. The compiler is within its rights to transform
would be prevented from inventing a store prior to the this code into that shown in Listing E.15, and might
barrier(), nothing would prevent it from inventing a well make this transformation due to register pressure
store between that barrier() and the plain store. q if handle_equality() was inlined and needed a lot
of registers. Line 9 of this transformed code uses q,
which although equal to p, is not necessarily tagged by
Quick Quiz 15.31: p.336
the hardware as carrying a dependency. Therefore, this
Why can’t you simply dereference the pointer before com- transformed code does not necessarily guarantee that line 9
paring it to &reserve_int on line 6 of Listing 15.28? is ordered after line 5.13 q

Answer: Quick Quiz 15.33: p.337


For first, it might be necessary to invoke handle_ But doesn’t the condition in line 35 supply a control
reserve() before do_something_with(). dependency that would keep line 36 ordered after line 34?
But more relevant to memory ordering, the compiler is
often within its rights to hoist the comparison ahead of
the dereferences, which would allow the compiler to use Answer:
&reserve_int instead of the variable p that the hardware Yes, but no. Yes, there is a control dependency, but control
has tagged with a dependency. q dependencies do not order later loads, only later stores. If
you really need ordering, you could place an smp_rmb()
between lines 35 and 36. Or better yet, have updater()
Quick Quiz 15.32: p.336 allocate two structures instead of reusing the structure.
But it should be safe to compare two pointer variables, For more information, see Section 15.3.3. q
right? After all, the compiler doesn’t know the value of
either, so how can it possibly learn anything from the p.340
Quick Quiz 15.34:
comparison?
Can’t you instead add an smp_mb() to P1() in List-
Answer: ing 15.32?
Unfortunately, the compiler really can learn enough to
break your dependency chain, for example, as shown in 13 Kudos to Linus Torvalds for providing this example.

v2022.09.25a
E.15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING 549

Answer:
Not given the Linux kernel memory model. (Try it!)
However, you can instead replace P0()’s WRITE_ONCE()
with smp_store_release(), which usually has less
overhead than does adding an smp_mb(). q

Quick Quiz 15.35: p.341


But doesn’t PowerPC have weak unlock-lock ordering
properties within the Linux kernel, allowing a write
before the unlock to be reordered with a read after the Listing E.16: Accesses Between Multiple Different-CPU Criti-
lock? cal Sections
1 C Lock-across-unlock-lock-3
Answer: 2

Yes, but only from the perspective of a third thread not 3 {}


4
holding that lock. In contrast, memory allocators need 5 P0(int *x, spinlock_t *sp)
only concern themselves with the two threads migrating 6 {
7 spin_lock(sp);
the memory. It is after all the developer’s responsibility 8 WRITE_ONCE(*x, 1);
to properly synchronize with any other threads that need 9 spin_unlock(sp);
10 }
access to the newly migrated block of memory. q 11
12 P1(int *x, int *y, int *z, spinlock_t *sp)
13 {
Quick Quiz 15.36: p.343 14 int r1;
15
But if there are three critical sections, isn’t it true that 16 spin_lock(sp);
CPUs not holding the lock will observe the accesses 17 r1 = READ_ONCE(*x);
18 WRITE_ONCE(*z, 1);
from the first and the third critical section as being 19 spin_unlock(sp);
ordered? 20 }
21
22 P2(int *x, int *y, int *z, spinlock_t *sp)
Answer: 23 {
No. 24 int r1;
25 int r2;
Listing E.16 shows an example three-critical-section 26
chain (Lock-across-unlock-lock-3.litmus). Run- 27 spin_lock(sp);
28 r1 = READ_ONCE(*z);
ning this litmus test shows that the exists clause can still 29 r2 = READ_ONCE(*y);
be satisfied, so this additional critical section is still not 30 spin_unlock(sp);
31 }
sufficient to force ordering. 32
However, as the reader can verify, placing an smp_mb__ 33 P3(int *x, int *y, spinlock_t *sp)
34 {
after_spinlock() after either P1()’s or P2()’s lock 35 int r1;
acquisition does suffice to force ordering. q 36
37 WRITE_ONCE(*y, 1);
38 smp_mb();
39 r1 = READ_ONCE(*x);
Quick Quiz 15.37: p.344
40 }
But if spin_is_locked() returns false, don’t we 41
42 exists (1:r1=1 /\ 2:r1=1 /\ 2:r2=0 /\ 3:r1=0)
also know that no other CPU or thread is holding the
corresponding lock?

Answer:
No. By the time that the code inspects the return value
from spin_is_locked(), some other CPU or thread
might well have acquired the corresponding lock. q

Quick Quiz 15.38: p.347


Wait a minute! In QSBR implementations of RCU, no

v2022.09.25a
550 APPENDIX E. ANSWERS TO QUICK QUIZZES

code is emitted for rcu_read_lock() and rcu_read_ Answer:


unlock(). This means that the RCU read-side critical The cycle would again be forbidden. Further analysis is
section in Listing 15.45 isn’t just empty, it is completely left as an exercise for the reader. q
nonexistent!!! So how can something that doesn’t exist
at all possibly have any effect whatsoever on ordering??? p.353
Quick Quiz 15.41:
What happens to code between an atomic operation and
Answer: an smp_mb__after_atomic()?
Because in QSBR, RCU read-side critical sections don’t
actually disappear. Instead, they are extended in both Answer:
directions until a quiescent state is encountered. For First, please don’t do this!
example, in the Linux kernel, the critical section might be But if you do, this intervening code will either be
extended back to the most recent schedule() call and ordered after the atomic operation or before the smp_
ahead to the next schedule() call. Of course, in non- mb__after_atomic(), depending on the architecture,
QSBR implementations, rcu_read_lock() and rcu_ but not both. This also applies to smp_mb__before_
read_unlock() really do emit code, which can clearly atomic() and smp_mb__after_spinlock(), that is,
provide ordering. And within the Linux kernel, even both the uncertain ordering of the intervening code and
the QSBR implementation has a compiler barrier() in the plea to avoid such code. q
rcu_read_lock() and rcu_read_unlock(), which is
necessary to prevent the compiler from moving memory
Quick Quiz 15.42: p.354
accesses that might result in page faults into the RCU
read-side critical section. Why does Alpha’s READ_ONCE() include an mb() rather
Therefore, strange though it might seem, empty RCU than rmb()?
read-side critical sections really can and do provide some
degree of ordering. q Answer:
Alpha has only mb and wmb instructions, so smp_rmb()
would be implemented by the Alpha mb instruction in
Quick Quiz 15.39: p.348
either case. In addition, at the time that the Linux kernel
Can P1()’s accesses be reordered in the litmus tests started relying on dependency ordering, it was not clear
shown in Listings 15.43, 15.44, and 15.45 in the same that Alpha ordered dependent stores, and thus smp_mb()
way that they were reordered going from Listing 15.38 was therefore the safe choice.
to Listing 15.39? However, given the aforementioned v5.9 changes to
READ_ONCE() and a few of Alpha’s atomic read-modify-
Answer:
write operations, no Linux-kernel core code need con-
No, because none of these later litmus tests have more than
cern itself with DEC Alpha, thus greatly reducing Paul
one access within their RCU read-side critical sections.
E. McKenney’s incentive to remove Alpha support from
But what about swapping the accesses, for example, in
the kernel. q
Listing 15.43, placing P1()’s WRITE_ONCE() within its
critical section and the READ_ONCE() before its critical
section? Quick Quiz 15.43: p.354
Swapping the accesses allows both instances of r2 to Isn’t DEC Alpha significant as having the weakest pos-
have a final value of zero, in other words, although RCU sible memory ordering?
read-side critical sections’ ordering properties can extend
outside of those critical sections, the same is not true of Answer:
their reordering properties. Checking this with herd and Although DEC Alpha does take considerable flak, it
explaining why is left as an exercise for the reader. q does avoid reordering reads from the same CPU to the
same variable. It also avoids the out-of-thin-air problem
Quick Quiz 15.40: p.350 that plagues the Java and C11 memory models [BD14,
What would happen if the smp_mb() was instead added BMN+ 15, BS14, Boe20, Gol19, Jef14, MB20, MJST16,
between P2()’s accesses in Listing 15.47? Š11, VBC+ 15]. q

v2022.09.25a
E.15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING 551

Listing E.17: Userspace RCU Code Reordering at about the same time. Then consider the following
1 static inline int rcu_gp_ongoing(unsigned long *ctr) sequence of events:
2 {
3 unsigned long v;
4 1. CPU 0 acquires the lock at line 29.
5 v = LOAD_SHARED(*ctr);
6 return v && (v != rcu_gp_ctr); 2. Line 27 determines that CPU 0 was online, so it
7 }
8 clears its own counter at line 28. (Recall that lines 27
9 static void update_counter_and_wait(void) and 28 have been reordered by the compiler to follow
10 {
11 struct rcu_reader *index; line 29).
12
13 STORE_SHARED(rcu_gp_ctr, rcu_gp_ctr + RCU_GP_CTR); 3. CPU 0 invokes update_counter_and_wait()
14 barrier();
15 list_for_each_entry(index, &registry, node) { from line 30.
16 while (rcu_gp_ongoing(&index->ctr))
17 msleep(10); 4. CPU 0 invokes rcu_gp_ongoing() on itself at
18 }
19 } line 16, and line 5 sees that CPU 0 is in a quies-
20 cent state. Control therefore returns to update_
21 void synchronize_rcu(void)
22 { counter_and_wait(), and line 15 advances to
23 unsigned long was_online; CPU 1.
24
25 was_online = rcu_reader.ctr;
26 smp_mb(); 5. CPU 1 invokes synchronize_rcu(), but because
27 if (was_online) CPU 0 already holds the lock, CPU 1 blocks wait-
28 STORE_SHARED(rcu_reader.ctr, 0);
29 mutex_lock(&rcu_gp_lock); ing for this lock to become available. Because the
30 update_counter_and_wait(); compiler reordered lines 27 and 28 to follow line 29,
31 mutex_unlock(&rcu_gp_lock);
32 if (was_online) CPU 1 does not clear its own counter, despite having
33 STORE_SHARED(rcu_reader.ctr, LOAD_SHARED(rcu_gp_ctr)); been online.
34 smp_mb();
35 }
6. CPU 0 invokes rcu_gp_ongoing() on CPU 1 at
line 16, and line 5 sees that CPU 1 is not in a quiescent
state. The while loop at line 16 therefore never exits.
Quick Quiz 15.44: p.356
Given that hardware can have a half memory barrier, So the compiler’s reordering results in a deadlock. In
why don’t locking primitives allow the compiler to move contrast, hardware reordering is temporary, so that CPU 1
memory-reference instructions into lock-based critical might undertake its first attempt to acquire the mutex
sections? on line 29 before executing lines 27 and 28, but it will
eventually execute lines 27 and 28. Because hardware
Answer: reordering only results in a short delay, it can be tolerated.
In fact, as we saw in Section 15.5.3 and will see in On the other hand, because compiler reordering results in
Section 15.5.6, hardware really does implement partial a deadlock, it must be prohibited.
memory-ordering instructions and it also turns out that Some research efforts have used hardware transactional
these really are used to construct locking primitives. How- memory to allow compilers to safely reorder more aggres-
ever, these locking primitives use full compiler barriers, sively, but the overhead of hardware transactions has thus
thus preventing the compiler from reordering memory- far made such optimizations unattractive. q
reference instructions both out of and into the correspond-
ing critical section. p.360
Quick Quiz 15.45:
To see why the compiler is forbidden from doing reorder- Why is it necessary to use heavier-weight ordering for
ing that is permitted by hardware, consider the following load-to-store and store-to-store links, but not for store-
sample code in Listing E.17. This code is based on the to-load links? What on earth makes store-to-load links
userspace RCU update-side code [DMS+ 12, Supplemen- so special???
tary Materials Figure 5].
Suppose that the compiler reordered lines 27 and 28 Answer:
into the critical section starting at line 29. Now suppose Recall that load-to-store and store-to-store links can be
that two updaters start executing synchronize_rcu() counter-temporal, as illustrated by Figures 15.10 and 15.11

v2022.09.25a
552 APPENDIX E. ANSWERS TO QUICK QUIZZES

in Section 15.2.7.2. This counter-temporal nature of to work in a given situation. However, even in these cases,
load-to-store and store-to-store links necessitates strong it may be very worthwhile to spend a little time trying
ordering. to come up with a simpler algorithm! After all, if you
In constrast, store-to-load links are temporal, as illus- managed to invent the first algorithm to do some task, it
trated by Listings 15.12 and 15.13. This temporal nature shouldn’t be that hard to go on to invent a simpler one. q
of store-to-load links permits use of minimal ordering. q

E.17 Conflicting Visions of the Fu-


E.16 Ease of Use ture
Quick Quiz 16.1: p.365
Quick Quiz 17.1: p.374
Can a similar algorithm be used when deleting elements? But suppose that an application exits while holding a
pthread_mutex_lock() that happens to be located in
Answer: a file-mapped region of memory?
Yes. However, since each thread must hold the locks of
Answer:
three consecutive elements to delete the middle one, if
Indeed, in this case the lock would persist, much to the
there are 𝑁 threads, there must be 2𝑁 + 1 elements (rather
consternation of other processes attempting to acquire this
than just 𝑁 + 1) in order to avoid deadlock. q
lock that is held by a process that no longer exists. Which
is why great care is required when using pthread_mutex
Quick Quiz 16.2: p.365 objects located in file-mapped memory regions. q
Yetch! What ever possessed someone to come up with
an algorithm that deserves to be shaved as much as this p.375
Quick Quiz 17.2:
one does???
What about non-persistent primitives represented by data
Answer: structures in mmap() regions of memory? What happens
That would be Paul. when there is an exec() within a critical section of such
He was considering the Dining Philosopher’s Prob- a primitive?
lem, which involves a rather unsanitary spaghetti dinner
attended by five philosophers. Given that there are five Answer:
plates and but five forks on the table, and given that each If the exec()ed program maps those same regions of
philosopher requires two forks at a time to eat, one is memory, then this program could in principle simply
supposed to come up with a fork-allocation algorithm that release the lock. The question as to whether this approach
avoids deadlock. Paul’s response was “Sheesh! Just get is sound from a software-engineering viewpoint is left as
five more forks!” an exercise for the reader. q
This in itself was OK, but Paul then applied this same
solution to circular linked lists. Quick Quiz 17.3: p.380
This would not have been so bad either, but he had to MV-RLU looks pretty good! Doesn’t it beat RCU hands
go and tell someone about it! q down?
Answer:
Quick Quiz 16.3: p.365 One might get that impression from a quick read of the
Give an exception to this rule. abstract, but more careful readers will notice the “for a
wide range of workloads” phrase in the last sentence. It
Answer: turns out that this phrase is quite important:
One exception would be a difficult and complex algorithm
that was the only one known to work in a given situation. 1. Their RCU evaluation uses synchronous grace pe-
Another exception would be a difficult and complex algo- riods, which needlessly throttle updates, as noted
rithm that was nonetheless the simplest of the set known in their Section 6.2.1. See Figure 10.11 page 193

v2022.09.25a
E.17. CONFLICTING VISIONS OF THE FUTURE 553

of this book to see that the venerable asynchronous presents an all-too-rare example of good scalability com-
call_rcu() primitive enables RCU to perform and bined with strong read-side coherence. They are also to be
scale quite well with large numbers of updaters. Fur- congratulated on overcoming the traditional academic prej-
thermore, in Section 3.7 of their paper, the authors udice against asynchronous grace periods, which greatly
admit that asynchronous grace periods are important aided their scalability.
to MV-RLU scalability. A fair comparison would Interestingly enough, RLU and RCU take different
also allow RCU the benefits of asynchrony. approaches to avoid the inherent limitations of STM noted
by Hagit Attiya et al. [AHM09]. RCU avoids providing
2. They use a poorly tuned 1,000-bucket hash table con- strict serializability and RLU avoids providing invisible
taining 10,000 elements. In addition, their 448 hard- read-only transactions, both thus avoiding the limitations.
ware threads need considerably more than 1,000 buck- q
ets to avoid the lock contention that they correctly
state limits RCU performance in their benchmarks.
Quick Quiz 17.4: p.381
A useful comparison would feature a properly tuned
hash table. Given things like spin_trylock(), how does it make
any sense at all to claim that TM introduces the concept
3. Their RCU hash table used per-bucket locks, which of failure???
they call out as a bottleneck, which is not a surprise
given the long hash chains and small ratio of buckets Answer:
to threads. A number of their competing mecha- When using locking, spin_trylock() is a choice, with
nisms instead use lockfree techniques, thus avoiding a corresponding failure-free choice being spin_lock(),
the per-bucket-lock bottleneck, which cynics might which is used in the common case, as in there are more than
claim sheds some light on the authors’ otherwise 100 times as many calls to spin_lock() than to spin_
inexplicable choice of poorly tuned hash tables. The trylock() in the v5.11 Linux kernel. When using TM,
first graph in the middle row of the authors’ Figure 4 the only failure-free choice is the irrevocable transaction,
show what RCU can achieve if not hobbled by ar- which is not used in the common case. In fact, the
tificial bottlenecks, as does the first portion of the irrevocable transaction is not even available in all TM
second graph in that same row. implementations. q

4. Their linked-list operation permits RLU to do con- p.382


Quick Quiz 17.5:
current modifications of different elements in the
What is to learn? Why not just use TM for memory-
list, while RCU is forced to serialize updates. Again,
based data structures and locking for those rare cases
RCU has always worked just fine in conjunction with
featuring the many silly corner cases listed in this silly
lockless updaters, a fact that has been set forth in
section???
academic literature that the authors cited [DMS+ 12].
A fair comparison would use the same style of update Answer:
for RCU as it does for MV-RLU. The year 2005 just called, and it says that it wants its
incandescent TM marketing hype back.
5. The authors fail to consider combining RCU and
In the year 2021, TM still has significant proving to
sequence locking, which is used in the Linux kernel to
do, even with the advent of HTM, which is covered in the
give readers coherent views of multi-pointer updates.
upcoming Section 17.3. q
6. The authors fail to consider RCU-based solutions to
the Issaquah Challenge [McK16a], which also gives Quick Quiz 17.6: p.384
readers a coherent view of multi-pointer updates, Why would it matter that oft-written variables shared
albeit with a weaker view of “coherent”. the cache line with the lock variable?

It is surprising that the anonymous reviewers of this Answer:


paper did not demand an apples-to-apples comparison If the lock is in the same cacheline as some of the variables
of MV-RLU and RCU. Nevertheless, the authors should that it is protecting, then writes to those variables by
be congratulated on producing an academic paper that one CPU will invalidate that cache line for all the other

v2022.09.25a
554 APPENDIX E. ANSWERS TO QUICK QUIZZES

CPUs. These invalidations will generate large numbers of The program is now in the else-clause instead of the
conflicts and retries, perhaps even degrading performance then-clause.
and scalability compared to locking. q This is not what I call an easy-to-use debugger. q

Quick Quiz 17.7: p.384 p.387


Quick Quiz 17.10:
Why are relatively small updates important to HTM But why would anyone need an empty lock-based critical
performance and scalability? section???
Answer: Answer:
The larger the updates, the greater the probability of See the answer to Quick Quiz 7.19 in Section 7.2.1.
conflict, and thus the greater probability of retries, which However, it is claimed that given a strongly atomic HTM
degrade performance. q implementation without forward-progress guarantees, any
memory-based locking design based on empty critical
Quick Quiz 17.8: p.386 sections will operate correctly in the presence of transac-
How could a red-black tree possibly efficiently enu- tional lock elision. Although I have not seen a proof of
merate all elements of the tree regardless of choice of this statement, there is a straightforward rationale for this
synchronization mechanism??? claim. The main idea is that in a strongly atomic HTM
implementation, the results of a given transaction are not
Answer: visible until after the transaction completes successfully.
In many cases, the enumeration need not be exact. In these Therefore, if you can see that a transaction has started, it
cases, hazard pointers or RCU may be used to protect is guaranteed to have already completed, which means
readers, which provides low probability of conflict with that a subsequent empty lock-based critical section will
any given insertion or deletion. q successfully “wait” on it—after all, there is no waiting
required.
This line of reasoning does not apply to weakly atomic
Quick Quiz 17.9: p.386
systems (including many STM implementation), and it
But why can’t a debugger emulate single stepping by also does not apply to lock-based programs that use means
setting breakpoints at successive lines of the transaction, other than memory to communicate. One such means
relying on the retry to retrace the steps of the earlier is the passage of time (for example, in hard real-time
instances of the transaction? systems) or flow of priority (for example, in soft real-time
Answer: systems).
This scheme might work with reasonably high probability, Locking designs that rely on priority boosting are of
but it can fail in ways that would be quite surprising to particular interest. q
most users. To see this, consider the following transaction:
Quick Quiz 17.11: p.387
1 begin_trans();
2 if (a) { Can’t transactional lock elision trivially handle locking’s
3 do_one_thing();
4 do_another_thing();
time-based messaging semantics by simply choosing not
5 } else { to elide empty lock-based critical sections?
6 do_a_third_thing();
7 do_a_fourth_thing();
8 } Answer:
9 end_trans(); It could do so, but this would be both unnecessary and
insufficient.
Suppose that the user sets a breakpoint at line 4, which It would be unnecessary in cases where the empty
triggers, aborting the transaction and entering the debugger. critical section was due to conditional compilation. Here,
Suppose that between the time that the breakpoint triggers it might well be that the only purpose of the lock was to
and the debugger gets around to stopping all the threads, protect data, so eliding it completely would be the right
some other thread sets the value of a to zero. When the thing to do. In fact, leaving the empty lock-based critical
poor user attempts to single-step the program, surprise! section would degrade performance and scalability.

v2022.09.25a
E.17. CONFLICTING VISIONS OF THE FUTURE 555

On the other hand, it is possible for a non-empty lock- Worker threads’ code is as follows:
based critical section to be relying on both the data- 1 int my_status = -1; /* Thread local. */
protection and time-based and messaging semantics of 2
3 while (continue_working()) {
locking. Using transactional lock elision in such a case 4 enqueue_any_new_work();
would be incorrect, and would result in bugs. q 5 wp = dequeue_work();
6 do_work(wp);
7 my_timestamp = clock_gettime(...);
p.387 8 }
Quick Quiz 17.12: 9
Given modern hardware [MOZ09], how can anyone 10 acquire_lock(&departing_thread_lock);
11
possibly expect parallel software relying on timing to 12 /*
work? 13 * Disentangle from application, might
14 * acquire other locks, can take much longer
15 * than MAX_LOOP_TIME, especially if many
Answer: 16 * threads exit concurrently.
The short answer is that on commonplace commodity 17 */
18 my_status = get_return_status();
hardware, synchronization designs based on any sort of 19 release_lock(&departing_thread_lock);
fine-grained timing are foolhardy and cannot be expected 20
21 /* thread awaits repurposing. */
to operate correctly under all conditions.
That said, there are systems designed for hard real-time
use that are much more deterministic. In the (very un- The control thread’s code is as follows:
likely) event that you are using such a system, here is a 1 for (;;) {
toy example showing how time-based synchronization can 2 for_each_thread(t) {
3 ct = clock_gettime(...);
work. Again, do not try this on commodity microproces- 4 d = ct - per_thread(my_timestamp, t);
sors, as they have highly nondeterministic performance 5 if (d >= MAX_LOOP_TIME) {
6 /* thread departing. */
characteristics. 7 acquire_lock(&departing_thread_lock);
This example uses multiple worker threads along with 8 release_lock(&departing_thread_lock);
9 i = per_thread(my_status, t);
a control thread. Each worker thread corresponds to an 10 status_hist[i]++; /* Bug if TLE! */
outbound data feed, and records the current time (for 11 }
12 }
example, from the clock_gettime() system call) in a 13 /* Repurpose threads as needed. */
per-thread my_timestamp variable after executing each 14 }
unit of work. The real-time nature of this example results
in the following set of constraints: Line 5 uses the passage of time to deduce that the thread
has exited, executing lines 6 and 10 if so. The empty
1. It is a fatal error for a given worker thread to fail to lock-based critical section on lines 7 and 8 guarantees that
update its timestamp for a time period of more than any thread in the process of exiting completes (remember
MAX_LOOP_TIME. that locks are granted in FIFO order!).
Once again, do not try this sort of thing on commodity
2. Locks are used sparingly to access and update global
microprocessors. After all, it is difficult enough to get this
state.
right on systems specifically designed for hard real-time
3. Locks are granted in strict FIFO order within a given use! q
thread priority.
Quick Quiz 17.13: p.387
When worker threads complete their feed, they must
But the boostee() function in Listing 17.1 alternatively
disentangle themselves from the rest of the application
acquires its locks in reverse order! Won’t this result in
and place a status value in a per-thread my_status vari-
deadlock?
able that is initialized to −1. Threads do not exit; they
instead are placed on a thread pool to accommodate later Answer:
processing requirements. The control thread assigns (and No deadlock will result. To arrive at deadlock, two differ-
re-assigns) worker threads as needed, and also maintains ent threads must each acquire the two locks in opposite
a histogram of thread statuses. The control thread runs orders, which does not happen in this example. However,
at a real-time priority no higher than that of the worker deadlock detectors such as lockdep [Cor06a] will flag this
threads. as a false positive. q

v2022.09.25a
556 APPENDIX E. ANSWERS TO QUICK QUIZZES

p.388
Table E.5: Emulating Locking: Performance Compari-
Quick Quiz 17.14: son (s)
So a bunch of people set out to supplant locking, and
they mostly end up just optimizing locking??? cmpxchg_acquire() xchg_acquire()

Answer: # Lock filter exists filter exists


At least they accomplished something useful! And perhaps 2 0.004 0.022 0.039 0.027 0.058
there will continue to be additional HTM progress over 3 0.041 0.743 1.653 0.968 3.203
time [SNGK17, SBN+ 20, GGK18, PMDY20]. q 4 0.374 59.565 151.962 74.818 500.96
5 4.905
Quick Quiz 17.15: p.391
Tables 17.1 and 17.2 state that hardware is only starting
to become available. But hasn’t HTM hardware support coarse-grained locking that is similar to the Linux kernel’s
been widely available for almost a full decade? old Big Kernel Lock (BKL). There will hopefully come a
day when it makes sense to add SEL4’s verifiers to a book
Answer: on parallel programming, but this is not yet that day. q
Yes and no. It appears that implementing even the HTM
subset of TM in real hardware is a bit trickier than it
appears [JSG12, Was14, Int20a, Int21, Lar21]. Therefore, Quick Quiz 17.18: p.397
the sad fact is that “starting to become available” is all Why bother with a separate filter command on line 27
too accurate as of 2021. In fact, vendors are beginning to of Listing 17.2 instead of just adding the condition to
deprecate their HTM implementations [Int20c, Book III the exists clause? And wouldn’t it be simpler to use
Appendix A]. q xchg_acquire() instead of cmpxchg_acquire()?

Quick Quiz 17.16: p.395 Answer:


This list is ridiculously utopian! Why not stick to the The filter clause causes the herd tool to discard ex-
current state of the formal-verification art? ecutions at an earlier stage of processing than does the
exists clause, which provides significant speedups.
Answer: As for xchg_acquire(), this atomic operation will do
You are welcome to your opinion on what is and is not a write whether or not lock acquisition succeeds, which
utopian, but I will be paying more attention to people means that a model using xchg_acquire() will have
actually making progress on the items in that list than to more operations than one using cmpxchg_acquire(),
anyone who might be objecting to them. This might have which won’t do a write in the failed-acquisition case. More
something to do with my long experience with people writes means more combinatorial to explode, as shown in
attempting to talk me out of specific things that their Table E.5 (C-SB+l-o-o-u+l-o-o-*u.litmus, C-SB+
favorite tools cannot handle. l-o-o-u+l-o-o-u*-C.litmus, C-SB+l-o-o-u+l-
In the meantime, please feel free to read the papers o-o-u*-CE.litmus, C-SB+l-o-o-u+l-o-o-u*-X.
written by the people who are actually making progress, litmus, and C-SB+l-o-o-u+l-o-o-u*-XE.litmus).
for example, this one [DFLO19]. q This table clearly shows that cmpxchg_acquire() out-
performs xchg_acquire() and that use of the filter
Quick Quiz 17.17: p.396 clause outperforms use of the exists clause. q
Given the groundbreaking nature of the various verifiers
used in the SEL4 project, why doesn’t this chapter cover
Quick Quiz 17.19: p.398
them in more depth?
How do we know that the MTBFs of known bugs is a
Answer: good estimate of the MTBFs of bugs that have not yet
There can be no doubt that the verifiers used by the SEL4 been located?
project are quite capable. However, SEL4 started as
a single-CPU project. And although SEL4 has gained Answer:
multi-processor capabilities, it is currently using very We don’t, but it does not matter.

v2022.09.25a
E.18. IMPORTANT QUESTIONS 557

To see this, note that the 7 % figure only applies to p.399


Quick Quiz 17.22:
injected bugs that were subsequently located: It neces-
How would testing stack up in the scorecard shown in
sarily ignores any injected bugs that were never found.
Table 17.5?
Therefore, the MTBF statistics of known bugs is likely to
be a good approximation of that of the injected bugs that Answer:
are subsequently located. It would be blue all the way down, with the possible
A key point in this whole section is that we should exception of the third row (overhead) which might well be
be more concerned about bugs that inconvenience users marked down for testing’s difficulty finding improbable
than about other bugs that never actually manifest. This bugs.
of course is not to say that we should completely ignore On the other hand, improbable bugs are often also
bugs that have not yet inconvenienced users, just that we irrelevant bugs, so your mileage may vary.
should properly prioritize our efforts so as to fix the most Much depends on the size of your installed base. If your
important and urgent bugs first. q code is only ever going to run on (say) 10,000 systems,
Murphy can actually be a really nice guy. Everything that
can go wrong, will. Eventually. Perhaps in geologic time.
Quick Quiz 17.20: p.398 But if your code is running on 20 billion systems, like
But the formal-verification tools should immediately the Linux kernel was said to be by late 2017, Murphy can
find all the bugs introduced by the fixes, so why is this a be a real jerk! Everything that can go wrong, will, and it
problem? can go wrong really quickly!!! q

Answer: Quick Quiz 17.23: p.399


It is a problem because real-world formal-verification tools But aren’t there a great many more formal-verification
(as opposed to those that exist only in the imaginations of systems than are shown in Table 17.5?
the more vociferous proponents of formal verification) are
not omniscient, and thus are only able to locate certain Answer:
types of bugs. For but one example, formal-verification Indeed there are! This table focuses on those that Paul
tools are unlikely to spot a bug corresponding to an has used, but others are proving to be useful. Formal veri-
omitted assertion or, equivalently, a bug corresponding to fication has been heavily used in the seL4 project [SM13],
an undiscovered portion of the specification. q and its tools can now handle modest levels of concurrency.
More recently, Catalin Marinas used Lamport’s TLA
tool [Lam02] to locate some forward-progress bugs in
Quick Quiz 17.21: p.399 the Linux kernel’s queued spinlock implementation. Will
But many formal-verification tools can only find one Deacon fixed these bugs [Dea18], and Catalin verified
bug at a time, so that each bug must be fixed before the Will’s fixes [Mar18].
tool can locate the next. How can bug-fix efforts be Lighter-weight formal verification tools have been
prioritized given such a tool? used heavily in production [LBD+ 04, BBC+ 10, Coo18,
SAE+ 18, DFLO19]. q
Answer:
One approach is to provide a simple fix that might not be
suitable for a production environment, but which allows E.18 Important Questions
the tool to locate the next bug. Another approach is to
restrict configuration or inputs so that the bugs located
Quick Quiz A.1: p.409
thus far cannot occur. There are a number of similar
approaches, but the common theme is that fixing the bug What SMP coding errors can you see in these examples?
from the tool’s viewpoint is usually much easier than See time.c for full code.
constructing and validating a production-quality fix, and
Answer:
the key point is to prioritize the larger efforts required to
Here are errors you might have found:
construct and validate the production-quality fixes. q
1. Missing barrier() or volatile on tight loops.

v2022.09.25a
558 APPENDIX E. ANSWERS TO QUICK QUIZZES

2. Missing memory barriers on update side. p.412


Quick Quiz A.4:
3. Lack of synchronization between producer and con- Suppose a portion of a program uses RCU read-side
sumer. q primitives as its only synchronization mechanism. Is
this parallelism or concurrency?

Quick Quiz A.2: p.410 Answer:


How could there be such a large gap between successive Yes. q
consumer reads? See timelocked.c for full code.

Answer: p.413
Quick Quiz A.5:
Here are a few reasons for such gaps:
In what part of the second (scheduler-based) perspective
would the lock-based single-thread-per-CPU workload
1. The consumer might be preempted for long time
be considered “concurrent”?
periods.

2. A long-running interrupt might delay the consumer. Answer:


The people who would like to arbitrarily subdivide and
3. Cache misses might delay the consumer. interleave the workload. Of course, an arbitrary subdi-
4. The producer might also be running on a faster CPU vision might end up separating a lock acquisition from
than is the consumer (for example, one of the CPUs the corresponding lock release, which would prevent any
might have had to decrease its clock frequency due to other thread from acquiring that lock. If the locks were
heat-dissipation or power-consumption constraints). pure spinlocks, this could even result in deadlock. q
q

p.411
Quick Quiz A.3: E.19 “Toy” RCU Implementations
But if fully ordered implementations cannot offer
stronger guarantees than the better performing and more
scalable weakly ordered implementations, why bother
Quick Quiz B.1: p.415
with full ordering?
Why wouldn’t any deadlock in the RCU implementation
Answer: in Listing B.1 also be a deadlock in any other RCU
Because strongly ordered implementations are sometimes implementation?
able to provide greater consistency among sets of calls to
functions accessing a given data structure. For example, Answer:
compare the atomic counter of Listing 5.2 to the statistical Suppose the functions foo() and bar() in Listing E.18
counter of Section 5.2. Suppose that one thread is adding are invoked concurrently from different CPUs. Then
the value 3 and another is adding the value 5, while two foo() will acquire my_lock() on line 3, while bar()
other threads are concurrently reading the counter’s value. will acquire rcu_gp_lock on line 13.
With atomic counters, it is not possible for one of the
readers to obtain the value 3 while the other obtains the When foo() advances to line 4, it will attempt to
value 5. With statistical counters, this outcome really can acquire rcu_gp_lock, which is held by bar(). Then
happen. In fact, in some computing environments, this when bar() advances to line 14, it will attempt to acquire
outcome can happen even on relatively strongly ordered my_lock, which is held by foo().
hardware such as x86. Each function is then waiting for a lock that the other
Therefore, if your user happen to need this admittedly holds, a classic deadlock.
unusual level of consistency, you should avoid weakly
ordered statistical counters. q Other RCU implementations neither spin nor block in
rcu_read_lock(), hence avoiding deadlocks. q

v2022.09.25a
E.19. “TOY” RCU IMPLEMENTATIONS 559

Listing E.18: Deadlock in Lock-Based RCU Implementation within an RCU read-side critical section. However, this
1 void foo(void) situation could deadlock any correctly designed RCU
2 {
3 spin_lock(&my_lock); implementation. After all, the synchronize_rcu()
4 rcu_read_lock(); primitive must wait for all pre-existing RCU read-side
5 do_something();
6 rcu_read_unlock(); critical sections to complete, but if one of those critical
7 do_something_else(); sections is spinning on a lock held by the thread executing
8 spin_unlock(&my_lock);
9 } the synchronize_rcu(), we have a deadlock inherent
10 in the definition of RCU.
11 void bar(void)
12 { Another deadlock happens when attempting to nest
13 rcu_read_lock();
14 spin_lock(&my_lock);
RCU read-side critical sections. This deadlock is peculiar
15 do_some_other_thing(); to this implementation, and might be avoided by using
16 spin_unlock(&my_lock);
17 do_whatever();
recursive locks, or by using reader-writer locks that are
18 rcu_read_unlock(); read-acquired by rcu_read_lock() and write-acquired
19 }
by synchronize_rcu().
However, if we exclude the above two cases, this im-
plementation of RCU does not introduce any deadlock
Quick Quiz B.2: p.415
situations. This is because only time some other thread’s
Why not simply use reader-writer locks in the RCU lock is acquired is when executing synchronize_rcu(),
implementation in Listing B.1 in order to allow RCU and in that case, the lock is immediately released, pro-
readers to proceed in parallel? hibiting a deadlock cycle that does not involve a lock held
across the synchronize_rcu() which is the first case
Answer: above. q
One could in fact use reader-writer locks in this manner.
However, textbook reader-writer locks suffer from memory
contention, so that the RCU read-side critical sections Quick Quiz B.5: p.416
would need to be quite long to actually permit parallel Isn’t one advantage of the RCU algorithm shown in
execution [McK03]. Listing B.2 that it uses only primitives that are widely
On the other hand, use of a reader-writer lock that available, for example, in POSIX pthreads?
is read-acquired in rcu_read_lock() would avoid the
deadlock condition noted above. q Answer:
This is indeed an advantage, but do not forget that rcu_
p.416 dereference() and rcu_assign_pointer() are still
Quick Quiz B.3:
required, which means volatile manipulation for rcu_
Wouldn’t it be cleaner to acquire all the locks, and
dereference() and memory barriers for rcu_assign_
then release them all in the loop from lines 15–18 of
pointer(). Of course, many Alpha CPUs require mem-
Listing B.2? After all, with this change, there would be
ory barriers for both primitives. q
a point in time when there were no readers, simplifying
things greatly.
Quick Quiz B.6: p.417
Answer:
But what if you hold a lock across a call to
Making this change would re-introduce the deadlock, so
synchronize_rcu(), and then acquire that same lock
no, it would not be cleaner. q
within an RCU read-side critical section?

Quick Quiz B.4: p.416 Answer:


Is the implementation shown in Listing B.2 free from Indeed, this would deadlock any legal RCU implemen-
deadlocks? Why or why not? tation. But is rcu_read_lock() really participating in
the deadlock cycle? If you believe that it is, then please
Answer: ask yourself this same question when looking at the RCU
One deadlock is where a lock is held across implementation in Appendix B.9. q
synchronize_rcu(), and that same lock is acquired

v2022.09.25a
560 APPENDIX E. ANSWERS TO QUICK QUIZZES

p.417 Listing B.6 are really needed. See Chapter 12 for informa-
Quick Quiz B.7:
tion on using these tools. The first correct and complete
How can the grace period possibly elapse in 40
response will be credited. q
nanoseconds when synchronize_rcu() contains a
10-millisecond delay?
Quick Quiz B.10: p.419
Answer:
Why is the counter flipped twice in Listing B.6?
The update-side test was run in absence of readers, so the
Shouldn’t a single flip-and-wait cycle be sufficient?
poll() system call was never invoked. In addition, the
actual code has this poll() system call commented out,
the better to evaluate the true overhead of the update-side Answer:
code. Any production uses of this code would be better Both flips are absolutely required. To see this, consider
served by using the poll() system call, but then again, the following sequence of events:
production uses would be even better served by other
implementations shown later in this section. q 1 Line 8 of rcu_read_lock() in Listing B.5 picks
up rcu_idx, finding its value to be zero.
Quick Quiz B.8: p.417
2 Line 8 of synchronize_rcu() in Listing B.6 com-
Why not simply make rcu_read_lock() wait when plements the value of rcu_idx, setting its value to
a concurrent synchronize_rcu() has been waiting one.
too long in the RCU implementation in Listing B.3?
Wouldn’t that prevent synchronize_rcu() from starv- 3 Lines 10–12 of synchronize_rcu() find that the
ing? value of rcu_refcnt[0] is zero, and thus returns.
(Recall that the question is asking what happens if
Answer: lines 13–20 are omitted.)
Although this would in fact eliminate the starvation, it
would also mean that rcu_read_lock() would spin or 4 Lines 9 and 10 of rcu_read_lock() store the value
block waiting for the writer, which is in turn waiting on zero to this thread’s instance of rcu_read_idx and
readers. If one of these readers is attempting to acquire a increments rcu_refcnt[0], respectively. Execu-
lock that the spinning/blocking rcu_read_lock() holds, tion then proceeds into the RCU read-side critical
we again have deadlock. section.
In short, the cure is worse than the disease. See Appen-
dix B.4 for a proper cure. q 5 Another instance of synchronize_rcu() again
complements rcu_idx, this time setting its value
to zero. Because rcu_refcnt[1] is zero,
Quick Quiz B.9: p.418
synchronize_rcu() returns immediately. (Re-
Why the memory barrier on line 5 of synchronize_ call that rcu_read_lock() incremented rcu_
rcu() in Listing B.6 given that there is a spin-lock refcnt[0], not rcu_refcnt[1]!)
acquisition immediately after?
6 The grace period that started in step 5 has been
Answer: allowed to end, despite the fact that the RCU read-
The spin-lock acquisition only guarantees that the spin- side critical section that started beforehand in step 4
lock’s critical section will not “bleed out” to precede the has not completed. This violates RCU semantics, and
acquisition. It in no way guarantees that code preceding could allow the update to free a data element that the
the spin-lock acquisition won’t be reordered into the RCU read-side critical section was still referencing.
critical section. Such reordering could cause a removal
from an RCU-protected list to be reordered to follow the Exercise for the reader: What happens if rcu_read_
complementing of rcu_idx, which could allow a newly lock() is preempted for a very long time (hours!) just
starting RCU read-side critical section to see the recently after line 8? Does this implementation operate correctly
removed data element. in that case? Why or why not? The first correct and
Exercise for the reader: Use a tool such as Promela/spin complete response will be credited. q
to determine which (if any) of the memory barriers in

v2022.09.25a
E.19. “TOY” RCU IMPLEMENTATIONS 561

p.419 However, if you are stress-testing code that uses RCU,


Quick Quiz B.11:
you might want to comment out the poll() statement
Given that atomic increment and decrement are so expen-
in order to better catch bugs that incorrectly retain a
sive, why not just use non-atomic increment on line 10
reference to an RCU-protected data element outside of an
and a non-atomic decrement on line 25 of Listing B.5?
RCU read-side critical section. q

Answer: Quick Quiz B.14: p.422


Using non-atomic operations would cause increments and All of these toy RCU implementations have either
decrements to be lost, in turn causing the implementation atomic operations in rcu_read_lock() and rcu_
to fail. See Appendix B.5 for a safe way to use non- read_unlock(), or synchronize_rcu() overhead
atomic operations in rcu_read_lock() and rcu_read_ that increases linearly with the number of threads. Un-
unlock(). q der what circumstances could an RCU implementation
enjoy lightweight implementations for all three of these
Quick Quiz B.12: p.419 primitives, all having deterministic (O (1)) overheads
Come off it! We can see the atomic_read() primi- and latencies?
tive in rcu_read_lock()!!! So why are you trying
Answer:
to pretend that rcu_read_lock() contains no atomic
Special-purpose uniprocessor implementations of RCU
operations???
can attain this ideal [McK09a]. q
Answer:
The atomic_read() primitives does not actually execute Quick Quiz B.15: p.422
atomic machine instructions, but rather does a normal If any even value is sufficient to tell synchronize_
load from an atomic_t. Its sole purpose is to keep rcu() to ignore a given task, why don’t lines 11 and 12
the compiler’s type-checking happy. If the Linux kernel of Listing B.14 simply assign zero to rcu_reader_gp?
ran on 8-bit CPUs, it would also need to prevent “store
tearing”, which could happen due to the need to store a
16-bit pointer with two eight-bit accesses on some 8-bit Answer:
systems. But thankfully, it seems that no one runs Linux Assigning zero (or any other even-numbered constant)
on 8-bit systems. q would in fact work, but assigning the value of rcu_gp_
ctr can provide a valuable debugging aid, as it gives the
p.420
developer an idea of when the corresponding thread last
Quick Quiz B.13:
exited an RCU read-side critical section. q
Great, if we have 𝑁 threads, we can have 2𝑁 ten-
millisecond waits (one set per flip_counter_and_
wait() invocation, and even that assumes that we wait Quick Quiz B.16: p.423
only once for each thread). Don’t we need the grace Why are the memory barriers on lines 19 and 31 of List-
period to complete much more quickly? ing B.14 needed? Aren’t the memory barriers inherent
in the locking primitives on lines 20 and 30 sufficient?
Answer:
Keep in mind that we only wait for a given thread if that
thread is still in a pre-existing RCU read-side critical sec- Answer:
tion, and that waiting for one hold-out thread gives all the These memory barriers are required because the locking
other threads a chance to complete any pre-existing RCU primitives are only guaranteed to confine the critical
read-side critical sections that they might still be executing. section. The locking primitives are under absolutely no
So the only way that we would wait for 2𝑁 intervals would obligation to keep other code from bleeding in to the
be if the last thread still remained in a pre-existing RCU critical section. The pair of memory barriers are therefore
read-side critical section despite all the waiting for all the requires to prevent this sort of code motion, whether
prior threads. In short, this implementation will not wait performed by the compiler or by the CPU. q
unnecessarily.

v2022.09.25a
562 APPENDIX E. ANSWERS TO QUICK QUIZZES

Quick Quiz B.17: p.423 Quick Quiz B.21: p.424


Couldn’t the update-side batching optimization de- Again, given the algorithm shown in Listing B.16, is
scribed in Appendix B.6 be applied to the implementa- counter overflow fatal? Why or why not? If it is fatal,
tion shown in Listing B.14? what can be done to fix it?

Answer: Answer:
Indeed it could, with a few modifications. This work is It can indeed be fatal. To see this, consider the following
left as an exercise for the reader. q sequence of events:
1. Thread 0 enters rcu_read_lock(), determines that
Quick Quiz B.18: p.423
it is not nested, and therefore fetches the value of
Is the possibility of readers being preempted in lines 3–4 the global rcu_gp_ctr. Thread 0 is then preempted
of Listing B.14 a real problem, in other words, is there a for an extremely long time (before storing to its
real sequence of events that could lead to failure? If not, per-thread rcu_reader_gp variable).
why not? If so, what is the sequence of events, and how
can the failure be addressed? 2. Other threads repeatedly invoke synchronize_
rcu(), so that the new value of the global rcu_gp_
Answer: ctr is now RCU_GP_CTR_BOTTOM_BIT less than it
It is a real problem, there is a sequence of events leading was when thread 0 fetched it.
to failure, and there are a number of possible ways of
addressing it. For more details, see the Quick Quizzes 3. Thread 0 now starts running again, and stores into its
near the end of Appendix B.8. The reason for locating per-thread rcu_reader_gp variable. The value it
the discussion there is to (1) give you more time to think stores is RCU_GP_CTR_BOTTOM_BIT+1 greater than
about it, and (2) because the nesting support added in that that of the global rcu_gp_ctr.
section greatly reduces the time required to overflow the
4. Thread 0 acquires a reference to RCU-protected data
counter. q
element A.

Quick Quiz B.19: p.424 5. Thread 1 now removes the data element A that
Why not simply maintain a separate per-thread nesting- thread 0 just acquired a reference to.
level variable, as was done in previous section, rather 6. Thread 1 invokes synchronize_rcu(), which in-
than having all this complicated bit manipulation? crements the global rcu_gp_ctr by RCU_GP_CTR_
Answer: BOTTOM_BIT. It then checks all of the per-thread
The apparent simplicity of the separate per-thread variable rcu_reader_gp variables, but thread 0’s value (in-
is a red herring. This approach incurs much greater correctly) indicates that it started after thread 1’s call
complexity in the guise of careful ordering of operations, to synchronize_rcu(), so thread 1 does not wait
especially if signal handlers are to be permitted to contain for thread 0 to complete its RCU read-side critical
RCU read-side critical sections. But don’t take my word section.
for it, code it up and see what you end up with! q 7. Thread 1 then frees up data element A, which thread 0
is still referencing.
Quick Quiz B.20: p.424
Note that scenario can also occur in the implementation
Given the algorithm shown in Listing B.16, how could presented in Appendix B.7.
you double the time required to overflow the global One strategy for fixing this problem is to use 64-bit
rcu_gp_ctr? counters so that the time required to overflow them would
Answer: exceed the useful lifetime of the computer system. Note
One way would be to replace the magnitude compar- that non-antique members of the 32-bit x86 CPU family
ison on lines 32 and 33 with an inequality check of allow atomic manipulation of 64-bit counters via the
the per-thread rcu_reader_gp variable against rcu_gp_ cmpxchg64b instruction.
ctr+RCU_GP_CTR_BOTTOM_BIT. q Another strategy is to limit the rate at which grace
periods are permitted to occur in order to achieve a similar

v2022.09.25a
E.19. “TOY” RCU IMPLEMENTATIONS 563

effect. For example, synchronize_rcu() could record However, this memory barrier is absolutely required so
the last time that it was invoked, and any subsequent that other threads will see the store on lines 12–13 before
invocation would then check this time and block as needed any subsequent RCU read-side critical sections executed
to force the desired spacing. For example, if the low-order by the caller. q
four bits of the counter were reserved for nesting, and if
grace periods were permitted to occur at most ten times
Quick Quiz B.23: p.425
per second, then it would take more than 300 days for the
counter to overflow. However, this approach is not helpful Why are the two memory barriers on lines 11 and 14 of
if there is any possibility that the system will be fully Listing B.18 needed?
loaded with CPU-bound high-priority real-time threads
for the full 300 days. (A remote possibility, perhaps, but Answer:
best to consider it ahead of time.) The memory barrier on line 11 prevents any RCU read-
side critical sections that might precede the call to rcu_
A third approach is to administratively abolish real-
thread_offline() won’t be reordered by either the com-
time threads from the system in question. In this case,
piler or the CPU to follow the assignment on lines 12–13.
the preempted process will age up in priority, thus getting
The memory barrier on line 14 is, strictly speaking, unnec-
to run long before the counter had a chance to overflow.
essary, as it is illegal to have any RCU read-side critical
Of course, this approach is less than helpful for real-time
sections following the call to rcu_thread_offline().
applications.
q
A fourth approach would be for rcu_read_lock() to
recheck the value of the global rcu_gp_ctr after storing
Quick Quiz B.24: p.426
to its per-thread rcu_reader_gp counter, retrying if the
new value of the global rcu_gp_ctr is inappropriate. To be sure, the clock frequencies of POWER systems
This works, but introduces non-deterministic execution in 2008 were quite high, but even a 5 GHz clock fre-
time into rcu_read_lock(). On the other hand, if your quency is insufficient to allow loops to be executed in
application is being preempted long enough for the counter 50 picoseconds! What is going on here?
to overflow, you have no hope of deterministic execution
time in any case! Answer:
Since the measurement loop contains a pair of empty
A fifth approach is for the grace period process to wait
functions, the compiler optimizes it away. The measure-
for all readers to become aware of the new grace period.
ment loop takes 1,000 passes between each call to rcu_
This works nicely in theory, but hangs if a reader blocks
quiescent_state(), so this measurement is roughly
indefinitely outside of an RCU read-side critical section.
one thousandth of the overhead of a single call to rcu_
A final approach is, oddly enough, to use a single-bit quiescent_state(). q
grace-period counter and for each call to synchronize_
rcu() to take two passes through its algorithm. This is
the approached use by userspace RCU [Des09b], and is Quick Quiz B.25: p.426
described in detail in the journal article and supplementary Why would the fact that the code is in a library make
materials [DMS+ 12, Appendix D]. q any difference for how easy it is to use the RCU imple-
mentation shown in Listings B.18 and B.19?

Quick Quiz B.22: p.425


Answer:
Doesn’t the additional memory barrier shown on line 14 A library function has absolutely no control over the
of Listing B.18 greatly increase the overhead of rcu_ caller, and thus cannot force the caller to invoke rcu_
quiescent_state? quiescent_state() periodically. On the other hand,
a library function that made many references to a given
Answer: RCU-protected data structure might be able to invoke
Indeed it does! An application using this implementation rcu_thread_online() upon entry, rcu_quiescent_
of RCU should therefore invoke rcu_quiescent_state state() periodically, and rcu_thread_offline()
sparingly, instead using rcu_read_lock() and rcu_ upon exit. q
read_unlock() most of the time.

v2022.09.25a
564 APPENDIX E. ANSWERS TO QUICK QUIZZES

p.426 cache—or even from a cache that might be shared among


Quick Quiz B.26:
several CPUs. The key point is that a given cache does
But what if you hold a lock across a call to
not have room for a given data item, so some other piece
synchronize_rcu(), and then acquire that same lock
of data must be ejected from the cache to make room. If
within an RCU read-side critical section? This should
there is some other piece of data that is duplicated in some
be a deadlock, but how can a primitive that generates
other cache or in memory, then that piece of data may be
absolutely no code possibly participate in a deadlock
simply discarded, with no writeback message required.
cycle?
On the other hand, if every piece of data that might
Answer: be ejected has been modified so that the only up-to-date
Please note that the RCU read-side critical section is copy is in this cache, then one of those data items must be
in effect extended beyond the enclosing rcu_read_ copied somewhere else. This copy operation is undertaken
lock() and rcu_read_unlock(), out to the previ- using a “writeback message”.
ous and next call to rcu_quiescent_state(). This The destination of the writeback message has to be
rcu_quiescent_state can be thought of as an rcu_ something that is able to store the new value. This might
read_unlock() immediately followed by an rcu_read_ be main memory, but it also might be some other cache. If
lock(). it is a cache, it is normally a higher-level cache for the same
Even so, the actual deadlock itself will involve the CPU, for example, a level-1 cache might write back to a
lock acquisition in the RCU read-side critical section and level-2 cache. However, some hardware designs permit
the synchronize_rcu(), never the rcu_quiescent_ cross-CPU writebacks, so that CPU 0’s cache might send
state(). q a writeback message to CPU 1. This would normally
be done if CPU 1 had somehow indicated an interest in
p.427 the data, for example, by having recently issued a read
Quick Quiz B.27:
request.
Given that grace periods are prohibited within RCU read-
side critical sections, how can an RCU data structure In short, a writeback message is sent from some part of
possibly be updated while in an RCU read-side critical the system that is short of space, and is received by some
section? other part of the system that can accommodate the data.
q
Answer:
This situation is one reason for the existence of asynchro-
Quick Quiz C.2: p.432
nous grace-period primitives such as call_rcu(). This
primitive may be invoked within an RCU read-side critical What happens if two CPUs attempt to invalidate the
section, and the specified RCU callback will in turn be same cache line concurrently?
invoked at a later time, after a grace period has elapsed.
The ability to perform an RCU update while within Answer:
an RCU read-side critical section can be extremely con- One of the CPUs gains access to the shared bus first, and
venient, and is analogous to a (mythical) unconditional that CPU “wins”. The other CPU must invalidate its copy
read-to-write upgrade for reader-writer locking. q of the cache line and transmit an “invalidate acknowledge”
message to the other CPU.
Of course, the losing CPU can be expected to immedi-
ately issue a “read invalidate” transaction, so the winning
E.20 Why Memory Barriers? CPU’s victory will be quite ephemeral. q

Quick Quiz C.1: p.432 p.432


Quick Quiz C.3:
Where does a writeback message originate from and When an “invalidate” message appears in a large mul-
where does it go to? tiprocessor, every CPU must give an “invalidate ac-
knowledge” response. Wouldn’t the resulting “storm” of
Answer:
“invalidate acknowledge” responses totally saturate the
The writeback message originates from a given CPU, or
system bus?
in some designs from a given level of a given CPU’s

v2022.09.25a
E.20. WHY MEMORY BARRIERS? 565

Answer: p.434
Quick Quiz C.7:
It might, if large-scale multiprocessors were in fact im-
But then why do uniprocessors also have store buffers?
plemented that way. Larger multiprocessors, particularly
NUMA machines, tend to use so-called “directory-based”
cache-coherence protocols to avoid this and other prob- Answer:
lems. q Because the purpose of store buffers is not just to
hide acknowledgement latencies in multiprocessor cache-
Quick Quiz C.4: p.432 coherence protocols, but to hide memory latencies in
If SMP machines are really using message passing general. Because memory is much slower than is cache
anyway, why bother with SMP at all? on uniprocessors, store buffers on uniprocessors can help
to hide write-miss memory latencies. q
Answer:
There has been quite a bit of controversy on this topic Quick Quiz C.8: p.434
over the past few decades. One answer is that the cache- So store-buffer entries are variable length? Isn’t that
coherence protocols are quite simple, and therefore can difficult to implement in hardware?
be implemented directly in hardware, gaining bandwidths
and latencies unattainable by software message passing. Answer:
Another answer is that the real truth is to be found in Here are two ways for hardware to easily handle variable-
economics due to the relative prices of large SMP machines length stores.
and that of clusters of smaller SMP machines. A third First, each store-buffer entry could be a single byte wide.
answer is that the SMP programming model is easier to Then an 64-bit store would consume eight store-buffer
use than that of distributed systems, but a rebuttal might entries. This approach is simple and flexible, but one
note the appearance of HPC clusters and MPI. And so the disadvantage is that each entry would need to replicate
argument continues. q much of the address that was stored to.
Second, each store-buffer entry could be double the
Quick Quiz C.5: p.433 size of a cache line, with half of the bits containing the
How does the hardware handle the delayed transitions values stored, and the other half indicating which bits
described above? had been stored to. So, assuming a 32-bit cache line,
a single-byte store of 0x5a to the low-order byte of a
Answer: given cache line would result in 0xXXXXXX5a for the
Usually by adding additional states, though these addi- first half and 0x000000ff for the second half, where
tional states need not be actually stored with the cache the values labeled X are arbitrary because they would
line, due to the fact that only a few lines at a time will be ignored. This approach allows multiple consecutive
be transitioning. The need to delay transitions is but one stores corresponding to a given cache line to be merged
issue that results in real-world cache coherence protocols into a single store-buffer entry, but is space-inefficient for
being much more complex than the over-simplified MESI random stores of single bytes.
protocol described in this appendix. Hennessy and Patter- Much more complex and efficient schemes are of course
son’s classic introduction to computer architecture [HP95] used by actual hardware designers. q
covers many of these issues. q
Quick Quiz C.9: p.436
Quick Quiz C.6: p.433 In step 1 above, why does CPU 0 need to issue a “read
What sequence of operations would put the CPUs’ caches invalidate” rather than a simple “invalidate”? After all,
all back into the “invalid” state? foo() will overwrite the variable a in any case, so why
should it care about the old value of a?
Answer:
There is no such sequence, at least in absence of special Answer:
“flush my cache” instructions in the CPU’s instruction set. Because the cache line in question contains more data
Most CPUs do have such instructions. q than just the variable a. Issuing “invalidate” instead of the
needed “read invalidate” would cause that other data to be

v2022.09.25a
566 APPENDIX E. ANSWERS TO QUICK QUIZZES

lost, which would constitute a serious bug in the hardware. Answer:


q Suppose that memory barrier was omitted.
Keep in mind that CPUs are free to speculatively execute
p.436
later loads, which can have the effect of executing the
Quick Quiz C.10: assertion before the while loop completes. Furthermore,
In step 9 above, did bar() read a stale value from a, or compilers assume that only the currently executing thread
did its reads of b and a get reordered? is updating the variables, and this assumption allows the
compiler to hoist the load of a to precede the loop.
Answer:
In fact, some compilers would transform the loop to a
It could be either, depending on the hardware implemen-
branch around an infinite loop as follows:
tation. And it really does not matter which. After all, the
bar() function’s assert() cannot tell the difference! q 1 void foo(void)
2 {
3 a = 1;
Quick Quiz C.11: p.437 4 smp_mb();
After step 15 in Appendix C.3.3 on page 437, both CPUs 5 b = 1;
6 }
might drop the cache line containing the new value of 7
“b”. Wouldn’t that cause this new value to be lost? 8 void bar(void)
9 {
Answer: 10 if (b == 0)
It might, and that is why real hardware takes steps to 11 for (;;)
12 continue;
avoid this problem. A traditional approach, pointed out by 13 assert(a == 1);
Vasilevsky Alexander, is to write this cache line back to 14 }
main memory before marking the cache line as “shared”.
A more efficient (though more complex) approach is to use
Given this optimization, the code would behave in a
additional state to indicate whether or not the cache line
completely different way than the original code. If bar()
is “dirty”, allowing the writeback to happen. Year-2000
observed “b == 0”, the assertion could of course not
systems went further, using much more state in order
be reached at all due to the infinite loop. However, if
to avoid redundant writebacks [CSG99, Figure 8.42]. It
bar() loaded the value “1” just as “foo()” stored it,
would be reasonable to assume that complexity has not
the CPU might still have the old zero value of “a” in its
decreased in the meantime. q
cache, which would cause the assertion to fire. You should
of course use volatile casts (for example, those volatile
Quick Quiz C.12: p.439 casts implied by the C11 relaxed atomic load operation)
In step 1 of the first scenario in Appendix C.4.3, why to prevent the compiler from optimizing your parallel
is an “invalidate” sent instead of a ”read invalidate” code into oblivion. But volatile casts would not prevent
message? Doesn’t CPU 0 need the values of the other a weakly ordered CPU from loading the old value for “a”
variables that share this cache line with “a”? from its cache, which means that this code also requires
the explicit memory barrier in “bar()”.
Answer: In short, both compilers and CPUs aggressively apply
CPU 0 already has the values of these variables, given that code-reordering optimizations, so you must clearly com-
it has a read-only copy of the cache line containing “a”. municate your constraints using the compiler directives
Therefore, all CPU 0 need do is to cause the other CPUs and memory barriers provided for this purpose. q
to discard their copies of this cache line. An “invalidate”
message therefore suffices. q p.440
Quick Quiz C.14:
Instead of all of this marking of invalidation-queue
Quick Quiz C.13: p.439 entries and stalling of loads, why not simply force an
Say what??? Why do we need a memory barrier immediate flush of the invalidation queue?
here, given that the CPU cannot possibly execute the
Answer:
assert() until after the while loop completes?
An immediate flush of the invalidation queue would do

v2022.09.25a
E.20. WHY MEMORY BARRIERS? 567

the trick. Except that the common-case super-scalar CPU between CPU 1’s “while” and assignment to “c”? Why
is executing many instructions at once, and not necessarily or why not?
even in the expected order. So what would “immediate”
even mean? The answer is clearly “not much”. Answer:
Nevertheless, for simpler CPUs that execute instruc- No. Such a memory barrier would only force ordering
tions serially, flushing the invalidation queue might be a local to CPU 1. It would have no effect on the relative
reasonable implementation strategy. q ordering of CPU 0’s and CPU 1’s accesses, so the asser-
tion could still fail. However, all mainstream computer
Quick Quiz C.15: p.440 systems provide one mechanism or another to provide
But can’t full memory barriers impose global ordering? “transitivity”, which provides intuitive causal ordering: If
After all, isn’t that needed to provide the ordering shown B saw the effects of A’s accesses, and C saw the effects
in Listing 12.27? of B’s accesses, then C must also see the effects of A’s
accesses. In short, hardware designers have taken at least
Answer: a little pity on software developers. q
Sort of.
Note well that this litmus test has not one but two
Quick Quiz C.18: p.442
full memory-barrier instructions, namely the two sync
instructions executed by P2 and P3. Suppose that lines 3–5 for CPUs 1 and 2 in Listing C.3
It is the interaction of those two instructions that pro- are in an interrupt handler, and that the CPU 2’s line 9
vides the global ordering, not just their individual execu- runs at process level. In other words, the code in all
tion. For example, each of those two sync instructions three columns of the table runs on the same CPU, but
might stall waiting for all CPUs to process their invali- the first two columns run in an interrupt handler, and
dation queues before allowing subsequent instructions to the third column runs at process level, so that the code
execute.14 q in third column can be interrupted by the code in the
first two columns. What changes, if any, are required
p.441
to enable the code to work correctly, in other words, to
Quick Quiz C.16:
prevent the assertion from firing?
Does the guarantee that each CPU sees its own memory
accesses in order also guarantee that each user-level
Answer:
thread will see its own memory accesses in order? Why
The assertion must ensure that the load of “e” precedes
or why not?
that of “a”. In the Linux kernel, the barrier() primitive
Answer: may be used to accomplish this in much the same way
No. Consider the case where a thread migrates from one that the memory barrier was used in the assertions in the
CPU to another, and where the destination CPU perceives previous examples. For example, the assertion can be
the source CPU’s recent memory operations out of order. modified as follows:
To preserve user-mode sanity, kernel hackers must use
memory barriers in the context-switch path. However, r1 = e;
barrier();
the locking already required to safely do a context switch
assert(r1 == 0 || a == 1);
should automatically provide the memory barriers needed
to cause the user-level task to see its own accesses in
order. That said, if you are designing a super-optimized No changes are needed to the code in the first two
scheduler, either in the kernel or at user level, please keep columns, because interrupt handlers run atomically from
this scenario in mind! q the perspective of the interrupted code. q

Quick Quiz C.17: p.441


Quick Quiz C.19: p.442
Could this code be fixed by inserting a memory barrier
If CPU 2 executed an assert(e==0||c==1) in the
14 Real-life hardware of course applies many optimizations to mini-
example in Listing C.3, would this assert ever trigger?
mize the resulting stalls.

v2022.09.25a
568 APPENDIX E. ANSWERS TO QUICK QUIZZES

Answer:
The result depends on whether the CPU supports “transi-
tivity”. In other words, CPU 0 stored to “e” after seeing
CPU 1’s store to “c”, with a memory barrier between
CPU 0’s load from “c” and store to “e”. If some other
CPU sees CPU 0’s store to “e”, is it also guaranteed to
see CPU 1’s store?
All CPUs I am aware of claim to provide transitivity. q

v2022.09.25a
Dictionaries are inherently circular in nature.

“Self Reference in word definitions”,

Glossary David Levary et al.

Acquire Load: A read from memory that has acquire example, on most CPUs, a store to a properly aligned
semantics. Normal use cases pair an acquire load pointer is atomic, because other CPUs will see either
with a release store, in which case if the load returns the old value or the new value, but are guaranteed
the value stored, then all code executed by the loading not to see some mixed value containing some pieces
CPU after that acquire load will see the effects of of the new and old values.
all memory-reference instructions executed by the
storing CPU prior to that release store. Acquiring a Atomic Read-Modify-Write Operation: An atomic op-
lock provides similar memory-ordering semantics, eration that both reads and writes memory is con-
hence the “acquire” in “acquire load”. (See also sidered an atomic read-modify-write operation, or
“memory barrier” and “release store”.) atomic RMW operation for short. Although the
value written usually depends on the value read,
Amdahl’s Law: If sufficient numbers of CPUs are used atomic_xchg() is the exception that proves this
to run a job that has both a sequential portion and a rule.
concurrent portion, performance and scalability will
be limited by the overhead of the sequential portion. Bounded Wait Free: A forward-progress guarantee in
which every thread makes progress within a specific
Associativity: The number of cache lines that can be held finite period of time, the specific time being the
simultaneously in a given cache, when all of these bound.
cache lines hash identically in that cache. A cache
that could hold four cache lines for each possible hash Bounded Population-Oblivious Wait Free: A forward-
value would be termed a “four-way set-associative” progress guarantee in which every thread makes
cache, while a cache that could hold only one cache progress within a specific finite period of time, the
line for each possible hash value would be termed a specific time being the bound, where this bound is
“direct-mapped” cache. A cache whose associativity independent of the number of threads.
was equal to its capacity would be termed a “fully
Cache: In modern computer systems, CPUs have caches
associative” cache. Fully associative caches have the
in which to hold frequently used data. These caches
advantage of eliminating associativity misses, but,
can be thought of as hardware hash tables with very
due to hardware limitations, fully associative caches
simple hash functions, but in which each hash bucket
are normally quite limited in size. The associativity
(termed a “set” by hardware types) can hold only a
of the large caches found on modern microprocessors
limited number of data items. The number of data
typically range from two-way to eight-way.
items that can be held by each of a cache’s hash
Associativity Miss: A cache miss incurred because the buckets is termed the cache’s “associativity”. These
corresponding CPU has recently accessed more data data items are normally called “cache lines”, which
hashing to a given set of the cache than will fit in can be thought of a fixed-length blocks of data that
that set. Fully associative caches are not subject circulate among the CPUs and memory.
to associativity misses (or, equivalently, in fully
associative caches, associativity and capacity misses Cache Coherence: A property of most modern SMP
are identical). machines where all CPUs will observe a sequence
of values for a given variable that is consistent with
Atomic: An operation is considered “atomic” if it is at least one global order of values for that variable.
not possible to observe any intermediate state. For Cache coherence also guarantees that at the end of

569

v2022.09.25a
570 GLOSSARY

a group of stores to a given variable, all CPUs will the same cache line) since this CPU has accessed it
agree on the final value for that variable. Note that (“communication miss”), or (5) This CPU attempted
cache coherence applies only to the series of values to write to a cache line that is currently read-only,
taken on by a single variable. In contrast, the memory possibly due to that line being replicated in other
consistency model for a given machine describes the CPUs’ caches.
order in which loads and stores to groups of variables
will appear to occur. See Section 15.2.6 for more Capacity Miss: A cache miss incurred because the corre-
information. sponding CPU has recently accessed more data than
will fit into the cache.
Cache-Coherence Protocol: A communications proto-
col, normally implemented in hardware, that enforces CAS: Compare-and-swap operation, which is an atomic
memory consistency and ordering, preventing dif- operation that takes a pointer, and old value, and
ferent CPUs from seeing inconsistent views of data a new value. If the pointed-to value is equal to
held in their caches. the old value, it is atomically replaced with the
new value. There is some variety in CAS API.
Cache Geometry: The size and associativity of a cache One variation returns the actual pointed-to value,
is termed its geometry. Each cache may be thought of so that the caller compares the CAS return value to
as a two-dimensional array, with rows of cache lines the specified old value, with equality indicating a
(“sets”) that have the same hash value, and columns successful CAS operation. Another variation returns
of cache lines (“ways”) in which every cache line a boolean success indication, in which case a pointer
has a different hash value. The associativity of a to the old value may be passed in, and if so, the old
given cache is its number of columns (hence the value is updated in the CAS failure case.
name “way”—a two-way set-associative cache has
two “ways”), and the size of the cache is its number Clash Free: A forward-progress guarantee in which, in
of rows multiplied by its number of columns. the absence of contention, at least one thread makes
progress within a finite period of time.
Cache Line: (1) The unit of data that circulates among
the CPUs and memory, usually a moderate power of Code Locking: A simple locking design in which a
two in size. Typical cache-line sizes range from 16 “global lock” is used to protect a set of critical sec-
to 256 bytes. tions, so that access by a given thread to that set is
(2) A physical location in a CPU cache capable of granted or denied based only on the set of threads
holding one cache-line unit of data. currently occupying the set of critical sections, not
(3) A physical location in memory capable of holding based on what data the thread intends to access. The
one cache-line unit of data, but that it also aligned scalability of a code-locked program is limited by
on a cache-line boundary. For example, the address the code; increasing the size of the data set will nor-
of the first word of a cache line in memory will end mally not increase scalability (in fact, will typically
in 0x00 on systems with 256-byte cache lines. decrease scalability by increasing “lock contention”).
Contrast with “data locking”.
Cache Miss: A cache miss occurs when data needed
by the CPU is not in that CPU’s cache. The data Communication Miss: A cache miss incurred because
might be missing because of a number of reasons, some other CPU has written to the cache line since
including: (1) This CPU has never accessed the the last time this CPU accessed it.
data before (“startup” or “warmup” miss), (2) This
Concurrent: In this book, a synonym of parallel. Please
CPU has recently accessed more data than would
see Appendix A.6 on page 412 for a discussion of
fit in its cache, so that some of the older data had
the recent distinction between these two terms.
to be removed (“capacity” miss), (3) This CPU has
recently accessed more data in a given set1 than that Critical Section: A section of code guarded by some
set could hold (“associativity” miss), (4) Some other synchronization mechanism, so that its execution
CPU has written to the data (or some other data in constrained by that primitive. For example, if a set
1 In hardware-cache terminology, the word “set” is used in the same of critical sections are guarded by the same global
way that the word “bucket” is used when discussing software caches. lock, then only one of those critical sections may be

v2022.09.25a
571

executing at a given time. If a thread is executing in with reduced energy consumption. Sublinear scala-
one such critical section, any other threads must wait bility can be an obstacle to energy-efficient use of a
until the first thread completes before executing any multicore system.
of the critical sections in the set.
Epoch-Based Reclamation (EBR): An RCU implemen-
Data Locking: A scalable locking design in which each tation style put forward by Keir Fraser [Fra03, Fra04,
instance of a given data structure has its own lock. If FH07].
each thread is using a different instance of the data
structure, then all of the threads may be executing Existence Guarantee: An existence guarantee is pro-
in the set of critical sections simultaneously. Data vided by a synchronization mechanism that prevents
locking has the advantage of automatically scaling a given dynamically allocated object from being
to increasing numbers of CPUs as the number of in- freed for the duration of that guarantee. For example,
stances of data grows. Contrast with “code locking”. RCU provides existence guarantees for the duration
of RCU read-side critical sections. A similar but
Data Race: A race condition in which several CPUs or strictly weaker guarantee is provided by type-safe
threads access a variable concurrently, and in which memory.
at least one of those accesses is a store and at least
one of those accesses is a plain access. It is important Exclusive Lock: An exclusive lock is a mutual-exclusion
to note that while the presence of data races often mechanism that permits only one thread at a time
indicates the presence of bugs, the absence of data into the set of critical sections guarded by that lock.
races in no way implies the absence of bugs. (See
False Sharing: If two CPUs each frequently write to one
“Plain access” and “Race condition”.)
of a pair of data items, but the pair of data items
Deadlock: A failure mode in which each of several are located in the same cache line, this cache line
threads is unable to make progress until some other will be repeatedly invalidated, “ping-ponging” back
thread makes progress. For example, if two threads and forth between the two CPUs’ caches. This is
acquire a pair of locks in opposite orders, dead- a common cause of “cache thrashing”, also called
lock can result. More information is provided in “cacheline bouncing” (the latter most commonly in the
Section 7.1.1. Linux community). False sharing can dramatically
reduce both performance and scalability.
Deadlock Free: A forward-progress guarantee in which,
in the absence of failures, at least one thread makes Forward-Progress Guarantee: Algorithms or programs
progress within a finite period of time. that guarantee that execution will progress at some
rate under specified conditions. Academic forward-
Direct-Mapped Cache: A cache with only one way, so
progress guarantees are grouped into a formal hi-
that it may hold only one cache line with a given
erarchy shown in Section 14.2. A wide variety of
hash value.
practical forward-progress guarantees are provided
Efficiency: A measure of effectiveness normally ex- by real-time systems, as discussed in Section 14.3.
pressed as a ratio of some metric actually achieved to
Fragmentation: A memory pool that has a large amount
some maximum value. The maximum value might
of unused memory, but not laid out to permit satisfy-
be a theoretical maximum, but in parallel program-
ing a relatively small request is said to be fragmented.
ming is often based on the corresponding measured
External fragmentation occurs when the space is
single-threaded metric.
divided up into small fragments lying between allo-
Embarrassingly Parallel: A problem or algorithm cated blocks of memory, while internal fragmentation
where adding threads does not significantly increase occurs when specific requests or types of requests
the overall cost of the computation, resulting in linear have been allotted more memory than they actually
speedups as threads are added (assuming sufficient requested.
CPUs are available).
Fully Associative Cache: A fully associative cache con-
Energy Efficiency: Shorthand for “energy-efficient use” tains only one set, so that it can hold any subset of
in which the goal is to carry out a given computation memory that fits within its capacity.

v2022.09.25a
572 GLOSSARY

Grace Period: A grace period is any contiguous time Latency: The wall-clock time required for a given opera-
interval such that any RCU read-side critical section tion to complete.
that began before the start of that interval has com-
Linearizable: A sequence of operations is “linearizable”
pleted before the end of that same interval. Many
if there is at least one global ordering of the sequence
RCU implementations define a grace period to be a
that is consistent with the observations of all CPUs
time interval during which each thread has passed
and/or threads. Linearizability is much prized by
through at least one quiescent state. Since RCU
many researchers, but less useful in practice than one
read-side critical sections by definition cannot con-
might expect [HKLP12].
tain quiescent states, these two definitions are almost
always interchangeable. Livelock: A failure mode in which each of several threads
is able to execute, but in which a repeating series of
Hardware Transactional Memory (HTM): A
failed operations prevents any of the threads from
transactional-memory system based on hardware
making any useful forward progress. For example,
instructions provided for this purpose, as discussed
incorrect use of conditional locking (for example,
in Section 17.3. (See “Transactional memory”.)
spin_trylock() in the Linux kernel) can result
Hazard Pointer: A scalable counterpart to a reference in livelock. More information is provided in Sec-
counter in which an object’s reference count is repre- tion 7.1.2.
sented implicitly by a count of the number of special
Lock: A software abstraction that can be used to guard
hazard pointers referencing that object.
critical sections, as such, an example of a “mutual
Heisenbug: A timing-sensitive bug that disappears from exclusion mechanism”. An “exclusive lock” permits
sight when you add print statements or tracing in an only one thread at a time into the set of critical
attempt to track it down. sections guarded by that lock, while a “reader-writer
lock” permits any number of reading threads, or but
Hot Spot: Data structure that is very heavily used, result-
one writing thread, into the set of critical sections
ing in high levels of contention on the corresponding
guarded by that lock. (Just to be clear, the presence
lock. One example of this situation would be a hash
of a writer thread in any of a given reader-writer
table with a poorly chosen hash function.
lock’s critical sections will prevent any reader from
Humiliatingly Parallel: A problem or algorithm where entering any of that lock’s critical sections and vice
adding threads significantly decreases the overall versa.)
cost of the computation, resulting in large superlinear
Lock Contention: A lock is said to be suffering con-
speedups as threads are added (assuming sufficient
tention when it is being used so heavily that there is
CPUs are available).
often a CPU waiting on it. Reducing lock contention
Immutable: In this book, a synonym for read-mostly. is often a concern when designing parallel algorithms
and when implementing parallel programs.
Invalidation: When a CPU wishes to write to a data
item, it must first ensure that this data item is not Lock Free: A forward-progress guarantee in which at
present in any other CPUs’ cache. If necessary, the least one thread makes progress within a finite period
item is removed from the other CPUs’ caches via of time.
“invalidation” messages from the writing CPUs to
Marked Access: A source-code memory access that uses
any CPUs having a copy in their caches.
a special function or macro, such as READ_ONCE(),
IPI: Inter-processor interrupt, which is an interrupt sent WRITE_ONCE(), atomic_inc(), and so on, in order
from one CPU to another. IPIs are used heavily in to protect that access from compiler and/or hardware
the Linux kernel, for example, within the scheduler optimizations. In contrast, a plain access simply
to alert CPUs that a high-priority process is now mentions the name of the object being accessed,
runnable. so that in the following, line 2 is the plain-access
equivalent of line 1:
IRQ: Interrupt request, often used as an abbreviation for
“interrupt” within the Linux kernel community, as in 1 WRITE_ONCE(a, READ_ONCE(b) + READ_ONCE(c));
2 a = b + c;
“irq handler”.

v2022.09.25a
573

Memory: From the viewpoint of memory models, the purposes such as profiling. The advantage of using
main memory, caches, and store buffers in which NMIs for profiling is that it allows you to profile code
values might be stored. However, this term is often that runs with interrupts disabled.
used to denote the main memory itself, excluding
caches and store buffers. Non-Blocking: A group of academic forward-progress
guarantees that includes bounded population-
Memory Barrier: A compiler directive that might also oblivious wait free, bounded wait free, wait free,
include a special memory-barrier instruction. The lock free, obstruction free, clash free, starvation
purpose of a memory barrier is to order memory- free, and deadlock free. See Section 14.2 for more
reference instructions that executed before the mem- information.
ory barrier to precede those that will execute follow-
ing that memory barrier. (See also “read memory Non-Blocking Synchronization (NBS): The use of al-
barrier” and “write memory barrier”.) gorithms, mechanisms, or techniques that provide
non-blocking forward-progress guarantees. NBS is
Memory Consistency: A set of properties that impose often used in a more restrictive sense of providing one
constraints on the order in which accesses to groups of the stronger forward-progress guarantees, usually
of variables appear to occur. Memory consistency wait free or lock free, but sometimes also obstruction
models range from sequential consistency, a very con- free. (See “Non-blocking”.)
straining model popular in academic circles, through
process consistency, release consistency, and weak NUCA: Non-uniform cache architecture, where groups
consistency. of CPUs share caches and/or store buffers. CPUs
in a group can therefore exchange cache lines with
MESI Protocol: The cache-coherence protocol featur- each other much more quickly than they can with
ing modified, exclusive, shared, and invalid (MESI) CPUs in other groups. Systems comprised of CPUs
states, so that this protocol is named after the states with hardware threads will generally have a NUCA
that the cache lines in a given cache can take on. A architecture.
modified line has been recently written to by this
CPU, and is the sole representative of the current NUMA: Non-uniform memory architecture, where mem-
value of the corresponding memory location. An ory is split into banks and each such bank is “close” to
exclusive cache line has not been written to, but this a group of CPUs, the group being termed a “NUMA
CPU has the right to write to it at any time, as the node”. An example NUMA machine is Sequent’s
line is guaranteed not to be replicated into any other NUMA-Q system, where each group of four CPUs
CPU’s cache (though the corresponding location in had a bank of memory nearby. The CPUs in a given
main memory is up to date). A shared cache line is group can access their memory much more quickly
(or might be) replicated in some other CPUs’ cache, than another group’s memory.
meaning that this CPU must interact with those other
CPUs before writing to this cache line. An invalid NUMA Node: A group of closely placed CPUs and
cache line contains no value, instead representing associated memory within a larger NUMA machines.
“empty space” in the cache into which data from
Obstruction Free: A forward-progress guarantee in
memory might be loaded.
which, in the absence of contention, every thread
Moore’s Law: A 1965 empirical projection by Gordon makes progress within a finite period of time.
Moore that transistor density increases exponentially
Overhead: Operations that must be executed, but which
over time [Moo65].
do not contribute directly to the work that must be
Mutual-Exclusion Mechanism: A software abstraction accomplished. For example, lock acquisition and
that regulates threads’ access to “critical sections” release is normally considered to be overhead, and
and corresponding data. specifically to be synchronization overhead.

NMI: Non-maskable interrupt. As the name indicates, Parallel: In this book, a synonym of concurrent. Please
this is an extremely high-priority interrupt that cannot see Appendix A.6 on page 412 for a discussion of
be masked. These are used for hardware-specific the recent distinction between these two terms.

v2022.09.25a
574 GLOSSARY

Performance: Rate at which work is done, expressed as Race Condition: Any situation where multiple CPUs or
work per unit time. If this work is fully serialized, threads can interact, though this term is often used
then the performance will be the reciprocal of the in cases where such interaction is undesirable. (See
mean latency of the work items. “Data race”.)
RCU-Protected Data: A block of dynamically allocated
Pipelined CPU: A CPU with a pipeline, which is an
memory whose freeing will be deferred such that
internal flow of instructions internal to the CPU that
an RCU grace period will elapse between the time
is in some way similar to an assembly line, with
that there were no longer any RCU-reader-accessible
many of the same advantages and disadvantages. In
pointers to that block and the time that that block is
the 1960s through the early 1980s, pipelined CPUs
freed. This ensures that no RCU readers will have
were the province of supercomputers, but started
access to that block at the time that it is freed.
appearing in microprocessors (such as the 80486) in
the late 1980s. RCU-Protected Pointer: A pointer to RCU-protected
data. Such pointers must be handled carefully, for ex-
Plain Access: A source-code memory access that simply ample, any reader that intends to dereference an RCU-
mentions the name of the object being accessed. (See protected pointer must use rcu_dereference() (or
“Marked access”.) stronger) to load that pointer, and any updater must
use rcu_assign_pointer() (or stronger) to store
Process Consistency: A memory-consistency model in to that pointer. More information is provided in
which each CPU’s stores appear to occur in program Section 15.3.2.
order, but in which different CPUs might see accesses
from more than one CPU as occurring in different RCU Read-Side Critical Section: A section of code
orders. protected by RCU, for example, beginning with
rcu_read_lock() and ending with rcu_read_
Program Order: The order in which a given thread’s unlock(). (See “Read-side critical section”.)
instructions would be executed by a now-mythical “in- Read-Copy Update (RCU): A synchronization mech-
order” CPU that completely executed each instruction anism that can be thought of as a replacement for
before proceeding to the next instruction. (The reason reader-writer locking or reference counting. RCU
such CPUs are now the stuff of ancient myths and provides extremely low-overhead access for readers,
legends is that they were extremely slow. These while writers incur additional overhead maintaining
dinosaurs were one of the many victims of Moore’s- old versions for the benefit of pre-existing readers.
Law-driven increases in CPU clock frequency. Some Readers neither block nor spin, and thus cannot par-
claim that these beasts will roam the earth once again, ticipate in deadlocks, however, they also can see stale
others vehemently disagree.) data and can run concurrently with updates. RCU
is thus best-suited for read-mostly situations where
Quiescent State: In RCU, a point in the code where stale data can either be tolerated (as in routing tables)
there can be no references held to RCU-protected or avoided (as in the Linux kernel’s System V IPC
data structures, which is normally any point outside implementation).
of an RCU read-side critical section. Any interval of
time during which all threads pass through at least Read Memory Barrier: A memory barrier that is only
one quiescent state each is termed a “grace period”. guaranteed to affect the ordering of load instructions,
that is, reads from memory. (See also “memory
Quiescent-State-Based Reclamation (QSBR): An barrier” and “write memory barrier”.)
RCU implementation style characterized by ex-
Read Mostly: Read-mostly data is (again, as the name im-
plicit quiescent states. In QSBR implementa-
plies) rarely updated. However, it might be updated
tions, read-side markers (rcu_read_lock() and
at any time.
rcu_read_unlock() in the Linux kernel) are no-
ops [MS98a, SM95]. Hooks in other parts of the Read Only: Read-only data is, as the name implies, never
software (for example, the Linux-kernel scheduler) updated except by beginning-of-time initialization.
provide the quiescent states. In this book, a synonym for immutable.

v2022.09.25a
575

Read-Side Critical Section: A section of code guarded Sequence Lock: A reader-writer synchronization mech-
by read-acquisition of some reader-writer synchro- anism in which readers retry their operations if a
nization mechanism. For example, if one set of writer was present.
critical sections are guarded by read-acquisition of a
given global reader-writer lock, while a second set Sequential Consistency: A memory-consistency model
of critical section are guarded by write-acquisition where all memory references appear to occur in an
of that same reader-writer lock, then the first set of order consistent with a single global order, and where
critical sections will be the read-side critical sections each CPU’s memory references appear to all CPUs
for that lock. Any number of threads may concur- to occur in program order.
rently execute the read-side critical sections, but only
if no thread is executing one of the write-side critical Software Transactional Memory (HTM): A
sections. (See also “RCU read-side critical section”.) transactional-memory system capable running on
computer systems without special hardware support.
(See “Transactional memory”.)
Reader-Writer Lock: A reader-writer lock is a mutual-
exclusion mechanism that permits any number of Starvation: A condition where at least one CPU or thread
reading threads, or but one writing thread, into the is unable to make progress due to an unfortunate
set of critical sections guarded by that lock. Threads series of resource-allocation decisions, as discussed
attempting to write must wait until all pre-existing in Section 7.1.2. For example, in a multisocket
reading threads release the lock, and, similarly, if system, CPUs on one socket having privileged access
there is a pre-existing writer, any threads attempting to the data structure implementing a given lock could
to write must wait for the writer to release the lock. prevent CPUs on other sockets from ever acquiring
A key concern for reader-writer locks is “fairness”: that lock.
Can an unending stream of readers starve a writer or
vice versa? Starvation Free: A forward-progress guarantee in which,
in the absence of failures, every thread makes
Real Time: A situation in which getting the correct result progress within a finite period of time.
is not sufficient, but where this result must also be
obtained within a given amount of time. Store Buffer: A small set of internal registers used by
a given CPU to record pending stores while the
Reference Count: A counter that tracks the number of corresponding cache lines are making their way to
users of a given object or entity. Reference counters that CPU. Also called “store queue”.
provide existence guarantees and are sometimes used
Store Forwarding: An arrangement where a given CPU
to implement garbage collectors.
refers to its store buffer as well as its cache so as to
ensure that the software sees the memory operations
Release Store: A write to memory that has release se- performed by this CPU as if they were carried out in
mantics. Normal use cases pair an acquire load with program order.
a release store, in which case if the load returns the
value stored, then all code executed by the loading Superscalar CPU: A scalar (non-vector) CPU capable
CPU after that acquire load will see the effects of of executing multiple instructions concurrently. This
all memory-reference instructions executed by the is a step up from a pipelined CPU that executes
storing CPU prior to that release store. Releasing a multiple instructions in an assembly-line fashion—in
lock provides similar memory-ordering semantics, a superscalar CPU, each stage of the pipeline would
hence the “release” in “release store”. (See also be capable of handling more than one instruction.
“acquire load” and “memory barrier”.) For example, if the conditions were exactly right, the
Intel Pentium Pro CPU from the mid-1990s could
Scalability: A measure of how effectively a given system execute two (and sometimes three) instructions per
is able to utilize additional resources. For paral- clock cycle. Thus, a 200 MHz Pentium Pro CPU
lel computing, the additional resources are usually could “retire”, or complete the execution of, up to
additional CPUs. 400 million instructions per second.

v2022.09.25a
576 GLOSSARY

Synchronization: Means for avoiding destructive inter- Unfairness: A condition where the progress of at least
actions among CPUs or threads. Synchronization one CPU or thread is impeded by an unfortunate
mechanisms include atomic RMW operations, mem- series of resource-allocation decisions, as discussed
ory barriers, locking, reference counting, hazard in Section 7.1.2. Extreme levels of unfairness are
pointers, sequence locking, RCU, non-blocking syn- termed “starvation”.
chronization, and transactional memory.
Unteachable: A topic, concept, method, or mechanism
Teachable: A topic, concept, method, or mechanism that that the teacher does not understand well is therefore
teachers believe that they understand completely and uncomfortable teaching.
are therefore comfortable teaching. Vector CPU: A CPU that can apply a single instruction
to multiple items of data concurrently. In the 1960s
Throughput: A performance metric featuring work through the 1980s, only supercomputers had vector
items completed per unit time. capabilities, but the advent of MMX in x86 CPUs and
VMX in PowerPC CPUs brought vector processing
Transactional Lock Elision (TLE): The use of transac- to the masses.
tional memory to emulate locking. Synchronization
is instead carried out by conflicting accesses to the Wait Free: A forward-progress guarantee in which every
data to be protected by the lock. In some cases, thread makes progress within a finite period of time.
this can increase performance because TLE avoids
contention on the lock word [PD11, Kle14, FIMR16, Write Memory Barrier: A memory barrier that is only
PMDY20]. guaranteed to affect the ordering of store instructions,
that is, writes to memory. (See also “memory barrier”
Transactional Memory (TM): A synchronization and “read memory barrier”.)
mechanism that gathers groups of memory accesses
Write Miss: A cache miss incurred because the corre-
so as to execute them atomically from the viewpoint
sponding CPU attempted to write to a cache line that
of transactions on other CPUs or threads, discussed
is read-only, most likely due to its being replicated
in Sections 17.2 and 17.3.
in other CPUs’ caches.
Type-Safe Memory: Type-safe memory [GC96] is pro- Write Mostly: Write-mostly data is (yet again, as the
vided by a synchronization mechanism that prevents name implies) frequently updated.
a given dynamically allocated object from chang-
ing to an incompatible type. Note that the object Write-Side Critical Section: A section of code guarded
might well be freed and then reallocated, but the by write-acquisition of some reader-writer synchro-
reallocated object is guaranteed to be of a compatible nization mechanism. For example, if one set of
type. Within the Linux kernel, type-safe memory critical sections are guarded by write-acquisition of
is provided within RCU read-side critical sections a given global reader-writer lock, while a second set
for memory allocated from slabs marked with the of critical section are guarded by read-acquisition of
SLAB_TYPESAFE_BY_RCU flag. The strictly stronger that same reader-writer lock, then the first set of criti-
existence guarantee also prevents freeing of the pro- cal sections will be the write-side critical sections for
tected object. that lock. Only one thread may execute in the write-
side critical section at a time, and even then only if
Unbounded Transactional Memory (UTM): A there are no threads are executing concurrently in
transactional-memory system based on hardware any of the corresponding read-side critical sections.
instructions provided for this purpose, but with spe-
cial hardware or software capabilities that allow
a given transaction to have a very large memory
footprint. Such a system would at least partially
avoid HTM’s transaction-size limitations called out
in Section 17.3.2.1. (See “Hardware transactional
memory”.)

v2022.09.25a
577

v2022.09.25a
578 GLOSSARY

v2022.09.25a
Bibliography

[AA14] Maya Arbel and Hagit Attiya. Concurrent updates with RCU: Search tree as
an example. In Proceedings of the 2014 ACM Symposium on Principles of
Distributed Computing, PODC ’14, page 196–205, Paris, France, 2014. ACM.
[AAKL06] C. Scott Ananian, Krste Asanovic, Bradley C. Kuszmaul, and Charles E.
Leiserson. Unbounded transactional memory. IEEE Micro, pages 59–69,
January-February 2006.
[AB13] Samy Al Bahra. Nonblocking algorithms and scalable multicore programming.
Commun. ACM, 56(7):50–61, July 2013.
[ABD+ 97] Jennifer M. Anderson, Lance M. Berc, Jeffrey Dean, Sanjay Ghemawat,
Monika R. Henzinger, Shun-Tak A. Leung, Richard L. Sites, Mark T. Vande-
voorde, Carl A. Waldspurger, and William E. Weihl. Continuous profiling:
Where have all the cycles gone? In Proceedings of the 16th ACM Symposium
on Operating Systems Principles, pages 1–14, New York, NY, October 1997.
[ACA+ 18] A. Aljuhni, C. E. Chow, A. Aljaedi, S. Yusuf, and F. Torres-Reyes. Towards
understanding application performance and system behavior with the full
dynticks feature. In 2018 IEEE 8th Annual Computing and Communication
Workshop and Conference (CCWC), pages 394–401, 2018.
[ACHS13] Dan Alistarh, Keren Censor-Hillel, and Nir Shavit. Are lock-free concurrent
algorithms practically wait-free?, December 2013. ArXiv:1311.3200v2.
[ACMS03] Andrea Arcangeli, Mingming Cao, Paul E. McKenney, and Dipankar Sarma.
Using read-copy update techniques for System V IPC in the Linux 2.5 kernel.
In Proceedings of the 2003 USENIX Annual Technical Conference (FREENIX
Track), pages 297–310, San Antonio, Texas, USA, June 2003. USENIX
Association.
[Ada11] Andrew Adamatzky. Slime mould solves maze in one pass . . . assisted by
gradient of chemo-attractants, August 2011. arXiv:1108.4956.
[ADF+ 19] Jade Alglave, Will Deacon, Boqun Feng, David Howells, Daniel Lustig, Luc
Maranget, Paul E. McKenney, Andrea Parri, Nicholas Piggin, Alan Stern,
Akira Yokosawa, and Peter Zijlstra. Who’s afraid of a big bad optimizing
compiler?, July 2019. Linux Weekly News.
[Adv02] Advanced Micro Devices. AMD x86-64 Architecture Programmer’s Manual
Volumes 1–5, 2002.
[AGH+ 11a] Hagit Attiya, Rachid Guerraoui, Danny Hendler, Petr Kuznetsov, Maged M.
Michael, and Martin Vechev. Laws of order: Expensive synchronization in
concurrent algorithms cannot be eliminated. In 38th ACM SIGACT-SIGPLAN

579

v2022.09.25a
580 BIBLIOGRAPHY

Symposium on Principles of Programming Languages, pages 487–498, Austin,


TX, USA, 2011. ACM.
[AGH+ 11b] Hagit Attiya, Rachid Guerraoui, Danny Hendler, Petr Kuznetsov, Maged M.
Michael, and Martin Vechev. Laws of order: Expensive synchronization in
concurrent algorithms cannot be eliminated. SIGPLAN Not., 46(1):487–498,
January 2011.
[AHM09] Hagit Attiya, Eshcar Hillel, and Alessia Milani. Inherent limitations on disjoint-
access parallel implementations of transactional memory. In Proceedings of the
twenty-first annual symposium on Parallelism in algorithms and architectures,
SPAA ’09, pages 69–78, Calgary, AB, Canada, 2009. ACM.
[AHS+ 03] J. Appavoo, K. Hui, C. A. N. Soules, R. W. Wisniewski, D. M. Da Silva,
O. Krieger, M. A. Auslander, D. J. Edelsohn, B. Gamsa, G. R. Ganger,
P. McKenney, M. Ostrowski, B. Rosenburg, M. Stumm, and J. Xenidis.
Enabling autonomic behavior in systems software with hot swapping. IBM
Systems Journal, 42(1):60–76, January 2003.
[AKK+ 14] Dan Alistarh, Justin Kopinsky, Petr Kuznetsov, Srivatsan Ravi, and Nir Shavit.
Inherent limitations of hybrid transactional memory. CoRR, abs/1405.5689,
2014.
[AKNT13] Jade Alglave, Daniel Kroening, Vincent Nimal, and Michael Tautschnig.
Software verification for weak memory via program transformation. In
Proceedings of the 22nd European conference on Programming Languages
and Systems, ESOP’13, pages 512–532, Rome, Italy, 2013. Springer-Verlag.
[AKT13] Jade Alglave, Daniel Kroening, and Michael Tautschnig. Partial orders for
efficient Bounded Model Checking of concurrent software. In Computer
Aided Verification (CAV), volume 8044 of LNCS, pages 141–157. Springer,
2013.
[Ale79] Christopher Alexander. The Timeless Way of Building. Oxford University
Press, New York, 1979.
[Alg13] Jade Alglave. Weakness is a virtue. In (EC)2 2013: 6th International Workshop
on Exploiting Concurrency Efficiently and Correctly, page 3, 2013.
[AM15] Maya Arbel and Adam Morrison. Predicate RCU: An RCU for scalable
concurrent updates. SIGPLAN Not., 50(8):21–30, January 2015.
[Amd67] Gene Amdahl. Validity of the single processor approach to achieving large-
scale computing capabilities. In AFIPS Conference Proceedings, AFIPS ’67
(Spring), pages 483–485, Atlantic City, New Jersey, 1967. Association for
Computing Machinery.
[AMD20] AMD. Professional compute products - GPUOpen, March 2020. https:
//gpuopen.com/professional-compute/.
[AMM+ 17a] Jade Alglave, Luc Maranget, Paul E. McKenney, Andrea Parri, and Alan
Stern. A formal kernel memory-ordering model (part 1), April 2017. https:
//lwn.net/Articles/718628/.
[AMM+ 17b] Jade Alglave, Luc Maranget, Paul E. McKenney, Andrea Parri, and Alan
Stern. A formal kernel memory-ordering model (part 2), April 2017. https:
//lwn.net/Articles/720550/.

v2022.09.25a
BIBLIOGRAPHY 581

[AMM+ 18] Jade Alglave, Luc Maranget, Paul E. McKenney, Andrea Parri, and Alan Stern.
Frightening small children and disconcerting grown-ups: Concurrency in the
Linux kernel. In Proceedings of the Twenty-Third International Conference
on Architectural Support for Programming Languages and Operating Systems,
ASPLOS ’18, pages 405–418, Williamsburg, VA, USA, 2018. ACM.
[AMP+ 11] Jade Alglave, Luc Maranget, Pankaj Pawan, Susmit Sarkar, Peter Sewell, Derek
Williams, and Francesco Zappa Nardelli. PPCMEM/ARMMEM: A tool for
exploring the POWER and ARM memory models, June 2011. https://www.
cl.cam.ac.uk/~pes20/ppc-supplemental/pldi105-sarkar.pdf.
[AMT14] Jade Alglave, Luc Maranget, and Michael Tautschnig. Herding cats: Modelling,
simulation, testing, and data-mining for weak memory. In Proceedings of
the 35th ACM SIGPLAN Conference on Programming Language Design and
Implementation, PLDI ’14, pages 40–40, Edinburgh, United Kingdom, 2014.
ACM.
[And90] T. E. Anderson. The performance of spin lock alternatives for shared-memory
multiprocessors. IEEE Transactions on Parallel and Distributed Systems,
1(1):6–16, January 1990.
[And91] Gregory R. Andrews. Concurrent Programming, Principles, and Practices.
Benjamin Cummins, 1991.
[And19] Jim Anderson. Software transactional memory for real-time systems, August
2019. https://www.cs.unc.edu/~anderson/projects/rtstm.html.
[ARM10] ARM Limited. ARM Architecture Reference Manual: ARMv7-A and ARMv7-R
Edition, 2010.
[ARM17] ARM Limited. ARM Architecture Reference Manual (ARMv8, for ARMv8-A
architecture profile), 2017.
[Ash15] Mike Ash. Concurrent memory deallocation in the objective-c runtime, May
2015. mikeash.com: just this guy, you know?
[ATC+ 11] Ege Akpinar, Sasa Tomic, Adrian Cristal, Osman Unsal, and Mateo Valero. A
comprehensive study of conflict resolution policies in hardware transactional
memory. In TRANSACT 2011, New Orleans, LA, USA, June 2011. ACM
SIGPLAN.
[ATS09] Ali-Reza Adl-Tabatabai and Tatiana Shpeisman. Draft specification of transac-
tional language constructs for C++, August 2009. URL: https://software.
intel.com/sites/default/files/ee/47/21569 (may need to append
.pdf to view after download).
[Att10] Hagit Attiya. The inherent complexity of transactional memory and what to
do about it. In Proceedings of the 29th ACM SIGACT-SIGOPS Symposium
on Principles of Distributed Computing, PODC ’10, pages 1–5, Zurich,
Switzerland, 2010. ACM.
[BA01] Jeff Bonwick and Jonathan Adams. Magazines and vmem: Extending the slab
allocator to many CPUs and arbitrary resources. In USENIX Annual Technical
Conference, General Track 2001, pages 15–33, 2001.
[Bah11a] Samy Al Bahra. ck_epoch: Support per-object destructors, Oc-
tober 2011. https://github.com/concurrencykit/ck/commit/
10ffb2e6f1737a30e2dcf3862d105ad45fcd60a4.

v2022.09.25a
582 BIBLIOGRAPHY

[Bah11b] Samy Al Bahra. ck_hp.c, February 2011. Hazard pointers: https://


github.com/concurrencykit/ck/blob/master/src/ck_hp.c.
[Bah11c] Samy Al Bahra. ck_sequence.h, February 2011. Sequence
locking: https://github.com/concurrencykit/ck/blob/master/
include/ck_sequence.h.
[Bas18] JF Bastien. P1152R0: Deprecating volatile, October 2018. http://www.
open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1152r0.html.
[BBC+ 10] Al Bessey, Ken Block, Ben Chelf, Andy Chou, Bryan Fulton, Seth Hallem,
Charles Henri-Gros, Asya Kamsky, Scott McPeak, and Dawson Engler. A
few billion lines of code later: Using static analysis to find bugs in the real
world. Commun. ACM, 53(2):66–75, February 2010.
[BCR03] David F. Bacon, Perry Cheng, and V. T. Rajan. A real-time garbage collector
with low overhead and consistent utilization. SIGPLAN Not., 38(1):285–298,
2003.
[BD13] Paolo Bonzini and Mike Day. RCU implementation for Qemu, August
2013. https://lists.gnu.org/archive/html/qemu-devel/2013-
08/msg02055.html.
[BD14] Hans-J. Boehm and Brian Demsky. Outlawing ghosts: Avoiding out-of-thin-
air results. In Proceedings of the Workshop on Memory Systems Performance
and Correctness, MSPC ’14, pages 7:1–7:6, Edinburgh, United Kingdom,
2014. ACM.
[Bec11] Pete Becker. Working draft, standard for programming language C++,
February 2011. http://www.open-std.org/jtc1/sc22/wg21/docs/
papers/2011/n3242.pdf.
[BG87] D. Bertsekas and R. Gallager. Data Networks. Prentice-Hall, Inc., 1987.
[BGHZ16] Oana Balmau, Rachid Guerraoui, Maurice Herlihy, and Igor Zablotchi. Fast
and robust memory reclamation for concurrent data structures. In Proceedings
of the 28th ACM Symposium on Parallelism in Algorithms and Architectures,
SPAA ’16, pages 349–359, Pacific Grove, California, USA, 2016. ACM.
[BGOS18] Sam Blackshear, Nikos Gorogiannis, Peter W. O’Hearn, and Ilya Sergey.
Racerd: Compositional static race detection. Proc. ACM Program. Lang.,
2(OOPSLA), October 2018.
[BGV17] Hans-J. Boehm, Olivier Giroux, and Viktor Vafeiades. P0668r1: Revising
the C++ memory model, July 2017. http://www.open-std.org/jtc1/
sc22/wg21/docs/papers/2017/p0668r1.html.
[Bha14] Srivatsa S. Bhat. percpu_rwlock: Implement the core design of per-CPU
reader-writer locks, February 2014. https://patchwork.kernel.org/
patch/2157401/.
[BHG87] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. Concurrency
Control and Recovery in Database Systems. Addison Wesley Publishing
Company, 1987.
[BHS07] Frank Buschmann, Kevlin Henney, and Douglas C. Schmidt. Pattern-Oriented
Software Architecture Volume 4: A Pattern Language for Distributed Comput-
ing. Wiley, Chichester, West Sussex, England, 2007.
[Bir89] Andrew D. Birrell. An Introduction to Programming with Threads. Digital
Systems Research Center, January 1989.

v2022.09.25a
BIBLIOGRAPHY 583

[BJ12] Rex Black and Capers Jones. Economics of software quality: An interview
with Capers Jones, part 1 of 2 (podcast transcript), January 2012. https:
//www.informit.com/articles/article.aspx?p=1824791.
[BK85] Bob Beck and Bob Kasten. VLSI assist in building a multiprocessor UNIX
system. In USENIX Conference Proceedings, pages 255–275, Portland, OR,
June 1985. USENIX Association.
[BLM05] C. Blundell, E. C. Lewis, and M. Martin. Deconstructing transactional seman-
tics: The subtleties of atomicity. In Annual Workshop on Duplicating, De-
constructing, and Debunking (WDDD), June 2005. Available: http://acg.
cis.upenn.edu/papers/wddd05_atomic_semantics.pdf [Viewed Feb-
ruary 28, 2021].
[BLM06] C. Blundell, E. C. Lewis, and M. Martin. Subtleties of transactional
memory and atomicity semantics. Computer Architecture Letters, 5(2),
2006. Available: http://acg.cis.upenn.edu/papers/cal06_atomic_
semantics.pdf [Viewed February 28, 2021].
[BM18] JF Bastien and Paul E. McKenney. P0750r1: Consume, February
2018. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/
2018/p0750r1.html.
[BMMM05] Luke Browning, Thomas Mathews, Paul E. McKenney, and James Moody.
Apparatus, method, and computer program product for converting simple locks
in a multiprocessor system. US Patent 6,842,809, Assigned to International
Business Machines Corporation, Washington, DC, January 2005.
+
[BMN 15] Mark Batty, Kayvan Memarian, Kyndylan Nienhuis, Jean Pichon-Pharabod,
and Peter Sewell. The problem of programming language concurrency
semantics. In Jan Vitek, editor, Programming Languages and Systems, volume
9032 of Lecture Notes in Computer Science, pages 283–307. Springer Berlin
Heidelberg, 2015.
[BMP08] R. F. Berry, P. E. McKenney, and F. N. Parr. Responsive systems: An
introduction. IBM Systems Journal, 47(2):197–206, April 2008.
[Boe05] Hans-J. Boehm. Threads cannot be implemented as a library. SIGPLAN Not.,
40(6):261–268, June 2005.
[Boe09] Hans-J. Boehm. Transactional memory should be an implementation technique,
not a programming interface. In HOTPAR 2009, page 6, Berkeley, CA, USA,
March 2009. Available: https://www.usenix.org/event/hotpar09/
tech/full_papers/boehm/boehm.pdf [Viewed May 24, 2009].
[Boe20] Hans Boehm. “Undefined behavior” and the concurrency memory
model, August 2020. http://www.open-std.org/jtc1/sc22/wg21/
docs/papers/2020/p2215r0.pdf.
[Boh01] Kristoffer Bohmann. Response time still matters, July 2001. URL: http:
//www.bohmann.dk/articles/response_time_still_matters.html
[broken, November 2016].
[Bon13] Paolo Bonzini. seqlock: introduce read-write seqlock, Sep-
tember 2013. https://git.qemu.org/?p=qemu.git;a=commit;h=
ea753d81e8b085d679f13e4a6023e003e9854d51.
[Bon15] Paolo Bonzini. rcu: add rcu library, February
2015. https://git.qemu.org/?p=qemu.git;a=commit;h=
7911747bd46123ef8d8eef2ee49422bb8a4b274f.

v2022.09.25a
584 BIBLIOGRAPHY

[Bon21a] Paolo Bonzini. An introduction to lockless algorithms, February 2021.


Available: https://lwn.net/Articles/844224/ [Viewed February 19,
2021].
[Bon21b] Paolo Bonzini. Lockless patterns: an introduction to compare-and-swap,
March 2021. Available: https://lwn.net/Articles/847973/ [Viewed
March 13, 2021].
[Bon21c] Paolo Bonzini. Lockless patterns: full memory barriers, March 2021. Avail-
able: https://lwn.net/Articles/847481/ [Viewed March 8, 2021].
[Bon21d] Paolo Bonzini. Lockless patterns: more read-modify-write operations, March
2021. Available: https://lwn.net/Articles/849237/ [Viewed March
19, 2021].
[Bon21e] Paolo Bonzini. Lockless patterns: relaxed access and partial memory bar-
riers, February 2021. Available: https://lwn.net/Articles/846700/
[Viewed February 27, 2021].
[Bon21f] Paolo Bonzini. Lockless patterns: some final topics, March 2021. Available:
https://lwn.net/Articles/850202/ [Viewed March 19, 2021].
[Bor06] Richard Bornat. Dividing the sheep from the goats, January 2006. Seminar at
School of Computing, Univ. of Kent. Abstract is available at https://www.
cs.kent.ac.uk/seminar_archive/2005_06/abs_2006_01_24.html.
Retracted in July 2014: http://www.eis.mdx.ac.uk/staffpages/r_
bornat/papers/camel_hump_retraction.pdf.
[Bos10] Keith Bostic. Switch lockless programming style from epoch to hazard refer-
ences, January 2010. https://github.com/wiredtiger/wiredtiger/
commit/dddc21014fc494a956778360a14d96c762495e09.
[BPP+ 16] Adam Belay, George Prekas, Mia Primorac, Ana Klimovic, Samuel Grossman,
Christos Kozyrakis, and Edouard Bugnion. The IX operating system: Com-
bining low latency, high throughput, and efficiency in a protected dataplane.
ACM Trans. Comput. Syst., 34(4):11:1–11:39, December 2016.
[Bra07] Reg Braithwaite. Don’t overthink fizzbuzz, January 2007. http://weblog.
raganwald.com/2007/01/dont-overthink-fizzbuzz.html.
[Bra11] Björn Brandenburg. Scheduling and Locking in Multiprocessor Real-Time
Operating Systems. PhD thesis, The University of North Carolina at
Chapel Hill, 2011. URL: https://www.cs.unc.edu/~anderson/diss/
bbbdiss.pdf.
[Bro15a] Neil Brown. Pathname lookup in Linux, June 2015. https://lwn.net/
Articles/649115/.
[Bro15b] Neil Brown. RCU-walk: faster pathname lookup in Linux, July 2015.
https://lwn.net/Articles/649729/.
[Bro15c] Neil Brown. A walk among the symlinks, July 2015. https://lwn.net/
Articles/650786/.
[BS75] Paul J. Brown and Ronald M. Smith. Shared data controlled by a plurality of
users, May 1975. US Patent 3,886,525, filed June 29, 1973.
[BS14] Mark Batty and Peter Sewell. The thin-air problem, February 2014. https:
//www.cl.cam.ac.uk/~pes20/cpp/notes42.html.

v2022.09.25a
BIBLIOGRAPHY 585

[But97] David Butenhof. Programming with POSIX Threads. Addison-Wesley, Boston,


MA, USA, 1997.
[BW14] Silas Boyd-Wickizer. Optimizing Communications Bottlenecks in Multipro-
cessor Operating Systems Kernels. PhD thesis, Massachusetts Institute of
Technology, 2014. https://pdos.csail.mit.edu/papers/sbw-phd-
thesis.pdf.
[BWCM+ 10] Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev,
M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. An analysis of
Linux scalability to many cores. In 9th USENIX Symposium on Operating
System Design and Implementation, pages 1–16, Vancouver, BC, Canada,
October 2010. USENIX.
[CAK+ 96] Crispin Cowan, Tito Autrey, Charles Krasic, Calton Pu, and Jonathan Walpole.
Fast concurrent dynamic linking for an adaptive operating system. In Interna-
tional Conference on Configurable Distributed Systems (ICCDS’96), pages
108–115, Annapolis, MD, May 1996.
[CBF13] UPC Consortium, Dan Bonachea, and Gary Funck. UPC language and library
specifications, version 1.3. Technical report, UPC Consortium, November
2013.
[CBM+ 08] Calin Cascaval, Colin Blundell, Maged Michael, Harold W. Cain, Peng Wu,
Stefanie Chiras, and Siddhartha Chatterjee. Software transactional memory:
Why is it only a research toy? ACM Queue, September 2008.
[Chi22] A.A. Chien. Computer Architecture for Scientists. Cambridge University
Press, 2022.
[CHP71] P. J. Courtois, F. Heymans, and D. L. Parnas. Concurrent control with “readers”
and “writers”. Communications of the ACM, 14(10):667–668, October 1971.
[CKL04] Edmund Clarke, Daniel Kroening, and Flavio Lerda. A tool for checking
ANSI-C programs. In Kurt Jensen and Andreas Podelski, editors, Tools
and Algorithms for the Construction and Analysis of Systems (TACAS 2004),
volume 2988 of Lecture Notes in Computer Science, pages 168–176. Springer,
2004.
[CKZ12] Austin Clements, Frans Kaashoek, and Nickolai Zeldovich. Scalable address
spaces using RCU balanced trees. In Architectural Support for Programming
Languages and Operating Systems (ASPLOS 2012), pages 199–210, London,
UK, March 2012. ACM.
[CKZ+ 13] Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich, Robert T.
Morris, and Eddie Kohler. The scalable commutativity rule: Designing
scalable software for multicore processors. In Proceedings of the Twenty-
Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, pages
1–17, Farminton, Pennsylvania, 2013. ACM.
[Cli09] Cliff Click. And now some hardware transactional memory comments..., Feb-
ruary 2009. URL: http://www.cliffc.org/blog/2009/02/25/and-
now-some-hardware-transactional-memory-comments/.
[CLRS01] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction to
Algorithms, Second Edition. MIT electrical engineering and computer science
series. MIT Press, 2001.

v2022.09.25a
586 BIBLIOGRAPHY

[CnRR18] Armando Castañeda, Sergio Rajsbaum, and Michel Raynal. Unifying con-
current objects and distributed tasks: Interval-linearizability. J. ACM, 65(6),
November 2018.
[Com01] Compaq Computer Corporation. Shared memory, threads, inter-
process communication, August 2001. Zipped archive: wiz_
2637.txt in https://www.digiater.nl/openvms/freeware/v70/
ask_the_wizard/wizard.zip.
[Coo18] Byron Cook. Formal reasoning about the security of amazon web services. In
Hana Chockler and Georg Weissenbacher, editors, Computer Aided Verifica-
tion, pages 38–47, Cham, 2018. Springer International Publishing.
[Cor02] Compaq Computer Corporation. Alpha Architecture Reference Manual. Digital
Press, fourth edition, 2002.
[Cor03] Jonathan Corbet. Driver porting: mutual exclusion with seqlocks, February
2003. https://lwn.net/Articles/22818/.
[Cor04a] Jonathan Corbet. Approaches to realtime Linux, October 2004. URL:
https://lwn.net/Articles/106010/.
[Cor04b] Jonathan Corbet. Finding kernel problems automatically, June 2004. https:
//lwn.net/Articles/87538/.
[Cor04c] Jonathan Corbet. Realtime preemption, part 2, October 2004. URL: https:
//lwn.net/Articles/107269/.
[Cor06a] Jonathan Corbet. The kernel lock validator, May 2006. Available: https:
//lwn.net/Articles/185666/ [Viewed: March 26, 2010].
[Cor06b] Jonathan Corbet. Priority inheritance in the kernel, April 2006. Available:
https://lwn.net/Articles/178253/ [Viewed June 29, 2009].
[Cor10a] Jonathan Corbet. Dcache scalability and RCU-walk, December 2010. Avail-
able: https://lwn.net/Articles/419811/ [Viewed May 29, 2017].
[Cor10b] Jonathan Corbet. sys_membarrier(), January 2010. https://lwn.net/
Articles/369567/.
[Cor11] Jonathan Corbet. How to ruin linus’s vacation, July 2011. Available: https:
//lwn.net/Articles/452117/ [Viewed May 29, 2017].
[Cor12] Jonathan Corbet. ACCESS_ONCE(), August 2012. https://lwn.net/
Articles/508991/.
[Cor13] Jonathan Corbet. (Nearly) full tickless operation in 3.10, May 2013. https:
//lwn.net/Articles/549580/.
[Cor14a] Jonathan Corbet. ACCESS_ONCE() and compiler bugs, December 2014.
https://lwn.net/Articles/624126/.
[Cor14b] Jonathan Corbet. MCS locks and qspinlocks, March 2014. https://lwn.
net/Articles/590243/.
[Cor14c] Jonathan Corbet. Relativistic hash tables, part 1: Algorithms, September
2014. https://lwn.net/Articles/612021/.
[Cor14d] Jonathan Corbet. Relativistic hash tables, part 2: Implementation, September
2014. https://lwn.net/Articles/612100/.
[Cor16a] Jonathan Corbet. Finding race conditions with KCSAN, June 2016. https:
//lwn.net/Articles/691128/.

v2022.09.25a
BIBLIOGRAPHY 587

[Cor16b] Jonathan Corbet. Time to move to C11 atomics?, June 2016. https:
//lwn.net/Articles/691128/.
[Cor18] Jonathan Corbet. membarrier(2), October 2018. https://man7.org/
linux/man-pages/man2/membarrier.2.html.
[Cra93] Travis Craig. Building FIFO and priority-queuing spin locks from atomic swap.
Technical Report 93-02-02, University of Washington, Seattle, Washington,
February 1993.
[CRKH05] Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman. Linux
Device Drivers. O’Reilly Media, Inc., third edition, 2005. URL: https:
//lwn.net/Kernel/LDD3/.
[CSG99] David E. Culler, Jaswinder Pal Singh, and Anoop Gupta. Parallel Computer
Architecture: a Hardware/Software Approach. Morgan Kaufman, 1999.
[cut17] crates.io user ticki. conc v0.5.0: Hazard-pointer-based concurrent memory
reclamation, August 2017. https://crates.io/crates/conc.
[Dat82] C. J. Date. An Introduction to Database Systems, volume 1. Addison-Wesley
Publishing Company, 1982.
[DBA09] Saeed Dehnadi, Richard Bornat, and Ray Adams. Meta-analysis of the effect of
consistency on success in early learning of programming. In PPIG 2009, pages
1–13, University of Limerick, Ireland, June 2009. Psychology of Programming
Interest Group.
[DCW+ 11] Luke Dalessandro, Francois Carouge, Sean White, Yossi Lev, Mark Moir,
Michael L. Scott, and Michael F. Spear. Hybrid NOrec: A case study in the
effectiveness of best effort hardware transactional memory. In Proceedings of
the 16th International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS), ASPLOS ’11, page 39–52,
Newport Beach, CA, USA, 2011. ACM.
[Dea18] Will Deacon. [PATCH 00/10] kernel/locking: qspinlock improve-
ments, April 2018. https://lkml.kernel.org/r/1522947547-24081-
[email protected].
[Dea19] Will Deacon. Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates,
August 2019. https://lore.kernel.org/lkml/20190821103200.
kpufwtviqhpbuv2n@willie-the-truck/.
[Den15] Peter Denning. Perspectives on OS foundations. In SOSP History Day 2015,
SOSP ’15, pages 3:1–3:46, Monterey, California, 2015. ACM.
[Dep06] Department of Computing and Information Systems, University of Melbourne.
CSIRAC, 2006. https://cis.unimelb.edu.au/about/csirac/.
[Des09a] Mathieu Desnoyers. Low-Impact Operating System Tracing. PhD
thesis, Ecole Polytechnique de Montréal, December 2009. Available:
https://lttng.org/files/thesis/desnoyers-dissertation-
2009-12-v27.pdf [Viewed February 27, 2021].
[Des09b] Mathieu Desnoyers. [RFC git tree] userspace RCU (urcu) for Linux, February
2009. https://liburcu.org.
[DFGG11] Aleksandar Dragovejic, Pascal Felber, Vincent Gramoli, and Rachid Guerraoui.
Why STM can be more than a research toy. Communications of the ACM,
pages 70–77, April 2011.

v2022.09.25a
588 BIBLIOGRAPHY

[DFLO19] Dino Distefano, Manuel Fähndrich, Francesco Logozzo, and Peter W. O’Hearn.
Scaling static analyses at facebook. Commun. ACM, 62(8):62–70, July 2019.
[DHJ+ 07] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakula-
pati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter
Vosshall, and Werner Vogels. Dynamo: Amazon’s highly available key-value
store. SIGOPS Oper. Syst. Rev., 41(6):205–220, October 2007.
[DHK12] Vijay D’Silva, Leopold Haller, and Daniel Kroening. Satisfiability solvers are
static analyzers. In Static Analysis Symposium (SAS), volume 7460 of LNCS,
pages 317–333. Springer, 2012.
[DHL+ 08] Dave Dice, Maurice Herlihy, Doug Lea, Yossi Lev, Victor Luchangco, Wayne
Mesard, Mark Moir, Kevin Moore, and Dan Nussbaum. Applications of the
adaptive transactional memory test platform. In 3rd ACM SIGPLAN Workshop
on Transactional Computing, pages 1–10, Salt Lake City, UT, USA, February
2008.
[Dij65] E. W. Dijkstra. Solution of a problem in concurrent programming control.
Communications of the ACM, 8(9):569, Sept 1965.
[Dij68] Edsger W. Dijkstra. Letters to the editor: Go to statement considered harmful.
Commun. ACM, 11(3):147–148, March 1968.
[Dij71] Edsger W. Dijkstra. Hierarchical ordering of sequential processes. Acta
Informatica, 1(2):115–138, 1971. Available: https://www.cs.utexas.
edu/users/EWD/ewd03xx/EWD310.PDF [Viewed January 13, 2008].
[DKS89] Alan Demers, Srinivasan Keshav, and Scott Shenker. Analysis and simulation
of a fair queuing algorithm. SIGCOMM ’89, pages 1–12, 1989.
[DLM+ 10] Dave Dice, Yossi Lev, Virendra J. Marathe, Mark Moir, Dan Nussbaum,
and Marek Oleszewski. Simplifying concurrent algorithms by exploiting
hardware transactional memory. In Proceedings of the 22nd ACM symposium
on Parallelism in algorithms and architectures, SPAA ’10, pages 325–334,
Thira, Santorini, Greece, 2010. ACM.
[DLMN09] Dave Dice, Yossi Lev, Mark Moir, and Dan Nussbaum. Early experience with
a commercial hardware transactional memory implementation. In Fourteenth
International Conference on Architectural Support for Programming Lan-
guages and Operating Systems (ASPLOS ’09), pages 157–168, Washington,
DC, USA, March 2009.
[DMD13] Mathieu Desnoyers, Paul E. McKenney, and Michel R. Dagenais. Multi-core
systems modeling for formal verification of parallel algorithms. SIGOPS Oper.
Syst. Rev., 47(2):51–65, July 2013.
[DMLP79] Richard A. De Millo, Richard J. Lipton, and Alan J. Perlis. Social processes
and proofs of theorems and programs. Commun. ACM, 22(5):271–280, May
1979.
[DMS+ 12] Mathieu Desnoyers, Paul E. McKenney, Alan Stern, Michel R. Dagenais, and
Jonathan Walpole. User-level implementations of read-copy update. IEEE
Transactions on Parallel and Distributed Systems, 23:375–382, 2012.
[dO18a] Daniel Bristot de Oliveira. Deadline scheduler part 2 – details and usage,
January 2018. URL: https://lwn.net/Articles/743946/.
[dO18b] Daniel Bristot de Oliveira. Deadline scheduling part 1 – overview and theory,
January 2018. URL: https://lwn.net/Articles/743740/.

v2022.09.25a
BIBLIOGRAPHY 589

[dOCdO19] Daniel Bristot de Oliveira, Tommaso Cucinotta, and Rômulo Silva de Oliveira.
Modeling the behavior of threads in the PREEMPT_RT Linux kernel using
automata. SIGBED Rev., 16(3):63–68, November 2019.
[Don21] Jason Donenfeld. Introduce WireGuardNT, August 2021. Git
commit: https://git.zx2c4.com/wireguard-nt/commit/?id=
d64c53776d7f72751d7bd580ead9846139c8f12f.
[Dov90] Ken F. Dove. A high capacity TCP/IP in parallel STREAMS. In UKUUG
Conference Proceedings, London, June 1990.
[Dow20] Travis Downs. Gathering intel on Intel AVX-512 transitions, Jan-
uary 2020. https://travisdowns.github.io/blog/2020/01/17/
avxfreq1.html.
[Dre11] Ulrich Drepper. Futexes are tricky. Technical Report FAT2011, Red Hat, Inc.,
Raleigh, NC, USA, November 2011.
[DSS06] Dave Dice, Ori Shalev, and Nir Shavit. Transactional locking II. In Proc.
International Symposium on Distributed Computing. Springer Verlag, 2006.
[Duf10a] Joe Duffy. A (brief) retrospective on transactional memory,
January 2010. http://joeduffyblog.com/2010/01/03/a-brief-
retrospective-on-transactional-memory/.
[Duf10b] Joe Duffy. More thoughts on transactional memory, May
2010. http://joeduffyblog.com/2010/05/16/more-thoughts-on-
transactional-memory/.
[Dug10] Abhinav Duggal. Stopping data races using redflag. Master’s thesis, Stony
Brook University, 2010.
[Eas71] William B. Easton. Process synchronization without long-term interlock. In
Proceedings of the Third ACM Symposium on Operating Systems Principles,
SOSP ’71, pages 95–100, Palo Alto, California, USA, 1971. Association for
Computing Machinery.
[Edg13] Jake Edge. The future of realtime Linux, November 2013. URL: https:
//lwn.net/Articles/572740/.
[Edg14] Jake Edge. The future of the realtime patch set, October 2014. URL:
https://lwn.net/Articles/617140/.
[EGCD03] T. A. El-Ghazawi, W. W. Carlson, and J. M. Draper. UPC language specifica-
tions v1.1, May 2003. URL: http://upc.gwu.edu [broken, February 27,
2021].
[EGMdB11] Stephane Eranian, Eric Gouriou, Tipp Moseley, and Willem de Bruijn. Linux
kernel profiling with perf, June 2011. https://perf.wiki.kernel.org/
index.php/Tutorial.
[Ell80] Carla Schlatter Ellis. Concurrent search and insertion in AVL trees. IEEE
Transactions on Computers, C-29(9):811–817, September 1980.
[ELLM07] Faith Ellen, Yossi Lev, Victor Luchangco, and Mark Moir. SNZI: scalable
NonZero indicators. In Proceedings of the twenty-sixth annual ACM symposium
on Principles of distributed computing, PODC ’07, pages 13–22, Portland,
Oregon, USA, 2007. ACM.

v2022.09.25a
590 BIBLIOGRAPHY

[EMV+ 20a] Marco Elver, Paul E. McKenney, Dmitry Vyukov, Andrey Konovalov, Alexan-
der Potapenko, Kostya Serebryany, Alan Stern, Andrea Parri, Akira Yokosawa,
Peter Zijlstra, Will Deacon, Daniel Lustig, Boqun Feng, Joel Fernandes,
Jade Alglave, and Luc Maranget. Concurrency bugs should fear the big bad
data-race detector (part 1), April 2020. Linux Weekly News.
[EMV+ 20b] Marco Elver, Paul E. McKenney, Dmitry Vyukov, Andrey Konovalov, Alexan-
der Potapenko, Kostya Serebryany, Alan Stern, Andrea Parri, Akira Yokosawa,
Peter Zijlstra, Will Deacon, Daniel Lustig, Boqun Feng, Joel Fernandes,
Jade Alglave, and Luc Maranget. Concurrency bugs should fear the big bad
data-race detector (part 2), April 2020. Linux Weekly News.
[Eng68] Douglas Engelbart. The demo, December 1968. URL: http://thedemo.
org/.
[ENS05] Ryan Eccles, Blair Nonneck, and Deborah A. Stacey. Exploring parallel
programming knowledge in the novice. In HPCS ’05: Proceedings of the
19th International Symposium on High Performance Computing Systems and
Applications, pages 97–102, Guelph, Ontario, Canada, 2005. IEEE Computer
Society.
[Eri08] Christer Ericson. Aiding pathfinding with cellular automata, June 2008.
http://realtimecollisiondetection.net/blog/?p=57.
[ES90] Margaret A. Ellis and Bjarne Stroustrup. The Annotated C++ Reference
Manual. Addison Wesley, 1990.
[ES05] Ryan Eccles and Deborah A. Stacey. Understanding the parallel programmer.
In HPCS ’05: Proceedings of the 19th International Symposium on High
Performance Computing Systems and Applications, pages 156–160, Guelph,
Ontario, Canada, 2005. IEEE Computer Society.
[ETH11] ETH Zurich. Parallel solver for a perfect maze, March
2011. URL: http://nativesystems.inf.ethz.ch/pub/Main/
WebHomeLecturesParallelProgrammingExercises/pp2011hw04.pdf
[broken, November 2016].
[Eva11] Jason Evans. Scalable memory allocation using jemalloc, Janu-
ary 2011. https://engineering.fb.com/2011/01/03/core-data/
scalable-memory-allocation-using-jemalloc/.
[Fel50] W. Feller. An Introduction to Probability Theory and its Applications. John
Wiley, 1950.
[Fen73] J. Fennel. Instruction selection in a two-program counter instruction unit.
Technical Report US Patent 3,728,692, Assigned to International Business
Machines Corp, Washington, DC, April 1973.
[Fen15] Boqun Feng. powerpc: Make value-returning atomics fully ordered, November
2015. Git commit: https://git.kernel.org/linus/49e9cf3f0c04.
[FH07] Keir Fraser and Tim Harris. Concurrent programming without locks. ACM
Trans. Comput. Syst., 25(2):1–61, 2007.
[FIMR16] Pascal Felber, Shady Issa, Alexander Matveev, and Paolo Romano. Hardware
read-write lock elision. In Proceedings of the Eleventh European Conference on
Computer Systems, EuroSys ’16, London, United Kingdom, 2016. Association
for Computing Machinery.

v2022.09.25a
BIBLIOGRAPHY 591

[Fos10] Ron Fosner. Scalable multithreaded programming with tasks. MSDN Magazine,
2010(11):60–69, November 2010. http://msdn.microsoft.com/en-us/
magazine/gg309176.aspx.
[FPB79] Jr. Frederick P. Brooks. The Mythical Man-Month. Addison-Wesley, 1979.
[Fra03] Keir Anthony Fraser. Practical Lock-Freedom. PhD thesis, King’s College,
University of Cambridge, 2003.
[Fra04] Keir Fraser. Practical lock-freedom. Technical Report UCAM-CL-TR-579,
University of Cambridge, Computer Laboratory, February 2004.
[FRK02] Hubertus Francke, Rusty Russell, and Matthew Kirkwood. Fuss, futexes
and furwocks: Fast userlevel locking in linux. In Ottawa Linux Symposium,
pages 479–495, June 2002. Available: https://www.kernel.org/doc/
ols/2002/ols2002-pages-479-495.pdf [Viewed May 22, 2011].
[FSP+ 17] Shaked Flur, Susmit Sarkar, Christopher Pulte, Kyndylan Nienhuis, Luc
Maranget, Kathryn E. Gray, Ali Sezgin, Mark Batty, and Peter Sewell.
Mixed-size concurrency: ARM, POWER, C/C++11, and SC. SIGPLAN Not.,
52(1):429–442, January 2017.
[GAJM15] Alex Groce, Iftekhar Ahmed, Carlos Jensen, and Paul E. McKenney. How
verified is my code? falsification-driven verification (t). In Proceedings of
the 2015 30th IEEE/ACM International Conference on Automated Software
Engineering (ASE), ASE ’15, pages 737–748, Washington, DC, USA, 2015.
IEEE Computer Society.
[Gar90] Arun Garg. Parallel STREAMS: a multi-processor implementation. In
USENIX Conference Proceedings, pages 163–176, Berkeley CA, February
1990. USENIX Association. Available: https://archive.org/details/
1990-proceedings-winter-dc/page/163/mode/2up.
[Gar07] Bryan Gardiner. IDF: Gordon Moore predicts end of Moore’s law (again),
September 2007. Available: https://www.wired.com/2007/09/idf-
gordon-mo-1/ [Viewed: February 27, 2021].
[GC96] Michael Greenwald and David R. Cheriton. The synergy between non-blocking
synchronization and operating system structure. In Proceedings of the Second
Symposium on Operating Systems Design and Implementation, pages 123–136,
Seattle, WA, October 1996. USENIX Association.
[GDZE10] Olga Golovanevsky, Alon Dayan, Ayal Zaks, and David Edelsohn. Trace-based
data layout optimizations for multi-core processors. In Proceedings of the 5th
International Conference on High Performance Embedded Architectures and
Compilers, HiPEAC’10, pages 81–95, Pisa, Italy, 2010. Springer-Verlag.
[GG14] Vincent Gramoli and Rachid Guerraoui. Democratizing transactional pro-
gramming. Commun. ACM, 57(1):86–93, January 2014.
[GGK18] Christina Giannoula, Georgios Goumas, and Nectarios Koziris. Combining
HTM with RCU to speed up graph coloring on multicore platforms. In Rio
Yokota, Michèle Weiland, David Keyes, and Carsten Trinitis, editors, High
Performance Computing, pages 350–369, Cham, 2018. Springer International
Publishing.
[GGL+ 19] Rachid Guerraoui, Hugo Guiroux, Renaud Lachaize, Vivien Quéma, and
Vasileios Trigonakis. Lock–unlock: Is that all? a pragmatic analysis of
locking in software systems. ACM Trans. Comput. Syst., 36(1):1:1–1:149,
March 2019.

v2022.09.25a
592 BIBLIOGRAPHY

[Gha95] Kourosh Gharachorloo. Memory consistency models for shared-memory multi-


processors. Technical Report CSL-TR-95-685, Computer Systems Laboratory,
Departments of Electrical Engineering and Computer Science, Stanford Univer-
sity, Stanford, CA, December 1995. Available: https://www.hpl.hp.com/
techreports/Compaq-DEC/WRL-95-9.pdf [Viewed: October 11, 2004].
[GHH+ 14] Alex Groce, Klaus Havelund, Gerard J. Holzmann, Rajeev Joshi, and Ru-Gang
Xu. Establishing flight software reliability: testing, model checking, constraint-
solving, monitoring and learning. Ann. Math. Artif. Intell., 70(4):315–349,
2014.
[GHJV95] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design
Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley,
1995.
[GKAS99] Ben Gamsa, Orran Krieger, Jonathan Appavoo, and Michael Stumm. Tornado:
Maximizing locality and concurrency in a shared memory multiprocessor
operating system. In Proceedings of the 3rd Symposium on Operating System
Design and Implementation, pages 87–100, New Orleans, LA, February 1999.
[GKP13] Justin Gottschlich, Rob Knauerhase, and Gilles Pokam. But how do we really
debug transactional memory? In 5th USENIX Workshop on Hot Topics in
Parallelism (HotPar 2013), San Jose, CA, USA, June 2013.
[GKPS95] Ben Gamsa, Orran Krieger, E. Parsons, and Michael Stumm. Performance
issues for multiprocessor operating systems, November 1995. Technical Re-
port CSRI-339, Available: ftp://ftp.cs.toronto.edu/pub/reports/
csri/339/339.ps.
[Gla18] Stjepan Glavina. Merge remaining subcrates, November 2018.
https://github.com/crossbeam-rs/crossbeam/commit/
d9b1e3429450a64b490f68c08bd191417e68f00c.
[Gle10] Thomas Gleixner. Realtime linux: academia v. reality, July 2010. URL:
https://lwn.net/Articles/397422/.
[Gle12] Thomas Gleixner. Linux -rt kvm guest demo, December 2012. Personal
communication.
[GMTW08] D. Guniguntala, P. E. McKenney, J. Triplett, and J. Walpole. The read-copy-
update mechanism for supporting real-time applications on shared-memory
multiprocessor systems with Linux. IBM Systems Journal, 47(2):221–236,
May 2008.
[Gol18a] David Goldblatt. Add the Seq module, a simple seqlock implementa-
tion, April 2018. https://github.com/jemalloc/jemalloc/tree/
06a8c40b36403e902748d3f2a14e6dd43488ae89.
[Gol18b] David Goldblatt. P1202: Asymmetric fences, October 2018. http://www.
open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1202r0.pdf.
[Gol19] David Goldblatt. There might not be an elegant OOTA fix, October
2019. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/
2019/p1916r0.pdf.
[GPB+ 07] Brian Goetz, Tim Peierls, Joshua Bloch, Joseph Bowbeer, David Holmes, and
Doug Lea. Java: Concurrency in Practice. Addison Wesley, Upper Saddle
River, NJ, USA, 2007.

v2022.09.25a
BIBLIOGRAPHY 593

[Gra91] Jim Gray. The Benchmark Handbook for Database and Transaction Processing
Systems. Morgan Kaufmann, 1991.
[Gra02] Jim Gray. Super-servers: Commodity computer clusters pose a software chal-
lenge, April 2002. Available: http://research.microsoft.com/en-
us/um/people/gray/papers/superservers(4t_computers).doc
[Viewed: June 23, 2004].
[Gre19] Brendan Gregg. BPF Performance Tools: Linux System and Application
Observability. Addison-Wesley Professional, 1st edition, 2019.
[Gri00] Scott Griffen. Internet pioneers: Doug englebart, May 2000. Available:
https://www.ibiblio.org/pioneers/englebart.html [Viewed No-
vember 28, 2008].
[Gro01] The Open Group. Single UNIX specification, July 2001. http://www.
opengroup.org/onlinepubs/007908799/index.html.
[Gro07] Dan Grossman. The transactional memory / garbage collection analogy. In
OOPSLA ’07: Proceedings of the 22nd annual ACM SIGPLAN conference on
Object oriented programming systems and applications, pages 695–706, Mont-
real, Quebec, Canada, October 2007. ACM. Available: https://homes.cs.
washington.edu/~djg/papers/analogy_oopsla07.pdf [Viewed Feb-
ruary 27, 2021].
[GRY12] Alexey Gotsman, Noam Rinetzky, and Hongseok Yang. Verify-
ing highly concurrent algorithms with grace (extended version), July
2012. https://software.imdea.org/~gotsman/papers/recycling-
esop13-ext.pdf.
[GRY13] Alexey Gotsman, Noam Rinetzky, and Hongseok Yang. Verifying concurrent
memory reclamation algorithms with grace. In ESOP’13: European Sympo-
sium on Programming, pages 249–269, Rome, Italy, 2013. Springer-Verlag.
[GT90] Gary Graunke and Shreekant Thakkar. Synchronization algorithms for shared-
memory multiprocessors. IEEE Computer, 23(6):60–69, June 1990.
[Gui18] Hugo Guiroux. Understanding the performance of mutual exclusion algorithms
on modern multicore machines. PhD thesis, Université Grenoble Alpes, 2018.
https://hugoguiroux.github.io/assets/these.pdf.
[Gwy15] David Gwynne. introduce srp, which according to the manpage i wrote is
short for “shared reference pointers”., July 2015. https://github.com/
openbsd/src/blob/HEAD/sys/kern/kern_srp.c.
[GYW+ 19] Jinyu Gu, Qianqian Yu, Xiayang Wang, Zhaoguo Wang, Binyu Zang, Haibing
Guan, and Haibo Chen. Pisces: A scalable and efficient persistent transactional
memory. In Proceedings of the 2019 USENIX Conference on Usenix Annual
Technical Conference, USENIX ATC ’19, pages 913–928, Renton, WA, USA,
2019. USENIX Association.
[Har16] "No Bugs" Hare. Infographics: Operation costs in CPU clock cycles, Sep-
tember 2016. http://ithare.com/infographics-operation-costs-
in-cpu-clock-cycles/.
[Hay20] Timothy Hayes. A shift to concurrency, October 2020. https:
//community.arm.com/developer/research/b/articles/posts/
arms-transactional-memory-extension-support-.

v2022.09.25a
594 BIBLIOGRAPHY

[HCS+ 05] Lorin Hochstein, Jeff Carver, Forrest Shull, Sima Asgari, and Victor Basili. Par-
allel programmer productivity: A case study of novice parallel programmers.
In SC ’05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing,
page 35, Seattle, WA, USA, 2005. IEEE Computer Society.
[Hei27] W. Heisenberg. Über den anschaulichen Inhalt der quantentheoretischen
Kinematik und Mechanik. Zeitschrift für Physik, 43(3-4):172–198, 1927.
English translation in “Quantum theory and measurement” by Wheeler and
Zurek.
[Her90] Maurice P. Herlihy. A methodology for implementing highly concurrent
data structures. In Proceedings of the 2nd ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming, pages 197–206, Seattle,
WA, USA, March 1990.
[Her91] Maurice Herlihy. Wait-free synchronization. ACM TOPLAS, 13(1):124–149,
January 1991.
[Her93] Maurice Herlihy. A methodology for implementing highly concurrent data ob-
jects. ACM Transactions on Programming Languages and Systems, 15(5):745–
770, November 1993.
[Her05] Maurice Herlihy. The transactional manifesto: software engineering and
non-blocking synchronization. In PLDI ’05: Proceedings of the 2005 ACM
SIGPLAN conference on Programming language design and implementation,
pages 280–280, Chicago, IL, USA, 2005. ACM Press.
[Her11] Benjamin Herrenschmidt. powerpc: Fix atomic_xxx_return barrier seman-
tics, November 2011. Git commit: https://git.kernel.org/linus/
b97021f85517.
[HHK+ 13] A. Haas, T.A. Henzinger, C.M. Kirsch, M. Lippautz, H. Payer, A. Sezgin,
and A. Sokolova. Distributed queues in shared memory—multicore perfor-
mance and scalability through quantitative relaxation. In Proc. International
Conference on Computing Frontiers, Ischia, Italy, 2013. ACM.
[HKLP12] Andreas Haas, Christoph M. Kirsch, Michael Lippautz, and Hannes Payer.
How FIFO is your concurrent FIFO queue? In Proceedings of the Workshop
on Relaxing Synchronization for Multicore and Manycore Scalability, Tucson,
AZ USA, October 2012.
[HL86] Frederick S. Hillier and Gerald J. Lieberman. Introduction to Operations
Research. Holden-Day, 1986.
[HLM02] Maurice Herlihy, Victor Luchangco, and Mark Moir. The repeat offender
problem: A mechanism for supporting dynamic-sized, lock-free data structures.
In Proceedings of 16th International Symposium on Distributed Computing,
pages 339–353, Toulouse, France, October 2002.
[HLM03] Maurice Herlihy, Victor Luchangco, and Mark Moir. Obstruction-free syn-
chronization: Double-ended queues as an example. In Proceedings of the 23rd
IEEE International Conference on Distributed Computing Systems (ICDCS),
pages 73–82, Providence, RI, May 2003. The Institute of Electrical and
Electronics Engineers, Inc.
[HM93] Maurice Herlihy and J. Eliot B. Moss. Transactional memory: Architectural
support for lock-free data structures. In ISCA ’93: Proceeding of the 20th
Annual International Symposium on Computer Architecture, pages 289–300,
San Diego, CA, USA, May 1993.

v2022.09.25a
BIBLIOGRAPHY 595

[HMB06] Thomas E. Hart, Paul E. McKenney, and Angela Demke Brown. Making lock-
less synchronization fast: Performance implications of memory reclamation.
In 20th IEEE International Parallel and Distributed Processing Symposium,
Rhodes, Greece, April 2006. Available: http://www.rdrop.com/users/
paulmck/RCU/hart_ipdps06.pdf [Viewed April 28, 2008].
[HMBW07] Thomas E. Hart, Paul E. McKenney, Angela Demke Brown, and Jonathan
Walpole. Performance of memory reclamation for lockless synchronization. J.
Parallel Distrib. Comput., 67(12):1270–1285, 2007.
[HMDZ06] David Howells, Paul E. McKenney, Will Deacon, and Peter Zijlstra. Linux
kernel memory barriers, March 2006. https://www.kernel.org/doc/
Documentation/memory-barriers.txt.
[Hoa74] C. A. R. Hoare. Monitors: An operating system structuring concept. Commu-
nications of the ACM, 17(10):549–557, October 1974.
[Hol03] Gerard J. Holzmann. The Spin Model Checker: Primer and Reference Manual.
Addison-Wesley, Boston, MA, USA, 2003.
[Hor18] Jann Horn. Reading privileged memory with a side-channel, Jan-
uary 2018. https://googleprojectzero.blogspot.com/2018/01/
reading-privileged-memory-with-side.html.
[HOS89] James P. Hennessy, Damian L. Osisek, and Joseph W. Seigh II. Passive
serialization in a multitasking environment. Technical Report US Patent
4,809,168, Assigned to International Business Machines Corp, Washington,
DC, February 1989.
[How12] Phil Howard. Extending Relativistic Programming to Multiple Writers. PhD
thesis, Portland State University, 2012.
[HP95] John L. Hennessy and David A. Patterson. Computer Architecture: A
Quantitative Approach. Morgan Kaufman, 1995.
[HP11] John L. Hennessy and David A. Patterson. Computer Architecture: A
Quantitative Approach, Fifth Edition. Morgan Kaufman, 2011.
[HP17] John L. Hennessy and David A. Patterson. Computer Architecture: A
Quantitative Approach, Sixth Edition. Morgan Kaufman, 2017.
[Hra13] Adam Hraška. Read-copy-update for helenos. Master’s thesis, Charles
University in Prague, Faculty of Mathematics and Physics, Department of
Distributed and Dependable Systems, 2013.
[HS08] Maurice Herlihy and Nir Shavit. The Art of Multiprocessor Programming.
Morgan Kaufmann, Burlington, MA, USA, 2008.
[HSLS20] Maurice Herlihy, Nir Shavit, Victor Luchangco, and Michael Spear. The Art
of Multiprocessor Programming, 2nd Edition. Morgan Kaufmann, Burlington,
MA, USA, 2020.
[HW90] Maurice P. Herlihy and Jeannette M. Wing. Linearizability: a correctness
condition for concurrent objects. ACM Trans. Program. Lang. Syst., 12(3):463–
492, July 1990.
[HW92] Wilson C. Hsieh and William E. Weihl. Scalable reader-writer locks for
parallel systems. In Proceedings of the 6th International Parallel Processing
Symposium, pages 216–230, Beverly Hills, CA, USA, March 1992.

v2022.09.25a
596 BIBLIOGRAPHY

[HW11] Philip W. Howard and Jonathan Walpole. A relativistic enhancement to


software transactional memory. In Proceedings of the 3rd USENIX conference
on Hot topics in parallelism, HotPar’11, pages 1–6, Berkeley, CA, 2011.
USENIX Association.
[HW14] Philip W. Howard and Jonathan Walpole. Relativistic red-black trees. Con-
currency and Computation: Practice and Experience, 26(16):2684–2712,
November 2014.
[IBM94] IBM Microelectronics and Motorola. PowerPC Microprocessor Family: The
Programming Environments, 1994.
[Inm85] Jack Inman. Implementing loosely coupled functions on tightly coupled
engines. In USENIX Conference Proceedings, pages 277–298, Portland, OR,
June 1985. USENIX Association.
[Inm07] Bill Inmon. Time value of information, January 2007. URL: http://www.b-
eye-network.com/view/3365 [broken, February 2021].
[Int92] International Standards Organization. Information Technology - Data-
base Language SQL. ISO, 1992. Available (Second informal review
draft of ISO/IEC 9075:1992): http://www.contrib.andrew.cmu.edu/
~shadow/sql/sql1992.txt [Viewed September 19, 2008].
[Int02a] Intel Corporation. Intel Itanium Architecture Software Developer’s Manual
Volume 2: System Architecture, 2002.
[Int02b] Intel Corporation. Intel Itanium Architecture Software Developer’s Manual
Volume 3: Instruction Set Reference, 2002.
[Int04a] Intel Corporation. IA-32 Intel Architecture Software Developer’s Manual
Volume 2B: Instruction Set Reference, N-Z, 2004.
[Int04b] Intel Corporation. IA-32 Intel Architecture Software Developer’s Manual
Volume 3: System Programming Guide, 2004.
[Int04c] International Business Machines Corporation. z/Architecture principles of
operation, May 2004. Available: http://publibz.boulder.ibm.com/
epubs/pdf/dz9zr003.pdf [Viewed: February 16, 2005].
[Int07] Intel Corporation. Intel 64 Architecture Memory Ordering White Paper, 2007.
[Int11] Intel Corporation. Intel 64 and IA-32 Architectures Software Devel-
oper’s Manual, Volume 3A: System Programming Guide, Part 1, 2011.
Available: http://www.intel.com/Assets/PDF/manual/253668.pdf
[Viewed: February 12, 2011].
[Int16] Intel Corporation. Intel 64 and IA-32 Architectures Software Developer’s
Manual, Volume 3A: System Programming Guide, Part 1, 2016.
[Int20a] Intel. Desktop 4th Generation Intel® Core™ Processor Family, Desktop Intel®
Pentium® Processor Family, and Desktop Intel® Celeron® Processor Family,
April 2020. http://www.intel.com/content/dam/www/public/us/
en/documents/specification-updates/4th-gen-core-family-
desktop-specification-update.pdf.
[Int20b] Intel Corporation. Intel Transactional Synchronization Extensions
(Intel TSX) Programming Considerations, 2021.1 edition, December
2020. In Intel C++ Compiler Classic Developer Guide and Reference,
https://software.intel.com/content/dam/develop/external/
us/en/documents/cpp_compiler_classic.pdf, page 1506.

v2022.09.25a
BIBLIOGRAPHY 597

[Int20c] International Business Machines Corporation. Power ISA™ Version 3.1, 2020.
[Int21] Intel. Performance monitoring impact of Intel® Transactional
Synchronization Extension memory ordering issue, June 2021.
https://www.intel.com/content/dam/support/us/en/documents/
processors/Performance-Monitoring-Impact-of-TSX-Memory-
Ordering-Issue-604224.pdf.
[Jac88] Van Jacobson. Congestion avoidance and control. In SIGCOMM ’88, pages
314–329, August 1988.
[Jac93] Van Jacobson. Avoid read-side locking via delayed free, September 1993.
private communication.
[Jac08] Daniel Jackson. MapReduce course, January 2008. Available: https:
//sites.google.com/site/mriap2008/ [Viewed January 3, 2013].
[JED] JEDEC. mega (M) (as a prefix to units of semiconductor storage capacity)
[online].
[Jef14] Alan Jeffrey. Jmm revision status, July 2014. https://mail.openjdk.
java.net/pipermail/jmm-dev/2014-July/000072.html.
[JJKD21] Ralf Jung, Jacques-Henri Jourdan, Robbert Krebbers, and Derek Dreyer. Safe
systems programming in Rust. Commun. ACM, 64(4):144–152, March 2021.
[JLK16a] Yeongjin Jang, Sangho Lee, and Taesoo Kim. Breaking kernel ad-
dress space layout randomization (KASLR) with Intel TSX, July
2016. Black Hat USA 2018 https://www.blackhat.com/us-
16/briefings.html#breaking-kernel-address-space-layout-
randomization-kaslr-with-intel-tsx.
[JLK16b] Yeongjin Jang, Sangho Lee, and Taesoo Kim. Breaking kernel address space
layout randomization with Intel TSX. In Proceedings of the 2016 ACM
SIGSAC Conference on Computer and Communications Security, CCS ’16,
pages 380–392, Vienna, Austria, 2016. ACM.
[JMRR02] Benedict Joseph Jackson, Paul E. McKenney, Ramakrishnan Rajamony, and
Ronald Lynn Rockhold. Scalable interruptible queue locks for shared-memory
multiprocessor. US Patent 6,473,819, Assigned to International Business
Machines Corporation, Washington, DC, October 2002.
[Joh77] Stephen Johnson. Lint, a C program checker, December 1977. Computer
Science Technical Report 65, Bell Laboratories.
[Joh95] Aju John. Dynamic vnodes – design and implementation. In USENIX Winter
1995, pages 11–23, New Orleans, LA, January 1995. USENIX Associa-
tion. Available: https://www.usenix.org/publications/library/
proceedings/neworl/full_papers/john.a [Viewed October 1, 2010].
[Jon11] Dave Jones. Trinity: A system call fuzzer. In 13th Ottawa Linux Symposium,
Ottawa, Canada, June 2011. Project repository: https://github.com/
kernelslacker/trinity.
[JSG12] Christian Jacobi, Timothy Slegel, and Dan Greiner. Transactional mem-
ory architecture and implementation for IBM System z. In Proceedings of
the 45th Annual IEEE/ACM International Symposium on Microarchitecture,
MICRO 45, pages 25–36, Vancouver B.C. Canada, December 2012. Presenta-
tion slides: https://www.microarch.org/micro45/talks-posters/
3-jacobi-presentation.pdf.

v2022.09.25a
598 BIBLIOGRAPHY

[Kaa15] Frans Kaashoek. Parallel computing and the os. In SOSP History Day, October
2015.
[KCH+ 06] Sanjeev Kumar, Michael Chu, Christopher J. Hughes, Partha Kumar, and
Anthony Nguyen. Hybrid transactional memory. In Proceedings of the
ACM SIGPLAN 2006 Symposium on Principles and Practice of Parallel
Programming, New York, New York, United States, 2006. ACM SIGPLAN.
[KDI20] Alex Kogan, Dave Dice, and Shady Issa. Scalable range locks for scalable
address spaces and beyond. In Proceedings of the Fifteenth European
Conference on Computer Systems, EuroSys ’20, Heraklion, Greece, 2020.
Association for Computing Machinery.
[Kel17] Michael J. Kelly. How might the manufacturability of the hardware at
device level impact on exascale computing?, 2017. Keynote speech at
Multicore World 2017, URL: https://openparallel.com/multicore-
world-2017/program-2017/abstracts2017/.
[Ken20] Chris Kennelly. TCMalloc overview, February 2020. https://google.
github.io/tcmalloc/overview.html.
[KFC11] KFC. Memristor processor solves mazes, March 2011. URL: https:
//www.technologyreview.com/2011/03/03/196572/memristor-
processor-solves-mazes/.
[Khi14] Maxim Khizhinsky. Memory management schemes, June 2014.
https://kukuruku.co/post/lock-free-data-structures-the-
inside-memory-management-schemes/.
[Khi15] Max Khiszinsky. Lock-free data structures. the inside. RCU, February
2015. https://kukuruku.co/post/lock-free-data-structures-
the-inside-rcu/.
[Kis14] Jan Kiszka. Real-time virtualization - how crazy are we? In Linux Plumbers
Conference, Duesseldorf, Germany, October 2014. URL: https://blog.
linuxplumbersconf.org/2014/ocw/proposals/1935.
[Kiv13] Avi Kivity. rcu: add basic read-copy-update implementation, Au-
gust 2013. https://github.com/cloudius-systems/osv/commit/
94b69794fb9e6c99d78ca9a58ddaee1c31256b43.
[Kiv14a] Avi Kivity. rcu hashtable, July 2014. https:
//github.com/cloudius-systems/osv/commit/
7fa2728e5d03b2174b4a39d94b21940d11926e90.
[Kiv14b] Avi Kivity. rcu: introduce an rcu list type, April 2014.
https://github.com/cloudius-systems/osv/commit/
4e46586093aeaf339fef8e08d123a6f6b0abde5b.
[KL80] H. T. Kung and Philip L. Lehman. Concurrent manipulation of binary search
trees. ACM Transactions on Database Systems, 5(3):354–382, September
1980.
[Kle14] Andi Kleen. Scaling existing lock-based applications with lock elision.
Commun. ACM, 57(3):52–56, March 2014.
[Kle17] Matt Klein. Envoy threading model, July 2017. https://blog.
envoyproxy.io/envoy-threading-model-a8d44b922310.

v2022.09.25a
BIBLIOGRAPHY 599

[KLP12] Christoph M. Kirsch, Michael Lippautz, and Hannes Payer. Fast and scalable
k-FIFO queues. Technical Report 2012-04, University of Salzburg, Salzburg,
Austria, June 2012.
[KM13] Konstantin Khlebnikov and Paul E. McKenney. RCU: non-atomic assignment
to long/pointer variables in gcc, January 2013. https://lore.kernel.
org/lkml/[email protected]/.
[KMK+ 19] Jaeho Kim, Ajit Mathew, Sanidhya Kashyap, Madhava Krishnan Ramanathan,
and Changwoo Min. Mv-rlu: Scaling read-log-update with multi-versioning.
In Proceedings of the Twenty-Fourth International Conference on Architectural
Support for Programming Languages and Operating Systems, ASPLOS ’19,
pages 779–792, Providence, RI, USA, 2019. ACM.
[Kni86] Tom Knight. An architecture for mostly functional languages. In Proceedings
of the 1986 ACM Conference on LISP and Functional Programming, LFP ’86,
pages 105–112, Cambridge, Massachusetts, USA, 1986. ACM.
[Kni08] John U. Knickerbocker. 3D chip technology. IBM Journal of Research and
Development, 52(6), November 2008. URL: http://www.research.ibm.
com/journal/rd52-6.html [Link to each article is broken as of November
2016; Available via https://ieeexplore.ieee.org/xpl/tocresult.
jsp?isnumber=5388557].
[Knu73] Donald Knuth. The Art of Computer Programming. Addison-Wesley, 1973.
[Kra17] Vlad Krasnov. On the dangers of Intel’s frequency scaling, No-
vember 2017. https://blog.cloudflare.com/on-the-dangers-of-
intels-frequency-scaling/.
[KS08] Daniel Kroening and Ofer Strichman. Decision Procedures: An Algorithmic
Point of View. Springer Publishing Company, Incorporated, 1 edition, 2008.
[KS17a] Michalis Kokologiannakis and Konstantinos Sagonas. Stateless model
checking of the linux kernel’s hierarchical read-copy update (Tree RCU).
Technical report, National Technical University of Athens, January 2017.
https://github.com/michalis-/rcu/blob/master/rcupaper.pdf.
[KS17b] Michalis Kokologiannakis and Konstantinos Sagonas. Stateless model check-
ing of the Linux kernel’s hierarchical read-copy-update (Tree RCU). In
Proceedings of International SPIN Symposium on Model Checking of Soft-
ware, SPIN 2017, New York, NY, USA, July 2017. ACM.
[KS19] Michalis Kokologiannakis and Konstantinos Sagonas. Stateless model check-
ing of the linux kernel’s read—copy update (RCU). Int. J. Softw. Tools Technol.
Transf., 21(3):287–306, June 2019.
[KWS97] Leonidas Kontothanassis, Robert W. Wisniewski, and Michael L. Scott.
Scheduler-conscious synchronization. ACM Transactions on Computer Sys-
tems, 15(1):3–40, February 1997.
[LA94] Beng-Hong Lim and Anant Agarwal. Reactive synchronization algorithms
for multiprocessors. In Proceedings of the sixth international conference
on Architectural support for programming languages and operating systems,
ASPLOS VI, pages 25–35, San Jose, California, USA, October 1994. ACM.
[Lam74] Leslie Lamport. A new solution of Dijkstra’s concurrent programming
problem. Communications of the ACM, 17(8):453–455, August 1974.

v2022.09.25a
600 BIBLIOGRAPHY

[Lam77] Leslie Lamport. Concurrent reading and writing. Commun. ACM, 20(11):806–
811, November 1977.
[Lam02] Leslie Lamport. Specifying Systems: The TLA+ Language and Tools for
Hardware and Software Engineers. Addison-Wesley Longman Publishing
Co., Inc., Boston, MA, USA, 2002.
[Lar21] Michael Larabel. Intel to disable TSX by default on more CPUs with new
microcode, June 2021. https://www.phoronix.com/scan.php?page=
news_item&px=Intel-TSX-Off-New-Microcode.
[LBD+ 04] James R. Larus, Thomas Ball, Manuvir Das, Robert DeLine, Manuel Fahndrich,
Jon Pincus, Sriram K. Rajamani, and Ramanathan Venkatapathy. Righting
software. IEEE Softw., 21(3):92–100, May 2004.
[Lea97] Doug Lea. Concurrent Programming in Java: Design Principles and Patterns.
Addison Wesley Longman, Reading, MA, USA, 1997.
[Lem18] Daniel Lemire. AVX-512: when and how to use these new instructions, Sep-
tember 2018. https://lemire.me/blog/2018/09/07/avx-512-when-
and-how-to-use-these-new-instructions/.
[LGW+ 15] H. Q. Le, G. L. Guthrie, D. E. Williams, M. M. Michael, B. G. Frey, W. J.
Starke, C. May, R. Odaira, and T. Nakaike. Transactional memory support in
the ibm power8 processor. IBM J. Res. Dev., 59(1):8:1–8:14, January 2015.
[LHF05] Michael Lyons, Bill Hay, and Brad Frey. PowerPC storage model and AIX
programming, November 2005. http://www.ibm.com/developerworks/
systems/articles/powerpc.html.
[Lis88] Barbara Liskov. Distributed programming in Argus. Commun. ACM,
31(3):300–312, 1988.
[LLO09] Yossi Lev, Victor Luchangco, and Marek Olszewski. Scalable reader-writer
locks. In SPAA ’09: Proceedings of the twenty-first annual symposium on
Parallelism in algorithms and architectures, pages 101–110, Calgary, AB,
Canada, 2009. ACM.
[LLS13] Yujie Liu, Victor Luchangco, and Michael Spear. Mindicators: A scalable
approach to quiescence. In Proceedings of the 2013 IEEE 33rd International
Conference on Distributed Computing Systems, ICDCS ’13, pages 206–215,
Washington, DC, USA, 2013. IEEE Computer Society.
[LMKM16] Lihao Liang, Paul E. McKenney, Daniel Kroening, and Tom Melham. Ver-
ification of the tree-based hierarchical read-copy update in the Linux ker-
nel. Technical report, Cornell University Library, October 2016. https:
//arxiv.org/abs/1610.03052.
[LMKM18] Lihao Liang, Paul E. McKenney, Daniel Kroening, and Tom Melham. Verifi-
cation of tree-based hierarchical Read-Copy Update in the Linux Kernel. In
2018 Design, Automation & Test in Europe Conference & Exhibition (DATE),
Dresden, Germany, March 2018.
[Loc02] Doug Locke. Priority inheritance: The real story, July 2002. URL:
http://www.linuxdevices.com/articles/AT5698775833.html [bro-
ken, November 2016], page capture available at https://www.math.unipd.
it/%7Etullio/SCD/2007/Materiale/Locke.pdf.

v2022.09.25a
BIBLIOGRAPHY 601

[Lom77] D. B. Lomet. Process structuring, synchronization, and recovery using


atomic actions. SIGSOFT Softw. Eng. Notes, 2(2):128–137, 1977. URL:
http://portal.acm.org/citation.cfm?id=808319#.
[LR80] Butler W. Lampson and David D. Redell. Experience with processes and
monitors in Mesa. Communications of the ACM, 23(2):105–117, 1980.
[LS86] Vladimir Lanin and Dennis Shasha. A symmetric concurrent b-tree algorithm.
In ACM ’86: Proceedings of 1986 ACM Fall joint computer conference, pages
380–389, Dallas, Texas, United States, 1986. IEEE Computer Society Press.
[LS11] Yujie Liu and Michael Spear. Toxic transactions. In TRANSACT 2011, San
Jose, CA, USA, June 2011. ACM SIGPLAN.
[LSLK14] Carl Leonardsson, Kostis Sagonas, Truc Nguyen Lam, and Michalis Kokolo-
giannakis. Nidhugg, July 2014. https://github.com/nidhugg/nidhugg.
[LVK+ 17] Ori Lahav, Viktor Vafeiadis, Jeehoon Kang, Chung-Kil Hur, and Derek Dreyer.
Repairing sequential consistency in C/C++11. SIGPLAN Not., 52(6):618–632,
June 2017.
[LZC14] Ran Liu, Heng Zhang, and Haibo Chen. Scalable read-mostly synchroniza-
tion using passive reader-writer locks. In 2014 USENIX Annual Technical
Conference (USENIX ATC 14), pages 219–230, Philadelphia, PA, June 2014.
USENIX Association.
[MAK+ 01] Paul E. McKenney, Jonathan Appavoo, Andi Kleen, Orran Krieger, Rusty
Russell, Dipankar Sarma, and Maneesh Soni. Read-copy update. In Ottawa
Linux Symposium, July 2001. URL: https://www.kernel.org/doc/ols/
2001/read-copy.pdf, http://www.rdrop.com/users/paulmck/RCU/
rclock_OLS.2001.05.01c.pdf.
[Mar17] Luc Maraget. Aarch64 model vs. hardware, May 2017. http://pauillac.
inria.fr/~maranget/cats7/model-aarch64/specific.html.
[Mar18] Catalin Marinas. Queued spinlocks model, March 2018. https://
git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-
tla.git.
[Mas92] H. Massalin. Synthesis: An Efficient Implementation of Fundamental Op-
erating System Services. PhD thesis, Columbia University, New York, NY,
1992.
[Mat17] Norm Matloff. Programming on Parallel Machines. University of California,
Davis, Davis, CA, USA, 2017.
[MB20] Paul E. McKenney and Hans Boehm. P2055R0: A relaxed guide to mem-
ory_order_relaxed, January 2020. http://www.open-std.org/jtc1/
sc22/wg21/docs/papers/2020/p2055r0.pdf.
[MBM+ 06] Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, and
David A. Wood. LogTM: Log-based transactional memory. In Proceed-
ings of the 12th Annual International Symposium on High Performance
Computer Architecture (HPCA-12), Austin, Texas, United States, 2006.
IEEE. Available: http://www.cs.wisc.edu/multifacet/papers/
hpca06_logtm.pdf [Viewed December 21, 2006].
[MBWW12] Paul E. McKenney, Silas Boyd-Wickizer, and Jonathan Walpole. RCU
usage in the linux kernel: One decade later, September 2012. Tech-
nical report paulmck.2012.09.17, http://rdrop.com/users/paulmck/
techreports/survey.2012.09.17a.pdf.

v2022.09.25a
602 BIBLIOGRAPHY

[McK90] Paul E. McKenney. Stochastic fairness queuing. In IEEE INFO-


COM’90 Proceedings, pages 733–740, San Francisco, June 1990. The
Institute of Electrical and Electronics Engineers, Inc. Revision avail-
able: http://www.rdrop.com/users/paulmck/scalability/paper/
sfq.2002.06.04.pdf [Viewed May 26, 2008].
[McK95] Paul E. McKenney. Differential profiling. In MASCOTS 1995, pages 237–241,
Toronto, Canada, January 1995.
[McK96a] Paul E. McKenney. Pattern Languages of Program Design, volume 2, chapter
31: Selecting Locking Designs for Parallel Programs, pages 501–531. Addison-
Wesley, June 1996. Available: http://www.rdrop.com/users/paulmck/
scalability/paper/mutexdesignpat.pdf [Viewed February 17, 2005].
[McK96b] Paul E. McKenney. Selecting locking primitives for parallel programs.
Communications of the ACM, 39(10):75–82, October 1996.
[McK99] Paul E. McKenney. Differential profiling. Software - Practice and Experience,
29(3):219–234, 1999.
[McK01] Paul E. McKenney. RFC: patch to allow lock-free traversal of lists with
insertion, October 2001. Available: https://lore.kernel.org/lkml/
[email protected]/ [Viewed
January 05, 2021].
[McK03] Paul E. McKenney. Using RCU in the Linux 2.5 kernel. Linux Journal,
1(114):18–26, October 2003. Available: https://www.linuxjournal.
com/article/6993 [Viewed November 14, 2007].
[McK04] Paul E. McKenney. Exploiting Deferred Destruction: An Analysis of Read-
Copy-Update Techniques in Operating System Kernels. PhD thesis, OGI
School of Science and Engineering at Oregon Health and Sciences University,
2004.
[McK05a] Paul E. McKenney. Memory ordering in modern microprocessors, part I.
Linux Journal, 1(136):52–57, August 2005. Available: https://www.
linuxjournal.com/article/8211 http://www.rdrop.com/users/
paulmck/scalability/paper/ordering.2007.09.19a.pdf [Viewed
November 30, 2007].
[McK05b] Paul E. McKenney. Memory ordering in modern microprocessors, part II.
Linux Journal, 1(137):78–82, September 2005. Available: https://www.
linuxjournal.com/article/8212 http://www.rdrop.com/users/
paulmck/scalability/paper/ordering.2007.09.19a.pdf [Viewed
November 30, 2007].
[McK05c] Paul E. McKenney. A realtime preemption overview, August 2005. URL:
https://lwn.net/Articles/146861/.
[McK06] Paul E. McKenney. Sleepable RCU, October 2006. Available:
https://lwn.net/Articles/202847/ Revised: http://www.rdrop.
com/users/paulmck/RCU/srcu.2007.01.14a.pdf [Viewed August 21,
2006].
[McK07a] Paul E. McKenney. The design of preemptible read-copy-update, October
2007. Available: https://lwn.net/Articles/253651/ [Viewed October
25, 2007].

v2022.09.25a
BIBLIOGRAPHY 603

[McK07b] Paul E. McKenney. Immunize rcu_dereference() against crazy compiler


writers, October 2007. Git commit: https://git.kernel.org/linus/
97b430320ce7.
[McK07c] Paul E. McKenney. [PATCH] QRCU with lockless fastpath, February 2007.
Available: https://lkml.org/lkml/2007/2/25/18 [Viewed March 27,
2008].
[McK07d] Paul E. McKenney. Priority-boosting RCU read-side critical sections, February
2007. https://lwn.net/Articles/220677/.
[McK07e] Paul E. McKenney. RCU and unloadable modules, January 2007. Available:
https://lwn.net/Articles/217484/ [Viewed November 22, 2007].
[McK07f] Paul E. McKenney. Using Promela and Spin to verify parallel algorithms,
August 2007. Available: https://lwn.net/Articles/243851/ [Viewed
September 8, 2007].
[McK08a] Paul E. McKenney. Efficient support of consistent cyclic search with read-
copy update (lapsed). Technical Report US Patent 7,426,511, Assigned to
International Business Machines Corp, Washington, DC, September 2008.
[McK08b] Paul E. McKenney. Hierarchical RCU, November 2008. https://lwn.net/
Articles/305782/.
[McK08c] Paul E. McKenney. rcu: fix rcu_try_flip_waitack_needed() to prevent
grace-period stall, May 2008. Git commit: https://git.kernel.org/
linus/d7c0651390b6.
[McK08d] Paul E. McKenney. rcu: fix misplaced mb() in rcu_enter/exit_
nohz(), March 2008. Git commit: https://git.kernel.org/linus/
ae66be9b71b1.
[McK08e] Paul E. McKenney. RCU part 3: the RCU API, January 2008. Available:
https://lwn.net/Articles/264090/ [Viewed January 10, 2008].
[McK08f] Paul E. McKenney. "Tree RCU": scalable classic RCU implementa-
tion, December 2008. Git commit: https://git.kernel.org/linus/
64db4cfff99c.
[McK08g] Paul E. McKenney. What is RCU? part 2: Usage, January 2008. Available:
https://lwn.net/Articles/263130/ [Viewed January 4, 2008].
[McK09a] Paul E. McKenney. Re: [PATCH fyi] RCU: the bloatwatch edition, Janu-
ary 2009. Available: https://lkml.org/lkml/2009/1/14/449 [Viewed
January 15, 2009].
[McK09b] Paul E. McKenney. Transactional memory everywhere?, September 2009.
https://paulmck.livejournal.com/13841.html.
[McK10] Paul E. McKenney. Efficient support of consistent cyclic search with read-
copy update (lapsed). Technical Report US Patent 7,814,082, Assigned to
International Business Machines Corp, Washington, DC, October 2010.
[McK11a] Paul E. McKenney. 3.0 and RCU: what went wrong, July 2011. https:
//lwn.net/Articles/453002/.
[McK11b] Paul E. McKenney. Concurrent code and expensive instructions, January
2011. Available: https://lwn.net/Articles/423994 [Viewed January
28, 2011].

v2022.09.25a
604 BIBLIOGRAPHY

[McK11c] Paul E. McKenney. Transactional memory everywhere: Htm and


cache geometry, June 2011. https://paulmck.livejournal.com/tag/
transactional%20memory%20everywhere.
[McK11d] Paul E. McKenney. Validating memory barriers and atomic instructions,
December 2011. https://lwn.net/Articles/470681/.
[McK11e] Paul E. McKenney. Verifying parallel software: Can theory meet practice?,
January 2011. http://www.rdrop.com/users/paulmck/scalability/
paper/VericoTheoryPractice.2011.01.28a.pdf.
[McK12a] Paul E. McKenney. Beyond expert-only parallel programming? In Proceedings
of the 2012 ACM workshop on Relaxing synchronization for multicore and
manycore scalability, RACES ’12, pages 25–32, Tucson, Arizona, USA, 2012.
ACM.
[McK12b] Paul E. McKenney. Making RCU safe for battery-powered devices, Feb-
ruary 2012. Available: http://www.rdrop.com/users/paulmck/RCU/
RCUdynticks.2012.02.15b.pdf [Viewed March 1, 2012].
[McK12c] Paul E. McKenney. Retrofitted parallelism considered grossly sub-optimal. In
4th USENIX Workshop on Hot Topics on Parallelism, page 7, Berkeley, CA,
USA, June 2012.
[McK12d] Paul E. McKenney. Signed overflow optimization hazards in the kernel, August
2012. https://lwn.net/Articles/511259/.
[McK12e] Paul E. McKenney. Transactional memory everywhere: Hardware
transactional lock elision, May 2012. Available: https://paulmck.
livejournal.com/32267.html [Viewed January 28, 2021].
[McK13] Paul E. McKenney. Structured deferral: synchronization via procrastination.
Commun. ACM, 56(7):40–49, July 2013.
[McK14a] Paul E. McKenney. C++ memory model meets high-update-rate data struc-
tures, September 2014. http://www2.rdrop.com/users/paulmck/RCU/
C++Updates.2014.09.11a.pdf.
[McK14b] Paul E. McKenney. Efficient support of consistent cyclic search with read-
copy update (lapsed). Technical Report US Patent 8,874,535, Assigned to
International Business Machines Corp, Washington, DC, October 2014.
[McK14c] Paul E. McKenney. Is Parallel Programming Hard, And, If So, What
Can You Do About It? (First Edition). kernel.org, Corvallis, OR, USA,
2014. https://kernel.org/pub/linux/kernel/people/paulmck/
perfbook/perfbook-e1.html.
[McK14d] Paul E. McKenney. N4037: Non-transactional implementation of atomic tree
move, May 2014. http://www.open-std.org/jtc1/sc22/wg21/docs/
papers/2014/n4037.pdf.
[McK14e] Paul E. McKenney. Proper care and feeding of return values from
rcu_dereference(), February 2014. https://www.kernel.org/doc/
Documentation/RCU/rcu_dereference.txt.
[McK14f] Paul E. McKenney. The RCU API, 2014 edition, September 2014. https:
//lwn.net/Articles/609904/.
[McK14g] Paul E. McKenney. Recent read-mostly research, November 2014. https:
//lwn.net/Articles/619355/.

v2022.09.25a
BIBLIOGRAPHY 605

[McK15a] Paul E. McKenney. Formal verification and Linux-kernel concurrency. In Com-


positional Verification Methods for Next-Generation Concurrency, Dagstuhl
Seminar Proceedings, Dagstuhl, Germany, 2015. Schloss Dagstuhl - Leibniz-
Zentrum fuer Informatik, Germany.
[McK15b] Paul E. McKenney. High-performance and scalable updates: The Issaquah
challenge, January 2015. http://www2.rdrop.com/users/paulmck/
scalability/paper/Updates.2015.01.16b.LCA.pdf.
[McK15c] Paul E. McKenney. [PATCH tip/core/rcu 01/10] rcu: Make
rcu_nmi_enter() handle nesting, January 2015. https:
//lore.kernel.org/lkml/1420651257-553-1-git-send-email-
[email protected]/.
[McK15d] Paul E. McKenney. Practical experience with formal verification tools. In Veri-
fied Trustworthy Software Systems Specialist Meeting. The Royal Society, April
2015. http://www.rdrop.com/users/paulmck/scalability/paper/
Validation.2016.04.06e.SpecMtg.pdf.
[McK15e] Paul E. McKenney. RCU requirements part 2 — parallelism and software
engineering, August 2015. https://lwn.net/Articles/652677/.
[McK15f] Paul E. McKenney. RCU requirements part 3, August 2015. https://lwn.
net/Articles/653326/.
[McK15g] Paul E. McKenney. Re: [patch tip/locking/core v4 1/6] powerpc:
atomic: Make *xchg and *cmpxchg a full barrier, October 2015. Email
thread: https://lore.kernel.org/lkml/20151014201916.GB3910@
linux.vnet.ibm.com/.
[McK15h] Paul E. McKenney. Requirements for RCU part 1: the fundamentals, July
2015. https://lwn.net/Articles/652156/.
[McK16a] Paul E. McKenney. Beyond the Issaquah challenge: High-performance
scalable complex updates, September 2016. http://www2.rdrop.com/
users/paulmck/RCU/Updates.2016.09.19i.CPPCON.pdf.
[McK16b] Paul E. McKenney. High-performance and scalable updates: The Issaquah
challenge, June 2016. http://www2.rdrop.com/users/paulmck/RCU/
Updates.2016.06.01e.ACM.pdf.
[McK17] Paul E. McKenney. Verification challenge 6: Linux-kernel Tree RCU, June
2017. https://paulmck.livejournal.com/46993.html.
[McK19a] Paul E. McKenney. A critical RCU safety property is... Ease of use! In
Proceedings of the 12th ACM International Conference on Systems and Storage,
SYSTOR ’19, pages 132–143, Haifa, Israel, 2019. ACM.
[McK19b] Paul E. McKenney. The RCU API, 2019 edition, January 2019. https:
//lwn.net/Articles/777036/.
[McK19c] Paul E. McKenney. RCU’s first-ever CVE, and how i lived to tell the
tale, January 2019. linux.conf.au Slides: http://www.rdrop.com/users/
paulmck/RCU/cve.2019.01.23e.pdf Video: https://www.youtube.
com/watch?v=hZX1aokdNiY.
[MCM02] Paul E. McKenney, Kevin A. Closson, and Raghupathi Malige. Lingering locks
with fairness control for multi-node computer systems. US Patent 6,480,918,
Assigned to International Business Machines Corporation, Washington, DC,
November 2002.

v2022.09.25a
606 BIBLIOGRAPHY

[MCS91] John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable syn-
chronization on shared-memory multiprocessors. Transactions of Computer
Systems, 9(1):21–65, February 1991.
[MD92] Paul E. McKenney and Ken F. Dove. Efficient demultiplexing of incoming tcp
packets. In SIGCOMM ’92, Proceedings of the Conference on Communications
Architecture & Protocols, pages 269–279, Baltimore, MD, August 1992.
Association for Computing Machinery.
[MDJ13a] Paul E. McKenney, Mathieu Desnoyers, and Lai Jiangshan. URCU-protected
hash tables, November 2013. https://lwn.net/Articles/573431/.
[MDJ13b] Paul E. McKenney, Mathieu Desnoyers, and Lai Jiangshan. URCU-protected
queues and stacks, November 2013. https://lwn.net/Articles/
573433/.
[MDJ13c] Paul E. McKenney, Mathieu Desnoyers, and Lai Jiangshan. User-space RCU,
November 2013. https://lwn.net/Articles/573424/.
[MDR16] Paul E. McKenney, Will Deacon, and Luis R. Rodriguez. Semantics of MMIO
mapping attributes across architectures, August 2016. https://lwn.net/
Articles/698014/.
[MDSS20] Hans Meuer, Jack Dongarra, Erich Strohmaier, and Horst Simon. Top 500: The
list, November 2020. Available: https://top500.org/lists/ [Viewed
March 6, 2021].
[Men16] Alexis Menard. Move OneWriterSeqLock and SharedMemorySe-
qLockBuffer from content/ to device/base/synchronization, September
2016. https://source.chromium.org/chromium/chromium/src/+/
b39a3082846d5877a15e8b7e18d66cb142abe8af.
[Mer11] Rick Merritt. IBM plants transactional memory in CPU, August 2011.
EE Times https://www.eetimes.com/ibm-plants-transactional-
memory-in-cpu/.
[Met99] Panagiotis Takis Metaxas. Fast dithering on a data-parallel computer. In
Proceedings of the IASTED International Conference on Parallel and Distrib-
uted Computing and Systems, pages 570–576, Cambridge, MA, USA, 1999.
IASTED.
[MG92] Paul E. McKenney and Gary Graunke. Efficient buffer allocation on shared-
memory multiprocessors. In IEEE Workshop on the Architecture and Imple-
mentation of High Performance Communication Subsystems, pages 194–199,
Tucson, AZ, February 1992. The Institute of Electrical and Electronics Engi-
neers, Inc.
+
[MGM 09] Paul E. McKenney, Manish Gupta, Maged M. Michael, Phil Howard, Joshua
Triplett, and Jonathan Walpole. Is parallel programming hard, and if so,
why? Technical Report TR-09-02, Portland State University, Portland, OR,
USA, February 2009. URL: https://archives.pdx.edu/ds/psu/10386
[Viewed February 13, 2021].
[MHS12] Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. Why on-chip coherence
is here to stay. Communications of the ACM, 55(7):78–89, July 2012.
[Mic02] Maged M. Michael. Safe memory reclamation for dynamic lock-free objects
using atomic reads and writes. In Proceedings of the 21st Annual ACM
Symposium on Principles of Distributed Computing, pages 21–30, August
2002.

v2022.09.25a
BIBLIOGRAPHY 607

[Mic03] Maged M. Michael. Cas-based lock-free algorithm for shared deques. In


Harald Kosch, László Böszörményi, and Hermann Hellwagner, editors, Euro-
Par, volume 2790 of Lecture Notes in Computer Science, pages 651–660.
Springer, 2003.
[Mic04a] Maged M. Michael. Hazard pointers: Safe memory reclamation for lock-free
objects. IEEE Transactions on Parallel and Distributed Systems, 15(6):491–
504, June 2004.
[Mic04b] Maged M. Michael. Scalable lock-free dynamic memory allocation. SIGPLAN
Not., 39(6):35–46, 2004.
[Mic08] Microsoft. FlushProcessWriteBuffers function, 2008.
https://docs.microsoft.com/en-us/windows/desktop/
api/processthreadsapi/nf-processthreadsapi-
flushprocesswritebuffers.
[Mic18] Maged Michael. Rewrite from experimental, use of de-
terministic schedule, improvements, June 2018. Hazard
pointers: https://github.com/facebook/folly/commit/
d42832d2a529156275543c7fa7183e1321df605d.
[Mil06] David S. Miller. Re: [PATCH, RFC] RCU : OOM avoidance and lower
latency, January 2006. Available: https://lkml.org/lkml/2006/1/7/22
[Viewed February 29, 2012].
[MJST16] Paul E. McKenney, Alan Jeffrey, Ali Sezgin, and Tony Tye. Out-of-thin-
air execution is vacuous, July 2016. http://www.open-std.org/jtc1/
sc22/wg21/docs/papers/2016/p0422r0.html.
[MK88] Marshall Kirk McKusick and Michael J. Karels. Design of a general purpose
memory allocator for the 4.3BSD UNIX kernel. In USENIX Conference
Proceedings, Berkeley CA, June 1988.
[MKM12] Yandong Mao, Eddie Kohler, and Robert Tappan Morris. Cache craftiness
for fast multicore key-value storage. In Proceedings of the 7th ACM Euro-
pean Conference on Computer Systems, EuroSys ’12, pages 183–196, Bern,
Switzerland, 2012. ACM.
[ML82] Udi Manber and Richard E. Ladner. Concurrency control in a dynamic search
structure. Technical Report 82-01-01, Department of Computer Science,
University of Washington, Seattle, Washington, January 1982.
[ML84] Udi Manber and Richard E. Ladner. Concurrency control in a dynamic search
structure. ACM Transactions on Database Systems, 9(3):439–455, September
1984.
[MLH94] Peter Magnusson, Anders Landin, and Erik Hagersten. Efficient software
synchronization on large cache coherent multiprocessors. Technical Report
T94:07, Swedish Institute of Computer Science, Kista, Sweden, February
1994.
[MM00] Ingo Molnar and David S. Miller. brlock, March 2000. URL:
http://kernel.nic.funet.fi/pub/linux/kernel/v2.3/patch-
html/patch-2.3.49/linux_include_linux_brlock.h.html.
[MMM+ 20] Paul E. McKenney, Maged Michael, Jens Maurer, Peter Sewell, Martin
Uecker, Hans Boehm, Hubert Tong, Niall Douglas, Thomas Rodgers, Will
Deacon, Michael Wong, David Goldblatt, Kostya Serebryany, and Anthony

v2022.09.25a
608 BIBLIOGRAPHY

Williams. P1726R4: Pointer lifetime-end zap, July 2020. http://www.open-


std.org/jtc1/sc22/wg21/docs/papers/2020/p1726r4.pdf.
[MMS19] Paul E. McKenney, Maged Michael, and Peter Sewell. N2369: Pointer
lifetime-end zap, April 2019. http://www.open-std.org/jtc1/sc22/
wg14/www/docs/n2369.pdf.
[MMTW10] Paul E. McKenney, Maged M. Michael, Josh Triplett, and Jonathan Walpole.
Why the grass may not be greener on the other side: a comparison of locking
vs. transactional memory. ACM Operating Systems Review, 44(3), July 2010.
[MMW07] Paul E. McKenney, Maged Michael, and Jonathan Walpole. Why the grass may
not be greener on the other side: A comparison of locking vs. transactional
memory. In Programming Languages and Operating Systems, pages 1–5,
Stevenson, Washington, USA, October 2007. ACM SIGOPS.
[Mol05] Ingo Molnar. Index of /pub/linux/kernel/projects/rt, February 2005. URL:
https://www.kernel.org/pub/linux/kernel/projects/rt/.
[Mol06] Ingo Molnar. Lightweight robust futexes, March 2006. Available: https://
www.kernel.org/doc/Documentation/robust-futexes.txt [Viewed
February 14, 2021].
[Moo65] Gordon E. Moore. Cramming more components onto integrated circuits.
Electronics, 38(8):114–117, April 1965.
[Moo03] Gordon Moore. No exponential is forever–but we can delay forever. In IBM
Academy of Technology 2003 Annual Meeting, San Francisco, CA, October
2003.
[Mor07] Richard Morris. Sir Tony Hoare: Geek of the week, August
2007. https://www.red-gate.com/simple-talk/opinion/geek-of-
the-week/sir-tony-hoare-geek-of-the-week/.
[MOZ09] Nicholas Mc Guire, Peter Odhiambo Okech, and Qingguo Zhou. Analysis
of inherent randomness of the linux kernel. In Eleventh Real Time Linux
Workshop, Dresden, Germany, September 2009.
[MP15a] Paul E. McKenney and Aravinda Prasad. Recent read-mostly research in 2015,
December 2015. https://lwn.net/Articles/667593/.
[MP15b] Paul E. McKenney and Aravinda Prasad. Some more details on read-log-
update, December 2015. https://lwn.net/Articles/667720/.
[MPA+ 06] Paul E. McKenney, Chris Purcell, Algae, Ben Schumin, Gaius Cornelius,
Qwertyus, Neil Conway, Sbw, Blainster, Canis Rufus, Zoicon5, Anome, and
Hal Eisen. Read-copy update, July 2006. https://en.wikipedia.org/
wiki/Read-copy-update.
[MPI08] MPI Forum. Message passing interface forum, September 2008. Available:
http://www.mpi-forum.org/ [Viewed September 9, 2008].
[MR08] Paul E. McKenney and Steven Rostedt. Integrating and validating dynticks and
preemptable RCU, April 2008. Available: https://lwn.net/Articles/
279077/ [Viewed April 24, 2008].
[MRP+ 17] Paul E. McKenney, Torvald Riegel, Jeff Preshing, Hans Boehm, Clark Nelson,
Olivier Giroux, Lawrence Crowl, JF Bastian, and Michael Wong. Marking
memory order consume dependency chains, February 2017. http://www.
open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0462r1.pdf.

v2022.09.25a
BIBLIOGRAPHY 609

[MS93] Paul E. McKenney and Jack Slingwine. Efficient kernel memory allocation
on shared-memory multiprocessors. In USENIX Conference Proceedings,
pages 295–306, Berkeley CA, February 1993. USENIX Association. Avail-
able: http://www.rdrop.com/users/paulmck/scalability/paper/
mpalloc.pdf [Viewed January 30, 2005].
[MS95] Maged M. Michael and Michael L. Scott. Correction of a memory management
method for lock-free data structures, December 1995. Technical Report TR599.
[MS96] M.M Michael and M. L. Scott. Simple, fast, and practical non-blocking
and blocking concurrent queue algorithms. In Proc of the Fifteenth ACM
Symposium on Principles of Distributed Computing, pages 267–275, May
1996.
[MS98a] Paul E. McKenney and John D. Slingwine. Read-copy update: Using execution
history to solve concurrency problems. In Parallel and Distributed Computing
and Systems, pages 509–518, Las Vegas, NV, October 1998.
[MS98b] Maged M. Michael and Michael L. Scott. Nonblocking algorithms and
preemption-safe locking on multiprogrammed shared memory multiprocessors.
J. Parallel Distrib. Comput., 51(1):1–26, 1998.
[MS01] Paul E. McKenney and Dipankar Sarma. Read-copy update mutual exclusion
in Linux, February 2001. Available: http://lse.sourceforge.net/
locking/rcu/rcupdate_doc.html [Viewed October 18, 2004].
[MS08] MySQL AB and Sun Microsystems. MySQL Downloads, November 2008.
Available: http://dev.mysql.com/downloads/ [Viewed November 26,
2008].
[MS09] Paul E. McKenney and Raul Silvera. Example POWER im-
plementation for C/C++ memory model, February 2009. Avail-
able: http://www.rdrop.com/users/paulmck/scalability/paper/
N2745r.2009.02.27a.html [Viewed: April 5, 2009].
[MS12] Alexander Matveev and Nir Shavit. Towards a fully pessimistic STM model.
In TRANSACT 2012, San Jose, CA, USA, February 2012. ACM SIGPLAN.
[MS14] Paul E. McKenney and Alan Stern. Axiomatic validation of memory barri-
ers and atomic instructions, August 2014. https://lwn.net/Articles/
608550/.
[MS18] Luc Maranget and Alan Stern. lock.cat, May 2018. https://github.com/
torvalds/linux/blob/master/tools/memory-model/lock.cat.
[MSA+ 02] Paul E. McKenney, Dipankar Sarma, Andrea Arcangeli, Andi Kleen, Orran
Krieger, and Rusty Russell. Read-copy update. In Ottawa Linux Symposium,
pages 338–367, June 2002. Available: https://www.kernel.org/doc/
ols/2002/ols2002-pages-338-367.pdf [Viewed February 14, 2021].
[MSFM15] Alexander Matveev, Nir Shavit, Pascal Felber, and Patrick Marlier. Read-log-
update: A lightweight synchronization mechanism for concurrent program-
ming. In Proceedings of the 25th Symposium on Operating Systems Principles,
SOSP ’15, pages 168–183, Monterey, California, 2015. ACM.
[MSK01] Paul E. McKenney, Jack Slingwine, and Phil Krueger. Experience with an
efficient parallel kernel memory allocator. Software – Practice and Experience,
31(3):235–257, March 2001.

v2022.09.25a
610 BIBLIOGRAPHY

[MSM05] Timothy G. Mattson, Beverly A. Sanders, and Berna L. Massingill. Patterns


for Parallel Programming. Addison Wesley, Boston, MA, USA, 2005.
[MSS04] Paul E. McKenney, Dipankar Sarma, and Maneesh Soni. Scaling dcache with
RCU. Linux Journal, 1(118):38–46, January 2004.
[MSS12] Luc Maranget, Susmit Sarkar, and Peter Sewell. A tutorial introduction to
the ARM and POWER relaxed memory models, October 2012. https:
//www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf.
[MT01] Jose F. Martinez and Josep Torrellas. Speculative locks for concurrent execu-
tion of critical sections in shared-memory multiprocessors. In Workshop on
Memory Performance Issues, International Symposium on Computer Archi-
tecture, Gothenburg, Sweden, June 2001. Available: https://iacoma.cs.
uiuc.edu/iacoma-papers/wmpi_locks.pdf [Viewed June 23, 2004].
[MT02] Jose F. Martinez and Josep Torrellas. Speculative synchronization: Applying
thread-level speculation to explicitly parallel applications. In Proceedings of
the 10th International Conference on Architectural Support for Programming
Languages and Operating Systems, pages 18–29, San Jose, CA, October 2002.
[Mud01] Trevor Mudge. POWER: A first-class architectural design constraint. IEEE
Computer, 34(4):52–58, April 2001.
[Mus04] Museum Victoria Australia. CSIRAC: Australia’s first computer, 2004. URL:
http://museumvictoria.com.au/csirac/.
[MW05] Paul E. McKenney and Jonathan Walpole. RCU semantics: A first attempt,
January 2005. Available: http://www.rdrop.com/users/paulmck/RCU/
rcu-semantics.2005.01.30a.pdf [Viewed December 6, 2009].
[MW07] Paul E. McKenney and Jonathan Walpole. What is RCU, fundamentally?, De-
cember 2007. Available: https://lwn.net/Articles/262464/ [Viewed
December 27, 2007].
[MW11] Paul E. McKenney and Jonathan Walpole. Efficient support of consistent
cyclic search with read-copy update and parallel updates (lapsed). Technical
Report US Patent 7,953,778, Assigned to International Business Machines
Corp, Washington, DC, May 2011.
[MWB+ 17] Paul E. McKenney, Michael Wong, Hans Boehm, Jens Maurer, Jeffrey Yasskin,
and JF Bastien. P0190R4: Proposal for new memory_order_consume
definition, July 2017. http://www.open-std.org/jtc1/sc22/wg21/
docs/papers/2017/p0190r4.pdf.
[MWPF18] Paul E. McKenney, Ulrich Weigand, Andrea Parri, and Boqun Feng. Linux-
kernel memory model, September 2018. http://www.open-std.org/
jtc1/sc22/wg21/docs/papers/2018/p0124r6.html.
[Mye79] Glenford J. Myers. The Art of Software Testing. Wiley, 1979.
[NA18] Catherine E. Nemitz and James H. Anderson. Work-in-progress: Lock-based
software transactional memory for real-time systems. In 2018 IEEE Real-Time
Systems Symposium, RTSS’18, pages 147–150, Nashville, TN, USA, 2018.
IEEE.
[Nag18] Honnappa Nagarahalli. rcu: add RCU library supporting QSBR mechanism,
May 2018. https://git.dpdk.org/dpdk/tree/lib/librte_rcu.
[Nata] National Institure of Standards and Technology. SI Unit rules and style
conventions [online].

v2022.09.25a
BIBLIOGRAPHY 611

[Natb] National Institure of Standards and Technology. Typefaces for Symbols in


Scientific Manuscripts [online].
[Nat19] National Institure of Standards and Technology. The international system of
units (SI). Technical Report NIST Special Publication 330 2019 EDITION,
U.S. Department of Commerce, Washington, D.C., 2019.
[Nes06a] Oleg Nesterov. Re: [patch] cpufreq: mark cpufreq_tsc() as
core_initcall_sync, November 2006. Available: https://lkml.org/
lkml/2006/11/19/69 [Viewed May 28, 2007].
[Nes06b] Oleg Nesterov. Re: [rfc, patch 1/2] qrcu: "quick" srcu implementation,
November 2006. Available: https://lkml.org/lkml/2006/11/29/330
[Viewed November 26, 2008].
[NSHW20] Vijay Nagarajan, Daniel J. Sorin, Mark D. Hill, and David A. Wood. A Primer
on Memory Consistency and Cache Coherence, Second Edition. Synthesis
Lectures on Computer Architecture. Morgan & Claypool, 2020.
[NVi17a] NVidia. Accelerated computing — training, January 2017. https://
developer.nvidia.com/accelerated-computing-training.
[NVi17b] NVidia. Existing university courses, January 2017. https://developer.
nvidia.com/educators/existing-courses.
[O’H19] Peter W. O’Hearn. Incorrectness logic. Proc. ACM Program. Lang., 4(POPL),
December 2019.
[OHOC20] Robert O’Callahan, Kyle Huey, Devon O’Dell, and Terry Coatta. To catch
a failure: The record-and-replay approach to debugging: A discussion
with robert o’callahan, kyle huey, devon o’dell, and terry coatta. Queue,
18(1):61–79, February 2020.
[ON07] Robert Olsson and Stefan Nilsson. TRASH: A dynamic LC-trie and hash
data structure. In Workshop on High Performance Switching and Routing
(HPSR’07), May 2007.
[ONH+ 96] Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and
Kunyung Chang. The case for a single-chip multiprocessor. In ASPLOS VII,
Cambridge, MA, USA, October 1996.
[Ope97] Open Group. The single UNIX specification, version 2: Threads, 1997.
Available: http://www.opengroup.org/onlinepubs/007908799/
xsh/threads.html [Viewed September 19, 2008].
[ORY01] Peter W. O’Hearn, John C. Reynolds, and Hongseok Yang. Local reasoning
about programs that alter data structures. In Proceedings of the 15th Inter-
national Workshop on Computer Science Logic, CSL ’01, page 1–19, Berlin,
Heidelberg, 2001. Springer-Verlag.
[PAB+ 95] Calton Pu, Tito Autrey, Andrew Black, Charles Consel, Crispin Cowan, Jon
Inouye, Lakshmi Kethana, Jonathan Walpole, and Ke Zhang. Optimistic
incremental specialization: Streamlining a commercial operating system. In
15th ACM Symposium on Operating Systems Principles (SOSP’95), pages
314–321, Copper Mountain, CO, December 1995.
[Pat10] David Patterson. The trouble with multicore. IEEE Spectrum, 2010:28–32,
52–53, July 2010.

v2022.09.25a
612 BIBLIOGRAPHY

[PAT11] V Pankratius and A R Adl-Tabatabai. A study of transactional memory vs.


locks in practice. In Proceedings of the 23rd ACM symposium on Parallelism
in algorithms and architectures (2011), SPAA ’11, pages 43–52, San Jose,
CA, USA, 2011. ACM.
[PBCE20] Elizabeth Patitsas, Jesse Berlin, Michelle Craig, and Steve Easterbrook.
Evidence that computer science grades are not bimodal. Commun. ACM,
63(1):91–98, January 2020.
[PD11] Martin Pohlack and Stephan Diestelhorst. From lightweight hardware transac-
tional memory to lightweight lock elision. In TRANSACT 2011, San Jose, CA,
USA, June 2011. ACM SIGPLAN.
[Pen18] Roman Penyaev. [PATCH v2 01/26] introduce list_next_or_null_rr_rcu(),
May 2018. https://lkml.kernel.org/r/20180518130413.16997-2-
[email protected].
[Pet06] Jeremy Peters. From reuters, automatic trading linked to news events, Decem-
ber 2006. URL: http://www.nytimes.com/2006/12/11/technology/
11reuters.html?ei=5088&en=e5e9416415a9eeb2&ex=1323493200.
..
[Pig06] Nick Piggin. [patch 3/3] radix-tree: RCU lockless readside, June 2006.
Available: https://lkml.org/lkml/2006/6/20/238 [Viewed March 25,
2008].
[Pik17] Fedor G. Pikus. Read, copy, update... Then what?, September 2017. https:
//www.youtube.com/watch?v=rxQ5K9lo034.
[PMDY20] SeongJae Park, Paul E. McKenney, Laurent Dufour, and Heon Y. Yeom.
An htm-based update-side synchronization for rcu on numa systems. In
Proceedings of the Fifteenth European Conference on Computer Systems,
EuroSys ’20, Heraklion, Greece, 2020. Association for Computing Machinery.
[Pod10] Andrej Podzimek. Read-copy-update for opensolaris. Master’s thesis, Charles
University in Prague, 2010.
[Pok16] Michael Pokorny. The deadlock empire, February 2016. https://
deadlockempire.github.io/.
[Pos08] PostgreSQL Global Development Group. PostgreSQL, November 2008.
Available: https://www.postgresql.org/ [Viewed November 26, 2008].
[Pug90] William Pugh. Concurrent maintenance of skip lists. Technical Report
CS-TR-2222.1, Institute of Advanced Computer Science Studies, Department
of Computer Science, University of Maryland, College Park, Maryland, June
1990.
[Pug00] William Pugh. Reordering on an Alpha processor, 2000. Available: https://
www.cs.umd.edu/~pugh/java/memoryModel/AlphaReordering.html
[Viewed: June 23, 2004].
[Pul00] Geoffrey K. Pullum. How Dr. Seuss would prove the halting problem
undecidable. Mathematics Magazine, 73(4):319–320, 2000. http://www.
lel.ed.ac.uk/~gpullum/loopsnoop.html.
[PW07] Donald E. Porter and Emmett Witchel. Lessons from large
transactional systems, December 2007. Personal communication
<[email protected]>.

v2022.09.25a
BIBLIOGRAPHY 613

[Ras14] Mindaugas Rasiukevicius. NPF—progress and perspective. In AsiaBSDCon,


Tokyo, Japan, March 2014.
[Ras16] Mindaugas Rasiukevicius. Quiescent-state and epoch based reclamation, July
2016. https://github.com/rmind/libqsbr.
[Ray99] Eric S. Raymond. The Cathedral and the Bazaar: Musings on Linux and
Open Source by an Accidental Revolutionary. O’Reilly, 1999.
[RC15] Pedro Ramalhete and Andreia Correia. Poor man’s URCU, August
2015. https://github.com/pramalhe/ConcurrencyFreaks/blob/
master/papers/poormanurcu-2015.pdf.
[RD12] Ravi Rajwar and Martin Dixon. Intel transactional synchronization extensions,
September 2012. Intel Developer Forum (IDF) 2012 ARCS004.
[Reg10] John Regehr. A guide to undefined behavior in C and C++, part 1, July 2010.
https://blog.regehr.org/archives/213.
[Rei07] James Reinders. Intel Threading Building Blocks. O’Reilly, Sebastopol, CA,
USA, 2007.
[RG01] Ravi Rajwar and James R. Goodman. Speculative lock elision: Enabling
highly concurrent multithreaded execution. In Proceedings of the 34th An-
nual ACM/IEEE International Symposium on Microarchitecture, MICRO 34,
pages 294–305, Austin, TX, December 2001. The Institute of Electrical and
Electronics Engineers, Inc.
[RG02] Ravi Rajwar and James R. Goodman. Transactional lock-free execution of
lock-based programs. In Proceedings of the 10th International Conference on
Architectural Support for Programming Languages and Operating Systems,
pages 5–17, Austin, TX, October 2002.
[RH02] Zoran Radović and Erik Hagersten. Efficient synchronization for nonuniform
communication architectures. In Proceedings of the 2002 ACM/IEEE Confer-
ence on Supercomputing, pages 1–13, Baltimore, Maryland, USA, November
2002. The Institute of Electrical and Electronics Engineers, Inc.
[RH03] Zoran Radović and Erik Hagersten. Hierarchical backoff locks for nonuniform
communication architectures. In Proceedings of the Ninth International
Symposium on High Performance Computer Architecture (HPCA-9), pages
241–252, Anaheim, California, USA, February 2003.
[RH18] Geoff Romer and Andrew Hunter. An RAII interface for deferred reclama-
tion, March 2018. http://www.open-std.org/jtc1/sc22/wg21/docs/
papers/2018/p0561r4.html.
[RHP+ 07] Chistopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E.
Ramadan, Aditya Bhandari, and Emmett Witchel. TxLinux: Using and
managing hardware transactional memory in an operating system. In
SOSP’07: Twenty-First ACM Symposium on Operating Systems Princi-
ples, Stevenson, WA, USA, October 2007. ACM SIGOPS. Available:
http://www.sosp2007.org/papers/sosp056-rossbach.pdf [Viewed
October 21, 2007].
[Rin13] Martin Rinard. Parallel synchronization-free approximate data structure
construction. In Proceedings of the 5th USENIX Conference on Hot Topics in
Parallelism, HotPar’13, page 6, San Jose, CA, 2013. USENIX Association.

v2022.09.25a
614 BIBLIOGRAPHY

[RKM+ 10] Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, and David I.
August. Speculative parallelization using software multi-threaded transactions.
SIGARCH Comput. Archit. News, 38(1):65–76, 2010.
[RLPB18] Yuxin Ren, Guyue Liu, Gabriel Parmer, and Björn Brandenburg. Scalable
memory reclamation for multi-core, real-time systems. In Proceedings of the
2018 IEEE Real-Time and Embedded Technology and Applications Symposium
(RTAS), page 12, Porto, Portugal, April 2018. IEEE.
[RMF19] Federico Reghenzani, Giuseppe Massari, and William Fornaciari. The real-
time Linux kernel: A survey on PREEMPT_RT. ACM Comput. Surv.,
52(1):18:1–18:36, February 2019.
[Ros06] Steven Rostedt. Lightweight PI-futexes, June 2006. Available: https:
//www.kernel.org/doc/html/latest/locking/pi-futex.html
[Viewed February 14, 2021].
[Ros10a] Steven Rostedt. tracing: Harry Potter and the Deathly Macros, December
2010. Available: https://lwn.net/Articles/418710/ [Viewed: August
28, 2011].
[Ros10b] Steven Rostedt. Using the TRACE_EVENT() macro (part 1), March 2010.
Available: https://lwn.net/Articles/379903/ [Viewed: August 28,
2011].
[Ros10c] Steven Rostedt. Using the TRACE_EVENT() macro (part 2), March 2010.
Available: https://lwn.net/Articles/381064/ [Viewed: August 28,
2011].
[Ros10d] Steven Rostedt. Using the TRACE_EVENT() macro (part 3), April 2010.
Available: https://lwn.net/Articles/383362/ [Viewed: August 28,
2011].
[Ros11] Steven Rostedt. lockdep: How to read its cryptic output, September 2011.
http://www.linuxplumbersconf.org/2011/ocw/sessions/153.
[Roy17] Lance Roy. rcutorture: Add CBMC-based formal verification for
SRCU, January 2017. URL: https://www.spinics.net/lists/kernel/
msg2421833.html.
[RR20] Sergio Rajsbaum and Michel Raynal. Mastering concurrent computing through
sequential thinking. Commun. ACM, 63(1):78–87, January 2020.
[RSB+ 97] Rajeev Rastogi, S. Seshadri, Philip Bohannon, Dennis W. Leinbaugh, Abraham
Silberschatz, and S. Sudarshan. Logical and physical versioning in main
memory databases. In Proceedings of the 23rd International Conference on
Very Large Data Bases, VLDB ’97, pages 86–95, San Francisco, CA, USA,
August 1997. Morgan Kaufmann Publishers Inc.
[RTY+ 87] Richard Rashid, Avadis Tevanian, Michael Young, David Golub, Robert Baron,
David Black, William Bolosky, and Jonathan Chew. Machine-independent
virtual memory management for paged uniprocessor and multiprocessor
architectures. In 2nd Symposium on Architectural Support for Programming
Languages and Operating Systems, pages 31–39, Palo Alto, CA, October
1987. Association for Computing Machinery.
[Rus00a] Rusty Russell. Re: modular net drivers, June 2000. URL: http://oss.
sgi.com/projects/netdev/archive/2000-06/msg00250.html [bro-
ken, February 15, 2021].

v2022.09.25a
BIBLIOGRAPHY 615

[Rus00b] Rusty Russell. Re: modular net drivers, June 2000. URL: http://oss.
sgi.com/projects/netdev/archive/2000-06/msg00254.html [bro-
ken, February 15, 2021].
[Rus03] Rusty Russell. Hanging out with smart people: or... things I learned being a
kernel monkey, July 2003. 2003 Ottawa Linux Symposium Keynote https://
ozlabs.org/~rusty/ols-2003-keynote/ols-keynote-2003.html.
[Rut17] Mark Rutland. compiler.h: Remove ACCESS_ONCE(), November 2017. Git
commit: https://git.kernel.org/linus/b899a850431e.
[SAE+ 18] Caitlin Sadowski, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and
Ciera Jaspan. Lessons from building static analysis tools at google. Commun.
ACM, 61(4):58–66, March 2018.
[SAH+ 03] Craig A. N. Soules, Jonathan Appavoo, Kevin Hui, Dilma Da Silva, Gre-
gory R. Ganger, Orran Krieger, Michael Stumm, Robert W. Wisniewski, Marc
Auslander, Michal Ostrowski, Bryan Rosenburg, and Jimi Xenidis. System
support for online reconfiguration. In Proceedings of the 2003 USENIX
Annual Technical Conference, pages 141–154, San Antonio, Texas, USA, June
2003. USENIX Association.
[SATG+ 09] Tatiana Shpeisman, Ali-Reza Adl-Tabatabai, Robert Geva, Yang Ni, and
Adam Welc. Towards transactional memory semantics for C++. In SPAA ’09:
Proceedings of the twenty-first annual symposium on Parallelism in algorithms
and architectures, pages 49–58, Calgary, AB, Canada, 2009. ACM.
[SBN+ 20] Dimitrios Siakavaras, Panagiotis Billis, Konstantinos Nikas, Georgios Goumas,
and Nectarios Koziris. Efficient concurrent range queries in b+-trees using
rcu-htm. In Proceedings of the 32nd ACM Symposium on Parallelism in
Algorithms and Architectures, SPAA ’20, page 571–573, Virtual Event, USA,
2020. Association for Computing Machinery.
[SBV10] Martin Schoeberl, Florian Brandner, and Jan Vitek. RTTM: Real-time
transactional memory. In Proceedings of the 2010 ACM Symposium on
Applied Computing, pages 326–333, 01 2010.
[Sch35] E. Schrödinger. Die gegenwärtige Situation in der Quantenmechanik. Natur-
wissenschaften, 23:807–812; 823–828; 844–849, November 1935.
[Sch94] Curt Schimmel. UNIX Systems for Modern Architectures: Symmetric Multi-
processing and Caching for Kernel Programmers. Addison-Wesley Longman
Publishing Co., Inc., Boston, MA, USA, 1994.
[Sco06] Michael Scott. Programming Language Pragmatics. Morgan Kaufmann,
Burlington, MA, USA, 2006.
[Sco13] Michael L. Scott. Shared-Memory Synchronization. Morgan & Claypool, San
Rafael, CA, USA, 2013.
[Sco15] Michael Scott. Programming Language Pragmatics, 4th Edition. Morgan
Kaufmann, Burlington, MA, USA, 2015.
[Seq88] Sequent Computer Systems, Inc. Guide to Parallel Programming, 1988.
[Sew] Peter Sewell. Relaxed-memory concurrency. Available: https://www.cl.
cam.ac.uk/~pes20/weakmemory/ [Viewed: February 15, 2021].
[Sey12] Justin Seyster. Runtime Verification of Kernel-Level Concurrency Using
Compiler-Based Instrumentation. PhD thesis, Stony Brook University, 2012.

v2022.09.25a
616 BIBLIOGRAPHY

[SF95] Janice M. Stone and Robert P. Fitzgerald. Storage in the PowerPC. IEEE
Micro, 15(2):50–58, April 1995.
[Sha11] Nir Shavit. Data structures in the multicore age. Commun. ACM, 54(3):76–84,
March 2011.
[She06] Gautham R. Shenoy. [patch 4/5] lock_cpu_hotplug: Redesign - lightweight
implementation of lock_cpu_hotplug, October 2006. Available: https:
//lkml.org/lkml/2006/10/26/73 [Viewed January 26, 2009].
[SHW11] Daniel J. Sorin, Mark D. Hill, and David A. Wood. A Primer on Memory Con-
sistency and Cache Coherence. Synthesis Lectures on Computer Architecture.
Morgan & Claypool, 2011.
[Slo10] Lubos Slovak. First steps for utilizing userspace RCU library,
July 2010. https://gitlab.labs.nic.cz/knot/knot-dns/commit/
f67acc0178ee9a781d7a63fb041b5d09eb5fb4a2.
[SM95] John D. Slingwine and Paul E. McKenney. Apparatus and method for
achieving reduced overhead mutual exclusion and maintaining coherency in
a multiprocessor system utilizing execution history and thread monitoring.
Technical Report US Patent 5,442,758, Assigned to International Business
Machines Corp, Washington, DC, August 1995.
[SM97] John D. Slingwine and Paul E. McKenney. Method for maintaining data co-
herency using thread activity summaries in a multicomputer system. Technical
Report US Patent 5,608,893, Assigned to International Business Machines
Corp, Washington, DC, March 1997.
[SM98] John D. Slingwine and Paul E. McKenney. Apparatus and method for
achieving reduced overhead mutual exclusion and maintaining coherency in
a multiprocessor system utilizing execution history and thread monitoring.
Technical Report US Patent 5,727,209, Assigned to International Business
Machines Corp, Washington, DC, March 1998.
[SM04a] Dipankar Sarma and Paul E. McKenney. Issues with selected scalability
features of the 2.6 kernel. In Ottawa Linux Symposium, page 16, July
2004. https://www.kernel.org/doc/ols/2004/ols2004v2-pages-
195-208.pdf.
[SM04b] Dipankar Sarma and Paul E. McKenney. Making RCU safe for deep sub-
millisecond response realtime applications. In Proceedings of the 2004
USENIX Annual Technical Conference (FREENIX Track), pages 182–191,
Boston, MA, USA, June 2004. USENIX Association.
[SM13] Thomas Sewell and Toby Murray. Above and beyond: seL4 noninterference
and binary verification, May 2013. https://cps-vo.org/node/7706.
[Smi19] Richard Smith. Working draft, standard for programming language C++,
January 2019. http://www.open-std.org/jtc1/sc22/wg21/docs/
papers/2019/n4800.pdf.
[SMS08] Michael Spear, Maged Michael, and Michael Scott. Inevitability mech-
anisms for software transactional memory. In 3rd ACM SIGPLAN Work-
shop on Transactional Computing, Salt Lake City, Utah, February 2008.
ACM. Available: http://www.cs.rochester.edu/u/scott/papers/
2008_TRANSACT_inevitability.pdf [Viewed January 10, 2009].

v2022.09.25a
BIBLIOGRAPHY 617

[SNGK17] Dimitrios Siakavaras, Konstantinos Nikas, Georgios Goumas, and Nectarios


Koziris. Combining HTM and RCU to implement highly efficient balanced
binary search trees. In 12th ACM SIGPLAN Workshop on Transactional
Computing, Austin, TX, USA, February 2017.
[SPA94] SPARC International. The SPARC Architecture Manual, 1994. Avail-
able: https://sparc.org/wp-content/uploads/2014/01/SPARCV9.
pdf.gz.
[Spi77] Keith R. Spitz. Tell which is which and you’ll be rich, 1977. Inscription on
wall of dungeon.
[Spr01] Manfred Spraul. Re: RFC: patch to allow lock-free traversal of lists with
insertion, October 2001. URL: http://lkml.iu.edu/hypermail/linux/
kernel/0110.1/0410.html.
[Spr08] Manfred Spraul. [RFC, PATCH] state machine based rcu, August 2008.
Available: https://lkml.org/lkml/2008/8/21/336 [Viewed December
8, 2008].
[SR84] Z. Segall and L. Rudolf. Dynamic decentralized cache schemes for MIMD
parallel processors. In 11th Annual International Symposium on Computer
Architecture, pages 340–347, June 1984.
[SRK+ 11] Justin Seyster, Prabakar Radhakrishnan, Samriti Katoch, Abhinav Duggal,
Scott D. Stoller, and Erez Zadok. Redflag: a framework for analysis of
kernel-level concurrency. In Proceedings of the 11th international conference
on Algorithms and architectures for parallel processing - Volume Part I,
ICA3PP’11, pages 66–79, Melbourne, Australia, 2011. Springer-Verlag.
[SRL90] Lui Sha, Ragunathan Rajkumar, and John P. Lehoczky. Priority inheritance
protocols: An approach to real-time synchronization. IEEE Transactions on
Computers, 39(9):1175–1185, 1990.
[SS94] Duane Szafron and Jonathan Schaeffer. Experimentally assessing the usability
of parallel programming systems. In IFIP WG10.3 Programming Environments
for Massively Parallel Distributed Systems, pages 19.1–19.7, Monte Verita,
Ascona, Switzerland, 1994.
[SS06] Ori Shalev and Nir Shavit. Split-ordered lists: Lock-free extensible hash
tables. J. ACM, 53(3):379–405, May 2006.
[SSA+ 11] Susmit Sarkar, Peter Sewell, Jade Alglave, Luc Maranget, and Derek Williams.
POWER and ARM litmus tests, 2011. https://www.cl.cam.ac.uk/
~pes20/ppc-supplemental/test6.pdf.
[SSHT93] Janice S. Stone, Harold S. Stone, Philip Heidelberger, and John Turek.
Multiple reservations and the Oklahoma update. IEEE Parallel and Distributed
Technology Systems and Applications, 1(4):58–71, November 1993.
[SSRB00] Douglas C. Schmidt, Michael Stal, Hans Rohnert, and Frank Buschmann.
Pattern-Oriented Software Architecture Volume 2: Patterns for Concurrent
and Networked Objects. Wiley, Chichester, West Sussex, England, 2000.
[SSVM02] S. Swaminathan, John Stultz, Jack Vogel, and Paul E. McKenney. Fairlocks –
a high performance fair locking scheme. In Proceedings of the 14th IASTED
International Conference on Parallel and Distributed Computing and Systems,
pages 246–251, Cambridge, MA, USA, November 2002.

v2022.09.25a
618 BIBLIOGRAPHY

[ST87] William E. Snaman and David W. Thiel. The VAX/VMS distributed lock
manager. Digital Technical Journal, 5:29–44, September 1987.
[ST95] Nir Shavit and Dan Touitou. Software transactional memory. In Proceedings
of the 14th Annual ACM Symposium on Principles of Distributed Computing,
pages 204–213, Ottawa, Ontario, Canada, August 1995.
[Ste92] W. Richard Stevens. Advanced Programming in the UNIX Environment.
Addison Wesley, 1992.
[Ste13] W. Richard Stevens. Advanced Programming in the UNIX Environment, 3rd
Edition. Addison Wesley, 2013.
[Sut08] Herb Sutter. Effective concurrency, 2008. Series in Dr. Dobbs Journal.
[Sut13] Adrian Sutton. Concurrent programming with the Disruptor, January 2013.
Presentation at Linux.conf.au 2013, URL: https://www.youtube.com/
watch?v=ItpT_vmRHyI.
[SW95] Richard L. Sites and Richard T. Witek. Alpha AXP Architecture. Digital Press,
second edition, 1995.
[SWS16] Harshal Sheth, Aashish Welling, and Nihar Sheth. Read-copy up-
date in a garbage collected environment, 2016. MIT PRIMES
program: https://math.mit.edu/research/highschool/primes/
materials/2016/conf/10-1%20Sheth-Welling-Sheth.pdf.
[SZJ12] KC Sivaramakrishnan, Lukasz Ziarek, and Suresh Jagannathan. Eliminating
read barriers through procrastination and cleanliness. In Proceedings of the
2012 International Symposium on Memory Management, ISMM ’12, pages
49–60, Beijing, China, 2012. ACM.
[Tal07] Nassim Nicholas Taleb. The Black Swan. Random House, 2007.
[TDV15] Joseph Tassarotti, Derek Dreyer, and Victor Vafeiadis. Verifying read-copy-
update in a logic for weak memory. In Proceedings of the 2015 Proceedings
of the 36th annual ACM SIGPLAN conference on Programming Language
Design and Implementation, PLDI ’15, pages 110–120, New York, NY, USA,
June 2015. ACM.
[The08] The Open MPI Project. Open MPI, November 2008. Available: http:
//www.open-mpi.org/software/ [Viewed November 26, 2008].
[The11] The Valgrind Developers. Valgrind, November 2011. http://www.
valgrind.org/.
[The12a] The NetBSD Foundation. pserialize(9), October 2012. http://netbsd.
gw.com/cgi-bin/man-cgi?pserialize+9+NetBSD-current.
[The12b] The OProfile Developers. Oprofile, April 2012. http://oprofile.
sourceforge.net.
[TMW11] Josh Triplett, Paul E. McKenney, and Jonathan Walpole. Resizable, scalable,
concurrent hash tables via relativistic programming. In Proceedings of the
2011 USENIX Annual Technical Conference, pages 145–158, Portland, OR
USA, June 2011. The USENIX Association.
[Tor01] Linus Torvalds. Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal of
lists with insertion, October 2001. URL: https://lkml.org/lkml/2001/
10/13/105, https://lkml.org/lkml/2001/10/13/82.

v2022.09.25a
BIBLIOGRAPHY 619

[Tor02] Linus Torvalds. Linux 2.5.43, October 2002. Available: https://lkml.


org/lkml/2002/10/15/425 [Viewed March 30, 2008].
[Tor03] Linus Torvalds. Linux 2.6, August 2003. Available: https://kernel.org/
pub/linux/kernel/v2.6 [Viewed February 16, 2021].
[Tor08] Linus Torvalds. Move ACCESS_ONCE() to <linux/compiler.h>, May 2008.
Git commit: https://git.kernel.org/linus/9c3cdc1f83a6.
[Tor19] Linus Torvalds. rcu: locking and unlocking need to always be at least
barriers, June 2019. Git commit: https://git.kernel.org/linus/
66be4e66a7f4.
[Tra01] Transaction Processing Performance Council. TPC, 2001. Available: http:
//www.tpc.org/ [Viewed December 7, 2008].
[Tre86] R. K. Treiber. Systems programming: Coping with parallelism, April 1986.
RJ 5118.
[Tri12] Josh Triplett. Relativistic Causal Ordering: A Memory Model for Scalable
Concurrent Data Structures. PhD thesis, Portland State University, 2012.
[TS93] Hiroaki Takada and Ken Sakamura. A bounded spin lock algorithm with
preemption. Technical Report 93-02, University of Tokyo, Tokyo, Japan,
1993.
[TS95] H. Takada and K. Sakamura. Real-time scalability of nested spin locks. In
Proceedings of the 2nd International Workshop on Real-Time Computing
Systems and Applications, RTCSA ’95, pages 160–167, Tokyo, Japan, 1995.
IEEE Computer Society.
[Tur37] Alan M. Turing. On computable numbers, with an application to the entschei-
dungsproblem. In Proceedings of the London Mathematical Society, volume 42
of 2, pages 230–265, 1937.
[TZK+ 13] Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Mad-
den. Speedy transactions in multicore in-memory databases. In Proceedings of
the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP
’13, pages 18–32, Farminton, Pennsylvania, 2013. ACM.
[Ung11] David Ungar. Everything you know (about parallel programming) is wrong!:
A wild screed about the future. In Dynamic Languages Symposium 2011,
Portland, OR, USA, October 2011. Invited talk presentation.
[Uni08a] University of California, Berkeley. BOINC: compute for science, October
2008. Available: http://boinc.berkeley.edu/ [Viewed January 31,
2008].
[Uni08b] University of California, Berkeley. SETI@HOME, December 2008. Available:
http://setiathome.berkeley.edu/ [Viewed January 31, 2008].
[Uni10] University of Maryland. Parallel maze solving, November 2010. URL: http:
//www.cs.umd.edu/class/fall2010/cmsc433/p3/ [broken, February
2021].
[Val95] John D. Valois. Lock-free linked lists using compare-and-swap. In Proceedings
of the Fourteenth Annual ACM Symposium on Principles of Distributed
Computing, PODC ’95, pages 214–222, Ottowa, Ontario, Canada, 1995.
ACM.

v2022.09.25a
620 BIBLIOGRAPHY

[Van18] Michal Vaner. ArcSwap, April 2018. https://crates.io/crates/arc-


swap.
[VBC+ 15] Viktor Vafeiadis, Thibaut Balabonski, Soham Chakraborty, Robin Morisset,
and Francesco Zappa Nardelli. Common compiler optimisations are invalid
in the c11 memory model and what we can do about it. SIGPLAN Not.,
50(1):209–220, January 2015.
[VGS08] Haris Volos, Neelam Goyal, and Michael M. Swift. Pathological interac-
tion of locks with transactional memory. In 3rd ACM SIGPLAN Work-
shop on Transactional Computing, Salt Lake City, Utah, USA, Febru-
ary 2008. ACM. Available: http://www.cs.wisc.edu/multifacet/
papers/transact08_txlock.pdf [Viewed September 7, 2009].
[Vog09] Werner Vogels. Eventually consistent. Commun. ACM, 52:40–44, January
2009.
[Š11] Jaroslav Ševčík. Safe optimisations for shared-memory concurrent programs.
SIGPLAN Not., 46(6):306–316, June 2011.
[Was14] Scott Wasson. Errata prompts Intel to disable TSX in Haswell, early Broadwell
CPUs, August 2014. https://techreport.com/news/26911/errata-
prompts-intel-to-disable-tsx-in-haswell-early-broadwell-
cpus/.
[Wav16] Wave Computing, Inc. MIPS®Architecture For Programmers Volume II-A:
The MIPS64®Instruction Set Reference Manual, 2016. URL: https://www.
mips.com/downloads/the-mips64-instruction-set-v6-06/.
[Wei63] J. Weizenbaum. Symmetric list processor. Commun. ACM, 6(9):524–536,
September 1963.
[Wei12] Frédéric Weisbecker. Interruption timer périodique, 2012. http:
//www.dailymotion.com/video/xtxtew_interruption-timer-
periodique-frederic-weisbecker-kernel-recipes-12_tech.
[Wei13] Stewart Weiss. Unix lecture notes, May 2013. Available:
http://www.compsci.hunter.cuny.edu/~sweiss/course_
materials/unix_lecture_notes/ [Viewed April 8, 2014].
[Wik08] Wikipedia. Zilog Z80, 2008. Available: https://en.wikipedia.org/
wiki/Z80 [Viewed: December 7, 2008].
[Wik12] Wikipedia. Labyrinth, January 2012. https://en.wikipedia.org/wiki/
Labyrinth.
[Wil12] Anthony Williams. C++ Concurrency in Action: Practical Multithreading.
Manning, Shelter Island, NY, USA, 2012.
[Wil19] Anthony Williams. C++ Concurrency in Action, 2nd Edition. Manning,
Shelter Island, NY, USA, 2019.
[WKS94] Robert W. Wisniewski, Leonidas Kontothanassis, and Michael L. Scott.
Scalable spin locks for multiprogrammed systems. In 8th IEEE Int’l. Par-
allel Processing Symposium, Cancun, Mexico, April 1994. The Institute of
Electrical and Electronics Engineers, Inc.
[Won19] William G. Wong. Vhs or betamax. . . ccix or cxl. . . so many choices,
March 2019. https://www.electronicdesign.com/industrial-
automation/article/21807721/vhs-or-betamaxccix-or-cxlso-
many-choices.

v2022.09.25a
BIBLIOGRAPHY 621

[WTS96] Cai-Dong Wang, Hiroaki Takada, and Ken Sakamura. Priority inheritance
spin locks for multiprocessor real-time systems. In Proceedings of the 2nd
International Symposium on Parallel Architectures, Algorithms, and Networks,
ISPAN ’96, pages 70–76, Beijing, China, 1996. IEEE Computer Society.
[xen14] xenomai.org. Xenomai, December 2014. URL: http://xenomai.org/.
[Xu10] Herbert Xu. bridge: Add core IGMP snooping support, February
2010. Available: https://marc.info/?t=126719855400006&r=1&w=2
[Viewed March 20, 2011].
[YHLR13] Richard M. Yoo, Christopher J. Hughes, Konrad Lai, and Ravi Rajwar.
Performance evaluation of Intel® Transactional Synchronization Extensions
for high-performance computing. In Proceedings of SC13: International
Conference for High Performance Computing, Networking, Storage and
Analysis, SC ’13, pages 19:1–19:11, Denver, Colorado, 2013. ACM.
[Yod04a] Victor Yodaiken. Against priority inheritance, September 2004. Avail-
able: https://www.yodaiken.com/papers/inherit.pdf [Viewed May
26, 2007].
[Yod04b] Victor Yodaiken. Temporal inventory and real-time synchronization in RTLin-
uxPro, September 2004. URL: https://www.yodaiken.com/papers/
sync.pdf.
[Zel11] Cyril Zeller. CUDA C/C++ basics: Supercomputing 2011 tutorial, Novem-
ber 2011. https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-
basics.pdf.
[Zha89] Lixia Zhang. A New Architecture for Packet Switching Network Protocols.
PhD thesis, Massachusetts Institute of Technology, July 1989.
[Zij14] Peter Zijlstra. Another go at speculative page faults, October 2014. https:
//lkml.org/lkml/2014/10/20/620.

v2022.09.25a
622 BIBLIOGRAPHY

v2022.09.25a
If I have seen further it is by standing on the
shoulders of giants.

Credits Isaac Newton, modernized

LATEX Advisor • Peter Zijlstra (Section 9.5.4).

Akira Yokosawa is this book’s LATEX advisor, which per- • Richard Woodruff (Appendix C).
haps most notably includes the care and feeding of the
style guide laid out in Appendix D. This work includes • Suparna Bhattacharya (Chapter 12).
table layout, listings, fonts, rendering of math, acronyms,
bibliography formatting, epigraphs, hyperlinks, paper size. • Vara Prasad (Section 12.1.5).
Akira also perfected the cross-referencing of quick quizzes,
allowing easy and exact navigation between quick quizzes Reviewers whose feedback took the extremely welcome
and their answers. He also added build options that permit form of a patch are credited in the git logs.
quick quizzes to be hidden and to be gathered at the end
of each chapter, textbook style.
This role also includes the build system, which Akira Machine Owners
has optimized and made much more user-friendly. His
enhancements have included automating response to bibli- Readers might have noticed some graphs showing scala-
ography changes, automatically determining which source bility data out to several hundred CPUs, courtesy of my
files are present, and automatically generating listings current employer, with special thanks to Paul Saab, Yashar
(with automatically generated hyperlinked line-number Bayani, Joe Boyd, and Kyle McMartin.
references) from the source files. From back in my time at IBM, a great debt of thanks goes
to Martin Bligh, who originated the Advanced Build and
Test (ABAT) system at IBM’s Linux Technology Center,
Reviewers as well as to Andy Whitcroft, Dustin Kirkland, and many
others who extended this system. Many thanks go also to a
• Alan Stern (Chapter 15).
great number of machine owners: Andrew Theurer, Andy
• Andy Whitcroft (Section 9.5.2, Section 9.5.3). Whitcroft, Anton Blanchard, Chris McDermott, Cody
Schaefer, Darrick Wong, David “Shaggy” Kleikamp, Jon
• Artem Bityutskiy (Chapter 15, Appendix C). M. Tollefson, Jose R. Santos, Marvin Heffler, Nathan
Lynch, Nishanth Aravamudan, Tim Pepper, and Tony
• Dave Keck (Appendix C).
Breeds.
• David S. Horner (Section 12.1.5).

• Gautham Shenoy (Section 9.5.2, Section 9.5.3). Original Publications


• “jarkao2”, AKA LWN guest #41960 (Section 9.5.3).
1. Section 2.4 (“What Makes Parallel Programming
• Jonathan Walpole (Section 9.5.3). Hard?”) on page 13 originally appeared in a Portland
State University Technical Report [MGM+ 09].
• Josh Triplett (Chapter 12).

• Michael Factor (Section 17.2). 2. Section 4.3.4.1 (“Shared-Variable Shenanigans”)


on page 40 originally appeared in Linux Weekly
• Mike Fulton (Section 9.5.2). News [ADF+ 19].

623

v2022.09.25a
624 CREDITS

3. Section 6.5 (“Retrofitted Parallelism Considered Figure Credits


Grossly Sub-Optimal”) on page 93 originally ap-
peared in 4th USENIX Workshop on Hot Topics on 1. Figure 3.1 (p 17) by Melissa Broussard.
Parallelism [McK12c].
2. Figure 3.2 (p 18) by Melissa Broussard.
4. Section 9.5.2 (“RCU Fundamentals”) on page 144
originally appeared in Linux Weekly News [MW07]. 3. Figure 3.3 (p 18) by Melissa Broussard.

5. Section 9.5.3 (“RCU Linux-Kernel API”) on 4. Figure 3.5 (p 19) by Melissa Broussard.
page 150 originally appeared in Linux Weekly
5. Figure 3.6 (p 20) by Melissa Broussard.
News [McK08e].
6. Figure 3.7 (p 20) by Melissa Broussard.
6. Section 9.5.4 (“RCU Usage”) on page 160 originally
appeared in Linux Weekly News [McK08g]. 7. Figure 3.8 (p 21) by Melissa Broussard.
7. Section 9.5.5 (“RCU Related Work”) on 8. Figure 3.9 (p 21) by Melissa Broussard.
page 177 originally appeared in Linux Weekly
News [McK14g]. 9. Figure 3.11 (p 25) by Melissa Broussard.

8. Section 9.5.5 (“RCU Related Work”) on page 177 10. Figure 5.3 (p 51) by Melissa Broussard.
originally appeared in Linux Weekly News [MP15a].
11. Figure 6.1 (p 73) by Kornilios Kourtis.
9. Chapter 12 (“Formal Verification”) on page 229
12. Figure 6.2 (p 74) by Melissa Broussard.
originally appeared in Linux Weekly News [McK07f,
MR08, McK11d]. 13. Figure 6.3 (p 74) by Kornilios Kourtis.
10. Section 12.3 (“Axiomatic Approaches”) on page 260 14. Figure 6.4 (p 75) by Kornilios Kourtis.
originally appeared in Linux Weekly News [MS14].
15. Figure 6.13 (p 84) by Melissa Broussard.
11. Section 13.5.4 (“Correlated Fields”) on page 280
originally appeared in Oregon Graduate Insti- 16. Figure 6.14 (p 85) by Melissa Broussard.
tute [McK04].
17. Figure 6.15 (p 85) by Melissa Broussard.
12. Chapter 15 (“Advanced Synchronization: Memory
18. Figure 7.1 (p 102) by Melissa Broussard.
Ordering”) on page 313 originally appeared in the
Linux kernel [HMDZ06]. 19. Figure 7.2 (p 102) by Melissa Broussard.
13. Chapter 15 (“Advanced Synchronization: Memory 20. Figure 10.13 (p 194) by Melissa Broussard.
Ordering”) on page 313 originally appeared in Linux
Weekly News [AMM+ 17a, AMM+ 17b]. 21. Figure 10.14 (p 194) by Melissa Broussard.

14. Chapter 15 (“Advanced Synchronization: Memory 22. Figure 11.1 (p 209) by Melissa Broussard.
Ordering”) on page 313 originally appeared in ASP-
23. Figure 11.2 (p 209) by Melissa Broussard.
LOS ’18 [AMM+ 18].
24. Figure 11.3 (p 216) by Melissa Broussard.
15. Section 15.3.2 (“Address- and Data-Dependency Dif-
ficulties”) on page 335 originally appeared in the 25. Figure 11.6 (p 227) by Melissa Broussard.
Linux kernel [McK14e].
26. Figure 14.1 (p 293) by Melissa Broussard.
16. Section 15.5 (“Memory-Barrier Instructions For Spe-
cific CPUs”) on page 351 originally appeared in 27. Figure 14.2 (p 293) by Melissa Broussard.
Linux Journal [McK05a, McK05b].
28. Figure 14.3 (p 294) by Melissa Broussard.

29. Figure 14.10 (p 302) by Melissa Broussard.

v2022.09.25a
OTHER SUPPORT 625

30. Figure 14.11 (p 302) by Melissa Broussard. The bibtex-generation service of the Association for
Computing Machinery has saved us a huge amount of time
31. Figure 14.14 (p 304) by Melissa Broussard. and effort compiling the bibliography, for which we are
32. Figure 14.15 (p 311) by Sarah McKenney. grateful. Thanks are also due to Stamatis Karnouskos, who
convinced me to drag my antique bibliography database
33. Figure 14.16 (p 311) by Sarah McKenney. kicking and screaming into the 21st century. Any technical
work of this sort owes thanks to the many individuals and
34. Figure 15.2 (p 315) by Melissa Broussard. organizations that keep Internet and the World Wide Web
35. Figure 15.5 (p 321) by Akira Yokosawa. up and running, and this one is no exception.
Portions of this material are based upon work supported
36. Figure 15.18 (p 355) by Melissa Brossard. by the National Science Foundation under Grant No. CNS-
0719851.
37. Figure 16.2 (p 365) by Melissa Broussard.
38. Figure 17.1 (p 367) by Melissa Broussard.
39. Figure 17.2 (p 368) by Melissa Broussard.
40. Figure 17.3 (p 368) by Melissa Broussard.
41. Figure 17.4 (p 368) by Melissa Broussard.
42. Figure 17.5 (p 368) by Melissa Broussard, remixed.
43. Figure 17.9 (p 382) by Melissa Broussard.
44. Figure 17.10 (p 382) by Melissa Broussard.
45. Figure 17.11 (p 382) by Melissa Broussard.
46. Figure 17.12 (p 383) by Melissa Broussard.
47. Figure 18.1 (p 405) by Melissa Broussard.
48. Figure A.1 (p 408) by Melissa Broussard.
49. Figure E.2 (p 489) by Kornilios Kourtis.

Figure 9.33 was adapted from Fedor Pikus’s “When to


use RCU” slide [Pik17]. The discussion of mechanical
reference counters in Section 9.2 stemmed from a private
conversation with Dave Regan.

Other Support
We owe thanks to many CPU architects for patiently ex-
plaining the instruction- and memory-reordering features
of their CPUs, particularly Wayne Cardoza, Ed Silha, An-
ton Blanchard, Tim Slegel, Juergen Probst, Ingo Adlung,
Ravi Arimilli, Cathy May, Derek Williams, H. Peter An-
vin, Andy Glew, Leonid Yegoshin, Richard Grisenthwaite,
and Will Deacon. Wayne deserves special thanks for his
patience in explaining Alpha’s reordering of dependent
loads, a lesson that Paul resisted quite strenuously!

v2022.09.25a
626 CREDITS

v2022.09.25a
Acronyms

CAS compare and swap, 22, 27, 36, 46, 258, 270, 386,
468, 535, 570
CBMC C bounded model checker, 179, 263, 395, 529

EBR epoch-based reclamation, 3, 178, 183, 571

HTM hardware transactional memory, 383, 384, 554,


556, 572

IPI inter-processor interrupt, 137, 354, 443, 572


IRQ interrupt request, 253, 303, 572

NBS non-blocking synchronization, 80, 119, 177, 286,


371, 403, 407, 573
NMI non-maskable interrupt, 175, 242, 369, 573
NUCA non-uniform cache architecture, 441, 544, 573
NUMA non-uniform memory architecture, 109, 179, 189,
380, 544, 573

QSBR quiescent-state-based reclamation, 141, 161, 178,


179, 183, 191, 346, 574

RAII resource acquisition is initialization, 112


RCU read-copy update, 3, 137, 554, 571, 572, 574

STM software transactional memory, 385, 554, 575

TLE transactional lock elision, 387, 410, 576


TM transactional memory, 576

UTM unbounded transactional memory, 385, 576

627

v2022.09.25a
628 Acronyms

v2022.09.25a
Index
Bold: Major reference.
Underline: Definition.

Acquire load, 46, 145, 324, 569 write, 430, 576 Dining philosophers problem, 73
Ahmed, Iftekhar, 179 Cache-coherence protocol, 431, 570 Direct-mapped cache, see Cache,
Alglave, Jade, 241, 257, 260, 345, 348 Cache-invalidation latency, see Latency, direct-mapped
Amdahl’s Law, 7, 81, 97, 569 cache-invalidation Dreyer, Derek, 179
Anti-Heisenbug, see Heisenbug, anti- Cache-miss latency, see Latency, Dufour, Laurent, 177
Arbel, Maya, 178 cache-miss
Ash, Mike, 178 Capacity miss, see Cache miss, capacity Efficiency, 9, 80, 86, 115, 412, 571
Associativity, see Cache associativity Chen, Haibo, 178 energy, 25, 223, 571
Associativity miss, see Cache miss, Chien, Andrew, 4 Embarrassingly parallel, 12, 86, 93, 571
associativity Clash free, 286, 570 Epoch-based reclamation (EBR), 178,
Atomic, 19, 28, 36, 37, 46, 50, 55, 61, 569 Clements, Austin, 177 183, 571
Atomic read-modify-write operation, 316, Code locking, see Locking, code Exclusive lock, see Lock, exclusive
317, 432, 569 Communication miss, see Cache miss, Existence guarantee, 116, 165, 166, 180,
Attiya, Hagit, 178, 553 communication 270, 492, 571
Compare and swap (CAS), 22, 27, 36,
Belay, Adam, 178 258, 270, 386, 468, 535, 570 False sharing, 24, 78, 97, 192, 205, 480,
Bhat, Srivatsa, 178 Concurrent, 412, 570 498, 518, 571
Bonzini, Paolo, 4 Consistency Felber, Pascal, 178
Bornat, Richard, 3 memory, 353, 573 Forward-progress guarantee, 121, 178,
Bounded population-oblivious wait free, process, 574 181, 286, 571
see Wait free, bounded sequential, 275, 396, 575 Fragmentation, 92, 571
population-oblivious weak, 356 Fraser, Keir, 178, 571
Bounded wait free, see Wait free, bounded Corbet, Jonathan, 3 Full memory barrier, see Memory barrier,
Butenhof, David R., 3 Correia, Andreia, 178 full
Critical section, 20, 35, 81, 84, 89, 109, Fully associative cache, see Cache, fully
C bounded model checker (CBMC), 179, 116, 570 associative
263, 395, 529 RCU read-side, 139, 145, 574
Cache, 569 read-side, 111, 135, 575 Generality, 8, 10, 27, 80
direct-mapped, 433, 571 write-side, 576 Giannoula, Christina, 179, 379
fully associative, 385, 571 Gotsman, Alexey, 179
Cache associativity, 385, 430, 569 Data locking, see Locking, data Grace period, 140, 151, 182, 190, 212,
Cache coherence, 326, 355, 385, 569 Data race, 32, 40, 101, 212, 335, 571 241, 261, 274, 305, 344, 370, 415,
Cache geometry, 430, 570 Deacon, Will, 41 572
Cache line, 21, 50, 115, 205, 315, 328, Deadlock, 7, 15, 76, 101, 141, 197, 306, Grace-period latency, see Latency,
353, 383, 429, 570 337, 364, 376, 386, 571 grace-period
Cache miss, 570 Deadlock cycle, 415, 417 Groce, Alex, 179
associativity, 430, 569 Deadlock free, 286, 571
capacity, 430, 570 Desnoyers, Mathieu, 177, 178 Hardware transactional memory (HTM),
communication, 430, 570 Dijkstra, Edsger W., 1, 74 383, 384, 554, 556, 572

629

v2022.09.25a
630 INDEX

Hawking, Stephen, 8 Lock free, 178, 286, 572 memory-barrier, 20


Hazard pointer, 131, 143, 149, 179, 189, Locking, 101
206, 274, 310, 372, 388, 492, 572 code, 82, 83, 90, 570 Parallel, 412, 573
Heisenberg, Weiner, 194, 218 data, 15, 82, 92, 571 Park, SeongJae, 179, 379
Heisenbug, 218, 572 Luchangco, Victor, 3, 178 Patterson, David A., 4, 17
anti-, 218 Pawan, Pankaj, 257
Hennessy, John L., 4, 17 Madden, Samuel, 177 Penyaev, Roman, 261
Herlihy, Maurice P., 3 Mao, Yandong, 177 Performance, 8, 80, 412, 574
Hot spot, 86, 192, 572 Maranget, Luc, 257 Pikus, Fedor, 625
Howard, Phil, 177 Marked access, 572 Pipelined CPU, 574
Hraska, Adam, 178 Marlier, Patrick, 178 Plain access, 40, 47, 144, 334, 574
Humiliatingly parallel, 96, 572 Matloff, Norm, 3 Podzimek, Andrej, 178
Hunter, Andrew, 179 Mattson, Timothy G., 3 Process consistency, see Consistency,
Matveev, Alexander, 178 process
Immutable, 572 McKenney, Paul E., 179 Productivity, 8, 10, 80, 308, 378
Inter-processor interrupt (IPI), 137, 354, Melham, Tom, 179 Program order, 574
443, 572 Memory, 573 Promela, 229, 529
Interrupt request (IRQ), 253, 303, 572 Memory barrier, 20, 36, 65, 81, 109, 132,
Invalidation, 430, 437, 554, 572 181, 191, 234, 270, 315, 369, 410, Quiescent state, 141, 254, 376, 425, 574
416, 422, 429, 573 Quiescent-state-based reclamation
Jensen, Carlos, 179
full, 137, 316, 342, 350, 352, 542 (QSBR), 141, 161, 179, 183, 191,
Kaashoek, Frans, 177 read, 338, 352, 440, 574 346, 574
Kim, Jaeho, 178 write, 352, 440, 576
Knuth, Donald, 3, 177, 371 Memory consistency, see Consistency, Race condition, 7, 117, 219, 229, 230,
Kogan, Alex, 179 memory 279, 319, 418, 574
Kohler, Eddie, 177 Memory latency, see Latency, memory Ramalhete, Pedro, 178
Kokologiannakis, Michalis, 179 Memory-barrier latency, see Latency, RCU read-side critical section, see
Kroah-Hartman, Greg, 3 memory-barrier Critical section, RCU read-side
Kroening, Daniel, 179 Memory-barrier overhead, see Overhead, RCU-protected data, 507, 574
Kung, H. T., 3, 177 memory-barrier RCU-protected pointer, 139, 574
MESI protocol, 431, 573 Read memory barrier, see Memory
Latency, 19, 24, 296, 572 Message latency, see Latency, message barrier, read
cache-invalidation, 438 Moore’s Law, 7, 9, 13, 17, 19, 25, 27, 82, Read mostly, 574
cache-miss, 25 367, 369, 573 Read only, 574
grace-period, 151, 423 Morris, Robert, 177 Read-copy update (RCU), 137, 554, 574
memory, 369 Morrison, Adam, 178 Read-side critical section, see Critical
memory-barrier, 191 Mutual-exclusion mechanism, 573 section, read-side
message, 81 Reader-writer lock, see Lock,
scheduling, 288 Nardelli, Francesco Zappa, 257 reader-writer
Lea, Doug, 4 Nidhugg, 264, 395, 529 Real time, 575
Lehman, Philip L., 3, 177 Non-blocking, 573 Reference count, 46, 49, 128, 174, 179,
Liang, Lihao, 179 Non-blocking synchronization (NBS), 80, 270, 281, 379, 415, 476, 575
Linearizable, 178, 286, 516, 572 119, 177, 286, 371, 403, 407, 573 Regan, Dave, 625
Liskov, Barbara, 177 Non-maskable interrupt (NMI), 175, 242, Reinders, James, 4
Liu, Ran, 178 369, 573 Release store, 46, 338, 575
Liu, Yujie, 178 Non-uniform cache architecture (NUCA), Reliability, 308
Livelock, 7, 15, 101, 108, 231, 387, 496, 441, 544, 573 Resource acquisition is initialization
572 Non-uniform memory architecture (RAII), 112
Lock, 572 (NUMA), 109, 179, 189, 380, 544, Rinetzky, Noam, 179
exclusive, 34, 110, 410, 571 573 Romer, Geoff, 179
reader-writer, 34, 110, 178, 575 NUMA node, 15, 528, 573 Roy, Lance, 179
sequence, 575 Rubini, Alessandro, 3
Lock contention, 56, 69, 77, 81, 84, 90, Obstruction free, 286, 573
109, 572 Overhead, 7, 21, 573 Sagonas, Konstantinos, 179

v2022.09.25a
INDEX 631

Sarkar, Susmit, 257 Store buffer, 575 Wait free, 286, 576
Scalability, 9, 412, 575 Store forwarding, 575 bounded, 181, 286, 569
Scheduling latency, see Latency, Superscalar CPU, 575 bounded population-oblivious, 286,
scheduling Sutter, Herb, 4 569
Schimmel, Curt, 4 Synchronization, 576 Walpole, Jon, 177
Schmidt, Douglas C., 3 Weak consistency, see Consistency, weak
Scott, Michael, 3 Tassarotti, Joseph, 179 Weiss, Stewart, 3
Sequence lock, see Lock, sequence Teachable, 576 Weizenbaum, Joseph, 177
Sequential consistency, see Consistency, Throughput, 576 Williams, Anthony, 3
sequential Torvalds, Linus, 477, 514, 548 Williams, Derek, 257
Sewell, Peter, 257 Transactional lock elision (TLE), 387, Write memory barrier, see Memory
410, 576 barrier, write
Shavit, Nir, 3, 178
Transactional memory (TM), 576 Write miss, see Cache miss, write
Shenoy, Gautham, 178
Triplett, Josh, 177 Write mostly, 576
Siakavaras, Dimitrios, 179, 379
Tu, Stephen, 177 Write-side critical section, see Critical
Sivaramakrishnan, KC, 178
Type-safe memory, 151, 165, 270, 576 section, write-side
Software transactional memory (STM),
385, 554, 575 Unbounded transactional memory (UTM), Xu, Herbert, 177, 195, 514
Sorin, Daniel, 4 385, 576
Spear, Michael, 3, 178 Uncertainty principle, 218 Yang, Hongseok, 179
Spin, 229 Unfairness, 101, 109, 114, 137, 576
Starvation, 73, 101, 108, 114, 137, 275, Unteachable, 576 Zeldovich, Nickolai, 177
304, 380, 417, 496, 575 Zhang, Heng, 178
Starvation free, 286, 575 Vafeiadis, Viktor, 179 Zheng, Wenting, 177
Stevens, W. Richard, 3 Vector CPU, 576 Zijlstra, Peter, 177

v2022.09.25a
632 INDEX

v2022.09.25a
API Index
(c): Cxx standard, (g): GCC extension, (k): Linux kernel,
(kh): Linux kernel historic, (pf): perfbook CodeSamples,
(px): POSIX, (ur): userspace RCU.

_Thread_local (c), 37, 46, 52 atomic_compare_exchange_ cds_list_for_each_entry_


__ATOMIC_ACQUIRE (g), 37 strong() (c), rcu() (ur),
__ATOMIC_ACQ_REL (g), 37 37 189
__ATOMIC_CONSUME (g), 37 atomic_compare_exchange_ create_thread() (pf), 38
__ATOMIC_RELAXED (g), 37 weak() (c),
__ATOMIC_RELEASE (g), 37 37 DECLARE_PER_THREAD() (pf), 47
__ATOMIC_SEQ_CST (g), 37 atomic_dec() (k), 46 DEFINE_PER_CPU() (k), 46
__atomic_load() (g), 37 atomic_dec_and_test() (k), 46 DEFINE_PER_THREAD() (pf), 46
__atomic_load_n() (g), 36, 37 atomic_exchange() (c), 37 destroy_rcu_head() (k), 159
__atomic_store() (g), 37 atomic_fetch_add() (c), 37 destroy_rcu_head_on_stack() (k),
__atomic_store_n() (g), 36, 37 atomic_fetch_and() (c), 37 159
__atomic_thread_fence() (g), 37 atomic_fetch_sub() (c), 37
__get_thread_var() (pf), 47, 51 atomic_fetch_xor() (c), 37 exec() (px), 120
__sync_add_and_fetch() (g), 36 atomic_inc() (k), 46 exit() (px), 30
__sync_and_and_fetch() (g), 36 atomic_inc_not_zero() (k), 46
__sync_bool_compare_and_ for_each_running_thread() (pf), 38
atomic_load() (c), 36
swap() (g), for_each_thread() (pf), 38, 52
atomic_load_explicit() (c), 37
36 fork() (px), 30, 39, 47, 120, 474, 476
atomic_read() (k), 46
__sync_fetch_and_add() (g), 36, 473,
atomic_set() (k), 46
474 get_nulls_value() (k), 156
atomic_signal_fence() (c), 36
__sync_fetch_and_and() (g), 36
atomic_store() (c), 36 hlist_del_rcu() (k), 159
__sync_fetch_and_nand() (g), 36, 474
atomic_sub() (k), 46 hlist_for_each_entry_rcu() (k),
__sync_fetch_and_or() (g), 36
atomic_sub_and_test() (k), 46 159
__sync_fetch_and_sub() (g), 36, 474
__sync_fetch_and_xor() (g), 36, 474 atomic_t (k), 46, 61
__sync_nand_and_fetch() (g), 36, 474 atomic_thread_fence() (c), 36 init_per_thread() (pf), 46, 47
__sync_or_and_fetch() (g), 36 atomic_xchg() (k), 46 init_rcu_head() (k), 159
__sync_sub_and_fetch() (g), 36 init_rcu_head_on_stack() (k), 159
__sync_synchronize() (g), 36 barrier() (k), 36, 43–45 is_a_nulls() (k), 156
__sync_val_compare_and_swap() (g),
36 kfree() (k), 158
call_rcu() (k), 105, 139, 151 kill() (px), 30
__sync_xor_and_fetch() (g), 36
__thread (g), 35, 37, 46, 52, 473 call_rcu_tasks() (k), 153 kmem_cache_create() (k), 151
call_srcu() (k), 153 kthread_create() (k), 38
ACCESS_ONCE() (kh), 36, 474 cds_list_add() (ur), 190 kthread_should_stop() (k), 38
atomic_add() (k), 46 cds_list_add_rcu() (ur), 190 kthread_stop() (k), 38
atomic_add_return() (k), 46 cds_list_del_init() (ur), 190
atomic_add_unless() (k), 46 cds_list_del_rcu() (ur), 190 list_add_rcu() (k), 159
atomic_cmpxchg() (k), 46, 62 cds_list_for_each_entry() (ur), 189 list_for_each_entry_rcu() (k), 159

633

v2022.09.25a
634 API INDEX

list_replace_rcu() (k), 156, 158 rcu_dereference_protected() (k), smp_load_acquire() (k), 46, 475
lockless_dereference() (kh), 324, 154 smp_mb() (k), 45
544 rcu_dereference_raw() (k), 155 smp_store_release() (k), 43, 46, 475
rcu_dereference_raw_notrace() (k), smp_thread_id() (pf), 38, 39, 475
NR_THREADS (pf), 38 155 smp_wmb() (k), 43
rcu_head (k), 159 spin_lock() (k), 39, 40
per_cpu() (k), 46 rcu_head_after_call_rcu() (k), 159 spin_lock_init() (k), 39
per_thread() (pf), 47, 52 rcu_head_init() (k), 159 spin_trylock() (k), 39, 106
pthread_atfork() (px), 120 rcu_init() (ur), 37 spin_unlock() (k), 39, 40
pthread_cond_wait() (px), 106 RCU_INIT_POINTER() (k), 154 spinlock_t (k), 39
pthread_create() (px), 31 rcu_is_watching() (k), 159 srcu_barrier() (k), 153
pthread_exit() (px), 31 RCU_LOCKDEP_WARN() (k), 159 srcu_read_lock() (k), 151
pthread_getspecific() (px), 37 RCU_NONIDLE() (k), 159 srcu_read_lock_held() (k), 159
pthread_join() (px), 31 rcu_pointer_handoff() (k), 154 srcu_read_unlock() (k), 151
pthread_key_create() (px), 37 RCU_POINTER_INITIALIZER() (k), 154 srcu_struct (k), 151
pthread_key_delete() (px), 37 rcu_read_lock() (k), 139, 151 struct task_struct (k), 38
pthread_kill() (px), 67 rcu_read_lock_bh() (k), 151 synchronize_irq() (k), 510
pthread_mutex_init() (px), 32 rcu_read_lock_bh_held() (k), 159 synchronize_net() (k), 151
PTHREAD_MUTEX_INITIALIZER (px), 32 rcu_read_lock_held() (k), 159 synchronize_rcu() (k), 139, 151
pthread_mutex_lock() (px), 32, 107 rcu_read_lock_sched() (k), 151 synchronize_rcu_expedited() (k),
pthread_mutex_t (px), 32, 34, 106 rcu_read_lock_sched_held() (k), 151
pthread_mutex_unlock() (px), 32 159 synchronize_rcu_tasks() (k), 153
pthread_rwlock_init() (px), 34 rcu_read_unlock() (k), 139, 151 synchronize_srcu() (k), 153
PTHREAD_RWLOCK_INITIALIZER (px), rcu_read_unlock_bh() (k), 151 synchronize_srcu_expedited() (k),
34 rcu_read_unlock_sched() (k), 151 153
pthread_rwlock_rdlock() (px), 34 rcu_register_thread() (ur), 37
this_cpu_ptr() (k), 46
pthread_rwlock_t (px), 34 rcu_replace_pointer() (k), 154
thread_id_t (pf), 38
pthread_rwlock_unlock() (px), 34 rcu_sleep_check() (k), 159
pthread_rwlock_wrlock() (px), 34 rcu_unregister_thread() (ur), 38
unlikely() (k), 43
pthread_setspecific() (px), 37 READ_ONCE() (k), 33, 35–37, 41, 42,
pthread_t (px), 37 44–46, 472–474 vfork() (px), 47, 476
volatile (c), 43, 44, 48
rcu_access_pointer() (k), 154 schedule() (k), 153
rcu_assign_pointer() (k), 139, 154 schedule_timeout_ wait() (px), 30, 31, 39, 47, 474
rcu_barrier() (k), 151 interruptible() (k), wait_all_threads() (pf), 38, 39
rcu_barrier_tasks() (k), 153 38 wait_thread() (pf), 38, 39
rcu_cpu_stall_reset() (k), 159 sig_atomic_t (c), 42 waitall() (px), 30
rcu_dereference() (k), 139, 154 SLAB_TYPESAFE_BY_RCU (k), 151 WRITE_ONCE() (k), 33, 36, 41, 42, 44–46,
rcu_dereference_check() (k), 154 smp_init() (pf), 37 472, 474, 475

v2022.09.25a

You might also like