C++ and The Perils of Double-Checked Locking
C++ and The Perils of Double-Checked Locking
Part I
Source Code Accompanies This Article. Download It Now.
dclock1.txt
Google the newsgroups or Web for the names of design patterns, and you're sure to
find that one of the most commonly mentioned is Singleton. Try to put Singleton into
practice, however, and you're all but certain to bump into a significant limitation: As
traditionally implemented, Singleton isn't thread safe.
Much effort has been put into addressing this shortcoming. One of the most popular
approaches is a design pattern in its own right, the Double-Checked Locking Pattern
(DCLP); see Douglas C. Schmidt et al., "Double-Checked Locking" and Douglas C.
Schmidt et al., Pattern-Oriented Software Architecture, Volume 2. DCLP is designed
to add efficient thread safety to initialization of a shared resource (such as a
Singleton), but it has a problem—it's not reliable. Furthermore, there's virtually no
portable way to make it reliable in C++ (or in C) without substantively modifying the
conventional pattern implementation. To make matters even more interesting, DCLP
can fail for different reasons on uniprocessor and multiprocessor architectures.
In this two-part article, we explain why Singleton isn't thread safe, how DCLP
attempts to address that problem, why DCLP may fail on both uni- and multiprocessor
architectures, and why you can't (portably) do anything about it. Along the way, we
clarify the relationships among statement ordering in source code, sequence points,
compiler and hardware optimizations, and the actual order of statement execution.
Finally, in the next installment, we conclude with some suggestions regarding how to
add thread safety to Singleton (and similar constructs) such that the resulting code is
both reliable and efficient.
Singleton* Singleton::instance() {
if (pInstance == 0) {
pInstance = new Singleton;
}
return pInstance;
}
At some point later, Thread A is allowed to continue running, and the first thing it does
is move to line 15, where it conjures up another Singleton object and
makes pInstance point to it. It should be clear that this violates the meaning of a
Singleton, as there are now two Singleton objects.
The crux of DCLP is the observation that most calls to instance see that pInstance is
not null, and not even try to initialize it. Therefore, DCLP tests pInstance for nullness
before trying to acquire a lock. Only if the test succeeds (that is, if pInstance has not
yet been initialized) is the lock acquired. After that, the test is performed again to
ensure pInstance is still null (hence, the "double-checked" locking). The second test is
necessary because it is possible that another thread happened to
initialize pInstance between the time pInstance was first tested and the time the lock
was acquired.
Of critical importance is the observation that compilers are not constrained to perform
these steps in this order! In particular, compilers are sometimes allowed to swap Steps
2 and 3. Why they might want to do that is a question we'll address in a moment. For
now, let's focus on what happens if they do.
In general, this is not a valid translation of the original DCLP source code because
the Singleton constructor called in Step 2 might throw an exception. And, if an
exception is thrown, it's important that pInstance has not yet been modified. That's
why, in general, compilers cannot move Step 3 above Step 2. However, there are
conditions under which this transformation is legitimate. Perhaps the simplest such
condition is when a compiler can prove that the Singleton constructor cannot throw
(via postinlining flow analysis, for instance), but that is not the only condition. Some
constructors that throw can also have their instructions reordered such that this
problem arises.
DCLP works only if Steps 1 and 2 are completed before Step 3 is performed, but there
is no way to express this constraint in C or C++. That's the dagger in the heart of
DCLP—you need to define a constraint on relative instruction ordering, but the
languages give you no way to express the constraint.
Yes, the C and C++ Standards (see ISO/IEC 9899:1999 International Standard and
ISO/IEC 14882:1998(E), respectively) do define sequence points, which define
constraints on the order of evaluation. For example, paragraph 7 of Section 1.9 of the
C++ Standard encouragingly states:
At certain specified points in the execution sequence called sequence points, all side
effects of previous evaluations shall be complete and no side effects of subsequent
evaluations shall have taken place.
Furthermore, both Standards state that a sequence point occurs at the end of each
statement. So it seems that if you're just careful with how you sequence your
statements, everything falls into place.
Oh, Odysseus, don't let thyself be lured by sirens' voices; for much trouble is waiting
for thou and thy mates!
Both Standards define correct program behavior in terms of the "observable behavior"
of an abstract machine. But not everything about this machine is observable. For
example, consider function Foo in Example 5 (which looks silly, but might plausibly
be the result of inlining some other functions called by Foo).
void Foo() {
int x = 0, y = 0; // Statement 1
x = 5; // Statement 2
y = 10; // Statement 3
printf("%d,_%d", x, y); // Statement 4
}
Example 5: This code could be the result of inlining some other functions called by Foo.
In both C and C++, the Standards guarantee that Foo prints "5,_10". But that's about
the extent of what we're guaranteed. We don't know whether statements 1-3 will be
executed at all and, in fact, a good optimizer will get rid of them. If statements 1-3 are
executed, we know that statement 1 precedes statements 2-4 and—assuming that the
call to printf isn't inlined and the result further optimized—we know about the
relative ordering of statements 2 and 3. Compilers might choose to execute statement
2 first, statement 3 first, or even to execute them both in parallel, assuming the
hardware has some way to do it. Which it might well have. Modern processors have a
large word size and several execution units. Two or more arithmetic units are
common. (For example, the Pentium 4 has three integer ALUs, PowerPC's G4e has
four, and Itanium has six.) Their machine language allows compilers to generate code
that yields parallel execution of two or more instructions in a single clock cycle.
When performing these kinds of optimizations, C/C++ compilers and linkers are
constrained only by the dictates of observable behavior on the abstract machines
defined by the language standards, and—this is the important bit—those abstract
machines are implicitly single threaded. As languages, neither C nor C++ have
threads, so compilers don't have to worry about breaking threaded programs when
they are optimizing. It should, therefore, not surprise you that they sometimes do.
That being the case, how can you write C and C++ multithreaded programs that
actually work? By using system-specific libraries defined for that purpose. Libraries
such as POSIX threads (pthreads) (see ANSI/IEEE 1003.1c-1995) give precise
specifications for the execution semantics of various synchronization primitives.
These libraries impose restrictions on the code that library-conformant compilers are
permitted to generate, thus forcing such compilers to emit code that respects the
execution ordering constraints on which those libraries depend. That's why threading
packages have parts written in assembler or issue system calls that are themselves
written in assembler (or in some unportable language): You have to go outside
Standard C and C++ to express the ordering constraints that multithreaded programs
require. DCLP tries to get by using only language constructs. That's why DCLP isn't
reliable.
If you reach for bigger ammo and try moving temp to a larger scope (say, by making it
file static), the compiler can still perform the same analysis and come to the same
conclusion. Scope, schmope. Game over. You lose. So you call for backup. You
declare temp extern and define it in a separate translation unit, thus preventing your
compiler from seeing what you are doing. Alas, some compilers have the optimizing
equivalent of night-vision goggles: They perform interprocedural analysis, discover
your ruse with temp, and again optimize it out of existence. Remember, these
are optimizing compilers. They're supposed to track down unnecessary code and
eliminate it. Game over. You lose.
So you try to disable inlining by defining a helper function in a different file, thus
forcing the compiler to assume that the constructor might throw an exception and,
therefore, delay the assignment to pInstance. Nice try, but some build environments
perform link-time inlining followed by more code optimizations (see Bruno De Bus et
al., "Post-pass Compaction Techniques;" Robert Cohn et al., "Spike: An Optimizer for
Alpha/NT Executables;" and Matt Pietrek, "Link-Time Code Generation"). Game
over. You lose.
Nothing you do can alter the fundamental problem: You need to be able to specify a
constraint on instruction ordering, and your language gives you no way to do it.
Next Month
In the next installment of this two-part article, we'll examine the role of
the volatile keyword, see what impact DCLP has on multiprocessor machines, and
conclude with a few suggestions.
References
Bruno De Bus, Daniel Kaestner, Dominique Chanet, Ludo Van Put, and Bjorn De
Sutter. "Post-pass Compaction Techniques." Communications of the ACM, 46(8):41-
46, August 2003. ISSN 0001-0782. http://doi.acm.org/10.1145/859670.859696.
Robert Cohn, David Goodwin, P. Geoffrey Lowney, and Norman Rubin. "Spike: An
Optimizer for Alpha/NT Executables." http://www.usenix.org/publications/
library/proceedings/usenix-nt97/ presentations/goodwin/index.htm, August 1997.
Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns:
Elements of Reusable Object-Oriented Software. Addison-Wesley, 1995.
Scott is author of Effective C++ and consulting editor for the Addison-Wesley
Effective Software Development series. He can be contacted at http://aristeia.com/.
Andrei is the author of Modern C++ Design and a columnist for the C/C++ Users
Journal. He can be contacted at http://moderncppdesign.com/.
cpplock2.txt
In this installment, Scott and Andrei examine the relationship between thread safety
and the volatile keyword.
In the first installment of this two-part article, we examined why the Singleton pattern
isn't thread safe, and how the Double-Checked Locking Pattern addresses that
problem. This month, we look at the role the volatile keyword plays in this, and why
DCLP may fail on both uni- and multiprocessor architectures.
The desire for specific instruction ordering makes you wonder whether
the volatile keyword might be of help with multithreading in general and with DCLP
in particular. Consequently, we restrict our attention to the semantics of volatile in
C++ and further restrict our discussion to its impact on DCLP.
Section 1.9 of the C++ Standard (see ISO/IEC 14882:1998(E)) includes this
information (emphasis is ours):
The observable behavior of the [C++] abstract machine is its sequence of reads and
writes to volatile data and calls to library I/O functions.
Accessing an object designated by a volatile lvalue, modifying an object, calling a
library I/O function, or calling a function that does any of those operations are
all side effects, which are changes in the state of the execution environment.
In conjunction with our earlier observations that the Standard guarantees that all side
effects will have taken place when sequence points are reached and that a sequence
point occurs at the end of each C++ statement, it would seem that all we need to do to
ensure correct instruction order is to volatile-qualify the appropriate data and
sequence our statements carefully. Our earlier analysis shows that pInstance needs to
be declared volatile, and this point is made in the papers on DCLP (see Douglas C.
Schmidt et al., "Double-Checked Locking" and Douglas C. Schmidt et al., Pattern-
Oriented Software Architecture, Volume 2). However, Sherlock Holmes would
certainly notice that, to ensure correct instruction order,
the Singleton object itself must also be volatile. This is not noted in the original
DCLP papers and that's an important oversight. To appreciate how
declaring pInstance alone volatile is insufficient, consider Example 7 (Examples 1-6
appeared in Part I of this article).
class Singleton {
public:
static Singleton* instance();
...
private:
static Singleton* volatile pInstance; // volatile added
int x;
Singleton() : x(5) {}
};
// from the implementation file
Singleton* Singleton::pInstance = 0;
Singleton* Singleton::instance() {
if (pInstance == 0) {
Lock lock;
if (pInstance == 0) {
Singleton*volatile temp = new Singleton; // volatile added
pInstance = temp;
}
}
return pInstance;
}
After inlining the constructor, the code looks like Example 8. Though temp is
volatile, *temp is not, and that means that temp->x isn't, either. Because you now
understand that assignments to nonvolatile data may sometimes be reordered, it is
easy to see that compilers could reorder temp->x's assignment with regard to the
assignment to pInstance. If they did, pInstance would be assigned before the data it
pointed to had been initialized, leading again to the possibility that a different thread
would read an uninitialized x.
if (pInstance == 0) {
Lock lock;
if (pInstance == 0) {
Singleton* volatile temp =
static_cast<Singleton*>(operator new(sizeof(Singleton)));
temp->x = 5; // inlined Singleton constructor
pInstance = temp;
}
}
First, the Standard's constraints on observable behavior are only for an abstract
machine defined by the Standard, and that abstract machine has no notion of multiple
threads of execution. As a result, though the Standard prevents compilers from
reordering reads and writes to volatile data within a thread, it imposes no constraints
at all on such reorderings across threads. At least that's how most compiler
implementers interpret things. As a result, in practice, many compilers may generate
thread-unsafe code from the aforementioned source. If your multithreaded code works
properly with volatile and doesn't work without it, then either your C++
implementation carefully implemented volatile to work with threads (less likely) or
you simply got lucky (more likely). Either case, your code is not portable.
has run to completion, and that means that we're back in a situation where instructions
for memory allocation and object initialization may be arbitrarily reordered.
Unfortunately, all this does nothing to address the first problem—C++'s abstract
machine is single threaded, and C++ compilers may choose to generate thread-unsafe
code from source like that just mentioned, anyway. Otherwise, lost optimization
opportunities lead to too big an efficiency hit. After all this, we're back to square one.
But wait, there's more—more processors.
Technically, you don't need full bidirectional barriers. The first barrier must prevent
downwards migration of Singleton's construction (by another thread); the second
barrier must prevent upwards migration of pInstance's initialization. These are
called "acquire" and "release" operations, and may yield better performance than full
barriers on hardware (such as Itainum) that makes the distinction.
Conclusion
There are several lessons to be learned here. First, remember that timeslice-based
parallelism on uniprocessors is not the same as true parallelism across multiple
processors. That's why a thread-safe solution for a particular compiler on a
uniprocessor architecture may not be thread safe on a multiprocessor architecture, not
even if you stick with the same compiler. (This is a general observation—it's not
specific to DCLP.)
(a)
Singleton::instance()->transmogrify();
Singleton::instance()->metamorphose();
Singleton::instance()->transmute();
(b)
Singleton* const instance =
Singleton::instance(); // cache instance pointer
instance->transmogrify();
instance->metamorphose();
instance->transmute();
Example 13: Instead of writing code like (a), clients should use something like (b).
Third, avoid using a lazily initialized Singleton unless you really need it. The
classic Singleton implementation is based on not initializing a resource until that
resource is requested. An alternative is to use eager initialization; that is, to initialize a
resource at the beginning of the program run. Because multithreaded programs
typically start running as a single thread, this approach can push some object
initializations into the single-threaded startup portion of the code, thus eliminating the
need to worry about threading during the initialization. In many cases, initializing
a Singleton resource during single-threaded program startup (that is, prior to
executing main) is the simplest way to offer fast, thread-safe Singleton access.
A different way to employ eager initialization is to replace the Singleton Pattern with
the Monostate Pattern (see Steve Ball et al., "Monostate Classes: The Power of One").
Monostate, however, has different problems, especially when it comes to controlling
the order of initialization of the nonlocal static objects that make up its state. Effective
C++ (see "References") describes these problems and, ironically, suggests using a
variant of Singleton to escape them. (The variant is not guaranteed to be thread safe;
see Pattern Hatching: Design Patterns Applied by John Vlissides.)
Finally, DCLP and its problems in C++ and C exemplify the inherent difficulty in
writing thread-safe code in a language with no notion of threading (or any other form
of concurrency). Multithreading considerations are pervasive because they affect the
very core of code generation. As Peter Buhr pointed out in "Are Safe Concurrency
Libraries Possible?" (see "References"), the desire to keep multi-threading out of the
language and tucked away in libraries is a chimera. Do that, and either the libraries
will end up putting constraints on the way compilers generate code
(as Pthreads already does), or compilers and other code-generation tools will be
prohibited from performing useful optimizations, even on single-threaded code. You
can pick only two of the troika formed by multithreading, a thread-unaware language,
and optimized code generation. Java and the .NET CLI, for example, address the
tension by introducing thread awareness into the language and language
infrastructure, respectively (see Doug Lea's Concurrent Programming in Java and
Arch D. Robison's "Memory Consistency & .NET").
Acknowledgments
Drafts of this article were read by Doug Lea, Kevlin Henney, Doug Schmidt, Chuck
Allison, Petru Marginean, Hendrik Schober, David Brownell, Arch Robison, Bruce
Leasure, and James Kanze. Their comments, insights, and explanations greatly
improved the article and led us to our current understanding of DCLP, multithreading,
instruction ordering, and compiler optimizations.
References
David Bacon, Joshua Bloch, Jeff Bogda, Cliff Click, Paul Hahr, Doug Lea, Tom May,
Jan-Willem Maessen, John D. Mitchell, Kelvin Nilsen, Bill Pugh, and Emin Gun
Sirer. The "Double-Checked Locking Pattern is Broken"
Declaration. http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocki
ng.html.
Steve Ball and John Crawford. "Monostate Classes: The Power of One." C++ Report,
May 1997. Reprinted in More C++ Gems, Robert C. Martin, ed., Cambridge
University Press, 2000.
Arch D. Robison. "Memory Consistency & .NET." Dr. Dobb's Journal, April 2003.
Douglas C. Schmidt and Tim Harrison. "Double-Checked Locking." In Pattern
Languages of Program Design 3, Robert Martin, Dirk Riehle, and Frank Buschmann,
editors. Addison-Wesley, 1998. http://www.cs .wustl.edu/~schmidt/PDF/DC-
Locking.pdf.
To find the roots of volatile, let's go back to the 1970s, when Gordon Bell
(of PDP-11 fame) introduced the concept of memory-mapped I/O (MMIO). Before
that, processors allocated pins and defined special instructions for
performing port I/O. The idea behind MMIO is to use the same pins and
instructions for both memory and port access. Hardware outside the processor
intercepts specific memory addresses and transforms them into I/O requests; so
dealing with ports became simply reading from and writing to machine-specific
memory addresses.
What a great idea. Reducing pin count is good—pins slow down signal, increase
defect rate, and complicate packaging. Also, MMIO doesn't require special
instructions for ports. Programs just use the memory, and the hardware takes
care of the rest.
Or almost.
*p = a;
*p = b;
The code writes two words to *p, but the optimizer might assume that *p is
memory and perform the dead assignment elimination optimization by eliminating
the first assignment.