0% found this document useful (0 votes)
68 views

Threadsanitizer - Data Race Detection in Practice: Konstantin Serebryany Timur Iskhodzhanov

Uploaded by

michel_r_91
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

Threadsanitizer - Data Race Detection in Practice: Konstantin Serebryany Timur Iskhodzhanov

Uploaded by

michel_r_91
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

ThreadSanitizer – data race detection in practice

Konstantin Serebryany Timur Iskhodzhanov


OOO Google MIPT
7 Balchug st. 9 Institutskii per.
Moscow, 115035, Russia Dolgoprudny, 141700, Russia
[email protected] [email protected]

ABSTRACT The problem of precise race detection is known to be NP-


Data races are a particularly unpleasant kind of threading hard (see [20]). However, it is possible to create tools for
bugs. They are hard to find and reproduce – you may not finding data races with acceptable precision (such tools will
observe a bug during the entire testing cycle and will only miss some races and/or report false warnings).
see it in production as rare unexplainable failures. This pa- Virtually every C++ application developed at Google is
per presents ThreadSanitizer – a dynamic detector of data multithreaded. Most of the code is covered with tests, rang-
races. We describe the hybrid algorithm (based on happens- ing from tiny unit tests to huge integration and regression
before and locksets) used in the detector. We introduce tests. However, our codebase had never been studied using
what we call dynamic annotations – a sort of race detec- a data race detector. Our main task was to implement and
tion API that allows a user to inform the detector about deploy a continuous process for finding data races.
any tricky synchronization in the user program. Various
practical aspects of using ThreadSanitizer for testing multi- 2. RELATED WORK
threaded C++ code at Google are also discussed.
There are a number of approaches to data race detection.
The three basic types of detection techniques are: static, on-
Categories and Subject Descriptors the-fly and postmortem. On-the-fly and postmortem tech-
D.2.5 [Software Engineering]: Testing and Debugging — niques are often referred to as dynamic.
Testing tools. Static data race detectors analyze the source code of a
program (e.g. [11]). It seems unlikely that static detectors
will work effectively in our environment: Google’s code is
General Terms large and complex enough that it would be expensive to add
Algorithms, Testing, Reliability. the annotations required by a typical static detector.
Dynamic data race detectors analyze the trace of a par-
Keywords ticular program execution. On-the-fly race detectors process
the program’s events in parallel with the execution [14, 22].
Concurrency Bugs, Dynamic Data Race Detection, Valgrind. The postmortem technique consists in writing such events
into a temporary file and then analyzing this file after the
1. INTRODUCTION actual program execution [18].
Most dynamic data race detection tools are based on one
A data race is a situation when two threads concurrently
of the following algorithms: happens-before, lockset or both
access a shared memory location and at least one of the
(the hybrid type). A detailed description of these algorithms
accesses is a write.
is given in [21]. Each of these algorithms can be used in the
Such bugs are often difficult to find because they happen
on-the-fly and postmortem analysis.
only under very specific circumstances which are hard to
reproduce. In other words, a successful pass of all tests
doesn’t guarantee the absence of data races. Since races 3. HISTORY OF THE PROJECT
can result in data corruption or segmentation fault, it is
Late in 2007 we tried several publicly available race de-
important to have tools for finding existing data races and
tectors, but all of them failed to work properly “out of the
for catching new ones as soon as they appear in the source
box”. The best of these tools was Helgrind 3.3 [8] which had
code.
a hybrid algorithm. But even Helgrind had too many false
positives and missed many real races. Early in 2008 we mod-
ified the Helgrind’s hybrid algorithm and also introduced an
Permission to make digital or hard copies of all or part of this work for optional pure happens-before mode. The happens-before
personal or classroom use is granted without fee provided that copies are mode had fewer false positives but missed even more data
not made or distributed for profit or commercial advantage and that copies races than the initial hybrid algorithm. Also, we introduced
bear this notice and the full citation on the first page. To copy otherwise, to dynamic annotations (section 5) which helped eliminate false
republish, to post on servers or to redistribute to lists, requires prior specific positive reports even in the hybrid mode.
permission and/or a fee.
WBIA ’09, Dec 12, New York City, NY Still, Helgrind did not work for us as effectively as we
Copyright 2009 ACM 978-1-60558-793-6/12/09 ...$10.00. would like it to — it was still too slow, missed too many

62
races in the pure happens-before mode and was too noisy in Thread T1 Thread T2 Thread T3
the hybrid mode1 .
So, later in 2008 we implemented our own race detector. S1 S2 S3
We called this tool “ThreadSanitizer”. ThreadSanitizer Signal(H1)
uses a new simple hybrid algorithm which can easily be used S4 Wait(H1)
in a pure happens-before mode. It supports the dynamic S5
annotations we have suggested for Helgrind. Also, we have
Signal(H2)
tried to make the race reports as informative as possible to
make the tool easier to use. S6 Wait(H2)
S7
4. ALGORITHM
ThreadSanitizer is implemented as a Valgrind [19] tool2 .
It observes the program execution as a sequence of events. Figure 1: Example of happens-before relation.
The most important events are memory access events and S1 ≺ S4 (same thread); S1 ≺ S5 (happens-before arc
synchronization events. Memory access events are Read SignalT1 (H1 ) – WaitT2 (H1 )); S1 ≺ S7 (happens-before is
and Write. Synchronization events are either locking events transitive); S4 6≺ S2 (no relation).
or happens-before events. Locking events are WrLock, RdLock,
WrUnlock and RdUnlock. Happens-before events are
Signal and Wait3 . events). The context of a segment is the context of the
These events, generated by the running program, are ob- first event in the segment. Each segment has its writer and
served by ThreadSanitizer with the help of the underlying bi- reader LockSets (LSW r and LSRd ). Each memory access
nary translation framework (Valgrind). The detector keeps event belongs to exactly one segment.
the state based on the history of the observed events and Happens-before arc: a pair of events X = SignalTX (AX )
updates it using a certain state machine. To formally de- and Y = WaitTY (AY ) such that AX = AY , TX 6= TY and
scribe the state and the state machine we will need some X is observed first.
definitions. Happens-before: a partial order on the set of events.
Given two events X = TypeXTX (AX ) and Y = TypeYTY (AY ),
4.1 Definitions the event X happens-before or precedes the event Y (in short,
Tid (thread ID): a unique number identifying a thread of X ≺ Y ;  and 6 are defined naturally) if X has been ob-
the running program. served before Y and at least one of the following statements
ID: a unique ID of a memory location4 . is true:
EventType: one of Read, Write, WrLock, RdLock,
WrUnlock, RdUnlock, Signal, Wait. • TX = TY .
Event: a triple {EventType, T id, ID}. We will write
• {X, Y } is a happens-before arc.
EventTypeT id (ID) or EventType(ID) if the T id is obvi-
ous from the context. • ∃E1 , E2 : X  E1 ≺ E2  Y (i.e. ≺ is transitive).
Lock: an ID that appeared in a locking event.
A lock L is write-held by a thread T at a given point of The happens-before relation can be naturally defined for
time if the number of events WrLockT (L) observed so far segments since segments don’t contain synchronization events.
is greater than the number of events WrUnlockT (L). Figure 1 shows three different threads divided into segments.
A lock L is read-held by a thread T if it is write-held by
T or if the number of events RdLockT (L) is greater than Segment Set: a set of N segments {S1 , S2 , ..., SN } such
the number of events RdUnlockT (L). that ∀i, j : Si 6 Sj .
Lock Set (LS): a set of locks. Concurrent: two memory access events X and Y are
Writer Lock Set (LSW r ): the set of all write-held locks concurrent if X 6 Y , Y 6 X and the intersection of the lock
of a given thread. sets of these events is empty.
Reader Lock Set (LSRd ): the set of all read-held locks Data Race: a data race is a situation when two threads
of a given thread. concurrently access a shared memory location (i.e. there are
Event Lock Set: LSW r for a Write event and LSRd two concurrent memory access events) and at least one of
for a Read event. the accesses is a Write.
Event context: the information that allows the user to
understand where the given event has appeared. Usually, 4.2 Hybrid state machine
the event context is a stack trace. The state of ThreadSanitizer consists of global and per-
Segment: a sequence of events of one thread that con- ID states. The global state is the information about the
tains only memory access events (i.e. no synchronization synchronization events that have been observed so far (lock
1 sets, happens-before arcs). Per-ID state (also called shadow
The current version of Helgrind (3.5) is different; it is faster memory or metadata) is the information about each memory
but has only a pure happens-before mode.
2 location of the running program.
At some point it was a PIN [15] tool, but the Valgrind-
based variant has proved to be twice as fast. ThreadSanitizer’s per-ID state consists of two segment
3
In the original Lamport’s paper [14] these are called Send sets: the writer segment set SSW r and the reader segment
and Receive. set SSRd . SSW r of a given ID is a set of segments where
4
In the current implementation ID represents one byte of the writes to this ID appeared. SSRd is a set of all seg-
memory, so on a 64-bit system it is a 64-bit pointer. ments where the reads from the given ID appeared, such

63
that ∀Sr ∈ SSRd , Sw ∈ SSW r : Sr 6 Sw (i.e. all segments in 10 Print-Segment-Lock-Sets(S)
SSRd happen-after or are unrelated to segments in SSW r ). 11 if not IsW rite
Each memory access is processed with the following pro- 12 then return
cedure. It adds and removes segments from SSW r and SSRd 13 for S ∈ SSRd \ Seg
so that SSW r and SSRd still match their definitions. At the 14 do if S 6 Seg
end, this procedure checks if the current state represents a 15 then Print( “Concurrent reads: ” )
race. 16 Print-Segment-Context(S)
17 Print-Segment-Lock-Sets(S)
Handle-Read-Or-Write-Event(IsW rite, T id, ID)
1  Handle event Read T id (ID) or Write T id (ID)
2 (SSW r , SSRd ) ← Get-Per-ID-State(ID) 4.3 Segments and context
3 Seg ← Get-Current-Segment(T id) As defined in (4.1), the segment is a sequence of memory
4 if IsW rite access events and the context of the segment is the context
5 then  Write event: update SSW r and SSRd of its first event. Recording the segment contexts is critical
6 SSRd ← {s : s ∈ SSRd ∧ s 6 Seg} because without them race reports will be less informative.
7 SSW r ← {s : s ∈ SSW r ∧ s 6 Seg} ∪ {Seg} ThreadSanitizer has three different modes5 with regard to
8 else  Read event: update SSRd creation of segments:
9 SSRd ← {s : s ∈ SSRd ∧ s 6 Seg} ∪ {Seg} 1 (default): Segments are created each time the program
10 Set-Per-ID-State(ID, SSW r , SSRd ) enters a new super-block (single-entry multiple-exit region)
11 if Is-Race(SSW r , SSRd ) of code. So, the contexts of all events in a segment belong
12 then  Report a data race on ID to a small range of code, always within the same function.
13 Report-Race(IsW rite, T id, Seg, ID) In practice, this means that the stack trace of the previ-
ous access is nearly precise: the line number of the topmost
Checking for race follows the definition of race (4.1). Note stack frame may be wrong, but all other line numbers and
that the intersection of lock sets happens in this procedure, all function names in the stack traces are exact.
and not earlier (see also 4.5). 0 (fast): Segments are created only after synchronization
events. This means that events inside a segment may have
Is-Race(SSW r , SSRd ) very different contexts and the context of the segment may
1  Check if we have a race. be much different from the contexts of other events. When
2 NW ← Segment-Set-Size(SSW r ) reporting a race in this mode, the contexts of the previous
3 for i ← 1 to NW accesses are not printed. This mode is useful only for re-
4 do W1 ← SSW r [i] gression testing. For performance data see 7.2.
5 LS1 ← Get-Writer-Lock-Set(W1 ) 2 (precise, slow): Segments are created on each memory
6  Check all write-write pairs. access (i.e. each segment contains just one event). This
7 for j ← i + 1 to NW mode gives precise stack traces for all previous accesses, but
8 do W2 ← SSW r [j] is very slow. In practice, this level of precision is almost
9 LS2 ← Get-Writer-Lock-Set(W2 ) never required.
10 Assert(W1 6 W2 and W2 6 W1 )
11 if LS1 ∩ LS2 = ∅ 4.4 Variations of the state machine
12 then return true The state machine described above is quite simple but
13  Check all write-read pairs. flexible. With small modifications it can be used as a pure
14 for R ∈ SSRd happens-before detector or else it can be enhanced with a
15 do LSR ← Get-Reader-Lock-Set(R) special state similar to the initialization state described in
16 if W1 6 R and LS1 ∩ LSR = ∅ [22]. ThreadSanitizer can use either of these modifications
17 then return true (adjustable by a command line flag).
18 return f alse
4.4.1 Pure happens-before state machine
Our ultimate goal is the race-reporting routine. It prints As any hybrid state machine, the state machine described
the contexts of all memory accesses involved in a race and above has false positives (see 6.4). It is possible to avoid
all locks that were held during each of the accesses. See most (but not all) false positives by using the pure happens-
appendix B for an example of output. Once a data race is before mode.
reported on ID, we ignore the consequent accesses to ID. Extended happens-before arc: a pair of events (X, Y )
such that X is observed before Y and one of the following
Report-Race(IsW rite, T id, Seg, ID) is true:
1 (SSW r , SSRd ) ← Get-Per-ID-State(ID)
2 Print( “Possible data race: ” ) • X = WrUnlockT1 (L), Y = WrLockT2 (L).
3 Print( IsWrite ? “Write” : “Read” )
4 Print( “ at address ” , ID) • X = WrUnlockT1 (L), Y = RdLockT2 (L).
5 Print-Current-Context(T id) • X = RdUnlockT1 (L), Y = WrLockT2 (L).
6 Print-Current-Lock-Sets(T id)
5
7 for S ∈ SSW r \ Seg Controlled by the --keep-history=[012] command flag;
8 do Print( “Concurrent writes: ” ) Memcheck and Helgrind also have similar modes controlled
9 Print-Segment-Context(S) by the flags --track-origins=yes|no and --history-
level=none|approx|full respectively.

64
Thread T1 Thread T2 4.5 Comparison with other state machines
RdLock(L) ThreadSanitizer and Eraser [22, 13] use locksets differ-
ently. In Eraser, the per-ID state stores the intersection
RdUnlock(L) of locksets. In ThreadSanitizer, the per-ID state contains
WrLock(L) original locksets (locksets are stored in segments, which are
stored in segment sets and, hence, in the per-ID state) and
WrUnlock(L) lockset intersection is computed each time when we check
for a race. This way we are able to report all locks involved
in a race. Surprisingly enough, this extra computation adds
only a negligible overhead.
Figure 2: Extended happens-before arc. This difference also allows our hybrid state machine to
avoid a false report on the following code. The accesses in
• (X, Y ) is a happens-before arc. three different threads do not have any common lock, yet
they are correctly synchronized8 .
If we use the extended happens-before arc in the definition
of happens-before relation, we will get the pure happens-
before state machine similar to the one described in [9]6 . Thread1 Thread2 Thread3
mu1 . Lock (); mu2 . Lock (); mu1 . Lock ();
The following example explains the difference between mu2 . Lock (); mu3 . Lock (); mu3 . Lock ();
pure happens-before and hybrid modes. obj - > Change (); obj - > Change (); obj - > Change ();
mu2 . Unlock (); mu3 . Unlock (); mu3 . Unlock ();
Thread1 Thread2
mu1 . Unlock (); mu2 . Unlock (); mu1 . Unlock ();
obj - > UpdateMe (); mu . Lock ();
mu . Lock (); bool f = flag ;
ThreadSanitizer’s pure happens-before mode finds the same
flag = true ; mu . Unlock ();
mu . Unlock (); if ( f ) obj - > UpdateMe (); races as the classical Lamport’s detector [14] (we did not
try to prove it formally though). On our set of unit tests
The first thread accesses an object without any lock and [7], it behaves the same way as other pure happens-before
then sets the flag under a lock. The second thread checks detectors (see appendix A)9 . The noticeable advantage of
the flag under a lock and then, if the flag is true, accesses ThreadSanitizer in the pure happens-before mode is that
the object again. The correctness of this code depends on it also reports all locks involved in a race — the classical
the initial value of the flag. If it is false, the two accesses pure happens-before detector knows nothing about locks and
to the object are synchronized correctly; otherwise we have can’t include them in the report.
a race. ThreadSanitizer cannot distinguish between these
two cases. In the hybrid mode, the tool will always re- 5. DYNAMIC ANNOTATIONS
port a data race on such code. In the pure happens-before
mode, ThreadSanitizer will behave differently: if the race Any dynamic race detector must understand the synchro-
is real, the race may or may not be reported (depends on nization mechanisms used by the tested program, otherwise
timing, this is why the pure happens-before mode is less the detector will not work. For programs that use only
predictable); if there is no race, the tool will be silent. POSIX mutexes, it is quite possible to hard-code the knowl-
edge about the POSIX API into the detector (most popular
4.4.2 Fast-mode state machine detectors do this). However, if the tested program uses other
In most real programs, the majority of memory locations means of synchronization, we have to explain them to the
are never shared between threads. It is natural to optimize detector. For this purpose we have created a set of dynamic
the race detector for this case. Such an optimization is im- annotations — a kind of race detection API.
plemented in ThreadSanitizer and is called fast mode7 . Each dynamic annotation is a C macro definition. The
Memory IDs in ThreadSanitizer are grouped into cache macro definitions are expanded into some code which is later
lines. Each cache line contains 64 IDs and the T id of the intercepted and interpreted by the tool10 . You can find our
thread which made the first access to this cache line. In implementation of the dynamic annotations at [7].
fast mode, we ignore all accesses to a cache line until we The most important annotations are:
see an access from another thread. This indeed makes the • ANNOTATE_HAPPENS_BEFORE(ptr),
detection faster — according to our measurements it may
increase the performance by up to 2x, see 7.2. • ANNOTATE_HAPPENS_AFTER(ptr)
This optimization affects accuracy. Eraser [22] has the ini- These annotations create, respectively, Signal(ptr) and
tialization state that reduces the number of false positives 8
produced by the lock-set algorithm. Similarly, the Thread- These cases are rare. During our experiments with Hel-
grind 3.3, which reported false positives on such code, we
Sanitizer’s fast mode reduces the number of false positives saw this situation only twice.
in the hybrid state machine. Both these techniques may as 9
It would be interesting to compare the accuracy of the de-
well hide real races. tectors on real programs, but in our case it appeared to be
The fast mode may be applied to the pure happens-before too difficult. Other detectors either did not work with our
state machine, but we don’t do this because the resulting OS and compiler or did not support our custom synchroniza-
detector will miss too many real races. tion utilities and specific synchronization idioms (e.g. syn-
chronization via I/O). Thus we have limited the comparison
6
Controlled by the --pure-happens-before command line to the unit tests.
flag. 10
Currently, the dynamic annotations are expanded into
7
Controlled by the --fast-mode command line flag. functions calls, but this is subject to change.

65
Wait(ptr) events for the current thread; they are used 6. RACE DETECTION IN PRACTICE
to annotate cases where a hybrid algorithm may pro-
duce false reports, as well as to annotate lock-free syn- 6.1 Performance
chronization. Examples are provided in section 6.4. Performance is critical for the successful use of race de-
Other annotations include: tectors, especially in large organizations like Google. First,
if a detector is too slow, it will be inconvenient to use it for
• ANNOTATE_PURE_HAPPENS_BEFORE_MUTEX(lock) manual testing or debugging. Second, slower detector will
Tells the detector to treat lock as in pure-happens- require more machine resources for regular testing, and the
before mode (even if all other locks are handled as machine resources cost money. Third, most of the C++ ap-
in hybrid mode). Using this annotation with the hy- plications at Google are time-sensitive and will simply fail
brid mode we can selectively apply pure hapens-before due to protocol timeouts if slowed down too much.
mode to some locks. In the pure happens-before mode When we first tried Helgrind 3.3 on a large set of unit
this annotations is a no-op. tests, almost 90% of them failed due to slowdown. With our
improved variant of Helgrind and, later, with ThreadSani-
• ANNOTATE_CONDVAR_LOCK_WAIT(cv,mu)
tizer we were able to achieve more than 95% pass rate. In
Creates a Wait(cv) event that matches the cv.Signal()
order to make the remaining tests pass, we had to change
event (cv is a conditional variable, see 6.4.1).
various timeout values in the tests11 .
• ANNOTATE_BENIGN_RACE(ptr) On an average Google unit test or application the slow-
Tells that races on the address “ptr” are benign. down is 20-50 times, but in extreme cases the slowdown
could be as high as 10000 times (as for example an artificial
• ANNOTATE_IGNORE_WRITES_BEGIN, stress test for a race detector) or as low as 2 times (the test
mostly sleeps or waits for I/O). See also section 7.2.
• ANNOTATE_IGNORE_WRITES_END
ThreadSanitizer spends almost all the time intercepting
Tells the tool to ignore all writes between these two
and analyzing memory accesses. If a given memory location
annotations. Similar annotations for reads also exist.
has been accessed by just one thread, the analysis is fast
• ANNOTATE_RWLOCK_CREATE(lock), (especially in the fast mode, see section 4.4.2). If a memory
location has been accessed by many threads and there have
• ANNOTATE_RWLOCK_DESTROY(lock), been a lot of synchronization events, the analysis is slow.
So, there are two major ways to speed up the tool: make
• ANNOTATE_RWLOCK_ACQUIRED(lock, isRW),
the analysis of one memory access faster and analyze fewer
• ANNOTATE_RWLOCK_RELEASED(lock, isRW) memory accesses. In order to make the analysis of one ac-
Is used to annotate a custom implementation of a lock cess faster, we used various well known techniques and al-
primitive. gorithms such as vector time-stamps ([9]) and caching. We
also limited the size of a segment set with a small constant
• ANNOTATE_PUBLISH_MEMORY_RANGE(ptr,size) (currently, 4) to avoid huge slowdowns in corner cases. But
Reports that the bytes in the range [ptr, ptr + size) whatever we do to speed up the analysis, the overhead will
are about to be published safely. The race detector always remain significant: remember that we replace a mem-
will create a happens-before arc from this call to sub- ory access instruction with a call to a quite sophisticated
sequent accesses to this memory. Usually required only function that usually runs for few hundreds of CPU cycles.
for hybrid detectors. A much more attractive approach is to reduce the number
of analyzed memory accesses. For example, ThreadSanitizer
• ANNOTATE_UNPUBLISH_MEMORY_RANGE(ptr,size)
does not instrument the internals of the threading library
Opposite to ANNOTATE_PUBLISH_MEMORY_RANGE. Reports
(there is no sense in analysing races on internal representa-
that the bytes in the range [ptr, ptr + size) are not
tion of a mutex). The tool also supports a mechanism to
shared between threads any more and can be safely
ignore parts of the program marked as safe by the user12 .
used by the current thread w/o synchronization. The
In some cases this allows to speed up the run by 2-3 times
race detector will create a happens-before arc from all
by ignoring a single hot spot.
previous accesses to this memory to this call. Usually
Adaptive instrumentation [16] seems promising. We also
required only for hybrid detectors.
plan to use static analysis performed by a compiler to skip
• ANNOTATE_NEW_MEMORY(ptr, size) instrumentation when we can prove thread-safety statically.
Tells that a new memory has been allocated by a cus- Another way to reduce the number of analyzed memory
tom allocator. accesses is to run the tool on an optimized binary. Unfortu-
nately, the current implementation does not work well with
• ANNOTATE_THREAD_NAME(name) fully optimized code (e.g. gcc -O2)13 , but the following gcc
Tells the name of the current thread to the detector. flags give 50%-100% speedup compared to a non-optimized
• ANNOTATE_EXPECT_RACE(ptr) 11
This had a nice side effect. Many of the tests that were fail-
Is used to write unit tests for the race detector. ing regularly under ThreadSanitizer were known to be flaky
(they were sometimes failing when running natively) and
With the dynamic annotations we can eliminate all false ThreadSanitizer helped to find the reason of that flakiness
reports of the hybrid detector and hide benign races. As just by making tests slower.
a result, ThreadSanitizer will find more races (as compared 12
Controlled by the --ignore command line flag.
to a pure happens-before detector) but will not report false 13
This is a limitation of both gcc, Valgrind and ThreadSani-
races. tizer.

66
compilation while maintaining the same level of usability: benign (the code counts some statistic that is allowed to be
gcc -O1 -g -fno-inline -fno-omit-frame-pointer -fno- imprecise). But sometimes such races are extremely harmful
builtin14 . (e.g. see 7.1).

6.2 Memory consumption Thread1 Thread2


int v ; ...
The memory consumption of ThreadSanitizer consists mostly
v ++; v ++;
of the following overhead:

• A constant size buffer that stores segments, including 6.3.2 Race on a complex type
stack traces. By default, there are 223 segments and Another popular race happens when two threads access
each occupies ≈100 bytes (≈50 bytes in 32-bit mode). a non-thread-safe complex object (e.g. an STL container)
So, the buffer is ≈800M. Decreasing this size may lead without synchronization. These are almost always danger-
to loosing some data races. If we are not tracking the ous.
contexts of previous accesses (see 4.3), the segments Thread1 Thread2
occupy much less memory (≈250M). std :: map < int , int > m ; ...
m [123] = 1; m [345] = 0;
• Vector time clocks attached to each segment. This
memory is limited by the number of threads times the 6.3.3 Notification
number of segments, but in most cases it is quite small.
A data race occurs when a boolean or an integer variable is
• Per-ID state. In the fast mode, the memory required used to send notifications between threads. This may work
for per-ID state linearly depends on the amount of correctly with some combination of compiler and hardware,
memory shared between more than one thread. In but for portability we do not recommend programmers to
the full hybrid and in the pure happens-before modes, assume implicit semantics of the target architecture.
the footprint is a linear function of all memory in the
Thread1 Thread2
program. However, these are the worst case assump-
bool done = false ; ...
tions and in practice a simple compression technique
reduces the memory usage significantly. while (! done ) done = true ;
sleep (1);
• Segment sets and locksets may potentially occupy ar-
bitrary large amount of memory, but in reality they 6.3.4 Publishing objects without synchronization
constitute only a small fraction of the overhead.
One thread initializes an object pointer (which was ini-
All these objects are automatically recycled when applicable. tially null) with a new value, another thread spins until the
On an average Google unit test the memory overhead is object pointer becomes non-null. Without proper synchro-
within 3x-4x (compared to a native run). Obviously, a test nization, the compiler may do surprising transformations
will fail under ThreadSanitizer if there is not enough RAM (code motion) with such code which will lead to (occasional)
in the machine. Almost all unit tests we have tried require failures. In addition to that, on some architectures this race
less than 4G when running under ThreadSanitizer. Real may cause failures due to cache-related effects.
applications may require 8G and more. See also 7.2 for the
actual numbers. Thread1 Thread2
MyObj * obj = NULL ; while ( obj == NULL )
6.2.1 Flushing state ... yield ();
obj = new MyObj (); obj - > DoSomethi ng ();
Even though the memory overhead of ThreadSanitizer is
sane on average, there are cases when the tool would con- 6.3.5 Initializing objects without synchronization
sume all the memory it could get. In order to stay robust,
ThreadSanitizer flushes all its internal state when the mem- static MyObj * obj = NULL ;
ory overhead is above a certain limit (supplied by the user or void InitObj () {
derived from ulimit) or when the tool has used all available if (! obj )
obj = new MyObj ();
segments and none of them can be recycled. Obviously, if a
}
flush happens between two memory accesses which race with
each other, such a race will be missed, but the probability
of such situation is low. Thread1 Thread2
InitObj (); InitObj ();
6.3 Common real races
In this section we will show the examples of the most This may lead e.g. to memory leaks (the object may be
frequent races found in our C++ code. The detailed analysis constructed twice).
of some of these races is given at [7].
6.3.6 Write during a ReaderLock
6.3.1 Simple race Updates happening under a reader lock.
The simplest possible data race is the most frequent one:
two threads are accessing a variable of a built-in type with- Thread1 Thread2
mu . ReaderLock (); mu . ReaderLock ();
out any synchronization. Quite frequently, such races are
var ++; var ++;
14
This applies to gcc 4.4 on x86 64. mu . R e a d e r U n l o c k (); mu . R e a d e r U n l o c k ();

67
6.3.7 Adjacent bit fields if the first thread executes obj->F() after the second thread
The code below looks correct at the first glance. But if x started executing A::~A, then A::F will be called instead of
is “struct { int a:4, b:4; }”, we have a bug. B::F.

Thread1 Thread2 Thread1 Thread2


x . a ++; x . b ++; obj - > F ();
delete obj ;
obj - > Done ();
6.3.8 Double-checked locking
The so called doubled-checked locking is well known to 6.4 Common false positives
be an anti-pattern ([17]), but we still find it occasionally Here we show the three most common types of false posi-
(mostly in the old code). tives, i.e. the situations where the code is correctly synchro-
bool inited = false ; nized, but ThreadSanitizer will report a race. The annota-
void Init () { tions given in the code examples explain the synchronization
// May be called by multiple threads . to the tool; with these annotations no reports will appear.
if (! inited ) {
mu . Lock (); 6.4.1 Condition variable
if (! inited ) {
// .. initialize something
}
inited = true ; Thread1 Thread2

mu . Unlock (); mu . Lock ();


} obj - > UpdateMe (); while (! c )
} mu . Lock (); cv . Wait (& mu );
c = true ; ANNOTATE_CONDVAR_LOCK_WAIT (
cv . Signal (); & cv , & mu );
6.3.9 Race during destruction mu . Unlock (); mu . Unlock ();
Sometimes objects are created on the stack, passed to an- obj - > UpdateMe ();
other thread and then destroyed without waiting for the
second thread to finish its work. This is a typical usage of a condition variable [12]: the two
accesses to obj are serialized. Unfortunately, it may be mis-
void Thread1 () {
SomeType object ; understood by the hybrid detector. For example, Thread1
ExecuteCallbackInThread2 ( may set the condition to ”true” and leave the critical section
SomeCallback , & object ); before Thread2 enters the critical section for the first time
... and blocks on the condition variable. The condition of the
// " object " is destroyed when while(!c) loop will never be true and cv.Wait() method
// leaving its scope . won’t be called. As a result, the happens-before dependency
}
will be missed.

6.3.10 Race on vptr 6.4.2 Message queue


Class A has a function Done(), virtual function F() and a Some message queues may also be unfriendly to the hybrid
virtual destructor. The destructor waits for the event gen- detector.
erated by Done(). There is also a class B, which inherits A
and overrides A::F(). class Queue {
public :
class A { void Put ( int * ptr ) {
public : mu_ . Lock ();
A () { queue_ . push_back ( ptr );
sem_init (& sem_ , 0 , 0); A N N O T A T E _ H A P P E N S _ B E F O R E ( ptr );
} class B : public A { mu_ . Unlock ();
virtual void F () { public : }
printf ( " A :: F \ n " ); virtual void F () { int * Get () {
} printf ( " B :: F \ n " ); int * res = NULL ;
void Done () { } mu_ . Lock ();
sem_post (& sem_ ); virtual ~ B () { } if (! queue_ . empty ()) {
} }; res = queue_ . front ();
virtual ~ A () { A N N O T A T E _ H A P P E N S _ A F T E R ( res );
sem_wait (& sem_ ); static A * obj = queue_ . pop_front ();
sem_destro y (& sem_ ); new B ; }
} mu_ . Unlock ();
private : return res ;
sem_t sem_ ; }
}; private :
std :: queue queue_ ;
An object obj of static type A and dynamic type B is cre-
Mutex mu_ ;
ated. One thread executes obj->F() and then signals to };
the second thread. The second thread calls delete obj (i.e.
B::~B) which then calls A::~A, which, in turn, waits for the The queue implementation above does not use any happens-
signal from the first thread. The destructor A::~A overwrites before synchronization mechanism but it does actually cre-
the vptr (pointer to virtual function table) to A::vptr. So, ate a happens-before dependency between Put() and Get().

68
Thread1 Thread2
6.5.1 Choosing the mode
Which of the three modes of ThreadSanitizer should one
* ptr = ...; choose?
ptr = queue . Get ();
queue . Put ( ptr );
if ( ptr ) If you are testing an existing software project, we suggest
* ptr = ...; you to start with the pure happens-before mode (4.4.1). Un-
less you have lock-free synchronization (which you will have
A message queue may be implemented via atomic oper- to annotate), every reported race will be real.
ations (i.e. without any Mutex). In this case even a pure Once you fixed all reports from the pure happens-before
happens-before detector may report false positives. mode (or if you are starting a new project), switch to the
6.4.3 Reference counting fast mode (4.4.2). You may see few false reports (6.4), which
can be easily eliminated. If your aim is to find the maximal
Another frequent cause of false positives is reference count- number of bugs and agree to spend some more time for an-
ing. As with message queues, mutex-based reference count- notations, use the full hybrid mode (4.2).
ing will result in false positives in the hybrid mode, while For regression testing prefer the hybrid mode (either full
a reference counting implemented via atomics will confuse or fast) because it is more predictable. It is often the case
even the pure happens-before mode. And again, the anno- that a race is detected only on one of 10-100 runs by the
tations allow the tool to understand the synchronization. pure happens-before mode, while the hybrid mode finds it
class S o m e R e f e r e n c e C o u n t e d C l a s s { in each run.
public :
void Unref () {
A N N O T A T E _ H A P P E N S _ B E F O R E (& ref_ ); 7. RACE DETECTION FOR CHROMIUM
if ( A t o m i c I n c r e m e n t (& ref_ , -1) == 0) { One of the applications we test with ThreadSanitizer is
A N N O T A T E _ H A P P E N S _ A F T E R (& ref_ ); Chromium [1], an open-source browser project.
delete this ;
} The code of Chromium browser is covered by a large num-
} ... ber of tests including unit tests, integration tests and inter-
private : active tests running the real application. All these tests
int ref_ ; are continuously run on a large number of test machines
} with different operating systems. Some of these machines
run tests under Memcheck (the Valgrind tool which finds
6.5 General advice memory-related errors, see [8]) and ThreadSanitizer. When
a new error (either a test failure or a race report from
Applying a data race detector to an arbitrary C++ pro-
ThreadSanitizer) is found after a commit to the repository,
gram may be arbitrarily hard. However, if the developers
the committer of the change is notified. These reports are
follow several simple rules, race detectors can be used at full
available for other developers and maintainers as well.
power. Here we summarize the recommendations we give to
We have found and fixed a few dozen data races in Chro-
C++ developers at Google.
mium itself, and in some third party components used by
First of all, variables shared between threads are best pro-
this project. You may find all these bugs by searching for
tected by a mutex. Always use mutex unless you know for
label:ThreadSanitizer at www.crbug.com.
sure that it causes a significant performance loss.
When possible, try to reuse the existing standard synchro- 7.1 Top crasher
nization primitives (e.g. message queues, reference counting
utilities, etc) instead of re-inventing the wheel. If you really One of the first data races we found in Chromium hap-
need your own synchronization mechanism, annotate it with pened to be the cause of a serious bug, which had been
dynamic annotations (section 5). observed for several months but had not been understood
Avoid using condition variables directly as they are not nor fixed15 . The data race happened on a class called Ref-
friendly to hybrid detectors. Instead, wrap the condition Counted. The reference counter was incremented and decre-
loop while(!c) cv.Wait(&mu) into a separate function and mented from multiple threads without synchronization. When
annotate it (6.4.1). In Google’s internal C++ library such the race actually occurred (which happened very rarely), the
function is a part of the Mutex API. value of the counter became incorrect. This resulted in ei-
Try not to use atomic operations directly. Instead, wrap ther a memory leak or in two calls of delete on the same
the atomic operations into functions or classes that imple- memory. In the latter case, the internals of the memory al-
ment certain synchronization patterns. locator were corrupted and one of the subsequent calls to
Remember that dynamic data race detection (as well as malloc failed with a segmentation fault.
most other kinds of dynamic analysis) is slow. Do not hard- The cause of these failures was not understood for a long
code any timeout values into your program. Instead, allow time because the failure never happened during debugging,
the timeout values to be changed via command line flags, and the failure stack traces were in a different place. Thread-
environment variables or configuration files. Sanitizer found this data race in a single run.
Never use sleep() as synchronization between threads, The fix for this data race was simple. Instead of the Ref-
even in unit tests. Counted class we needed to use RefCountedThreadSafe, the
Don’t over-synchronize. Excessive synchronization may class which implements reference counting using atomic in-
be just as incorrect as no synchronization at all, but it may structions.
hide real races from data race detectors. 15
See the bug entries http://crbug.com/18488 and
http://crbug.com/15577 describing the race and the
crashes, respectively.

69
Table 1: Time and space overhead compared to Helgrind and Memcheck on Chromium tests.
The performance of ThreadSanitizer is close to Memcheck. On large tests (e.g. unit), ThreadSanitizer can be twice as fast as
Helgrind. The memory consumption is also comparable to Memcheck and Helgrind.
app base ipc net unit
native 3s 172M 77s 1811M 5s 325M 50s 808M 43s 914M
Memcheck-no-hist 6.7x 2.0x 1.7x 1.1x 5.2x 1.1x 3.0x 1.6x 14.8x 1.7x
Memcheck 10.5x 2.6x 2.2x 1.1x 8.2x 1.2x 5.1x 2.3x 29.7x 1.9x
Helgrind-no-hist 13.9x 2.7x 1.8x 1.8x 5.4x 1.5x 4.5x 2.2x 48.7x 3.4x
Helgrind 14.9x 3.8x 1.7x 1.9x 6.7x 1.7x 11.9x 2.5x 62.3x 3.8x
TS-fast-no-hist 6.2x 4.2x 2.2x 1.2x 11.1x 1.8x 3.9x 1.7x 19.2x 2.2x
TS-fast 7.9x 7.6x 2.4x 1.5x 12.0x 3.6x 4.7x 2.4x 21.6x 2.8x
TS-full-no-hist 8.4x 4.2x 2.4x 1.2x 11.3x 1.8x 4.7x 1.6x 22.3x 2.3x
TS-full 13.8x 7.4x 2.8x 1.5x 11.9x 3.6x 6.3x 2.3x 28.6x 2.5x
TS-phb-no-hist 8.3x 4.2x 2.8x 1.2x 11.2x 1.8x 4.7x 1.8x 23.0x 6.2x
TS-phb 14.2x 7.4x 2.6x 1.5x 11.8x 3.6x 6.2x 2.3x 28.6x 2.5x

7.2 Performance evaluation on Chromium keeping zero noise level (no false positives or benign races
We used Chromium unit tests for performance evaluation are reported).
of ThreadSanitizer. We compared our tool with Helgrind ThreadSanitizer is heavily used at Google for testing var-
and Memcheck 3.5.0 [8]. Even though Memcheck is not a ious C++ applications, including Chromium. In this paper
race detector, it performs similar instrumentation; this tool we discussed a number of practical issues which we have
is well known for its high quality and practical usefulness. faced while deploying ThreadSanitizer.
Table 1 gives the summary of the results. ThreadSani- We believe that our ThreadSanitizer has noticeable advan-
tizer was run in three modes: --pure-happens-before=yes tages over other dynamic race detectors in terms of practical
(phb), --fast-mode=yes (fast) and --fast-mode=no (full). use. The current implementation of ThreadSanitizer is built
Similarly to ThreadSanitizer, Helgrind and Memcheck have on top of the Valgrind binary translation framework and it
modes where the history of previous accesses is not tracked can be used to test C/C++ programs on Linux and Mac.
(4.3). In the table, such modes are marked with no-hist. The source code of ThreadSanitizer is published under the
The tests were built using gcc -O1 -g -fno-inline -fno- GPL license and can be downloaded at [7].
omit-frame-pointer -fno-builtin flags for the x86 64 plat-
form and run on Intel Core 2 Duo Q6600 with 8Gb or RAM. 9. ACKNOWLEDGMENTS
As may be seen from Table 1, the performance of Thread- We would like to thank Mike Burrows, the co-author of
Sanitizer is close to Memcheck. The average slowdown com- Eraser [22], for his great support of our project at Google
pared to the native run is less than 30x. On large tests like and for many algorithmic suggestions, and Julian Seward,
unit, ThreadSanitizer can be twice as fast as Helgrind. the author of Valgrind and Helgrind [8, 19], for his amazing
The memory consumption is also comparable to Mem- tools and fruitful discussions.
check and Helgrind. ThreadSanitizer allocates a large con-
stant size buffer of segments (see 6.2), hence on small tests 10. REFERENCES
it consumes more memory than other tools. [1] Chromium project. http://dev.chromium.org.
ThreadSanitizer flushes its state (see 6.2.1) 90 times on
[2] Intel Parallel Studio. http://software.intel.com/en-
unit, 34 times on net and 4 times on base test sets when
us/intel-parallel-studio-home.
running in the full or pure happens-before modes with his-
tory tracking enabled. In the fast mode and with disabled [3] Intel Thread Checker.
history tracking ThreadSanitizer never flushes its state on http://software.intel.com/en-us/intel-thread-checker.
these tests. [4] Multi-Thread Run-time Analysis Tool for Java.
http://www.alphaworks.ibm.com/tech/mtrat.
[5] Pin - a dynamic binary instrumentation tool.
8. CONCLUSIONS http://www.pintool.org.
In this paper we have presented ThreadSanitizer, a dy- [6] Sun Studio. http://developers.sun.com/sunstudio.
namic detector of data races. ThreadSanitizer uses a new [7] ThreadSanitizer project: documentation, source code,
algorithm; it has several modes of operation, ranging from dynamic annotations, unit tests.
the most conservative mode (which has few false positives http://code.google.com/p/data-race-test.
but also misses real races) to a very aggressive one (which [8] Valgrind project. Home of Memcheck, Helgrind and
has more false positives but detects the largest number of DRD. http://www.valgrind.org.
real races). To the best of our knowledge ThreadSanitizer [9] U. Banerjee, B. Bliss, Z. Ma, and P. Petersen. A
has the most detailed output and it is the only dynamic race theory of data race detection. In PADTAD ’06:
detector with hybrid and pure happens-before modes. Proceedings of the 2006 workshop on Parallel and
We have introduced the dynamic annotations, a sort of distributed systems: testing and debugging, pages
API for a race detector. Using the dynamic annotations 69–78, New York, NY, USA, 2006. ACM.
together with the most aggressive mode of ThreadSanitizer [10] U. Banerjee, B. Bliss, Z. Ma, and P. Petersen.
enables us to find the largest number of real races while Unraveling Data Race Detection in the Intel R Thread

70
Checker. In First Workshop on Software Tools for APPENDIX
Multi-core Systems (STMCS), in conjunction with
IEEE/ACM International Symposium on Code A. OTHER RACE DETECTORS
Generation and Optimization (CGO), March, Here we briefly describe some of the race detectors avail-
volume 26, 2006. able for download.
[11] D. Engler and K. Ashcraft. Racerx: effective, static Helgrind is a tool based on Valgrind [19, 8]. Helgrind
detection of race conditions and deadlocks. In SOSP 3.5 is a pure happens-before detector; it supports a sub-
’03: Proceedings of the nineteenth ACM symposium on set of dynamic annotations described in section 5. Part of
Operating systems principles, pages 237–252, New ThreadSanitizer’s instrumentation code is derived from Hel-
York, NY, USA, 2003. ACM. grind. DRD is one more Valgrind-based race detector with
[12] F. Garcia and J. Fernandez. Posix thread libraries. similar properties.
Linux J., 70es (Feb. 2000):36, 2000. Intel Thread Checker [3, 10, 9] is a pure happens-before
[13] J. J. Harrow. Runtime checking of multithreaded race detector. It supports an analog of dynamic annota-
applications with visual threads. In Proceedings of the tions (a subset). It works on Linux and Windows. Thread
7th International SPIN Workshop on SPIN Model Checker’s latest reincarnation is called Intel Parallel In-
Checking and Software Verification, pages 331–342, spector [2] and is based on PIN [15, 5]. As of November
London, UK, 2000. Springer-Verlag. 2009, the Parallel Inspector is available only for Windows.
[14] L. Lamport. Time, clocks, and the ordering of events Sun Thread Analyzer, a part of Sun Studio [6]. A
hybrid race detector. It supports an analog of dynamic an-
in a distributed system. Commun. ACM,
notations (a small subset). It works only together with the
21(7):558–565, 1978.
Sun Studio compiler (so we did not try it for our real tasks).
[15] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser,
IBM MTRAT [4] is a race detector for Java. It uses
G. Lowney, S. Wallace, V. J. Reddi, and
some variant of hybrid state machine and does not support
K. Hazelwood. Pin: building customized program
any annotations. As of the version from March 2009, the
analysis tools with dynamic instrumentation. In PLDI
noise level seems to be rather high.
’05: Proceedings of the 2005 ACM SIGPLAN
conference on Programming language design and
implementation, pages 190–200, New York, NY, USA, B. EXAMPLE OF OUTPUT
2005. ACM. Here we give a simple test case where a wrong mutex is
[16] D. Marino, M. Musuvathi, and S. Narayanasamy. used in one place. For more examples refer to [7].
Literace: effective sampling for lightweight data-race Mutex mu1 ; // This Mutex guards var .
detection. In PLDI ’09: Proceedings of the 2009 ACM Mutex mu2 ; // This Mutex is not related to var .
SIGPLAN conference on Programming language int var ;
design and implementation, pages 134–143, New York, // Runs in thread named ’ test - thread -1 ’
void Thread1 () {
NY, USA, 2009. ACM. mu1 . Lock (); // Correct Mutex .
[17] S. Meyers and A. Alexandrescu. C++ and the Perils var = 1;
of Double-Checked Locking: Part I. DOCTOR mu1 . Unlock ();
DOBBS JOURNAL, 29:46–49, 2004. }
// Runs in thread named ’ test - thread -2 ’
[18] S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards,
void Thread2 () {
and B. Calder. Automatically classifying benign and mu2 . Lock (); // Wrong Mutex .
harmful data races using replay analysis. In PLDI ’07: var = 2;
Proceedings of the 2007 ACM SIGPLAN conference mu2 . Unlock ();
on Programming language design and implementation, }
pages 22–31, New York, NY, USA, 2007. ACM. The output of ThreadSanitizer will contain stack traces for
[19] N. Nethercote and J. Seward. Valgrind: a framework both memory accesses, names of both threads, information
for heavyweight dynamic binary instrumentation. In about locks held during each access and the description of
PLDI ’07: Proceedings of the 2007 ACM SIGPLAN the memory location.
conference on Programming language design and
implementation, pages 89–100, New York, NY, USA, WARNING: Possible data race during write of size 4
2007. ACM. T2 (test-thread-2) (locks held: {L134}):
#0 Thread2() racecheck_unittest.cc:7034
[20] R. H. B. Netzer and B. P. Miller. What are race
#1 MyThread::ThreadBody(MyThread*) ...
conditions?: Some issues and formalizations. ACM
Concurrent write(s) happened at these points:
Letters on Programming Languages and Systems
T1 (test-thread-1) (locks held: {L133}):
(LOPLAS), 1(1):74–88, 1992.
#0 Thread1() racecheck_unittest.cc:7029
[21] R. O’Callahan and J.-D. Choi. Hybrid dynamic data #1 MyThread::ThreadBody(MyThread*) ...
race detection. In PPoPP ’03: Proceedings of the Address 0x63F260 is 0 bytes inside data symbol "var"
ninth ACM SIGPLAN symposium on Principles and
Locks involved in this report: {L133, L134}
practice of parallel programming, pages 167–178, New
L133
York, NY, USA, 2003. ACM. #0 Mutex::Lock() ...
[22] S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and #1 Thread1() racecheck_unittest.cc:7028 ...
T. Anderson. Eraser: a dynamic data race detector for L134
multithreaded programs. ACM Trans. Comput. Syst., #0 Mutex::Lock() ...
15(4):391–411, 1997. #1 Thread2() racecheck_unittest.cc:7033 ...

71

You might also like