Memory Model For Multithreaded C++: Andrei Alexandrescu Hans Boehm Kevlin Henney Doug Lea Bill Pugh
Memory Model For Multithreaded C++: Andrei Alexandrescu Hans Boehm Kevlin Henney Doug Lea Bill Pugh
Memory Model For Multithreaded C++: Andrei Alexandrescu Hans Boehm Kevlin Henney Doug Lea Bill Pugh
Date:
Reply to:
WG21/N1680=J16/04-0120
10 September 2004
Andrei Alexandrescu
andrei@metalanguage.com
2216 NE 46th St. Apt. A
Seattle WA 98105 USA
Hans Boehm
Kevlin Henney
Bill Pugh
Abstract
The C++ Standard defines single-threaded program execution. Fundamentally, multithreaded execution requires a much more refined memory and execution model. C++ threading libraries are in the awkward situation of specifying (implicitly or explicitly) an extended memory model
for C++ in order to specify program execution. We propose integrating
a memory model suitable for multithreaded execution in the C++ Standard. On top of that model, we propose a standard threading library.
Introduction
do not have answers. Superstition, myths, bad advice, and hey, it worked for
me stories are rampant.
Absent a clearly defined memory model as a common ground between the
compiler, the hardware, the threading library, and the programmer, multithreaded C++ code is fundamentally at odds with compiler and processorlevel optimizations [4]. Even relatively simple uses can fail unexpectedly due
to unforeseen compiler optimizations. These optimizations are safe in a singlethreaded context, but not with multiple threads. Additionally, the specification
does not address at all the interactions of memory operations with atomic update operations or constructs such as locks built out of them. (The pthreads
specification does not help much here, since locking semantics are not fully
integrated with the semantics of other memory operations.)
These problems can be solved without introducing additional keywords or
syntactic constructs. The main plan of attack is:
1. Specification of an abstract memory model describing the interactions
between threads and memory.
2. Application of this model to existing aspects of the C++ specification to
replace the current implicitly sequential semantics. This will entail new
constraints on how compilers can emit and optimize code. In particular,
this will entail a reworking of the specification of t to provide
useful multithreaded semantics.
3. Introduction of a small number of standard library classes providing standardized access to atomic update operations (such as compare_and_set).
These classes will have multithreaded semantics integrated with the above
specifications for other memory operations. Thus, compilers will need to
treat these as intrinsics. These operations form the low-level basis for
modern multithreaded synchronization constructs such as locks, and are
also required in the construction of efficient non-blocking data structures.
4. Definition of a standard thread library that provides similar functionality
to pthreads and Win32 threads, but meshes with the rest of the C++
standard.
This is an ambitious proposal, and the current draft represents only the first
step of a long process. In the remainder of this draft, we briefly describe the
fundamental issues, some concrete steps in addressing them, and sketch out
remaining work.
A Memory Model
A memory model describes the behavior of threads with respect to basic memory
operations mainly reads and writes of variables potentially accessible across
multiple threads. The main questions addressed by a memory model include:
Memory Effects
A memory model defines categories of memory actions. For example actions that
act as acquire and release operations guarantee visibility of a set of updates by
one thread to another. The next step for a language specification is to map
these notions to all of the memory-related constructions in the language. This
process entails nailing down a large set of small issues that are necessary for
programmers to be able to predict and control effects. Areas that we have so
far identified include:
Atomicity A given platform may guarantee atomicity only for reads and writes
of certain bit widths and alignments. The spec must permit these to
vary, and must therefore provide some means for programs to query these
properties.
Extra writes There are several cases in C++ in which compilers and machines
have historically been permitted to issue writes that are not obvious from
inspection of source code. The most notable examples involve structures
with small fields. For example, given:
strt S { srt a; r b; r c; } s;
an assignment such as s.a = 0 might be executed as if the code were
*(t*)&s = 0 if a compiler infers from context that b and c are zero as
well, as in the following example:
Fun(S& s) {
(s.b == 0 && s.c == 0) {
s.a = 0;
3
}
}
Another instance of this problem is with hardware that doesnt support
memory writes below a certain size (often, s(t)). In that case, if
S is packed such that s(S) == s(t), the compiler must
generate code for s.a = 0 to read the entire s object in a register, mask
a portion of it, and write back the entire object to memory. Effectively,
simply writing the field S::a became a (possibly non-atomic) read-write
operation as far as the neighboring fields S::b and S::c are concerned.
This would be unexpected at best in a multithreaded context in which the
other fields were also being assigned concurrently. A spec must clearly
define whether and when such compiler transformations remain legal.
Volatile data In the current language spec, the t qualifier is mainly
used to indicate guaranteed order of reads and writes within singlethreaded semanticsfor example for device control registers, memorymapped I/O, or opaque flow (as in setjmp or interrupts). In a multithreaded language, it may be useful for t to take on the extra burden of constraining inter-thread visibility and ordering properties. There
are a few options for the detailed semantics. In the simplest, t
reads act as acquire and writes as release. This has the virtue of being relatively easy to use by programmers who are not intimately familiar
with memory models. For example, the infamous double-checked locking idiom [4] works as expected under these rules if references are declared as t (and other lock-based rules below are followed). This
has the disadvantage of imposing heavier constraints on the compiler
and processor than necessary in very performance-sensitive applications.
However, optimizers can often eliminate unnecessary operations (such as
consolidating several consecutive acquire and release operations into
one). To help optimizers in the few remaining situations, weaker forms
could also be provided as operations on atomic variables, as discussed
below.
Opaque calls One concern about moving to multithreaded specifications is
that compilers may become overly conservative when compiling code with
opaque function callsflushing and reloading registers and/or issuing
memory barriers in case the called functions effects depend on this. It
may be desirable to allow programmers to control this using some kind
of qualifier. Options include those with defaults in both directions; for
example, assuming lack of effects unless a function is qualified as, say,
t; versus assuming effects unless qualified with some extended form
of st. Alternatively, or in addition, the spec could include a means for
programmers to tell compilers that a certain program is either definitely
single-threaded or definitely multithreaded, as a way of controlling certain
optimizations. Further exploration of options and their consequences is
needed. These considerations are very related to an existing C++ standardization proposal [2].
Atomics
s std {
ss atomic_int {
:
t get();
t set(t v);
compare_and_set(t expected_value, t new_value);
t weak_get();
t weak_set(t v);
weak_compare_and_set(t expected_value,
t new_value);
// other minor convenience functions, including:
t get_and_increment();
t get_and_add(t v);
// ...
};
}
The main attraction of this approach is that it appears to be implementable
on essentially any platform. Even those machines without such primitives can
emulate them using private locks. And even though some machines (such
as PowerPC) support LL/SC (load-linked, store-conditional) instead of CAS
(compare-and-set), in practice, nearly all usages of LL/SC are to perform CAS
(the reverse is impossible), so there would rarely be motivation to resort to
non-standardized, non-portable constructions even on these platforms.
The idea of the weak versions is to permit finer control of atomics and
barriers than otherwise available using t or other constructions. For
example, a weak_set need only perform a store ordering barrier, not a full
release, which may be cheaper on some machines. (Details are yet to be fully
worked out. Additionally, a refinement of this approach relying more heavily on
Thread Library
References
[1] Tim Lindholm et al. Java Specification Request 133: Memory Model and
Thread Specification Revision. Available at http://www.jcp.org/jsr/
detail/133.jsp.
[2] Walter E. Brown et al. Toward Improved Optimization Opportunities in
C++0X. Document WG21/N1664 = J16/04-0104; available at http://www.
open-std.org/jtc1/sc22/wg21/docs/papers/2004/n1664.pdf, Jul 2004.
[3] IEEE Standard for Information Technology. Portable Operating System Interface (POSIX) System Application Program Interface (API) Amendment 2: Threads Extension (C Language). ANSI/IEEE 1003.1c-1995, 1995.
[4] Scott Meyers and Andrei Alexandrescu. C++ and The Perils of DoubleChecked Locking. Doctor Dobbs Journal, Jul 2004.