Concurrent Programming Without Locks
Concurrent Programming Without Locks
KEIR FRASER
University of Cambridge Computer Laboratory
and
TIM HARRIS
Microsoft Research Cambridge
Mutual exclusion locks remain the de facto mechanism for concurrency control on shared-memory
data structures. However, their apparent simplicity is deceptive: it is hard to design scalable
locking strategies because locks can harbour problems such as priority inversion, deadlock and
convoying. Furthermore, scalable lock-based systems are not readily composable when building
compound operations. In looking for solutions to these problems, interest has developed in non-
blocking systems which have promised scalability and robustness by eschewing mutual exclusion
while still ensuring safety. However, existing techniques for building non-blocking systems are
rarely suitable for practical use, imposing substantial storage overheads, serialising non-conflicting
operations, or requiring instructions not readily available on today’s CPUs.
In this paper we present three APIs which make it easier to develop non-blocking implemen-
tations of arbitrary data structures. The first API is a multi-word compare-and-swap operation
(MCAS) which atomically updates a set of memory locations. This can be used to advance a
data structure from one consistent state to another. The second API is a word-based software
transactional memory (WSTM) which can allow sequential code to be re-used more directly than
with MCAS and which provides better scalability when locations are being read rather than being
updated. The third API is an object-based software transactional memory (OSTM). OSTM allows
a more streamlined implementation than WSTM, but at the cost of re-engineering the code to
use OSTM objects.
We present practical implementations of all three of these APIs, built from operations available
across all of today’s major CPU families. We illustrate the use of these APIs by using them
to build highly concurrent skip-lists and red-black trees. We compare the performance of the
resulting implementations against one another and against high-performance lock-based systems.
These results demonstrate that it is possible to build useful non-blocking data structures with
performance comparable to, or better than, sophisticated lock-based designs.
1. INTRODUCTION
Mutual-exclusion locks are one of the most widely used and fundamental abstrac-
tions for synchronisation. This popularity is largely due to their apparently simple
programming model and the availability of implementations which are efficient and
scalable. Unfortunately, without specialist programming care, these benefits rarely
hold for systems containing more than a handful of locks:
— For correctness, programmers must ensure that threads hold the necessary
locks to avoid conflicting operations being executed concurrently. To avoid mis-
takes, this favours the development of simple locking strategies which pessimisti-
cally serialise some non-conflicting operations.
— For liveness, programmers must be careful to avoid introducing deadlock and
consequently they may cause software to hold locks for longer than would otherwise
be necessary. Also, without scheduler support, programmers must be aware of
priority inversion problems.
— For high performance, programmers must balance the granularity at which
locking operates against the time that the application will spend acquiring and
releasing locks.
This paper is concerned with the design and implementation of software which is
safe for use on multi-threaded multi-processor shared-memory machines but which
does not involve the use of locking. Instead, we present three different APIs for
making atomic accesses to sets of memory locations. These enable the direct de-
velopment of concurrent data structures from sequential ones. We believe this
makes it easier to build multi-threaded systems which are correct. Furthermore,
our implementations are non-blocking (meaning that even if any set of threads is
stalled then the remaining threads can still make progress) and they generally allow
disjoint-access parallelism (meaning that updates made to non-overlapping sets of
locations will be able to execute concurrently).
To introduce these APIs we shall sketch their use in a code fragment that inserts
items into a singly-linked list which holds integers in ascending order. In each case
the list is structured with sentinel head and tail nodes whose keys are respectively
less than and greater than all other values. Each node’s key remains constant after
insertion. For comparison Figure 1 shows the corresponding insert operation when
implemented for single-threaded use. In that figure, as in each of our examples,
the insert operation proceeds by identifying nodes prev and curr between which the
new node is to be placed.
Our three alternative APIs all follow a common optimistic style [Kung and Robin-
son 1981] in which the core sequential code is wrapped in a loop which retries the
insertion until it succeeds in committing the updates to memory.
Our first API provides multi-word compare-and-swap (MCAS) which generalises
the single-word CAS operation found on many processors: it atomically updates
one or more memory locations from a set of expected values to a set of new values.
Figure 2 shows how the insertion could be expressed using MCAS.
There are two fundamental changes from the sequential algorithm. Firstly, in-
stead of updating shared locations individually, the code must call MCAS to perform
the complete set of memory accesses that need to be made atomically. In the ex-
ACM Journal Name, Vol. V, No. N, M 20YY.
Concurrent Programming Without Locks · 3
Fig. 2. Insertion into a sorted list managed using MCAS. In this case the arrays specifying the
update need contain only a single element.
ample there is only a single update to be made with MCAS, but a corresponding
delete operation would pass two updates to MCAS: one to excise the node from the
list and a second to clear its next field to NULL to prevent concurrent insertion after
a deleted node. Secondly, the code must call MCASRead whenever it reads from a
location that might be subject to a concurrent update by MCAS in another thread.
MCAS and MCASRead present a rather low level API: programmers must be
careful to use MCASRead where necessary and they must also remember that the
subsequent MCAS is not aware of the locations that have been read. This can lead
to cumbersome code in which the program keeps lists of the locations it has read
from, and the values that it has seen in them, and then passes these lists to MCAS
to confirm that the values represent a consistent view of shared memory.
The second abstraction provides a word-based software transactional memory
(WSTM) which avoids some of these problems by allowing a series of reads and
writes to be grouped as a software transaction and applied to the heap atomi-
cally [Harris and Fraser 2003]. Figure 3 shows our list example using WSTM. The
changes from sequential code are that reads and writes to shared locations are
ACM Journal Name, Vol. V, No. N, M 20YY.
4 · K. Fraser and T. Harris
Fig. 3. Insertion into a sorted list managed using WSTM. The structure mirrors Figure 2 except
that the WSTM implementation tracks which locations have been accessed based on the calls to
WSTMRead and WSTMWrite.
performed through WSTMRead and WSTMWrite functions and that this whole set
of memory accesses is wrapped in a call to WSTMStartTransaction and a call to
WSTMCommitTransaction calls.
The third abstraction provides an object-based software transactional memory
(OSTM) which allows a thread to ‘open’ a set of objects for transactional accesses
and, once more, to commit updates to them atomically [Fraser 2003]. Figure 4 illus-
trates this style of programming: each object is accessed through an OSTM handle
which must be subject to an OSTMOpenForReading or OSTMOpenForWriting call
in order to obtain access to the underlying data. In short examples the code looks
more verbose than WSTM, but the OSTM implementation is more straightforward
and often runs more quickly.
While these techniques do not provide a silver-bullet to designing scalable concur-
rent data structures they represent a shift of responsibility away from the program-
mer: the API’s implementation is responsible for correctly ensuring that conflicting
operations do not proceed concurrently and for preventing deadlock and priority-
inversion between concurrent operations. The API’s caller remains responsible for
ensuring scalability by making it unlikely that concurrent operations will need to
modify overlapping sets of locations. However, this is a performance problem rather
than a correctness or liveness one and, in our experience, even straightforward data
structures, developed directly from sequential code, offer performance which com-
petes with and often surpasses state-of-the-art lock-based designs.
1.1 Goals
We set ourselves a number of goals in order to ensure that our designs are practical
and perform well when compared with lock-based schemes:
Concreteness. We must consider the full implementation path down to the in-
structions available on commodity CPUs. This means we build from atomic single-
word read, write and compare-and-swap (CAS) operations. We define CAS to
ACM Journal Name, Vol. V, No. N, M 20YY.
Concurrent Programming Without Locks · 5
Fig. 4. Insertion into a sorted list managed using OSTM. The code is more verbose than Figure 3
because data is accessed by indirection through OSTM handles which must be opened before use.
Table I. Assessment of our implementations of these three APIs against our goals.
MCAS WSTM OSTM
probabilistically
when accessing when accessing
Disjoint-access when accessing
disjoint sets of disjoint sets of
parallelism disjoint sets of
words objects
words
Problem’ [2005] and Michael’s ‘Safe Memory Reclamation’ [2002] both allow threads
to issue tentative deallocation requests that are deferred until it is established that
no other thread can access the memory involved.
1.2 Source code availability
Source code for our MCAS, WSTM and OSTM systems, data structure implemen-
tations and test harnesses is available for Alpha, Intel IA-32, Intel IA-64, MIPS,
PowerPC and SPARC processor families at http://www.cl.cam.ac.uk/netos/
lock-free.
1.3 Structure of this paper
In Section 2 we present the three alternative APIs and compare and contrast their
features and the techniques for using them effectively. We discuss previous work
with respect to our goals in Section 3. In Section 4 we describe our overall design
method and the common facets of each of our designs. In Sections 5–7 we explore
the details of these three APIs in turn and present our design, its relationship to
previous work and, where applicable, to contemporary work which has had similar
goals of practicability.
In Section 8 we evaluate the performance of data structures built over each of the
APIs, both in comparison with one another and in comparison with sophisticated
lock-based schemes. We use skip-lists and red-black trees as running examples,
highlighting any particular issues that arise when adapting a sequential implemen-
tation for concurrent use.
2. PROGRAMMING APIS
In this section we present the programming interfaces for using MCAS (Section 2.1),
WSTM (Section 2.2) and OSTM (Section 2.3). These each provide mechanisms for
accessing and/or modifying multiple unrelated words in a single atomic step; how-
ever, they differ in the way in which those accesses are specified and the adaptation
required to make a sequential operation safe for multi-threaded use.
After presenting the APIs themselves, Section 2.4 discusses how they may be
used in practice in shared-memory multi-threaded programs.
2.1 Multi-word compare-and-swap (MCAS)
Multi-word compare-&-swap (MCAS) extends the well-known hardware CAS prim-
itive to operate on an arbitrary number of memory locations simultaneously. As
with the linked-list example shown in Figure 2, it is typically used by preparing a
list of updates to make in a thread-private phase before invoking MCAS to apply
them to the heap. MCAS is defined to operate on N distinct memory locations
(ai ), expected values (ei ), and new values (ni ): each ai is updated to value ni if and
only if each ai contains the expected value ei before the operation. MCAS returns
TRUE if these updates are made and FALSE otherwise.
Heap accesses to words which may be subject to a concurrent MCAS must be
performed by calling MCASRead. This restriction is needed because, as we show in
Section 5, the MCAS implementation places its own values in these locations while
they are being updated. Furthermore, the MCAS implementation reserves two bits
in each location that it may work on. In practice this means that these locations
ACM Journal Name, Vol. V, No. N, M 20YY.
8 · K. Fraser and T. Harris
H 2 1 3 T H 3 2 1 T
Moved to head A
A B of list by op B.
(a) (b)
Fig. 5. The need for read consistency: a move-to-front linked list subject to two searches for node
3. In snapshot (a), search A is preempted while passing over node 1. Meanwhile, in snapshot (b),
search B succeeds and moves node 3 to the head of the list. When A continues execution, it will
incorrectly report that 3 is not in the list.
must hold aligned pointer values in which at least two low-order bits are ordinarily
clear on a machine with 32-bit or 64-bit words. The full API is consequently:
1 // Update locations a[0]..a[N-1] from e[0]..e[N-1] to n[0]..n[N-1]
bool MCAS (int N, word **a[ ], word *e[ ], word *n[ ]);
3 // Read the contents of location a
word *MCASRead (word **a);
This API is effective when a small number of locations can be identified which
need to be accessed to update a data structure from one consistent state to another.
Using MCAS also allows expert programmers to reduce contention between con-
current operations by paring down the set of locations passed to each atomic update,
or by decomposing a series of related operations into a series of MCAS calls. For
instance, when inserting a node into a sorted linked-list, we relied on the structure
of the list and the immutability of key fields to allow us to update just one location
rather than needing to check that the complete chain of pointers traversed has not
been modified by a concurrent thread. However, this flexibility presents a potential
pit-fall for programmers directly using MCAS.
The API also precludes our goal of composability.
2.2 Word-based software transactional memory (WSTM)
Although MCAS eases the burden of ensuring correct synchronisation of updates,
many data structures also require consistency among groups of read operations
and it is cumbersome for the application to track these calls to MCASRead and to
use the results to build arrays of ‘no-op’ updates to pass to MCAS. For instance,
consider searching within a move-to-front list, in which a successful search promotes
the discovered node to the head of the list. As indicated in Figure 5, a naı̈ve
algorithm which does not consider synchronisation between concurrent searches
may incorrectly fail.
Software transactional memories provide a way of dealing with these problems by
grouping shared-memory accesses into transactions. These transactions succeed or
fail atomically. Furthermore, composability is gained by allowing nested transac-
tions: a series of WSTM transactions can be composed by bracketing them within
a further transaction.
Typically, our implementation of the WSTM API allows a transaction to commit
so long as no other thread has committed an update to one of the locations that has
been accessed. However, as we show in Section 6.1, this must be considered a prob-
ACM Journal Name, Vol. V, No. N, M 20YY.
Concurrent Programming Without Locks · 9
abilistic guarantee rather than a hard one because false conflicts can be introduced
if there are collisions under a hash function used in the WSTM implementation.
Within a transaction, data accesses are performed by WSTMRead and WSTMWrite
operations. As with MCAS, the caller is responsible for using these operations when
accessing words which may be subject to a concurrent WSTMCommitTransaction.
Unlike MCAS, our WSTM implementation does not reserve space in each word,
allowing it to act on full word-size data rather than just pointer-valued fields in
which ‘spare’ bits can be reserved. The full API is:
1 // Transaction management
wstm transaction *WSTMStartTransaction();
3 bool WSTMCommitTransaction(wstm transaction *tx);
bool WSTMValidateTransaction(wstm transaction *tx);
5 void WSTMAbortTransaction(wstm transaction *tx);
// Data access
7 word WSTMRead(wstm transaction *tx, word *a);
void WSTMWrite(wstm transaction *tx, word *a, word d);
As we will show later, the interface often results in reduced performance compared
with MCAS.
2.4.3 Optimizations and hints. The final aspect we consider are the additional
tuning facilities available for a programmer to improve the performance of an algo-
rithm using our APIs.
The key problem is false contention where operations built using the APIs are
ACM Journal Name, Vol. V, No. N, M 20YY.
12 · K. Fraser and T. Harris
deemed to conflict even though logically they commute – for instance, if a set of
integers is held in numerical order in a linked list, then a thread transactionally
inserting 15 between 10 and 20 will perform updates that conflict with reads from
a thread searching through that point for a higher value.
It is not clear that this particular example can be improved automatically when
a tool is generating calls on our APIs; realising that the operations do not logically
conflict relies on knowledge of their semantics and the set’s representation. Notwith-
standing this, the ideas of disjoint-access parallelism and read-parallelism allow the
programmer to reason about which operations will be able to run concurrently.
However, our APIs can be extended with operations for use by expert program-
mers. As with Herlihy et al’s DSTM [Herlihy et al. 2003], our OSTM supports
an additional early release operation that discards an object from the sets of ac-
cesses that the implementation uses for conflict detection. For instance, in our list
example, a thread searching the list could release the lists nodes as it traverses
them, eventually trying to commit a minimal transaction containing only the node
it seeks (if it exists) and its immediate predecessor in the list. Similarly, as we dis-
cuss in Section 6.5, WSTM supports discard operations to remove addresses from
a transaction.
These operations all require great care: once released or discarded, data plays
no part in the transaction’s commit or abort. A general technique for using them
correctly is for the programmer to ensure that (i) as a transaction runs, it always
holds enough data for invalidity to be detected, (ii) when a transaction commits,
the operation it is performing is correct given only the data that is still held. For
instance, in the case of searching a sorted linked list, it would need to hold a pair
of adjacent nodes to act as a ‘witness’ of the operation’s result. However, such
extreme use of optimization APIs loses many of the benefits of performing atomic
multi-word updates (the linked list examples becomes comparably complex to a list
built directly from CAS [Harris 2001] or sophisticated locking [Heller et al. 2005].
3. RELATED WORK
The literature contains several designs for abstractions such as MCAS, WSTM and
OSTM. However, many of the foundational designs have not shared our recent goals
of practicality – for instance much work builds on instructions such as strong-LL/SC
or DCAS [Motorola 1985] which are not available as primitives in contemporary
hardware. Our experience is that although this work has identified the problems
which exist and has introduced terminology and conventions for presenting and
reasoning about algorithms, it has not been possible to effectively implement or use
these algorithms by layering them above software implementations of strong-LL/SC
or DCAS. For instance when considering strong-LL/SC, Jayanti and Petrovic’s
recent design reserves four words of storage per thread for each word that may be
accessed [Jayanti and Petrovic 2003]. Other designs reserve N or log N bits of
storage within each word when used with N threads: such designs can only be used
when N is small. When considering DCAS, it appears no easier to build a general
purpose DCAS operation than it is to implement our MCAS design.
In discussing related work, the section is split into three parts. Firstly, in Sec-
tion 3.1 we introduce the terminology of non-blocking systems and describe the
ACM Journal Name, Vol. V, No. N, M 20YY.
Concurrent Programming Without Locks · 13
progress guarantees that they make. These properties underpin the liveness guar-
antees that are provided to users of our algorithms. Secondly, in Section 3.2 we
discuss the design of ‘universal’ transformations that build non-blocking systems
from sequential code or from lock-based code. Finally, in Section 3.3, we present
previous designs for multi-word abstractions such as MCAS, WSTM and OSTM
and we assess them against our goals.
1 We say eventually because the transaction may have to be re-executed before this occurs (just as
an obstruction-free algorithm may involve internal re-try steps to remove obstructions). Crucially
it cannot run in isolation an unbounded number of times before committing.
ACM Journal Name, Vol. V, No. N, M 20YY.
Concurrent Programming Without Locks · 15
4. DESIGN METHOD
Our implementations of the three APIs in Sections 2.1–2.3 have to solve a set of
common problems and, unsurprisingly, use a number of similar techniques.
The key problem is that of ensuring that a set of memory accesses appears to
occur atomically when it is implemented by a series of individual instructions ac-
cessing one word at a time. Our fundamental approach is to deal with this problem
by decoupling the notion of a location’s physical contents in memory from its logi-
ACM Journal Name, Vol. V, No. N, M 20YY.
Concurrent Programming Without Locks · 17
cal contents when accessed through one of the APIs. The physical contents can, of
course, only be updated one word at a time. However, as we shall show, we arrange
that the logical contents of a set of locations can be updated atomically.
For each of the APIs there is only one operation which updates the logical con-
tents of memory locations: MCAS, WSTMCommitTransaction and OSTMCommit-
Transaction. We call these operations collectively the commit operations and they
are the main source of complexity in our designs.
For each of the APIs we present our design in a series of four steps:
(1) Define the format of the heap, the temporary data structures that are used and
how an application goes about allocating and deallocating memory for data
structures that will be accessed through the API.
(2) Define the notion of logical contents in terms of these structures and show
how it can be computed using a series of single-word accesses. This underpins
the implementation of all functions other than the commit operations. In this
step we are particularly concerned with ensuring non-blocking progress and
read-parallelism so that, for instance, two threads can perform WSTMRead
operations to the same location at the same time without producing conflicts
in the memory hierarchy.
(3) Show how the commit operation arranges to atomically update the logical con-
tents of a set of locations when it executes without interference from concurrent
commit operations. In this stage we are particularly concerned with ensuring
disjoint-access parallelism so that threads can commit updates to disjoint sets
of locations at the same time.
(4) Show how contention is resolved when one commit operation’s progress is im-
peded by a conflicting commit operation. In this step we are concerned with
ensuring non-blocking progress so that the progress is not prevented if, for
example, the thread performing the existing commit operation has been pre-
empted.
Before considering the details of the three different APIs we discuss the common
aspects of each of these four steps in Sections 4.1–4.4 respectively.
2 This does not mean that the designs can only be used in languages that traditionally provide
garbage collection. For instance, in our evaluation in Section 8.1.3, we use reference counting [Jones
and Lins 1996] on the descriptors to allow prompt memory reuse and affinity between descriptors
and threads.
ACM Journal Name, Vol. V, No. N, M 20YY.
18 · K. Fraser and T. Harris
The second property is that, aside from its status field, a descriptor’s contents
are unchanged once it is made reachable from shared memory. This means that if
one thread t1 encounters a descriptor allocated by another thread t2, then t1 can
read from a series of values from it and be sure of receiving mutually consistent
results. For instance, in the case of an MCAS descriptor, t1 can read details about
a location that t2 accessed and the value that t2 proposes to write there.
The third property is that, once the outcome of a particular commit operation
has been decided then the descriptor’s status field remains constant: if a thread
wishes to retry a commit operation, e.g. if the code in Figures 2–4 loops, then each
retry uses a fresh descriptor. This means that threads reading from a descriptor
and seeing that the outcome has been decided can be sure that the status field will
not subsequently change.
The combination of the first two properties is important because it allows us
to avoid many A-B-A problems in which a thread is about to perform a CAS
conditional on a location holding a value A, but then a series of operations by other
threads changes the value to B and then back to A allowing the delayed CAS to
succeed. These two properties mean that there is effectively a one-to-one association
between descriptor references and the intent to perform a given atomic update.
Our implementations rely on being able to distinguish pointers to descriptors
from other values. In our pseudo-code in Sections 5–7 we abstract these tests with
predicates, for instance IsMCASDesc to test if a pointer refers to an MCAS descrip-
tor. We discuss ways in which these predicates can be implemented in Section 8.1.2.
4.2 Logical contents
Each of our API implementations uses descriptors to define the logical contents
of memory locations by providing a mechanism for a descriptor to own a set of
memory locations.
In general, when a commit operation relating to it is not in progress, then a
location is unowned and it holds its logical contents directly. Otherwise, when a
location is owned, the logical contents are taken from the descriptor and chosen
from the ‘before’ and ‘after’ versions based on the descriptor’s status field. This
means that updating the status field has the effect of updating the logical contents
of the whole set of locations that the descriptor owns.
Each of our designs uses a different mechanism to represent the ownership rela-
tionship between locations and transactions. This forms the key distinction between
them and we cover the details in Sections 5 (MCAS), 6 (WSTM) and 7 (OSTM).
4.3 Uncontended commit operations
The commit operations themselves are each structured in three stages. A first phase
acquires exclusive ownership of the locations being updated, a second read-check
phase ensures that locations that have been read but not updated hold the values
expected in them. This is followed by the decision point at which the outcome
of the commit operation is decided and made visible to other threads through
the descriptor’s status field, and then the final release phase in which the thread
relinquishes ownership of the locations being updated.
There are four status values: UNDECIDED, READ-CHECK, SUCCESSFUL and
FAILED. A descriptor’s status field is initially UNDECIDED at the start of a commit
ACM Journal Name, Vol. V, No. N, M 20YY.
Concurrent Programming Without Locks · 19
Fig. 6. Timeline for the three phases used in commit operations. The grey bar indicates when the
commit operation is executed; prior to this the thread prepares the heap accesses that it wants
to commit. In this example location a1 has been read but not updated and location a2 has been
updated. The first phase acquires exclusive access to the locations being updated. The second
phase checks that locations read have not been updated by concurrent threads. The third phase
releases exclusive access after making any updates. The read-check made at point 2 ensures that
a1 is not updated between 0 and 2. The acquisition of a2 ensures exclusive access to it between 1
and 3.
3 In the presence of helping, the linearization point is defined with reference to the thread that
outcome is actually signaled to other threads, occurs at the end of the read-check
phase.
This choice of linearization point may appear perverse for two reasons:
(1) The linearization point is before its decision point – how can an operation
appear to commit its updates before its outcome is decided?
The rationale for this is that holding ownership of the locations being updated
ensures that these remain under the control of this descriptor from acquisition
until release (1 until 3 in Figure 6). Similarly, read-checks ensure that any
locations accessed in a read-only mode have not been updated4 between points
0 and 2. Both of these intervals include the proposed linearization point, even
though it precedes the decision point.
(2) If the operation occurs atomically at its linearization point, then what are the
logical contents of the locations involved until the descriptor’s status is updated
at the decision point?
Following the definition in Section 4.2, the logical contents are dependent on
the descriptor’s status field and so updates are not revealed to other threads
until the decision point is reached. We reconcile this definition with the use of
a read-check phase by ensuring that concurrent readers help commit operations
to complete, retrying the read operation once the transaction has reached its
decision point. This means that the logical contents do not need to be defined
during the read-check phase because they are never required.
4 Of course, the correctness of this argument does not allow the read-checks to simply consider the
values in the locations because that would allow A-B-A problems to emerge if the locations are
updated multiple times between 0 and 2 – our WSTM and OSTM designs which use read-check
phases must check versioning information rather than just values.
ACM Journal Name, Vol. V, No. N, M 20YY.
Concurrent Programming Without Locks · 21
reading
x reading y
Fig. 7. An example of a dependent cycle of two operations, A and B. Each needs the other to
exit its read-check phase before it can complete its own.
This CCAS operation is a special case of the DCAS primitive that some processors
have provided [Motorola 1985] and which has often been used in related work
on building MCAS. However, unlike the more general DCAS (or even a double-
compare single-swap), this restricted double-word operation has a straightforward
implementation using CAS; we present this implementation in Section 5.4.
The implementation of MCAS is simpler than the two STMs because it does not
involve a read-check phase. If the arrays passed to MCAS happen to specify the
same value as ei and ni then this is treated as an update between two identical
values.
Fig. 8. MCASRead operation used to read from locations which may be subject to concurrent
MCAS operations.
is why the MCAS implementation needs two reserved bits in the locations that it
may update. However, for simplicity in the pseudo-code versions of our algorithms,
we use predicates to abstract these bitwise operations. Many alternative implemen-
tation techniques are available: for instance some languages provide run-time type
information, and in other cases descriptors of a given kind can be placed in given
regions of the process’ virtual address space.
Application MCAS
heap descriptors
tx1
a1 100 Status: UNDECIDED
a1: 100 −> 200
a2 200
a2: 200 −> 100
(a) The operation executes in private until it invokes MCAS. The MCAS descriptor holds the
updates being proposed: in this case the contents of a1 and a2 are to be swapped.
tx1
a1 100
Status: UNDECIDED
a2 200 a1: 100 −> 200
a2: 200 −> 100
(b) CCAS is used to acquire ownership of addresses a1, replacing the value expected there with
a reference to the MCAS descriptor. The update is conditional on the descriptor remaining
UNDECIDED in order to guarantee that the location’s logical contents do not change.
tx1
a1
Status: UNDECIDED
a1: 100 −> 200
a2 200
a2: 200 −> 100
tx1
a1
Status: UNDECIDED
a1: 100 −> 200
a2
a2: 200 −> 100
(d) CAS is used to set the status to SUCCESSFUL. This has the effect of atomically updating
the locations’ logical contents.
tx1
a1
Status: SUCCESSFUL
a1: 100 −> 200
a2
a2: 200 −> 100
tx1
a1 200
Status: SUCCESSFUL
a1: 100 −> 200
a2
a2: 200 −> 100
Fig. 9. An uncontended commit swapping the contents of a1 and a2. Grey boxes show where CAS
and CCAS operations are to be performed at each step. While a location is owned, its logical
contents remain available through the MCAS descriptor.
1 typedef struct {
word **a, *e, *n, *cond;
3 } ccas descriptor;
word *CCAS (word **a, word *e, word *n, word *cond) {
5 ccas descriptor *d := new ccas descriptor();
word *v;
7 (d→a, d→e, d→n, d→cond) := (a, e, n, cond);
while ( (v := CAS(d→a, d→e, d)) = d→e ) {
9 if ( ¬IsCCASDesc(v) ) return v;
CCASHelp((ccas descriptor *)v);
11 }
CCASHelp(d);
13 return v;
}
15 word CCASRead (word *a) {
word *v := *a;
17 while ( IsCCASDesc(v) ) {
CCASHelp((ccas descriptor *)v);
19 v := *a;
}
21 return v;
}
23 void CCASHelp (ccas descriptor *d) {
bool success := (*d→cond = 0);
25 CAS(d→a, d, success ? d→n : d→e);
}
Fig. 11. Conditional compare-&-swap (CCAS). CCASRead is used to read from locations which
may be subject to concurrent CCAS operations.
The first thread to reach the decision point for a descriptor must succeed in
installing SUCCESSFUL or FAILED. If the MCAS has failed then the point at which
an unexpected value was seen forms the linearization point of the operation: the
unexpected value was the logical contents of the location and it contradicts the
expected value ei for that location. Otherwise, if the MCAS has succeeded, note
that when the status field is updated (line 23) then all of the locations ai must
hold references to the descriptor and consequently the single status update changes
the logical contents of all of the locations. This is because the update is made by
the first thread to reach line 23 for the descriptor and so no threads can yet have
reached lines 25-27 and have starting releasing the addresses.
The final phase then is to release the locations, replacing the references to the de-
scriptor with the new or old values according to whether the MCAS has succeeded.
tx1
a1 100 Status: UNDECIDED
(a) CAS is used to try to install a pointer to the CCAS descriptor into the update location, in
this case replacing the expected value 100.
tx1
a1 Status: UNDECIDED
(b) The conditional location is checked against 0 and the thread (or threads) acting on this
CCAS descriptor individually decide the outcome and use CAS to store the new value (if the
check succeeds) or the expected value (if the check fails).
tx1
a1
Status: UNDECIDED
a1: 100 −> 200
a2 200 cc1
a2: 200 −> 100
cond
a
e: 100
n
(c) Note that the one-shot use of descriptors means that once one thread has removed the
reference to the CCAS descriptor then any concurrent threads’ CAS operations attempting to do
so will fail: this is why there is no need for consensus between concurrent threads helping with a
CCAS.
Fig. 12. The steps involved in performing the first CCAS operation needed in Figure 9. In this
case the first location a1 is being updated to refer to MCAS descriptor tx1, conditional on a1
holding 100 and the descriptor being UNDECIDED.
(line 8, Figure 12(a)). This ensures that the location’s logical contents is the ex-
pected value while the conditional location is being tested, so that a successful
CCAS operation linearises when the conditional location is read from. If the up-
date location doesn’t contain the expected value then CCAS fails (line 9); if it
contains another CCAS descriptor then that operation is helped to complete before
retrying (line 10).
If the update location is successfully acquired, the conditional location is tested
(line 24, Figure 12(b)). Depending on the contents of this location, the descriptor
is either replaced with the new value, or restored to the original value (line 25,
Figure 12(c)). CAS is used so that this update is performed exactly once even
when the CCAS operation is helped to complete by other processes.
There is an interesting subtle aspect of the design and implementation of CCAS:
it does not return a boolean indicating whether or not it succeeded. Our MCAS
ACM Journal Name, Vol. V, No. N, M 20YY.
28 · K. Fraser and T. Harris
algorithm does not need such a return value and, in fact, it is non-trivial to extend
CCAS to provide one. The reason is that one thread’s CCAS operation may be
helped to completion by another thread and so it would be necessary to commu-
nicate the success/failure result back to the first thread. This cannot be done by
extending the descriptor with a boolean field for the result: there may be multi-
ple threads concurrently helping the same descriptor, each executing line 25 with
different result values.
CCASRead proceeds by reading the contents of the supplied address. If this is
a CCAS descriptor then the descriptor’s operation is helped to completion and the
CCASRead is retried. CCASRead returns a non-descriptor value once one is read.
Notice that CCASHelp ensures that the descriptor passed to it has been removed
from the address by the time that it returns, so CCASRead can only loop while
other threads are performing new CCAS operations on the same address.
5.5 Discussion
There are a number of final points to consider in our design for MCAS. The first is
to observe that when committing an update to a set of N locations, and proceeding
without experiencing contention, the basic operation performs 3N +1 updates using
CAS: each of the N CCAS operations makes 2N updates in total, a further N CAS
operations are used releasing ownership and a single CAS is used to update the
status field. However, although this is more than a factor of three increase over
updating the locations directly, it is worth noting that the three batches of N
updates all act on the same locations: unless evicted during the MCAS operation,
the cache lines holding those locations need only be fetched once.
We did develop an alternative implementation of CCAS which uses an ordinary
write in place of its second CAS. This involves leaving the CCAS descriptor linked
in to the location being updated and recording the success or failure of the CCAS
within that descriptor. This 2N + 1 scheme is not a worthwhile improvement over
the 3N + 1 design: it writes to more distinct cache lines and it makes it difficult
to re-use CCAS descriptors in the way we describe in Section 8.1.3. However, this
direction may be useful if there are systems in which CAS operates substantially
more slowly than an ordinary write.
Moir explained how to build an obstruction-free 2N + 1 MCAS which follows
the same general structure as our lock-free 3N + 1 design [Moir 2002]. His design
uses CAS in place of CCAS to acquire ownership while still preserving the logical
contents of the location being updated. The weaker progress guarantee makes this
possible by avoiding recursive helping: if t2 encounters t1 performing an MCAS
then t2 causes t1’s operation to abort if it is still UNDECIDED. This avoids the
need to CCAS because only the thread initiating an MCAS can now update its
status field to SUCCESSFUL: there is no need to check it upon each acquisition.
Finally, notice that algorithms built over MCAS will not meet the goal of read-
parallelism from Section 1.1. This is because MCAS must still perform CAS op-
erations on addresses for which identical old and new values are supplied: these
CAS operations force the address’s cache line to be held in exclusive mode on the
processor executing the MCAS.
ACM Journal Name, Vol. V, No. N, M 20YY.
Concurrent Programming Without Locks · 29
a1 100
a2 200 tx1
r1
Status: UNDECIDED
a1: (100,15) −> (200,16)
r2 version 21 a2: (200,21) −> (100,22)
a101 300
a102 400
Fig. 13. The heap structure used in the lock-based WSTM. The commit operation acting on
descriptor tx1 is mid-way through an update to locations a1 and a2: 200 is being written to a1
and 100 to a2. Locations a101 and a102 are examples of other locations which happen to map to
the same ownership records, but which are not part of the update.
transaction descriptor. It holds a version number when the orec is quiescent – that
is, when no transaction is trying to commit an update to a location associated with
it. Orec r2 is an example. Alternatively, when an update is being committed, the
orec refers to the descriptor of the transaction involved. In the figure transaction
tx1 is committing an update to addresses a1 and a2 – it has acquired orec r1
and is about to acquire r2. Within the transaction descriptor we indicate memory
accesses using the notation ai :(oi , voi ) → (ni , vni ) to indicate that address ai is
being updated from value oi at version number voi to value ni at version number
vni . For a read-only access, oi = ni and voi = vni . For an update, vni = voi + 1.
Figure 14 shows the definition of the data types involved. A wstm transaction
comprises a status field and a list of wstm entry structures. As indicated in the
figure, each entry provides the old and new values and the old and new version
numbers for a given memory access. In addition, the obstruction-free version of
WSTM includes a prev ownership field to co-ordinate helping between transactions.
The lock-based WSTM uses orecs of type orec basic: each orec is a simple
union holding either a version field or a reference to the owning transaction. The
obstruction-free WSTM requires orecs of type orec obsfree holding either a version
field or a pair containing an integer count alongside a reference to the owning trans-
action5 . In both implementations the current usage of a union can be determined
by applying the predicate IsWSTMDesc: if it evaluates true then the orec contains
a reference to a transaction, else it contains a version number. As before, these
predicates can be implemented by a reserved bit in the space holding the union.
A descriptor is well formed if for each orec it either (i) contains at most one entry
associated with that orec, or (ii) contains multiple entries associated with that orec,
but the old version number is the same in all of them and the new version number
is the same in all of them.
5 The maximum count needed is the maximum number of concurrent commit operations. In
practice this means that orecs in the obstruction-free design are two words wide. Both IA-32 and
32-bit SPARC provide double-word-width CAS. On 64-bit machines without double-word-width
CAS, it is either possible to reserve sufficient high-order bits in a sparse 64-bit address space or
to add a level of indirection between orecs and temporary double-word structures.
ACM Journal Name, Vol. V, No. N, M 20YY.
Concurrent Programming Without Locks · 31
1 typedef struct {
word status; // UNDECIDED, READ CHECK, SUCCESSFUL, or FAILED
3 list<wstm entry*> *entries;
} wstm transaction;
5 typedef struct {
word *addr;
7 word old value, old version;
word new value, new version;
9 orec obsfree prev ownership; // Only used in obstruction-free variant
} wstm entry;
11 typedef union {
word version;
13 wstm transaction* owner;
} orec basic; // Used in the basic variant
15 typedef union {
word version;
17 <word, wstm transaction*> <count, owner>;
} orec obsfree; // Used in the obstruction-free variant
19 // Ensure tx outcome is decided - either FAILED or SUCCESSFUL
void force decision(wstm transaction *tx) {
21 CAS (&tx → status, UNDECIDED, FAILED);
CAS (&tx → status, READ CHECK, FAILED);
23 }
// Mapping from addresses to orecs
25 orec *addr to orec(word *addr);
// Search a list of wstm entry structures for the first one relating
27 // to a specified address (search addr), or for the first one
// relating to any address associated with a specified orec
29 // (search orec). In either case, a NULL return indicates that there
// is no match.
31 wstm entry *search addr(list<wstm entry*> *entries, word *addr);
wstm entry *search orec(list<wstm entry*> *entries, orec *orec ptr);
33 // Filter a list of wstm entry structures to obtain sub-lists
// relating to a given orec, or containing only reads (same
35 // old version and new version) or only updates (different old version
// and new version)
37 list<wstm entry*> *filter for orec(list<wstm entry*> *entries, orec *orec ptr);
list<wstm entry*> *filter for updates(list<wstm entry*> *entries);
39 list<wstm entry*> *filter for reads(list<wstm entry*> *entries);
Fig. 14. Data structures and helper functions used in the WSTM implementation.
Figure 15 shows how the logical contents of a location are determined in the
WSTMRead and WSTMWrite functions. As usual, since we cannot determine the
logical contents during a READ-CHECK phase, we rely on threads encountering such
a descriptor to help decide its outcome (reaching states SUCCESSFUL or FAILED
and hence LS2 or LS3).
Both of these are built over get entry which finds (or adds) a wstm entry structure
to the given list of entries. get entry begins by checking whether or not the given
address is already in the list (lines 2–3). If not then a new entry is needed, holding
the logical contents along with the version number associated with that value. Lines
5–29 determine this following the structure of cases LS1–3. The properties from
ACM Journal Name, Vol. V, No. N, M 20YY.
32 · K. Fraser and T. Harris
Fig. 15. WSTMRead and WSTMWrite functions built over get entry.
ACM Journal Name, Vol. V, No. N, M 20YY.
Concurrent Programming Without Locks · 33
Section 4.1 are key to allowing one thread to read another’s transaction descriptor
– recall that, aside from the status field, the descriptor is unchanged once it is made
reachable in shared storage. Note that we call force decision at line 14 to ensure that
any other descriptor we encounter is in either the FAILED or SUCCESSFUL state;
this ensures that (i) the descriptor cannot change status while we are reading from
it, and (ii) as in Section 4.4, we do not read from a descriptor in the READ-CHECK
state, as the logical contents cannot be determined.
Lines 32-38 ensure that the descriptor will remain well formed when the new
entry is added to it. This is done by searching for any other entries relating to
the same orec (line 33). If there is an existing entry then the old version numbers
must match (line 34). If the numbers do not match then a concurrent transaction
has committed an update to a location involved with the same orec – tx is doomed
to fail (line 35). Line 36 ensures the descriptor remains well formed even if it is
doomed to fail. Line 37 ensures that the new entry has the same new version as the
existing entries – e.g. if there was an earlier WSTMWrite in the same transaction
that was made to another address associated with this orec.
WSTMRead (lines 42–46) simply returns the new value for the entry for addr.
WSTMWrite (lines 47-53) updates the entry’s new value (lines 48–50). It must
ensure that the new version number indicates that the orec has been updated (lines
51–52), both in the entry for addr and, to ensure well formedness, in any other
entries for the same orec.
6.3 Uncontended commit operations
Figure 16 shows the overall structure of WSTMCommitTransaction built using helper
functions for actually acquiring and releasing ownership of orecs. The structure
follows the design method in Section 4: orecs associated with updated locations are
acquired (lines 25–27), then a read-check phase checks that the version numbers in
orecs associated with reads are still current (lines 28–31). If both phases succeed
then the descriptor status is attempted to be set to SUCCESSFUL (lines 33-34)
and, if that succeeds, the updates are made (lines 35–37). Finally, ownership of
the orecs is released (lines 38–39). Notice how the definition of LS2 means that
setting the status to SUCCESSFUL atomically updates the logical contents of all of
the locations written by the transaction.
The read-check phase uses read check orec to check that the current version num-
ber associated with an orec matches the old version in the entries in a transaction
descriptor. As with get entry in Figure 15, if it encounters another transaction
descriptor then it ensures that its outcome is decided (line 12) before examining it.
In our lock-based WSTM implementation the helper functions prepare descriptor,
acquire orec and release orec are straightforward, as shown in Figure 16.
This implementation is based on using the orecs as mutual exclusion locks, al-
lowing at most one transaction to own an orec at a given time. A transaction owns
an orec when the orec contains a pointer to the transaction descriptor. To avoid
deadlock, prepare descriptor (lines 1–4) ensures that the entries are sorted, for in-
stance by address. Ownership acquisition involves two cases: (i) the orec is already
owned by the current transaction because its descriptor holds multiple entries for
the same orec (line 9), (ii) the orec is not yet acquired by us, in which case CAS
is used to replace the current transaction’s old version number with a reference to
ACM Journal Name, Vol. V, No. N, M 20YY.
34 · K. Fraser and T. Harris
its descriptor (lines 11–13). Note that the loop at line 13 will spin while the value
seen in the orec is another transaction’s descriptor. Releasing an orec is straight-
forward: mutual exclusion allows us to directly update any orecs that we acquired
(lines 19–22).
Figure 18 shows graphically how an uncontended commit operation can proceed
for the the transaction in Figure 13.
ACM Journal Name, Vol. V, No. N, M 20YY.
Concurrent Programming Without Locks · 35
Fig. 17. Basic lock-based implementation of the helper functions for WSTMCommitTransaction.
tx1
a1 100 r1 version 15 Status: UNDECIDED
tx1
a1 100 r1 version 15 Status: UNDECIDED
(b) CAS is used to acquire ownership record r1 replacing the expected version number with a
pointer to the transaction descriptor.
tx1
r1
a1 100 Status: UNDECIDED
a1: (100,15) −> (200,16)
a2 200 version 21
r2
a2: (200,21) −> (100,22)
tx1
r1
a1 100
Status: UNDECIDED
a1: (100,15) −> (200,16)
a2 200
r2
a2: (200,21) −> (100,22)
tx1
r1
a1 100
Status: SUCCESSFUL
a1: (100,15) −> (200,16)
a2 200
r2
a2: (200,21) −> (100,22)
(e) The updates are written back to the heap. There is no need for the two writes to be atomic
with one another.
r1 tx1
a1 200
Status: SUCCESSFUL
a2 a1: (100,15) −> (200,16)
100
r2 a2: (200,21) −> (100,22)
tx1
a1 200 r1 version 16 Status: SUCCESSFUL
a1: (100,15) −> (200,16)
a2 100 r2 a2: (200,21) −> (100,22)
Fig. 18. An uncontended commit swapping the contents of a1 and a2, showing where updates are
to be performed at each step.
ACM Journal Name, Vol. V, No. N, M 20YY.
Concurrent Programming Without Locks · 37
Fig. 19. Obstruction-free implementation of the prepare descriptor helper function for WSTMCom-
mitTransaction.
Fig. 20. Obstruction-free implementation of the acquire orec and release orec helper functions for
WSTMCommitTransaction.
The main complexity in the obstruction-free design comes from allowing own-
ership to transfer directly from one descriptor to another: notice that if a thread
committing one transaction, say t2 performing tx2, encounters an orec that has
been acquired by another transaction, say tx1, then t2 cannot simply replace the
<count, owner> pair <1, tx1> with <2, tx2> – addresses whose logical contents
were determined from tx1 under case LS2 may change because tx2 may not have
entries for those addresses.
Figure 19 shows the obstruction-free prepare descriptor function that enables safe
ownership stealing. This is based on the idea that, before a transaction tx starts
ACM Journal Name, Vol. V, No. N, M 20YY.
Concurrent Programming Without Locks · 39
acquiring any ownerships, its descriptor is extended to include relevant entries from
descriptors from which it may need to steal. This means that the logical contents
of these entries’ addresses will be unchanged if stealing occurs. Of course, good
run-time performance depends on contention – and hence stealing – being rare.
prepare descriptor works in two stages. Firstly, it builds up a set containing the
orecs that it will need to acquire (lines 2–8), recording the previous value so that, if
stealing is necessary, it is possible to check that the ownership has not changed since
the value was stashed in prev ownership at line 6. Secondly, for each of those orecs,
it checks if it is currently owned (lines 11–12). If it is owned then the owner is forced
to a decision so that its descriptor can be examined (line 14), and the descriptor is
searched for entries relating to the orec. For each such entry the logical contents
and version number are determined (lines 16–22), it is checked that tx will remain
well formed if those entries are merged into it (lines 23–24), and it is checked that
tx does not already contain an entry for the same address (line 25). If all these
checks succeed then a new entry is added to tx containing the logical contents of
the address (lines 26–29).
Figure 20 shows how acquire orec and release orec are modified to enable stealing.
A third case is added to acquire orec: lines 13-17 steal ownership so long as the
contents of the orec have not changed since the victim’s entries were merged into tx
in prepare descriptor (i.e. the orec’s current owner matches the prev ownership value
recorded during prepare descriptor. Note that we re-use the prev ownership field in
acquire orec to indicate which entries led to a successful acquisition – this is needed
in release orec to determine which entries to release. release orec itself is now split
into three cases. Firstly, if the count field will remain above 0, the count is simply
decremented because other threads may still be able to perform delayed writes to
locations controlled by the orec (lines 27–28). Secondly, if the count is to return to
0 and our descriptor tx is still the owner, we take the old or new version number as
appropriate (lines 30–31).
The third case is that the count is to return to 0, but ownership has been stolen
from our descriptor (lines 33–36). In this case we must re-perform the updates from
the current owner before releasing the orec (lines 34–36). This ensures that the
current logical contents are written back to the locations, overwriting any delayed
writes from threads that released the orec earlier. Note that we do not need to call
force decision before reading from the current owner: the count is 1, meaning that
the thread committing tx is the only one working on that orec and so the descriptor
referred to from the orec must have already been decided (otherwise there would
be at least two threads working on the orec).
6.5 Discussion
The obstruction-free design from Figures 19 and 20 is clearly extremely complicated.
Aside from its complexity, the design has an undesirable property under high con-
tention: if a thread is pre-empted between calling acquire orec and release orec then
the logical contents of locations associated with that orec cannot revert to being
held in the application heap until the thread is re-scheduled.
Although we do not present them here in detail, there are a number of extensions
to the WSTM interface which add to the range of settings in which it can be used.
In particular, a WSTMDiscardUpdate operation which takes an address and acts as
ACM Journal Name, Vol. V, No. N, M 20YY.
40 · K. Fraser and T. Harris
a hint that the WSTM implementation is permitted (but not required) to discard
any updates made to that address in the current transaction. This can simplify
the implementation of some data structures in which shared ‘write-only’ locations
exist. For instance, in the red-black trees we use in Section 8, the implementation
of rotations within the trees is made easier if references to dummy nodes are used in
place of NULL pointers. If a single dummy node is used then updates to its parent
pointer produce contention between logically non-conflicting transactions: in this
case we can either use separate nodes or use WSTMDiscardUpdate on the dummy
node’s parent pointer.
OSTM OSTM
OSTM private structures
handle handle
(a) Example OSTM-based linked list structure used by pseudocode in Figure 4. List nodes are
chained together via OSTM handles, which are private to the STM. References to OSTM handles
must be converted to list-node references using OSTMOpenForReading or OSTMOpenForWriting
within the scope of a transaction.
2 3 2
OSTM OSTM
handle handle
Transaction descriptor
status UNDECIDED
read−only list
next object
(b) Example of a transaction attempting to delete node 3 from the list introduced above. The
transaction has accessed one object (node 2) which it has opened for writing. The read-only list
is therefore empty, while the sorted write list contains one entry describing the modified node 2.
sorted write list hold a pointer to a thread-local shadow copy of the data-block.
Figure 21(b) illustrates the use of transaction descriptors and OSTM handles by
showing a transaction in the process of deleting a node from an ordered linked list.
Ordinarily, OSTM handles refer to the current version of the object’s data via
a pointer to the current data-block. However, if a transaction is in the process of
committing an update to the object, then they can refer to the descriptor for the
owning transaction. If the transaction in Figure 21(b) were attempting to commit
then the OSTM handle for node 2 will be updated to contain a pointer to the
transaction’s descriptor. Concurrent reads can still determine the object’s current
value by searching the sorted write list and returning the appropriate data-block
depending on the transaction’s status. Once again notice that, unlike MCAS, a
commit is co-ordinated without needing to use reserved values in the application’s
data structures, and so full word-size values can be held in OSTM objects.
As usual, a predicate IsOSTMDesc distinguishes between references to a data-
block and references to a transaction descriptor.
ACM Journal Name, Vol. V, No. N, M 20YY.
42 · K. Fraser and T. Harris
Fig. 22. OSTM’s OSTMOpenForReading and OSTMOpenForWriting interface calls. Algorithms for
read and sorted write lists are not given here. Instead, search, insert and remove operations are
assumed to exist, e.g. acting on linked lists of obj entry structures.
7.5 Discussion
Our lock-free OSTM was developed concurrently with an obstruction-free design by
Herlihy et al [Herlihy et al. 2003]. We include both in our experimental evaluation.
The two designs are similar in the use of handles as a point of indirection and the
use of transaction descriptors to publish the updates that a transaction proposes
to make.
ACM Journal Name, Vol. V, No. N, M 20YY.
Concurrent Programming Without Locks · 45
The key difference lies in how transactions proceed before they attempt to com-
mit. In our scheme transactions operate entirely in private and so descriptors are
only revealed when a transaction is ready to commit. In Herlihy et al’s DSTM
design each OSTMOpen operation causes the transaction to acquire the object in
question. This allows a wider range of contention management strategies because
contention is detected earlier than with our scheme. However, it means that their
OSTM cannot be made lock-free: in our scheme threads can help one another’s
commit operations, but in their scheme it would be necessary for threads to help
one another’s entire transactions.
It is interesting to note that, unlike MCAS, we cannot obtain useful simplifica-
tions of our OSTM implementation by moving from lock freedom to obstruction
freedom. This is because the data pointers in OSTM handles serve to uniquely
identify a given object-state and so lock-freedom can be obtained without needing
CCAS to avoid A-B-A problems when acquiring ownership.
8. EVALUATION
There is a considerable gap between the pseudocode designs presented for MCAS,
WSTM and OSTM and a useful implementation of those algorithms on which to
base our evaluation. In this section we highlight a number of these areas elided in
the pseudocode and then assess the practical performance of our implementations
by using them to build concurrent skip-lists and red-black trees.
8.1 Implementation concerns
We consider three particular implementation problems: supporting nested trans-
actions for composition (Section 8.1.1), distinguishing descriptors from application
data (Section 8.1.2) and managing the memory within which descriptors are con-
tained (Section 8.1.3).
8.1.1 Nested transactions. In order to allow composition of STM-based opera-
tions we introduce limited support for nested transactions. This takes the simple
form of counting the number of StartTransaction invocations that are outstand-
ing in the current thread and only performing an actual CommitTransaction when
the count is returned to zero. This means that it is impossible to abort an inner
transaction without aborting its enclosing transactions.
An alternative implementation would be to use separate descriptors for enclosed
transactions and, upon commit, to merge these into the descriptors for the next
transaction out. This would allow an enclosed transaction to be aborted and retried
without requiring that all of the enclosing transactions be aborted.
8.1.2 Descriptor identification. To allow implementation of the IsMCASDesc,
IsCCASDesc, IsWSTMDesc and IsOSTMDesc predicates from Sections 5–7, there
needs to be a way to distinguish pointers to descriptors from other valid memory
values.
We do this by reserving the two low-order bits in each pointer that may refer
to a descriptor. This limits CCAS and MCAS to only operate on pointer-typed
locations, as dynamically distinguishing a descriptor reference from an integer with
the same representation is not generally possible. However, OSTM descriptors
are only ever installed in place of data-block pointers, so OSTM trivially complies
ACM Journal Name, Vol. V, No. N, M 20YY.
46 · K. Fraser and T. Harris
6 As we explain below, this paper considers workloads with at least one CPU available for each
thread. Schemes like SMR or pass-the-buck would be necessary for prompt memory re-use in
workloads with more threads than processors.
ACM Journal Name, Vol. V, No. N, M 20YY.
Concurrent Programming Without Locks · 47
The same storage method is used for the per-transaction object lists maintained
by OSTM. Each transaction descriptor contains a pool of embedded entries that
are sequentially allocated as required. If a transaction opens a very large number
of objects then further descriptors are allocated and chained together to extend the
list.
runs with error bars indicating the minimum and maximum results achieved.
For brevity our experiments only consider non-multiprogrammed workloads where
there are sufficient CPUs for running the threads in parallel. The main reason for
this setting is that it enables a fairer comparison between our non-blocking designs
and those based on locks. If we did use more threads than available CPUs then,
when testing lock-based designs, a thread could be pre-empted at a point where
it holds locks, potentially obstructing the progress of the thread scheduled in its
place. Conversely, when testing non-blocking designs, even obstruction freedom
precludes pre-empted threads from obstructing others – in all our designs, the time
taken to remove or help an obstructing thread is vastly less than typical scheduling
quanta. Comparisons of multiprogrammed workloads would consequently be highly
dependent on how well the lock implementation is integrated with the scheduler.
In addition to gathering the performance figures presented here, our benchmark
harness can run in a testing mode, logging the inputs, results and invocation and
response timestamps for each operation. Although logging every operation incurs a
significant performance penalty, this mode of operation would never be enabled in
normal use. We used an off-line checker to ensure that these observations are lin-
earizable. Although this problem is generally NP-complete [Wing and Gong 1993],
a greedy algorithm which executes a depth-first search to determine a satisfactory
ordering for the invocations works well in practice [Fraser 2003]. This was invalu-
able for finding implementation errors such as missing memory-ordering barriers,
complementing other techniques such as proofs and model checking which ensure
algorithmic correctness.
We compare 14 different set implementations: 6 based on red-black trees and
8 based on skip lists. Many of these are lock-based and, in the absence of pub-
lished designs, were created for the purpose of running these tests to provide as
strong contenders as possible; we have made their source code publicly available for
inspection and Fraser describes the contenders in more detail as part of his PhD
dissertation [Fraser 2003]. Fraser also considers general binary search trees and
develops a range of non-blocking and lock-based designs.
As well as our own STM designs, we also include Herlihy’s DSTM coupled with
a simple ‘polite’ contention manager that uses exponential backoff to deal with
conflicts [Herlihy et al. 2003]. We include this primarily because it is a configuration
that has been widely studied in the literature and so it serves as a good common
point of comparison between different designs. As others have shown, DSTM results
are sensitive to the contention management strategies and so our results should not
be seen as a thorough comparison between DSTM and other STMs.
Where needed by lock-based algorithms we use Mellor-Crummey and Scott’s
(MCS) scalable queue-based spinlocks which avoid unnecessary cache-line transfers
between CPUs that are spinning on the same lock [Mellor-Crummey and Scott
1991a]. Although seemingly complex, the MCS operations are highly competitive
even when the lock is not contended; an uncontended lock is acquired or released
with a single read-modify-write access. Furthermore, contended MCS locks create
far less memory traffic than standard test-and-set or test-and-test-and-set locks.
Where multi-reader locks are required we use another queue-based design by the
same authors which allows adjacently-queued readers to enter their critical regions
ACM Journal Name, Vol. V, No. N, M 20YY.
Concurrent Programming Without Locks · 49
simultaneously when the first of the sequence reaches the head of the queue [Mellor-
Crummey and Scott 1991b].
In summary the 14 set implementations considered here are:
(1) Skip lists with per-pointer locks. Pugh describes a highly-concurrent skip list
implementation which uses per-pointer mutual-exclusion locks [Pugh 1990]. Any
update to a pointer must be protected by its lock. Deleted nodes have their pointers
updated to link backwards thus ensuring that a search correctly backtracks if it
traverses into a defunct node.
(2) Skip lists with per-node locks. Although per-pointer locking successfully lim-
its the possibility of conflicts, the overhead of acquiring and releasing so many locks
is an important consideration. We therefore include Pugh’s design using per-node
locks. The operations are identical to those for per-pointer locks, except that a
node’s lock is acquired before it is first updated and continuously held until after
the final update to the node. Although this slightly increases the possibility of
conflict between threads, in many cases this is more than repaid by the reduced
locking overheads.
(3) Skip lists built directly from CAS. The direct-CAS design performs compos-
ite update operations using a sequence of individual CAS instructions, with no
need for dynamically-allocated descriptors. This means that great care is needed
to ensure that updates occur atomically and consistently. Fraser [2003] and Sun-
dell et al [2004] both provide pseudocode algorithms for non-blocking versions of
the usual skip-list operations. In outline, list membership is defined according to
presence in the lowest level. Insertion or deletion is performed on each level in turn
as an independent linked list, using Harris’s marking technique [Harris 2001] to
logically delete a node from each level of the skip list. This implementation is used
to show the performance gains that are possible using an intricate non-blocking
system when compared with one built from MCAS, WSTM or OSTM. Of course,
the STM-based implementations do allow composability whereas the CAS-based
design does not.
(4) Skip lists built using MCAS. Insertions and deletions proceed by building up
batches of memory updates to make through a single MCAS invocation. As with
Pugh’s schemes, pointers within deleted nodes are reversed to aid concurrent searches.
(5-6) Skip lists built using WSTM. Skip lists can be built straightforwardly from
single-threaded code using WSTM. We consider two variants: a non-blocking WSTM
built using double-word-width compare and swap and a version using the suspen-
sion scheme described in Section 6.4.
(7-8) Skip lists built using OSTM. Skip lists can be built straightforwardly using
OSTM by representing each list node as a separate OSTM object. We consider two
variants: the lock-free OSTM scheme described in Section 7 and Herlihy et al’s
obstruction-free STM [Herlihy et al. 2003]. We couple the obstruction-free STM
with a ‘polite’ contention manager which introduces exponential backoff to deal
with conflicts [Scherer III and Scott 2005].
(9) Red-black trees with serialised writers. Unlike skip lists there has been little
practical work on parallelism in balanced trees. Our first design [Fraser 2003]
builds on Hanke’s [Hanke 1999] and uses lock-coupling when searching down the
ACM Journal Name, Vol. V, No. N, M 20YY.
50 · K. Fraser and T. Harris
tree, upgrading to a write mode when performing rebalancing (taking care to avoid
deadlock by upgrading in down-the-tree order). A global mutual-exclusion lock is
used to serialise concurrent writers.
(10) Red-black trees with concurrent writers. Our second scheme allows concur-
rent writing by decomposing tree operations into a series of local updates on tree
fragments [Fraser 2003]. It is similar to Hanke’s relaxed red-black tree in that it
decouples the steps of rebalancing the tree from the actual insertion or deletion of a
node [Hanke et al. 1997]. Although lock-based, the style of the design is reminiscent
of optimistic concurrency control because each local update is preceded by checking
part of the tree in private to identify the sets of locks needed, retrying this stage if
inconsistency is observed.
(11-12) Red-black trees built using WSTM. As with skip lists, red-black trees can
be built straightforwardly from single-threaded code using WSTM. However, there
is one caveat. In order to reduce the number of cases to consider during rotations,
and in common with standard designs, we use a black sentinel node in place of
NULL child pointers in the leaves of the tree. We use write discard to avoid up-
dates to sentinel nodes introducing contention when making needless updates to
the sentinel’s parent pointer.
(13-14) Red-black trees built using OSTM. As with skip lists, each node is rep-
resented by a separate OSTM object, so nodes must be opened for the appropriate
type of access as the tree is traversed. We consider implementations using OSTM
from Section 7 and Herlihy et al’s obstruction-free STM [Herlihy et al. 2003] cou-
pled with the ‘polite’ contention manager. As before, we use write discard on the
sentinel node7 .
We now consider our performance results under a series of scenarios. Section 8.2.1
looks at scalability under low contention. This shows the performance of our non-
blocking systems when they are running on machines with few CPUs, or when they
are being used carefully to reduce the likelihood that concurrent operations conflict.
Our second set of results, in Section 8.2.2, considers performance under increasing
levels of contention.
8.2.1 Scalability under low contention. The first set of results measure perfor-
mance when contention between concurrent operations is very low. Each experi-
ment runs with a mean of 219 keys in the set, which is sufficient to ensure that
parallel writers are extremely unlikely to update overlapping sections of the data
structure. A well-designed algorithm which provides disjoint-access parallelism will
avoid introducing contention between these logically non-conflicting operations.
Note that all the graphs in this section show a significant drop in performance
when parallelism increases beyond 5 to 10 threads. This is due to the architecture
of the underlying hardware: small benchmark runs execute within one or two CPU
7 Herlihy et al’s OSTM cannot readily support write discard because only one thread may have
an OSTM object open for writing at a time. Their early release scheme applies only to read-only
accesses. To avoid contention on the sentinel node we augmented their STM with a mechanism
for registering objects with non-transactional semantics: such objects can be opened for writing
by multiple threads, but the shadow copies remains thread private and are discarded on commit
or abort.
ACM Journal Name, Vol. V, No. N, M 20YY.
Concurrent Programming Without Locks · 51
35
WSTM (without suspension)
Herlihy-OSTM
WSTM (with suspension)
OSTM
30 Lock-based (per-pointer)
Lock-based (per-node)
MCAS
CAS-based
CPU time per operation / µs
25
20
(a)
15
10
0
10 20 30 40 50 60 70 80 90
Processors
30
Herlihy-OSTM
WSTM (without suspension)
WSTM (with suspension)
OSTM
25 Lock-based (per-pointer)
Lock-based (per-node)
MCAS
CAS-based
CPU time per operation / µs
20
15
(b)
10
0
1 2 3 4 5
Processors
Fig. 24. Graph (a) shows the performance of large skip lists (K = 219 ) as parallelism is increased
to 90 threads. Graph (b) is a ‘zoom’ of (a), showing the performance of up to 5 threads. As with
all our graphs, lines marked with boxes represent lock-based implementations, circles are OSTMs,
triangles are WSTMs and crosses are implementations built from MCAS or directly from CAS.
The ordering in the key reflects the ordering of the lines at the right-hand side of the graph: lower
lines are achieved by faster implementations.
‘quads’, each of which has its own on-board memory. Most or all memory reads
in small runs are therefore serviced from local memory which is considerably faster
than transferring cache lines across the switched inter-quad backplane.
Figure 24 shows the performance of each of the skip-list implementations. As ex-
pected, the STM-based implementations perform poorly compared with the other
ACM Journal Name, Vol. V, No. N, M 20YY.
52 · K. Fraser and T. Harris
lock-free schemes; this demonstrates that there are significant overheads associated
with the read and write operations (in WSTM) or with maintaining the lists of
opened objects and constructing shadow copies of updated objects (in OSTM). Ad-
ditionally, access-validation is necessary in these cases, unlike lock-based schemes.
The lock-free CAS-based and MCAS-based designs perform extremely well be-
cause, unlike the STMs, they add only minor overheads on each memory access.
Interestingly, under low contention the MCAS-based design has almost identical
performance to the much more complicated CAS-based design — the extra complex-
ity of using hardware primitives directly is not always worthwhile. Both schemes
surpass the two lock-based designs, of which the finer-grained scheme is slower be-
cause of the costs associated with traversing and manipulating the larger number
of locks.
Figure 25, presenting results for red-black trees, gives the clearest indication of
the kinds of setting where our APIs are effective. Neither of the lock-based schemes
scales effectively with increasing parallelism; indeed, both OSTM and WSTM-based
trees out-perform the schemes using locking with only 2 concurrent threads. Of
course, the difficulty of designing effective lock-based trees motivated the develop-
ment of skip lists, so it is interesting to observe that a straightforward tree imple-
mentation, layered over STM, does scale well and often performs better than a skip
list implemented over the same STM.
Surprisingly, the lock-based scheme that permits parallel updates performs hardly
any better than the much simpler and more conservative design with serialised
updates. This is because the main performance bottleneck in both schemes is
contention when accessing the multi-reader lock at the root of the tree. Although
multiple readers can enter their critical region simultaneously, there is significant
contention for updating the shared synchronisation fields within the lock itself. Put
simply, using a more permissive type of lock (i.e., multi-reader) does not improve
performance because the bottleneck is caused by cache-line contention rather than
lock contention.
In contrast, the STM schemes scale very well because transactional reads do
not cause potentially-conflicting memory writes in the underlying synchronisation
primitives. We suspect that, under low contention, OSTM is faster then Herlihy’s
design due to better cache locality. Herlihy’s STM requires a double indirection
when opening a transactional object: thus three cache lines are accessed when
reading a field within a previously-unopened object. In contrast our scheme accesses
two cache lines; more levels of the tree fit inside each CPU’s caches and, when
traversing levels that do not fit in the cache, 50% fewer lines must be fetched from
main memory.
8.2.2 Performance under varying contention. The second set of results shows
how performance is affected by increasing contention — a particular concern for
non-blocking algorithms, which usually assume that conflicts are rare. This assump-
tion allows the use of optimistic techniques for concurrency control; when conflicts
do occur they are handled using a fairly heavyweight mechanism such as recursive
helping or interaction with the thread scheduler. Contrast this with using locks,
where an operation assumes the worst and ‘announces’ its intent before accessing
shared data: that approach introduces unnecessary overheads when contention is
ACM Journal Name, Vol. V, No. N, M 20YY.
Concurrent Programming Without Locks · 53
350
Lock-based (serialised writers)
Lock-based (concurrent writers)
WSTM (without suspension)
WSTM (with suspension)
300 Herlihy-OSTM
OSTM
CPU time per operation / µs
250
200
(a)
150
100
50
0
10 20 30 40 50 60 70 80 90
Processors
30
Lock-based (concurrent writers)
Lock-based (serialised writers)
Herlihy-OSTM
WSTM (without suspension)
25 WSTM (with suspension)
OSTM
CPU time per operation / µs
20
15
(b)
10
0
1 2 3 4 5
Processors
Fig. 25. Graph (a) shows the performance of large red-black trees (K = 219 ) as parallelism is
increased to 90 threads. Graph (b) is a ‘zoom’ of (a), showing the performance of up to 5 threads.
low because fine-grained locking requires expensive juggling of acquire and release
invocations. The results here allow us to investigate whether these overheads pay
off as contention increases.
All experiments are executed with 90 parallel threads (P = 90). Contention
is varied by adjusting the benchmark parameter K and hence the mean size of
the data structure under test. Although not a general-purpose contention metric,
this is sufficient to allow us to compare the performance of a single data structure
implemented over different concurrency-control mechanisms.
ACM Journal Name, Vol. V, No. N, M 20YY.
54 · K. Fraser and T. Harris
100
WSTM (without suspension)
Herlihy-OSTM
WSTM (with suspension)
OSTM
Lock-based (per-pointer)
80 Lock-based (per-node)
MCAS
CPU time per operation / µs CAS-based
60
40
20
0
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Mean set size
500
Lock-based (concurrent writers)
Lock-based (serialised writers)
WSTM (without suspension)
WSTM (with suspension)
Herlihy-OSTM
400 OSTM
300
200
100
0
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Mean set size
Herlihy’s STM performs comparatively poorly under high contention when using
an initial contention-handling mechanism which introduces exponential backoff to
‘politely’ deal with conflicts. Other schemes may work better [Scherer III and Scott
2005] and, since DSTM is obstruction-free, they would be expected to influence
its performance more than OSTM’s. Once again, the results here are intended
primarily to compare our designs with lock-based alternatives and include DSTM
in this sample configuration because it has widely been studied in the literature.
Marathe et al perform further comparisons of DSTM and OSTM [Marathe et al.
2004].
Note that, using this basic contention manager, the execution times of individual
operations are very variable, which explains the performance ‘spike’ at the left-hand
side of the graph. This low and variable performance is caused by sensitivity to
the choice of backoff rate: our implementation uses the same values as the original
authors, but these were chosen for a Java-based implementation of red-black trees
and they do not discuss how to choose a more appropriate set of values for different
circumstances.
9. CONCLUSION
We have presented three APIs for building non-blocking concurrency-safe software
and have demonstrated that these can match or surpass the performance of state-of-
the-art lock-based alternatives. Thus, not only do non-blocking systems have many
functional advantages compared with locks (such as freedom from deadlock and
unfortunate scheduler interactions), but they can also be implemented on modern
multiprocessor systems without the run-time overheads that have traditionally been
feared.
Furthermore, APIs such as STM have benefits in ease of use compared with tra-
ACM Journal Name, Vol. V, No. N, M 20YY.
56 · K. Fraser and T. Harris
ditional direct use of mutual-exclusion locks. An STM avoids the need to consider
issues such as granularity of locking, the order in which locks should be acquired
to avoid deadlock, and composability of different data structures or subsystems.
This ease of use is in contrast to traditional implementations of non-blocking data
structures based directly on hardware primitives such as CAS.
In conclusion, using the APIs that we have presented in this paper, it is now
practical to deploy lock-free techniques, with all their attendant advantages, in
situations where lock-based synchronisation would traditionally be the only viable
option.
ACKNOWLEDGMENT
This work has been supported by donations from the Scalable Synchronization
Research Group at Sun Labs Massachusetts. The evaluation was carried out on the
Cambridge-Cranfield High Performance Computing Facility.
ACM Journal Name, Vol. V, No. N, M 20YY.
Concurrent Programming Without Locks · 57
REFERENCES
Ananian, C. S., Asanović, K., Kuszmaul, B. C., Leiserson, C. E., and Lie, S. 2005. Un-
bounded transactional memory. In Proceedings of the 11th International Symposium on High-
Performance Computer Architecture (HPCA’05). San Franscisco, California, 316–327.
Anderson, J. H. and Moir, M. 1995. Universal constructions for multi-object operations. In
Proceedings of the 14th Annual ACM Symposium on Principles of Distributed Computing
(PODC ’95). 184–193.
Anderson, J. H., Ramamurthy, S., and Jain, R. 1997. Implementing wait-free objects on
priority-based systems. In Proceedings of the 16th Annual ACM Symposium on Principles of
Distributed Computing (PODC ’97). 229–238.
Barnes, G. 1993. A method for implementing lock-free data structures. In Proceedings of the
5th Annual ACM Symposium on Parallel Algorithms and Architectures. 261–270.
Fich, F., Luchangco, V., Moir, M., and Shavit, N. 2005. Obstruction-free algorithms can be
practically wait-free. In Distributed algorithms, P. Fraigniaud, Ed. Lecture Notes In Computer
Science, vol. 3724. 78–92.
Fraser, K. 2003. Practical lock freedom. Ph.D. thesis, Computer Laboratory, University of
Cambridge. Also available as Technical Report UCAM-CL-TR-639, Cambridge University.
Greenwald, M. 1999. Non-blocking synchronization and system design. Ph.D. thesis, Stanford
University. Also available as Technical Report STAN-CS-TR-99-1624, Stanford University,
Computer Science Department.
Greenwald, M. 2002. Two-handed emulation: How to build non-blocking implementations of
complex data structures using DCAS. In Proceedings of the 21st Annual ACM Symposium on
Principles of Distributed Computing (PODC ’02). 260–269.
Hammond, L., Carlstrom, B. D., Wong, V., Hertzberg, B., Chen, M., Kozyrakis, C., and
Olukotun, K. 2004. Programming with transactional coherence and consistency (TCC). In
ASPLOS-XI: Proceedings of the 11th international conference on Architectural support for
programming languages and operating systems. ACM Press, 1–13.
Hanke, S. 1999. The performance of concurrent red-black tree algorithms. In Proceedings of
the 3rd Workshop on Algorithm Engineering. Lecture Notes in Computer Science, vol. 1668.
Springer-Verlag, 287–301.
Hanke, S., Ottmann, T., and Soisalon-Soininen, E. 1997. Relaxed balanced red-black trees.
In Proceedings of the 3rd Italian Conference on Algorithms and Complexity. Lecture Notes in
Computer Science, vol. 1203. Springer-Verlag, 193–204.
Harris, T. 2001. A pragmatic implementation of non-blocking linked lists. In Proceedings of the
15th International Symposium on Distributed Computing (DISC ’01). Springer-Verlag, 300–
314.
Harris, T. 2004. Exceptions and side-effects in atomic blocks. In Proceedings of the 2004 PODC
Workshop on Synchronization in Java Programs. 46–53. Proceedings published as Memorial
University of Newfoundland CS Technical Report 2004-01.
Harris, T. and Fraser, K. 2003. Language support for lightweight transactions. In Proceedings
of the 18th Annual ACM-SIGPLAN Conference on Object-Oriented Programming, Systems,
Languages & Applications (OOPSLA ’03). 388–402.
Harris, T. and Fraser, K. 2005. Revocable locks for non-blocking programming. In PPoPP
’05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel
programming. ACM Press, New York, NY, USA, 72–82.
Harris, T., Marlow, S., Peyton-Jones, S., and Herlihy, M. 2005. Composable memory
transactions. In PPoPP ’05: Proceedings of the tenth ACM SIGPLAN symposium on Principles
and practice of parallel programming. ACM Press, New York, NY, USA, 48–60.
Heller, S., Herlihy, M., Luchangco, V., Moir, M., Scherer, B., and Shavit, N. 2005. A
lazy concurrent list-based set algorithm. In 9th International Conference on Principles of
Distributed Systems (OPODIS).
Hennessy, J. L. and Patterson, D. A. 2003. Computer Architecture – A Quantitative Approach,
3rd ed. Morgan Kaufmann Publishers, San Francisco, CA, USA.
ACM Journal Name, Vol. V, No. N, M 20YY.
58 · K. Fraser and T. Harris
Herlihy, M. 1993. A methodology for implementing highly concurrent data objects. ACM
Transactions on Programming Languages and Systems 15, 5 (Nov.), 745–770.
Herlihy, M. 2005. SXM1.1: Software transactional memory package for C#. Tech. rep., Brown
University & Microsoft Research. May.
Herlihy, M., Luchangco, V., Martin, P., and Moir, M. 2005. Nonblocking memory manage-
ment support for dynamic-sized data structures. ACM Trans. Comput. Syst. 23, 2, 146–196.
Herlihy, M., Luchangco, V., and Moir, M. 2003. Obstruction-free synchronization: Double-
ended queues as an example. In Proceedings of the 23rd IEEE International Conference on
Distributed Computing Systems (ICDCS). IEEE, 522–529.
Herlihy, M., Luchangco, V., Moir, M., and Scherer, W. 2003. Software transactional memory
for dynamic-sized data structures. In Proceedings of the 22nd Annual ACM Symposium on
Principles of Distributed Computing (PODC ’03). 92–101.
Herlihy, M. and Moss, J. E. B. 1993. Transactional memory: Architectural support for lock-
free data structures. In Proceedings of the 20th Annual International Symposium on Computer
Architecture (ISCA ’93). ACM Press, 289–301.
Herlihy, M. and Wing, J. M. 1990. Linearizability: A correctness condition for concurrent
objects. ACM Transactions on Programming Languages and Systems 12, 3 (July), 463–492.
Hoare, C. A. R. 1972. Towards a theory of parallel programming. In Operating Systems Tech-
niques. A.P.I.C. Studies in Data Processing, vol. 9. Academic Press, 61–71.
Israeli, A. and Rappoport, L. 1994. Disjoint-access-parallel implementations of strong shared
memory primitives. In Proceedings of the 13nd Annual ACM Symposium on Principles of
Distributed Computing (PODC ’94). 151–160.
Jayanti, P. and Petrovic, S. 2003. Efficient and practical constructions of LL/SC variables.
In Proceedings of the twenty-second annual symposium on Principles of distributed computing.
ACM Press, 285–294.
Jones, R. and Lins, R. 1996. Garbage Collection: Algorithms for Automatic Dynamic Memory
Management. John Wiley and Sons.
Kung, H. T. and Lehman, P. L. 1980. Concurrent manipulation of binary search trees. ACM
Transactions on Database Systems 5, 3 (Sept.), 354–382.
Kung, H. T. and Robinson, J. T. 1981. On optimistic methods for concurrency control. ACM
Trans. Database Syst. 6, 2, 213–226.
Marathe, V. J., Scherer III, W. N., and Scott, M. L. 2004. Design tradeoffs in modern
software transactional memory systems. In Proceedings of the 7th Workshop on Languages,
Compilers, and Run-time Support for Scalable Systems.
McDonald, A., Chung, J., Chafi, H., Cao Minh, C., Carlstrom, B. D., Hammond, L.,
Kozyrakis, C., and Olukotun, K. 2005. Characterization of TCC on chip-multiprocessors.
In Proceedings of the 14th International Conference on Parallel Architectures and Compilation
Techniques.
Mellor-Crummey, J. and Scott, M. 1991a. Algorithms for scalable synchronization on shared-
memory multiprocessors. ACM Transactions on Computer Systems 9, 1, 21–65.
Mellor-Crummey, J. and Scott, M. 1991b. Scalable reader-writer synchronization for shared-
memory multiprocessors. In Proceedings of the 3rd ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming. 106–113.
Michael, M. M. 2002. Safe memory reclamation for dynamic lock-free objects using atomic reads
and writes. In Proceedings of the 21st Annual ACM Symposium on Principles of Distributed
Computing (PODC ’02).
Michael, M. M. and Scott, M. 1995. Correction of a memory management method for lock-free
data structures. Tech. Rep. TR599, University of Rochester, Computer Science Department.
Dec.
Moir, M. 1997. Transparent support for wait-free transactions. In Distributed Algorithms, 11th
International Workshop. Lecture Notes in Computer Science, vol. 1320. Springer-Verlag, 305–
319.
Moir, M. 2002. Personal communication.
ACM Journal Name, Vol. V, No. N, M 20YY.
Concurrent Programming Without Locks · 59
Moore, K. E., Hill, M. D., and Wood, D. A. 2005. Thread-level transactional memory. Tech-
nical Report: CS-TR-2005-1524, Dept. of Computer Sciences, University of Wisconsin, 1–11.
Motorola. 1985. MC68020 32-Bit Microprocessor User’s Manual.
Pugh, W. 1990. Concurrent maintenance of skip lists. Technical Report CS-TR-2222, Department
of Computer Science, University of Maryland. June.
Rajwar, R. and Goodman, J. R. 2002. Transactional lock-free execution of lock-based programs.
ACM SIGPLAN Notices 37, 10 (Oct.), 5–17.
Rajwar, R., Herlihy, M., and Lai, K. 2005. Virtualizing transactional memory. In Proceedings of
the 32nd Annual International Symposium on Computer Architecture. IEEE Computer Society,
494–505.
Ringenburg, M. F. and Grossman, D. 2005. AtomCaml: first-class atomicity via rollback. In
ICFP ’05: Proceedings of the tenth ACM SIGPLAN international conference on Functional
programming. ACM Press, New York, NY, USA, 92–104.
Scherer III, W. N. and Scott, M. L. 2005. Advanced contention management for dynamic
software transactional memory. In PODC ’05: Proceedings of the twenty-fourth annual ACM
SIGACT-SIGOPS symposium on Principles of distributed computing. ACM Press, New York,
NY, USA, 240–248.
Shalev, O. and Shavit, N. 2003. Split-ordered lists: Lock-free extensible hash tables. In Pro-
ceedings of the 22nd Annual ACM Symposium on Principles of Distributed Computing (PODC
’03). 102–111.
Shavit, N. and Touitou, D. 1995. Software transactional memory. In Proceedings of the 14th
Annual ACM Symposium on Principles of Distributed Computing (PODC ’95). 204–213.
Sundell, H. and Tsigas, P. 2004. Scalable and lock-free concurrent dictionaries. In SAC ’04:
Proceedings of the 2004 ACM symposium on Applied computing. ACM Press, New York, NY,
USA, 1438–1445.
Turek, J., Shasha, D., and Prakash, S. 1992. Locking without blocking: Making lock-based
concurrent data structure algorithms nonblocking. In Proceedings of the 11th ACM Symposium
on Principles of Database Systems. 212–222.
Welc, A., Jagannathan, S., and Hosking, T. 2004. Transactional monitors for concurrent ob-
jects. In Proceedings of the European Conference on Object-Oriented Programming (ECOOP).
519–542.
Wing, J. M. and Gong, C. 1993. Testing and verifying concurrent objects. Journal of Parallel
and Distributed Computing 17, 1 (Jan.), 164–182.