Lock-Free and Practical Deques Using Single-Word Compare-And-Swap
Lock-Free and Practical Deques Using Single-Word Compare-And-Swap
Lock-Free and Practical Deques Using Single-Word Compare-And-Swap
2004-02
Göteborg, 2004
Technical Report in Computing Science at
Chalmers University of Technology and Göteborg University
therefore do not suffer from the problems that block- ends, and does not allow concurrent invocations of the push operation and
ing could generate. Lock-free implementations are non- a pop operation on the opposite end.
2 A CAS2 operations can atomically read-and-possibly-update the con-
blocking and guarantee that regardless of the contention tents of two non-adjacent memory words. This operation is also sometimes
caused by concurrent operations and the interleaving of called DCAS in the literature.
their sub-operations, always at least one operation will 3 The standard CAS operation can atomically read-and-possibly-update
progress. However, there is a risk for starvation as the the contents of a single memory word
4 It is available on the Intel IA-32, but not on the Sparc or MIPS micro-
progress of some operations could cause some other opera- processor architectures.
tions to never finish. Wait-free [8] algorithms are lock-free 5 There is a general and formal definition called disjoint-access-parallel
and moreover they avoid starvation as well, as all operations by Israeli and Rappoport [11]
function TAS(value:pointer to word):boolean
Local Memory Local Memory Local Memory
... atomic do
if *value=0 then
Processor 1 Processor 2 Processor n *value:=1;
return true;
Interconnection Network
else return false;
Shared Memory
procedure FAA(address:pointer to word, number:integer)
atomic do
Figure 1. Shared Memory Multiprocessor Sys- *address := *address + number;
tem Structure
function CAS(address:pointer to word, oldvalue:word,
newvalue:word):boolean
atomic do
We have performed experiments that compare the per- if *address = oldvalue then
formance of our algorithm with two of the most efficient *address := newvalue;
algorithms of lock-free deques known; [13] and [12], the return true;
latter implemented using results from [3] and [6]. Exper- else return false;
iments were performed on three different multiprocessor
system equipped with either 2,4 or 29 processors. All of Figure 2. The Test-And-Set (TAS), Fetch-And-
the systems are using different operating systems. Our re- Add (FAA) and Compare-And-Swap (CAS)
sults show that the CAS-based algorithms outperforms the atomic primitives.
CAS2-based implementations6 for any number of threads,
and for the system with full concurrency and non-uniform
memory architecture our algorithm performs significantly
better than the algorithm in [13]. 1 2 4
2 System Description uniformly accessible for all nodes in the system; proces-
sors can have different access times on different parts of the
memory.
A typical abstraction of a shared memory multi-
processor system configuration is depicted in Figure 1.
Each node of the system contains a processor together with 3 Algorithm
its local memory. All nodes are connected to the shared
memory via an interconnection network. A set of co- The algorithm is based on a doubly-linked list data struc-
operating tasks is running on the system performing their ture. To use the data structure as a deque, every node con-
respective operations. Each task is sequentially executed tains a value. The fields of each node item are described in
on one of the processors, while each processor can serve Figure 4 as it is used in this implementation.
(run) many tasks at a time. The co-operating tasks, possi- In order to make the deque construction concurrent and
bly running on different processors, use shared data objects non-blocking, we are using three of the standard atomic
built in the shared memory to co-ordinate and communi- synchronization primitives, Test-And-Set (TAS), Fetch-
cate. Tasks synchronize their operations on the shared data And-Add (FAA) and Compare-And-Swap (CAS). Figure
objects through sub-operations on top of a cache-coherent 2 describes the specification of these primitives which are
shared memory. The shared memory may not though be available in most modern platforms.
6 The CAS2 operation was implemented in software, using either mu- To insert or delete a node from the list we have to change
tual exclusion or the results from [6], which presented an software CASn the respective set of prev and next pointers. These have to
(CAS for n non-adjacent words) implementation. be changed consistently, but not necessarily all at once. Our
solution is to treat the doubly-linked list as being a singly- management scheme invented by Valois [17] and corrected
linked list with auxiliary information in the prev pointers, by Michael and Scott [14], which makes use of the FAA and
with the next pointers being updated before the prev point- CAS atomic synchronization primitives. Using this scheme
ers. Thus, the next pointers always form a consistent singly- we can assure that a node can only be reclaimed when there
linked list, but the prev pointers only give hints for where to is no prev or next pointer in the list that points to it. One
find the previous node. This is possible because of the ob- problem with this scheme is that it can not handle cyclic
servation that a “late” non-updated prev pointer will always garbage (i.e. 2 or more nodes that should be recycled but
point to a node that is directly or some steps previous of reference each other, and therefore each node keeps a pos-
the current node, and from that “hint” position it is always itive reference count, although they are not referenced by
possible to traverse7 through the next pointers to reach the the main structure). Our solution is to make sure to break
directly previous node. potential cyclic references directly before a node is possibly
One problem, that is general for non-blocking imple- recycled.
mentations that are based on the singly-linked list structure, Another memory management issue is how to de-
arises when inserting a new node into the list. Because of reference pointers safely. If we simply de-reference the
the linked-list structure one has to make sure that the previ- pointer, it might be that the corresponding node has been
ous node is not about to be deleted. If we are changing the reclaimed before we could access it. It can also be that the
next pointer of this previous node atomically with CAS, to deletion mark that is connected to the prev or next pointer
point to the new node, and then immediately afterwards the was set, thus marking that the node is deleted. The follow-
previous node is deleted - then the new node will be deleted ing functions are defined for safe handling of the memory
as well, as illustrated in Figure 3. There are several solu- management:
tions to this problem. One solution is to use the CAS2 oper-
ation as it can change two pointers atomically, but this oper- function MALLOC_NODE() :pointer to Node
ation is not available in any modern multiprocessor system. function READ_PREV(address:pointer to Link) :pointer to
A second solution is to insert auxiliary nodes [17] between Node
every two normal nodes, and the latest method introduced function READ_NEXT(address:pointer to Link) :pointer to
by Harris [7] is to use a deletion mark. This deletion mark Node
function READ_PREV_DEL(address:pointer to Link)
is updated atomically together with the next pointer. Any
:pointer to Node
concurrent insert operation will then be notified about the
function READ_NEXT_DEL(address:pointer to Link)
possibly set deletion mark, when its CAS operation will fail :pointer to Node
on updating the next pointer of the to-be-previous node. For function COPY_NODE(node:pointer to Node) :pointer to
our doubly-linked list we need to be informed also when in- Node
serting using the prev pointer. In order to be able to atomi- procedure RELEASE_NODE(node:pointer to Node)
cally update both the prev and the next pointer together with
the deletion mark, all of these have to be put together in one The function MALLOC_NODE allocates a new node
memory word. For a 32-bit word this means a maximum from the memory pool of pre-allocated nodes. The function
of 32 768 (or 2 147 483 648 for a 64-bit word) possibly RELEASE_NODE decrements the reference counter on the
addressable nodes using the prev or next pointers. How- corresponding given node. If the reference count reaches
ever, as will be shown later in Section 3.4, the algorithm zero, the function then calls the ReleaseReferences function
can easily be extended to handle dynamic maximum sizes, that will recursively call RELEASE_NODE on the nodes
thus making this limit obsolete. that this node has owned pointers to, and then it reclaims the
node. The function COPY_NODE increases the reference
3.1 Memory Management counter for the corresponding given node. READ_PREV,
READ_PREV_DEL, READ_NEXT and READ_NEXT_DEL
As we are concurrently (with possible preemptions) atomically de-references the given link in the corresponding
traversing nodes that will be continuously allocated and re- direction and increases the reference counter for the cor-
claimed, we have to consider several aspects of memory responding node. In case the deletion mark of the link is
management. No node should be reclaimed and then later set, the functions READ_PREV and READ_NEXT return
re-allocated while some other process is (or will be) travers- NULL.
ing that node. This can be solved for example by careful
reference counting. We have selected the lock-free memory 3.2 Pushing and Popping Nodes
7 As will be shown later, we have defined the deque data structure in a
way that makes it possible to traverse even through deleted nodes, as long The PushLeft operation, see Figure 4, first repeatingly
as they are referenced in some way. tries in lines L4-L15 to insert the new node (node) between
union Link procedure PushCommon(node: pointer to Node, next: pointer to Node)
_: word P1 while true do
hprev, next, di: hpointer to Node, pointer to Node, booleani P2 link1:=next.link;
P3 link2:=hnode,link1.next,falsei;
structure Node P4 if link1.d = true or node.link.d = true
value: pointer to word P5 or node.link.next 6= next then
link: union Link P6 break;
P7 if CAS(&next.link,link1,link2) then
// Global variables P8 COPY_NODE(node);
head, tail: pointer to Node P9 RELEASE_NODE(link1.prev);
// Local variables P10 if node.link.d = true then
node, prev, prev2, next, next2: pointer to Node P11 prev2:=COPY_NODE(node);
link1, link2, lastlink: union Link P12 prev2:=HelpInsert(prev2,next);
P13 RELEASE_NODE(prev2);
function CreateNode(value:pointer to word):pointer to Node P14 break;
C1 node:=MALLOC_NODE(); P15 Back-Off
C2 node.value:=value; P16 RELEASE_NODE(next);
C3 return node; P17 RELEASE_NODE(node);
Definition 4 The value v is present (∃i.Q[i] = v) in the Lemma 3 A PopRight operation which fails (P R() = ⊥),
abstract internal state Q of our implementation, when there takes effect atomically at one statement.
is a connected chain of next pointers (i.e. prev.link.next) Proof: The decision point for a PopRight operation which
from a present node (or the head node) in the doubly linked fails (P R() = ⊥) is the check in line PR9. Passing of
list that connects to a node that contains the value v, and the decision point together with the verification in line PR5
this node is not marked as deleted (i.e. node.link.d=false). gives that the next pointer of the head node must have been
Definition 5 The decision point of an operation is defined pointing to the tail node (Qt1 = ∅) directly before the read
as the atomic statement where the result of the operation is sub-operation of the link field in line PR4, i.e. the state-
finitely decided, i.e. independent of the result of any sub- read point. Consequently, the linearizability point will be
operations after the decision point, the operation will have the read sub-operation in line PR4. ✷
the same result. We define the state-read point of an opera-
tion to be the atomic statement where a sub-state of the pri- Lemma 4 A PopRight operation which succeeds (P R() =
ority queue is read, and this sub-state is the state on which v), takes effect atomically at one statement.
the decision point depends. We also define the state-change
Proof: The decision point for a PopRight operation which
point as the atomic statement where the operation changes
succeeds (P R() = v) is when the CAS sub-operation in line
the abstract internal state of the priority queue after it has
PR15 succeeds. Passing of the decision point together with
passed the corresponding decision point.
the verification in line PR5 gives that the next pointer of the
We will now use these points in order to show the ex- to-be-deleted node must have been pointing to the tail node
istence and location in execution history of a point where (Qt1 = Q1 + [v]) directly before the CAS sub-operation
the concurrent operation can be viewed as it occurred atom- in line PR15, i.e. the state-read point. Directly after pass-
ically, i.e. the linearizability point. ing the CAS sub-operation (i.e. the state-change point) the
to-be-deleted node will be marked as deleted and therefore
Lemma 1 A PushRight operation (R(v)), takes effect not present in the deque (Qt2 = Q1 ). Consequently, the
atomically at one statement. linearizability point will be the CAS sub-operation in line
PR15. ✷
Proof: The decision, state-read and state-change point for
a PushRight operation which succeeds (R(v)), is when the Lemma 5 A PopLeft operation which fails (P L() = ⊥),
CAS sub-operation in line R11 (see Figure 4) succeeds. The takes effect atomically at one statement.
state of the deque was (Qt1 = Q1 ) directly before the pass-
ing of the decision point. The prev node was the very last Proof: The decision point for a PopLeft operation which
present node as it pointed (verified by R6 and the CAS in fails (P L() = ⊥) is the check in line PL4. Passing of the
R11) to the tail node directly before the passing of the deci- decision point gives that the next pointer of the head node
sion point. The state of the deque directly after passing the must have been pointing to the tail node (Qt1 = ∅) directly
decision point will be Qt2 = Q1 + [v] as the next pointer of before the read sub-operation of the link field in line PL3,
i.e. the state-read point. Consequently, the linearizability as detected in line P5. If a new node is inserted the cor-
point will be the read sub-operation in line PL3. ✷ responding PushLeft (PushRight) operation will make sure
that the prev pointer is corrected. If either the next or this
Lemma 6 A PopLeft operation which succeeds (P L() = node is deleted, the corresponding PopLeft (PopRight) op-
v), takes effect atomically at one statement. eration will make sure that the prev pointer is corrected. If
the prev pointer was successfully changed it is possible that
Proof: The decision point for a PopLeft operation which this node was deleted before we changed the prev pointer
succeeds (P L() = v) is when the CAS sub-operation in of the next node. If this is detected in line P10, then the
line PL15 succeeds. Passing of the decision point together prev pointer of the next node is corrected by the HelpInsert
with the verification in line PL9 gives that the next pointer function.
of the head node must have been pointing to the present After successfully marking the to-be-deleted nodes in
to-be-deleted node (Qt1 = [v] + Q1 ) directly before the line PL15 (PR15), the PopLeft (PopRight) functions will
read sub-operation in line PL3, i.e. the state-read point. make sure that the connecting next pointer of the prev node
Directly after passing the CAS sub-operation in line PL15 will be changed to point to the closest present node to the
(i.e. the state-change point) the to-be-deleted node will be right, by calling the DeleteNext procedure in line PL16
marked as deleted and therefore not present in the deque (PR16). It will also make sure that the corresponding prev
(¬∃i.Qt2 [i] = v). Unfortunately this does not match the pointer of the next code will be corrected by calling the
semantic definition of the operation. HelpInsert function in line PL17 (PR17).
However, none of the other concurrent operations lin- The DeleteNext procedure will repeatedly try to change
earizability points is dependent on the to-be-deleted node’s the next pointer of the prev node that points to the deleted
state as marked or not marked during the time interval from node, until it either succeeds changing the next pointer
the state-read to the state-change point. Clearly, the lin- in line DN28 or some concurrent DeleteNext already suc-
earizability points of Lemmas 1 and 2 are independent as ceeded as detected in line DN5.
the to-be-deleted node would be part (or not part if not The HelpInsert procedure will repeatedly try to change
present) of the corresponding Q1 terms. The linearizabil- the prev pointer of the node to match with the next pointer
ity points of Lemmas 3 and 5 are independent, as those lin- of the prev node, until it either succeeds changing the prev
earizability points depend on the head node’s next pointer pointer in line HI23 or the node is deleted as detected in line
pointing to the tail node or not. Finally, the linearizabil- HI13. If it succeeded with changing the prev pointer, the
ity points of Lemma 4 as well as this lemma are indepen- prev node might have been deleted directly before changing
dent, as the to-be-deleted node would be part (or not part the prev pointer, and therefore it is detected if the prev node
if not present) of the corresponding Q1 terms, otherwise is marked in line HI26 and then the procedure will continue
the CAS sub-operation in line PL15 of this operation would trying to correctly change the prev pointer. ✷
have failed.
Therefore all together, we could safely interpret the to- Lemma 8 When the deque is idle, all previously deleted
be-deleted node to be not present already directly after pass- nodes are garbage collected.
ing the state-read point ((Qt2 = Q1 ). Consequently, the lin- Proof: We have to show that each PopRight or PopLeft op-
earizability point will be the read sub-operation in line PL3. eration takes responsibility for that the deleted node will
✷ finally have no references to it. The possible references are
caused by other nodes pointing to it. Following Lemma 7
Lemma 7 When the deque is idle (i.e. no operations are we know that no present nodes will reference the deleted
being performed), all next pointers of present nodes are node. It remains to show that all paths of references from
matched with a correct prev pointer from the correspond- a deleted node will finally reference a present node, i.e.
ing present node (i.e. all linked nodes from the head or tail there are no cyclic referencing. After the node is deleted in
node are present in the deque). lines PL16 and PL17 (PR16 and PR17), it is assured by the
PopLeft (PopRight) operation by calling the RemoveCross-
Proof: We have to show that each operation takes responsi- Reference procedure in line PL25 (PR25) that both the next
bility for that the affected prev pointer will finally be cor- and prev pointers are pointing to a present node. If any
rect after changing the corresponding next pointer. Af- of those present nodes are deleted before the referencing
ter successfully changing the next pointer in the PushLeft deleted node is garbage collected in line , the RemoveCross-
(PushRight) in line L12 (R11) operation, the correspond- Reference procedures called by the corresponding PopLeft
ing prev pointer is tried to be changed in line P7 repeatedly or PopRight operation will assure that the next and prev
until i) it either succeeds, ii) either the next or this node is pointers of the previously present node will point to present
deleted as detected in line P4, iii) or a new node is inserted nodes, and so on recursively. The RemoveCrossReference
procedure repeatedly tries to change prev pointers to point the equality in line DN5 will hold). As long as the prev
to the previous node of the referenced node until the ref- node is marked it will be traversed to the left in line DN16,
erenced node is present, detected in line RC4 and possibly and if it is the left-most marked node the prev node will be
changed in line RC6. The next pointer is correspondingly deleted by recursively calling DeleteNext in line DN14. If
detected in line RC10 and possibly changed in line RC12. the prev node is not marked it will be traversed to the right.
✷ As there is a limited number of changes and thus a limited
number of marked nodes left of the to-be-deleted node, the
Lemma 9 The path of prev pointers from a node is always prev node will finally traverse to the right and either of the
pointing a present node that is left of the current node. termination criteria will be fulfilled.
The loop HI2-HI28 will terminate if either the to-be-
Proof: We will look at all possibilities where the prev corrected node is marked in line HI13 or if the CAS
pointer is set or changed. The setting in line L10 (R9) is sub-operation in line HI23 succeeds and prev node is not
clearly to the left as it is verified by L6 and L12 (R5 and marked. We know that from the start of the execution of
R11). The change of the prev pointer in line P7 is to the left the loop, that the prev node is left of the to-be-corrected
as verified by P5 and that nodes are never moved relatively node. Following from Lemma 9 this order will hold by
to each other. The change of the prev pointer in line HI23 traversing the prev node through its prev pointer. Conse-
is to the left as verified by line HI3 and HI16. Finally, the quently, traversing the prev node through the next pointer
change of the prev pointer in line RC6 is to the left as it is will finally cause the prev node to be directly left of the to-
changed to the prev pointer of the previous node. ✷ be-corrected node if this is not deleted (and the CAS sub-
operation in line HI23 will finally succeed), otherwise line
Lemma 10 All operations will terminate if exposed to a HI13 will succeed. As long as the prev node is marked it
limited number of concurrent changes to the deque. will be traversed to the left in line HI8, and if it is the left-
most marked node the prev node will be deleted by calling
Proof: The amount of changes an operation could experi- DeleteNext in line HI6. If the prev node is not marked it
ence is limited. Because of the reference counting, none will be traversed to the right. As there is a limited number
of the nodes which is referenced to by local variables can of changes and thus a limited number of marked nodes left
be garbage collected. When traversing through prev or next of the to-be-corrected node, the prev node will finally tra-
pointers, the memory management guarantees atomicity of verse to the right and either of the termination criteria will
the operations, thus no newly inserted or deleted nodes will be fulfilled.
be missed. We also know that the relative positions of nodes The loop RC1-RC15 will terminate if both the prev node
that are referenced to by local variables will not change as and the next node of the to-be-deleted node is not marked
nodes are never moved in the deque. Most loops in the op- in line RC4 respectively line RC10. We know that from the
erations retry because a change in the state of some node(s) start of the execution of the loop, the prev node is left of the
was detected in the ending CAS sub-operation, and then to-be-deleted node and the next node is right of the to-be-
retry by re-reading the local variables (and possibly correct- deleted node. Following from Lemma 9, traversing the prev
ing the state of the nodes) until no concurrent changes was node through the next pointer will finally reach a not marked
detected by the CAS sub-operation and therefore the CAS node or the head node (which is not marked), and traversing
succeeded and the loop terminated. Those loops will clearly the next node through the next pointer will finally reach a
terminate after a limited number of concurrent changes. In- not marked node or the tail node (which is not marked), and
cluded in that type of loops are L4-L15, R4-R14, P1-P15, both of the termination criteria will be fulfilled. ✷
PL2-PL24 and PR2-PR24.
The loop DN4-DN32 will terminate if either the prev Lemma 11 With respect to the retries caused by synchro-
node is equal to the next node in line DN5 or the CAS sub- nization, one operation will always do progress regardless
operation in line DN28 succeeds. We know from the start of the actions by the other concurrent operations.
of the execution of the loop, that the prev node is left of the
to-be-deleted node which in turn is left of the next node. Proof: We now examine the possible execution paths of our
Following from Lemma 9 this order will hold by travers- implementation. There are several potentially unbounded
ing the prev node through its prev pointer and traversing the loops that can delay the termination of the operations. We
next node through its next pointer. Consequently, traversing call these loops retry-loops. If we omit the conditions
the prev node through the next pointer will finally cause the that are because of the operations semantics (i.e. search-
prev node to be directly left of the to-be-deleted node if this ing for the correct criteria etc.), the loop retries when
is not already deleted (and the CAS sub-operation in line sub-operations detect that a shared variable has changed
DN28 will finally succeed), otherwise the prev node will fi- value. This is detected either by a subsequent read sub-
nally be directly left of the next node (and in the next step operation or a failed CAS. These shared variables are only
changed concurrently by other CAS sub-operations. Ac- optimistic performance of an imaginary CAS2 implemen-
cording to the definition of CAS, for any number of concur- tation in hardware. The other approach was to implement
rent CAS sub-operations, exactly one will succeed. This CAS2 using one of the most efficient software implementa-
means that for any subsequent retry, there must be one tions of CASN known that could meet the needs of [12] and
CAS that succeeded. As this succeeding CAS will cause its [3], i.e. the implementation by Harris et al [6].
retry loop to exit, and our implementation does not contain A clean-cache operation was performed just before each
any cyclic dependencies between retry-loops that exit with sub-experiment using a different implementation. All im-
CAS, this means that the corresponding PushRight, Push- plementations are written in C and compiled with the high-
Left, PopRight or PopLeft operation will progress. Conse- est optimization level. The atomic primitives are written in
quently, independent of any number of concurrent opera- assembler.
tions, one operation will always progress. ✷ The experiments were performed using different number
of threads, varying from 1 to 28 with increasing steps. Three
Theorem 1 The algorithm implements a correct, memory different platforms were used, with varying number of pro-
stable, lock-free and linearizable deque. cessors and level of shared memory distribution. To get a
highly pre-emptive environment, we performed our experi-
Proof: Following from Lemmas 1, 2, 3, 4, 5 and 6 and ments on a Compaq dual-processor Pentium II PC running
by using the respective linearizability points, we can create Linux, and a Sun Ultra 80 system running Solaris 2.7 with 4
an identical (with the same semantics) sequential execution processors. In order to evaluate our algorithm with full con-
that preserves the partial order of the operations in a con- currency we also used a SGI Origin 2000 system running
current execution. Following from Definition 3, the imple- Irix 6.5 with 29 250 MHz MIPS R10000 processors. The
mentation is therefore linearizable. results from the experiments are shown in Figure 6. The av-
Lemmas 10 and 11 give that our implementation is lock- erage execution time is drawn as a function of the number
free. of threads.
Following from Lemmas 10, 1, 2, 3, 4, 5 and 6 we can Our results show that both the CAS-based algorithms
conclude that all operations will terminate with the correct outperforms the CAS2-based implementations for any num-
result. ber of threads. For the systems with low or medium concur-
Following from Lemma 8 we know that the maximum rency and uniform memory architecture, [13] has the best
memory usage will be proportional to the number of present performance. However, for the system with full concur-
values in the deque. rency and non-uniform memory architecture our algorithm
✷ performs significantly better than [13] from 2 threads and
more, as a direct consequence of the disjoint-parallel acces-
sible nature of our algorithm.
5 Experimental Evaluation
6 Conclusions
In our experiments, each concurrent thread performed
1000 randomly chosen sequential operations on a shared
deque, with a distribution of 1/4 PushRight, 1/4 PushLeft, We have presented the first lock-free algorithmic imple-
1/4 PopRight and 1/4 PopLeft operations. Each experiment mentation of a concurrent deque that has all the following
was repeated 50 times, and an average execution time for features: i) it is disjoint-parallel accessible with retained
each experiment was estimated. Exactly the same sequen- parallelism, ii) uses a fully described lock-free memory
tial operations were performed for all different implementa- management scheme, and iii) uses atomic primitives which
tions compared. Besides our implementation, we also per- are available in modern computer systems, even when ex-
formed the same experiment with the lock-free implemen- tended for dynamic maximum sizes.
tation by Michael [13] and the implementation by Martin We have performed experiments that compare the per-
et al [12], two of the most efficient lock-free deques that formance of our algorithm with two of the most efficient
have been proposed. The algorithm by Martin et al [12] algorithms of lock-free deques known, using full imple-
was implemented together with the corresponding memory mentations of those algorithms. The experiments show that
management scheme by Detlefs et al [3]. However, as both our implementation performs significantly better on sys-
[12] and [3] use the atomic operation CAS2 which is not tems with high concurrency and non-uniform memory ar-
available in any modern system, the CAS2 operation was chitecture.
implemented in software using two different approaches. We believe that our implementation is of highly practical
The first approach was to implement CAS2 using mutual interest for multi-processor applications. We are currently
exclusion (as proposed in [12]), which should match the incorporating it into the NOBLE [16] library.
Deque with High Contention - SGI Mips, 29 Processors Deque with High Contention - SGI Mips, 29 Processors
100000
NEW ALGORITHM NEW ALGORITHM
25000 MICHAEL MICHAEL
HAT-TRICK MUTEX HAT-TRICK MUTEX
HAT-TRICK CASN 10000 HAT-TRICK CASN
Execution Time (ms)
1000
15000
100
10000
5000 10
0 1
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Threads Threads
Deque with High Contention - SUN Solaris, 4 Processors Deque with High Contention - SUN Solaris, 4 Processors
1000 100
800
600
10
400
200
0 1
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Threads Threads
Deque with High Contention - Linux, 2 Processors Deque with High Contention - Linux, 2 Processors
800
NEW ALGORITHM 1000 NEW ALGORITHM
700 MICHAEL MICHAEL
HAT-TRICK MUTEX HAT-TRICK MUTEX
HAT-TRICK CASN HAT-TRICK CASN
600
Execution Time (ms)
500 100
400
300
10
200
100
0 1
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Threads Threads
Figure 6. Experiment with deques and high contention. Logarithmic scales to the right.
References [12] P. Martin, M. Moir, and G. Steele, “DCAS-based con-
current deques supporting bulk allocation,” Sun Mi-
[1] O. Agesen, D. Detlefs, C. H. Flood, A. Garthwaite, crosystems, Tech. Rep. TR-2002-111, 2002.
P. Martin, N. Shavit, and G. L. Steele Jr., “DCAS-
[13] M. M. Michael, “CAS-based lock-free algorithm for
based concurrent deques,” in ACM Symposium on Par-
shared deques,” in Proceedings of the 9th Interna-
allel Algorithms and Architectures, 2000, pp. 137–
tional Euro-Par Conference, ser. Lecture Notes in
146.
Computer Science. Springer Verlag, Aug. 2003.
[2] N. S. Arora, R. D. Blumofe, and C. G. Plaxton,
[14] M. M. Michael and M. L. Scott, “Correction of a
“Thread scheduling for multiprogrammed multipro-
memory management method for lock-free data struc-
cessors,” in ACM Symposium on Parallel Algorithms
tures,” Computer Science Department, University of
and Architectures, 1998, pp. 119–129.
Rochester, Tech. Rep., 1995.
[3] D. Detlefs, P. Martin, M. Moir, and G. Steele Jr,
[15] A. Silberschatz and P. Galvin, Operating System Con-
“Lock-free reference counting,” in Proceedings of the
cepts. Addison Wesley, 1994.
20th Annual ACM Symposium on Principles of Dis-
tributed Computing, Aug. 2001. [16] H. Sundell and P. Tsigas, “NOBLE: A non-blocking
inter-process communication library,” in Proceedings
[4] D. Detlefs, C. H. Flood, A. Garthwaite, P. Martin,
of the 6th Workshop on Languages, Compilers and
N. Shavit, and G. L. Steele Jr., “Even better DCAS-
Run-time Systems for Scalable Computers, ser. Lec-
based concurrent deques,” in International Symposium
ture Notes in Computer Science. Springer Verlag,
on Distributed Computing, 2000, pp. 59–73.
2002.
[5] M. Greenwald, “Non-blocking synchronization and
system design,” Ph.D. dissertation, Stanford Univer- [17] J. D. Valois, “Lock-free data structures,” Ph.D. dis-
sity, Palo Alto, CA, 1999. sertation, Rensselaer Polytechnic Institute, Troy, New
York, 1995.
[6] T. Harris, K. Fraser, and I. Pratt, “A practical multi-
word compare-and-swap operation,” in Proceedings A The algorithm for dynamic maximum sizes
of the 16th International Symposium on Distributed
Computing, 2002.
This section shows the the full details of how to extend
[7] T. L. Harris, “A pragmatic implementation of non- the algorithm for dynamic maximum sizes, following the
blocking linked lists,” in Proceedings of the 15th Inter- guidelines earlier presented in Section 3.4.
national Symposium of Distributed Computing, Oct. As the link structure now can contain full pointer values,
2001. see Figure 7 , the following functions are added for safe
handling of the memory management:
[8] M. Herlihy, “Wait-free synchronization,” ACM Trans-
actions on Programming Languages and Systems, function READ_NODE(address:pointer to Link) :pointer to
vol. 11, no. 1, pp. 124–149, Jan. 1991. Node
function READ_DEL_NODE(address:pointer to Link)
[9] M. Herlihy, V. Luchangco, and M. Moir, :pointer to Node
“Obstruction-free synchronization: Double-ended
queues as an example,” in Proceedings of the 23rd The functions READ_NODE and READ_DEL_NODE
International Conference on Distributed Computing atomically de-references the given link and increases the
Systems, 2003. reference counter for the corresponding node. In case the
deletion mark of the link is set, the function READ_NODE
[10] M. Herlihy and J. Wing, “Linearizability: a correct-
returns NULL.
ness condition for concurrent objects,” ACM Trans-
The remaining details of the extended algorithm are
actions on Programming Languages and Systems,
showed in Figures 7 and 8.
vol. 12, no. 3, pp. 463–492, 1990.
[11] A. Israeli and L. Rappoport, “Disjoint-access-parallel
implementations of strong shared memory primitives,”
in Proceedings of the 13th Annual ACM Symposium
on Principles of Distributed Computing. ACM, Aug.
1994, pp. 151–160.
union Link procedure PushCommon(node, next: pointer to Node)
_: word P1 while true do
hp, di: hpointer to Node, booleani P2 link1:=next.prev;
P3 if link1.d = true or node.next 6= hnext,falsei then
structure Node P4 break;
value: pointer to word P5 if CAS(&next.prev,link1,hnode,falsei) then
prev: union Link P6 COPY_NODE(node);
next: union Link P7 RELEASE_NODE(link1.p);
P8 if node.prev.d = true then
// Global variables P9 prev2:=COPY_NODE(node);
head, tail: pointer to Node P10 prev2:=HelpInsert(prev2,next);
// Local variables P11 RELEASE_NODE(prev2);
node,prev,prev2,next,next2: pointer to Node P12 break;
link1,lastlink: union Link P13 Back-Off
P14 RELEASE_NODE(next);
function CreateNode(value: pointer to word):pointer to Node P15 RELEASE_NODE(node);
C1 node:=MALLOC_NODE();
C2 node.value:=value; function PopLeft(): pointer to word
C3 return node; PL1 prev:=COPY_NODE(head);
PL2 while true do
procedure ReleaseReferences(node: pointer to Node) PL3 node:=READ_NODE(&prev.next);
RR1 RELEASE_NODE(node.prev.p); PL4 if node = tail then
RR2 RELEASE_NODE(node.next.p); PL5 RELEASE_NODE(node);
PL6 RELEASE_NODE(prev);
procedure PushLeft(value: pointer to word) PL7 return ⊥;
L1 node:=CreateNode(value); PL8 link1:=node.next;
L2 prev:=COPY_NODE(head); PL9 if link1.d = true then
L3 next:=READ_NODE(&prev.next); PL10 DeleteNext(node);
L4 while true do PL11 RELEASE_NODE(node);
L5 if prev.next 6= hnext,falsei then PL12 continue;
L6 RELEASE_NODE(next); PL13 if CAS(&node.next,link1,hlink1.p,truei) then
L7 next:=READ_NODE(&prev.next); PL14 DeleteNext(node);
L8 continue; PL15 next:=READ_DEL_NODE(&node.next);
L9 node.prev:=hprev,falsei; PL16 prev:=HelpInsert(prev,next);
L10 node.next:=hnext,falsei; PL17 RELEASE_NODE(prev);
L11 if CAS(&prev.next,hnext,falsei,hnode,falsei) then PL18 RELEASE_NODE(next);
L12 COPY_NODE(node); PL19 value:=node.value;
L13 break; PL20 break;
L14 Back-Off PL21 RELEASE_NODE(node);
L15 PushCommon(node,next); PL22 Back-Off
PL23 RemoveCrossReference(node);
procedure PushRight(value: pointer to word) PL24 RELEASE_NODE(node);
R1 node:=CreateNode(value); PL25 return value;
R2 next:=COPY_NODE(tail);
R3 prev:=READ_NODE(&next.prev); function PopRight(): pointer to word
R4 while true do PR1 next:=COPY_NODE(tail);
R5 if prev.next 6= hnext,falsei then PR2 node:=READ_NODE(&next.prev);
R6 prev:=HelpInsert(prev,next); PR3 while true do
R7 continue; PR4 if node.next 6= hnext,falsei then
R8 node.prev:=hprev,falsei; PR5 node:=HelpInsert(node,next);
R9 node.next:=hnext,falsei; PR6 continue;
R10 if CAS(&prev.next,hnext,falsei,hnode,falsei) then PR7 if node = head then
R11 COPY_NODE(node); PR8 RELEASE_NODE(node);
R12 break; PR9 RELEASE_NODE(next);
R13 Back-Off PR10 return ⊥;
R14 PushCommon(node,next);