The Distributed Computing Column: C Onvergent and Commutative Replicated D Ata Types
The Distributed Computing Column: C Onvergent and Commutative Replicated D Ata Types
The Distributed Computing Column: C Onvergent and Commutative Replicated D Ata Types
CORE
by Metadata, citation and similar papers at core.ac.uk
Provided by Universidade do Minho: RepositoriUM
Panagiota Fatourou
Department of Computer Science, University of Crete
P.O. Box 2208 GR-714 09 Heraklion, Crete, Greece
and
Institute of Computer Science (ICS)
Foundation for Research and Technology (FORTH)
N. Plastira 100. Vassilika Vouton
GR-700 13 Heraklion, Crete, Greece
[email protected]
∗
This research was supported in part by ANR project ConcoRDanT (ANR-10-BLAN 0208),
and a Google Research Award 2009. Marek Zawirski is a recipient of the Google Europe Fellow-
ship in Distributed Computing, and this research is supported in part by this Google Fellowship.
Carlos Baquero is partially supported by FCT project Castor (PTDC/EIA-EIA/104022/2008).
†
INRIA & LIP6, Paris, France
‡
CITI, Universidade Nova de Lisboa, Portugal
§
Universidade do Minho, Portugal
¶
INRIA & UPMC, Paris, France
Abstract
1 Introduction
Strong consistency protocols serialise updates in a global total order [5, 15]. This
constitutes a performance and scalability bottleneck. Furthermore, strong consis-
tency conflicts with availability and partition-tolerance [9].
When network delays are large or partitioning is an issue, as in delay-tolerant
networks, disconnected operation, cloud computing, or P2P systems, eventual
consistency has better availability and performance [23, 29]. An update happens
at a replica, without synchronisation; then, it is sent to the other replicas. All
updates eventually take effect at all replicas, asynchronously and possibly in dif-
ferent orders. Conflicts are resolved by a background consensus algorithm [3, 28].
This weaker consistency is considered acceptable for some classes of applications.
However, conflict resolution is hard; there is little guidance on designing a correct
optimistic system, and ad-hoc approaches are brittle and error-prone.1
We propose a simple, theoretically-sound approach to eventual consistency:
to leverage simple mathematical properties that ensure absence of conflict: the
values of state-based objects are monotonic in a semilattice, and the concurrent
updates of operation-based objects commute. A trivial example is a replicated
counter, which converges because its increment and decrement operations com-
mute (assuming no overflow). Data types designed this way are called conver-
gent or commutative replicated data types (CRDTs). CRDT updates do not re-
quire synchronisation, and its replicas provably converge to a common state that
1
Consider for example the anomalies of the Amazon Shopping Cart [7].
68
tpb2
source
f(x1)
x merge
S
x1 M
g(xx2)
S
x2
merge merge
x3 M M
is always enabled, this implies eventual effect. For op-based objects, we must
prove that the downstream precondition is eventually enabled at every replica. We
give hereafter sufficient conditions for convergence, and must prove that the object
satisfies such conditions.
71
BEATCS no 104 THE EATCS COLUMNS
A type with these properties will be called a Convergent Replicated Data Type
or CvRDT. In a CvRDT, we require that compare(x, y) return x ≤v y, that x ≤v
y ∧ y ≤v x ⇒ x ≡ y, and that merge be always enabled.
Proposition 2.1. Two CvRDT replicas eventually converge, assuming the commu-
nication subsystem delivers payload infinitely often between them.
We refer to a companion technical report for the proof, which basically for-
malises the above discussion [27].
The communication subsystem of CvRDTs may have very weak properties.
Since merge is idempotent and commutative, messages may be lost, received out
of order, or multiple times, as long as new state eventually reaches all replicas,
either directly or indirectly.
In op-based specifications (e.g., Figure 12), the payload, initial and query clauses
have the same meaning as in the state-based case. The at-source phase is marked
atSource. Its (optional) source pre-condition, marked pre, must be true in the
source state. It executes atomically. It is not allowed to make side effects, but it
may send additional arguments downstream. The downstream phase executes at
some replica only if and when its downstream precondition is true: immediately
at the source, and after the update is delivered, at all other replicas. It updates the
downstream state atomically; thus the update takes effect.
Definition 2.4 (Causal History (op-based)). The causal history of replica xi is
defined as follows: (a) Initially, C(xi ) = ∅. (b) Atomically with executing the
downstream phase of f at xi , C(xi ) := C(xi ) ∪ { f }.
Again, happened-before is defined by f → g ⇔ (∀i : g ∈ C(xi ) ⇒ f ∈ C(xi )).
Operations are concurrent if not ordered by happened-before; formally: f g ⇔
f → g ∧ g → f .
Definition 2.5 (Commutativity). Updates f and g commute, iff for any reachable
replica state S where their downstream pre-condition is enabled, the downstream
precondition of f (resp. g) remains enabled in state S ·g (resp. S · f ), and S · f ·g ≡
S · g · f.
Causal delivery (defined as follows: at any replica, if f → g then f is deliv-
ered at any replica before g is delivered) is sufficient to ensure that the downstream
precondition is true, for all objects in this paper, and operations take effect in that
order. Thus, two operations that are causally related execute their downstream
72
The Bulletin of the EATCS
phase in the same order at all replicas, and the final state is the same. Operations
that are not related are concurrent; if they commute, the final states are equivalent.
Thus, a sufficient condition for convergence of an op-based object is that all
its concurrent operations commute. An object satisfying this condition is called a
Commutative Replicated Data Type (CmRDT).
Proposition 2.2. Assuming a communication subsystem that reliably delivers up-
dates in causal order, replicas of a CmRDT converge.
Recall that reliable causal delivery does not require agreement. It is immune
to partitioning, in the sense that replicas in a connected subset can deliver each
other’s updates, and that updates are eventually delivered to all replicas.
3 Example CRDTs
We now present a number of example CRDT designs: Registers, Sets, and Graphs.
We refer the reader to a technical report for further examples, e.g., Counters,
Maps, Monotonic DAGs, and Sequences [27].
Our specifications are written with clarity in mind, not efficiency. In many
cases, there are clearly more efficient ways, but we preferred the more easily-
understood version.
We write either state- or op-based specifications, as convenient. Proofs that
objects fulfill the convergence conditions is generally trivial for the types here-
after.
3.1 Registers
A register is a memory cell storing an opaque atom or object (noted type X here-
after). It supports assign to update its value, and value to query it. Non-concurrent
assigns preserve sequential semantics: the later one overwrites the earlier one. To
make concurrent updates commute, two approaches are possible: either one takes
precedence over the other (LWW-Register), or both are retained (MV-Register).
73
x1≔(1,3) x1≔(1,3)
x1= (0,0) S M
x2≔(2,1)
x2= (0,0) S
x3≔(3,2)
x3 = (0,0) S M M
x3≔(3,2) x3≔(1,3)
75
3.1.2 Multi-Value Register (MV-Register)
x1≔{1} x1≔{1,2}
{0[0,0]}
{1[1,0]} {1[2,0], 2[2,0]}
{0 [0,0]} M M
3.2 Sets
We now present clean specifications of Sets. Sets constitute one of the most basic
data structures. Containers, Maps, Graphs and Sequences are all based on Sets.
We consider mutating operations to add or remove an element. Unfortunately,
the underlying union and set-minus do not commute with each other. Therefore,
3
By symmetry with value, assign takes a set of values.
{} add(a) {aα} rmv (a) {} add((aβ) {aβ}
S S D
add(a)
{}
S
a Set-like CRDT can only approximate the intuitive sequential specification. Fig-
ure 10 illustrates the issue with a naïve set implementation. Two replicas concur-
rently add and remove the same element, but the result depends on the order of
delivery.
We now examine a few Set variants, which differ mainly in the result of con-
current add(e) with remove(e). The 2P-Set gives precedence to remove, OR-Set
to add.
3.2.1 2P-Set
The simplest approach is the Add-Only-Set (G-Set), which avoids the problematic
remove altogether [27, Section 3.3.1]. G-Set is useful as a building block for more
complex constructions.
In a Two-Phase Set (2P-Set), an element may be added, then removed, but not
added again, as specified in Figure 8. It combines a G-Set for adding with another
for removing; the latter is colloquially known as the tombstone set.
payload set S -- Unique + causal delivery ⇒ no tombstones
initial ∅
query lookup (element e) : boolean b
let b = (e ∈ S )
update add (element e)
atSource (e)
pre e is unique
downstream (e)
S := S ∪ {e}
update remove (element e)
atSource (e)
pre lookup(e) -- 2P-Set precondition
downstream (e)
pre add(e) has been delivered -- Causal order suffices
S := S \ {e}
78
The Bulletin of the EATCS
When a client calls remove(e), the set of unique tags associated with e at the
source is recorded. All such pairs are removed from the downstream payload.
Thus, when remove(e) happens-after any number of add(e), all the corresponding
pairs are removed, and the element is not in the set any more, as expected intu-
itively. When add(e) is concurrent with remove(e), the add takes precedence, as
the unique tag generated by add cannot be observed by remove.
This behaviour is illustrated in Figure 11, noting α, β, . . . the unique tags.
The remove(a) called at the top replica translates to removing (a, α) downstream.
79
BEATCS no 104 THE EATCS COLUMNS
The add called at the second replica is concurrent to the remove of the first one,
therefore (a, β) remains in the final state.
3.3 Graphs
A graph is a pair of sets (V, E) (called vertices and edges respectively) such that
E ⊆ V × V. Any of the Set implementations described above can be used for V
and E.
Because of the invariant E ⊆ V × V, operations on vertices and edges are
not independent. At source, an edge may be added only if the corresponding
vertices exist; conversely, a vertex may be removed only if it supports no edge.
The specification in Figure 14 uses one 2P-Set for vertices and another for edges.
The dependencies between them are resolved by causal delivery. Even if ver-
tices are unique, we do not use U-Set because tombstones are needed to guard
addEdge against concurrent removeVertex. In case of a concurrent addEdge and
removeVertex, the effect of removeVertex takes precedence, as an edge only exists
if their vertices have not been removed (as defined in edge lookup).
payload set S -- triplets (isbn k, integer n, unique-tag u), . . .
initial ∅
query get (isbn k) : integer n
let N = {n |(k , n , u ) ∈ S ∧ k = k}
if N = ∅ then
let n = 0
else
let n = max(N)
update add (isbn k, integer n)
atSource (k, n)
pre n > 0
let u = unique()
let R = {(k , n , u ) ∈ S |k = k}
downstream (R, k, n, u)
pre ∀(k , n , u ) ∈ R :
add(k , n , u) has been delivered
-- OR-Set remove precondition
S := (S \ R) ∪ {(k, n, u)}
-- Replace elements observed at source
update remove (isbn k)
atSource (k)
let R = {(k , n , u ) ∈ S |k = k}
downstream (R)
pre ∀(k, n, u) ∈ R : add(k, n, u) has been delivered
-- OR-Set precondition
S := S \ R -- Remove elements observed at source
80
The Bulletin of the EATCS
4 Garbage collection
Some CRDTs tend to become less efficient over time, as tombstones accumu-
late and internal data structures become unbalanced [16, 19]. Garbage collection
(GC) alleviates these problems; it may require synchronisation, but its liveness
is not essential. We investigate two classes of GC mechanisms, with different
synchronisation requirements.
An update f sometimes adds information r( f ) to the payload in order to deal
cleanly with concurrent operations, e.g. in Graph, remove leaves a tombstone to
handle concurrent addBetweens. Our first class of GC discards such r( f ) when it
does not serve any useful purpose any more:
Definition 4.1 (Stability). Update f is stable at replica xi (noted Φi ( f )) if all
updates concurrent to f have taken effect at xi . Formally, Φi ( f ) ⇔ ∀ j : f ∈
C(x j ) ∧ (g ∈ C(x j ) \ C(xi ) : f g).
Liveness of Φ requires that the set of replicas be known and that they do
not crash permanently (undetectably). Under these assumptions, the algorithm of
Wuu and Bernstein [32] can be adapted to detect stability of f and thus discard
r( f ). We note that this information is generally available when using a reliable
broadcast channel.4 Importantly, GC based on Φ can be performed in the back-
ground, so its liveness is not critical for correctness.
A second class of GC problems resets the payload across all replicas. An
example is removing tombstones from a 2P-Set (thus allowing to re-add deleted
elements again), removing entries from a version vector, or rebalancing a repli-
cated tree [16]. This requires a commitment protocol. To alleviate the strong
4
Note furthermore that such a channel already does GC internally, often making a CmRDT
simpler than the corresponding CvRDT.
81
BEATCS no 104 THE EATCS COLUMNS
A shopping cart maps a book number (ISBN) to the quantity that the user wants.
Any of the Set CRDTs presented earlier extends readily to a Map; we choose to
extend OR-Set (Section 3.2.2). This design is simple, and does not have the cost
of the version vectors needed by Dynamo’s MV-Register.
Figure 15 presents an op-based OR-Cart. The payload is a set of triplets
(key, value, unique-identifier). Operation remove discards all existing mappings
for the given ISBN: the source records the triplets associated with that key, to
be removed, downstream, from the payload. Operation add overwrites by first
discarding existing mappings as above, then inserting a unique triplet. As in OR-
Set, causal delivery is sufficient to satisfy the downstream precondition.
We now show informally that concurrent updates commute. Two removes
commute, as the downstream set-minus operations are either independent or idem-
potent. The triplets created by concurrent adds cannot be in the removal set of the
other, and (similarly to remove), their downstream set-minuses commute. Oper-
ation add is independent from, or idempotent with, a concurrent remove, as the
triplet added by the former is disjoint from the triplets removed by the latter.
The bookstore maps user accounts to OR-Carts, using a U-Map (derived from
U-Set in the obvious way). A shopping cart is added when the account is first
created, and removed when it is deleted.
82
The Bulletin of the EATCS
When the user chooses book b, the user interface calls add(b, 1) against some
replica. To change the quantity to q > 0, it calls add(b, q). If the user cancels the
book, or brings the quantity to zero, the interface calls remove(b).
Non-concurrent updates have the expected semantics, i.e., later ones take
precedence. Even though the user interface may address updates to different repli-
cas (which may be out of sync with one another [7]), concurrent updates have
clear, understandable semantics, i.e., it is the largest value that is chosen.
Although the concept itself was identified only recently, previous CRDT designs
have been published. Johnson and Thomas invented the LWW-Register [13].
They propose a database of registers that can be created, updated and deleted,
using the LWW rule to arbitrate between concurrent assignments and removes
(i.e., a removed element can be recreated). LWW ensures a total order of oper-
ations, but it is an arbitrary extension of happened-before, so, inherently, some
updates are lost.
Wuu and Bernstein [32] describe Dictionary and Log CRDTs. Their Dictio-
nary is a Map CmRDT, similar to our U-Set. Their Log serves as a reliable broad-
cast channel for Dictionary. They study how to propagate the log effectively; to
limit log growth, they propose the algorithm to detect when an entry is stable and
can be collected, used in Section 4.
Concurrent editing has been the focus of CRDT and related research. WOOT
is a Graph CRDT designed for collaborative editing [18]. The same authors de-
signed the Logoot Sequence CRDT that supports an undo mechanism based on a
CRDT Counter [31]. Preguiça and Shapiro propose Treedoc, a Sequence CRDT
83
BEATCS no 104 THE EATCS COLUMNS
for concurrent editing [19]. They later identified the GC issue, and studied how to
move it into the background [16].
The CRDT concept was invented by Shapiro and Preguiça [26]. Other work has
used similar ideas.
Ellis and Gibbs’ [8] Operational Transformation (OT) studies op-based Se-
quences for shared editing. To ensure responsiveness, a local operation executes
immediately. Operations are not designed to commute; however, a replica re-
ceiving an update transforms it against previously-executed concurrent updates to
achieve a similar result. Many OT algorithms have been proposed; Oster et al.
show that most OT algorithms for a decentralised architecture are incorrect [17].
We believe that designing for commutativity from the start is both cleaner and
simpler.
The foundations of CvRDTs were introduced by Baquero and Moura [1, 2].
We extend their work with a specification language, by considering CmRDTs, by
studying more complex examples, and by considering GC.
Roh et al. [21, 22] independently developed the Replicated Abstract Data
Type concept, which is quite similar to CRDT. They generalise LWW to a partial
order of updates, which they leverage to build several LWW-style classes; we
allow any LUB merge function.
84
The Bulletin of the EATCS
7 Conclusion
We presented the concept of a CRDT, a replicated data type for which some simple
mathematical properties guarantee eventual consistency. In the state-based style,
the successive states of an object should form a monotonic semilattice, and merge
should compute a least upper bound. In the op-based style, concurrent operations
should commute. Assuming only that the communication subsystem eventually
delivers, both styles of CRDTs are guaranteed to converge towards a common,
correct state, without requiring any synchronisation.
We specified a number of interesting data types, in a high-level specification
language based on simple logic. In particular, we focused on Set types with clean
semantics for add and remove operations; Maps, Graphs, and Sequences can be
built above Sets. Our bookstore example shows how CRDTs might be used prac-
tically.
Eventual consistency is a critical technique in many large-scale distributed
systems, including delay-tolerant networks, sensor networks, peer-to-peer net-
works, collaborative computing, cloud computing, and so on. However, work
on eventual consistency was mostly ad-hoc so far. Although some of our CRDTs
were known before in the literature or in the folklore, this is the first work to en-
gage in a systematic study. We believe this is required if eventual consistency is
to gain a solid theoretical and practical foundation.
Future work is both theoretical and practical. On the theory side, this will
include understanding the class of computations that can be accomplished by
CRDTs, the complexity classes of CRDTs, the classes of invariants that can be
supported by a CRDT, the relations between CRDTs and concepts such as self-
stabilisation and aggregation, and so on. On the practical side, we plan to imple-
ment the data types specified herein as a library, to use them in practical applica-
tions, and to evaluate their performance analytically and experimentally. Another
direction is to support support infrequent, non-critical synchronous operations,
such as committing a state or performing a global reset. We will also look into
stronger global invariants, possibly using probabilistic or heuristic techniques.
References
[1] Carlos Baquero and Francisco Moura. Specification of convergent abstract
data types for autonomous mobile computing. Technical report, Departa-
mento de Informática, Universidade do Minho, October 1997.
85
BEATCS no 104 THE EATCS COLUMNS
[2] Carlos Baquero and Francisco Moura. Using structural characteristics for
autonomous operation. Operating Systems Review, 33(4):90–96, 1999.
[4] Nikolaj Bjørner. Models and software model checking of a distributed file
replication system. In Formal Methods and Hybrid Real-Time Systems,
pages 1–23, 2007.
[5] Tushar Deepak Chandra, Vassos Hadzilacos, and Sam Toueg. The weakest
failure detector for solving consensus. Journal of the ACM, 43(4):685–722,
1996.
[9] Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibility
of consistent, available, partition-tolerant web services. SIGACT News,
33(2):51–59, 2002.
[10] Jim Gray, Pat Helland, Patrick O’Neil, and Dennis Shasha. The dangers of
replication and a solution. In Int. Conf. on the Mgt. of Data (SIGMOD),
pages 173–182, Montréal, Canada, June 1996. ACM SIGMOD, ACM Press.
[11] Pat Helland and David Campbell. Building on quicksand. In Biennial Conf.
on Innovative DataSystems Research (CIDR), Asilomar, Pacific Grove CA,
USA, June 2009.
86
The Bulletin of the EATCS
87
BEATCS no 104 THE EATCS COLUMNS
[23] Yasushi Saito and Marc Shapiro. Optimistic replication. ACM Computing
Surveys, 37(1):42–81, March 2005.
[25] Marco Serafini, Dan Dobre, Matthias Majuntke, Péter Bokor, and Neeraj
Suri. Eventually linearizable shared objects. In Symp. on Principles of Dist.
Comp. (PODC), pages 95–104, Zürich, Switzerland, 2010. Assoc. for Comp.
Machinery.
[26] Marc Shapiro and Nuno Preguiça. Designing a commutative replicated data
type. Rapport de recherche RR-6320, Institut Nat. de la Recherche en Infor-
matique et Automatique (INRIA), Rocquencourt, France, October 2007.
[27] Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. A com-
prehensive study of Convergent and Commutative Replicated Data Types.
Rapport de recherche 7506, Institut Nat. de la Recherche en Informatique et
Automatique (INRIA), Rocquencourt, France, January 2011.
[31] Stephane Weiss, Pascal Urso, and Pascal Molli. Logoot-undo: Distributed
collaborative editing system on P2P networks. IEEE Trans. on Parallel and
Dist. Sys. (TPDS), 21:1162–1174, 2010.
[32] Gene T. J. Wuu and Arthur J. Bernstein. Efficient solutions to the replicated
log and dictionary problems. In Symp. on Principles of Dist. Comp. (PODC),
pages 233–242, Vancouver, BC, Canada, August 1984.
88