Introduction To Fault Tolerant

Download as pdf or txt
Download as pdf or txt
You are on page 1of 114

Model Checking of Fault-Tolerant Distributed Algorithms

Part I: Fault-Tolerant Distributed Algorithms


Annu Gmeiner Igor Konnov Ulrich Schmid
Helmut Veith Josef Widder

TMPA 2014, Kostroma, Russia

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

1 / 52

114

Figure 7.1: DARTS prototype board, comprising 8 interconnected HITS chips

7.1 Assessing and validating the standard node HITS design

Distributed Systems

Figure 7.1: DARTS prototype board, comprising 8 interconnected HITS chips

114

7.1 Assessing and validating the standard node HITS design

Distributed Systems

Are they always working?


Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

2 / 52

No. . . some failing systems


Therac-25 (1985)
radiation therapy machine
gave massive overdoses, e.g., due to race conditions of concurrent tasks

Quantas Airbus in-flight Learmonth upset (2008)


1 out of 3 replicated components failed
computer initiated dangerous altitude drop

Ariane 501 maiden flight (1996)


primary/backup, i.e., 2 replicated computers
both run into the same integer overflow

Netflix outages due to Amazons cloud (ongoing)


one is not sure what is going on there
hundreds of computers involved
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

3 / 52

Why do they fail?

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

4 / 52

Why do they fail?

faults at design/implementation phase

faults at runtime
outside of control of designer/developer
e.g., to the right: crack in a diode in the
data link interface of the Space Shuttle
led to erroneous messages being sent

Driscoll (Honeywell)
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

4 / 52

Why do they fail?

faults at design/implementation phase


approach:
find and fix faults before operation
model checking

faults at runtime
outside of control of designer/developer
e.g., to the right: crack in a diode in the
data link interface of the Space Shuttle
led to erroneous messages being sent

Driscoll (Honeywell)
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

4 / 52

Why do they fail?

faults at design/implementation phase


approach:
find and fix faults before operation
model checking

faults at runtime
outside of control of designer/developer
e.g., to the right: crack in a diode in the
data link interface of the Space Shuttle
led to erroneous messages being sent

approach:
keep system operational despite faults
fault-tolerant distributed algorithms
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

Driscoll (Honeywell)
TMPA14, Nov. 2014

4 / 52

Bringing both together

Goal: automatically verified fault-tolerant distributed algorithms

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

5 / 52

Bringing both together

Goal: automatically verified fault-tolerant distributed algorithms

model checking FTDAs is a research challenge:


computers run independently at different speeds
exchange messages with uncertain delays
faults
parameterization

. . . fault-tolerance makes model checking harder

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

5 / 52

Lecture overview

Part I: Fault-tolerant distributed algorithms


introduction to distributed algorithms
details of our case study algorithm
motivation why model checking is cool

Part II: Modeling fault-tolerant distributed algorithms


model checking challenges in distributed algorithms
Promela, control flow automata, etc.
model checking of small instances with Spin

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

6 / 52

Lecture overview cont.


We will have to skip due to time limitations:
Part III: Introduction into Parameterized model checking
the general problem statement
well-known undecidability results
Igor Konnov:
Part IV: Parameterized model checking of FTDAs by abstraction
parametric interval abstraction (PIA)
PIA data and counter abstraction
counterexample-guided abstraction refinement (CEGAR)
Part V: Parameterized model checking of FTDAs by BMC
bounding the diameter
bounded model checking as a complete method
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

7 / 52

Part I: Fault-Tolerant Distributed


Algorithms

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

8 / 52

Distributed Systems are everywhere


What they allow to do
share resources
communicate
increase performance
speed
fault tolerance

Difference to centralized systems


independent activities (concurrency)
components do not have access to the global state (only local view)
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

9 / 52

Application areas
buzzwords from the 60ies
operating systems
(distributed) data base systems
communication networks
multiprocessor architectures
control systems
New buzzwords
cloud computing
social networks
multi core
cyber-physical systems
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

10 / 52

Major challenge
Uncertainty
computers run independently at different speeds
exchange messages with (unknown) delays
faults

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

11 / 52

Major challenge
Uncertainty
computers run independently at different speeds
exchange messages with (unknown) delays
faults
challenge in design of distributed algorithms
a process has access only to its local state
but one wants to achieve some global property

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

11 / 52

Major challenge
Uncertainty
computers run independently at different speeds
exchange messages with (unknown) delays
faults
challenge in design of distributed algorithms
a process has access only to its local state
but one wants to achieve some global property
challenge in proving them correct
large degree of non-determinism
large execution and state space
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

11 / 52

From dependability to a distributed system

Process P provides a service. We want to access it reliably


but P may fail

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

12 / 52

From dependability to a distributed system

replication
P

P1

P2
P3

Process P provides a service. We want to access it reliably


but P may fail
canonical approach: replication, i.e., several copies of P
Due to non-determinism, the behavior of the copies might deviate
(e.g. in a replicated database, transactions are committed in different
orders at different sites)

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

12 / 52

From dependability to a distributed system

replication
P

P1

P2

consistency

P3

P
P

Process P provides a service. We want to access it reliably


but P may fail
canonical approach: replication, i.e., several copies of P
Due to non-determinism, the behavior of the copies might deviate
(e.g. in a replicated database, transactions are committed in different
orders at different sites)
we have to enforce that the copies behave as one.
Consistency in a distributed system: what does it mean to behave
as one.
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

12 / 52

Replication distributed systems

replication
P

P1

P2
P3

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

13 / 52

Distributed message passing system


multiple distributed processes pi

pi

receive

send

internal

dots represent states


a step of a process can be
a send step (a process sends messages to other processes)
a receive step (a process receives a subset of messages sent to it)
an internal step (a local computation)

steps are the atomic (indivisible) units of computations


Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

14 / 52

Types of Distributed Algorithms:


Synchronous vs. Asynchronous
Synchronous
all processes move in lock-step
rounds
a message sent in a round is received in the same round
idealized view
impossible or expensive to implement
Asynchronous
only one process moves at a time
arbitrary interleavings of steps
a message sent is received eventually

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

15 / 52

Types of Distributed Algorithms:


Synchronous vs. Asynchronous
Synchronous
all processes move in lock-step
rounds
a message sent in a round is received in the same round
idealized view
impossible or expensive to implement
Asynchronous
only one process moves at a time
arbitrary interleavings of steps
a message sent is received eventually
important problems not solvable (Fischer et al., 1985)!

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

15 / 52

Types of Distributed Algorithms:


Synchronous vs. Asynchronous
Synchronous
all processes move in lock-step
rounds
a message sent in a round is received in the same round
idealized view
impossible or expensive to implement
Asynchronous
only one process moves at a time
arbitrary interleavings of steps
a message sent is received eventually
important problems not solvable (Fischer et al., 1985)!
We focus on asynchronous algorithms here. . .
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

15 / 52

Asynchronous system
has very mild restrictions on the environment
interleaving semantics
unbounded message delays
very little can be done. . .
there is no distributed algorithm that solves consensus in the presence
of one faulty process
(as we will see, consensus is the paradigm of consistency)
folklore explanation:
you cannot distinguish a slow process from a crashed one
real explanation:
see intricate proof by Fischer, Lynch, and Paterson (JACM 1985)

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

16 / 52

Where we stand

replication
P

P1

P2
P3

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

17 / 52

What we still need. . .

P1

P2

consistency

P3

Josef Widder (www.forsyte.at)

P
P

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

18 / 52

What we still need. . .

P1

P2

consistency

P3

P
P

consistency requirements have been formalized under several names,


e.g.,
consensus
atomic broadcast
Byzantine Generals problem
Byzantine agreement
atomic commitment

definitions are similar but may have subtle differences

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

18 / 52

What we still need. . .

P1

P2

consistency

P3

P
P

consistency requirements have been formalized under several names,


e.g.,
consensus
atomic broadcast
Byzantine Generals problem
Byzantine agreement
atomic commitment

definitions are similar but may have subtle differences


We use the famous Byzantine Generals to introduce this problem
domain. . .
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

18 / 52

Fault tolerance The Byzantine generals problem

Wiktionary:
Byzantine: adj. of a devious, usually stealthy manner, of practice.

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

19 / 52

Fault tolerance The Byzantine generals problem


Lamport (this years Turing laureate), Shostak, and Pease wrote in their
Dijkstra Prize in Distributed Computing winning paper (Lamport et al.,
1982):
[. . .] several divisions of the Byzantine army are camped outside
an enemy city, each division commanded by its own general. [. . .]
However, some of the generals may be traitors [. . .]
if the divisions of loyal generals attack together, the city falls
if only some loyal generals attack, their armies fall
generals communicate by obedient messengers

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

20 / 52

Fault tolerance The Byzantine generals problem


Lamport (this years Turing laureate), Shostak, and Pease wrote in their
Dijkstra Prize in Distributed Computing winning paper (Lamport et al.,
1982):
[. . .] several divisions of the Byzantine army are camped outside
an enemy city, each division commanded by its own general. [. . .]
However, some of the generals may be traitors [. . .]
if the divisions of loyal generals attack together, the city falls
if only some loyal generals attack, their armies fall
generals communicate by obedient messengers
The Byzantine generals problem:
the loyal generals have to agree on whether to attack.
if all want to attack they must attack, if no-one wants to attack they
must not attack

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

20 / 52

Fault tolerance The Byzantine generals problem


Lamport (this years Turing laureate), Shostak, and Pease wrote in their
Dijkstra Prize in Distributed Computing winning paper (Lamport et al.,
1982):
[. . .] several divisions of the Byzantine army are camped outside
an enemy city, each division commanded by its own general. [. . .]
However, some of the generals may be traitors [. . .]
if the divisions of loyal generals attack together, the city falls
if only some loyal generals attack, their armies fall
generals communicate by obedient messengers
The Byzantine generals problem:
the loyal generals have to agree on whether to attack.
if all want to attack they must attack, if no-one wants to attack they
must not attack
metaphor for a distributed system where correct processes (loyal generals)
act as one in the presence of faulty processes (traitors)
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

20 / 52

Byzantine generals problem cont.


In the absence of faults it is trivial to solve:
send proposed plan (attack or not attack) to all
wait until received messages from everyone
if a process proposed attack decide to attack
otherwise, decide to not attack

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

21 / 52

Byzantine generals problem cont.


In the absence of faults it is trivial to solve:
send proposed plan (attack or not attack) to all
wait until received messages from everyone
if a process proposed attack decide to attack
otherwise, decide to not attack
In the presence of faults it becomes tricky
if a process may crash, some processes may not receive messages
from everyone (but some may)
if a process may send faulty messages, contradictory information may
be received, e.g.,
A tells B that C told A that C wants to attack, while C tells B
that C does not want to attack
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

21 / 52

Byzantine generals problem cont.


In the absence of faults it is trivial to solve:
send proposed plan (attack or not attack) to all
wait until received messages from everyone
if a process proposed attack decide to attack
otherwise, decide to not attack
In the presence of faults it becomes tricky
if a process may crash, some processes may not receive messages
from everyone (but some may)
if a process may send faulty messages, contradictory information may
be received, e.g.,
A tells B that C told A that C wants to attack, while C tells B
Who is lying to whom?
that C does not want to attack
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

21 / 52

Fault-tolerant distributed algorithms

n processes communicate by messages (reliable communication)


all processes know that at most t of them might be faulty
f are actually faulty
resilience conditions, e.g., n > 3t t f 0
no masquerading: the processes know the origin of incoming messages
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

22 / 52

Fault-tolerant distributed algorithms

n
?
t

n processes communicate by messages (reliable communication)


all processes know that at most t of them might be faulty
f are actually faulty
resilience conditions, e.g., n > 3t t f 0
no masquerading: the processes know the origin of incoming messages
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

22 / 52

Fault-tolerant distributed algorithms

n
?
?
?
t f
n processes communicate by messages (reliable communication)
all processes know that at most t of them might be faulty
f are actually faulty
resilience conditions, e.g., n > 3t t f 0
no masquerading: the processes know the origin of incoming messages
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

22 / 52

Fault models abstractions of reality


clean crashes:

least severe

faulty processes prematurely halt after/before send to all

crash faults:
faulty processes prematurely halt (also) in the middle of send to all

omission faults:
faulty processes follow the algorithm, but some messages sent by them
might be lost

symmetric faults:
faulty processes send arbitrarily to all or nobody

Byzantine faults:

most severe

faulty processes can do anything

encompass all behaviors of above models

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

23 / 52

Fault models the ugly truth


A photo of a Byzantine fault:

photo by Driscoll (Honeywell)


he reports Byzantine behavior on the Space Shuttle computer network
other sources of faults: bit-flips in memory, power outage, disconnection
from the network, etc.
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

24 / 52

Model vs. reality: impossibilities


Hence, we would like the weakest assumptions possible. But there are
theoretical limits on how weak assumptions can be made:
consensus is impossible in asynchronous systems if there may be a
crash fault, i.e., t = 1 (Fischer et al., 1985)
consensus is possible in synchronous systems in the presence of
Byzantine faults iff n > 3t (Lamport et al., 1982)
consensus is impossible in (synchronous) round-based systems if
bn/2c messages can be lost per round (Santoro & Widmayer, 1989)
fast Byzantine consensus is solvable iff n > 5t (Martin & Alvisi, 2006)
32 different degrees of synchrony and whether consensus can be
solved in the presence of how many faults investigated in (Dolev
et al., 1987)

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

25 / 52

Model vs. reality: impossibilities


Hence, we would like the weakest assumptions possible. But there are
theoretical limits on how weak assumptions can be made:
consensus is impossible in asynchronous systems if there may be a
crash fault, i.e., t = 1 (Fischer et al., 1985)
consensus is possible in synchronous systems in the presence of
Byzantine faults iff n > 3t (Lamport et al., 1982)
consensus is impossible in (synchronous) round-based systems if
bn/2c messages can be lost per round (Santoro & Widmayer, 1989)
fast Byzantine consensus is solvable iff n > 5t (Martin & Alvisi, 2006)
32 different degrees of synchrony and whether consensus can be
solved in the presence of how many faults investigated in (Dolev
et al., 1987)
arithmetic resilience conditions play crucial role!
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

25 / 52

After this excursion to faults, lets


go back to the problem of defining
consistency
(asynchronous systems)

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

26 / 52

Defining consistency e.g., binary consensus


Every process has some initial value v {0, 1} and has to decide
irrevocably on some value in concordance with the following properties:

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

27 / 52

Defining consistency e.g., binary consensus


Every process has some initial value v {0, 1} and has to decide
irrevocably on some value in concordance with the following properties:

agreement. No two correct processes decide on different value.


either all attack or no-one

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

27 / 52

Defining consistency e.g., binary consensus


Every process has some initial value v {0, 1} and has to decide
irrevocably on some value in concordance with the following properties:

agreement. No two correct processes decide on different value.


either all attack or no-one
validity. If all correct processes have the same initial value v , then v
is the only possible decision value
the decision on whether to attack must be consistent with
the will of at least one loyal general

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

27 / 52

Defining consistency e.g., binary consensus


Every process has some initial value v {0, 1} and has to decide
irrevocably on some value in concordance with the following properties:

agreement. No two correct processes decide on different value.


either all attack or no-one
validity. If all correct processes have the same initial value v , then v
is the only possible decision value
the decision on whether to attack must be consistent with
the will of at least one loyal general
termination. Every correct process eventually decides.
at some point negotiations must be over

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

27 / 52

Defining consistency e.g., binary consensus


Every process has some initial value v {0, 1} and has to decide
irrevocably on some value in concordance with the following properties:

agreement. No two correct processes decide on different value.


either all attack or no-one
validity. If all correct processes have the same initial value v , then v
is the only possible decision value
the decision on whether to attack must be consistent with
the will of at least one loyal general
termination. Every correct process eventually decides.
at some point negotiations must be over
Interplay of safety and liveness makes the problem hard. . .
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

27 / 52

What if only two properties have to be satisfied?


Every process has some initial value v {0, 1} and has to decide
irrevocably on some value in concordance with the following properties:

validity. If all correct processes have the same initial value v , then v
is the only possible decision value.
termination. Every correct process eventually decides.
Give an algorithm that solves validity and termination!

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

28 / 52

What if only two properties have to be satisfied?


Every process has some initial value v {0, 1} and has to decide
irrevocably on some value in concordance with the following properties:

validity. If all correct processes have the same initial value v , then v
is the only possible decision value.
termination. Every correct process eventually decides.
Solution: decide my own proposed value. (no need to agree)

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

28 / 52

What if only two properties have to be satisfied?


Every process has some initial value v {0, 1} and has to decide
irrevocably on some value in concordance with the following properties:

agreement. No two correct processes decide on different value.

termination. Every correct process eventually decides.


Give an algorithm that solves agreement and termination!

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

28 / 52

What if only two properties have to be satisfied?


Every process has some initial value v {0, 1} and has to decide
irrevocably on some value in concordance with the following properties:

agreement. No two correct processes decide on different value.

termination. Every correct process eventually decides.


Solution: decide 0. (no relation to initial values required)

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

28 / 52

What if only two properties have to be satisfied?


Every process has some initial value v {0, 1} and has to decide
irrevocably on some value in concordance with the following properties:

agreement. No two correct processes decide on different value.


validity. If all correct processes have the same initial value v , then v
is the only possible decision value.

Give an algorithm that solves agreement and validity!

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

28 / 52

What if only two properties have to be satisfied?


Every process has some initial value v {0, 1} and has to decide
irrevocably on some value in concordance with the following properties:

agreement. No two correct processes decide on different value.


validity. If all correct processes have the same initial value v , then v
is the only possible decision value.

Solution: do nothing (doing nothing is always safe)

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

28 / 52

Wrap-up: Intro to FTDAs

distributed systems
replication and consistency
synchronous vs. asynchronous
fault models
example for an agreement problem: Byzantine Generals

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

29 / 52

Our case study. . .

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

30 / 52

Asynchronous FTDAs
In this lecture we consider methods for asynchronous FTDAs that either
solve problems that are less hard than consensus:
reliable broadcast. termination required only for specific initial state
(Srikanth & Toueg, 1987). [Verified in Parts II, III, V]
condition-based consensus properties required only in runs from
specific initial states (Mostefaoui et al., 2003)
[Verified in Part II]
The Paxos idea fault-tolerant distributed algorithms that are safe and
make progress only if you are lucky (Lamport, 1998)
[Serious challenge]
are asynchronous but use information on faults as a black box
failure detector based atomic commitment. distributed databases
(Raynal, 1997)
[Challenge]

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

31 / 52

Asynchronous FTDAs
In this lecture we consider methods for asynchronous FTDAs that either
solve problems that are less hard than consensus:
reliable broadcast. termination required only for specific initial state
(Srikanth & Toueg, 1987). [Verified in Parts II, III, V]
condition-based consensus properties required only in runs from
specific initial states (Mostefaoui et al., 2003)
[Verified in Part II]
The Paxos idea fault-tolerant distributed algorithms that are safe and
make progress only if you are lucky (Lamport, 1998)
[Serious challenge]
are asynchronous but use information on faults as a black box
failure detector based atomic commitment. distributed databases
(Raynal, 1997)
[Challenge]
We use the algorithm from (Srikanth & Toueg, 1987) as running example
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

31 / 52

Asynchronous Reliable Broadcast (Srikanth & Toueg, 87)


The core of the classic broadcast algorithm from the DA literature.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

An atomic step:
i f vi = 1
then send ( echo ) to all ;

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

32 / 52

Assumptions from (Srikanth & Toueg, 87)

asynchronous interleaving

reliable message passing (no bounds on message delays)

at most t Byzantine faults

resilience condition: n > 3t t f

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

33 / 52

The spec of our case-study

Unforgeability. If vi = false for all correct processes i, then for all correct
processes j, acceptj remains false forever.

Completeness. If vi = true for all correct processes i, then there is a


correct process j that eventually sets acceptj to true.
Relay. If a correct process i sets accepti to true, then eventually all
correct processes j set acceptj to true.

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

34 / 52

The spec of our case-study

Unforgeability. If vi = false for all correct processes i, then for all correct
processes j, acceptj remains false forever.
if no loyal general wants to attack, then traitors should not
be able to force one.
Completeness. If vi = true for all correct processes i, then there is a
correct process j that eventually sets acceptj to true.
If all loyal generals want to attack, there shall be an attack.
Relay. If a correct process i sets accepti to true, then eventually all
correct processes j set acceptj to true.
If one loyal general attacks, then all loyal generals should attack.

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

34 / 52

The spec of our case-study

Unforgeability. If vi = false for all correct processes i, then for all correct
processes j, acceptj remains false forever.
if no loyal general wants to attack, then traitors should not
be able to force one.
Completeness. If vi = true for all correct processes i, then there is a
correct process j that eventually sets acceptj to true.
If all loyal generals want to attack, there shall be an attack.
Relay. If a correct process i sets accepti to true, then eventually all
correct processes j set acceptj to true.
If one loyal general attacks, then all loyal generals should attack.

These are the specs as given in literature: they can be formalized in LTL
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

34 / 52

Reliable broadcast vs. Consensus


Reliable broadcast: Completeness. If vi = true for all correct processes i,
then there is a correct process j that eventually sets acceptj
to true.
Consensus: Termination. Every correct process eventually decides.
Difference:
Completeness requires to do something only if i. vi = true,
i.e., only for one specific initial state
Termination requires to do something in all runs (from all initial
states)
weakening of spec makes reliable broadcast solvable in async,
while consensus is not solvable
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

35 / 52

Asynchronous Reliable Broadcast (Srikanth & Toueg, 87)


The core of the classic broadcast algorithm from the DA literature.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

An atomic step:
i f vi = 1
then send ( echo ) to all ;

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

36 / 52

Threshold-Guarded Distributed Algorithms


Standard construct: quantified guards (t=f=0)
Existential Guard
if received m from some process then ...
Universal Guard
if received m from all processes then ...

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

37 / 52

Threshold-Guarded Distributed Algorithms


Standard construct: quantified guards (t=f=0)
Existential Guard
if received m from some process then ...
Universal Guard
if received m from all processes then ...
what if faults might occur?

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

37 / 52

Threshold-Guarded Distributed Algorithms


Standard construct: quantified guards (t=f=0)
Existential Guard
if received m from some process then ...
Universal Guard
if received m from all processes then ...
what if faults might occur?
Fault-Tolerant Algorithms: n processes, at most t are Byzantine
Threshold Guard
if received m from n t processes then ...
(the processes cannot refer to f!)

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

37 / 52

Basic mechanisms used by the algorithm: thresholds


t +1

if received m from t + 1 processes then ...

t f

(threshold)

Correct processes count distinct incoming messages

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

38 / 52

Basic mechanisms used by the algorithm: thresholds


t +1

if received m from t + 1 processes then ...

t f

(threshold)

Correct processes count distinct incoming messages

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

38 / 52

Basic mechanisms used by the algorithm: thresholds


t +1

at least one non-faulty sent the message


if received m from t + 1 processes then ...

t f

(threshold)

Correct processes count distinct incoming messages

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

38 / 52

Classic correctness argument


hand-written proofs

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

39 / 52

Proof: Unforgeability
If vi = false for all correct processes i, then for all correct processes j,
acceptj remains false forever.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

An atomic step:
i f vi = 1
then send ( echo ) to all ;

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

40 / 52

Proof: Unforgeability
If vi = false for all correct processes i, then for all correct processes j,
acceptj remains false forever.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

By contradiction assume a
correct process sets acceptj = 1

4
5
6
7

An atomic step:
i f vi = 1
then send ( echo ) to all ;

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

40 / 52

Proof: Unforgeability
If vi = false for all correct processes i, then for all correct processes j,
acceptj remains false forever.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

By contradiction assume a
correct process sets acceptj = 1
Thus it has executed line 16

An atomic step:
i f vi = 1
then send ( echo ) to all ;

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

40 / 52

Proof: Unforgeability
If vi = false for all correct processes i, then for all correct processes j,
acceptj remains false forever.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

An atomic step:
i f vi = 1
then send ( echo ) to all ;

By contradiction assume a
correct process sets acceptj = 1
Thus it has executed line 16
Thus it has received n t
messages by distinct processes

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

40 / 52

Proof: Unforgeability
If vi = false for all correct processes i, then for all correct processes j,
acceptj remains false forever.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

An atomic step:
i f vi = 1
then send ( echo ) to all ;

By contradiction assume a
correct process sets acceptj = 1
Thus it has executed line 16
Thus it has received n t
messages by distinct processes
That means messages by at
n 2t correct processes

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

40 / 52

Proof: Unforgeability
If vi = false for all correct processes i, then for all correct processes j,
acceptj remains false forever.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

An atomic step:
i f vi = 1
then send ( echo ) to all ;

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

By contradiction assume a
correct process sets acceptj = 1
Thus it has executed line 16
Thus it has received n t
messages by distinct processes
That means messages by at
n 2t correct processes
Let p be the first correct
processes that has sent (echo)

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

40 / 52

Proof: Unforgeability
If vi = false for all correct processes i, then for all correct processes j,
acceptj remains false forever.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

An atomic step:
i f vi = 1
then send ( echo ) to all ;

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

By contradiction assume a
correct process sets acceptj = 1
Thus it has executed line 16
Thus it has received n t
messages by distinct processes
That means messages by at
n 2t correct processes
Let p be the first correct
processes that has sent (echo)
It did not send in line 7, as
vp = 0 by assumption

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

40 / 52

Proof: Unforgeability
If vi = false for all correct processes i, then for all correct processes j,
acceptj remains false forever.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

An atomic step:
i f vi = 1
then send ( echo ) to all ;

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

By contradiction assume a
correct process sets acceptj = 1
Thus it has executed line 16
Thus it has received n t
messages by distinct processes
That means messages by at
n 2t correct processes
Let p be the first correct
processes that has sent (echo)
It did not send in line 7, as
vp = 0 by assumption
Thus, p sent in line 12

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

40 / 52

Proof: Unforgeability
If vi = false for all correct processes i, then for all correct processes j,
acceptj remains false forever.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

An atomic step:
i f vi = 1
then send ( echo ) to all ;

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

By contradiction assume a
correct process sets acceptj = 1
Thus it has executed line 16
Thus it has received n t
messages by distinct processes
That means messages by at
n 2t correct processes
Let p be the first correct
processes that has sent (echo)
It did not send in line 7, as
vp = 0 by assumption
Thus, p sent in line 12
Based on t + 1 messages, i.e., 1
sent by a correct processes

TMPA14, Nov. 2014

40 / 52

Proof: Unforgeability
If vi = false for all correct processes i, then for all correct processes j,
acceptj remains false forever.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

An atomic step:
i f vi = 1
then send ( echo ) to all ;

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

By contradiction assume a
correct process sets acceptj = 1
Thus it has executed line 16
Thus it has received n t
messages by distinct processes
That means messages by at
n 2t correct processes
Let p be the first correct
processes that has sent (echo)
It did not send in line 7, as
vp = 0 by assumption
Thus, p sent in line 12
Based on t + 1 messages, i.e., 1
sent by a correct processes
contradiction to p being the
first one.
TMPA14, Nov. 2014

40 / 52

Proof: Completeness
If vi = true for all correct processes i, then there is a correct process j
that eventually sets acceptj to true.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

An atomic step:
i f vi = 1
then send ( echo ) to all ;

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

41 / 52

Proof: Completeness
If vi = true for all correct processes i, then there is a correct process j
that eventually sets acceptj to true.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

An atomic step:
i f vi = 1
then send ( echo ) to all ;

all, i.e., at least n t correct


processes execute line 7

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

41 / 52

Proof: Completeness
If vi = true for all correct processes i, then there is a correct process j
that eventually sets acceptj to true.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

An atomic step:
i f vi = 1
then send ( echo ) to all ;

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

all, i.e., at least n t correct


processes execute line 7
by reliable communication all
correct processes receive all
messages sent by correct
processes

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

41 / 52

Proof: Completeness
If vi = true for all correct processes i, then there is a correct process j
that eventually sets acceptj to true.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

An atomic step:
i f vi = 1
then send ( echo ) to all ;

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

all, i.e., at least n t correct


processes execute line 7
by reliable communication all
correct processes receive all
messages sent by correct
processes
Thus, a correct process receives
n t (echo) messages

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

41 / 52

Proof: Completeness
If vi = true for all correct processes i, then there is a correct process j
that eventually sets acceptj to true.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

An atomic step:
i f vi = 1
then send ( echo ) to all ;

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

13
14
15
16

all, i.e., at least n t correct


processes execute line 7
by reliable communication all
correct processes receive all
messages sent by correct
processes
Thus, a correct process receives
n t (echo) messages
Thus, a correct process executes
line 16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

41 / 52

Proof: Relay
If a correct process i sets accepti to true, then eventually all correct
processes j set acceptj to true.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

An atomic step:
i f vi = 1
then send ( echo ) to all ;

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

42 / 52

Proof: Relay
If a correct process i sets accepti to true, then eventually all correct
processes j set acceptj to true.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

Correct process executes line 16

4
5
6
7

An atomic step:
i f vi = 1
then send ( echo ) to all ;

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

42 / 52

Proof: Relay
If a correct process i sets accepti to true, then eventually all correct
processes j set acceptj to true.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

Correct process executes line 16


Thus it has received n t
messages by distinct processes

4
5
6
7

An atomic step:
i f vi = 1
then send ( echo ) to all ;

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

42 / 52

Proof: Relay
If a correct process i sets accepti to true, then eventually all correct
processes j set acceptj to true.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

Correct process executes line 16


Thus it has received n t
messages by distinct processes
That means messages by n 2t
correct processes

An atomic step:
i f vi = 1
then send ( echo ) to all ;

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

42 / 52

Proof: Relay
If a correct process i sets accepti to true, then eventually all correct
processes j set acceptj to true.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

Correct process executes line 16


Thus it has received n t
messages by distinct processes
That means messages by n 2t
correct processes

An atomic step:
i f vi = 1
then send ( echo ) to all ;

By the resilience condition


n > 3t, we have n 2t t + 1

8
9
10
11
12

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

42 / 52

Proof: Relay
If a correct process i sets accepti to true, then eventually all correct
processes j set acceptj to true.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

11
12

Thus it has received n t


messages by distinct processes
That means messages by n 2t
correct processes

An atomic step:
i f vi = 1
then send ( echo ) to all ;

10

Correct process executes line 16

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

By the resilience condition


n > 3t, we have n 2t t + 1
Thus at least t + 1 correct
processes have sent (echo)

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

42 / 52

Proof: Relay
If a correct process i sets accepti to true, then eventually all correct
processes j set acceptj to true.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

11
12

Thus it has received n t


messages by distinct processes
That means messages by n 2t
correct processes

An atomic step:
i f vi = 1
then send ( echo ) to all ;

10

Correct process executes line 16

i f received (echo) from a t l e a s t


t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

By the resilience condition


n > 3t, we have n 2t t + 1
Thus at least t + 1 correct
processes have sent (echo)
By reliable communication,
these messages are received by
all correct processes

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

42 / 52

Proof: Relay
If a correct process i sets accepti to true, then eventually all correct
processes j set acceptj to true.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

11
12

i f vi = 1
then send ( echo ) to all ;
i f received (echo) from a t l e a s t
t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

13
14
15
16

Thus it has received n t


messages by distinct processes
That means messages by n 2t
correct processes

An atomic step:

10

Correct process executes line 16

By the resilience condition


n > 3t, we have n 2t t + 1
Thus at least t + 1 correct
processes have sent (echo)
By reliable communication,
these messages are received by
all correct processes
Thus, all correct processes send
(echo) in line 12

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

42 / 52

Proof: Relay
If a correct process i sets accepti to true, then eventually all correct
processes j set acceptj to true.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

11
12

i f vi = 1
then send ( echo ) to all ;
i f received (echo) from a t l e a s t
t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Thus it has received n t


messages by distinct processes
That means messages by n 2t
correct processes

An atomic step:

10

Correct process executes line 16

Checking Fault-Tolerant Distributed Algos

By the resilience condition


n > 3t, we have n 2t t + 1
Thus at least t + 1 correct
processes have sent (echo)
By reliable communication,
these messages are received by
all correct processes
Thus, all correct processes send
(echo) in line 12
There are at least n t correct

TMPA14, Nov. 2014

42 / 52

Proof: Relay
If a correct process i sets accepti to true, then eventually all correct
processes j set acceptj to true.
1
2
3

Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0

4
5
6
7

11
12

i f vi = 1
then send ( echo ) to all ;
i f received (echo) from a t l e a s t
t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
then send ( echo ) to all ;

13
14
15
16

i f received ( echo ) from a t l e a s t


n - t distinct p r o c e s s e s
then accepti := 1 ;
Josef Widder (www.forsyte.at)

Thus it has received n t


messages by distinct processes
That means messages by n 2t
correct processes

An atomic step:

10

Correct process executes line 16

Checking Fault-Tolerant Distributed Algos

By the resilience condition


n > 3t, we have n 2t t + 1
Thus at least t + 1 correct
processes have sent (echo)
By reliable communication,
these messages are received by
all correct processes
Thus, all correct processes send
(echo) in line 12
There are at least n t correct
Thus, all correct processes
eventually execute line 16
TMPA14, Nov. 2014

42 / 52

Problems with hand-written proofs


code inspection becomes confusing for larger algorithms

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

43 / 52

Bracha & Touegs algorithm (JACM 1985)

Part II, Part V

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

44 / 52

Condition-based consensus (Mostefaoui et al., 2003)

Part II, Part V


Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

45 / 52

Problems with hand-written proofs


code inspection becomes confusing for larger algorithms
hidden assumptions
resilience condition
reliable communication (fairness)
non-masquerading
failure model

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

46 / 52

Problems with hand-written proofs


code inspection becomes confusing for larger algorithms
hidden assumptions
resilience condition
reliable communication (fairness)
non-masquerading
failure model

re-using proofs if one of the ingredients changes?


if I cannot prove it correct, that does not mean the algorithm is wrong
. . . how to come up with counterexamples?
ultimate goal: verify the actual source code.
. . . it is not realistic that developers do mathematical proofs.
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

46 / 52

We have convinced a human, . . .


. . . why should we convince a computer?
it is easy to make mistakes in proofs

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

47 / 52

We have convinced a human, . . .


. . . why should we convince a computer?
it is easy to make mistakes in proofs
it is easier to overlook mistakes in proofs
distributed algorithms require non-centralized thinking
(untypical for the human mind)
many issues to consider at the same time
(interleaving of steps, faults, timing assumptions)

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

47 / 52

We have convinced a human, . . .


. . . why should we convince a computer?
it is easy to make mistakes in proofs
it is easier to overlook mistakes in proofs
distributed algorithms require non-centralized thinking
(untypical for the human mind)
many issues to consider at the same time
(interleaving of steps, faults, timing assumptions)

people who tried to convince computers found bugs in published. . .


Byzantine agreement algorithm (Lincoln & Rushby, 1993)
clock synchronization algorithm (Malekpour & Siminiceanu, 2006)
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

47 / 52

End of Part I

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

48 / 52

References I

Dolev, Danny, Dwork, Cynthia, & Stockmeyer, Larry. 1987.


On the minimal synchronism needed for distributed consensus.
J. ACM, 34, 7797.
http://doi.acm.org/10.1145/7531.7533.
Fischer, Michael J., Lynch, Nancy A., & Paterson, M. S. 1985.
Impossibility of Distributed Consensus with one Faulty Process.
J. ACM, 32(2), 374382.
http://doi.acm.org/10.1145/3149.214121.
Lamport, Leslie. 1998.
The part-time parliament.
ACM Trans. Comput. Syst., 16, 133169.
http://doi.acm.org/10.1145/279227.279229.
Lamport, Leslie, Shostak, Robert E., & Pease, Marshall C. 1982.
The Byzantine Generals Problem.
ACM Trans. Program. Lang. Syst., 4(3), 382401.

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

49 / 52

References II
Lincoln, P., & Rushby, J. 1993.
A formally verified algorithm for interactive consistency under a hybrid fault model.
Pages 402411 of: FTCS-23.
http://dx.doi.org/10.1109/FTCS.1993.627343.
Malekpour, Mahyar R., & Siminiceanu, Radu. 2006.
Comments on the Byzantine Self-Stabilizing Pulse Synchronization Protocol:
Counterexamples.
Tech. rept. TM-2006-213951. NASA.
Martin, Jean-Philippe, & Alvisi, Lorenzo. 2006.
Fast Byzantine Consensus.
IEEE Trans. Dependable Sec. Comput., 3(3), 202215.
Most
efaoui, Achour, Mourgaya, Eric, Parv
edy, Philippe Raipin, & Raynal, Michel. 2003.
Evaluating the Condition-Based Approach to Solve Consensus.
Pages 541550 of: DSN.
Raynal, Michel. 1997.
A Case Study of Agreement Problems in Distributed Systems: Non-Blocking Atomic
Commitment.
Pages 209214 of: HASE.
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

50 / 52

References III

Santoro, Nicola, & Widmayer, Peter. 1989.


Time is Not a Healer.
Pages 304313 of: STACS.
LNCS, vol. 349.
http://dx.doi.org/10.1007/BFb0028994.
Srikanth, T. K., & Toueg, Sam. 1987.
Optimal clock synchronization.
J. ACM, 34, 626645.
http://doi.acm.org/10.1145/28869.28876.

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

51 / 52

Model vs. reality: assumption coverage


Every assumption has a probability that it is satisfied in the actual system:
n > 3t
less likely than n > t
every message sent is received within bounded time
less likely than that it is eventually received
processes fail by crashing
less likely than they deviate arbitrarily from the prescribed behavior
non-masquerading
less likely than processes that can pretend to be someone else

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

52 / 52

Model vs. reality: assumption coverage


Every assumption has a probability that it is satisfied in the actual system:
n > 3t
less likely than n > t
every message sent is received within bounded time
less likely than that it is eventually received
processes fail by crashing
less likely than they deviate arbitrarily from the prescribed behavior
non-masquerading
less likely than processes that can pretend to be someone else
To use a distributed algorithm in practice:
one must ensure that an assumption is suitable for a given system
the probability that the system is working correctly is the probability
that the assumptions hold
(given that the distributed algorithm actually is correct)
Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

52 / 52

You might also like