Introduction To Fault Tolerant

Model Checking of Fault-Tolerant Distributed Algorithms
Part I: Fault-Tolerant Distributed Algorithms

Annu Gmeiner Igor Konnov Ulrich Schmid
Helmut Veith Josef Widder
TMPA 2014, Kostroma, Russia
Josef Widder (www.forsyte.at)
Checking Fault-Tolerant Distributed Algos
TMPA14, Nov. 2014
1 / 52
114
Figure 7.1: DARTS prototype board, comprising 8 interconnected HITS chips
7.1 Assessing and validating the standard node HITS design
Distributed Systems
Figure 7.1: DARTS prototype board, comprising 8 interconnected HITS chips
114
7.1 Assessing and validating the standard node HITS design
Distributed Systems
Are they always working?

TMPA14, Nov. 2014
2 / 52
No. . . some failing systems

Therac-25 (1985)
radiation therapy machine
gave massive overdoses, e.g., due to race conditions of concurrent tasks
Quantas Airbus in-flight Learmonth upset (2008)

1 out of 3 replicated components failed
computer initiated dangerous altitude drop
Ariane 501 maiden flight (1996)

primary/backup, i.e., 2 replicated computers
both run into the same integer overflow
Netflix outages due to Amazons cloud (ongoing)

one is not sure what is going on there
hundreds of computers involved
TMPA14, Nov. 2014
3 / 52
Why do they fail?
TMPA14, Nov. 2014
4 / 52
Why do they fail?
faults at design/implementation phase
faults at runtime
outside of control of designer/developer
e.g., to the right: crack in a diode in the
data link interface of the Space Shuttle
led to erroneous messages being sent
Driscoll (Honeywell)
TMPA14, Nov. 2014
4 / 52
Why do they fail?

approach:
find and fix faults before operation
model checking
faults at runtime
TMPA14, Nov. 2014
4 / 52
Why do they fail?

approach:
find and fix faults before operation
model checking
faults at runtime
approach:
keep system operational despite faults
fault-tolerant distributed algorithms
TMPA14, Nov. 2014
4 / 52
Bringing both together
Goal: automatically verified fault-tolerant distributed algorithms
TMPA14, Nov. 2014
5 / 52
Bringing both together
Goal: automatically verified fault-tolerant distributed algorithms
model checking FTDAs is a research challenge:

computers run independently at different speeds
exchange messages with uncertain delays
faults
parameterization
. . . fault-tolerance makes model checking harder
TMPA14, Nov. 2014
5 / 52
Lecture overview
Part I: Fault-tolerant distributed algorithms

introduction to distributed algorithms
details of our case study algorithm
motivation why model checking is cool
Part II: Modeling fault-tolerant distributed algorithms

model checking challenges in distributed algorithms
Promela, control flow automata, etc.
model checking of small instances with Spin
TMPA14, Nov. 2014
6 / 52
Lecture overview cont.

We will have to skip due to time limitations:
Part III: Introduction into Parameterized model checking
the general problem statement
well-known undecidability results
Igor Konnov:
Part IV: Parameterized model checking of FTDAs by abstraction
parametric interval abstraction (PIA)
PIA data and counter abstraction
counterexample-guided abstraction refinement (CEGAR)
Part V: Parameterized model checking of FTDAs by BMC
bounding the diameter
bounded model checking as a complete method
TMPA14, Nov. 2014
7 / 52
Part I: Fault-Tolerant Distributed

Algorithms
TMPA14, Nov. 2014
8 / 52
Distributed Systems are everywhere

What they allow to do
share resources
communicate
increase performance
speed
fault tolerance
Difference to centralized systems

independent activities (concurrency)
components do not have access to the global state (only local view)
TMPA14, Nov. 2014
9 / 52
Application areas
buzzwords from the 60ies
operating systems
(distributed) data base systems
communication networks
multiprocessor architectures
control systems
New buzzwords
cloud computing
social networks
multi core
cyber-physical systems
TMPA14, Nov. 2014
10 / 52
Major challenge
Uncertainty
exchange messages with (unknown) delays
faults
TMPA14, Nov. 2014
11 / 52
Major challenge
Uncertainty
faults
challenge in design of distributed algorithms
a process has access only to its local state
but one wants to achieve some global property
TMPA14, Nov. 2014
11 / 52
Major challenge
Uncertainty
faults
challenge in design of distributed algorithms
a process has access only to its local state
but one wants to achieve some global property
challenge in proving them correct
large degree of non-determinism
large execution and state space
TMPA14, Nov. 2014
11 / 52
From dependability to a distributed system
Process P provides a service. We want to access it reliably

but P may fail
TMPA14, Nov. 2014
12 / 52
replication
P
P1
P2
P3

but P may fail
canonical approach: replication, i.e., several copies of P
Due to non-determinism, the behavior of the copies might deviate
(e.g. in a replicated database, transactions are committed in different
orders at different sites)
TMPA14, Nov. 2014
12 / 52
replication
P
P1
P2
consistency
P3
P
P

but P may fail
canonical approach: replication, i.e., several copies of P
Due to non-determinism, the behavior of the copies might deviate
(e.g. in a replicated database, transactions are committed in different
orders at different sites)
we have to enforce that the copies behave as one.
Consistency in a distributed system: what does it mean to behave
as one.
TMPA14, Nov. 2014
12 / 52
Replication distributed systems
replication
P
P1
P2
P3
TMPA14, Nov. 2014
13 / 52
Distributed message passing system

multiple distributed processes pi
pi
receive
send
internal
dots represent states

a step of a process can be
a send step (a process sends messages to other processes)
a receive step (a process receives a subset of messages sent to it)
an internal step (a local computation)
steps are the atomic (indivisible) units of computations

TMPA14, Nov. 2014
14 / 52
Types of Distributed Algorithms:

Synchronous vs. Asynchronous
Synchronous
all processes move in lock-step
rounds
a message sent in a round is received in the same round
idealized view
impossible or expensive to implement
Asynchronous
only one process moves at a time
arbitrary interleavings of steps
a message sent is received eventually
TMPA14, Nov. 2014
15 / 52

Synchronous
rounds
idealized view
Asynchronous
important problems not solvable (Fischer et al., 1985)!
TMPA14, Nov. 2014
15 / 52

Synchronous
rounds
idealized view
Asynchronous
important problems not solvable (Fischer et al., 1985)!
We focus on asynchronous algorithms here. . .
TMPA14, Nov. 2014
15 / 52
Asynchronous system
has very mild restrictions on the environment
interleaving semantics
unbounded message delays
very little can be done. . .
there is no distributed algorithm that solves consensus in the presence
of one faulty process
(as we will see, consensus is the paradigm of consistency)
folklore explanation:
you cannot distinguish a slow process from a crashed one
real explanation:
see intricate proof by Fischer, Lynch, and Paterson (JACM 1985)
TMPA14, Nov. 2014
16 / 52
Where we stand
replication
P
P1
P2
P3
TMPA14, Nov. 2014
17 / 52
What we still need. . .
P1
P2
consistency
P3
P
P
TMPA14, Nov. 2014
18 / 52
P1
P2
consistency
P3
P
P
consistency requirements have been formalized under several names,

e.g.,
consensus
atomic broadcast
Byzantine Generals problem
Byzantine agreement
atomic commitment
definitions are similar but may have subtle differences
TMPA14, Nov. 2014
18 / 52
P1
P2
consistency
P3
P
P
consistency requirements have been formalized under several names,

e.g.,
consensus
atomic broadcast
Byzantine Generals problem
Byzantine agreement
atomic commitment
definitions are similar but may have subtle differences

We use the famous Byzantine Generals to introduce this problem
domain. . .
TMPA14, Nov. 2014
18 / 52
Fault tolerance The Byzantine generals problem
Wiktionary:
Byzantine: adj. of a devious, usually stealthy manner, of practice.
TMPA14, Nov. 2014
19 / 52

Lamport (this years Turing laureate), Shostak, and Pease wrote in their
Dijkstra Prize in Distributed Computing winning paper (Lamport et al.,
1982):
[. . .] several divisions of the Byzantine army are camped outside
an enemy city, each division commanded by its own general. [. . .]
However, some of the generals may be traitors [. . .]
if the divisions of loyal generals attack together, the city falls
if only some loyal generals attack, their armies fall
generals communicate by obedient messengers
TMPA14, Nov. 2014
20 / 52

1982):
The Byzantine generals problem:
the loyal generals have to agree on whether to attack.
if all want to attack they must attack, if no-one wants to attack they
must not attack
TMPA14, Nov. 2014
20 / 52

1982):
The Byzantine generals problem:
the loyal generals have to agree on whether to attack.
if all want to attack they must attack, if no-one wants to attack they
must not attack
metaphor for a distributed system where correct processes (loyal generals)
act as one in the presence of faulty processes (traitors)
TMPA14, Nov. 2014
20 / 52
Byzantine generals problem cont.

In the absence of faults it is trivial to solve:
send proposed plan (attack or not attack) to all
wait until received messages from everyone
if a process proposed attack decide to attack
otherwise, decide to not attack
TMPA14, Nov. 2014
21 / 52

In the presence of faults it becomes tricky
if a process may crash, some processes may not receive messages
from everyone (but some may)
if a process may send faulty messages, contradictory information may
be received, e.g.,
A tells B that C told A that C wants to attack, while C tells B
that C does not want to attack
TMPA14, Nov. 2014
21 / 52

In the presence of faults it becomes tricky
if a process may crash, some processes may not receive messages
from everyone (but some may)
if a process may send faulty messages, contradictory information may
be received, e.g.,
A tells B that C told A that C wants to attack, while C tells B
Who is lying to whom?
that C does not want to attack
TMPA14, Nov. 2014
21 / 52
Fault-tolerant distributed algorithms
n processes communicate by messages (reliable communication)

all processes know that at most t of them might be faulty
f are actually faulty
resilience conditions, e.g., n > 3t t f 0
no masquerading: the processes know the origin of incoming messages
TMPA14, Nov. 2014
22 / 52
n
?
t

TMPA14, Nov. 2014
22 / 52
n
?
?
?
t f
TMPA14, Nov. 2014
22 / 52
Fault models abstractions of reality

clean crashes:
least severe
faulty processes prematurely halt after/before send to all
crash faults:
faulty processes prematurely halt (also) in the middle of send to all
omission faults:
faulty processes follow the algorithm, but some messages sent by them
might be lost
symmetric faults:
faulty processes send arbitrarily to all or nobody
Byzantine faults:
most severe
faulty processes can do anything
encompass all behaviors of above models
TMPA14, Nov. 2014
23 / 52
Fault models the ugly truth

A photo of a Byzantine fault:
photo by Driscoll (Honeywell)

he reports Byzantine behavior on the Space Shuttle computer network
other sources of faults: bit-flips in memory, power outage, disconnection
from the network, etc.
TMPA14, Nov. 2014
24 / 52
Model vs. reality: impossibilities

Hence, we would like the weakest assumptions possible. But there are
theoretical limits on how weak assumptions can be made:
consensus is impossible in asynchronous systems if there may be a
crash fault, i.e., t = 1 (Fischer et al., 1985)
consensus is possible in synchronous systems in the presence of
Byzantine faults iff n > 3t (Lamport et al., 1982)
consensus is impossible in (synchronous) round-based systems if
bn/2c messages can be lost per round (Santoro & Widmayer, 1989)
fast Byzantine consensus is solvable iff n > 5t (Martin & Alvisi, 2006)
32 different degrees of synchrony and whether consensus can be
solved in the presence of how many faults investigated in (Dolev
et al., 1987)
TMPA14, Nov. 2014
25 / 52
Model vs. reality: impossibilities

Hence, we would like the weakest assumptions possible. But there are
theoretical limits on how weak assumptions can be made:
consensus is impossible in asynchronous systems if there may be a
crash fault, i.e., t = 1 (Fischer et al., 1985)
consensus is possible in synchronous systems in the presence of
Byzantine faults iff n > 3t (Lamport et al., 1982)
consensus is impossible in (synchronous) round-based systems if
bn/2c messages can be lost per round (Santoro & Widmayer, 1989)
fast Byzantine consensus is solvable iff n > 5t (Martin & Alvisi, 2006)
32 different degrees of synchrony and whether consensus can be
solved in the presence of how many faults investigated in (Dolev
et al., 1987)
arithmetic resilience conditions play crucial role!
TMPA14, Nov. 2014
25 / 52
After this excursion to faults, lets

go back to the problem of defining
consistency
(asynchronous systems)
TMPA14, Nov. 2014
26 / 52
Defining consistency e.g., binary consensus

Every process has some initial value v {0, 1} and has to decide
irrevocably on some value in concordance with the following properties:
TMPA14, Nov. 2014
27 / 52

agreement. No two correct processes decide on different value.

either all attack or no-one
TMPA14, Nov. 2014
27 / 52


validity. If all correct processes have the same initial value v , then v
is the only possible decision value
the decision on whether to attack must be consistent with
the will of at least one loyal general
TMPA14, Nov. 2014
27 / 52


termination. Every correct process eventually decides.
at some point negotiations must be over
TMPA14, Nov. 2014
27 / 52


at some point negotiations must be over
Interplay of safety and liveness makes the problem hard. . .
TMPA14, Nov. 2014
27 / 52
What if only two properties have to be satisfied?

is the only possible decision value.
Give an algorithm that solves validity and termination!
TMPA14, Nov. 2014
28 / 52

Solution: decide my own proposed value. (no need to agree)
TMPA14, Nov. 2014
28 / 52


Give an algorithm that solves agreement and termination!
TMPA14, Nov. 2014
28 / 52


Solution: decide 0. (no relation to initial values required)
TMPA14, Nov. 2014
28 / 52


Give an algorithm that solves agreement and validity!
TMPA14, Nov. 2014
28 / 52


Solution: do nothing (doing nothing is always safe)
TMPA14, Nov. 2014
28 / 52
Wrap-up: Intro to FTDAs
distributed systems
replication and consistency
synchronous vs. asynchronous
fault models
example for an agreement problem: Byzantine Generals
TMPA14, Nov. 2014
29 / 52
Our case study. . .
TMPA14, Nov. 2014
30 / 52
Asynchronous FTDAs
In this lecture we consider methods for asynchronous FTDAs that either
solve problems that are less hard than consensus:
reliable broadcast. termination required only for specific initial state
(Srikanth & Toueg, 1987). [Verified in Parts II, III, V]
condition-based consensus properties required only in runs from
specific initial states (Mostefaoui et al., 2003)
[Verified in Part II]
The Paxos idea fault-tolerant distributed algorithms that are safe and
make progress only if you are lucky (Lamport, 1998)
[Serious challenge]
are asynchronous but use information on faults as a black box
failure detector based atomic commitment. distributed databases
(Raynal, 1997)
[Challenge]
TMPA14, Nov. 2014
31 / 52
Asynchronous FTDAs
In this lecture we consider methods for asynchronous FTDAs that either
solve problems that are less hard than consensus:
reliable broadcast. termination required only for specific initial state
(Srikanth & Toueg, 1987). [Verified in Parts II, III, V]
condition-based consensus properties required only in runs from
specific initial states (Mostefaoui et al., 2003)
[Verified in Part II]
The Paxos idea fault-tolerant distributed algorithms that are safe and
make progress only if you are lucky (Lamport, 1998)
[Serious challenge]
are asynchronous but use information on faults as a black box
failure detector based atomic commitment. distributed databases
(Raynal, 1997)
[Challenge]
We use the algorithm from (Srikanth & Toueg, 1987) as running example
TMPA14, Nov. 2014
31 / 52
Asynchronous Reliable Broadcast (Srikanth & Toueg, 87)

The core of the classic broadcast algorithm from the DA literature.
1
2
3
Variables of process i
vi : { 0 , 1} i n i t i a l l y 0 or 1
accepti : { 0 , 1} i n i t i a l l y 0
4
5
6
7
An atomic step:
i f vi = 1
then send ( echo ) to all ;
8
9
10
11
12
i f received (echo) from a t l e a s t

t + 1 distinct p r o c e s s e s
and not s e n t ( echo ) b e f o r e
13
14
15
16
i f received ( echo ) from a t l e a s t

n - t distinct p r o c e s s e s
then accepti := 1 ;
TMPA14, Nov. 2014
32 / 52
Assumptions from (Srikanth & Toueg, 87)
asynchronous interleaving
reliable message passing (no bounds on message delays)
at most t Byzantine faults
resilience condition: n > 3t t f
TMPA14, Nov. 2014
33 / 52
The spec of our case-study
Unforgeability. If vi = false for all correct processes i, then for all correct
processes j, acceptj remains false forever.
Completeness. If vi = true for all correct processes i, then there is a

correct process j that eventually sets acceptj to true.
Relay. If a correct process i sets accepti to true, then eventually all
correct processes j set acceptj to true.
TMPA14, Nov. 2014
34 / 52
if no loyal general wants to attack, then traitors should not
be able to force one.
If all loyal generals want to attack, there shall be an attack.
If one loyal general attacks, then all loyal generals should attack.
TMPA14, Nov. 2014
34 / 52
if no loyal general wants to attack, then traitors should not
be able to force one.
If all loyal generals want to attack, there shall be an attack.
If one loyal general attacks, then all loyal generals should attack.
These are the specs as given in literature: they can be formalized in LTL
TMPA14, Nov. 2014
34 / 52
Reliable broadcast vs. Consensus

Reliable broadcast: Completeness. If vi = true for all correct processes i,
then there is a correct process j that eventually sets acceptj
to true.
Consensus: Termination. Every correct process eventually decides.
Difference:
Completeness requires to do something only if i. vi = true,
i.e., only for one specific initial state
Termination requires to do something in all runs (from all initial
states)
weakening of spec makes reliable broadcast solvable in async,
while consensus is not solvable
TMPA14, Nov. 2014
35 / 52
Asynchronous Reliable Broadcast (Srikanth & Toueg, 87)

The core of the classic broadcast algorithm from the DA literature.
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7
An atomic step:
i f vi = 1
8
9
10
11
12

13
14
15
16

then accepti := 1 ;
TMPA14, Nov. 2014
36 / 52
Threshold-Guarded Distributed Algorithms

Standard construct: quantified guards (t=f=0)
Existential Guard
if received m from some process then ...
Universal Guard
if received m from all processes then ...
TMPA14, Nov. 2014
37 / 52

Existential Guard
Universal Guard
what if faults might occur?
TMPA14, Nov. 2014
37 / 52

Existential Guard
Universal Guard
what if faults might occur?
Fault-Tolerant Algorithms: n processes, at most t are Byzantine
Threshold Guard
if received m from n t processes then ...
(the processes cannot refer to f!)
TMPA14, Nov. 2014
37 / 52
Basic mechanisms used by the algorithm: thresholds

t +1
if received m from t + 1 processes then ...
t f
(threshold)
Correct processes count distinct incoming messages
TMPA14, Nov. 2014
38 / 52

t +1
t f
(threshold)
TMPA14, Nov. 2014
38 / 52

t +1
at least one non-faulty sent the message

t f
(threshold)
TMPA14, Nov. 2014
38 / 52
Classic correctness argument

hand-written proofs
TMPA14, Nov. 2014
39 / 52
Proof: Unforgeability
If vi = false for all correct processes i, then for all correct processes j,
acceptj remains false forever.
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7
An atomic step:
i f vi = 1
8
9
10
11
12

13
14
15
16

then accepti := 1 ;
TMPA14, Nov. 2014
40 / 52
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
By contradiction assume a
correct process sets acceptj = 1
4
5
6
7
An atomic step:
i f vi = 1
8
9
10
11
12

13
14
15
16

then accepti := 1 ;
TMPA14, Nov. 2014
40 / 52
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7
Thus it has executed line 16
An atomic step:
i f vi = 1
8
9
10
11
12

13
14
15
16

then accepti := 1 ;
TMPA14, Nov. 2014
40 / 52
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7
An atomic step:
i f vi = 1
Thus it has received n t
messages by distinct processes
8
9
10
11
12

13
14
15
16

then accepti := 1 ;
TMPA14, Nov. 2014
40 / 52
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7
An atomic step:
i f vi = 1
That means messages by at
n 2t correct processes
8
9
10
11
12

13
14
15
16

then accepti := 1 ;
TMPA14, Nov. 2014
40 / 52
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7
An atomic step:
i f vi = 1
8
9
10
11
12

Let p be the first correct
processes that has sent (echo)
13
14
15
16

then accepti := 1 ;
TMPA14, Nov. 2014
40 / 52
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7
An atomic step:
i f vi = 1
8
9
10
11
12

It did not send in line 7, as
vp = 0 by assumption
13
14
15
16

then accepti := 1 ;
TMPA14, Nov. 2014
40 / 52
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7
An atomic step:
i f vi = 1
8
9
10
11
12

Thus, p sent in line 12
13
14
15
16

then accepti := 1 ;
TMPA14, Nov. 2014
40 / 52
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7
An atomic step:
i f vi = 1
8
9
10
11
12

13
14
15
16

then accepti := 1 ;
Based on t + 1 messages, i.e., 1
sent by a correct processes
TMPA14, Nov. 2014
40 / 52
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7
An atomic step:
i f vi = 1
8
9
10
11
12

13
14
15
16

then accepti := 1 ;
Based on t + 1 messages, i.e., 1
sent by a correct processes
contradiction to p being the
first one.
TMPA14, Nov. 2014
40 / 52
Proof: Completeness
If vi = true for all correct processes i, then there is a correct process j
that eventually sets acceptj to true.
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7
An atomic step:
i f vi = 1
8
9
10
11
12

13
14
15
16

then accepti := 1 ;
TMPA14, Nov. 2014
41 / 52
Proof: Completeness
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7
An atomic step:
i f vi = 1
all, i.e., at least n t correct

processes execute line 7
8
9
10
11
12

13
14
15
16

then accepti := 1 ;
TMPA14, Nov. 2014
41 / 52
Proof: Completeness
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7
An atomic step:
i f vi = 1
8
9
10
11
12


by reliable communication all
correct processes receive all
messages sent by correct
processes
13
14
15
16

then accepti := 1 ;
TMPA14, Nov. 2014
41 / 52
Proof: Completeness
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7
An atomic step:
i f vi = 1
8
9
10
11
12


processes
Thus, a correct process receives
n t (echo) messages
13
14
15
16

then accepti := 1 ;
TMPA14, Nov. 2014
41 / 52
Proof: Completeness
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7
An atomic step:
i f vi = 1
8
9
10
11
12

13
14
15
16

processes
Thus, a correct process receives
n t (echo) messages
Thus, a correct process executes
line 16

then accepti := 1 ;
TMPA14, Nov. 2014
41 / 52
Proof: Relay
If a correct process i sets accepti to true, then eventually all correct
processes j set acceptj to true.
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7
An atomic step:
i f vi = 1
8
9
10
11
12

13
14
15
16

then accepti := 1 ;
TMPA14, Nov. 2014
42 / 52
Proof: Relay
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
Correct process executes line 16
4
5
6
7
An atomic step:
i f vi = 1
8
9
10
11
12

13
14
15
16

then accepti := 1 ;
TMPA14, Nov. 2014
42 / 52
Proof: Relay
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1

4
5
6
7
An atomic step:
i f vi = 1
8
9
10
11
12

13
14
15
16

then accepti := 1 ;
TMPA14, Nov. 2014
42 / 52
Proof: Relay
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7

That means messages by n 2t
correct processes
An atomic step:
i f vi = 1
8
9
10
11
12

13
14
15
16

then accepti := 1 ;
TMPA14, Nov. 2014
42 / 52
Proof: Relay
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7

correct processes
An atomic step:
i f vi = 1
By the resilience condition

n > 3t, we have n 2t t + 1
8
9
10
11
12

13
14
15
16

then accepti := 1 ;
TMPA14, Nov. 2014
42 / 52
Proof: Relay
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7
11
12

correct processes
An atomic step:
i f vi = 1
10


Thus at least t + 1 correct
processes have sent (echo)
13
14
15
16

then accepti := 1 ;
TMPA14, Nov. 2014
42 / 52
Proof: Relay
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7
11
12

correct processes
An atomic step:
i f vi = 1
10


By reliable communication,
these messages are received by
all correct processes
13
14
15
16

then accepti := 1 ;
TMPA14, Nov. 2014
42 / 52
Proof: Relay
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7
11
12
i f vi = 1
13
14
15
16

correct processes
An atomic step:
10

Thus, all correct processes send
(echo) in line 12

then accepti := 1 ;
TMPA14, Nov. 2014
42 / 52
Proof: Relay
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7
11
12
i f vi = 1
13
14
15
16

then accepti := 1 ;

correct processes
An atomic step:
10

(echo) in line 12
There are at least n t correct
TMPA14, Nov. 2014
42 / 52
Proof: Relay
1
2
3
vi : { 0 , 1} i n i t i a l l y 0 or 1
4
5
6
7
11
12
i f vi = 1
13
14
15
16

then accepti := 1 ;

correct processes
An atomic step:
10

(echo) in line 12
There are at least n t correct
Thus, all correct processes
eventually execute line 16
TMPA14, Nov. 2014
42 / 52
Problems with hand-written proofs

code inspection becomes confusing for larger algorithms
TMPA14, Nov. 2014
43 / 52
Bracha & Touegs algorithm (JACM 1985)
Part II, Part V
TMPA14, Nov. 2014
44 / 52
Condition-based consensus (Mostefaoui et al., 2003)
Part II, Part V

TMPA14, Nov. 2014
45 / 52

hidden assumptions
resilience condition
reliable communication (fairness)
non-masquerading
failure model
TMPA14, Nov. 2014
46 / 52

hidden assumptions
resilience condition
reliable communication (fairness)
non-masquerading
failure model
re-using proofs if one of the ingredients changes?

if I cannot prove it correct, that does not mean the algorithm is wrong
. . . how to come up with counterexamples?
ultimate goal: verify the actual source code.
. . . it is not realistic that developers do mathematical proofs.
TMPA14, Nov. 2014
46 / 52
We have convinced a human, . . .

. . . why should we convince a computer?
it is easy to make mistakes in proofs
TMPA14, Nov. 2014
47 / 52

it is easier to overlook mistakes in proofs
distributed algorithms require non-centralized thinking
(untypical for the human mind)
many issues to consider at the same time
(interleaving of steps, faults, timing assumptions)
TMPA14, Nov. 2014
47 / 52

it is easier to overlook mistakes in proofs
distributed algorithms require non-centralized thinking
(untypical for the human mind)
many issues to consider at the same time
(interleaving of steps, faults, timing assumptions)
people who tried to convince computers found bugs in published. . .

Byzantine agreement algorithm (Lincoln & Rushby, 1993)
clock synchronization algorithm (Malekpour & Siminiceanu, 2006)
TMPA14, Nov. 2014
47 / 52
End of Part I
TMPA14, Nov. 2014
48 / 52
References I
Dolev, Danny, Dwork, Cynthia, & Stockmeyer, Larry. 1987.

On the minimal synchronism needed for distributed consensus.
J. ACM, 34, 7797.
http://doi.acm.org/10.1145/7531.7533.
Fischer, Michael J., Lynch, Nancy A., & Paterson, M. S. 1985.
Impossibility of Distributed Consensus with one Faulty Process.
J. ACM, 32(2), 374382.
http://doi.acm.org/10.1145/3149.214121.
Lamport, Leslie. 1998.
The part-time parliament.
ACM Trans. Comput. Syst., 16, 133169.
http://doi.acm.org/10.1145/279227.279229.
Lamport, Leslie, Shostak, Robert E., & Pease, Marshall C. 1982.
The Byzantine Generals Problem.
ACM Trans. Program. Lang. Syst., 4(3), 382401.
TMPA14, Nov. 2014
49 / 52
References II
Lincoln, P., & Rushby, J. 1993.
A formally verified algorithm for interactive consistency under a hybrid fault model.
Pages 402411 of: FTCS-23.
http://dx.doi.org/10.1109/FTCS.1993.627343.
Malekpour, Mahyar R., & Siminiceanu, Radu. 2006.
Comments on the Byzantine Self-Stabilizing Pulse Synchronization Protocol:
Counterexamples.
Tech. rept. TM-2006-213951. NASA.
Martin, Jean-Philippe, & Alvisi, Lorenzo. 2006.
Fast Byzantine Consensus.
IEEE Trans. Dependable Sec. Comput., 3(3), 202215.
Most
efaoui, Achour, Mourgaya, Eric, Parv
edy, Philippe Raipin, & Raynal, Michel. 2003.
Evaluating the Condition-Based Approach to Solve Consensus.
Pages 541550 of: DSN.
Raynal, Michel. 1997.
A Case Study of Agreement Problems in Distributed Systems: Non-Blocking Atomic
Commitment.
Pages 209214 of: HASE.
TMPA14, Nov. 2014
50 / 52
References III
Santoro, Nicola, & Widmayer, Peter. 1989.

Time is Not a Healer.
Pages 304313 of: STACS.
LNCS, vol. 349.
http://dx.doi.org/10.1007/BFb0028994.
Srikanth, T. K., & Toueg, Sam. 1987.
Optimal clock synchronization.
J. ACM, 34, 626645.
http://doi.acm.org/10.1145/28869.28876.
TMPA14, Nov. 2014
51 / 52
Model vs. reality: assumption coverage

Every assumption has a probability that it is satisfied in the actual system:
n > 3t
less likely than n > t
every message sent is received within bounded time
less likely than that it is eventually received
processes fail by crashing
less likely than they deviate arbitrarily from the prescribed behavior
non-masquerading
less likely than processes that can pretend to be someone else
TMPA14, Nov. 2014
52 / 52
Model vs. reality: assumption coverage

Every assumption has a probability that it is satisfied in the actual system:
n > 3t
less likely than n > t
every message sent is received within bounded time
less likely than that it is eventually received
processes fail by crashing
less likely than they deviate arbitrarily from the prescribed behavior
non-masquerading
less likely than processes that can pretend to be someone else
To use a distributed algorithm in practice:
one must ensure that an assumption is suitable for a given system
the probability that the system is working correctly is the probability
that the assumptions hold
(given that the distributed algorithm actually is correct)
TMPA14, Nov. 2014
52 / 52

Introduction To Fault Tolerant

Uploaded by

Copyright:

Available Formats

Introduction To Fault Tolerant

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Fault Tolerant

Uploaded by

Copyright:

Available Formats

Model Checking of Fault-Tolerant Distributed Algorithms

Part I: Fault-Tolerant Distributed Algorithms

TMPA 2014, Kostroma, Russia

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

Figure 7.1: DARTS prototype board, comprising 8 interconnected HITS chips

7.1 Assessing and validating the standard node HITS design

Figure 7.1: DARTS prototype board, comprising 8 interconnected HITS chips

7.1 Assessing and validating the standard node HITS design

Are they always working?

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

No. . . some failing systems

Quantas Airbus in-flight Learmonth upset (2008)

Ariane 501 maiden flight (1996)

Netflix outages due to Amazons cloud (ongoing)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

Why do they fail?

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

Why do they fail?

faults at design/implementation phase

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

Why do they fail?

faults at design/implementation phase

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

Why do they fail?

faults at design/implementation phase

Checking Fault-Tolerant Distributed Algos

Bringing both together

Goal: automatically verified fault-tolerant distributed algorithms

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

Bringing both together

Goal: automatically verified fault-tolerant distributed algorithms

model checking FTDAs is a research challenge:

. . . fault-tolerance makes model checking harder

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

Part I: Fault-tolerant distributed algorithms

Part II: Modeling fault-tolerant distributed algorithms

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

Lecture overview cont.

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014

Part I: Fault-Tolerant Distributed

Josef Widder (www.forsyte.at)

Checking Fault-Tolerant Distributed Algos

TMPA14, Nov. 2014