Fault Tolerance
Fault Tolerance
Fault Tolerance
Part I Introduction
Part II Process Resilience
Part III Reliable Communication
Part IV Distributed Commit
Part V Recovery
Chapter 8
FAULT TOLERANCE
Part I
Introduction
FAULT TOLERANCE
A DS should be fault-tolerant
⚫ Should be able to continue functioning in the presence of faults
Availability
Reliability
Safety
Maintainability
AVAILABILITY & RELIABILITY (1)
Availability: A measurement of whether a system is ready to be
used immediately
⚫ System is up and running at any given moment
FAULT TOLERANCE
Part II
Process Resilience
PROCESS RESILIENCE
Mask process failures by replication
FAULT TOLERANCE
Part III
Reliable Communication
RELIABLE GROUP
COMMUNICATION
RELIABLE GROUP
COMMUNICATION
For process resilience to take place, there has to be reliable
process communication.
Reliable multicast services guarantee that messages are
delivered to all members in a process group
⚫ When a group is static and processes do not fail
B B
C C
Gi = (A, B, C) Gi+1 = (B, Gi = (A, B, C) Gi+1 = (B,
C) C)
VIRTUAL SYNCHRONY
IMPLEMENTATION: [BIRMAN ET AL., 1991]
Only stable messages are delivered
Stable message: a message received by all processes in the
message’s group view
Assumptions (can be ensured by using TCP):
⚫ Point-to-point communication is reliable
⚫ Point-to-point communication ensures FIFO-ordering
P4
P5
VIRTUAL SYNCHRONY
IMPLEMENTATION: EXAMPLE
P5
VIRTUAL SYNCHRONY
IMPLEMENTATION: EXAMPLE
Every process
⚫ After receiving a flush message P2 P3
from all processes in Gi+1 installs
Gi+1
P1
P4
P5
IMPLEMENTING VIRTUAL SYNCHRONY
If the sender of m to G fails, there should be other ways
of ensuring that m is received by those processes that did
not.
Every process in G keeps m until it knows for sure that
all members in G have received it.
If m has been received by all members in G, m is said to
be stable. Only stable messages are allowed to be
delivered.
To ensure stability, it is sufficient to select an arbitrary
(operational) process in G and request it to send m to all
other processes.
IMPLEMENTING VIRTUAL SYNCHRONY
When a process P receives the view-change message for
Gi+1, it first forwards a copy of any unstable message
from G, it still has to every process in Gi+1, and
subsequently marks it as being stable.
To indicate that P no longer has any unstable messages
and that it is prepared to install Gi+1 as soon as the other
processes can do that as well, it multicasts a flush
message for Gi +1
After P has received a flush message for Gi+ 1 from each
other process, it can safely install the new view
IMPLEMENTING VIRTUAL SYNCHRONY
Major flaw in this protocol:
it cannot deal with process failures while a new view
change is being announced
Solution: announcing view changes for any view Gi+k
even while previous changes have not yet been installed
by all processes.
DISTRIBUTED COMMIT
The atomic multicasting problem discussed previously is an
example of a more general problem, known as distributed
commit.