Unit 4

Fault Tolerance:
In single machine systems, if there is a crash, everything fails. In a distributed system, there are multiple
nodes. So, the question is how can partial failures be tolerated and recovered from in a distributed
system? If a system has n nodes, and the probability that single one fails is p, then the probability that
there is a failure in the system is given by:
p(f) = 1 − pn
As n grows, this number probability converges to 1. In other words, there will almost always be a failed
node in a large enough distributed system.
Basic Concepts
As computing has become more pervasive, we need to make things more reliable. A system’s
dependability is evaluated based on:
• Availability: The percentage of time for which a system is available. Gold standard is that a system is
available 99.999% of the time.
• Reliability: System must run continuously without failure.
• Safety: System failures should not compromise safety of client systems and lead to catastrophic
failures.
• Maintainability: Systems failures should be easy to fix.
Faults and Failures
There are many types of faults, including:
• Transient faults: When the system is running, it sees occasional errors but continues to run. The errors
come and go, but do not bring the system down.
• Intermittent faults: The system may die occasionally but if you restart it, it comes back up.
• Permanent faults: the system is dead and not coming back up.
Figure 1: Different types of failures

The different models of failure are shown in Figure 1. Typically fault tolerance mechanisms are assumed
to provide safety against crash failures.
Arbitrary failures may also be thought to be Byzantine failures where different behaviour is observed
at different times. These faults are typically very expensive to provide tolerance against.
Failure masking by redundancy
Figure 2: Triple modular redundancy
 Fault tolerance may be achieved by means of redundant computations and per stage voting. The
circuit shown in Figure 2 demonstrates this. Here each computation of the stages A, B and C is
replicated and the results are aggregated by votes.
 Each device is replicated three times. Following each stage in the circuit is a triplicated voter.
Replicating the voter makes the system more resilient as there is a chance of failure of voter as
well.
 Each voter is a circuit that has three inputs and one output. If two or three of the inputs are the
same, the output is equal to that input. If all three inputs are different, the output is undefined.
 This circuit is capable of tolerating one failure per stage of computation.
 If we try to deal with crash fault, we only need the replication degree to be 2, because we assume
the node always produces the correct result if it’s alive.
 We need the replication degree of 3 to deal with Byzantine faults.
 Suppose that element A2 fails. Each of the voters, V1 , V2 , and V3 gets two good (identical)
inputs and one rogue input, and each of them outputs the correct value to the second stage. In
essence, the effect of A2 failing is completely masked.
 Now consider what happens if B3 and C1 are also faulty, in addition to A2. These effects are
also masked, as we do not have more than one faults at any stage, so the three final outputs are
still correct.
Process Resilience and Design Issues
 Protection against process failures, which is achieved by replicating processes into groups.
 The key approach to tolerating a faulty process is to organize several identical processes into a
group.
 The key property that all groups have is that when a message is sent to the group itself, all
members of the group receive it. If one process in a group fails, hopefully some other process
can take over for it.
 Process groups may be dynamic. New groups can be created and old groups can be destroyed.
A process can join a group or leave one during system operation. A process can be a member
of several groups at the same time. Consequently, mechanisms are needed for managing groups
and group membership.
 The purpose of introducing groups is to allow processes to deal with collections of processes
as a single abstraction.
Flat Groups versus Hierarchical Groups
Flat Groups:
 In Flat Groups, all the processes are equal. No one is boss and all decisions are made
collectively.
 Advantages: The flat group is symmetrical and has no single point of failure. If one of the
processes crashes, the group simply becomes smaller, but can otherwise continue.
 Disadvantages: decision making is more complicated.
Hierarchical Groups
 In Hierarchical groups, some kind of hierarchy exists. For example, one process is the
coordinator and all the others are workers. In this model, when a request for work is generated,
either by an external client or by one of the workers, it is sent to the coordinator. The coordinator
then decides which worker is best suited to carry it out, and forwards it there.
 Advantages: as long as it is running, it can make decisions without bothering everyone else.
 Disadvantages: Loss of the coordinator brings the entire group to a grinding halt.
Figure 3: Communication in a flat group. (b) Communication in a simple hierarchical group.

Group Membership
Two approaches
 manage group membership in a centralized way: A group server maintains a complete data
base of all the groups and their exact membership. A single point of failure.
 manage group membership in a distributed way: In this, an outsider can send a message to all
group members announcing its wish to join the group. To leave a group, a member just sends a
goodbye message to everyone.
 Leaving and joining have to be synchronous with data messages being sent.
o A process has joined a group, must receive all messages sent to that group.
o As soon as a process has left a group, it must not receive any more messages from the
group, and the other members must not receive any more messages from it.
 if several machines go down that the group can no longer function at all, some protocol is
needed to rebuild the group.
Failure Masking and Replication

 Primary-based replication in the case of fault tolerance generally appears in the form of a
primary-backup protocol. In this case, a group of processes is organized in a hierarchical
fashion in which a primary process coordinates all write operations.
 In practice, the primary is fixed, although its role can be taken over by one of the backups, if
need be. In effect, when the primary crashes, the backups execute some election algorithm to
choose a new primary.
Agreement in Faulty Systems
 The two main type of faults are crash failures and Byzantine faults. Fault tolerance during crash
failures allows us to deal with servers which crash silently. Detecting failures can be achieved
by sending “heartbeat” messages.
 In a system where we only have silent faults, if the system has k faults simultaneously then we
need k+1 nodes in total to reach agreement.
 In Byzantine faults, the server may produce arbitrary responses at arbitrary times. It needs
higher degrees of replication to deal with these faults. To detect k byzantine faults, we need 2k
+ 1 processes. Byzantine faults are much more difficult to deal with.
Byzantine Generals Problem (consensus problem)
 In this case, we are not going to assume the processes are faulty, but rather the network is faulty.
The network can either send the message, delete it, or change it.
 In the Byzantine general’s problem, we have a scenario in which two generals in spatially
separated camps want to reach a consensus on whether to attack a fort. The attack will only be
successful if both generals attack. They send messengers in order to communicate with each
other. Each general sends their vote message containing “attack” or “retreat,” and the other
general sends an “ack” to the vote.
 However, the messengers are sometimes killed by the enemy before delivering the message.
Therefore, the communication channel is unreliable and both generals do not know if their votes
or acks were reliably received. In this set-up, it is provably impossible for the generals to reach
a consensus.
 In the above-mentioned example, the network was unreliable. If the communication network is
reliable but now there are N generals and M might be traitors, one of the solutions to reach
consensus is as follows.
Byzantine nodes can be thought as traitors which will force the system not to reach the consensus. The
solution is as follows:
 Each node collects information from all other nodes and sends it back to all others so that each
node now can see the view of the world from the perspective of other nodes.
 By simple voting, each node can now either accept a single correct value or can identify a
Byzantine failure.
 In a system with k such faults, 2k + 1 total nodes are needed to only detect that fault is present,
while 3k + 1 total nodes are needed to reach agreement, despite the faults.
Figure 4: The Byzantine agreement problem for three nonfaulty and one faulty process. (a)
Each process sends their value to the others. (b) The vectors that each process assembles based
on (a). (c) The vectors that each process receives in step 3.
In Fig. 4 (a) we see that process 1 reports 1, process 2 reports 2, process 3 lies to everyone,
giving x, y, and z, respectively, and process 4 reports a value of 4.
In step 2, the results of the announcements of step 1 are collected together in the form of the
vectors of Fig. 4 (b).
Step 3 consists of every process passing its vector from Fig. 4 (b) to every other process. In this
way, every process gets three vectors, one from every other process.
Here, too, process 3 lies, inventing 12 new values, a through l. The results of step 3 are shown
in Fig. 4 (c).
Finally, in step 4, each process examines the ith element of each of the newly received vectors.
If any value has a majority, that value is put into the result vector.
From Fig. 4 (c) we see that process 1, 2, and 4 all come to agreement on the values for v1, v2,
and v4 (values received from non-faulty processes), which is the correct result. What these
processes conclude regarding v3 (value receives from faulty process) cannot be decided, but is
also irrelevant.
Failure Detection
 When it comes to detecting process failures, there are essentially only two mechanisms. Either
processes actively send "are you alive?" messages to each other, or passively wait until
messages come in from different processes.
 The latter approach makes sense only when it can be guaranteed that there is enough
communication between processes. In practice, actively pinging processes is usually followed.
 What it all boils down to is that a timeout mechanism is used to check whether a process has
failed. In real settings, there are two major problems with this approach. First, due to unreliable
networks, simply stating that a process has failed because it does not return an answer to a ping
message may be wrong. As the answer might be lost during transit or a link with the probing
process might have been broken.
 In other words, it is quite easy to generate false positives. Hence in the worst case, perfectly
healthy process might be removed from a membership list.
 Better solution is when noticing a timeout on a ping message, a node requests other neighbours
to see whether they can reach the presumed failing node. Of course, positive information can
also be shared: if a node is still alive, that information can be forwarded to other interested
parties.
 Failure detection can also take place through gossiping in which each node regularly announces
to its neighbours that it is still up and running.
RELIABLE GROUP COMMUNICATION

Reaching Agreement
If message delivery is unbounded, no agreement can be reached even if one process fails and slow
processes are indistinguishable from a faulty ones. If the processes are faulty, then appropriate fault
models can be used such as BAR fault tolerance where nodes can be Byzantine, altruistic, and rational.
Reliable One-To-One Communication

One-one communication involves communication between a client process and a server process using
RPCs, RMIs, etc. We need one-to-many communication (multicast or broadcast) in order to reach
agreement. We need to extend the one-to-one scenario to the many-to-one scenario in order to solve the
agreement problem. let us distinguish between five different classes of failures that can occur in RPC
systems.
Figure 1: Types of failures in the one-to-one scenario. (a) Client unable to locate server. (b) Lost request
messages. (c) Server crashes after receiving request. (d) Lost reply messages. (e) Client crashes after
sending request.
Figure 1 depicts several failure modes in the one-to-one scenario. These failures can be dealt by (1)
Using reliable transport protocols such as TCP or (2) handling failures at the application layer. TCP
masks omission failures, which occur in the form of lost messages, by using acknowledgments and
retransmissions. However, crash failures of connections are not masked. A crash failure may occur when
a TCP connection is abruptly broken so that no more messages can be transmitted through the channel.
In most cases, the client is informed that the channel has crashed by raising an exception. The only way
to mask such failures is to let the distributed system attempt to automatically set up a new connection,
by simply resending a connection request.
Reliable One-To-Many Communication
Broadcast is sending a message to all nodes in a network. Multicast is sending to a subset of all nodes.
If there are lost messages due to network inconsistencies, we need to retransmit messages after a
timeout. There are two ways to do this: ACK-based schemes and NACK-based schemes.
ACK-based schemes:
• Send acknowledgement (ACK) for each of the message received. If the sender does not receive the
ACK from a receiver, after timeout it retransmits the message.
• Sender becomes a bottleneck: ACK based scheme does not scale well. As number of receivers in the
multicast group grows (say 1000 - 10,000) then the number of ACK messages that needs to be processed
also grows.
• ACK based retransmission works well for one-one communication but does not scale for one-many
communications. Large bandwidth gets used in acknowledgment process which results in an ACK
explosion.
NACK-based schemes:
• NACK-based schemes deals with sender becoming a bottleneck and the ACK-explosion issue.
• ACK indicates a packet was received. NACK indicates a packet was missed.
Figure 2: A simple solution to reliable multicasting when all receivers are known and are assumed not
to fail. (a) Message transmission. (b) Reporting feedback.
In Fig 2 (a) Here, all the receivers have their last packet received as #24 except receiver 3 which missed
packet #24. Hence, it’s last packet is #23. As soon as it receives packet #25, it knows it missed the
packet #24.
In Fig 2 (b) Each receiver now sends an acknowledgment ACK either in form of received packet #25
or missed packet #24. As we can see for a single packet, sender receives ’n’ ACKs.
• Scheme explanation: Send packet to multicast group, if receivers receive a packet, they don’t do
anything. If receiver sees a missing packet, it sends a NACK to nearby receiver as well as the sender.
Sender or neighbouring receivers would re-transmit the missed packet. This optimization works only if
the neighbouring receivers have the received packets stored in a buffer.
• Sender receive only complaint about the missed packets and this scheme scales well for multicast as
the #NACKs received is far less than the #ACKs, unless a massive amount of packet loss.
• Much more scalable than ACK-based schemes
Figure 3: Each receiver now suppresses their ACK feedback. Only receiver 3 sends a NACK to other
receivers and the sender.
This scheme, only addresses how to send a message to all members of the group, it does not discuss
other properties of multicast like:
• FIFO order: Messages will be delivered in the same order that they are sent.
• Total order: All processes receive messages in the same order. Total order does not require FIFO.
• Causal order: It is based on the happens before relationship. If send(m1) happens before send(m2),
then the receive(m1) should also happen before receive(m2) between processes.
Scalability in Reliable Multicasting
The main problem with the reliable multicast scheme just described is that it cannot support large
numbers of receivers. If there are N receivers, the sender must be prepared to accept at least N
acknowledgments. With many receivers, the sender may be swamped with such feedback messages,
which is also referred to as a feedback implosion.
One solution to this problem is not to have receivers acknowledge the receipt of a message. Instead, a
receiver returns a feedback message only to inform the sender it is missing a message. Returning only
such negative acknowledgments can be shown to generally scale better, but no hard guarantees can be
given that feedback implosions will never happen.
Another problem with returning only negative acknowledgments is that the sender will, in theory, be
forced to keep a message in its history buffer forever. Because the sender can never know if a message
has been correctly delivered to all receivers, it should always be prepared for a receiver requesting the
retransmission of an old message. In practice, the sender will remove a message from its history buffer
after some time has elapsed to prevent the buffer from overflowing. However, removing a message is
done at the risk of a request for a retransmission not being honoured.
The key issue to scalable solutions for reliable multicasting is to reduce the number of feedback
messages that are returned to the sender. A popular model that has been applied to several wide-area
applications is feedback suppression. This scheme underlies the Scalable Reliable Multicasting (SRM)
protocol.
First, in SRM, receivers never acknowledge the successful delivery of a multicast message, but instead,
report only when they are missing a message. How message loss is detected is left to the application.
Only negative acknowledgments are returned as feedback. Whenever a receiver notices that it missed a
message, it multicasts its feedback to the rest of the group.
Atomic multicast
What is often needed in a distributed system is the guarantee that a message is delivered to either all
processes or to none at all.
Atomic multicast guarantees all or none. It guarantees that either all processes in a group receive a
packet or no process receives a packet.
In addition, it is generally also required that all messages are delivered in the same order to all processes.
This is also known as the atomic multicast problem.
Replicated databases:
We can’t have a scenario where M out of N DB replicas have executed some DB update and the rest
haven’t. It needs to be ensured that every update to the database is made by all or none.
Group view: Each message is uniquely associated with a group of processes.
If there is a crash:
• Either every process blocks because ’all’ constraint will not be satisfied.
• Or all remaining members need to agree to a group change. The process that crashed is ejected from
the group.
• If the process rejoins, it has to run techniques to re-synchronize with the group such that it is in a
consistent state.
Figure 3: Atomic multicast
As shown in figure 3, Initially all process are up and are part of a group {P1,P2,P3,P4}. All the messages
are being reliable multicasted to each of the processes. At dotted line2, P3 crashes while sending a
message. From this point onwards, the group {P1,P2,P3,P4} will not maintain the ’all’ property of
atomic multicast. Hence, P1, P2 and P4 agree on a group change and then start atomic multicast amongst
themselves (the new group). At a later point P3 recovers and rejoins. At this point, it run synchronization
algorithms to bring itself up-to-date with other members of the group it wants to rejoin.
Message Ordering
Virtual synchrony allows an application developer to think about multicasts as taking place in epochs
that are separated by group membership changes. However, nothing has yet been said concerning the
ordering of multicasts. In general, four different orderings are distinguished:
1. Unordered multicasts
2. FIFO-ordered multicasts
3. Causally-ordered multicasts
4. Totally-ordered multicasts
A reliable, unordered multicast is a virtually synchronous multicast in which no guarantees are given
concerning the order in which received messages are delivered by different processes.
Figure 4: Three communicating processes in the same group. The ordering of events per process is
shown along the vertical axis.
Now suppose a sender P1 multicasts two messages to a group while two other processes in that group
are waiting for messages to arrive, as shown in Fig. 4. Assuming that processes do not crash or leave
the group during these multicasts, it is possible that the communication layer at P2 first receives message
m1 and then m2. Because there are no message-ordering constraints, the messages may be delivered to
P2 in the order that they are received. In contrast, the communication layer at P3 may first receive
message m2 followed by m1, and delivers these two in this same order to P3.
Figure 5: Four processes in the same group with two different senders, and a possible delivery order of
messages under FIFO-ordered multicasting.
In the case of reliable FIFO-ordered multicasts, the communication layer is forced to deliver
incoming messages from the same process in the same order as they have been sent (Fig 5). With FIFO
ordering, the only thing that matters is that message m1 is always delivered before m2, and, likewise,
that message m3 is always delivered before m4. This rule has to be obeyed by all processes in the group.
In other words, when the communication layer at P3 receives m2 first, it will wait with delivery to P3
until it has received and delivered m1.
Finally, reliable causally-ordered multicast delivers messages so that potential causality between
different messages is preserved. In other words, if a message m1 causally precedes another message
m2, regardless of whether they were multicast by the same sender, then the communication layer at
each receiver will always deliver m2 after it has received and delivered m1. Note that causally ordered
multicasts can be implemented using vector timestamps.
Total-ordered delivery means that regardless of whether message delivery is unordered, FIFO ordered,
or causally ordered, it is required additionally that when messages are delivered, they are delivered in
the same order to all group members.

Unit 4

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Unit 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 4

Uploaded by

Copyright:

Available Formats

Fault Tolerance:

Figure 1: Different types of failures

Figure 3: Communication in a flat group. (b) Communication in a simple hierarchical group.

Failure Masking and Replication

RELIABLE GROUP COMMUNICATION

Reliable One-To-One Communication

You might also like