UNIT 4 DC Final Copy

UNIT IV
CONSENSUS AND RECOVERY
Consensus and Agreement Algorithms: Problem Definition –

Overview of Results – Agreement in a Failure-Free
System(Synchronous and Asynchronous) – Agreement in
Synchronous Systems with Failures; Check pointing and Rollback
Recovery: Introduction – Background and Definitions – Issues in
Failure Recovery – Checkpoint-based Recovery – Coordinated
Check pointing Algorithm – – Algorithm for Asynchronous
Checkpointing and Recovery
Problem Definition
Agreement among the processes in a distributed system is a fundamental

requirement for a wide range of applications.
Many forms of coordination require the processes to exchange information to

negotiate with one another and eventually reach a common understanding or
agreement, before taking application-specific actions.
A classical example is that of the commit decision in database systems, wherein the
processes collectively decide whether to commit or abort a transaction that they
participate in.
We first state some assumptions underlying our study of agreement algorithms:
• Failure models Among the n processes in the system, at most f processes can be
faulty. A faulty process can behave in any manner allowed by the failure model
assumed. The various failure models – fail-stop, send omission and receive omission,
and Byzantine failures.
• Synchronous/asynchronous communication If a failure-prone process chooses

to send a message to process Pi but fails, then Pi cannot detect the non-arrival of the
message in an asynchronous system.
• In a synchronous system, however, the scenario in which a message has not been
sent can be recognized by the intended recipient, at the end of the round.
• Network connectivity The system has full logical connectivity, i.e., each
process can communicate with any other by direct message passing.
• Sender identification A process that receives a message always knows the

identity of the sender process.
• Channel reliability The channels are reliable, and only the processes may fail
(under one of various failure models).
• Authenticated vs. non-authenticated messages With unauthenticated

messages, when a faulty process relays a message to other processes,
• (i) it can forge the message and claim that it was received from another process,
and
• (ii) it can also tamper with the contents of a received message before relaying it.
When a process receives a message, it has no way to verify its authenticity.
• An unauthenticated message is also called an oral message or an unsigned

message. Using authentication via techniques such as digital signatures, it is easier to
solve the agreement problem because, if some process forges a message or tampers
with the contents of a received message before relaying it, the recipient can detect the
forgery or tampering.
• Thus, faulty processes can inflict less damage.
• Agreement variable The agreement variable may be boolean or multivalued,

and need not be an integer.
The Byzantine agreement
The Byzantine agreement problem requires a designated process, called the source
process, with an initial value
Problem definition agreement with the other processes about its initial value, subject
to the following conditions:
• Agreement All non-faulty processes must agree on the same value.
• Validity If the source process is non-faulty, then the agreed upon value by all the non-
faulty processes must be the same as the initial value of the source.
• Termination Each non-faulty process must eventually decide on a value. The validity
condition rules out trivial solutions, such as one in which the agreed upon value is a
constant.
The consensus problem
The consensus problem differs from the Byzantine agreement problem in that each
process has an initial value and all the correct processes must agree on a single value
• Agreement All non-faulty processes must agree on the same (single) value.
• Validity If all the non-faulty processes have the same initial value, then the agreed
upon value by all the non-faulty processes must be that same value.
• Termination Each non-faulty process must eventually decide on a value.
The interactive consistency problem
The interactive consistency problem differs from the Byzantine agreement problem in
that each process has an initial value, and all the correct processes must agree upon a
set of values, with one value for each process.
• Agreement All non-faulty processes must agree on the same array of values
A[v1…vn]
• Validity If process i is non-faulty and its initial value is vi, then all nonfaulty
processes agree on vi as the ith element of the array A. If process j is faulty, then the
non-faulty processes can agree on any value for A[j].
• Termination Each non-faulty process must eventually decide on the array A

Overview of results:
Failure mode Synchronous system Asynchronous system

(message-passing and (message-passing and shared
shared memory) memory)
No Failure Agreementattainable;common agreement attainable;
knowledge attainable
concurrent common knowledge
Crash Failure agreement attainable f < n agreement not attainable
processes
Byzantine agreement attainable agreement not attainable
Failure f ≤ [(n - 1)/3] Byzantine
processes
AGREEMENT IN A FAILURE-FREE SYSTEM (SYNCHRONOUS OR
ASYNCHRONOUS)
In a failure-free system, consensus can be reached by collecting information from

the different processes, arriving at a “decision,” and distributing this decision in the
system.
A distributed mechanism would have each process broadcast its values to others,
and each process computes the same function on the values received.
The decision can be reached by using an applicationspecific function – some

simple examples being the majority, max, and min functions.
Algorithms to collect the initial values and then distribute the decision may be
based on the token circulation on a logical ring, or the three phase
Consensus and agreement algorithms tree-based broadcast–converge cast–

broadcast, or direct communication with all nodes.
AGREEMENT IN (MESSAGE-PASSING) SYNCHRONOUS SYSTEMS WITH

FAILURES CONSENSUS ALGORITHM FOR CRASH FAILURES
(SYNCHRONOUS SYSTEM)
• The consensus algorithm for n processes where up to f processes where f < n

may fail in a fail stop failure model.
• Here the consensus variable x is integer value; each process has initial value xi.
• If up to f failures are to be tolerated than algorithm has f+1 rounds, in each

round a process i sense the value of its variable xi to all other processes if that
value has not been sent before.
• So, of all the values received within that round and its own value xi at that start
of the round the process takes minimum and updates xi occur f + 1 rounds the
local value xi guaranteed to be the consensus value.
• If one process is faulty, among three processes then f = 1. So the agreement

requires f + 1 that is equal to two rounds.
• If it is faulty let us say it will send 0 to 1 process and 1 to another process i, j
and k. Now,on receiving one on receiving 0 it will broadcast 0 over here and this
particular process on receiving 1 it will broadcast 1 over here.
• So, this will complete one round in this one round and this particular process on
receiving 1 it will send 1 over here and this on the receiving 0 it will send 0 over
here.
The agreement condition is satisfied because in the f+1 rounds ,there must be atleast
one round in which no process failed.
• In this round, say round r,all the processes that have not failed so far succeed in
broadcasting their values, and all these processes take the minimum of the values
broadcast and received in that round.
• Thus,the local values at the end of the round are the same,say x ri for all non-failed
processes.
• In further rounds,only this value maybe sent by each process atmost once,and no
process I will update its value x ri.
• The validity condition is satisfied because processes do not send fictitious values in
this failure model.
• For all i,if the initialvalue is identical,then the only value sent by any process is the
value that has been agreed upon as per the agreement condition.
• The termination condition is seen to be satisfied.

Complexity: The complexity of this particular algorithm is it requires f + 1 rounds where f < n and the
number of messages is O(n2)in each round and each message has one integers hence the total number
of messages is O((f+1)· n 2 ) is the total number of rounds and in each round n 2 messages are required.
Consensus algorithms for Byzantine failures (synchronous system)
STEPS FOR BYZANTINE GENERALS (ITERATIVE FORMULATION),

SYNCHRONOUS, MESSAGE-PASSING:
STEPS FOR BYZANTINE GENERALS (RECURSIVE FORMULATION),
SYNCHRONOUS, MESSAGE-PASSING:
CODE FOR THE PHASE KING ALGORITHM:
Each phase has a unique "phase king" derived, say, from

PID. Each phase has two rounds:
• 1 in 1st round, each process sends its estimate to all other processes.
• 2 in 2nd round, the "Phase king" process arrives at an estimate based on the
values it received in 1st round, and broadcasts its new estimate to all others.
Fig. Message pattern for the phase -king algorithm.
PHASE KING ALGORITHM CODE:
(f + 1) phases, (f + 1)[( n - 1)( n + 1)] messages, and can tolerate up to f < dn=4e
malicious processes
Correctness Argument
• 1 Among f + 1 phases, at least one phase k where phase-king is non-malicious.
• 2 In phase k, all non-malicious processes Pi and Pj will have same

estimate of consensus value as Pk does.
• Pi and Pj use their own majority values. Pi 's mult> n=2 + f )
• Pi uses its majority value; Pj uses phase-king's tie-breaker value. (Pi’s mult> n=2 +
f ,Pj 's mult> n=2 for same value)
• Pi and Pj use the phase-king's tie-breaker value. (In the phase in which Pk is non-
malicious, it sends same value to Pi and Pj )
In all 3 cases, argue that Pi and Pj end up with same value as estimate
• If all non-malicious processes have the value x at the start of a phase, they will
continue to have x as the consensus value at the end of the phase.
Check pointing and rollback recovery: Introduction
• Rollback recovery protocols restore the system back to a consistent state after a failure,
• It achieves fault tolerance by periodically saving the state of a process during the
failure- free execution
• It treats a distributed system application as a collection of processes that communicate
over a network Checkpoints
The saved state is called a checkpoint, and the procedure of restarting from a
previously check pointed state is called rollback recovery.
A checkpoint can be saved on either the stable storage or the volatile storage
Why is rollback recovery of distributed systems complicated?
Messages induce inter-process dependencies during failure-free operation
Rollback propagation
The dependencies among messages may force some of the processes that did not fail
to roll back. This phenomenon of cascaded rollback is called the domino effect.
Uncoordinated check pointing
If each process takes its checkpoints independently, then the system cannot avoid the
domino effect – this scheme is called independent or uncoordinated check pointing
Techniques that avoid domino effect
1. Coordinated check pointing rollback recovery - Processes coordinate their checkpoints
to form a system-wide consistent state
2. Communication-induced check pointing rollback recovery - Forces each process to
take checkpoints based on information piggybacked on the application.
3. Log-based rollback recovery - Combines check pointing with logging of non-
deterministic events
• relies on piecewise deterministic (PWD) assumption.
Background and definitions
System model
• A distributed system consists of a fixed number of processes, P1, P2,…_ PN , which
communicate only through messages.
•
Processes cooperate to execute a distributed application and interact with the outside
world by receiving and sending input and output messages, respectively
• Rollback-recovery protocols generally make assumptions about the reliability of the

inter process communication.
• Some protocols assume that the communication uses first-in-first-out (FIFO) order,
while other protocols assume that the communication subsystem can lose, duplicate,
or reorder messages
–if a process‟s state reflects a message receipt, then the state of the
corresponding sender must reflect the sending of the message
• A global checkpoint is a set of local checkpoints, one from each process

• A consistent global checkpoint is a global checkpoint such that no message is sent by a
process after taking its local point that is received by another process before taking its
checkpoint.
• For instance, Figure shows two examples of global states.

• The state in fig (a) is consistent and the state in Figure (b) is inconsistent.
• Note that the consistent state in Figure (a) shows message m1 to have been sent but not
yet received, but that is alright.
• The state in Figure (a) is consistent because it represents a situation in which every
message that has been received, there is a corresponding message send event.
• The state in Figure (b) is inconsistent because process P2 is shown to have received m2
but the state of process P1 does not reflect having sent it.
• Such a state is impossible in any failure-free, correct computation. Inconsistent states

occur because of failures.
Interactions with outside world

A distributed system often interacts with the outside world to receive input data or
deliver the outcome of a computation. If a failure occurs, the outside world cannot be
expected to roll back.
For example, a printer cannot roll back the effects of printing a character
Outside World Process (OWP)
• It is a special process that interacts with the rest of the system through message
passing.
• It is therefore necessary that the outside world see a consistent behavior of the system
despite failures.
• Thus, before sending output to the OWP, the system must ensure that the state from
which the output is sent will be recovered despite any future failure.
A common approach is to save each input message on the stable storage before
allowing the application program to process it.
An interaction with the outside world to deliver the outcome of a computation is
shown on the process-line by the symbol “||”.
Different types of Messages
1. In-transit message
• messages that have been sent but not yet received
2. Lost messages
• messages whose “send‟ is done but “receive‟ is undone due to rollback
3. Delayed messages
• messages whose “receive‟ is not recorded because the receiving process was
either down or themessage arrived after rollback
4. Orphan messages
• messages with “receive‟ recorded but message“send‟ not recorded
• do not arise if processes roll back to a consistent global state
5. Duplicate messages
• arise due to message logging and replaying during process recovery
In-transit messages
In Figure , the global state {C1,8 , C2, 9 , C3,8,C4,8} shows that messagem1 has been sent but
not yet received. We call such a message anin-transit message. Messagem2 is also an in-transit
message.
Delayed messages
Messages whose receive is not recorded because the receiving process was either down or the
message arrived after the rollback of the receiving process, are calleddelayed messages. For
example, messages m2 and m5 inFigure are delayed messages.
Lost messages
Messages whose send is not undone but receive is undone due to rollback are called
lost messages. This type of messages occurs when the process rolls back to a
checkpoint prior to reception of the message while the sender does not rollback
beyond the send operation of the message.
In Figure , message m1 is a lost message.
Duplicate messages
• Duplicate messages arise due to message logging and replaying during process
recovery. For example, in Figure, message m4 was sent and received before the
rollback.
• However, due to the rollback of process P4 to C4,8 and process P3 to C3,8, both send
and receipt of message m4 are undone.
• When process P3 restarts from C3,8, it will resend message m4.
• Therefore, P4 should not replay message m4 from its log.
• If P4 replays message m4, then message m4 is called a duplicate message.
• Issues in failure recovery

In a failure recovery, we must not only restore the system to a consistent state, but
also appropriately handle messages that are left in an abnormal state due to the failure
and recovery
• The computation comprises of three processes Pi, Pj , and Pk, connected through a
communication network. The processes communicate solely by exchanging messages
over fault free, FIFO communication channels.
• Processes Pi, Pj , and Pk, have taken checkpoints {Ci,0, Ci,1}, {Cj,0, Cj,1, Cj,2}, and
{Ck,0, Ck,1}, respectively, and these processes have exchanged messages A to J
Suppose process Pi fails at the instance indicated in the figure. All the contents of the
volatile memory of Pi are lost and, after Pi has recovered from the failure, the system
needs to be restored to a consistent global state from where the processes can resume
their execution.
• Process Pi’s state is restored to a valid state by rolling it back to its most recent
checkpoint Ci,1.
To restore the system to a consistent state, the process Pj rolls back to checkpoint
Cj,1 because the rollback of process Pi to checkpoint Ci,1 created an orphan message
H (the receive event of H is recorded at process Pj while the send event of H has
been undone at process Pi).
• Pj does not roll back to checkpoint Cj,2 but to checkpoint Cj,1. An orphan message
I is created due to the roll back of process Pj to checkpoint Cj,1.
To eliminate this orphan message, process Pk rolls back to checkpoint Ck,1.
• Messages C, D, E, and F are potentially problematic. Message C is in transit during the

failure and it is a delayed message.
• The delayed message C has several possibilities: C might arrive at process Pi before it
recovers, it might arrive while Pi is recovering, or it might arrive after Pi has
completed recovery. Each of these cases must be dealt with correctly.
• Message D is a lost message since the send event for D is recorded in the restored state
for process Pj , but the receive event has been undone at process Pi. Process Pj will
not resend D without an additional mechanism.
• Messages E and F are delayed orphan messages and pose perhaps the most serious
problem of all the messages. When messages E and F arrive at their respective
destinations, they must be discarded since their send events have been undone.
• Processes, after resuming execution from their checkpoints, will generate both of these
messages.
• Lost messages like D can be handled by having processes keep a message log of all the
sent messages.
• So when a process restores to a checkpoint, it replays the messages from its log to
handle the lost message problem.
• Overlapping failures further complicate the recovery process. If overlapping failures

are to be tolerated, a mechanism must be introduced to deal with amnesia and the
resulting inconsistencies.
Checkpoint-based recovery
Checkpoint-based rollback-recovery techniques can be classified into three categories:
1. Uncoordinated checkpointing
2. Coordinated checkpointing
3. Communication-induced checkpointing
1.Uncoordinated Checkpointing
• Each process has autonomy in deciding when to take checkpoints
• Advantages
The lower runtime overhead during normal execution
• Disadvantages
1. Domino effect during a recovery
2. Recovery from a failure is slow because processes need to iterate to find a consistent
set of checkpoints
3. Each process maintains multiple checkpoints and periodically invoke a garbage

collection algorithm
4. Not suitable for application with frequent output commits

• The processes record the dependencies among their checkpoints caused by message
exchange during failure-free operation
• The following direct dependency tracking technique is commonly used in

uncoordinated checkpointing.
Direct dependency tracking technique

• Assume each process 𝑃𝑖 starts its execution with an initial checkpoint 𝐶𝑖,0
• 𝐼𝑖,𝑥: checkpoint interval, interval between 𝐶𝑖,𝑥−1 and 𝐶𝑖,𝑥
• When 𝑃𝑗 receives a message m during 𝐼𝑗,𝑦, it records the dependency from 𝐼𝑖,𝑥 to
𝐼𝑗,𝑦, which is later saved onto stable storage when 𝑃𝑗 takes 𝐶𝑗,𝑦
• When a failure occurs, the recovering process initiates rollback by broadcasting a
dependency request message to collect all the dependency information maintained by
each process.
• When a process receives this message, it stops its execution and replies with the
dependency information saved on the stable storage as well as with the dependency
information, if any, which is associated with its current state.
• The initiator then calculates the recovery line based on the global dependency
information and broadcasts a rollback request message containing the recovery line.
• Upon receiving this message, a process whose current state belongs to the recovery
line simply resumes execution;
• otherwise, it rolls back to an earlier checkpoint as indicated by the recovery line.
2.Coordinated Checkpointing
In coordinated checkpointing, processes orchestrate their checkpointing activities so
that all local checkpoints form a consistent global state
Types
1. Blocking Checkpointing:
After a process takes a local checkpoint, to prevent orphan messages, it remains
blocked until the entire check pointing activity is complete
Disadvantages: The computation is blocked during the check pointing
2. Non-blocking Checkpointing:
The processes need not stop their execution while taking checkpoints. A
fundamental problem in coordinated checkpointing is to prevent a process from
receiving application messages that could make the checkpoint inconsistent.
Example (a) : Checkpoint inconsistency
• Message m is sent by 𝑃0 after receiving a checkpoint request from the checkpoint
• Assume m reaches 𝑃1 before the checkpoint request

coordinator
• This situation results in an inconsistent checkpoint since checkpoint 𝐶1,𝑥shows the

receipt of message m from 𝑃0, while checkpoint 𝐶0,𝑥 does not show m being sent
from 𝑃0
Example (b) : A solution with FIFO channels
• If channels are FIFO, this problem can be avoided by preceding the first post-
checkpoint message on each channel by a checkpoint request, forcing each process to
take a checkpoint before receiving the first post-checkpoint message
Impossibility of min-process non-blocking checkpointing

• A min-process, non-blocking checkpointing algorithm is one that forces only a
minimum number of processes to take a new checkpoint, and at the same time it does
not force any process to suspend its computation.
Algorithm
• The algorithm consists of two phases. During the first phase, the checkpoint initiator
identifies all processes with which it has communicated since the last checkpoint and
sends them a request.
• Upon receiving the request, each process in turn identifies all processes it has
communicated with since the last checkpoint and sends them a request, and so on,
until no more processes can be identified.
• During the second phase, all processes identified in the first phase take a checkpoint.
The result is a consistent checkpoint that involves only the participating processes.
• In this protocol, after a process takes a checkpoint, it cannot send any message until the
second phase terminates successfully, although receiving a message after the
checkpoint has been taken is allowable.
3.Communication-induced Checkpointing
Communication-induced checkpointing is another way to avoid the domino
effect, while allowing processes to take some of their checkpoints independently.
Processes may be forced to take additional checkpoints
Two types of checkpoints

1. Autonomous checkpoints
2. Forced checkpoints
The checkpoints that a process takes independently are called local checkpoints,
while those that a process is forced to take are called forced checkpoints.
• Communication-induced check pointing piggybacks protocol- related information on
each application message.
• The receiver of each application message uses the piggybacked information to

determine if it has to take a forced checkpoint to advance the global recovery line.
• The forced checkpoint must be taken before the application may process the contents
of the message.
• In contrast with coordinated check pointing, no special coordination messages are

exchanged
Two types of communication-induced checkpointing

1.Model-based checkpointing
2.Index-based checkpointing.
Model-based checkpointing
• Model-based checkpointing prevents patterns of communications and checkpoints that
could result in inconsistent states among the existing checkpoints.
• No control messages are exchanged among the processes during normal operation. All
information necessary to execute the protocol is piggybacked on application messages
• There are several domino-effect-free checkpoint and communication model.
• The MRS (mark, send, and receive) model of Russell avoids the domino effect by
ensuring that within every checkpoint interval all message receiving events precede
all message-sending events.
Index-based checkpointing.
• Index-based communication-induced checkpointing assigns monotonically increasing
indexes to checkpoints, such that the checkpoints having the same index at different
processes form a consistent state.
KOO AND TOUEG COORDINATED CHECKPOINTING AND RECOVERY

TECHNIQUE:
• Koo and Toueg coordinated check pointing and recovery technique takes a consistent
set of checkpoints and avoids the domino effect and livelock problems during the
recovery.
• Includes 2 parts: the check pointing algorithm and the recovery algorithm
A.The Checkpointing Algorithm

The checkpoint algorithm makes the following assumptions about the distributed
system:
• Processes communicate by exchanging messages through communication channels.
• Communication channels are FIFO.
• Assume that end-to-end protocols (the sliding window protocol) exist to handle with
message loss due to rollback recovery and communication failure.
• Communication failures do not divide the network.
The checkpoint algorithm takes two kinds of checkpoints on the stable storage:
Permanent and Tentative.
A permanent checkpoint is a local checkpoint at a process and is a part of a consistent

global checkpoint.
A tentative checkpoint is a temporary checkpoint that is made a permanent checkpoint
on the successful termination of the checkpoint algorithm.
The algorithm consists of two phases.

First Phase
1. An initiating process Pi takes a tentative checkpoint and requests all other processes to
take tentative checkpoints. Each process informs Pi whether it succeeded in taking a
tentative checkpoint.
2. A process says “no” to a request if it fails to take a tentative checkpoint
3. If Pi learns that all the processes have successfully taken tentative checkpoints, Pi
decides that all tentative checkpoints should be made permanent; otherwise, Pi
decides that all the tentative checkpoints should be thrown-away.
Second Phase
1. Pi informs all the processes of the decision it reached at the end of the first phase.
2. A process, on receiving the message from Pi will act accordingly.
3. Either all or none of the processes advance the checkpoint by taking permanent
checkpoints.
4. The algorithm requires that after a process has taken a tentative checkpoint, it cannot
send messages related to the basic computation until it is informed of Pi’s decision.
Correctness: for two reasons
i.Either all or none of the processes take permanent checkpoint

ii.No process sends message after taking permanent checkpoint
An Optimization
The above protocol may cause a process to take a checkpoint even when it is not
necessary for consistency. Since taking a checkpoint is an expensive operation, we
avoid taking checkpoints.
B.The Rollback Recovery Algorithm

The rollback recovery algorithm restores the system state to a consistent state after a
failure.
The rollback recovery algorithm assumes that a single process invokes the algorithm.
It assumes that the checkpoint and the rollback recovery algorithms are not invoked
concurrently.
The rollback recovery algorithm has two phases.
First Phase
1. An initiating process Pi sends a message to all other processes to check if they all are
willing to restart from their previous checkpoints.
2. A process may reply “no” to a restart request due to any reason (e.g., it is already
participating in a check pointing or a recovery process initiated by some other
process).
3. If Pi learns that all processes are willing to restart from their previous checkpoints, Pi
decides that all processes should roll back to their previous checkpoints. Otherwise,
4. Pi aborts the roll back attempt and it may attempt a recovery at a later time.
Second Phase
1. Pi propagates its decision to all the processes.
2. On receiving Pi’s decision, a process acts accordingly
The message transmission delay is arbitrary, but finite
Underlying computation/application is event-driven: process P is at state s, receives

message m, processes the message, moves to state s’ and send messages out. So the
triplet (s, m, msgs_sent) represents the state of P
Two type of log storage are maintained:

a Volatile log: short time to access but lost if processor crash. Move to stable log
periodically.
b Stable log: longer time to access but remained if crashed
A.Asynchronous Check pointing
– After executing an event, the triplet is recorded without any synchronization with
other processes.
– Local checkpoint consist of set of records, first are stored in volatile log, then moved to
stable log.
B.The Recovery Algorithm Notations and data structure
The following notations and data structure are used by the algorithm:
• RCVDi←j(CkPti) represents the number of messages received by processor pi
from processor pj , from the beginning of the computation till the checkpoint CkPti.
• SENTi→j(CkPti) represents the number of messages sent by processor pi to

processor pj , from the beginning of the computation till the checkpoint CkPti.
Basic idea
• Since the algorithm is based on asynchronous check pointing, the main issue in the
recovery is to find a consistent set of checkpoints to which the system can be restored.
• The recovery algorithm achieves this by making each processor keep track of both the
number of messages it has sent to other processors as well as the number of messages
it has received from other processors.
• Whenever a processor rolls back, it is necessary for all other processors to find out if
any message has become an orphan message.
•
• Orphan messages are discovered by comparing the number of messages sent to and
received from neighbouring processors.
For example, if RCVDi←j(CkPti) >SENTj→i(CkPtj) (that is, the number of messages

received by processor pi from processor pj is greater than the number of messages
sent by processor pj to processor pi, according to the current states the processors),
then one or more messages at processor pj are orphan messages.
The Algorithm
When a processor restarts after a failure, it broadcasts a ROLLBACK message that it

had failed
Procedure RollBack_Recovery processor pi executes the following:
STEP (a)
if processor pi is recovering after a failure then CkPti := latest event logged in
the stable storage else
CkPti := latest event that took place in pi {The latest event at pi can be either in stable
or in volatile storage.} end if
STEP (b)
for k = 1 1 to N {N is the number of processors in the system} do
for each neighboring processor pj do
compute SENTi→j(CkPti)
send a ROLLBACK(i, SENTi→j(CkPti)) message to pj

end for
for every ROLLBACK(j, c) message received from a neighbor j do
if RCVDi←j(CkPti) > c {Implies the presence of orphan messages} then
find the latest event e such that RCVDi←j(e) = c {Such an event e may be in the volatile storage
or stable storage.}
CkPti := e
end if
end for
end for{for k}
D. An Example
Consider an example shown in Figure 2 consisting of three processors. Suppose processor Y
fails and restarts. If event ey2 is the latest checkpointed event at Y, then Y will restart from the
state corresponding to ey2.
Figure 2: An example of Juan-Venkatesan algorithm.
• Because of the broadcast nature of ROLLBACK messages, the recovery algorithm is

initiated at processors X and Z.
• Initially, X, Y, and Z set CkPtX ← ex3, CkPtY ← ey2 and CkPtZ ← ez2, respectively,
and X, Y, and Z send the following messages during the first iteration:
• Y sends ROLLBACK(Y,2) to X and ROLLBACK(Y,1) to Z;

UNIT 4
2 MARKS:
1. State the use of Rollback recovery.
Restore the system back to a consistent state after a failure
. • Achieve fault tolerance by periodically saving the state of a

process during the failure-free execution.
 Treats a distributed system application as communicate over a network.
2. What is consensus in distributed system?
Each process has an initial value and all the correct processes must agree
on a single
value.
3. Write the purpose of using checkpoints.
Check pointing is most typically used to provide fault tolerance to

applications. Check pointing techniques are useful not only for availability, but
also for program debugging, process migration, and load balancing.
4. What do you mean by agreement problem in distributed system?
In the agreement problem, to achieve overall system reliability in the

presence of a number of faulty processes and single process has the initial value.
5. What is the difference between agreement and consensus problem?
The difference between the agreement problem and the consensus problem is
that, in the agreement problem, a single process has the initial value, whereas in the
consensus problem, all processes have an initial value.
6. Define recovery.
Recovery refers to restoring a system to its normal operational state. Once a

failure has occurred, it is essential that the process where the failure happened
recover to a correct state. Fundamental to fault tolerance is the recovery from an
error.
7. Explain two types of checkpoints.
1. Tentative: A temporary checkpoint that is made a permanent checkpoint

on the successful termination of the checkpoint algorithm.
2. Permanent: A local checkpoint at a process.

8. List drawback of synchronous check pointing.
1. Additional messages must be exchanged to coordinate check pointing
2. Synchronization delays are introduced during normal operations
3. No computational messages can be sent while the check pointing

algorithm is in progress.
4. If failure rarely occurs between successive checkpoints, then the

checkpoint algorithm places an unnecessary extra load on the system, which can
significantly affect performance.
9. How shadow versions are helpful in recovery?
Shadow version uses a map to locate versions of the server's objects in a file
called a version store. The map associates the identifiers of the server's objects
with the positions of their current versions in the version store. The versions
written by each transaction are shadows of the previous committed versions. The
transaction status entries and intentions lists are stored separately. When a
transaction commits, a new map is made by copying the old map and entering the
positions of the shadow versions. To complete the commit process, the new map
replaces the old map.
10. Define fault and failure. What are different approaches to fault-tolerance?
Fault: Anomalous physical condition, eg design errors, manufacturing

problems, damage, external disturbances.
Failure of a system occurs when the system does not perform its service in
the manner specified.
11. List the requirements of consensus algorithm to hold for
execution. The requirements of consensus algorithm to
hold for execution are
1. Termination
2. Agreement and
3. Integrity.
12. What are the performance aspects of agreement
protocols? Following metrics are used
1. Time: No of rounds needed to reach an agreement.
2. Message traffic: Number of messages exchanged to reach an

agreement.
3. Storage overhead: Amount of information that needs to stored at
processors during execution of the protocol.

13. What are the application of agreement
algorithm? Applications of agreement
algorithms
 Fault-tolerant clock synchronization.

 Distributed systems require physical clocks to synchronized
 Physical clocks have drift problem.
 Agreement protocols may help to reach a common clock value. •
Synchronizing distributed clocks:
 At any time, values of clocks of all non-faulty processes must be
approximately equal.
 There is a small bound on amount by which the clock of a non-
faulty process is changed during re-synchronization.
14. State Byzantine agreement problem.
In the Byzantine agreement problem, n processors communicate with each

other in order to reach an agreement on a binary value b. There are bad processors.
that may collaborate with each other in order to prevent an admissible agreement.
Each processor has an initial binary value. The agreement must reflect to a certain
extent the majority among the initial value.
15. What is local checkpoints?
A process may take a local check point anytime during the execution. The
local checkpoints of different processes are not coordinated to form a global
consistent checkpoint.
16. What is forced checkpoints?
To guard against the domino effect, a communication induced checkpoint

protocol piggybacks protocol-speci?c information to application messages that
processes exchange. Each process examines the information and occasionally is
forced to take a checkpoint according to the protocol.
17. Explain useless checkpoints.
A useless checkpoint of a process is one that will never be part of a global

consistent state. Useless checkpoints are not desirable because they do not
contribute to the recovery of the system from failures, but they consume resources
and cause performance overhead.
18. What is checkpoint intervals?
A checkpoint interval is the sequence of events between two consecutive

checkpoints in the execution of a process.
19. Define orphan messages.
Messages with receive recorded but message send not recorded are called the
orphan messages.
20. What is the basic idea behind task assignment
approach? Basic idea:
a. A process has already been split up into pieces called tasks:
b. The amount of computation required by each task and the are known.
c. The cost of processing each task on every node is known.
d. The IPC costs between every pair of tasks is known..
e. Precedence relationships among the taks are known.
f. Reassignment of tasks is not possible. Mention some motivations for

replication.
13 MARKS:
1. Explain the Solution to Byzantine Agreement Problem.
 Impossible Scenario
 Lamport – Shostak – Pease Algorithm
o Example
2. Explain the Consistent Set of Checkpoint.
 Definition
o Strongly Consistent Set of Checkpoint.
o Consistent Set of Checkpoint
o Checkpoint Notation
 Synchronous Checkpoint and Recovery
o Checkpointing Algorithm
 Types
o Synchronous Checkpointing Disadvantages
 The Rollback Recovery Algorithm
o Phase one
o Phase two
 Message Types
3. Discuss about the Checkpoint – Based Recovery.
 Uncoordinated Checkpointing
o Direct dependency tracking technique
 Coordinated Checkpointing
o Blocking Checkpointing
o Non – Blocking Checkpointing
 Communication – induced Checkpointing
4. Describe the Issues in Failure Recovery.
 Basic Concept
 Recovery
o System Failure
o Erroneous System State
o Error
o Fault
5. Explain Byzantine Agreement Problem.
 Introduction
o The Problem
o Validity
 Consensus Problem
o Agreement
o Validity
 Interactive Consistency Problem

UNIT 4 DC Final Copy

Uploaded by

Copyright:

Available Formats

UNIT 4 DC Final Copy

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UNIT 4 DC Final Copy

Uploaded by

Copyright:

Available Formats

UNIT IV

CONSENSUS AND RECOVERY

Consensus and Agreement Algorithms: Problem Definition –

Agreement among the processes in a distributed system is a fundamental

Many forms of coordination require the processes to exchange information to

We first state some assumptions underlying our study of agreement algorithms:

• Synchronous/asynchronous communication If a failure-prone process chooses

• Sender identification A process that receives a message always knows the

• Authenticated vs. non-authenticated messages With unauthenticated

• An unauthenticated message is also called an oral message or an unsigned

• Thus, faulty processes can inflict less damage.

• Agreement variable The agreement variable may be boolean or multivalued,

The Byzantine agreement

• Agreement All non-faulty processes must agree on the same value.

• Termination Each non-faulty process must eventually decide on a value.

The interactive consistency problem

• Termination Each non-faulty process must eventually decide on the array A

Failure mode Synchronous system Asynchronous system

In a failure-free system, consensus can be reached by collecting information from

The decision can be reached by using an applicationspecific function – some

Consensus and agreement algorithms tree-based broadcast–converge cast–

AGREEMENT IN (MESSAGE-PASSING) SYNCHRONOUS SYSTEMS WITH

• The consensus algorithm for n processes where up to f processes where f < n

• If up to f failures are to be tolerated than algorithm has f+1 rounds, in each

• If one process is faulty, among three processes then f = 1. So the agreement

• The termination condition is seen to be satisfied.

Consensus algorithms for Byzantine failures (synchronous system)

STEPS FOR BYZANTINE GENERALS (ITERATIVE FORMULATION),

CODE FOR THE PHASE KING ALGORITHM:

Each phase has a unique "phase king" derived, say, from

PHASE KING ALGORITHM CODE:

• 1 Among f + 1 phases, at least one phase k where phase-king is non-malicious.

• 2 In phase k, all non-malicious processes Pi and Pj will have same

Check pointing and rollback recovery: Introduction

Why is rollback recovery of distributed systems complicated?

Messages induce inter-process dependencies during failure-free operation

• Rollback-recovery protocols generally make assumptions about the reliability of the

• A global checkpoint is a set of local checkpoints, one from each process

• For instance, Figure shows two examples of global states.

• Such a state is impossible in any failure-free, correct computation. Inconsistent states

Interactions with outside world

Different types of Messages

• messages whose “send‟ is done but “receive‟ is undone due to rollback

In Figure , message m1 is a lost message.

• Issues in failure recovery

To eliminate this orphan message, process Pk rolls back to checkpoint Ck,1.

• Messages C, D, E, and F are potentially problematic. Message C is in transit during the

• Overlapping failures further complicate the recovery process. If overlapping failures

3. Each process maintains multiple checkpoints and periodically invoke a garbage

4. Not suitable for application with frequent output commits

• The following direct dependency tracking technique is commonly used in

Direct dependency tracking technique

• otherwise, it rolls back to an earlier checkpoint as indicated by the recovery line.

• Assume m reaches 𝑃1 before the checkpoint request

• This situation results in an inconsistent checkpoint since checkpoint 𝐶1,𝑥shows the

Impossibility of min-process non-blocking checkpointing

Two types of checkpoints

• The receiver of each application message uses the piggybacked information to

• In contrast with coordinated check pointing, no special coordination messages are

Two types of communication-induced checkpointing