UNIT%20IV%20CONSENSUS%20AND%20RECOVERY

Unit-4/Distributed Computing
UNIT IV CONSENSUS AND RECOVERY
Consensus and Agreement Algorithms: Problem Definition – Overview of Results –

Agreement in a Failure-Free System(Synchronous and Asynchronous) – Agreement in
Synchronous Systems with Failures; Checkpointing and Rollback Recovery: Introduction –
Background and Definitions – Issues in Failure Recovery – Checkpoint-based Recovery –
Coordinated Checkpointing Algorithm - - Algorithm for Asynchronous Checkpointing and
Recovery
Must study topics:
1)Issues in Failure recovery with examples.

2)Agreement in synchronous systems with failures.
3)Checkpoint based recovery
4a)coordinated checkpointing algorithm(koo toueg algorithm)
4b)Algorithm for asynchronous checkpointing and recovery
Part-A
1)Outline the issues in failure recovery.[April/May 2024]
2)What are the advantages and disadvantages of asynchronous computing?.[April/May 2024]
3)What is a checkpoint in the recovery system?How is it useful?[Nov/Dec 2023]
4)State the issues in failure recovery.State the means to identify the failure at an early
stage.[Nov/Dec 2023]
Part-B
April/May 2024:
Nov/Dec 2023
Some base terms:

Agreement-Group of processes agree on any common value.
Example protocols:2PC
Consensus-Group of processes agree on one of the proposed values.
Example protocols:Paxos,Raft
1|Page
Real-Life Analogy
• Agreement: Imagine a group of friends trying to decide where to go for dinner. They
discuss options, and if they can’t agree on a restaurant, they default to a familiar spot
they all accept (a pre-agreed fallback).
• Consensus: Now imagine they’re trying to pick a restaurant, but they can only choose
one that someone explicitly suggested. They need to discuss and vote until they all
agree on one of the proposed options.
Assumptions underlying our study of agreement algorithms:
i)Failure models:In distributed systems, failure models define how faulty processes behave,
impacting the ability to reach consensus. Key models include:
1. Fail-stop model: A process may crash during execution, potentially leaving some tasks
incomplete.
2. Send/Receive omission: Processes may fail to send or receive messages, leading to
communication gaps.
3. Byzantine failure model: Faulty processes can behave arbitrarily, sending incorrect or
misleading information.
The complexity of reaching consensus depends on the failure model, with Byzantine failures
being the hardest to manage due to their unpredictability and potential for malicious behavior
ii)Asynchronous communication and Synchronous communication:
In distributed systems, communication can be either synchronous or asynchronous, which

affects the ability to detect message failures.
• Asynchronous communication: If a process fails to send a message, the recipient

cannot distinguish between a failure or a message delayed in transit, making it difficult
to detect the non-arrival of messages. This leads to challenges in reaching agreement,
especially under failure conditions.
• Synchronous communication: In this model, the recipient can detect when a message
has not been sent by the end of a predefined round. The recipient can handle this by
assuming a default message and continuing the algorithm, making agreement more
feasible in synchronous systems.
2|Page
• Reaching consensus in asynchronous systems is inherently more challenging due to the

indistinguishability between message delay and failure.In summary, synchronous
communication makes consensus easier because failures and delays are predictable.
Asynchronous communication introduces uncertainty, making consensus more
complex and requiring more sophisticated algorithms like Paxos or Raft to achieve
agreement.
iii)Network Connectivity :The System has full logical connectivity, i.e., each process can
communicate with any other by direct message passing.
iv)Sender Identification: Every message received by a process contains the identity of the
sender. This feature helps mitigate some risks of Byzantine faults (faulty behavior where
processes can send false or malicious messages) by ensuring the recipient can at least identify
the source. While the content may be tampered with, the sender's identity is still revealed by
the underlying network protocols.
v)Multiple Messages and Scheduling: When a process is expected to receive multiple
messages from the same sender during a round, the system assumes a scheduling mechanism
that handles the messages in sub-rounds. This ensures each message is uniquely identifiable,
even when multiple messages are sent in a short span of time.
Example:
• Sub-round 1: Lock the records.

• Sub-round 2: Update the balance.
• Sub-round 3: Release the lock.
This ensures that the transaction steps occur in the correct sequence, even in the face of network
delays or faults, preventing issues like lost updates or deadlocks.
vi)Channel Reliability: The communication channels are assumed to be reliable, meaning

only processes can fail. This simplifies the study of failure models because it removes the
complexities of handling unreliable communication links, allowing the focus to be on process
failures.
vii)Authenticated vs. Non-authenticated Messages:
• Unauthenticated Messages: Also referred to as "oral" or "unsigned" messages,

unauthenticated messages pose challenges because faulty processes can forge or tamper
with messages, making it difficult for recipients to trust the content or origin.
• Authenticated Messages: With authentication (e.g., digital signatures), it becomes
easier to solve agreement problems. Faulty processes that try to forge or tamper with
messages can be detected, thus reducing the impact of such processes.
viii)Agreement Variable: The agreement variable is the value that processes must agree on in
consensus algorithms. It can be boolean (true/false) or multi-valued. For simplicity in some
algorithms, the boolean form is used, although the results can extend to other data types.
Why is it difficult to achieve consensus in Distributed system??
This example illustrates the Byzantine Generals Problem, a classic issue in distributed
systems that highlights the challenge of achieving consensus in the presence of faulty or
malicious actors. Four generals of an army, camped around the enemy fort, need to agree on a
3|Page
time to attack, as success depends on their synchronized actions. However, communication

between the generals occurs via messengers, and in an asynchronous system, messengers can
take an indefinite amount of time, or be "lost" (captured by the enemy).
A Byzantine process represents a traitorous general who tries to undermine consensus by

providing false or inconsistent information to different generals. For example, the traitor may
tell one general to attack at 10 a.m., while informing others to attack at noon, or fail to send
messages altogether. This behavior leads to confusion and makes it difficult for the generals to
coordinate.
The challenge is to determine whether the generals can still reach agreement despite the
presence of Byzantine actors, and if so, under what conditions. This situation reflects the core
problem of fault tolerance in distributed systems, where consensus protocols must be devised
to ensure agreement, even when some processes behave maliciously or fail.
The diagram shows four generals (G1, G2, G3, and G4) involved in the Byzantine Generals
Problem, where each general is sending potentially conflicting or misleading messages to the
others. The arrows between the generals indicate the messages being exchanged, and the
numbers along the arrows represent the values being communicated (either 0 or 1).
Key points from the diagram:
• G1 sends conflicting information: a "1" to G2 and G4, but a "0" to G3.

• G2 communicates "0" to G1, “1” to G3, and G4.
• G3 sends "1" to G1 but "0" to G2 and G4.
• G4 sends "0" to G1, G2, and G3.
This example illustrates how Byzantine behavior (traitorous generals) can lead to confusion,
as the messages received by each general are inconsistent. This inconsistency makes it difficult
to determine which messages are reliable, ultimately challenging the generals' ability to reach
consensus on a common decision (whether to attack or not).
The problem shown in the diagram reflects the core difficulty in achieving agreement in
distributed systems when some participants (generals) act maliciously or unpredictably. It
highlights the importance of robust consensus protocols that can overcome the
uncertainty introduced by Byzantine failures.
4|Page
The Byzantine agreement and other problems
1. Byzantine Agreement Problem:
The Byzantine Agreement Problem involves a designated source process with an initial
value, and the goal is for all non-faulty processes to agree on the source’s value, even if some
processes are faulty (Byzantine). The key conditions are:
• Agreement: All non-faulty processes must agree on the same value.

• Validity: If the source process is non-faulty, the agreed-upon value must be the
source’s initial value.
• Termination: All non-faulty processes must eventually decide on a value.
If the source is faulty, the non-faulty processes can agree on any value. Faulty processes'
decisions are irrelevant, and the system ensures that all correct processes reach a consensus.
Use Case:
Aviation Systems: Byzantine agreement is used in safety-critical systems like flight control
computers in aircraft, where some computers might provide incorrect data due to faults, but the
majority of systems need to agree on a single course of action to ensure safe flight.
2. Consensus Problem:
The Consensus Problem differs from Byzantine agreement in that each process has its own
initial value. The goal is for all non-faulty processes to agree on a single value, based on their
initial values. The conditions are:
• Agreement: All non-faulty processes must agree on the same value.

• Validity: If all non-faulty processes have the same initial value, they must agree on
that value.
• Termination: Each non-faulty process must eventually decide on a value.
In this case, all correct processes must come to a consensus on one value, even though each
process may initially propose a different value.
Use Case:
• Distributed Databases (e.g., Raft, Paxos): In distributed systems like databases,

consensus is needed to ensure that all nodes agree on the order of transactions (for
consistency) even if some nodes crash.
3. Interactive Consistency Problem:
The Interactive Consistency Problem extends the consensus problem by requiring that non-
faulty processes agree on a set of values, one for each process (including possibly faulty
ones). The conditions are:
• Agreement: All non-faulty processes must agree on the same array of values
A[v1,v2,...,vn]
5|Page
• Validity: If process i is non-faulty, the agreed value for that process vi must be the
same as its initial value. For faulty processes, non-faulty processes can agree on any
value.
• Termination: Each non-faulty process must eventually decide on the entire array A.
In this problem, instead of agreeing on a single value, the goal is to agree on a set of values
corresponding to the initial value of each process.
Use Case:
Sensor Networks: In sensor networks where sensors collect data from different sources, some
sensors might be faulty or provide incorrect readings. The system needs to achieve interactive
consistency to ensure all non-faulty nodes agree on the values received from each sensor,
ensuring accurate aggregation of data (e.g., for climate control systems in smart buildings).
Equivalence of the problems and notation:
The formal difference between the agreement and consensus problems is that:
• In the agreement problem, only a single process (source) has an initial value that all
other processes must agree on.
• In the consensus problem, all processes start with their own initial values and must
agree on one value.
Despite this formal distinction, the terms "agreement" and "consensus" are often used
interchangeably in the literature because they address similar goals—achieving agreement
among non-faulty processes.
Reductions:
• Since these problems are equivalent, a solution to one problem can be transformed into
a solution for the others, implying that they share core characteristics, such as how non-
faulty processes reach a decision, handle failures, and ensure agreement.
Overview of result
Table 14.1 (table summarizes results on achieving agreement in distributed systems under
different failure modes for both synchronous and asynchronous systems.)gives an overview of
the results and lower bounds on solving the consensus problem under different assumptions. It
is worth understanding the relation between the consensus problem and the problem of
attaining common knowledge of the agreement value. For the “no failure” case, consensus is
attainable. Further, in a synchronous system, common knowledge of the consensus value is
also attainable, whereas in the asynchronous case, concurrent common knowledge of the
consensus value is attainable. Consensus is not solvable in asynchronous systems even if one
process can fail by crashing. To circumvent this impossibility result, weaker variants
6|Page
The following table demonstrates that, although consensus is generally challenging in

asynchronous systems, specific techniques and constructs can be applied in both message-
passing and shared memory settings to achieve approximate or limited forms of consensus.
Agreement in a Failure-Free System(Synchronous and Asynchronous)
In a failure-free system, consensus can be achieved relatively easily by having processes share
their information, compute a decision collectively, and then disseminate that decision across
the system. Here's a breakdown of how this process works:
Mechanism:
1. Information Collection and Decision:

o Each process broadcasts its initial value to all other processes in the system.
o Each process receives the values from other processes and computes a decision
based on an application-specific function. Simple examples include:
▪ Majority function: The decision is the value that occurs most
frequently.
7|Page
▪ Max function: The decision is the maximum value among all the
broadcast values.
▪ Min function: The decision is the minimum value among all the
broadcast values.
2. Distribution of Decision:
o Once the decision is made, it must be distributed to all processes to ensure that
the system reaches agreement.
o This can be done through different communication protocols:
▪ Token Circulation: Processes are arranged in a logical ring, and a token
is passed around the ring carrying the decision.
▪ Three-phase Tree-based Broadcast–Convergecast–Broadcast:
Processes form a tree-like structure where decisions are broadcast,
collected, and re-broadcast.
▪ Direct Communication: Each process communicates directly with all
other processes.
Asynchronous Systems:
• Constant Number of Rounds: In asynchronous systems, where message delivery

times may vary, consensus can still be reached in a constant number of rounds (the
number of rounds depends on the logical topology and the specific algorithm used).
• Concurrent Common Knowledge: Once a decision is made, an additional round of
communication can be used to achieve common knowledge (where every process
knows that all other processes also know the decision value).
Failure-Free Systems vs. Failure-Prone Systems:
• In failure-free systems, achieving consensus is straightforward because there are no

failures to contend with. The challenge of reaching agreement arises primarily in
failure-prone systems, which include scenarios where processes may crash or behave
maliciously (Byzantine failures). This is where more complex protocols and fault-
tolerant mechanisms are required to ensure consensus.
In conclusion, while consensus is straightforward in failure-free systems using basic

communication mechanisms, much of the complexity and interest in distributed consensus
algorithms arises when processes are prone to failures, necessitating more sophisticated
approaches.
Agreement in Synchronous Systems with Failures
1)Algorithm1:Consensus algorithm for crash failures (synchronous system)
8|Page
• Objective: Achieve consensus among n processes, where up to f processes may crash

(fail-stop model).
• Initial State: Each process Pi has an initial integer value xi.
• Rounds: The algorithm runs for f+1 rounds.
Steps:
1. Broadcasting: In each round, process Pi broadcasts its value xi to all other processes,
if it hasn’t broadcast it before.
2. Receiving and Updating: Each process updates its local value to the minimum of its
own value and the values it receives from other processes in that round.
3. Termination: After f+1 rounds, each process outputs its local value as the consensus
value.
Algorithm 14.1 is a consensus algorithm for n processes that can tolerate up to f crash failures.
The algorithm runs for f+1 rounds to ensure that consensus is reached even if processes crash.
Each process starts with an initial value and, in each round:
1. Broadcasts its value to all other processes if it hasn’t done so yet.

2. Updates its value to the minimum of the values it has received and its own value.
9|Page
Conditions:
• Agreement: At least one round will be failure-free. In that round, all non-failed
processes will broadcast and update to the same minimum value, ensuring they agree
on the same consensus value.
• Validity: Crashed processes only send correct values before crashing.
• Termination: After f+1rounds, each non-failed process outputs its final value as the
consensus.
2)Upper bound on Byzantine processes:
In a system of n processes, the Byzantine agreement problem (as also the other variants of the
agreement problem) can be solved in a synchronous system only if the number of Byzantine
processes f is such that
10 | P a g e
This establishes that to solve the Byzantine agreement problem, the number of processes n
must be greater than 3f, meaning that the system requires more than three times the number
of potentially faulty processes to achieve consensus.
3a)Byzantine agreement tree algorithm: exponential (synchronous system)

Recursive formulation
We begin with an informal description of how agreement can be achieved with n = 4 and f = 1
processes , as depicted in Figure 14.4.
In the first round, the commander Pc sends its value to the other three lieutenants, as shown by
dotted arrows. In the second round, each lieutenant relays to the other two lieutenants the value
it received from the commander in the first round. At the end of the second round, a lieutenant
takes the majority of the values it received (i) directly from the commander in the first round,
and (ii) from the other two lieutenants in the second round. The majority gives a correct
estimate of the “commander’s” value. Consider Figure 14.4(a) where the commander is a
traitor. The values that get transmitted in the two rounds are as shown. All three lieutenants
take the majority of (1, 0, 0) which is “0,” the agreement value.
11 | P a g e
In Figure 14.4(b), lieutenant Pd is malicious. Despite its behavior as shown, lieutenants Pa and
Pb agree on “0,” the value of the commander.
12 | P a g e
Recursion Unfolding (f > 0):
• For each process in the system (except the commander), the process receives a
value from the commander or other processes, and it forwards this value to others.
This continues as long as f>0, and at each level, the recursion depth decreases as f
is reduced.
13 | P a g e
Recursion Folding(Base case f = 0):
• Once the recursion reaches f=0, each process stops forwarding values. Instead, it
looks at the values it received from other processes.
• Each process then uses majority voting to decide the final value:
o If a process received conflicting values (e.g., 1 from one source, 0 from
another), it uses the majority value.
o If no value was received from a particular process, the algorithm assumes a
default value (usually 0).
Real-World Applications:
1. Blockchain Systems: Blockchain platforms like Hyperledger Fabric use Byzantine
fault-tolerant algorithms (such as PBFT) to ensure consensus on transaction blocks,
even in the presence of malicious nodes.
3b)Byzantine agreement tree algorithm:exponential(synchronous system)iterative

formulation:
The Byzantine Agreement (BA) problem involves reaching a consensus among

distributed nodes (or processes) in the presence of some faulty nodes that may behave
maliciously.
The Byzantine Agreement Tree Algorithm is an approach used to solve the Byzantine
Agreement problem in synchronous systems by ensuring that non-faulty nodes agree on the
same value despite a subset of faulty nodes.
In the tree-based Byzantine Agreement algorithm (exponential in synchronous systems),

the processes use a tree structure to reach agreement iteratively. This formulation is
exponential in communication complexity due to the recursive messaging among nodes and
becomes feasible under certain synchronous conditions.
14 | P a g e
15 | P a g e
Explanation:
16 | P a g e
Algorithm:
17 | P a g e
4)Phase-King algorithm:
Objective:
The Phase-King algorithm is designed to solve the Byzantine Agreement problem in a

synchronous system with at most f faulty (malicious) processes, where the number of total
processes n>4f. It allows all correct processes to reach consensus on a value, even if some
processes are malicious. The algorithm is so called because it operates in f +1 phases, each
with two rounds, and a unique process plays an asymmetrical role as a leader in each round.
18 | P a g e
19 | P a g e
Correctness in the Phase king algorithm:
Complexity:
The algorithm requires f +1 phases with two sub-rounds in each phase, and (f +1)( n−1)(n+1)
messages
Checkpointing and Rollback Recovery
Introduction:In distributed systems rollback recovery protocols aim to restore a system to a

consistent state after a failure. Distributed systems often face failures that disrupt operations,
so achieving fault tolerance is critical.
20 | P a g e
Rollback recovery works by saving the state of processes periodically (called checkpoints), so
that in the event of failure, the system can restart from the last saved state, minimizing the
loss of work.
Key Concepts:
1. Checkpoints: Saved states of a process, stored in either stable or volatile storage, to

facilitate recovery after failure.
2. Rollback Propagation: In distributed systems, processes depend on each other through
message exchanges. When one process rolls back to a previous checkpoint, other
processes that received messages from it may also need to roll back, creating a chain
reaction of rollbacks. This is called the domino effect and can lead to loss of all
previous work.
3. Uncoordinated Checkpointing: If each process takes checkpoints independently, the
system is vulnerable to the domino effect.
4. Coordinated Checkpointing: Processes coordinate to take checkpoints together,
forming a consistent global state, which prevents the domino effect. In case of failure,
the system restores from this consistent set of checkpoints.
Example: A multiplayer game where all players' states are saved at the same time to
ensure no player has an inconsistent game state after a crash.
5. Communication-Induced Checkpointing: Processes are forced to take checkpoints

based on message-piggybacked information, ensuring that a consistent state is always saved
and the domino effect is avoided.
Example:
• Sensor A sends a message to Sensor B saying, "The temperature is 25°C."

• Sensor B logs this message and updates its state.
• If Sensor A crashes later and rolls back to a state before it sent this message, Sensor B
can replay the "25°C" message to restore Sensor A’s state to what it was before the
crash.
By using logs and message replay, the system avoids unnecessary rollbacks and maintains
efficiency during recovery.
6. Log-Based Rollback Recovery: Combines checkpointing with logging of non-

deterministic events (e.g., interactions with the outside world). These events are logged so they
can be replayed exactly, allowing recovery even beyond the most recent checkpoints. This
method is especially useful for applications that interact with external devices, which cannot
roll back.
Example:
• In a stock trading platform, the system logs non-deterministic events like stock price
updates and user trade actions. If a process fails, it rolls back to the last checkpoint and
replays the logged events (e.g., received stock price updates) to restore the state without
losing trades made by users.
• Online Transaction Processing in banking system(fund transfers, deposits, and
payments)
21 | P a g e
The primary focus is on ensuring reliability and availability in distributed systems, where faults
can occur. Different techniques, like coordinated checkpointing and log-based rollback
recovery, aim to mitigate the trade-offs of checkpointing and avoid the domino effect.
Background and Definitions

System model
The system consists of multiple processes (e.g., P1, P2, P3) that communicate exclusively
through messages.
Processes work together to execute distributed applications and interact with the outside world
by receiving input and sending output messages.
The reliability of inter-process communication is crucial, with some protocols assuming
reliable message delivery in FIFO order, while others account for possible message loss,
duplication, or reordering.
Checkpointing:
• Local checkpoint: Each process periodically saves its state, known as a local
checkpoint, which is stored on stable storage for recovery in case of failure.
• Checkpointing is represented by the symbol “ |” in process-line diagrams.
• A process may maintain multiple checkpoints or a single one at a time, depending on
the method used. It can roll back to any of its checkpoints if necessary.
Rollback-Recovery Protocols:
• These protocols ensure that, after a failure, the system's internal state is consistent
with its observable behavior before the failure.
• The protocols maintain information about both internal interactions (between
processes) and external interactions (with the outside world).
Generic Correctness Condition:
• A system is considered to recover correctly if its state after recovery is consistent with
its behavior prior to the failure.
Consistent system states
A global state in a distributed system is the collection of all individual process states and the
states of communication channels. A consistent global state is one that could occur during a
22 | P a g e
failure-free execution, meaning that if a message is received by a process, the corresponding

sender’s state must show that the message was sent.
Consistent and inconsistent states are illustrated in Figure 13.2:
• Consistent state (a): Message m1 has been sent but not yet received, which is still valid
since every received message has a corresponding send event.
• Inconsistent state (b): Process P2 has received m2, but P1’s state does not show that
it sent the message, making the state inconsistent.
A local checkpoint is a snapshot of a process’s state, while a global checkpoint is a set of

local checkpoints.
A consistent global checkpoint is one where no message is received by any process before
the sender of that message has taken its local checkpoint.[Make sure no receive(M1) event
without a send(M1)]
Interaction with the Outside World:
Distributed applications often interact with external entities (like printers or ATMs) to either
receive inputs or deliver outputs. However, these external entities, referred to as outside world
processes (OWPs), cannot roll back actions if a failure occurs. For example, once a printer
prints a character or an ATM dispenses cash, those actions cannot be undone.
Modeling the Outside World: To simplify understanding how rollback-recovery protocols

interact with external systems, the outside world is modeled as a special process, OWP, that
communicates with the distributed system via message passing.
Consistency and Failures: The system must ensure that the outside world sees consistent
behavior, even in the event of a failure. This means that:
• Output Commit Problem: Before sending output to the OWP, the system must ensure
that the state from which the output is sent can be recovered after any future failure.
• Handling Input Messages:(Input reproducibility) Input messages from the OWP
might not be reproducible during recovery, so these must be saved to stable storage
before the application processes them. This ensures they can be replayed in case of a
failure.
23 | P a g e
Different types of message:
In a distributed system with multiple processes that communicate through messages, a process
failure and subsequent recovery can lead to inconsistencies in the handling of messages. These
inconsistencies arise because the rollback of processes for recovery may involve undoing the
sending or receiving of certain messages. The messages affected by the rollback fall into
several categories, as illustrated in the example from Figure 13.3
Here’s a summary of the types of messages involved:
1. In-Transit Messages:
• These are messages that have been sent but not yet received at the time of recovery. In
Figure 13.3, messages m2 and m5 are examples of in-transit messages.
• In-transit messages do not cause inconsistencies as long as the system guarantees their
delivery(reliable). In systems with reliable communication channels, they must be
delivered after recovery. In systems with lossy communication channels, they might be
lost.
2. Lost Messages:
• These occur when the sending of a message is not undone by the rollback, but the
receiving of it is undone. In this case, the sender does not roll back to a state before the
send operation, while the receiver does roll back to a state before it receives the
message.
• In the example, m1 is a lost message because its receipt was undone during recovery.
3. Delayed Messages:
• These are messages whose receipt is not recorded because the receiving process was
either down or the message arrived after the rollback.
• For instance, m2 and m5 in the example are delayed messages. They arrived after the
receiving process had rolled back.
4. Orphan Messages:(receive(M1) recorded without recording send(M1))
24 | P a g e
• Orphan messages occur when the receipt of a message is recorded, but the send
operation is not because the sending process rolled back to a state before it sent the
message.
• Orphan messages do not arise if the system is rolled back to a consistent global state
(where all process states are aligned).
5. Duplicate Messages:
• These occur due to message logging and replaying during process recovery. After a
rollback, both the sending and receiving of a message may be undone. When the sender
process restarts, it resends the message, potentially causing a duplicate.
• In the example, m5 could become a duplicate if it is resent after rollback.
Issues in failure delivery
In the context of failure recovery in a distributed system, several issues arise regarding message
handling when processes fail and rollback to a prior checkpoint.
During recovery, the system must handle various types of messages that are left in abnormal
states, such as delayed, lost, orphan, and duplicate messages.
The below scenario involves three processes Pi, Pj, and Pk exchanging messages over fault-
free, FIFO communication channels. The failure recovery issues are illustrated in Figure 13.4,
where process Pi fails, and the system must restore consistency.
Key Points:
1. Process Rollback and Orphan Messages:
o After Pi fails, it rolls back to checkpoint Ci1, and Pj rolls back to Cj1 to handle
the orphan message H (receive event recorded at Pj, but send event undone at
Pi).
o Pk then rolls back to Ck1 to eliminate another orphan message I created by Pj
’s rollback.
2. Types of Messages:
25 | P a g e
o
Normal Messages: These are messages whose send and receive events have
been recorded, and they remain consistent after the rollback. In this example,
A, B, and J are normal messages.
o Vanished Messages: Messages whose send and receive events have been fully
undone by the rollback. In the example, G, H, and I are vanished messages.
o Delayed Messages: Messages that were in transit during the failure. For
instance, C was in transit and might arrive at different times during or after
recovery. Each scenario needs proper handling.
o Lost Messages: Messages whose send events are recorded but the receive
events are undone due to the rollback. D is a lost message because it was sent
by Pj, but the receive event at Pi was undone.
o Delayed Orphan Messages: Messages like E and F, where the send events
have been undone by rollback but may still arrive at their destination. These
must be discarded since the system will regenerate them.
3. Handling of Problematic Messages:
o Lost Messages: Process logs are used to replay lost messages, but this can lead
to duplicate messages, such as when Pj resends J, which Pk has already
received.
o Duplicate Messages: Duplicate messages arise from the replay of logs during
recovery and must be handled to maintain system consistency.
o Delayed Messages: Messages in transit during failure, like C, must be carefully
managed depending on when they arrive (before, during, or after recovery).
4. Overlapping Failures:
o Overlapping failures complicate the recovery process. For example, if Pj starts
recovery in response to Pi 's failure but then fails itself, it may "forget" about Pi
's failure. This introduces further inconsistencies, requiring mechanisms to
address overlapping failures.
----------------------------------------------------------------------------------------------------------------
Checkpoint-based Recovery
Checkpoint-based recovery involves periodically saving the state of each process

and its communication channels to enable system restoration to a globally
consistent state after a failure.
This method does not depend on the Piecewise Deterministic (PWD) assumption
and avoids logging or replaying non-deterministic events. While simpler than log-
26 | P a g e
based rollback recovery, it does not ensure deterministic re-execution after

rollback, making it unsuitable for applications with frequent external
interactions.
Checkpoint-based techniques are divided into three categories:

uncoordinated, coordinated, and communication-induced checkpointing.
1. Uncoordinated Checkpointing:
Each process independently decides when to take checkpoints without coordinating with other
processes. This autonomy reduces runtime overhead and allows processes to checkpoint at
convenient times. However, it introduces several issues:
• Domino Effect: During recovery, multiple processes might need to roll back to earlier
checkpoints, potentially losing a significant amount of useful work.
• Slow Recovery: Processes must iterate through checkpoints to find a globally
consistent state, increasing recovery time.
• Useless Checkpoints: Some checkpoints may never contribute to recovery and incur
unnecessary overhead.
• Multiple Checkpoints: Processes must store several checkpoints and periodically run
garbage collection to discard obsolete ones.
• Not Suitable for Frequent Output Commit Applications: These applications require
global coordination, reducing the benefits of autonomous checkpointing.
To handle dependencies between processes, checkpoints track message exchanges. When a

process sends a message, it includes its checkpoint identifier, and the receiving process records
the dependency between its own checkpoint and the sender’s. In the event of a failure,
processes share their saved dependencies to determine a consistent global recovery line.
How consistent global state is determined???
Figure 13.5: Checkpoint index and checkpoint interval, illustrates the dependencies between
checkpoints in a system with uncoordinated checkpointing.
27 | P a g e
Rollback and Recovery:

• In case of a failure:
1. The recovering process broadcasts a dependency request message to all
processes, asking for their saved dependency information.
2. When a process receives this request, it stops its execution and responds with:
▪ Dependency information stored on stable storage.
▪ Any dependency information associated with the process's current state.
3. The initiator collects this global dependency information and calculates the
recovery line.
4. The initiator broadcasts a rollback request containing the recovery line.
Processes that are part of the recovery line resume execution. Processes outside
the recovery line roll back to an earlier checkpoint.
2.Coordinated Checkpointing:
Types:
i)Blocking
ii)Non-blocking
i)Blocking coordinated checkpointing:

28 | P a g e
In coordinated checkpointing, blocking communication simplifies synchronization by preventing

orphan messages, ensuring that processes are in a consistent state. The outlined approach involves a
two-phase commit protocol:
1. Initiation: The coordinator initiates checkpointing by taking its checkpoint and broadcasting a
request for all processes to the checkpoint.
2. Checkpoint and Acknowledgment: Upon receiving this request, each process halts execution,
flushes its communication channels (to clear any in-transit messages), takes a tentative
checkpoint, and sends an acknowledgment to the coordinator.
3. Commit Phase: Once the coordinator gathers acknowledgments from all processes, it sends a
commit message. Each process then finalizes the checkpoint (making the tentative one
permanent), discards any old checkpoints, and resumes normal execution.
Drawbacks of Blocking Checkpointing

The main downside of this approach is the blocking of computation, as all processes must pause until
the entire checkpoint process is complete. This can lead to significant delays in distributed systems,
especially if there are many processes or if the checkpointing process itself is time-consuming. Non-
blocking checkpointing schemes, which allow computation to continue during checkpointing, are
generally more desirable for systems that require high availability and minimal interruptions.
ii)Non-blocking Coordinated Checkpointing:

In non-blocking coordinated checkpointing, processes can continue their execution while
taking checkpoints. A common issue arises when processes receive application messages
during the checkpointing process, which can lead to inconsistent checkpoints.
1. Issue of Inconsistent Checkpoints(Figure 13.6a)
In a distributed system, different processes communicate with each other. If one process, say
P0, sends a message after it takes a checkpoint, but another process, say P1, hasn't yet received
the checkpoint request, then P1 might receive the message before taking its checkpoint. This
leads to an inconsistent state:
• Process P0's checkpoint will not show the message as having been sent.
• Process P1's checkpoint will show the message as received.
29 | P a g e
This inconsistency complicates recovery, as different processes have different views of what
has happened in the system.
Solution to the above inconsistent checkpoints is available for both FIFO and non-FIFO
channels.
FIFO Channels Solution (Figure 13.6(b))
The problem can be avoided if the communication channels between processes are FIFO
(First-In-First-Out). In this case:
• Each process sends a checkpoint request (marker) before sending any message after
its checkpoint.
• When a process receives the checkpoint request, it takes its own checkpoint before
processing any further messages.
This guarantees that no messages will be recorded inconsistently across checkpoints.
Example: Chandy-Lamport Snapshot Algorithm
The Chandy-Lamport algorithm is a non-blocking, coordinated checkpointing protocol. Here’s how

it works:
1. An initiator process begins the checkpointing by taking a checkpoint and sending a marker
to all the other processes it communicates with.
2. When a process receives this marker, it:
o Takes a checkpoint (if it hasn’t already),
o Sends a marker to all the processes it communicates with before sending any other
messages.
3. The system continues to operate while the checkpointing process occurs in the background,
ensuring that each process records a consistent global state.
This protocol works seamlessly when channels are reliable and FIFO, meaning messages are received
in the order they were sent.
Non-FIFO Channels Solution
When channels are non-FIFO, two approaches are used:
1. Piggyback Markers: Each post-checkpoint message carries a piggybacked marker.

When a process receives such a message, it acts as if it received both the marker and
the application message, ensuring checkpoints are taken correctly.
2. Checkpoint Indices: Alternatively, processes can attach checkpoint indices to
messages. If a process receives a message with a higher checkpoint index than its own,
it takes a checkpoint to keep up with the sender’s state.
Note:Impossibility of min-process non-blocking checkpointing:
A min-process, non-blocking checkpointing algorithm is one that forces only a minimum

number of processes to take a new checkpoint, and at the same time it does not force any
process to suspend its computation. Clearly, such checkpointing algorithms will be very
attractive. Cao and Singhal showed that it is impossible to design a min-process, non-blocking
checkpointing algorithm
30 | P a g e
Z-dependency essentially tracks message dependencies between processes across checkpoint

intervals. It helps in identifying which processes should checkpoint to maintain a consistent
state, minimizing the number of checkpoints required. However, because receivers lack full
knowledge of dependencies, it is impossible to maintain this consistency in a non-blocking
way, hence the impossibility of a min-process, non-blocking checkpointing algorithm.(i,e)z-
dependency supports blocking checkpointing.
3. Communication-induced checkpointing:
Communication-induced checkpointing (CIC) is a strategy used to prevent the domino effect

in distributed systems, allowing processes to take some of their checkpoints independently,
while also being forced to take additional checkpoints when necessary. This approach ensures
a consistent global recovery line without requiring the exchange of explicit coordination
messages, unlike coordinated checkpointing.
How will a process take checkpoints?
• Autonomous (local) checkpoints: Taken independently by the process.

• Forced checkpoints: Triggered by information piggybacked on application messages,
requiring the process to take a checkpoint before processing the message.
Types of Communication-induced checkpointing
i)Model-based checkpointing
ii)Index-based checkpointing
i)Model-based checkpointing:
How it works:
Processes monitor their communication patterns and take forced checkpoints whenever there's a risk
of creating an inconsistent recovery state. No coordination messages are exchanged; instead, decision-
making happens locally based on system behavior.
31 | P a g e
Key Idea:
A process takes a forced checkpoint when it detects that its communication might lead to
an inconsistent state later. This ensures that checkpoints always lead to a domino-effect-free
recovery, where only necessary rollbacks happen.
In the MRS model (Mark, Send, and Receive), the goal is to avoid the domino effect by
ensuring that message-sending events always occur before message-receiving events within
the same checkpoint interval.
Here’s a clearer explanation of the MRS model:
• The model forces processes to take a checkpoint before sending a message if no

checkpoint has been taken since the last message was received. This ensures that all
outgoing messages from a process happen after the most recent checkpoint.
Example: :
Imagine a group of people playing a multiplayer online game. Every time one player fights
another, the game ensures both players’ progress is saved (checkpointed) so that if one player
disconnects, both players can resume from the same point after recovery. The game
automatically forces checkpoints before big events like battles, ensuring the game doesn’t
get confused about the state of each player.
Best for:
Systems with complex communication patterns where dynamic, real-time interactions
between processes need to be closely monitored, like multiplayer gaming or real-time
simulations.
2. Index-Based Checkpointing:
• How it works:
Checkpoints are assigned increasing numbers (indexes), and processes force
checkpoints when they receive a message with a higher index than their current
checkpoint. This way, all processes maintain checkpoints that are globally consistent.
• Key Idea:
Processes only take forced checkpoints when their checkpoint index is lower than the
one attached to a received message, ensuring all processes stay in sync. This reduces
unnecessary checkpoints.
• Example (easy explanation):
Imagine several bank branches processing transactions. Each branch takes a
snapshot (checkpoint) after a few transactions. If one branch finishes processing
and sends an update (with a checkpoint number) to another branch, the receiving
branch checks its own checkpoint number. If its checkpoint is behind, it takes a
new one to keep both branches on the same page. This prevents confusion about
which transactions have been processed if a failure happens.
• Best for:
Systems where checkpoints can be tracked using a simple number system, like
transactional databases or financial systems.
32 | P a g e
Koo–Toueg coordinated checkpointing algorithm
The Koo–Toueg coordinated checkpointing algorithm is designed for distributed systems to

ensure consistent global checkpoints while avoiding issues like the domino effect during
recovery. The key idea is to have all processes in the system take coordinated checkpoints such
that the set of all checkpoints forms a consistent global state.
Types of Checkpoints:
o Permanent Checkpoints: Local checkpoints at processes that form part of a

consistent global state.
o Tentative Checkpoints: Temporary checkpoints that will only become
permanent if the entire system successfully completes the checkpointing
process.
Assumptions:
o The system uses FIFO communication channels.

o End-to-end protocols are in place to handle message loss and communication
failures.
o No network partitioning, and no process failures during checkpointing.
i)The checkpointing algorithm:

Phases of the Algorithm
First Phase: Taking Tentative Checkpoints
1. Initiation: A process (Pi) initiates the algorithm by taking a tentative checkpoint and
sends requests to all other processes to do the same.
2. Response Handling: Each process either takes a tentative checkpoint and informs Pi
of success, or it replies with a "no" if it cannot take a tentative checkpoint.
o If Pi gets success responses from all processes, it decides to make all
checkpoints permanent.
o If any process fails, Pi discards the tentative checkpoints.
Second Phase: Decision Propagation
1. Decision Broadcast: Pi sends its decision to all processes, informing them whether the
tentative checkpoints should be made permanent or discarded.
2. Action: Upon receiving Pi's decision:
o If the decision is to make the checkpoints permanent, each process converts its
tentative checkpoint to a permanent one.
o If the decision is to discard the checkpoints, each process discards its tentative
checkpoint.
Key Benefits
• Avoids the domino effect: Ensures that processes do not have to roll back repeatedly,
thereby avoiding cascading rollbacks.
• Consistency: All or none of the processes advance their checkpoints, ensuring global
consistency.
33 | P a g e
Correctness of the Algorithm

The correctness of the Koo–Toueg coordinated checkpointing algorithm is guaranteed for two
key reasons:
1. All-or-none checkpointing: Either all processes take a permanent checkpoint, or none

do. This ensures that the system remains in a globally consistent state after
checkpointing. No partial checkpoint advancement is allowed, which prevents
inconsistency.
2. No messages after tentative checkpoints: Once a process has taken a tentative
checkpoint, it cannot send messages related to the computation until it receives a
decision from the initiating process. This restriction prevents the occurrence of a
situation where there is a record of receiving a message but no record of sending it (a
classic source of inconsistency). Therefore, the checkpoint set will always be
consistent.
These two aspects ensure that the algorithm avoids situations that could lead to an
inconsistent global state, such as the domino effect or orphan messages.
Optimization
While the protocol guarantees correctness, it may result in unnecessary checkpoints being
taken, which can be inefficient. Checkpointing is computationally expensive, so the goal is
to minimize unnecessary checkpoints.
Fig:13.11Note:Checkpoint Z2 can be avoided as the process is not involved with sending and
receiving of message M.Thus number of checkpoints can be reduced.
ii)The rollback recovery algorithm

The algorithm ensures that, in the event of a failure, all processes in the system restart from a
consistent state (i.e., the most recent set of permanent checkpoints). The process assumes the
existence of checkpoints and coordinates the rollback procedure to avoid inconsistency.
34 | P a g e
Phases of the Rollback Recovery Algorithm:

First Phase: Check for Willingness to Roll Back
1. Initiation:
o A process Pi initiates the rollback recovery by sending a message to all other
processes in the system, requesting them to check whether they are willing to
restart from their most recent permanent checkpoints.
2. Responses from Processes:
o Each process replies either "yes" or "no":
▪ Yes: The process agrees to roll back to its previous checkpoint.
▪ No: The process is unable or unwilling to roll back, possibly due to being
involved in another ongoing checkpointing or recovery process.
3. Decision:
o If Pi receives "yes" responses from all processes, it decides that all processes
should roll back to their previous checkpoints.
o If any process replies "no", Pi aborts the rollback attempt and may try again
later.
Second Phase: Propagation of the Decision
1. Decision Propagation:
o Once Pi makes a decision, it propagates this decision to all other processes:
▪ If the decision is to roll back, the processes must restart from their
previous permanent checkpoints.
▪ If the rollback is aborted, the processes do not change their state, and no
rollback occurs.
2. Process Actions:
o Upon receiving the decision from Pi, each process performs the appropriate
action:
▪ If instructed to roll back, the process rolls back to its last permanent
checkpoint and resumes execution from there.
▪ If the rollback is aborted, the process continues from its current state.
3. Message Sending Restriction:
o While waiting for the rollback decision, processes are not allowed to send
messages related to the underlying computation. This restriction is necessary to
ensure that no inconsistent messages are sent during the rollback process.
Correctness
The correctness of the rollback recovery algorithm is guaranteed by the following:
1. Consistent Restart:
o All processes restart from a consistent global state, as the checkpointing
algorithm ensures that all checkpoints form a consistent set.
o Since either all processes roll back to their previous checkpoints or none do,
no inconsistent state can arise during the recovery process.
2. No Inconsistent Messages:
o The restriction on message sending during the recovery process ensures that no
messages are sent based on an inconsistent state, preventing issues such as
35 | P a g e
orphan messages (messages received without the corresponding send event

recorded).
Algorithm for Asynchronous Checkpointing and Recovery(Juang venkatesan)
The Juang–Venkatesan algorithm is a popular approach for asynchronous checkpointing and

recovery in distributed systems. In systems with processes communicating over messages, the
primary aim of this algorithm is to enable fault tolerance by allowing processes to take
checkpoints independently without requiring global coordination.
Assumptions:
The system uses reliable, FIFO-ordered communication channels with infinite buffers.
Message transmission delays are arbitrary but finite.
To facilitate recovery after a process failure and restore the system to a consistent state, two
types of log storage are maintained, volatile log and stable log. Accessing the volatile log takes
less time than accessing the stable log, but the contents of the volatile log are lost if the
corresponding processor fails. The contents of the volatile log are periodically flushed to the
stable storage.
36 | P a g e
Algorithm:
Note:Sender of Rollback() message : Y
ROLLBACK( Process ID who is recovering,Count of messages sent by the process )
Receiver of Rollback() message::X,Z
37 | P a g e
@node X:
Compute rcvd=3(Total number of messages received so far by node X from Y)
@node Z:
Compute rcvd=2
If(sent<recvd)
Then
RollBack to the previous event and make it as Checkpoint.
Explanation:
In the above diagram 13.14 ,when process-Y fails it sends ROLLBACK(Y,2) to process-X and
ROLLBACK(Y,1) to Z.
@X :
if(sent <rcvd)
if(2<3)
So Rollback X rolls back to event ex2.It will change that as checkpoint2 of process Y.
Similar steps will be executed at Z.
38 | P a g e

UNIT%20IV%20CONSENSUS%20AND%20RECOVERY

Uploaded by

Copyright:

Available Formats

UNIT%20IV%20CONSENSUS%20AND%20RECOVERY

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UNIT%20IV%20CONSENSUS%20AND%20RECOVERY

Uploaded by

Copyright:

Available Formats

Unit-4/Distributed Computing

UNIT IV CONSENSUS AND RECOVERY

Consensus and Agreement Algorithms: Problem Definition – Overview of Results –

Must study topics:

1)Issues in Failure recovery with examples.

Some base terms:

Assumptions underlying our study of agreement algorithms:

ii)Asynchronous communication and Synchronous communication:

In distributed systems, communication can be either synchronous or asynchronous, which

• Asynchronous communication: If a process fails to send a message, the recipient

• Reaching consensus in asynchronous systems is inherently more challenging due to the

• Sub-round 1: Lock the records.

vi)Channel Reliability: The communication channels are assumed to be reliable, meaning

vii)Authenticated vs. Non-authenticated Messages:

• Unauthenticated Messages: Also referred to as "oral" or "unsigned" messages,

Why is it difficult to achieve consensus in Distributed system??

time to attack, as success depends on their synchronized actions. However, communication

A Byzantine process represents a traitorous general who tries to undermine consensus by

Key points from the diagram:

• G1 sends conflicting information: a "1" to G2 and G4, but a "0" to G3.

The Byzantine agreement and other problems

1. Byzantine Agreement Problem:

• Agreement: All non-faulty processes must agree on the same value.

• Agreement: All non-faulty processes must agree on the same value.

• Distributed Databases (e.g., Raft, Paxos): In distributed systems like databases,

3. Interactive Consistency Problem:

Equivalence of the problems and notation:

The following table demonstrates that, although consensus is generally challenging in

Agreement in a Failure-Free System(Synchronous and Asynchronous)

1. Information Collection and Decision:

• Constant Number of Rounds: In asynchronous systems, where message delivery

Failure-Free Systems vs. Failure-Prone Systems:

• In failure-free systems, achieving consensus is straightforward because there are no

In conclusion, while consensus is straightforward in failure-free systems using basic

Agreement in Synchronous Systems with Failures

1)Algorithm1:Consensus algorithm for crash failures (synchronous system)

• Objective: Achieve consensus among n processes, where up to f processes may crash

1. Broadcasts its value to all other processes if it hasn’t done so yet.

2)Upper bound on Byzantine processes:

processes f is such that

3a)Byzantine agreement tree algorithm: exponential (synchronous system)

Recursion Unfolding (f > 0):

Recursion Folding(Base case f = 0):

3b)Byzantine agreement tree algorithm:exponential(synchronous system)iterative

The Byzantine Agreement (BA) problem involves reaching a consensus among

In the tree-based Byzantine Agreement algorithm (exponential in synchronous systems),

The Phase-King algorithm is designed to solve the Byzantine Agreement problem in a

Correctness in the Phase king algorithm:

Checkpointing and Rollback Recovery

Introduction:In distributed systems rollback recovery protocols aim to restore a system to a

1. Checkpoints: Saved states of a process, stored in either stable or volatile storage, to

5. Communication-Induced Checkpointing: Processes are forced to take checkpoints

• Sensor A sends a message to Sensor B saying, "The temperature is 25°C."

6. Log-Based Rollback Recovery: Combines checkpointing with logging of non-

Background and Definitions

Generic Correctness Condition:

Consistent system states

failure-free execution, meaning that if a message is received by a process, the corresponding

Consistent and inconsistent states are illustrated in Figure 13.2:

A local checkpoint is a snapshot of a process’s state, while a global checkpoint is a set of