Distributed Chapter 4
Distributed Chapter 4
org/agreement-protocol-in-
distributed-systems/ Agreement Protocols in DS
The Byzantine Agreement Problem occurs in distributed systems where processes must reach a
consensus despite the presence of faulty or malicious processes (called Byzantine faults). These
faulty processes may send incorrect or conflicting messages to disrupt the system.
Key Characteristics:
Applications:
Distributed databases.
Blockchain systems (e.g., Bitcoin, Ethereum).
Fault-tolerant systems like flight control systems.
Consensus Problem
The Consensus Problem is a broader challenge in distributed systems where all processes
(faulty or not) must agree on a single value. Unlike the Byzantine Agreement, it does not
specifically focus on malicious faults.
Key Properties:
The Consensus Problem assumes non-malicious faults like crashes or delays, while Byzantine
Agreement handles malicious faults.
Key Requirements:
Example:
A classic solution to the Byzantine Agreement Problem is the Byzantine Generals Algorithm
(based on message passing). One popular approach is Lamport, Shostak, and Pease's
algorithm.
Key Assumptions:
Steps:
1. Initialization: Each process proposes a value and sends it to all other processes.
2. Message Exchange: Processes exchange messages to share their received values:
o Each process collects messages from all other processes.
o This step repeats f+1 times to ensure enough information is gathered to identify faults.
3. Decision:
o Each process applies a majority rule or predefined decision rule to determine the agreed
value.
o Faulty or conflicting values are discarded, ensuring consistency.
Example:
In a system of 7 processes (n=7), up to 2 can be Byzantine (f=2). The algorithm ensures that all
non-faulty processes agree on a consistent value.
Limitations:
Summary Table
Interactive Agree on a vector of values from all Handles Byzantine Similar to Byzantine
Consistency processes. faults. Agreement.
For exam writing, explain each problem with simple definitions, examples, and the main steps
of the solution (e.g., the Byzantine Generals Algorithm). Use diagrams or examples of message
exchange for clarity.
4o
Distributed Transactions
Distributed transaction refers to a transaction in which multiple servers are involved.
Multiple servers are called by a client in Simple Distributed Transaction whereas a server
calls another server in Nested Transaction. The execution of a transaction at many sites,
must either be committed at all sites or aborted at all sites. However, this should not be
the case when a transaction is committed at one site and aborted at another site.
Distributed site systems use distributed commitment rules to ensure atomicity across
sites. Atomic commitment is a channel of need for cooperation across a variety of
systems.
n distributed systems, recovery mechanisms are crucial for handling failures and maintaining
system reliability. The two main approaches are forward recovery and backward recovery:
1. Forward Recovery:
Forward recovery focuses on correcting the errors and continuing operations from the
current state. Instead of undoing previous actions, this method applies corrective or
compensating measures to address the impact of failures. Forward recovery is commonly
used in scenarios where preserving progress is critical, such as financial transactions or
real-time applications. However, it is complex to implement as it requires a detailed
understanding of the failure and its effects on the system.
2. Backward Recovery:
Backward recovery restores the system to a previously saved consistent state by rolling
back all changes made after the failure. This approach relies on techniques like
checkpoints or logs to return to a known good state. It is simpler and widely used in
systems like databases, where maintaining data consistency is essential. However,
backward recovery may lead to the loss of recent progress made before the failure
occurred.
DataBase Recovery
In order to recuperate from database failure, database management systems resort to a number
of recovery management techniques. In this chapter, we will study the different approaches for
database recovery. The typical strategies for database recovery are.
In case of soft failures that result in inconsistency of database, recovery strategy includes
transaction undo or rollback. However, sometimes, transaction redo may also be adopted to
recover to a consistent state of the transaction.
In case of immediate update mode, the recovery manager takes the following actions −
Transactions which are in active list and failed list are undone and written on the abort
list.
Transactions which are in before-commit list are redone.
No action is taken for transactions in commit or abort lists.
In case of deferred update mode, the recovery manager takes the following actions −
Transactions which are in the active list and failed list are written onto the abort list. No
undo operations are required since the changes have not been written to the disk yet.
Transactions which are in before-commit list are redone.
No action is taken for transactions in commit or abort lists.
The transactions in the commit list and before-commit list are redone and written onto
the commit list in the transaction log.
The transactions in the active list and failed list are undone and written onto the abort list
in the transaction log.
Explore our latest online courses and learn new skills at your own pace. Enroll and become a
certified expert to boost your career.
Checkpointing
Checkpoint is a point of time at which a record is written onto the database from the buffers. As
a consequence, in case of a system crash, the recovery manager does not have to redo the
transactions that have been committed before checkpoint. Periodical checkpointing shortens the
recovery process.
Consistent checkpointing
Fuzzy checkpointing
Consistent Checkpointing
If in step 4, the transaction log is archived as well, then this checkpointing aids in recovery from
disk failures and power failures, otherwise it aids recovery from only power failures.
Fuzzy Checkpointing
In fuzzy checkpointing, at the time of checkpoint, all the active transactions are written in the log.
In case of power failure, the recovery manager processes only those transactions that were
active during checkpoint and later. The transactions that have been committed before checkpoint
are written to the disk and hence need not be redone.
Example of Checkpointing
Let us consider that in system the time of checkpointing is tcheck and the time of system crash is
tfail. Let there be four transactions Ta, Tb, Tc and Td such that −
UNDO all faulty transactions and transactions that may be affected by the faulty
transactions.
REDO all transactions that are not faulty but have been undone due to the faulty
transactions.
1. Fault Detection
Fault Detection is the first phase where the system is monitored continuously. The
outcomes are being compared with the expected output. During monitoring if any faults
are identified they are being notified. These faults can occur due to various reasons such
as hardware failure, network failure, and software issues. The main aim of the first phase
is to detect these faults as soon as they occur so that the work being assigned will not be
delayed.
2. Fault Diagnosis
Fault diagnosis is the process where the fault that is identified in the first phase will be
diagnosed properly in order to get the root cause and possible nature of the faults. Fault
diagnosis can be done manually by the administrator or by using automated Techniques
in order to solve the fault and perform the given task.
3. Evidence Generation
Evidence generation is defined as the process where the report of the fault is prepared
based on the diagnosis done in an earlier phase. This report involves the details of the
causes of the fault, the nature of faults, the solutions that can be used for fixing, and other
alternatives and preventions that need to be considered.
4. Assessment
Assessment is the process where the damages caused by the faults are analyzed. It can
be determined with the help of messages that are being passed from the component that
has encountered the fault. Based on the assessment further decisions are made.
5. Recovery
Recovery is the process where the aim is to make the system fault free. It is the step to
make the system fault free and restore it to state forward recovery and backup recovery.
Some of the common recovery techniques such as reconfiguration and resynchronization
can be used.
Each slave sends a ‘DONE’ message to the controlling site after each slave has
completed its transaction.
After sending the ‘DONE’ message, the slaves start waiting for a ‘Commit’ or ‘Abort’
response from the controlling site.
After receiving the ‘DONE’ message from all the slaves, then the controlling site
decides whether they have to commit or abort. Here the confirmation of the message is
done. Then it sends a message to every slave.
Then after the slave performs the operation as instructed by the controlling site, they
send an acknowledgement to the controlling site.
One-Phase Commit Protocol
Two-Phase Commit
It is the second type of commit protocol in DBMS. It was introduced to reduce the
vulnerabilities of the one phase commit protocol. There are two phases in the two-phase
commit protocol.
Prepare Phase
Each slave sends a ‘DONE’ message to the controlling site after each slave has
completed its transaction.
After getting 'DONE' message from all the slaves, it sends a "prepare" message to all the
slaves.
Then the slaves share their vote or opinion whether they want to commit or not. If a slave
wants to commit, it sends a message which is "Ready".
If the slaves doesn't want to commit, it then sends a "Not Ready" message.
Prepare Phase
Commit/Abort Phase
Controlling Site, after receiving "Ready" message from all the slaves
The controlling site sends a message "Global Commit" to all the slaves. The message
contains the details of the transaction which needs to be stored in the databases.
Then each slave completes the transaction and returns an acknowledgement message
back to the controlling site.
The controlling site after receiving acknowledgement from all the slaves, which means
the transaction is completed.
Commit/Abort Phase
Dynamic voting protocols extend this concept by adapting to changes in the system, such as
node failures or recovery. In these protocols, the size of the quorum required for reads and writes
is adjusted dynamically based on the availability of nodes. This flexibility enhances fault
tolerance, as the system can continue functioning even with fewer operational nodes. Dynamic
voting ensures that decisions remain consistent despite changes in the system state, making it
suitable for highly dynamic and unreliable environments. However, implementing dynamic
voting can be complex due to the need to track and manage node states in real time.
4o