0% found this document useful (0 votes)
22 views

Distributed Chapter 4

Uploaded by

Sahil Chourasiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Distributed Chapter 4

Uploaded by

Sahil Chourasiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

https://www.geeksforgeeks.

org/agreement-protocol-in-
distributed-systems/ Agreement Protocols in DS

Byzantine Agreement Problem

The Byzantine Agreement Problem occurs in distributed systems where processes must reach a
consensus despite the presence of faulty or malicious processes (called Byzantine faults). These
faulty processes may send incorrect or conflicting messages to disrupt the system.

Key Characteristics:

 There are n processes, and up to f of them may behave maliciously.


 The goal is for all non-faulty processes to agree on the same value, ensuring system consistency.

Conditions for Byzantine Agreement:

1. Agreement: All non-faulty processes must agree on the same value.


2. Validity: If all non-faulty processes propose the same value, that value must be chosen.
3. Fault Tolerance: The algorithm must work even if up to f processes are faulty.

Applications:

 Distributed databases.
 Blockchain systems (e.g., Bitcoin, Ethereum).
 Fault-tolerant systems like flight control systems.

Consensus Problem

The Consensus Problem is a broader challenge in distributed systems where all processes
(faulty or not) must agree on a single value. Unlike the Byzantine Agreement, it does not
specifically focus on malicious faults.

Key Properties:

1. Agreement: All processes must agree on the same value.


2. Termination: All processes must decide within a finite time.
3. Integrity: If a process decides a value, that value must have been proposed by at least one
process.

Difference from Byzantine Agreement:

 The Consensus Problem assumes non-malicious faults like crashes or delays, while Byzantine
Agreement handles malicious faults.

Interactive Consistency Problem

The Interactive Consistency Problem is a specialized version of the Byzantine Agreement


Problem, where each process needs to agree on a vector of values from all processes.

Key Requirements:

 Every non-faulty process must agree on the same vector of values.


 The value for a non-faulty process must be its original input value.
 Faulty processes may provide conflicting inputs, but agreement must still be achieved.

Example:

 A network of nodes exchanging state information for consistent updates.

Solution to Byzantine Agreement Problem

A classic solution to the Byzantine Agreement Problem is the Byzantine Generals Algorithm
(based on message passing). One popular approach is Lamport, Shostak, and Pease's
algorithm.

Key Assumptions:

 There are n processes, and up to f are Byzantine.


 For the algorithm to succeed, the number of non-faulty processes must satisfy: n>3fn > 3fn>3f

Steps:

1. Initialization: Each process proposes a value and sends it to all other processes.
2. Message Exchange: Processes exchange messages to share their received values:
o Each process collects messages from all other processes.
o This step repeats f+1 times to ensure enough information is gathered to identify faults.
3. Decision:
o Each process applies a majority rule or predefined decision rule to determine the agreed
value.
o Faulty or conflicting values are discarded, ensuring consistency.
Example:

In a system of 7 processes (n=7), up to 2 can be Byzantine (f=2). The algorithm ensures that all
non-faulty processes agree on a consistent value.

Limitations:

 High communication overhead.


 Requires (f+1)(f+1)(f+1)-rounds of message exchange.
 Assumes reliable communication between processes.

Summary Table

Problem Goal Fault Tolerance Example Algorithm

Byzantine Achieve agreement despite Handles Byzantine Byzantine Generals


Agreement malicious processes. faults. Algorithm.

Agree on a value despite crashes or


Consensus Problem Crash faults, delays. Paxos, Raft.
delays.

Interactive Agree on a vector of values from all Handles Byzantine Similar to Byzantine
Consistency processes. faults. Agreement.

For exam writing, explain each problem with simple definitions, examples, and the main steps
of the solution (e.g., the Byzantine Generals Algorithm). Use diagrams or examples of message
exchange for clarity.

4o

Distributed Transactions
Distributed transaction refers to a transaction in which multiple servers are involved.
Multiple servers are called by a client in Simple Distributed Transaction whereas a server
calls another server in Nested Transaction. The execution of a transaction at many sites,
must either be committed at all sites or aborted at all sites. However, this should not be
the case when a transaction is committed at one site and aborted at another site.
Distributed site systems use distributed commitment rules to ensure atomicity across
sites. Atomic commitment is a channel of need for cooperation across a variety of
systems.

Properties of the Distributed Transactions


The distributed transactions must possess the following ACID properties, like any other
transaction. In distributed systems, a transaction manager is used to coordinate the
distinct operations and then commit/rollback the transaction as needed.
 Atomicity: All data changes are treated as if they were a single operation. That is,
either all of the modifications are made, or none of them are made. The atomicity
feature assures that if a debit is successfully made from one account, the matching
credit is made to the other account in an application that transfers money from one
account to another.
 Consistency: This property implies that when a transaction begins and ends, the state
of data is consistent. For example, it ensures that the value remains consistent at the
start and end of a transaction in an application that transfers funds from one account to
another.
 Isolation: In this property, concurrently running transactions appear to be serialized.
For example, the isolation property assures that the transferred funds between two
accounts can be seen by another transaction in either one of the accounts, but not
both, or neither.
 Durability: Changes to data persist when a transaction completes successfully and are
not undone, even if the system fails. It assures that modifications made to each
account will not be reversed in an application that transfers money from one account to
another.
Coordination in Distributed Transactions
At the time of coordination in Distributed Transactions, one of the servers becomes a
coordinator, and the rest of the workers become coordinators.
 In a simple transaction, the first server acts as the Coordinator.
 In the nested transaction, the top-level server acts as the Coordinator.
 Role of Coordinator: The coordinator keeps track of participating servers, gathers
results from workers, and makes a decision to ensure transaction consistency.
 Role of Workers: Workers are aware of the coordinator’s existence and in addition,
communicate their outcome to the coordinator and then follow the coordinator’s
decision.
Atomic Commit
The atomic commit procedure should meet the following requirements:
 All participants who make a choice reach the same conclusion.
 If any participant decides to commit, then all other participants must have voted yes.
 If all participants vote yes and no failure occurs, then all participants decide to commit.
Distributed One-Phase Commit
A one-phase commitment protocol involves a coordinator who communicates with servers
and performs each task regularly to inform them to perform or cancel actions i.e.
transactions.
Distributed Two-Phase Commit
There are two phases for the commit procedure to work:
Phase 1: Voting
 A “prepare message” is sent to each participating worker by the coordinator.
 The coordinator must wait until a response whether ready or not ready is received from
each worker, or a timeout occurs.
 Workers must wait until the coordinator sends the “prepare” message.
 If a transaction is ready to commit then a “ready” message is sent to the coordinator.
 If a transaction is not ready to commit then a “no” message is sent to the coordinator
and resulting in aborting of the transaction.
Phase 2: Completion of the voting result
 In this phase, the Coordinator will check about the “ready” message. If each worker
sent a “ready” message then only a “commit” message is sent to each worker;
otherwise, send an “abort” message to each worker.
 Now, wait for acknowledgment until it is received from each worker.
 In this phase, Workers wait until the coordinator sends a “commit” or “abort” message;
then act according to the message received.
 At last, Workers send an acknowledgment to the Coordinator.

What is a Distributed File System?


A Distributed File System(DFS) is a networked file system that allows files to be stored,
accessed, and managed across multiple nodes or computers in a distributed manner.
Unlike traditional file systems that are typically localized to a single machine or server, a
distributed file system spans multiple machines, often geographically dispersed,
connected via a network.
Characteristics of Distributed File System
Below are the characteristics of distributed file system:
 Scalability: DFS can scale horizontally by adding more nodes to accommodate
growing storage needs and increasing access demands.
 Fault Tolerance: DFS maintains data availability even if some nodes or parts of the
network fail. Data redundancy, replication, and fault detection mechanisms are often
implemented to ensure reliability.
 Transparency: Users perceive the distributed file system as a single unified file system
regardless of the physical location of data or the underlying infrastructure.
 Concurrency: Multiple users and applications can access and modify files concurrently
while maintaining data consistency and integrity.
 Performance Optimization: DFS employs strategies such as caching, load balancing,
and data locality to enhance performance and minimize latency.
 Security: DFS incorporates security mechanisms to protect data during transmission
and storage, including authentication, encryption, access control, and data integrity
verification.

Key Components of Distributed File System (DFS)


Below are the key components of Distributed File System:
Metadata Management:
o Purpose: Metadata includes information about files, such as file names,
locations, permissions, and other attributes. Managing metadata efficiently is
crucial for locating and accessing files across distributed nodes.
o Components: Metadata servers or services maintain metadata consistency
and provide lookup services to map file names to their physical locations.
 Data Distribution and Replication
o Purpose: Distributing data across multiple nodes improves performance and
fault tolerance. Replication ensures data availability even if some nodes fail.
o Components: Replication protocols define how data copies are
synchronized and maintained across distributed nodes. Strategies include
primary-backup replication, quorum-based replication, or consistency models
like eventual consistency or strong consistency.
 Consistency Models
o Purpose: Ensuring data consistency across distributed nodes is challenging
due to network delays and potential conflicts. Consistency models define how
updates to data are propagated and synchronized to maintain a coherent
view.
o Components: Models range from strong consistency (where all nodes see
the same data simultaneously) to eventual consistency (where updates
eventually propagate to all nodes).

Mechanism to build Distributed File Systems


Building a Distributed File System (DFS) involves implementing various mechanisms and
architectures to ensure scalability, fault tolerance, performance optimization, and security.
Let’s see the mechanisms typically used in constructing a DFS:
1. Centralized File System Architecture
In a centralized DFS architecture, all file operations and management are handled by a
single centralized server. Clients access files through this central server, which manages
metadata and data storage.
1.1. Key Components and Characteristics:
 Central Server: Manages file metadata, including file names, permissions, and
locations.
 Client-Server Communication: Clients communicate with the central server to
perform read, write, and metadata operations.
 Simplicity: Easier to implement and manage compared to distributed architectures.
 Scalability Limitations: Limited scalability due to the centralized nature, as the server
can become a bottleneck for file operations.
 Examples: Network File System (NFS) is a classic example of a centralized DFS
where a central NFS server manages file access for multiple clients in a network.
1.2. Advantages:
 Centralized Control: Simplifies management and administration of file operations.
 Consistency: Ensures strong consistency since all operations go through the central
server.
 Security: Centralized security mechanisms can be implemented effectively.
1.3. Disadvantages:
 Scalability: Limited scalability as all operations are bottlenecked by the central
server’s capacity.
 Single Point of Failure: The central server becomes a single point of failure, impacting
system reliability.
 Performance: Potential performance bottleneck due to increased network traffic and
server load.

2. Distributed File System Architecture
Distributed File Systems distribute files and their metadata across multiple nodes in a
network. This approach allows for parallel access and improves scalability and fault
tolerance.
2.1. Key Components and Characteristics:
 Distributed Storage: Files are divided into smaller units (blocks) and distributed
across multiple nodes or servers in the network.
 Metadata Handling: Metadata is distributed or managed by dedicated metadata
servers or distributed across storage nodes.
 Data Replication : Copies of data blocks may be replicated across nodes to ensure
fault tolerance and data availability.
 Load Balancing: Techniques are employed to distribute read and write operations
evenly across distributed nodes.
 Examples: Google File System (GFS), Hadoop Distributed File System (HDFS), and
Amazon S3 (Simple Storage Service) are prominent examples of distributed file
systems used in cloud computing and big data applications.
2.2. Advantages:
 Scalability: Easily scales by adding more storage nodes without centralized
bottlenecks.
 Fault Tolerance: Redundancy and data replication ensure high availability and
reliability.
 Performance: Parallel access to distributed data blocks improves read/write
performance.
 Load Balancing: Efficient load distribution across nodes enhances overall system
performance.
2.3. Disadvantages:
 Complexity: Increased complexity in managing distributed data and ensuring
consistency across nodes.
 Consistency Challenges: Consistency models must be carefully designed to manage
concurrent updates and maintain data integrity.
 Security: Ensuring data security and access control across distributed nodes can be
more challenging than in centralized systems.

3. Hybrid File System Architectures


Hybrid architectures combine elements of both centralized and distributed approaches,
offering flexibility and optimization based on specific use cases.
3.1. Key Components and Characteristics:
 Combination of Models: Utilizes centralized servers for managing metadata and
control, while data storage and access may be distributed across nodes.
 Client-Side Caching: Clients may cache frequently accessed data locally to reduce
latency and improve performance.
 Examples: Andrew File System (AFS) combines centralized metadata management
with distributed data storage for efficient file access and management.
3.2. Advantages:
 Flexibility: Adaptable to different workload demands and scalability requirements.
 Performance Optimization: Combines centralized control with distributed data access
for optimized performance.
 Fault Tolerance: Can leverage data replication strategies while maintaining centralized
metadata management.
3.3. Disadvantages:
 Complexity: Hybrid architectures can introduce additional complexity in system design
and management.
 Consistency: Ensuring consistency between centralized metadata and distributed data
can be challenging.
 Integration Challenges: Integration of centralized and distributed components
requires careful coordination and synchronization.
4. Implementation Considerations
 Scalability: Choose architecture based on scalability requirements, considering
potential growth in data volume and user access.
 Fault Tolerance: Implement redundancy and data replication strategies to ensure high
availability and reliability.
 Performance: Optimize data access patterns, employ caching mechanisms, and utilize
load balancing techniques to enhance performance.
 Security: Implement robust security measures, including encryption, access controls,
and authentication mechanisms, to protect data across distributed nodes.
Challenges and Solutions in building Distributed File Systems
Below are the challenges and solution in building distributed file systems:
 Data Consistency and Integrity
o Challenge: Ensuring consistent data across distributed nodes due to network
delays and concurrent updates.
o Solution: Implement appropriate consistency models (e.g., strong or
eventual consistency) and concurrency control mechanisms (e.g., distributed
locking, transactions).
 Fault Tolerance and Reliability
o Challenge: Dealing with node failures, network partitions, and
hardware/software issues.
o Solution: Use data replication strategies (e.g., primary-backup, quorum-
based), automated failure detection, and recovery mechanisms to ensure
data availability and system reliability.
 Scalability and Performance
o Challenge: Scaling to handle large data volumes and increasing user access
without performance degradation.
o Solution: Scale horizontally by adding more nodes, employ load balancing
techniques, and optimize data partitioning and distribution to enhance system
performance.
 Security and Access Control
o Challenge: Protecting data integrity, confidentiality, and enforcing access
control across distributed nodes.
o Solution: Utilize encryption for data at rest and in transit, implement robust
authentication and authorization mechanisms, and monitor access logs for
security compliance.

Recovery Techniques in Distributed Systems


Recovery techniques in distributed systems are essential for ensuring that the system can
return to a stable state after encountering errors or failures. These techniques can be
broadly categorized into the following:
 Checkpointing: Periodically saving the system’s state to a stable storage, so that in
the event of a failure, the system can be restored to the last known good state.
Checkpointing is a key aspect of backward recovery.
 Rollback Recovery: Involves reverting the system to a previous checkpointed state
upon detecting an error. This technique is useful for undoing the effects of errors and is
often combined with checkpointing.
 Forward Recovery: Instead of reverting to a previous state, forward recovery attempts
to move the system from an erroneous state to a new, correct state. This requires
anticipating possible errors and having strategies in place to correct them on the fly.
 Logging and Replay: Keeping logs of system operations and replaying them from a
certain point to recover the system’s state. This is useful in scenarios where a complete
rollback might not be feasible.
 Replication: Maintaining multiple copies of data or system components across
different nodes. If one component fails, another can take over, ensuring continuity of
service.
 Error Detection and Correction: Incorporating mechanisms that detect errors and
automatically correct them before they lead to system failure. This is a proactive
approach that enhances system resilience.

n distributed systems, recovery mechanisms are crucial for handling failures and maintaining
system reliability. The two main approaches are forward recovery and backward recovery:

1. Forward Recovery:
Forward recovery focuses on correcting the errors and continuing operations from the
current state. Instead of undoing previous actions, this method applies corrective or
compensating measures to address the impact of failures. Forward recovery is commonly
used in scenarios where preserving progress is critical, such as financial transactions or
real-time applications. However, it is complex to implement as it requires a detailed
understanding of the failure and its effects on the system.
2. Backward Recovery:
Backward recovery restores the system to a previously saved consistent state by rolling
back all changes made after the failure. This approach relies on techniques like
checkpoints or logs to return to a known good state. It is simpler and widely used in
systems like databases, where maintaining data consistency is essential. However,
backward recovery may lead to the loss of recent progress made before the failure
occurred.

Recovery With Concurrent Transactions


Last Updated : 25 Aug, 2020



Concurrency control means that multiple transactions can be executed at the


same time and then the interleaved logs occur. But there may be changes in
transaction results so maintain the order of execution of those transactions.
During recovery, it would be very difficult for the recovery system to backtrack
all the logs and then start recovering.
Recovery with concurrent transactions can be done in the following four ways.
1. Interaction with concurrency control
2. Transaction rollback
3. Checkpoints
4. Restart recovery
Interaction with concurrency control :
In this scheme, the recovery scheme depends greatly on the concurrency
control scheme that is used. So, to rollback a failed transaction, we must undo
the updates performed by the transaction.
Transaction rollback :
 In this scheme, we rollback a failed transaction by using the log.
 The system scans the log backward a failed transaction, for every log record
found in the log the system restores the data item.
Checkpoints :
 Checkpoints is a process of saving a snapshot of the applications state so that
it can restart from that point in case of failure.
 Checkpoint is a point of time at which a record is written onto the database
form the buffers.
 Checkpoint shortens the recovery process.
 When it reaches the checkpoint, then the transaction will be updated into the
database, and till that point, the entire log file will be removed from the file.
Then the log file is updated with the new step of transaction till the next
checkpoint and so on.
 The checkpoint is used to declare the point before which the DBMS was in the
consistent state, and all the transactions were committed.
Restart recovery :
 When the system recovers from a crash, it constructs two lists.
 The undo-list consists of transactions to be undone, and the redo-list consists
of transaction to be redone.
 The system constructs the two lists as follows: Initially, they are both empty.
The system scans the log backward, examining each record, until it finds the
first <checkpoint> record.

DataBase Recovery
In order to recuperate from database failure, database management systems resort to a number
of recovery management techniques. In this chapter, we will study the different approaches for
database recovery. The typical strategies for database recovery are.

In case of soft failures that result in inconsistency of database, recovery strategy includes
transaction undo or rollback. However, sometimes, transaction redo may also be adopted to
recover to a consistent state of the transaction.

 In case of hard failures resulting in extensive damage to database, recovery strategies


encompass restoring a past copy of the database from archival backup. A more current
state of the database is obtained through redoing operations of committed transactions
from transaction log.

Recovery from Power Failure


Power failure causes loss of information in the non-persistent memory. When power is restored,
the operating system and the database management system restart. Recovery manager initiates
recovery from the transaction logs.

In case of immediate update mode, the recovery manager takes the following actions −

 Transactions which are in active list and failed list are undone and written on the abort
list.
 Transactions which are in before-commit list are redone.
 No action is taken for transactions in commit or abort lists.

In case of deferred update mode, the recovery manager takes the following actions −

 Transactions which are in the active list and failed list are written onto the abort list. No
undo operations are required since the changes have not been written to the disk yet.
 Transactions which are in before-commit list are redone.
 No action is taken for transactions in commit or abort lists.

Recovery from Disk Failure


A disk failure or hard crash causes a total database loss. To recover from this hard crash, a new
disk is prepared, then the operating system is restored, and finally the database is recovered
using the database backup and transaction log. The recovery method is same for both immediate
and deferred update modes.

The recovery manager takes the following actions −

 The transactions in the commit list and before-commit list are redone and written onto
the commit list in the transaction log.
 The transactions in the active list and failed list are undone and written onto the abort list
in the transaction log.

Explore our latest online courses and learn new skills at your own pace. Enroll and become a
certified expert to boost your career.

Checkpointing
Checkpoint is a point of time at which a record is written onto the database from the buffers. As
a consequence, in case of a system crash, the recovery manager does not have to redo the
transactions that have been committed before checkpoint. Periodical checkpointing shortens the
recovery process.

The two types of checkpointing techniques are −

 Consistent checkpointing
 Fuzzy checkpointing
Consistent Checkpointing

Consistent checkpointing creates a consistent image of the database at checkpoint. During


recovery, only those transactions which are on the right side of the last checkpoint are undone or
redone. The transactions to the left side of the last consistent checkpoint are already committed
and needn’t be processed again. The actions taken for checkpointing are −

 The active transactions are suspended temporarily.


 All changes in main-memory buffers are written onto the disk.
 A “checkpoint” record is written in the transaction log.
 The transaction log is written to the disk.
 The suspended transactions are resumed.

If in step 4, the transaction log is archived as well, then this checkpointing aids in recovery from
disk failures and power failures, otherwise it aids recovery from only power failures.

Fuzzy Checkpointing

In fuzzy checkpointing, at the time of checkpoint, all the active transactions are written in the log.
In case of power failure, the recovery manager processes only those transactions that were
active during checkpoint and later. The transactions that have been committed before checkpoint
are written to the disk and hence need not be redone.

Example of Checkpointing

Let us consider that in system the time of checkpointing is tcheck and the time of system crash is
tfail. Let there be four transactions Ta, Tb, Tc and Td such that −

 Ta commits before checkpoint.


 Tb starts before checkpoint and commits before system crash.
 Tc starts after checkpoint and commits before system crash.
 Td starts after checkpoint and was active at the time of system crash.

The situation is depicted in the following diagram −

The actions that are taken by the recovery manager are −


 Nothing is done with Ta.
 Transaction redo is performed for Tb and Tc.
 Transaction undo is performed for Td.

Transaction Recovery Using UNDO / REDO


Transaction recovery is done to eliminate the adverse effects of faulty transactions rather than to
recover from a failure. Faulty transactions include all transactions that have changed the
database into undesired state and the transactions that have used values written by the faulty
transactions.

Transaction recovery in these cases is a two-step process −

 UNDO all faulty transactions and transactions that may be affected by the faulty
transactions.
 REDO all transactions that are not faulty but have been undone due to the faulty
transactions.

What is Fault Tolerance?


Fault Tolerance is defined as the ability of the system to function properly even in the
presence of any failure. Distributed systems consist of multiple components due to which
there is a high risk of faults occurring. Due to the presence of faults, the overall
performance may degrade.
Types of Faults
 Transient Faults: Transient Faults are the type of faults that occur once and then
disappear. These types of faults do not harm the system to a great extent but are very
difficult to find or locate. Processor fault is an example of transient fault.
 Intermittent Faults: Intermittent Faults are the type of faults that come again and
again. Such as once the fault occurs it vanishes upon itself and then reappears again.
An example of intermittent fault is when the working computer hangs up.
 Permanent Faults: Permanent Faults are the type of faults that remain in the system
until the component is replaced by another. These types of faults can cause very
severe damage to the system but are easy to identify. A burnt-out chip is an example of
a permanent Fault.

Need for Fault Tolerance in Distributed Systems
Fault Tolerance is required in order to provide below four features.
1. Availability: Availability is defined as the property where the system is readily
available for its use at any time.
2. Reliability: Reliability is defined as the property where the system can work
continuously without any failure.
3. Safety: Safety is defined as the property where the system can remain safe from
unauthorized access even if any failure occurs.
4. Maintainability: Maintainability is defined as the property states that how easily and
fastly the failed node or system can be repaired.
Fault Tolerance in Distributed Systems
In order to implement the techniques for fault tolerance in distributed systems, the design,
configuration and relevant applications need to be considered. Below are the phases
carried out for fault tolerance in any distributed systems.

Phases of Fault Tolerance in Distributed Systems

1. Fault Detection
Fault Detection is the first phase where the system is monitored continuously. The
outcomes are being compared with the expected output. During monitoring if any faults
are identified they are being notified. These faults can occur due to various reasons such
as hardware failure, network failure, and software issues. The main aim of the first phase
is to detect these faults as soon as they occur so that the work being assigned will not be
delayed.
2. Fault Diagnosis
Fault diagnosis is the process where the fault that is identified in the first phase will be
diagnosed properly in order to get the root cause and possible nature of the faults. Fault
diagnosis can be done manually by the administrator or by using automated Techniques
in order to solve the fault and perform the given task.
3. Evidence Generation
Evidence generation is defined as the process where the report of the fault is prepared
based on the diagnosis done in an earlier phase. This report involves the details of the
causes of the fault, the nature of faults, the solutions that can be used for fixing, and other
alternatives and preventions that need to be considered.
4. Assessment
Assessment is the process where the damages caused by the faults are analyzed. It can
be determined with the help of messages that are being passed from the component that
has encountered the fault. Based on the assessment further decisions are made.
5. Recovery
Recovery is the process where the aim is to make the system fault free. It is the step to
make the system fault free and restore it to state forward recovery and backup recovery.
Some of the common recovery techniques such as reconfiguration and resynchronization
can be used.

Types of Fault Tolerance in Distributed Systems


1. Hardware Fault Tolerance: Hardware Fault Tolerance involves keeping a backup plan
for hardware devices such as memory, hard disk, CPU, and other hardware peripheral
devices. Hardware Fault Tolerance is a type of fault tolerance that does not examine
faults and runtime errors but can only provide hardware backup. The two different
approaches that are used in Hardware Fault Tolerance are fault-masking and dynamic
recovery.
2. Software Fault Tolerance: Software Fault Tolerance is a type of fault tolerance where
dedicated software is used in order to detect invalid output, runtime, and programming
errors. Software Fault Tolerance makes use of static and dynamic methods for
detecting and providing the solution. Software Fault Tolerance also consists of
additional data points such as recovery rollback and checkpoints.
3. System Fault Tolerance: System Fault Tolerance is a type of fault tolerance that
consists of a whole system. It has the advantage that it not only stores the checkpoints
but also the memory block, and program checkpoints and detects the errors in
applications automatically. If the system encounters any type of fault or error it does
provide the required mechanism for the solution. Thus system fault tolerance is reliable
and efficient.
Fault Tolerance Strategies
Fault tolerance strategies are essential for ensuring that distributed systems continue to
operate smoothly even when components fail. Here are the key strategies commonly
used:
 Redundancy and Replication
o Data Replication: Data is duplicated across multiple nodes or locations to
ensure availability and durability. If one node fails, the system can still access
the data from another node.
o Component Redundancy: Critical system components are duplicated so
that if one component fails, others can take over. This includes redundant
servers, network paths, or services.
 Failover Mechanisms
o Active-Passive Failover: One component (active) handles the workload
while another component (passive) remains on standby. If the active
component fails, the passive component takes over.
o Active-Active Failover: Multiple components actively handle workloads and
share the load. If one component fails, others continue to handle the
workload.
 Error Detection Techniques
o Heartbeat Mechanisms: Regular signals (heartbeats) are sent between
components to detect failures. If a component stops sending heartbeats, it is
considered failed.
o Checkpointing: Periodic saving of the system’s state so that if a failure
occurs, the system can be restored to the last saved state.
 Error Recovery Methods
o Rollback Recovery: The system reverts to a previous state after detecting
an error, using saved checkpoints or logs.
o Forward Recovery: The system attempts to correct or compensate for the
failure to continue operating. This may involve reprocessing or reconstructing
data.

Fault tolerance can present a number of challenges, including:


 Cost
Fault-tolerant systems require additional hardware, equipment, and software licenses, which can
increase costs.
 Performance
Redundant resources and failover mechanisms can impact system performance, especially when
demand is high.
 Complexity
Fault tolerance mechanisms can make a system more complex, which can make it harder to design,
deploy, and maintain.
 Consistency and availability
Maintaining a balance between data consistency and system availability can be challenging.
 Latency
Some fault tolerance mechanisms can introduce latency, which may not be acceptable for real-time
applications.
 Tools
There is a lack of tools to help programmers make reliable systems.
 Testing and fault detection
Fault tolerance can make it more difficult to know if components are performing as expected.

Commit Protocol in DBMS


The concept of the Commit Protocol was developed in the context of database systems.
Commit protocols are defined as algorithms that are used in distributed systems to ensure
that the transaction is completed entirely or not. It helps us to maintain data integrity,
Atomicity, and consistency of the data. It helps us to create robust, efficient and reliable
systems.
One-Phase Commit
It is the simplest commit protocol. In this commit protocol, there is a controlling site, and
there are a variety of slave sites where the transaction is performed. The steps followed in
the one-phase commit protocol are following:

 Each slave sends a ‘DONE’ message to the controlling site after each slave has
completed its transaction.
 After sending the ‘DONE’ message, the slaves start waiting for a ‘Commit’ or ‘Abort’
response from the controlling site.
 After receiving the ‘DONE’ message from all the slaves, then the controlling site
decides whether they have to commit or abort. Here the confirmation of the message is
done. Then it sends a message to every slave.
 Then after the slave performs the operation as instructed by the controlling site, they
send an acknowledgement to the controlling site.
One-Phase Commit Protocol

Two-Phase Commit
It is the second type of commit protocol in DBMS. It was introduced to reduce the
vulnerabilities of the one phase commit protocol. There are two phases in the two-phase
commit protocol.
Prepare Phase
Each slave sends a ‘DONE’ message to the controlling site after each slave has
completed its transaction.
After getting 'DONE' message from all the slaves, it sends a "prepare" message to all the
slaves.
Then the slaves share their vote or opinion whether they want to commit or not. If a slave
wants to commit, it sends a message which is "Ready".
If the slaves doesn't want to commit, it then sends a "Not Ready" message.
Prepare Phase

Commit/Abort Phase
Controlling Site, after receiving "Ready" message from all the slaves
 The controlling site sends a message "Global Commit" to all the slaves. The message
contains the details of the transaction which needs to be stored in the databases.
 Then each slave completes the transaction and returns an acknowledgement message
back to the controlling site.
 The controlling site after receiving acknowledgement from all the slaves, which means
the transaction is completed.
Commit/Abort Phase

Three Phase Commit Protocol


It is the second type of commit protocol in DBMS. It was introduced to address the issue
of blocking. In this commit protocol, there are three phases: -
Prepare Phase
It is same as the prepare phase of Two-phase Commit Protocol. In this phase every site
should be prepared or ready to make a commitment.
Prepare to Commit Phase
In this phase the controlling site issues an "Enter Prepared State" message and sends it
to all the slaves. In the response, all the slaves site issues an OK message.
Commit/Abort Phase
This phase consists of the steps which are same as they were in the two-phase commit.
In this phase, no acknowledegement is provided after the process.
Three-Phase
Commit protocol

Advantages of Commit Protocol in DBMS


 The commit protocol in DBMS helps to ensure that the changes in the database
remains consistent throughout the database.
 It basically also helps to ensure that the integrity of the data is maintained throughout
the database.
 It will also helps to maintain the atomicity which means that either all the operations in
a transaction are completed successfully or not done at all.
 The commit protocol provide mechanisms for system recovery in the case of system
failures.
In distributed systems, voting protocols are used to maintain consistency and fault tolerance
during operations like reading or writing shared data. In these protocols, nodes maintain replicas
of the data, and a quorum (a majority or predefined subset of nodes) must agree before an
operation is performed. Voting protocols rely on two types of quorums: a read quorum (RRR)
and a write quorum (WWW). The relationship R+W>NR + W > NR+W>N, where NNN is the
total number of nodes, ensures that at least one node involved in any read operation has the most
recent data from the write operation. This guarantees consistency even in the presence of
failures. Voting protocols are resilient to node failures and support distributed decision-making
but can suffer from high communication overhead.

Dynamic voting protocols extend this concept by adapting to changes in the system, such as
node failures or recovery. In these protocols, the size of the quorum required for reads and writes
is adjusted dynamically based on the availability of nodes. This flexibility enhances fault
tolerance, as the system can continue functioning even with fewer operational nodes. Dynamic
voting ensures that decisions remain consistent despite changes in the system state, making it
suitable for highly dynamic and unreliable environments. However, implementing dynamic
voting can be complex due to the need to track and manage node states in real time.

4o

You might also like