unit 3

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 26

DNS

An application layer protocol defines how the application processes running on different systems,
pass the messages to each other.

o DNS stands for Domain Name System.

o DNS is a directory service that provides a mapping between the name of a host on the
network and its numerical address.
o DNS is required for the functioning of the internet.

o Each node in a tree has a domain name, and a full domain name is a sequence of symbols
specified by dots.

o DNS is a service that translates the domain name into IP addresses. This allows the users of
networks to utilize user-friendly names when looking for other hosts instead of remembering
the IP addresses.

o For example, suppose the FTP site at EduSoft had an IP address of 132.147.165.50, most
people would reach this site by specifying ftp.EduSoft.com. Therefore, the domain name is
more reliable than IP address.

DNS is a TCP/IP protocol used on different platforms. The domain name space is divided into three
different sections: generic domains, country domains, and inverse domain.

Generic Domains

o It defines the registered hosts according to their generic behavior.

o Each node in a tree defines the domain name, which is an index to the DNS database.

o It uses three-character labels, and these labels describe the organization type.

Label Description

aero Airlines and aerospace companies

biz Businesses or firms

com Commercial Organizations

coop Cooperative business Organizations


edu Educational institutions

gov Government institutions

info Information service providers

int International Organizations

mil Military groups

museum Museum & other nonprofit organizations

name Personal names

net Network Support centers

org Nonprofit Organizations

pro Professional individual Organizations


Country Domain

The format of country domain is same as a generic domain, but it uses two-character country
abbreviations (e.g., us for the United States) in place of three character organizational abbreviations.

Inverse Domain

The inverse domain is used for mapping an address to a name. When the server has received a
request from the client, and the server contains the files of only authorized clients. To determine
whether the client is on the authorized list or not, it sends a query to the DNS server and ask for
mapping an address to the name.

Working of DNS

o DNS is a client/server network communication protocol. DNS clients send requests to the.
server while DNS servers send responses to the client.

o Client requests contain a name which is converted into an IP address known as a forward
DNS lookups while requests containing an IP address which is converted into a name known
as reverse DNS lookups.

o DNS implements a distributed database to store the name of all the hosts available on the
internet.
o If a client like a web browser sends a request containing a hostname, then a piece of
software such as DNS resolver sends a request to the DNS server to obtain the IP address of
a hostname. If DNS server does not contain the IP address associated with a hostname, then
it forwards the request to another DNS server. If IP address has arrived at the resolver, which
in turn completes the request over the internet protocol.

Directory & discovery services :-


In distributed computing, directory and discovery services are essential for locating
resources, such as services, data, or devices, across a network of multiple systems. These
services ensure that applications can interact with the right resources efficiently, even in
large and dynamic environments. Let's break down the main components and concepts
involved:
1. Directory Services
 A directory service stores information about network resources and provides
mechanisms to retrieve this information based on specific criteria.
 This service acts as a centralized or distributed repository for storing metadata about
resources, such as their locations, interfaces, and access credentials.
 Examples include the Lightweight Directory Access Protocol (LDAP), Active Directory,
and Domain Name System (DNS) servers, which provide resource and service
location information.
Key Features:
 Resource Identification: Resources are assigned unique identifiers (UIDs) that allow
directory services to locate them on the network.
 Organized Hierarchically: Information is often organized in a tree-like structure,
facilitating search and retrieval.
 Access Control: Provides access restrictions to ensure security, allowing only
authorized users or services to retrieve certain information.
2. Discovery Services
 Discovery services dynamically locate and keep track of resources and services
available in the network. This is essential in environments where resources may
frequently join or leave.
 Discovery services are often decentralized, making them well-suited for distributed
computing environments where network topology changes frequently.
 Examples include Service Location Protocol (SLP), Universal Plug and Play (UPnP), and
Zero-configuration networking (Bonjour, for instance).
Key Features:
 Dynamic Availability: Resources can register and deregister as needed, allowing
discovery services to track active resources.
 Service Announcement and Browsing: Resources can announce their presence, and
clients can browse for available services.
 Scalability and Fault Tolerance: Discovery services are typically designed to handle
numerous resources and maintain functionality even with some node failures.
3. How Directory and Discovery Services Work Together
 Directory services are more static, often used to store persistent information about
resources, such as configurations, policies, or certificates.
 Discovery services are more dynamic, allowing for real-time updates on resource
availability.
 In practice, both types of services work together to enable efficient resource
management: directory services provide a stable reference, while discovery services
adapt to dynamic network changes.
4. Common Applications and Protocols
 LDAP: Common for user authentication and authorization in enterprise
environments.
 DNS-SD (DNS Service Discovery): Used to discover services over IP networks.
 UPnP: Allows devices to dynamically join a network, obtain IP addresses, and
discover each other.
 Consul, Zookeeper, etcd: Used in cloud environments for service discovery,
configuration, and leader election.
5. Challenges
 Scalability: Managing large numbers of services in a distributed network.
 Fault Tolerance: Ensuring availability and resilience despite network failures.
 Security: Protecting directory and discovery information from unauthorized access or
tampering.
In summary, directory services are foundational for organized, structured resource
information storage, while discovery services allow for the dynamic interaction and real-time
location of resources in distributed systems.

time and global states in distributed system:-


In distributed systems, managing time and global state is challenging because no single
node has a complete view of the system at any moment, and each node typically has its own
independent clock. These factors make it hard to ensure consistent ordering of events and a
unified view of the system's state. Here’s an overview of the concepts, techniques, and
challenges involved.
1. Time in Distributed Systems
 Problem of Synchronization: In distributed systems, physical clocks on different
nodes tend to drift apart, leading to discrepancies in time. Achieving precise clock
synchronization across nodes is difficult, yet necessary for tasks that rely on time
ordering (e.g., logs, transactions).
 Types of Timekeeping:
o Physical Clocks: These are system clocks maintained by each node. Protocols
like the Network Time Protocol (NTP) help synchronize physical clocks to
some extent, but perfect accuracy is impossible.
o Logical Clocks: Logical clocks, such as Lamport timestamps and vector clocks,
are used to establish a causal ordering of events rather than absolute time.
Logical clocks provide a way to order events based on cause and effect rather
than precise timestamps.
Techniques for Managing Time:
 Lamport Timestamps: Each process maintains a counter that increments with each
event. When messages are sent, the counter value is passed along, allowing the
receiver to adjust its counter to ensure a causal ordering.
 Vector Clocks: An extension of Lamport timestamps where each node maintains a
vector that tracks its own clock and the clocks of other nodes. Vector clocks are more
precise and can detect concurrent events (events without a causal relationship).
 Hybrid Clocks: These clocks combine physical and logical clocks to maintain both a
close approximation of real-time and causal ordering (e.g., Google’s TrueTime).
2. Global States in Distributed Systems
 Concept of Global State: The global state is the combined state of all nodes in the
distributed system at a specific point in time. This includes all processes, variables,
and messages in transit.
 Challenges of Capturing Global State:
o Lack of Shared Memory: Since each node operates independently, there is no
single memory or storage location representing the entire system’s state.
o Concurrency and Asynchrony: Nodes operate asynchronously, so capturing a
consistent global snapshot that reflects a specific point in time across the
system is challenging.
Techniques for Capturing Global State:
 Snapshot Algorithms: Algorithms like Chandy-Lamport Snapshot are used to capture
consistent global states without halting the system.
o Chandy-Lamport Algorithm: A process initiates the snapshot and sends
markers to other processes. Each process records its state upon receiving the
marker and logs any incoming messages received after this point, ensuring
consistency.
 Consistent Cuts: A consistent cut is a subset of events from each node that does not
violate causality (i.e., an event and its cause are both part of the cut). This ensures
the snapshot reflects a logically consistent state.
 Checkpointing: Periodically saving the state of each node so that, in the event of a
failure, the system can revert to a recent consistent state.
3. Applications and Importance
 Fault Tolerance: Time and global state management help detect and recover from
failures by providing consistent snapshots and checkpoints.
 Distributed Debugging: Logical clocks and consistent cuts assist in debugging by
tracking causality across distributed nodes.
 Event Ordering and Consistency: Time ordering is essential for distributed databases
and applications where consistency and causality need to be maintained.
4. Challenges and Trade-offs
 Trade-off between Accuracy and Overhead: Highly synchronized clocks and frequent
snapshots incur higher communication and processing overhead.
 Network Latency and Clock Drift: Physical clock synchronization is limited by
network latency and the drift of hardware clocks, making perfect global time
ordering impossible.
 Concurrency: Concurrency increases the complexity of maintaining a consistent
global state without introducing blocking or latency.
In summary, managing time and global state in distributed systems is about achieving a
logical ordering of events and snapshots of system state that are consistent across nodes,
despite the inherent limitations of independent clocks and distributed control.

clocks, events, process states :-


In distributed systems, the concepts of clocks, events, and process states are fundamental
to understanding how the system’s activities are organized and how processes coordinate
despite being distributed across different nodes.
1. Clocks
Clocks in distributed systems help provide a way to track and order events, which is crucial
since there is no single, global clock that can be referenced by all nodes simultaneously.
Types of Clocks
 Physical Clocks: These are real-time clocks present in each system’s hardware, but
they tend to drift over time due to slight inaccuracies in each clock’s crystal oscillator.
This drift means that physical clocks across different systems can fall out of sync.
o Synchronization Protocols: Protocols like the Network Time Protocol (NTP) or
Precision Time Protocol (PTP) help synchronize physical clocks across nodes,
but they are not perfect and can introduce small time discrepancies.
 Logical Clocks: Logical clocks are designed to order events without reference to
physical time. They track causal relationships rather than absolute time.
o Lamport Timestamps: Introduced by Leslie Lamport, these timestamps assign
an integer counter to each event. Each process increments its own counter
with each event and synchronizes with others when messages are exchanged.
Lamport timestamps help in maintaining a partial ordering of events, ensuring
that causally related events are ordered.
o Vector Clocks: These clocks extend Lamport timestamps by maintaining a
vector for each process that tracks causal dependencies more accurately
across multiple nodes. Each element in the vector represents the logical clock
of a different node, enabling detection of concurrency and a more precise
ordering of events.
 Hybrid Clocks: Hybrid clocks, such as Google’s TrueTime, combine physical and
logical clocks to balance accuracy in timekeeping with causality tracking. TrueTime
provides an interval for event timestamps to account for any synchronization
uncertainty between nodes.

2. Events
 In distributed systems, events refer to any significant action that occurs within a
process, such as sending or receiving a message, updating a variable, or initiating a
computation. Events are discrete points in time and serve as key markers for ordering
and synchronization.
 Types of Events:
o Internal Events: These are events that happen within a single process and
don’t involve interaction with other processes, such as updating local data or
executing a computation.
o External Events: These involve interaction between processes, such as
sending and receiving messages. External events are critical for coordinating
across nodes and often trigger updates to logical clocks.
 Event Ordering:
o Causal Ordering: Events are causally ordered if one event influences or
depends on another. For instance, if event A on node 1 causes event B on
node 2, then event A should appear before event B in the causal order. Causal
ordering is tracked using logical clocks.
o Concurrent Events: Events are considered concurrent if they have no causal
relationship, meaning they happen independently. With vector clocks,
concurrency can be detected when neither clock vector is “less than” the
other.
3. Process States
 Process State: The process state is the complete set of information that describes a
process at a specific point in time, including its variables, data structures, and other
resources it is holding. A process's state can change over time as it processes internal
events or interacts with other processes.
 In distributed systems, the states of individual processes, when combined, represent
the system's global state at a particular moment. However, capturing a global state is
challenging because processes operate asynchronously and there is no instant way to
collect all process states at once.
Capturing Process States and Global State
 Snapshots: To capture a global state, snapshot algorithms like the Chandy-Lamport
algorithm record process states and message histories in a way that respects
causality. Snapshots help in understanding the distributed system's behavior and are
essential for debugging, rollback, and fault tolerance.
 Consistent Cuts: A consistent cut is a subset of all events and process states that do
not violate causality. For example, if one process’s state reflects receiving a message,
the sender’s state should reflect having sent it. Consistent cuts help create a
coherent snapshot of the system state at any point in time.
States in Fault Tolerance and Recovery
 Checkpointing: This is the process of saving the state of a process at regular intervals,
so in the case of a failure, the system can roll back to a known good state.
Checkpointing is essential for fault tolerance and is often combined with snapshot
techniques to ensure that the saved states across processes are consistent.
 Event Logs: Systems may also log events to support replaying operations from a
certain point, which aids in recovery and debugging. Event logs help maintain a
sequence of actions that can reconstruct a process’s state at any historical point.
Key Interrelationships
 Clocks and Events: Clocks help order events, which is crucial for establishing causality
in distributed systems.
 Events and Process States: Events alter process states, so tracking events allows us to
understand how process states evolve over time.
 Global State and Consistent Cut: Global state is an aggregate of all process states at a
given time. Consistent cuts ensure a logically coherent global state by tracking
process state changes and message transfers.
In summary, clocks provide a means to order events, events signify changes within or
between processes, and process states reflect the current configuration of a process.
Together, these components help maintain coordination and consistency in distributed
systems.

Clock Synchronization in Distributed System :-


In this tutorial you are going to learn about Clock Synchronization in Distributed System.
Introduction
Clock synchronization is the mechanism to synchronize the time of all the computers in the
distributed environments or system.
Assume that there are three systems present in a distributed environment. To maintain the
data i.e. to send, receive and manage the data between the systems with the same time in
synchronized manner you need a clock that has to be synchronized. This process to
synchronize data is known as Clock Synchronization.
Synchronization in distributed system is more complicated than in centralized system
because of the use of distributed algorithms.
Properties of Distributed algorithms to maintain Clock synchronization:
 Relevant and correct information will be scattered among multiple machines.
 The processes make the decision only on local information.
 Failure of the single point in the system must be avoided.
 No common clock or the other precise global time exists.
 In the distributed systems, the time is ambiguous.

As the distributed systems has its own clocks. The time among the clocks may also vary. So,
it is possible to synchronize all the clocks in distributed environment.
Types of Clock Synchronization
 Physical clock synchronization
 Logical clock synchronization
 Mutual exclusion synchronization

Physical Synchronization:
 In physical clock synchronization, All the computers will have their own clocks.
 The physical clocks are needed to adjust the time of nodes. All the nodes in the
system can share their local time with all other nodes in the system.
 The time will be set based on UTC (Universal Coordinate Timer).
 The time difference between the two computers is known as “Time drift”. Clock drifts
over the time is known as “Skew”. Synchronization is necessary here.

Physical clocks: In physical synchronization, physical clocks are used to time stamp an event
on that computer.
If two events, E1 and E2, having different time stamps t1 and t2, the order of the event
occurring will be considered and not on the exact time or the day at which they are occur.
Several methods are used to attempt the synchronization of the physical clocks in
Distributed synchronization:
1. UTC (Universal coordinate timer)
2. Christian’s algorithm
3. Berkely’s algorithm

Universal Coordinate Time (UTC)


 All the computers are generally synchronized to a standard time called Universal
Coordinate Time (UTC).
 UTC is the primary time standard by which the time and the clock are regulated in
the world. It is available via radio signals, telephone line and satellites (GPS).
 UTC is broadcasted via the satellites.
 Computer servers and online services with the UTC resources can be synchronized by
the satellite broadcast.
 Kept within 0.9 seconds of UTI.

UTO - Mean solar time on Greenwich meridian, Obtained from astronomical observation.
UT1 - UTO corrected for polar motion.
UT2 - UT1 corrected for seasonal variations in earth’s rotation.
UTC - Civil time will be measured on an atomic scale.
Christian’s Algorithm:
 The simplest algorithm for setting time, it issues a remote procedure call (RPC) to the
time sever and obtains the time.
 The machine which send requests to the time server is “d/z” seconds, where d is the
maximum difference between the clock and the UTC.
 The time server sends the reply with current UTC when receives the request from the
receiver.

What is the logical time :-


The logical time in distributed systems is used to maintain the consistent ordering of events.
The concept of causality, i.e. the causal precedence relationship, is fundamental for
distributed systems. Usually, it is tracked using physical time, but physical clocks are hard to
maintain in a distributed system, so logical clocks are used instead. The idea that makes
logical clocks different is that these are designed to maintain the information about the
order of events rather than pertaining to the same notion of time as the physical clocks.
Rules
A logical clock must specify the following two rules to implement proper functionality:
 Rule 1 determines how a local process updates its clock when an event occurs.
 Rule 2 determines how a local process updates its global clock upon receiving a
message from another process to update its view of global time.
Lamport clock algorithm
Lamport clock is a simple logical clock implementation. The illustration above shows the
working of Lamport clock algorithm. You can find explanation in this article.
Applications
Logical time has many applications in distributed systems.
 It is used to resolve conflicts, track individual events, and take concurrency measures.
 Logical time also has its application in debugging distributed systems. With all the
information about the causal relationship of events, it is easy to spot which event
caused the error.

Logical Clock in Distributed System :-


In distributed systems, ensuring synchronized events across multiple nodes is crucial for
consistency and reliability. Enter logical clocks, a fundamental concept that orchestrates
event ordering without relying on physical time. By assigning logical timestamps to events,
these clocks enable systems to reason about causality and sequence events accurately, even
across network delays and varied system clocks. This article explores how logical clocks
enhance distributed system design.
What are Logical Clocks?
Logical clocks are a concept used in distributed systems to order events without relying on
physical time synchronization. They provide a way to establish a partial ordering of events
based on causality rather than real-time clock values.
 By assigning logical timestamps to events, logical clocks allow distributed systems to
maintain consistency and coherence across different nodes, despite varying clock
speeds and network delays.
 This ensures that events can be correctly ordered and coordinated, facilitating fault
tolerance and reliable operation in distributed computing environments.
Differences Between Physical and Logical Clocks
Physical clocks and logical clocks serve distinct purposes in distributed systems:
1. Nature of Time:
 Physical Clocks: These rely on real-world time measurements and are
typically synchronized using protocols like NTP (Network Time Protocol). They
provide accurate timestamps but can be affected by clock drift and network
delays.
 Logical Clocks: These are not tied to real-world time and instead use logical
counters or timestamps to order events based on causality. They are resilient
to clock differences between nodes but may not provide real-time accuracy.`
2. Usage:
 Physical Clocks: Used for tasks requiring real-time synchronization and
precise timekeeping, such as scheduling tasks or logging events with accurate
timestamps.
 Logical Clocks: Used in distributed systems to order events across different
nodes in a consistent and causal manner, enabling synchronization and
coordination without strict real-time requirements.
3. Dependency:
 Physical Clocks: Dependent on accurate timekeeping hardware and
synchronization protocols to maintain consistency across distributed nodes.
 Logical Clocks: Dependent on the logic of event ordering and causality,
ensuring that events can be correctly sequenced even when nodes have
different physical time readings.
Types of Logical Clocks in Distributed System
1. Lamport Clocks
Lamport clocks provide a simple way to order events in a distributed system. Each node
maintains a counter that increments with each event. When nodes communicate, they
update their counters based on the maximum value seen, ensuring a consistent order of
events.
Characteristics of Lamport Clocks:
 Simple to implement.
 Provides a total order of events but doesn’t capture concurrency.
 Not suitable for detecting causal relationships between events.
Algorithm of Lamport Clocks:
1. Initialization: Each node initializes its clock LLL to 0.
2. Internal Event: When a node performs an internal event, it increments its clock LLL.
3. Send Message: When a node sends a message, it increments its clock LLL and
includes this value in the message.
4. Receive Message: When a node receives a message with timestamp T: It sets
L=max⁡(L,T)+1
Advantages of Lamport Clocks:
 Simple to implement and understand.
 Ensures total ordering of events.
2. Vector Clocks
Vector clocks use an array of integers, where each element corresponds to a node in the
system. Each node maintains its own vector clock and updates it by incrementing its own
entry and incorporating values from other nodes during communication.
Characteristics of Vector Clocks:
 Captures causality and concurrency between events.
 Requires more storage and communication overhead compared to Lamport clocks.
Algorithm of Vector Clocks:
1. Initialization: Each node PiP_iPi initializes its vector clock ViV_iVi to a vector of zeros.
2. Internal Event: When a node performs an internal event, it increments its own entry
in the vector clock Vi[i]V_i[i]Vi[i].
3. Send Message: When a node PiP_iPi sends a message, it includes its vector clock
ViV_iVi in the message.
4. Receive Message: When a node PiP_iPi receives a message with vector clock Vj:

 It updates each entry: Vi[k]=max⁡(Vi[k],Vj[k])


 It increments its own entry: Vi[i]=Vi[i]+1
Advantages of Vector Clocks:
 Accurately captures causality and concurrency.
 Detects concurrent events, which Lamport clocks cannot do.
3. Matrix Clocks
Matrix clocks extend vector clocks by maintaining a matrix where each entry captures the
history of vector clocks. This allows for more detailed tracking of causality relationships.
Characteristics of Matrix Clocks:
 More detailed tracking of event dependencies.
 Higher storage and communication overhead compared to vector clocks.
Algorithm of Matrix Clocks:
1. Initialization: Each node PiP_iPi initializes its matrix clock MiM_iMi to a matrix of
zeros.
2. Internal Event: When a node performs an internal event, it increments its own entry
in the matrix clock Mi[i][i]M_i[i][i]Mi[i][i].
3. Send Message: When a node PiP_iPi sends a message, it includes its matrix clock
MiM_iMi in the message.
4. Receive Message: When a node PiP_iPi receives a message with matrix clock Mj:

 It updates each entry: Mi[k][l]=max⁡(Mi[k][l],Mj[k][l])


 It increments its own entry: Mi[i][i]=Mi[i][i]+1
Advantages of Matrix Clocks:
 Detailed history tracking of event causality.
 Can provide more information about event dependencies than vector clocks.
4. Hybrid Logical Clocks (HLCs)
Hybrid logical clocks combine physical and logical clocks to provide both causality and real-
time properties. They use physical time as a base and incorporate logical increments to
maintain event ordering.
Characteristics of Hybrid Logical Clocks:
 Combines real-time accuracy with causality.
 More complex to implement compared to pure logical clocks.
Algorithm of Hybrid Logical Clocks:
1. Initialization: Each node initializes its clock HHH with the current physical time.
2. Internal Event: When a node performs an internal event, it increments its logical part
of the HLC.
3. Send Message: When a node sends a message, it includes its HLC in the message.
4. Receive Message: When a node receives a message with HLC T:

 It updates its H = max⁡(H,T)+1


Advantages of Hybrid Logical Clocks:
 Balances real-time accuracy and causal consistency.
 Suitable for systems requiring both properties, such as databases and distributed
ledgers.
Applications of Logical Clocks
Logical clocks play a crucial role in distributed systems by providing a way to order events
and maintain consistency. Here are some key applications:
 Event Ordering
o Causal Ordering: Logical clocks help establish a causal relationship between
events, ensuring that messages are processed in the correct order.
o Total Ordering: In some systems, it’s essential to have a total order of events.
Logical clocks can be used to assign unique timestamps to events, ensuring a
consistent order across the system.
 Causal Consistency
o Consistency Models: In distributed databases and storage systems, logical
clocks are used to ensure causal consistency. They help track dependencies
between operations, ensuring that causally related operations are seen in the
same order by all nodes.
 Distributed Debugging and Monitoring
o Tracing and Logging: Logical clocks can be used to timestamp logs and trace
events across different nodes in a distributed system. This helps in debugging
and understanding the sequence of events leading to an issue.
o Performance Monitoring: By using logical clocks, it’s possible to monitor the
performance of distributed systems, identifying bottlenecks and delays.
 Distributed Snapshots
o Checkpointing: Logical clocks are used in algorithms for taking consistent
snapshots of the state of a distributed system, which is essential for fault
tolerance and recovery.
o Global State Detection: They help detect global states and conditions such as
deadlocks or stable properties in the system.
 Concurrency Control
o Optimistic Concurrency Control: Logical clocks help detect conflicts in
transactions by comparing timestamps, allowing systems to resolve conflicts
and maintain data integrity.
o Versioning: In versioned storage systems, logical clocks can be used to
maintain different versions of data, ensuring that updates are applied
correctly and consistently.
Challenges and Limitations with Logical Clocks
Logical clocks are essential for maintaining order and consistency in distributed systems, but
they come with their own set of challenges and limitations:
 Scalability Issues
o Vector Clock Size: In systems using vector clocks, the size of the vector grows
with the number of nodes, leading to increased storage and communication
overhead.
o Management Complexity: Managing and maintaining logical clocks across a
large number of nodes can be complex and resource-intensive.
 Synchronization Overhead
o Communication Overhead: Synchronizing logical clocks requires additional
messages between nodes, which can increase network traffic and latency.
o Processing Overhead: Updating and maintaining logical clock values can add
computational overhead, impacting the system’s overall performance.
 Handling Failures and Network Partitions
o Clock Inconsistency: In the presence of network partitions or node failures,
maintaining consistent logical clock values can be challenging.
o Recovery Complexity: When nodes recover from failures, reconciling logical
clock values to ensure consistency can be complex.
 Partial Ordering
o Limited Ordering Guarantees: Logical clocks, especially Lamport clocks, only
provide partial ordering of events, which may not be sufficient for all
applications requiring a total order.
o Conflict Resolution: Resolving conflicts in operations may require additional
mechanisms beyond what logical clocks can provide.
 Complexity in Implementation
o Algorithm Complexity: Implementing logical clocks, particularly vector and
matrix clocks, can be complex and error-prone, requiring careful design and
testing.
o Application-Specific Adjustments: Different applications may require
customized logical clock implementations to meet their specific requirements.
 Storage Overhead
o Vector and Matrix Clocks: These clocks require storing a vector or matrix of
timestamps, which can consume significant memory, especially in systems
with many nodes.
o Snapshot Storage: For some applications, maintaining snapshots of logical
clock values can add to the storage overhead.
 Propagation Delay
o Delayed Updates: Updates to logical clock values may not propagate instantly
across all nodes, leading to temporary inconsistencies.
o Latency Sensitivity: Applications that are sensitive to latency may be
impacted by the delays in propagating logical clock updates.

Coordination and Agreement :-

Overview
We start by addressing the question of why process need to coordinate their actions and
agree on values in various scenarios.
1. Consider a mission critical application that requires several computers to
communicate and decide whether to proceed with or abort a mission. Clearly, all
must come to agreement about the fate of the mission.
2. Consider the Berkeley algorithm for time synchronization. One of the participate
computers serves as the coordinator. Suppose that coordinator fails. The remaining
computers must elect a new coordinator.
3. Broadcast networks like Ethernet and wireless must agree on which nodes can send
at any given time. If they do not agree, the result is a collision and no message is
transmitted successfully.
4. Like other broadcast networks, sensor networks face the challenging of agreeing
which nodes will send at any given time. In addition, many sensor network
algorithms require that nodes elect coordinators that take on a server-like
responsibility. Choosing these nodes is particularly challenging in sensor networks
because of the battery constraints of the nodes.
5. Many applications, such as banking, require that nodes coordinate their access of a
shared resource. For example, a bank balance should only be accessed and updated
by one computer at a time.

Failure Assumptions and Detection


Coordination in a synchronous system with no failures is comparatively easy. We'll look at
some algorithms targeted toward this environment. However, if a system is asynchronous,
meaning that messages may be delayed an indefinite amount of time, or failures may occur,
then coordination and agreement become much more challenging.
A correct process "is one that exhibits no failures at any point in the execution under
consideration." If a process fails, it can fail in one of two ways: a crash failure or a byzantine
failure. A crash failure implies that a node stops working and does not respond to any
messages. A byzantine failure implies that a node exhibits arbitrary behavior. For example, it
may continue to function but send incorrect values.
Failure Detection
One possible algorithm for detecting failures is as follows:
 Every t seconds, each process sends an "I am alive" message to all other processes.
 Process p knows that process q is either unsuspected, suspected, or failed.
 If p sees q's message, it sets q's status to unsuspected.
This seems ok if there are no failures. What happens if a failure occurs? In this case, q will
not send a message. In a synchronous system, p waits for d seconds (where d is the
maximum delay in message delivery) and if it does not hear from q then it knows that q has
failed. In an asynchronous system, q can be suspected of failure after a timeout, but there is
no guarantee that a failure has occurred.

Mutual Exclusion
The first set of coordination algorithms we'll consider deal with mutual exclusion. How can
we ensure that two (or more) processes do not access a shared resource simultaneously?
This problem comes up in the OS domain and is addressed by negotiating with shared
objects (locks). In a distributed system, nodes must negotiate via message passing.
Each of the following algorithms attempt to ensure the following:
 Safety: At most one process may execute in the critical section (CS) at a time.
 Liveness: Requests to enter and exit the critical section eventually succeed.
 Causal ordering: If one request to enter the CS happened-before another, then entry
to the CS is granted in that order.
Central Server
The first algorithm uses a central server to manage access to the shared resource. To enter a
critical section, a process sends a request to the server. The server behaves as follows:
 If no one is in a critical section, the server returns a token. When the process exits
the critical section, the token is returned to the server.
 If someone already has the token, the request is queued.
Requests are serviced in FIFO order.
If no failures occur, this algorithm ensures safety and liveness. However, ordering is not
preserved (why?). The central server is also a bottleneck and a single point of failure.
Token Ring
The token ring algorithm arranges processes in a logical ring. A token is passed clockwise
around the ring. When a process receives the token it can enter its critical section. If it does
not need to enter a critical section, it immediately passes the token to the next process.
This algorithm also achieves safety and liveness, but not ordering, in the case when no
failures occur. However, a significant amount of bandwidth is used because the token is
passed continuously even when no process needs to enter a CS.
Multicast and Logical Clocks
Each process has a unique identifier and maintains a logical clock. A process can be in one of
three states: released, waiting, or held. When a process wants to enter a CS it does the
following:
 sets its state to waiting
 sends a message to all other processes containing its ID and timestamp
 once all other processes respond, it can enter the CS
When a message is received from another process, it does the following:
 if the receiver process state is held, the message is queued
 if the receiver process state is waiting and the timestamp of the message is after the
local timestamp, the message is queued (if the timestamps are the same, the process
ID is used to order messages)
 else - reply immediately
When a process exits a CS, it does the following:
 sets its state to released
 replies to queued requests
This algorithm provides safety, liveness, and ordering. However, it cannot deal with failure
and has problems of scale.
None of the algorithms discussed are appropriate for a system in which failures may occur. In
order to handle this situation, we would need to first detect that a failure has occurred and
then reorganize the processes (e.g., form a new token ring) and reinitialize appropriate state
(e.g., create a new token).

Election
An election algorithm determines which process will play the role of coordinator or server.
All processes need to agree on the selected process. Any process can start an election, for
example if it notices that the previous coordinator has failed. The requirements of an
election algorithm are as follows:
 Safety: Only one process is chosen -- the one with the largest identifying value. The
value could be load, uptime, a random number, etc.
 Liveness: All process eventually choose a winner or crash.
Ring-based
Processes are arranged in a logical ring. A process starts an election by placing its ID and
value in a message and sending the message to its neighbor. When a message is received, a
process does the following:
 If the value is greater that its own, it saves the ID and forwards the value to its
neighbor.
 Else if its own value is greater and the it has not yet participated in the election, it
replaces the ID with its own, the value with its own, and forwards the message.
 Else if it has already participated it discards the message.
 If a process receives its own ID and value, it knows it has been elected. It then sends
an elected message to its neighbor.
 When an elected message is received, it is forwarded to the next neighbor.

Safety is guaranteed - only one value can be largest and make it all the way through the ring.
Liveness is guaranteed if there are no failures. However, the algorithm does not work if there
are failures.
Bully
The bully algorithm can deal with crash failures, but not communication failures. When a
process notices that the coordinator has failed, it sends an election message to all higher-
numbered processes. If no one replies, it declares itself the coordinator and sends a new
coordinator message to all processes. If someone replies, it does nothing else. When a
process receives an election message from a lower-numbered process it returns a reply and
starts an election. This algorithm guarantees safety and liveness and can deal with crash
failures.

Consensus
All of the previous algorithms are examples of the consensus problem: how can we get all
processes to agree on a state? Here, we look at when the consensus problem is solvable.
The system model considers a collection of processes pi (i = 1, 2, ..., N). Communication is
reliable, but processes may fail. Failures may be crash failures or byzantine failures.
The goals of consensus are as follows:
 Termination: Every correct process eventually decides on a value.
 Agreement: All processes agree on a value.
 Integrity: If all correct processes propose the same value, that value is the one
selected.
We consider the Byzantine Generals problem. A set of generals must agree on whether to
attack or retreat. Commanders can be treacherous (faulty). This is similar to consensus, but
differs in that a single process proposes a value that the others must agree on. The
requirements are:
 Termination: All correct processes eventually decide on a value.
 Agreement: All correct processes agree on a value.
 Integrity: If the commander is correct, all correct processes agree on what the
commander proposed.
If communication is unreliable, consensus is impossible. Remember the blue army discussion
from the second lecture period. With reliable communication, we can solve consensus in a
synchronous system with crash failures.
We can solve Byzantine Generals in a synchronous system as long as less than 1/3 of the
processes fail. The commander sends the command to all of the generals and each general
sends the command to all other generals. If each correct process chooses the majority of all
commands, the requirements are met. Note that the requirements do not specify that the
processes must detect that the commander is fault.
It is impossible to guarantee consensus in an asynchronous system, even in the presence of
1 crash failure. That means that we can design systems that reach consensus most of the
time, but cannot guarantee that they will reach consensus every time. Techniques for
reaching consensus in an asynchronous system include the following:
 Masking faults - Hide failures by using persistent storage to store state and restarting
processes when they crash.
 Failure detectors - Treat an unresponsive process (that may still be alive) as failed.
 Randomization - Use randomized behavior to confuse byzantine processes.

You might also like