0% found this document useful (0 votes)
2 views130 pages

Distributed Computing Book

The document provides an introduction to distributed computing, outlining its definition, components, and communication models. It discusses the differences between message-passing and shared memory systems, as well as the challenges and design issues faced in creating distributed systems. Key topics include the motivations for distributed computing, communication primitives, execution models, and the importance of addressing heterogeneity, fault tolerance, scalability, and security.

Uploaded by

Abilash Jinu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views130 pages

Distributed Computing Book

The document provides an introduction to distributed computing, outlining its definition, components, and communication models. It discusses the differences between message-passing and shared memory systems, as well as the challenges and design issues faced in creating distributed systems. Key topics include the motivations for distributed computing, communication primitives, execution models, and the importance of addressing heterogeneity, fault tolerance, scalability, and security.

Uploaded by

Abilash Jinu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 130

DISTRIBUTED COMPUTING

CS3551
SEMESTER - V
ANNA UNIVERSITY REGULATION 2021

PREPARED BY
PRADEEP.A. BSc., MSc
LECTURER
GOJAN SCHOOL OF BUSINESS AND TECHNOLOGY
UNIT I – INTRODUCTION
Definition-Relation to Computer System Components-Motivation – Message -Passing-
Systems versus Shared Memory Systems-Primitives for Distributed-Communication-
Synchronous versus Asynchronous Executions-Design Issues and Challenges; A Model of
Distributed Computations: A Distributed Program-A Model of Distributed Executions-
Models of Communication Networks-Global State of a Distributed System

Course Outcomes

CO1: Explain the foundations of distributed systems (K2)

Course Objective

To introduce the computation and communication models of distributed systems

Important Topics

1. Explain how a parallel system differs from a distributed system


2. Illustrate the difference between message passing and shared memory process
communication model
3. Discuss the design issues and challenges in distributed system from a system
perspective.

1|Page
UNIT I – INTRODUCTION

Definition
Distributed computing is a field of computer science where a collection of independent
computers works together as a single system to achieve a common goal. These computers,
often called nodes, communicate and coordinate their actions by passing messages over a
network. To the users, the distributed system appears as a single coherent unit, even though
its components are physically separated and may be geographically dispersed

Relation to Computer System Components


A distributed system is composed of multiple hardware and software components:
 Nodes: Independent computers (servers, desktops, etc.)
 Network: The communication infrastructure (LAN, WAN, Internet) that connects the
nodes and allows them to exchange data.
 Middleware: Software that provides common services and capabilities to
applications outside of what's offered by the operating system, such as
communication, authentication, and data management.
 Applications: Programs that utilize the distributed resources to perform tasks
collaboratively.
A distributed system is fundamentally an integration of various computer system
components both hardware and software that work together to achieve a common
objective. Unlike traditional centralized systems, the components of a distributed system
are physically separated, often geographically dispersed, and connected via a
communication network. Despite this separation, the system is designed to present itself as
a single, coherent unit to users and applications.
The primary components of a distributed system include:
 Nodes: These are independent computers such as servers, desktops, or embedded
devices. Each node runs its own operating system and may host one or more
distributed applications. Nodes are the basic building blocks, providing computation,
storage, and local resources.
 Network: The network (LAN, WAN, or the Internet) forms the communication
backbone, enabling nodes to exchange data and coordinate actions. The network’s
reliability, latency, and bandwidth play a crucial role in the overall performance and
robustness of the distributed system.
 Middleware: Middleware is specialized software that sits between the operating
system and the distributed applications. It provides essential services such as
communication, authentication, data management, and message passing. Middleware
abstracts the complexities of the underlying network and hardware, offering a uniform

2|Page
interface to applications and facilitating interoperability among heterogeneous
components.
 Applications: These are the programs that leverage distributed resources to perform
collaborative tasks. Applications in a distributed system are designed to run across
multiple nodes, coordinate via message passing, and handle partial failures gracefully.
This layered structure nodes, network, middleware, and applications—mirrors the
architecture of modern computer systems, where each layer builds upon the services provided
by the one below. In distributed systems, the interaction among these components is managed
through well-defined protocols and interfaces, allowing for resource sharing, scalability, fault
tolerance, and high availability. The seamless integration of these components is what
enables distributed systems to function efficiently and transparently, masking the complexity
of distribution from end users and applications

P M P M
P M

Communication network (WAN/ LAN)

P M P M
P M P M

Motivation
The main motivations for distributed computing include:
 Resource Sharing: Different nodes can share hardware, software, and data.
 Scalability: Systems can grow easily by adding more nodes.
 Fault Tolerance: If one node fails, others can take over, improving reliability.
 Performance: Tasks can be divided and executed in parallel, reducing compilation
 Geographical Distribution: Enables collaboration and resource access across
different locations

Distributed application Extent of distributed protocols

Distributed software
Network protocol stack

(middleware libraries)

Application layer

Operating system Transport layer

Network layer
Data link layer

3|Page
Message-Passing Systems versus Shared Memory Systems
 Message-Passing Systems: Nodes communicate by explicitly sending and receiving
messages over a network. Each node has its own local memory, and there is no direct
access to another node’s memory.
 Shared Memory Systems: Multiple processors access a common memory space.
Communication occurs through reading and writing to shared variables, typically
within a single physical machine or tightly coupled environment.
 Comparison: Message-passing is more suitable for distributed systems where nodes
are physically separated, while shared memory is common in multi-core or multi-
processor systems within a single machine
Two fundamental models define how processes communicate and coordinate: Message-
Passing Systems and Shared Memory Systems. Understanding the differences between these
models is essential for designing efficient distributed algorithms and architectures.
Message-Passing Systems
In a message-passing system, each node (or process) in the distributed system has its own
local memory and does not have direct access to the memory of other nodes. Communication
and coordination among nodes occur exclusively through the explicit sending and receiving
of messages over a network. This model is natural for distributed systems where nodes may
be physically separated and connected via local area networks (LANs), wide area networks
(WANs), or the Internet.
Key Features:
 Explicit Communication: Processes use operations like send and receive to exchange
information.
 No Shared Memory: Each node maintains its own private memory space.
 Network Dependency: The underlying communication network plays a crucial role in
performance, reliability, and scalability.
 Fault Tolerance: Message-passing systems can be more resilient to node failures, as
nodes are loosely coupled.
Use Cases:
Message-passing is ideal for large-scale distributed systems, cloud computing, microservices
architectures, and scenarios where nodes are geographically dispersed.
Shared Memory Systems
In a shared memory system, multiple processors or processes communicate by reading from
and writing to a common memory space. This model is typical in tightly coupled

4|Page
environments, such as multi-core or multi-processor computers, where all processors have
access to the same physical memory.

Key Features:
 Implicit Communication: Processes communicate by updating shared variables in
memory.
 Direct Memory Access: All processors can access the shared memory directly.
 Synchronization Mechanisms: Mechanisms like locks, semaphores, and monitors are
used to coordinate access to shared data and prevent conflicts.
 Limited Scalability: Shared memory systems are typically limited to a single physical
machine or tightly coupled cluster due to hardware constraints.
Use Cases:
Shared memory is common in parallel computing, high-performance computing, and real-
time systems where low-latency communication is required.

Aspect Message-Passing Systems Shared Memory Systems

Communication Explicit (send/receive messages) Implicit (shared variables)


Common, accessible by all
Memory Access Private, local to each node
processors

High (suitable for large, distributed Limited (usually within a single


Scalability
networks) machine)

Lower (failure can affect the whole


Fault Tolerance High (nodes can fail independently)
system)

Achieved via locks, semaphores,


Synchronization Achieved via message protocols
monitors

Distributed systems, cloud, Parallel computing, multi-core


Typical Use
microservices systems

 Message-passing systems are best suited for distributed environments where nodes are
physically separated and communicate over a network.
 Shared memory systems are ideal for tightly coupled, parallel computing
environments within a single machine.
 The choice between the two models depends on system architecture, scalability
requirements, and the physical distribution of resources.
5|Page
Primitives for Distributed Communication
Distributed systems use a set of basic operations (primitives) for communication:
 Send: Transmit a message from one node to another.
 Receive: Accept a message sent by another node.
 Broadcast: Send a message to all nodes in the system.
 Multicast: Send a message to a specific group of nodes.
These primitives form the foundation for higher-level communication protocols and
synchronization mechanisms.
Communication primitives are the fundamental operations that enable processes running on
different nodes to exchange information and coordinate their activities. Since distributed
systems lack shared memory and nodes are often geographically separated, communication is
achieved through explicit message passing. These primitives form the building blocks for
higher-level protocols and synchronization mechanisms, ensuring that distributed
applications can function correctly and efficiently.
The most basic communication primitives are sent and receive. The send primitive allows a
process to transmit a message to another process over the network, while the receive
primitive enables a process to accept a message sent by another node. Beyond these,
distributed systems often employ broadcast and multicast primitives. A broadcast operation
sends a message to all nodes in the system, ensuring that every process receives the same
information simultaneously. Multicast, on the other hand, targets a specific group of nodes,
allowing efficient group communication without flooding the entire network. These
primitives are essential for implementing coordination, synchronization, and consistency
protocols in distributed environments.
The choice and implementation of communication primitives directly affect the design and
performance of distributed systems. For example, primitives must handle issues such as
message ordering, reliability, and fault tolerance. They also underpin more complex
operations like distributed mutual exclusion, consensus, and global state recording. By
providing a standardized set of operations for message exchange, communication primitives
abstract the complexities of the underlying network, enabling developers to build robust and
scalable distributed applications.
 Send and receive are the fundamental primitives for message exchange in distributed
systems.
 Broadcast and multicast enable efficient group communication, supporting
coordination among multiple nodes.
 Primitives form the foundation for higher-level protocols such as synchronization,
consensus, and global state capture.

6|Page
 Proper implementation must address issues of ordering, reliability, and fault tolerance
to ensure correct system behaviour.
 Communication primitives abstract network complexities, allowing distributed
applications to be built in a scalable and robust manner.

Synchronous versus Asynchronous Executions


 Synchronous Execution: All nodes operate in lockstep, often coordinated by a global
clock. Message delivery and process execution happen within known time bounds.
 Asynchronous Execution: Nodes operate independently without a global clock.
Message delivery times and process execution speeds are unpredictable.
 Implications: Asynchronous systems are more flexible and realistic for large-scale
distributed systems, but they introduce challenges in coordination and consistency.

In a synchronous execution model, all processes or nodes in the distributed system operate in
lockstep, often coordinated by a global clock or a well-defined timing mechanism. Each
process performs its actions in discrete rounds or steps, and message delivery between nodes
is guaranteed to occur within known, bounded time intervals. This means that every process
waits for all others to complete their current step before advancing to the next one.
Synchronous execution simplifies the design of distributed algorithms because it eliminates
uncertainty about the order and timing of events, making it easier to reason about system
behaviour, detect failures, and maintain consistency. However, achieving true synchrony in
real-world, geographically distributed systems is challenging due to network delays and
varying process speeds.
Asynchronous Execution
In contrast, asynchronous execution allows each process to operate independently, without
relying on a global clock or synchronized steps. Processes execute at their own pace, and
there are no guarantees about the relative speeds of execution or the delivery times of
messages. This model is more realistic for large-scale distributed systems, where network
latency, process scheduling, and hardware differences make synchronization impractical.
Asynchronous execution provides greater flexibility and scalability, but it introduces
significant challenges in coordination, consistency, and fault tolerance. Algorithms must be
carefully designed to cope with unpredictable delays, out-of-order message delivery, and the
absence of a shared notion of time.
Implications and Comparison
The choice between synchronous and asynchronous execution has profound implications for
distributed system design. Synchronous systems are easier to reason about and can provide
stronger guarantees for consistency and coordination, but they may suffer from performance
bottlenecks and limited scalability. Asynchronous systems, while more adaptable and
scalable, require more complex algorithms to handle uncertainty and ensure correct
operation. Most practical distributed systems adopt an asynchronous execution model,
sometimes incorporating partial synchronization for critical operations.

7|Page
 Synchronous execution relies on a global clock or timing mechanism, ensuring
coordinated, step-by-step progress across all nodes.
 Asynchronous execution allows nodes to operate independently, with unpredictable
message delivery and process speeds.
 Synchronous models simplify algorithm design by providing predictable event
ordering and easier failure detection.
 Asynchronous models are more flexible and scalable, but introduce challenges in
coordination, consistency, and fault tolerance.
 Most real-world distributed systems are asynchronous, requiring robust
algorithms to handle uncertainty and ensure reliable operation.

Design Issues and Challenges


Designing distributed systems involves addressing several key challenges:
 Heterogeneity: Different hardware, operating systems, and network technologies
must work together.
 Transparency: The system should hide its distributed nature from users and
applications.
 Fault Tolerance: The system must continue to function despite failures of nodes or
network links.
 Scalability: The system should perform well as the number of nodes increases.
 Security: Protect data and resources from unauthorized access and attacks.
 Concurrency: Multiple processes may operate simultaneously, requiring
synchronization to avoid conflicts.
 No Global Clock: Coordinating actions without a shared notion of time is difficult.

Designing distributed systems involves addressing a range of complex issues and challenges
that arise from the very nature of distribution, heterogeneity, and the need for coordination
among independent components. The primary design concerns stem from the fact that
distributed systems are composed of multiple, geographically dispersed nodes that must work
together seamlessly, often under varying network conditions and with differing hardware and
software environments1. These factors introduce unique obstacles that are not present in
centralized or tightly coupled systems.
One of the foremost challenges is heterogeneity, where different nodes may run on diverse
hardware architectures, operating systems, and network technologies. Ensuring
interoperability and smooth communication across such varied environments requires careful
standardization and the use of middleware. Another critical issue is transparency—the system
must hide its distributed nature from users and applications, making resource access, location,

8|Page
replication, and failures invisible whenever possible. Achieving fault tolerance is also
essential, as the system must continue to function correctly even when individual nodes or
network links fail. This requires redundancy, failure detection, and recovery mechanisms to
be built into the system’s design
Scalability is a further challenge, as distributed systems should maintain performance and
reliability as the number of nodes increases. This requires efficient algorithms and protocols
that do not become bottlenecks as the system grows. Security is another major concern:
distributed systems must protect data and resources from unauthorized access and attacks,
which is complicated by the open and networked nature of these environments. Concurrency
poses additional difficulties, as multiple processes may operate simultaneously, necessitating
robust synchronization to prevent conflicts and ensure data consistency. Finally, the absence
of a global clock complicates coordination and event ordering, making it difficult to
synchronize actions and maintain a consistent global state across the system

 Heterogeneity: Integrating diverse hardware, operating systems, and network


technologies to ensure seamless interoperability.
 Transparency: Concealing the complexities of distribution from users and
applications, including access, location, replication, and failure transparency.
 Fault Tolerance: Ensuring the system remains operational in the face of node or
network failures through redundancy and recovery mechanisms.
 Scalability and Security: Maintaining performance as the system grows and
protecting resources from unauthorized access and attacks.
 Concurrency and Lack of Global Clock: Managing simultaneous operations and
coordinating actions without a shared notion of time, which complicates
synchronization and event ordering

Model of Distributed Computations


A distributed program consists of multiple processes running on different nodes, cooperating
to solve a problem. Each process executes its own code and communicates with others using
message-passing primitives. The correctness and efficiency of a distributed program depend
on proper coordination and communication among processes.
A model of distributed computations provides a formal framework for understanding,
designing, and analysing how multiple processes interact and cooperate to solve problems in
a distributed system. In this model, a distributed program is viewed as a collection of
processes, each running on a different node, that communicate and coordinate their actions
solely through message passing. Each process executes its own code independently, but the
correctness and efficiency of the overall computation depend on proper coordination and
communication among all participating processes.
At the core of this model is the concept that each process can perform two types of actions:
local computations and communication events These actions are represented as a sequence of

9|Page
events at each node. The order in which events occur both within a single process and across
the entire system is crucial for understanding the system’s behaviour. Since there is no global
clock or shared memory, the model uses the concept of logical clocks (like Lamport clocks or
vector clocks) to capture the causal relationships between events. This allows the system to
reason about the "happened-before" relationships, ensuring that the effects of one event are
properly reflected in subsequent events that depend on it.
The model of distributed computations also considers the underlying communication
network, which can be fully connected, partially connected (such as ring, mesh, or star
topologies), or dynamic (where the network topology changes over time). The properties of
the network such as latency, reliability, and bandwidth—directly impact how processes
interact and how quickly information can propagate throughout the system. By modelling
distributed computations in this structured way, designers and researchers can analyse
algorithms for correctness, performance, and fault tolerance, and can develop techniques for
capturing global state, achieving consensus, and handling failures.

 A distributed computation consists of multiple processes running on different nodes,


each performing local computations and communicating via message passing.
 Events in distributed computations are modelled as local actions or message
send/receive events, with their order crucial for system correctness.
 Logical clocks (such as Lamport or vector clocks) are used to capture causal
relationships between events in the absence of a global clock.
 The communication network topology (fully connected, partially connected, or
dynamic) influences how processes interact and affects the system’s performance and
reliability.
 This model enables analysis of correctness, coordination, and fault tolerance in
distributed algorithms, forming the foundation for advanced topics like global state
capture, consensus, and recovery.

Model of Distributed Executions


Distributed executions are modelled as a series of events occurring at different nodes. Each
event can be a local computation or a message send/receive action. The order of events is
crucial for understanding system behaviour, especially since there is no global clock. The
concept of “logical clocks” (such as Lamport clocks) is used to capture the causal
relationships between events.
The model of distributed executions provides a formal way to describe and analyse how
events unfold across multiple, independent nodes working together to solve a problem.
Unlike centralized systems, distributed systems lack a global clock and shared memory,
making it challenging to reason about the order and causality of events. The model of
distributed executions addresses these challenges by representing the system’s behaviour as a

10 | P a g e
series of events occurring at different nodes, each of which may be a local computation or a
communication action such as sending or receiving a message.
Each node in the system executes its own sequence of events independently. These events are
typically classified as either local events (internal computations) or communication events
(message sends and receives). The order in which these events occur, both within a single
node and across the entire system, is crucial for understanding and verifying the correctness
of distributed algorithms. Since there is no global time reference, distributed systems use
concepts like logical clocks (e.g., Lamport clocks or vector clocks) to capture the causal
relationships between events. These logical clocks help define the "happened-before"
relationship, allowing the system to reason about which events could have influenced others,
even in the absence of synchronized physical time.
The model of distributed executions is foundational for analysing distributed algorithms,
especially for ensuring properties like safety, liveness, and consistency. By modelling
executions as partially ordered sets of events, it becomes possible to prove correctness, detect
potential issues such as deadlocks or race conditions, and design mechanisms for capturing
global states (for example, using snapshot algorithms). This model also underpins advanced
topics such as global predicate detection, distributed debugging, and recovery from failures.

 Distributed executions are modelled as sequences of events (local computations and


message send/receive actions) occurring at different nodes.
 There is no global clock or shared memory, so the order and causality of events are
captured using logical clocks and the "happened-before" relationship.
 The model enables reasoning about system behaviour, including correctness, safety,
and liveness properties of distributed algorithms.
 It is essential for designing and analysing algorithms for global state capture,
deadlock detection, and fault tolerance.
 By representing executions as partially ordered sets of events, the model provides a
rigorous foundation for understanding and verifying distributed system behaviour.

Models of Communication Networks


Distributed systems can be built on various network models:
 Fully Connected Network: Every node can directly communicate with every other
node.
 Partially Connected Network: Nodes are connected in a specific topology
(e.g., ring, star, mesh), and messages may need to be forwarded through intermediate
nodes.
 Dynamic Networks: The network topology can change over time due to node
mobility or failures.
 Properties: Networks are characterized by their latency, bandwidth, reliability, and
fault tolerance capabilities.

11 | P a g e
The communication network is the backbone that connects independent nodes, enabling them
to exchange messages and coordinate their actions to achieve a common goal. The way these
nodes are interconnected and the properties of the network play a crucial role in determining
the system’s efficiency, reliability, and scalability. Different models of communication
networks are used to represent and analyse how nodes interact, how messages are routed, and
how resilient the system is to failures or changes in topology.
There are several fundamental models of communication networks commonly considered in
distributed computing. In a fully connected network, every node has a direct link to every
other node, allowing for straightforward communication but resulting in a large number of
connections as the system grows. Partially connected networks use specific topologies such
as ring, star, mesh, or tree, where not all nodes are directly connected; instead, messages may
be forwarded through intermediate nodes. These topologies balance scalability, cost, and fault
tolerance. Dynamic networks are those in which the topology can change over time, such as
in mobile ad hoc networks or peer-to-peer systems, requiring distributed algorithms to adapt
dynamically to maintain communication and system consistency.
The effectiveness of a communication network in a distributed system is characterized by
several key properties: latency (the time taken for a message to travel from sender to
receiver), bandwidth (the volume of data that can be transmitted in a given period), reliability
(the guarantee of accurate and in-order message delivery), and fault tolerance (the ability to
continue functioning despite node or link failures). The choice of a network model and its
properties directly influence the design of communication protocols, the performance of
distributed applications, and the system’s robustness against failures and dynamic changes.

 Fully connected networks allow direct links between all nodes, offering maximum
redundancy but limited scalability for large systems.
 Partially connected networks use topologies like ring, star, mesh, or tree, balancing
scalability, cost, and fault tolerance.
 Dynamic networks support changing topologies, enabling flexibility and adaptability
but requiring robust, adaptive algorithms.
 Key properties such as latency, bandwidth, reliability, and fault tolerance determine
network effectiveness in distributed systems.
 The choice of network model directly impacts the design, performance, and reliability
of distributed algorithms and applications.

12 | P a g e
Global State of a Distributed System
The global state is the collective state of all processes and communication channels in the
system at a given instant. Capturing the global state is challenging because there is no global
clock and processes do not share memory. Techniques such as *snapshot algorithms* are
used to record a consistent global state, which is essential for debugging, checkpointing, and
recovery.
The global state is defined as the collective state of all processes and the communication
channels between them at a specific instant in time. Unlike centralized systems, distributed
systems lack a global clock and shared memory, making it challenging to capture a consistent
snapshot of the entire system’s state. Each process operates independently and may only have
partial knowledge of the overall system, which complicates the task of determining the global
state.
Capturing the global state is crucial for several reasons. It enables system-wide debugging,
checkpointing for fault tolerance, deadlock detection, and recovery from failures. To record a
consistent global state, distributed systems use specialized algorithms called snapshot
algorithms (such as the Chandy-Lamport algorithm). These algorithms allow each process to
record its local state and the state of the communication channels in such a way that the
combination of these local states forms a meaningful and consistent global snapshot. This
snapshot represents a state that could have occurred during a legal execution of the system,
even though the actual recording of local states happens asynchronously.
The main challenge in capturing the global state lies in ensuring consistency—that is, the
snapshot must not reflect impossible or partial system configurations. For example, it should
not show a message as received by one process but not yet sent by another. Snapshot
algorithms address this by coordinating the recording of states and messages, typically using
marker messages to delineate the boundaries of the snapshot. The ability to obtain a
consistent global state is fundamental for reliable distributed computing, as it underpins
system monitoring, error recovery, and the maintenance of data consistency across distributed
processes.
 The global state is the combination of all process states and communication channel
states at a particular instant in a distributed system.
 Capturing the global state is challenging due to the absence of a global clock and
shared memory in distributed environments.
 Snapshot algorithms (e.g., Chandy-Lamport) are used to record a consistent global
state without halting the system.
 A consistent global state is essential for debugging, checkpointing, deadlock
detection, and recovery in distributed systems.

13 | P a g e
 Ensuring consistency means the snapshot must not show impossible situations (e.g., a
message received but not sent), which is achieved through careful coordination during
the snapshot process.

What is a distributed system

A distributed system is one in which components located at networked computers


communicate and coordinate their actions only by-passing messages. The components
interact with each other in order to achieve a common goal.

What do you mean by message passing

Message passing is a fundamental mechanism for communication in distributed systems. It


enables processes or nodes to exchange messages and coordinate their actions. There are
several types of message-passing models, including synchronous, asynchronous, and hybrid
approaches.

Define Distributed Program

A computer program that runs within a distributed system is called a distributed program, and
distributed programming is the process of writing such programs.

What do you mean by synchronous and asynchronous execution

Asynchronous is a non-blocking architecture, so the execution of one task isn't dependent on


another. Tasks can run simultaneously. Synchronous is a blocking architecture, so the
execution of each operation depends on completing the one before it. Each task requires an
answer before moving on to the next iteration.

List out the features of distributed systems


 Performance.
 Scalability.
 High availability.
 Data integrity.
 High reliability.
 Security.
 User mobility.

Write down the principles of distributed systems

14 | P a g e
Distributed file systems are an important part of any organization's data storage and access
needs. The design of the system should be based on the principles of scalability, availability,
reliability, performance, and security

.
State the objectives of resource sharing model

The primary objective of resource sharing is to maximize the resource baseline., collection,
staff, infrastructure, as well as services of the participating libraries. They would be benefited
by the resources of other libraries adding to their own resources.

What are the significant consequences of distributed systems

The components of a distributed system interact with one another in order to achieve a
common goal. Three significant challenges of distributed systems are: maintaining
concurrency of components, overcoming the lack of a global clock, and managing the
independent failure of components

What are the challenges of distributed systems

 Heterogeneity
 Openness
 Security
 Scalability
 Failure handling
 Concurrency
 Transparency
 Quality of service

Define Transparency. What are its types

Transparency is defined as the concealment from the user and the application programmer of
the separation of components in a distributed system, so that the system is perceived as a
whole rather than as a collection of independent components.

 Access transparency
 Location transparency
 Concurrency transparency
 Replication transparency
 Failure transparency
 Mobility transparency
 Performance transparency

15 | P a g e
What is the need of openness in distributed system

Openness: The openness of the distributed system is determined primarily by the degree to
which new resource-sharing services can be made available to the users. Open systems are
characterized by the fact that their key interfaces are published

List any two resources of hardware and software, which can be shared in
distributed systems with examples

Five types of hardware resource and five types of data or software resource that can usually
be shared are printer, plotter, storage space, cd drive, DVD drive, processing power. For
example, printer which takes graphics and texts from the computer and later it gets
transferred into a paper which is of standard size.

Differentiate between buffering and caching

Buffering is a process of temporarily holding data in memory or a buffer before writing it to a


permanent storage location. Caching is a process of temporarily storing data in memory for
quick access or retrieval. Cache stores copy of the data. Cache is in processor, and can be also
implemented with ram and disk.

Differentiate between synchronous and asynchronous execution

Synchronous code executes one line of code after the other, while asynchronous code allows
multiple lines of code to run at the same time. Asynchronous code can be much more
efficient than synchronous code for certain types of programs, but it
is also more complex and harder to debug.

What is the role of middleware in a distributed system

Middleware is an intermediate layer of software that sits between the application and the
network. It is used in distributed systems to provide common services, such as authentication,
authorization, compilation for best performance on particular architectures, input/output
translation, and error handling.

Name some services and examples of middleware

Common middleware examples include database middleware, application server middleware,


message-oriented middleware, web middleware, and transaction processing monitors.

What is open in distributed system

16 | P a g e
An Open Distributed System is made up of components that may be obtained from
a number of different sources, which together work as a single distributed system.

Describe what is meant by a scalable system

A system is scalable when it has the capacity to accommodate a greater amount of usage.
Some systems aren't at all scalable, and can only handle exactly the amount of usage they
were designed for. Scalable systems can handle extra usage, but their capacity varies.

What is replication transparency

Replication transparency is the ability to create multiple copies of objects without any effect
of the replication seen by applications that use the objects. It should not be possible for an
application to determine the number of replicas, or to be able to see the identities of specific
replica instances.

Define access transparency

Access Transparency allows the same operations to be used to access local and remote
resources.

Explain how a parallel system differs from a distributed


system

Parallel and distributed systems are two important architectures designed to enhance
computational performance by dividing tasks among multiple processing units. While both
aim to achieve concurrency and efficient resource utilization, they differ significantly in their
structure, communication mechanisms, memory organization, scalability, and applications
Parallel System
A parallel system consists of multiple processors within a single computer or tightly coupled
system. These processors work simultaneously to solve a problem by dividing it into smaller
sub-tasks, each executed in parallel. The processors typically share a common memory and
communicate through this shared memory or a high-speed bus.
Key Characteristics:
 Tightly Coupled: Processors are closely linked, often within the same physical
machine.
 Shared Memory: All processors access a common memory space, enabling fast data
sharing.
 Single System Image: The system appears as a single entity to the user.

17 | P a g e
 Synchronization: Processors are synchronized using a master clock, ensuring
coordinated execution.
 Limited Scalability: The number of processors is restricted by hardware constraints
such as memory bandwidth and bus capacity.
 Fault Tolerance: Limited; failure of one processor may affect the entire system,
though some architectures provide redundancy.
 Typical Applications: Scientific simulations, graphics rendering, real-time data
processing, supercomputers

Distributed System

A distributed system comprises multiple autonomous computers (nodes) connected


via a network. Each node has its own memory and processor. The system works
collaboratively to solve a problem by dividing it into tasks assigned to different nodes.
Communication between nodes occurs through message passing over the network.

Key Characteristics:
 Loosely Coupled: Nodes are independent and may be geographically dispersed.
 Distributed Memory: Each node maintains its own private memory; there is no
global shared memory.
 Multiple System Images: Each node operates its own operating system and may
appear as a separate entity.
 Communication: Nodes communicate by passing messages over LAN, WAN, or the
internet.
 Scalability: Highly scalable; new nodes can be added easily to increase
computational power.
 Fault Tolerance: High; failure of one or more nodes typically does not halt the entire
system, as tasks can be reassigned.
 Synchronization: No global clock; synchronization is achieved through algorithms
and protocols.
 Typical Applications: Web services, cloud computing, distributed databases, large-
scale data analysis, collaborative applications

Illustrative Example
 Parallel System Example: A supercomputer performing weather simulations uses
hundreds of processors to compute different parts of the simulation simultaneously,
all accessing a common memory
 Distributed System Example: The SETI@home project divides radio signal data
into chunks and sends them to volunteers' computers worldwide. Each computer
analyses its chunk independently and sends results back to a central server

18 | P a g e
Key Differences Explained

1. Architecture:
 Parallel systems use multiple processors within a single machine, sharing
memory and resources.
 Distributed systems use multiple independent computers, each with its own
resources, connected via a network.
2. Communication:
 Parallel systems communicate via shared memory or high-speed buses.
 Distributed systems communicate via message passing over networks
3. Scalability:
 Parallel systems are less scalable due to hardware limits.
 Distributed systems are highly scalable; nodes can be added as needed
4. Fault Tolerance:
 Parallel systems have limited fault tolerance; failure may affect the whole
system.
 Distributed systems are more fault-tolerant; failure of one node does not halt
the system
5. Synchronization:
 Parallel systems use a master clock for synchronization.
 Distributed systems require complex synchronization algorithms due to lack of
a global clock

While both parallel and distributed systems aim to improve computational efficiency by
executing tasks concurrently, they differ fundamentally in architecture, memory organization,
communication, scalability, and fault tolerance. Parallel systems are best suited for tasks
requiring high-speed computation within a single system, whereas distributed systems excel
in scalability and reliability, making them ideal for large-scale, geographically distributed
applications

19 | P a g e
Illustrate the difference between message passing and
shared memory process communication model

Inter-process communication (IPC) is crucial in distributed and parallel systems, enabling


processes to coordinate and share data. The two primary IPC models are Message
Passing and Shared Memory. Each has distinct mechanisms, advantages, and use cases

Shared Memory Communication Model


Definition:
In the shared memory model, multiple processes communicate by reading and writing data to
a common memory region accessible to all involved processes
Key Points:
 Implicit Communication: Data exchange is done by directly accessing shared
variables.
 Synchronization: Mechanisms like semaphores, locks, and monitors are used to
ensure consistency and avoid race conditions

20 | P a g e
 Speed: Generally faster for local communication since no kernel or network
intervention is needed
 Typical Use: Multi-core systems, symmetric multiprocessing (SMP).

Equation
If process P1P1 writes value xx to shared variable SS, and process P2P2 reads from SS:
S←x(by P1)
y←S(by P2)

Here, y=xy=x after both operations, assuming proper synchronization.

Model Definition:
In the message passing model, processes communicate by explicitly sending and receiving
messages, typically through the operating system or network
Key Points:
 Explicit Communication: Data is packed (marshalled) into messages, sent by one
process, and unpacked (unmarshalled) by the receiver
 Synchronization: Achieved via send and receive primitives; processes may block
until communication completes.
 Flexibility: Suitable for distributed systems where processes may reside on different
machines
 Typical Use: Distributed systems, networked clusters.

Equation
If process P1P1 sends a message mm to process P2P2:
send (P2, m) (by P1)
receive (P1, m) (by P2)

21 | P a g e
Illustrative Example
Shared Memory Example
 Two processes increment a shared counter:
 counter←counter+1
 Requires a lock to avoid race conditions.
Message Passing Example
 Process A sends a data packet to Process B:
 A: send(B,"Hello")
 B: receive(A,msg)
Shared memory is efficient for tightly coupled systems but requires careful synchronization.
Message passing is flexible and essential for distributed systems, enabling communication
across heterogeneous platforms with explicit control over data exchange
Both models are foundational in distributed computing, and the choice depends on system
architecture and application requirements.

Discuss the design issues and challenges in distributed


system from a system perspective.

Distributed systems consist of multiple autonomous computers that communicate and


coordinate their actions by exchanging messages over a network. While these systems offer
scalability, reliability, and resource sharing, their design introduces several complex
challenges that must be addressed to ensure correct and efficient operation

22 | P a g e
Key Design Issues and Challenges
Communication
 Challenge: Ensuring reliable, efficient, and secure communication among processes
across a network.
 Issues: Network failures, message loss, variable latency, and protocol heterogeneity.
 Example Mechanisms: Remote Procedure Call (RPC), message-oriented
middleware, stream-oriented communication
 If process P1P1 sends message mm to P2P2:
 send(P2,m)→Network→receive(P1,m)

Process Management
Challenge: Managing processes and threads across distributed nodes.
Issues: Process creation, migration, termination, and load balancing.
Example: Dynamic allocation of tasks to nodes for optimal resource utilization

Naming and Resource Location


Challenge: Assigning unique, robust names to resources and processes for easy identification
and access.
Issues: Scalability, mobility, and dynamic changes in network topology
[Client] --(name lookup) --> [Naming Service] --(address)--> [Resource]

Synchronization
Challenge: Coordinating actions of distributed processes to maintain consistency.
Issues: Lack of global clock, race conditions, mutual exclusion, leader election, and deadlock
detection.
Equation (Lamport Clock)
C(a)<C(b)⟹a→b

23 | P a g e
Where C(x)C(x) is the logical clock value of event xx.

Data Storage and Consistency


Challenge: Ensuring data remains consistent across distributed nodes despite concurrent
updates and failures.
Issues: Eventual consistency, replication, conflict resolution, CAP theorem trade-offs
(Consistency, Availability, Partition tolerance)

NODE_1 NODE_2

NODE_3

Scalability
Challenge: Handling growth in users, data, and workload without performance degradation.
Approaches: Horizontal scaling (adding nodes), vertical scaling (upgrading node resources

24 | P a g e
Fault Tolerance and Reliability
Challenge: Ensuring system continues to function despite node or network failures.
Issues: Redundancy, replication, consensus algorithms (Paxos, Raft), failure detection.

Security
Challenge: Protecting data and communication from unauthorized access and attacks.
Issues: Authentication, authorization, encryption, secure protocols.

Openness and Heterogeneity


Challenge: Supporting interoperability among diverse hardware, operating systems, and
network protocols.

Issues: Middleware, standardization, API compatibility.

Observability and Monitoring


Challenge: Detecting and diagnosing faults, performance bottlenecks, and misconfigurations.
Issues: Logging, distributed tracing, health checks

Challenge Description Example/Equation/Diagram

Communication Reliable message passing send (P2, m) send (P2, m)

Process Management Process creation, migration Load balancing diagram

Naming Unique resource identification Naming service diagram

Synchronization Coordinating distributed actions Lamport clocks

25 | P a g e
Challenge Description Example/Equation/Diagram

Data Consistency Replication, conflict resolution CAP theorem

Scalability Handling growth Horizontal scaling diagram

Fault Tolerance Surviving failures Availability equation

Security Protecting data and communication Encryption protocols

Openness Supporting heterogeneity Middleware diagram

Observability Monitoring and diagnostics Tracing/logging tools

Designing distributed systems involves addressing multifaceted challenges, including


communication, synchronization, consistency, scalability, fault tolerance, and security. Each
challenge requires careful consideration of trade-offs and the use of specialized algorithms,
protocols, and design patterns to build robust, efficient, and reliable distributed applications

Explain about Chandy Lamport Snapshot algorithms for


FIFO channels

Each distributed system has a number of processes running on a number of different physical
servers. These processes communicate with each other via communication channels using
text messaging. These processes neither have a shared memory nor a common physical clock,
this makes the process of determining the instantaneous global state difficult.
A process could record its own local state at a given time but the messages that are in transit
(on its way to be delivered) would not be included in the recorded state and hence the actual
state of the system would be incorrect after the time in transit message is delivered.
Chandy and Lamport were the first to propose a algorithm to capture consistent global state
of a distributed system. The main idea behind proposed algorithm is that if we know that all

26 | P a g e
message that have been sent by one process have been received by another then we can
record the global state of the system.
Any process in the distributed system can initiate this global state recording algorithm using a
special message called MARKER. This marker traverses the distributed system across all
communication channel and cause each process to record its own state. In the end, the state of
entire system (Global state) is recorded. This algorithm does not interfere with normal
execution of processes.
Assumptions of the algorithm:
 There are finite number of processes in the distributed system and they do not share
memory and clocks.
 There are finite number of communication channels and they are unidirectional and
FIFO ordered.
 There exists a communication path between any two processes in the system
 On a channel, messages are received in the same order as they are sent.
Algorithm:
 Marker sending rule for a process P
 Process p records its own local state
 For each outgoing channel C from process P, P sends marker along C before sending
any other messages along C.

Marker receiving rule for a process Q:


 If process Q has not yet recorded its own local state, then
 Record the state of incoming channel C1 as an empty sequence or null.
 After recording the state of incoming channel C1, process Q Follows the
marker sending rule
 If process Q has already recorded its state
 Record the state of incoming channel C1 as the sequence of messages received
along channel C1 after the state of Q was recorded and before Q received the
marker along C1 from process P.
Need of taking snapshot or recording global state of the system:
 Checkpointing: It helps in creating checkpoint. If somehow application fails, this
checkpoint can be reused

27 | P a g e
 Garbage collection: It can be used to remove objects that do not have any references.
 It can be used in deadlock and termination detection.
 It is also helpful in other debugging.

A Global Snapshot or a global state consists of local states of each process in the distributed
system along with the in-transit messages on communication channels. The snapshot problem
in a distributed system tries to determine the “current” snapshot of the global
state without stopping the system. A Global snapshot is important, in many scenarios such as:

 determine safety points for a distributed database


 deadlock detection
 termination of an algorithm
 garbage collection of objects
 to find out the current load of a distributed system
Two major problems in determining a global snapshot are:
1. One cannot catch all processes at the same time
2. Messages that are on the way cannot be seen

The classical algorithm that is used to determine a global snapshot in a distributed system is
the Chandy-Lamport Global Snapshot Algorithm, 1985.
The assumptions of the algorithm are as follows:
 Neither channels nor processes fail, communication is reliable
 Channels are unidirectional and provide FIFO-ordered delivery
 There is a communication path between any two processes in the system
 Any process may initiate the snapshot algorithm
 The snapshot algorithm does not interfere with the normal execution of the processes
 Each process in the system records its local state and the state of its incoming
channels
 Global state is collected in a distributed manner

28 | P a g e
The snapshot algorithm works using marker messages. The marker message is a special
control message and not an application message. It does not interfere with application
messages. The algorithm has three phases: Initiation, Propagation and Termination and works
as follows:
1. Initiating a Snapshot
 Process Pi initiates the snapshot.
 The initiator, Pi records its own state and creates the marker message.
 Pi then forwards the marker message to all other processes using the using N-1
outbound channels Cij(j = 1 to N & j != i).
 Pi starts recording all incoming messages from channels Cji (j = 1 to N & j != i).
2. Propagating a Snapshot
If a process Pj receives a Marker message on an incoming channel Ckj, and if this is the first
Marker that Pj is seeing:
 Pj records its own state and mark Ckj as empty.
 Pj then forwards the marker message to all other processes using the N-1 outbound
channels.
 Pj starts recording all incoming messages from channels Clj for l not equal to j or k.
If the process Pj has already seen a Marker message:
 Mark the state of channel Ckj as all the messages that have arrived on it since the
recording was turned on for Ckj.

3. Terminating a Snapshot
The algorithm terminates when:
 All processes have received a marker (which implies that process state has been
recorded)
 All processes have received a marker on all the N-1 incoming channels (which
implies that state of all channels has been recorded)
 Later, if needed, the central server collects all these partial states to build the global
snapshot.
Consistent Cut

29 | P a g e
A cut in a space-time diagram is a line joining an arbitrary point on each process line that
slices the space‐time diagram into a PAST and a FUTURE. A consistent global state
corresponds to a cut in which every message received in the PAST of the cut was sent in the
PAST of that cut. Such a cut is known as a consistent cut. A consistent cut obeys causality
and any run of the Chandy-Lamport Global Snapshot algorithm creates a consistent cut.

UNIT II – LOGICAL TIME AND GLOBAL


STATE

Logical Time: Physical Clock Synchronization: NTP – A Framework for a System of Logical
Clocks – Scalar Time – Vector Time; Message Ordering and Group Communication:
Message Ordering Paradigms – Asynchronous Execution with Synchronous Communication
– Synchronous Program Order on Asynchronous System – Group Communication – Causal
Order – Total Order; Global State and Snapshot Recording Algorithms: Introduction –
System Model and Definitions – Snapshot Algorithms for FIFO Channels.

Course Outcome
30 | P a g e
CO2: Solve synchronization and state consistency problems (K3)

Course Objective

To illustrate the issues of synchronization and collection of information in distributed systems

Important Topics

1. Explain about Message ordering paradigms


2. Explain the types of Group communication used in distributed system
3. Elucidate on the total and casual order in distributed system with a neat diagram
4. Explain about Chandy Lamport Snapshot algorithms for FIFO channels

UNIT II – LOGICAL TIME AND GLOBAL STATE

NTP-A Framework for a System of Logical Clocks


In distributed systems, physical clocks on different machines can drift apart, leading to
inconsistencies. The Network Time Protocol (NTP) is used to synchronize clocks across
systems using UTC as a reference. However, due to inherent limitations, distributed systems
often rely on logical clocks instead of physical clocks to maintain event ordering. A
framework for logical clocks defines a time domain TT and a mapping function CC that
assigns timestamps to events, ensuring causal relationships (if event aa causally affects bb,
then C(a)<C(b)C(a)<C(b)) are preserved

31 | P a g e
1. Need for Logical Clocks in Distributed Systems
Distributed systems consist of multiple independent computers (nodes) that communicate and
coordinate their actions by passing messages over a network. Unlike centralized systems,
distributed systems lack a single, global clock to synchronize actions across all nodes. This
absence creates significant challenges:
 Event Ordering: Determining the sequence in which events occur across different
nodes.
 Causality: Understanding which events may have influenced others.
 Synchronization: Coordinating actions such as resource sharing, mutual exclusion
To address these issues, distributed systems rely on logical clocks—mechanisms that assign
timestamps to events in a way that reflects their causal relationships, even if actual physical
time is not known.
2. The NTP-A Framework Overview
While the provided material does not elaborate on the specifics of "NTP-A," it is typically
understood as a framework or model for implementing and managing logical clocks in
distributed systems The framework is designed to:
 Provide a systematic method to assign logical timestamps to events.
 Ensure that the causal relationships between events are preserved.
 Enable processes to determine the partial or total order of events in the absence of a
global clock.
The NTP-A framework is foundational for algorithms that implement logical clocks, such as
Lamport’s scalar clocks and vector clocks

3. Key Components of the NTP-A Framework


a. Scalar Time (Lamport Clocks)
 Each process maintains a local counter (logical clock).
 The counter is incremented for each local event.
 When a process sends a message, it includes its current clock value.
 Upon receiving a message, the recipient updates its clock to be greater than both its
own and the received value, then increments it.
 Guarantees: If event A causally precedes event B, then the timestamp of A is less than
that of B
b. Vector Clocks

32 | P a g e
 Each process maintains a vector of counters, one for each process in the system.
 Provides more detailed information about causality than scalar clocks.
 Allows detection of concurrent events (events that are not causally related).
c. Paradigms and Execution Models
 Asynchronous Execution: Nodes operate independently; message
delays are unpredictable.
 Synchronous Communication: Some systems may impose bounds on message
delivery times, simplifying ordering but not eliminating the need for logical clocks
d. Orderings
 Causal Order: Ensures that messages or events that are causally related are processed
in the correct order.
 Total Order: All events are ordered, even those that are not causally related, which is
important for certain applications like distributed databases.
4. Applications of Logical Clocks in Distributed Systems
Logical clocks are used in various distributed system algorithms and protocols, including:
 Mutual Exclusion Algorithms: Ensuring only one process accesses a resource at a
time (e.g., Lamport’s and Ricart-Agarwala’s algorithms).
 Snapshot Algorithms: Capturing a consistent global state of the system for
checkpointing and recovery.
 Deadlock Detection: Identifying cycles of resource waiting among processes.
 Group Communication: Maintaining order in message delivery for replicated services

5. Importance and Challenges


 No Global Clock: Logical clocks provide a way to reason about time and order in
systems where a global clock is impossible due to network delays and independent
process execution.
 Concurrency: They help manage and understand concurrent operations, which is
critical for correctness.
 Fault Tolerance: Logical clocks are essential for recovery and debugging, as they
allow reconstruction of event histories

Scalar Time

33 | P a g e
Scalar time, introduced by Lamport, provides a simple mechanism for ordering events in
distributed systems. Each process maintains a counter (Lamport clock), incrementing it for
every event. When sending a message, the current counter value is included. On receiving a
message, the process updates its counter to the maximum of its own and the received value,
then increments it. This maintains a consistent causal order but cannot capture concurrent
events
Scalar Time in distributed computing refers to a logical clock mechanism that assigns a
single, monotonically increasing integer (scalar value) to each event occurring within a
distributed system. This approach was introduced by Leslie Lamport and is commonly known
as the Lamport Clock. In environments where there is no global physical clock, scalar time
provides a way to establish a partial ordering of events across different nodes. Each process
in the system maintains its own local counter, which is incremented for every local event.
When a process sends a message, it includes its current scalar time value; upon receiving a
message, the recipient process updates its counter to be greater than both its own and the
received value, then increments it. This ensures that causally related events are assigned
timestamps that reflect their order of occurrence.
The primary purpose of scalar time is to capture the causal relationships between events in a
distributed system. If event A causally precedes event B, then the scalar timestamp of A will
be less than that of B. However, scalar time does not provide information about events that
are concurrent as a result, while scalar time can establish a partial order of events, it cannot
distinguish between concurrent events, which may receive arbitrary relative timestamps
depending on the order in which messages are processed.
Scalar time plays a foundational role in the design and analysis of distributed algorithms,
particularly those that require coordination, mutual exclusion, or consistent snapshots of the
system state. By providing a mechanism to order events without relying on synchronized
physical clocks, scalar time enables distributed systems to operate correctly and efficiently,
even in the presence of network delays and asynchronous execution. Its simplicity and
effectiveness have made it a standard tool for reasoning about event ordering and causality in
distributed computing

Paradigms
Paradigms in distributed computing refer to the fundamental models and approaches that
define how distributed systems are structured, how their components interact, and how
communication and coordination are achieved. The choice of paradigm shapes the design,
implementation, and performance of a distributed system. Two primary paradigms dominate
the field: message-passing systems and shared memory systems. In message-passing systems,
independent nodes communicate by explicitly sending and receiving messages over a
network, with each node maintaining its own local memory. In contrast, shared memory
systems allow multiple processors to access a common memory space, enabling
communication through shared variables—this is more common in tightly coupled
environments or within a single physical machine

34 | P a g e
Distributed systems also rely on a set of communication primitives that form the basis for
higher-level protocols and synchronization mechanisms. These primitives include operations
such as send, receive, broadcast, and multicast, which enable nodes to exchange information
and coordinate actions. The execution of distributed programs can follow either synchronous
or asynchronous paradigms. In synchronous execution, nodes operate in lockstep, often
coordinated by a global clock, and message delivery occurs within predictable time bounds.
Asynchronous execution is more flexible and realistic for large-scale systems, as nodes
operate independently and message delivery times are unpredictable, but this introduces
additional challenges in coordination and consistency
The design and operation of distributed systems are further influenced by the underlying
models of communication networks (such as fully connected, partially connected, or dynamic
networks) and by the need to capture the global state of the system, which is the collective
state of all processes and communication channels at a given instant. Capturing a consistent
global state is challenging due to the lack of a global clock and the absence of shared
memory, necessitating techniques like snapshot algorithms for debugging, checkpointing, and
recovery

 Message-passing and shared memory are the two core paradigms, each suited to
different system architectures and requirements
 Communication primitives (send, receive, broadcast, multicast) are foundational for
building distributed protocols and ensuring process coordination
 Synchronous execution offers predictability but is less flexible, while asynchronous
execution is more scalable and realistic but harder to coordinate
 Network models (fully/partially connected, dynamic) directly impact communication
strategies and system robustness
 Capturing global state is essential for system reliability and recovery, but is
complicated by the distributed nature and lack of a global clock

Asynchronous Execution with Synchronous Communication


In an asynchronous system, processes execute at arbitrary speeds, and message delays are
unpredictable. However, synchronous communication can be emulated in such systems using
protocols that ensure messages are delivered in a predictable order, often by introducing
rounds or synchronizers
In distributed computing, asynchronous execution with synchronous communication
describes a hybrid paradigm where processes or nodes operate independently and without a
shared global clock (asynchronous execution), but the communication between them follows
a synchronous protocol. In such systems, each process executes its tasks at its own pace, and
there is no guarantee about the relative speeds of processes or the timing of their actions. This
35 | P a g e
reflects the reality of distributed environments, where network delays, process scheduling,
and hardware heterogeneity make global synchronization impractical. However, when these
processes need to exchange information, they do so use synchronous communication
primitives, meaning that both the sender and receiver coordinate their actions to ensure a
message is successfully delivered and received, often requiring both to be ready for the
communication to proceed.
This paradigm offers a balance between the flexibility and realism of asynchronous execution
and the reliability of synchronous communication. Asynchronous execution allows
distributed systems to scale and adapt to unpredictable workloads and network conditions, as
each node can proceed independently. At the same time, synchronous communication ensures
that critical interactions—such as coordination, agreement, or data consistency—occur in a
controlled manner. For example, in a synchronous message-passing model, a process sending
a message may be blocked until the recipient is ready to receive it, guaranteeing that the
message is not lost and that both parties are synchronized at the moment of exchange. This is
particularly useful in scenarios that require strong consistency or coordinated actions, such as
distributed transactions or leader election.
Despite its advantages, this hybrid approach also introduces challenges. The blocking nature
of synchronous communication can reduce system concurrency and introduce bottlenecks,
especially if some processes are slower or become unavailable. Careful system design is
required to avoid deadlocks and ensure that the benefits of asynchronous execution are not
negated by excessive synchronization overhead. Nevertheless, asynchronous execution with
synchronous communication remains a powerful model, enabling distributed systems to
combine the best aspects of both paradigms for robust, scalable, and reliable operation.

 Asynchronous execution allows processes to operate independently, without relying


on a global clock or synchronized steps.
 Synchronous communication requires both sender and receiver to be ready for
message exchange, ensuring reliable delivery and coordination.
 This hybrid model balances flexibility and scalability (from asynchrony) with
reliability and consistency (from synchrony).
 It is particularly useful in scenarios where independent progress is desired but certain
operations require strict coordination or agreement.
 The main challenges include potential blocking, reduced concurrency, and the risk of
deadlocks if communication is not carefully managed.

Synchronous Program
A synchronous program assumes a tightly controlled environment where all processes
proceed in lockstep, and message delivery times are bounded and predictable. This simplifies
reasoning about program behaviour but is less realistic for large-scale distributed systems

36 | P a g e
A Synchronous Program in distributed computing is characterized by the coordinated
execution of processes, where all participating nodes or processes operate in lockstep, often
under the control of a global clock or a well-defined timing mechanism. In this model, the
execution of each process is divided into rounds or steps, and every process performs its
actions for a given round before moving on to the next. Communication between processes is
also synchronized—messages sent in one round are guaranteed to be delivered and processed
before the start of the next round. This strict timing discipline simplifies reasoning about the
system’s behaviour and makes it easier to design protocols that require strong guarantees
about message delivery and event ordering
The synchronous programming paradigm is particularly advantageous for developing
distributed algorithms that require deterministic behaviour, such as consensus protocols,
coordinated checkpointing, and certain types of fault-tolerant systems. Since all processes
progress together, it becomes straightforward to detect failures (e.g., if a process does not
respond within a round, it is considered faulty), and to ensure that all nodes have a consistent
view of the system state at each step. Synchronous programs are less susceptible to the
uncertainties and complexities introduced by variable message delays and process speeds,
which are common in asynchronous systems
However, designing and implementing synchronous programs in real-world distributed
environments can be challenging. Achieving perfect synchronization across geographically
dispersed nodes requires reliable and low-latency communication networks, which may not
always be feasible. Additionally, the need for all processes to wait for the slowest participant
in each round can limit system scalability and performance. Despite these challenges,
synchronous programs remain fundamental in distributed computing theory and are often
used as a baseline for analysing and comparing the correctness and efficiency of more
complex, asynchronous algorithms
 All processes advance in coordinated rounds, typically governed by a global clock or
timing mechanism.
 Message delivery and processing are guaranteed to occur within fixed time bounds,
simplifying protocol design.
 Synchronous programs are well-suited for consensus, coordinated checkpointing, and
fault detection.
 They provide strong guarantees about consistency and event ordering across the
system.
 Scalability and performance can be limited by the need to synchronize all processes,
especially in heterogeneous or unreliable network

Order on Asynchronous System


Ordering events in an asynchronous system is challenging due to the lack of global time.
Logical clocks (scalar, vector, matrix) are used to infer causal relationships. Lamport clocks
provide a total order, while vector clocks capture causality more precisely, distinguishing
concurrent events

37 | P a g e
Order on Asynchronous System
In distributed computing, asynchronous systems are environments where there is no global
clock, and processes operate independently without any guarantee about the relative speed of
execution or message delivery times. This lack of synchronization introduces a fundamental
challenge: establishing a meaningful order of events across different nodes. Unlike
synchronous systems, where a global clock or synchronized rounds can be used to sequence
events, asynchronous systems must rely on alternative mechanisms to determine the order in
which events occur and to reason about causality.
To address this, asynchronous systems use concepts such as logical clocks (e.g., Lamport
clocks and vector clocks) to assign timestamps to events. These logical clocks help capture
the causal relationships between events, allowing the system to establish a partial order. In
this context, if an event A causally influences event B (for example, A is the sending of a
message and B is the receipt), then the logical timestamp of A will be less than that of B.
However, for events that are independent and occur concurrently on different nodes, there is
no inherent way to determine their order—these events are simply considered concurrent.
This partial ordering is crucial for ensuring consistency, detecting causality, and
implementing protocols such as mutual exclusion, checkpointing, and deadlock detection.
The absence of a global order in asynchronous systems means that total ordering—where
every event is assigned a unique position in a single, global sequence—is generally not
possible without additional coordination mechanisms. Some distributed algorithms introduce
extra communication or consensus protocols to achieve total order, when necessary, such as
in group communication or distributed databases. However, these solutions can be costly in
terms of performance and scalability. As a result, understanding and managing the order of
events in asynchronous systems is a central concern in distributed computing, impacting the
design and correctness of nearly all distributed algorithms.

 No Global Clock: Asynchronous systems lack a shared notion of time, making it


difficult to establish a global order of events.
 Partial Order via Logical Clocks: Logical clocks (Lamport, vector clocks) are used
to capture causal relationships and establish a partial order among events.
 Concurrency: Events that are independent and occur on different nodes may be
concurrent, with no defined order between them.
 Total Order Requires Extra Coordination: Achieving a total order of events often
needs additional protocols, such as consensus algorithms, which can impact system
performance.
 Impact on Protocol Design: The challenge of ordering events in asynchronous
systems directly influences the design of distributed algorithms for consistency,
coordination, and fault tolerance.

Group Communication

38 | P a g e
Group communication involves sending messages to multiple processes as a group. This
requires mechanisms to ensure consistency in message delivery, such as reliable multicast
and group membership protocols. It is essential for applications like replicated databases and
collaborative tools
Group Communication in distributed systems refers to the mechanisms and protocols that
enable a set of processes or nodes to communicate and coordinate as a collective unit, rather
than as isolated pairs. Unlike basic point-to-point communication, group communication
allows a message to be delivered simultaneously to multiple recipients, supporting operations
such as broadcast (sending to all nodes) and multicast (sending to a selected group of
nodes)1. This capability is fundamental for building reliable, scalable, and coordinated
distributed applications, such as replicated databases, collaborative tools, and distributed
control systems.
The primary goal of group communication is to ensure that all members of a group receive
messages in a consistent and predictable manner, even in the presence of failures or network
partitions. This often involves enforcing specific ordering guarantees, such as causal order
(messages are delivered respecting causal dependencies) or total order (all messages are
delivered to all group members in the same sequence)1. These guarantees are essential for
maintaining consistency among distributed replicas and for coordinating actions like leader
election, distributed transactions, and consensus protocols. Group communication systems
also address challenges such as dynamic group membership, fault tolerance, and efficient
message dissemination.
Implementing group communication requires careful consideration of the underlying network
model, the desired reliability and ordering properties, and the potential for node or network
failures. Protocols supporting group communication must handle message loss, duplication,
and delays, while providing mechanisms for processes to join or leave groups dynamically.
These protocols form the backbone of many higher-level distributed algorithms and services,
enabling robust coordination and collaboration across distributed environments.

 Enables broadcast and multicast, allowing messages to be sent to all or a subset of


nodes simultaneously1.
 Supports ordering guarantees to maintain consistency and coordination among group
members1.
 Essential for replicated systems, distributed consensus, and collaborative applications.
 Handles dynamic group membership, allowing nodes to join or leave groups as
needed.
 Addresses fault tolerance and efficient message dissemination, ensuring reliable
communication even in the presence of failures.

Causal Order

39 | P a g e
Causal order ensures that messages are delivered respecting the causal relationships among
events. If message A causally precedes message B, then all processes must see A before B.
Vector clocks are typically used to implement causal ordering in distributed systems
Certainly! Here is a detailed, book-style explanation of Causal Order in distributed systems,
followed by five important points:
In distributed systems, causal order refers to the principle that messages and events are
delivered and processed in a manner that preserves their cause-and-effect relationships.
Unlike a simple chronological or total order, causal order specifically ensures that if one
event causally influences another—such as a process sending a message and another process
receiving and acting upon it—then all processes in the system observe these events in the
same causal sequence. This concept is crucial because, in asynchronous distributed systems,
there is no global clock, and events may occur concurrently or be delayed due to network
variability. As a result, maintaining causal order is fundamental for ensuring consistency,
correctness, and predictability in distributed applications.
Causal order is typically enforced using logical clock mechanisms, such as vector clocks.
These clocks assign timestamps to events in a way that captures the "happened-before"
relationship defined by Leslie Lamport. For instance, if event A happens before event B (A
→ B), then the system must guarantee that all nodes observe A before B. This is especially
important in collaborative systems, replicated databases, and distributed debugging, where
the order of operations can affect the final state or outcome. By preserving causal order,
distributed systems can avoid anomalies such as delivering a reply before the original request
or applying updates in an inconsistent sequence.
Implementing causal order adds complexity, as it requires processes to track dependencies
and sometimes delay the delivery of messages until all causally prior events have been
observed. Despite this, causal order is less restrictive than total order and is often more
efficient for applications where only causal consistency is required. It strikes a balance
between the flexibility of asynchronous execution and the need for meaningful coordination,
making it a foundational concept in the design of robust distributed algorithms and protocols.

1. Preserves cause-and-effect relationships between events, ensuring that causally


related messages are delivered in the correct order.
2. Enforced using logical clocks (especially vector clocks), which track dependencies
and the "happened-before" relationship among events.
3. Essential for consistency in collaborative applications, replicated databases, and
distributed debugging, where the order of operations matters.
4. Less restrictive than total order, allowing greater concurrency and efficiency while
still preventing causality violations.
5. Requires careful message tracking and delivery management, as messages may need
to be delayed until all causally prior events are received.

40 | P a g e
Total Order Global State
Total order requires that all messages are delivered to all processes in the same order,
regardless of causality. This is stricter than causal order and is vital for applications like
distributed transactions and state machine replication. Achieving total order often involves
consensus protocols
The concept of Total Order Global State refers to the challenge of capturing and reasoning
about the collective state of all processes and communication channels in a manner that
respects a single, unified sequence of events—known as total order. Unlike in centralized or
synchronous systems, distributed systems lack a global clock and shared memory, making it
difficult to determine a single, consistent view of the system at any given instant. The global
state is a snapshot that reflects the local states of all processes and the status of all
communication channels. Achieving a total order means that all events (such as message
sends, receives, and internal computations) are arranged in a single sequence that is agreed
upon by all nodes, regardless of the inherent concurrency and asynchrony of the system.
Total order is essential for certain classes of distributed applications, such as replicated
databases, distributed transactions, and group communication protocols, where all
participants must agree on the exact sequence of events to ensure consistency and
correctness. To establish a total order, distributed systems often employ protocols and
algorithms like total order broadcast, consensus algorithms (e.g., Paxos, Raft), or logical time
mechanisms that assign unique, monotonically increasing timestamps to events. These tools
enable the system to resolve ambiguities caused by concurrent or overlapping events and to
present a coherent, agreed-upon sequence of actions to all participants.
Capturing a total order global state is particularly challenging because messages may be
delayed, reordered, or lost, and processes may operate at different speeds. Techniques such as
snapshot algorithms (e.g., Chandy-Lamport algorithm) are used to record a consistent global
state without halting the system. These algorithms ensure that the snapshot reflects a state
that could have occurred during a legal execution of the system, even if the actual events
were interleaved. By maintaining a total order of events, distributed systems can provide
strong guarantees about consistency, facilitate debugging and recovery, and support complex
coordination tasks across geographically dispersed nodes.

 Total order arranges all events in a single, agreed-upon sequence across the
distributed system, essential for consistency in critical applications.
 Global state is the combined state of all processes and communication channels at a
particular instant, crucial for debugging, checkpointing, and recovery.
 Achieving total order is challenging due to the lack of a global clock and
unpredictable message delays in asynchronous distributed systems.
 Protocols like total order broadcast, consensus algorithms, and logical clocks are used
to establish and maintain total order among events.

41 | P a g e
 Snapshot algorithms (e.g., Chandy-Lamport) help capture a consistent global state
without stopping the system, ensuring that the recorded state is meaningful and usable
for system management tasks.

System Model and Definitions


A distributed system is modelled as a collection of processes communicating over a network.
Key definitions include:
 Process: Independent computational entity.
 Event: An action within a process (e.g., computation, message send/receive).
 Channel: Communication link between processes, can be FIFO or non-FIFO.
 Global State: The combined state of all processes and channels at a given instant

A system model in distributed computing provides a formal framework to describe, analyse,


and design distributed systems. It defines the structure, components, communication
mechanisms, and assumptions about the environment in which the distributed system
operates. The system model is crucial for understanding the behaviour of distributed
programs, for proving correctness, and for comparing different algorithms and architectures.

1. Definition of Distributed System


A distributed system is a collection of independent computers (nodes) that work together to
achieve a common goal. These nodes communicate and coordinate their actions by passing
messages over a network. To users and applications, the distributed system appears as a
single coherent unit, even though its components may be physically separated and
geographically dispersed.
2. Components of a Distributed System
 Nodes: Independent computers such as servers, desktops, or embedded devices.
 Network: The communication infrastructure that connects the nodes and enables data
exchange.
 Middleware: Software that provides common services to applications, sitting
between the operating system and the application layer.
 Applications: Programs that utilize distributed resources to perform collaborative
tasks.

3. Communication Models
Distributed systems use two primary models for inter-process communication:
 Message-Passing Systems: Each node has its own local memory and communicates
by explicitly sending and receiving messages. There is no shared memory space.

42 | P a g e
 Shared Memory Systems: Multiple processors access a common memory space and
communicate by reading and writing shared variables. This is typical in tightly
coupled systems.
4. Execution Models
 Synchronous Execution: All nodes operate in lockstep, often coordinated by a global
clock. Message delivery and process execution occur within known time bounds.
 Asynchronous Execution: Nodes operate independently without a global clock.
Message delivery times and process execution speeds are unpredictable, making
coordination more challenging but allowing greater flexibility and scalability.
5. Primitives for Distributed Communication
Distributed systems rely on basic communication primitives:
 Send: Transmit a message from one node to another.
 Receive: Accept a message sent by another node.
 Broadcast: Send a message to all nodes.
 Multicast: Send a message to a specific group of nodes.
These primitives form the foundation for higher-level protocols and synchronization
mechanisms.
6. Models of Communication Networks
Distributed systems can be built on various network topologies:
 Fully Connected Network: Every node can directly communicate with every other
node.
 Partially Connected Network: Nodes are connected in specific topologies and
messages may pass through intermediate nodes.
 Dynamic Networks: The network topology can change over time due to node
mobility or failures.
7. Global State
The global state of a distributed system is the collective state of all processes and
communication channels at a given instant. Capturing a consistent global state is challenging
due to the lack of a global clock and shared memory. Snapshot algorithms are used to record
global states for debugging, checkpointing, and recovery.

8. Design Issues and Challenges


Key challenges in designing distributed systems include:
 Heterogeneity: Integrating different hardware, operating systems, and network
technologies.

43 | P a g e
 Transparency: Hiding the distributed nature from users and applications.
 Fault Tolerance: Ensuring the system functions despite failures.
 Scalability: Maintaining performance as the number of nodes increases.
 Security: Protecting data and resources from unauthorized access.
 Concurrency: Managing simultaneous operations and avoiding conflicts.
 No Global Clock: Coordinating actions without a shared notion of time.

Snapshot
A snapshot is a mechanism to record a consistent global state of a distributed system. The
Chandy-Lamport algorithm is a classic approach for taking snapshots in systems with FIFO
channels. It allows processes to record their states and the states of communication channels
without halting the system, enabling applications like checkpointing, garbage collection, and
deadlock detection
In distributed computing, a snapshot refers to a technique or algorithm used to record a
consistent global state of a distributed system at a particular point in time. Since distributed
systems consist of multiple independent nodes that communicate via message passing and
operate without a global clock, capturing the exact state of all processes and communication
channels simultaneously is a significant challenge. Snapshots are essential for applications
such as distributed debugging, checkpointing, failure recovery, and detecting global
properties like deadlocks or termination.
The most well-known snapshot algorithm is the Chandy-Lamport Snapshot Algorithm. This
algorithm allows a distributed system to record its global state without halting the system or
requiring synchronized clocks. The process begins when a designated process (the initiator)
records its own local state and sends a special "marker" message along all its outgoing
communication channels. When other processes receive a marker for the first time, they
record their local state and propagate the marker on their outgoing channels. If a process
receives a marker after it has already recorded its state, it records the state of the channel as
the messages received between its local snapshot and the arrival of the marker. This ensures
that the global state captured is consistent and could have occurred during a legal execution
of the system.
Snapshots are vital for checkpointing (saving the system state so it can be restored after a
failure), recovery (rolling back to a consistent state after errors), and for analysing the
behaviour of distributed applications. By enabling the system to capture a consistent global
state, snapshot algorithms help in ensuring reliability and correctness in distributed
environments, where asynchrony and the absence of a global clock make such tasks
inherently complex.

 A snapshot captures a consistent global state of all processes and communication


channels in a distributed system.

44 | P a g e
 The Chandy-Lamport algorithm is the standard method for taking snapshots without
stopping the system or requiring synchronized clocks.
 Snapshots are essential for checkpointing, recovery, and debugging, enabling
distributed systems to recover from failures and analyse system behaviour.
 Markers are used to coordinate the recording of local states and channel states,
ensuring the consistency of the global snapshot.
 Snapshots address the challenge of asynchrony and lack of a global clock, making
them fundamental to the reliability and correctness of distributed computing.

Algorithms for FIFO Channels


FIFO (First-In-First-Out) channels preserve the order of messages between any pair of
processes. Algorithms like Chandy-Lamport's snapshot exploit FIFO properties to ensure that
the recorded global state is consistent and can be used for recovery or analysis
In distributed systems, FIFO (First-In, First-Out) channels are communication links that
guarantee messages sent from one process to another are delivered in the exact order in which
they were sent. This property is fundamental for maintaining consistency and simplifying the
design of distributed algorithms, as it ensures that the sequence of events observed by the
receiver matches the sequence intended by the sender. FIFO channels are particularly
important in message-passing systems, where processes interact solely by exchanging
messages over a network.

Importance of FIFO Channels


FIFO channels are essential for many distributed algorithms, including those for mutual
exclusion, global state recording (such as snapshot algorithms), and reliable multicast or
broadcast protocols. When using FIFO channels, developers can assume that messages will
not be reordered in transit, which simplifies reasoning about causality and event ordering. For
example, if a process sends two updates to a shared resource, FIFO guarantees that all other
processes will receive the updates in the same order, preventing inconsistencies.

Common Algorithms Utilizing FIFO Channels

1. Chandy-Lamport Snapshot Algorithm:

45 | P a g e
This algorithm relies on FIFO channels to ensure that the global state recorded is
consistent. When a marker message is sent to initiate the snapshot, the FIFO property
guarantees that all messages sent before the marker are received before the marker,
and any messages received after the marker were sent after the snapshot started. This
property is crucial for accurately recording the state of communication channels.
2. Reliable Broadcast Algorithms:
In reliable broadcast or multicast protocols, FIFO channels ensure that all recipients
receive messages from a given sender in the same order. This is especially important
in replicated systems and group communication, where consistency across nodes is
required.
3. Distributed Mutual Exclusion Algorithms:
Algorithms such as Lamport’s and Ricart-Agarwala’s mutual exclusion protocols
often assume FIFO message delivery to simplify the coordination logic. FIFO
channels help prevent deadlocks and ensure that requests for critical sections are
processed in the correct order.
4. Causal and Total Ordering Protocols:
While causal and total order protocols may require additional mechanisms FIFO
channels provide a baseline guarantee that helps in constructing higher-level ordering
guarantees.

Implementing FIFO Channels


In practice, implementing FIFO channels may involve adding sequence numbers to messages
and buffering out-of-order messages until missing ones arrive. Many network protocols, such
as TCP, inherently provide FIFO guarantees, while others (like UDP) do not, requiring
additional software layers to enforce ordering.

 FIFO channels guarantee in-order delivery of messages between any pair of


processes, simplifying distributed algorithm design.
 Snapshot algorithms (like Chandy-Lamport) rely on FIFO channels to capture
consistent global states.
 Reliable broadcast and multicast protocols use FIFO channels to ensure all recipients
observe messages from a sender in the same order.
 Mutual exclusion and coordination algorithms benefit from FIFO properties to
prevent deadlocks and maintain fairness.
 FIFO can be implemented via sequence numbers and buffering, or by using network
protocols (like TCP) that provide in-order delivery natively.

46 | P a g e
What are the issues in distributed system
There is no global time in a distributed system, so the clocks on different computers do not
necessarily give the same time as one another. All communication between processes is
achieved by means of messages. Message communication over a computer network can be
affected by delays, can suffer from a variety of failures and is vulnerable to security attacks.

What is meant by group communication in distributed system


Group Communication occurs when a single source process simultaneously attempts to
communicate with numerous functions. A group is an abstract collection of interrelated
operations. This abstraction hides the message passing such that the communication seems to
be a standard procedure call.

What is meant by asynchronous programming


Asynchronous programming provides opportunities for a program to continue running other
code while waiting for a long -running task to complete.

Write application of casual order


The causal ordering of messages describes the causal relationship between a message send
event and a message receive event. For example, if send(M1) -> send(M2) then every
recipient of both the messages M1 and M2 must receive the message M1 before receiving the
message M2.

What is synchronous order


Synchronous execution means the first task in a program must finish processing before
moving on to executing the next task

Define Scalar Time

47 | P a g e
scalar time are independent (i.e., they are not causally related), they can be ordered using any.
arbitrary criterion without violating the causality. relation. Therefore, total order is consistent
with the. causality relation.

What is clock shew


Clock skew (sometimes called timing skew) is a phenomenon in synchronous digital circuit
systems (such as computer systems) in which the same sourced clock signal arrives at
different components at different times due to gate or, in more advanced semiconductor
technology, wire signal propagation delay.

What is clock drift rate


Clock Drift: As mentioned, no two clocks would have the same clock rate of oscillations i.e;
clock rate would be different. The difference of clock rate is called clock drift.

What is clock tick


Clock Tick: after a predefined number of oscillations, the timer will generate a clock tick.
This clock tick generates a hardware interrupt that causes the computer' operating system to
enter a special routine in which it can update the software clock and run the process
scheduler.

What is logical Clock


Logical Clocks refer to implementing a protocol on all machines within your distributed
system, so that the machines are able to maintain consistent ordering of events within some
virtual timespan. A logical clock is a mechanism for capturing chronological and causal
relationships in a distributed system.

What is global state of the distributed system


The global state of a distributed system is the set of local states of each individual processes
involved in the system plus the state of the communication channels. Determinism.
Deterministic Computation.

Write the happen before relation


48 | P a g e
Happened before relation is an irreflexive partial ordering on the set of all events happening
in the system i.e.; (a⇢ a) is not true for any event a.
This relates back to Einstein’s general theory of relativity where events are ordered in terms
of messages that could possibly be sent.

What is vector clock


Vector Clock is an algorithm that generates partial ordering of events and detects
causality violations in a distributed system.

What is Chandy Lamport algorithm


Chandy and Lamport were the first to propose a algorithm to capture consistent global state
of a distributed system. The main idea behind proposed algorithm is that if we know that all
message that have been sent by one process have been received by another then we can
record the global state of the system.

49 | P a g e
Explain about Message ordering paradigms
Message Ordering Paradigms in Distributed Systems Message ordering paradigms define
how messages are delivered to processes in a distributed system, ensuring consistency,
coordination, and correctness. They are crucial for applications like distributed databases,
group communication, and collaborative tools, where the order of operations affects the
global state.

Types of Message Ordering Paradigms


a) No Ordering (Unordered Delivery)
 Messages may arrive in any order, possibly different at each receiver. Suitable for
applications where order does not matter.
b) FIFO (First-In-First-Out) Ordering
 Messages sent from one process to another are delivered in the order sent.
 If process PP sends messages m1m1 and then m2m2 to process QQ, then:
If m1→m2 at sender, then m1→m2 at receiver
c) Messages are delivered respecting the "happened-before" relationship
 If sending m1m1 causally precedes sending m2m2, all processes deliver m1m1 before
m2m2
 a→b⟹deliver(a) before deliver(b)
 a→b denotes the causal relationship.
d) Total Ordering
 All messages are delivered to all processes in the exact same order, regardless of

∀Pi,Pj,∀mx,my:if both deliver mx and my, order is the same


causality.

e) Atomic (Reliable) Ordering
 Ensures that messages are delivered to all or none of the intended recipients, often
combined with total order.

1. Algorithms for Message Ordering


Lamport’s Total Order Algorithm
 Each message is timestamped using Lamport clocks.
 Messages are delivered in order of increasing timestamp.
 If two messages have the same timestamp, process IDs are used to
break ties.
Algorithm Steps:

50 | P a g e
 Process sends a timestamped message to all, including itself.
 On receiving, each process queues the message and sends an acknowledgment.
 Message is delivered when it has the smallest timestamp in the queue and all
acknowledgments are received
 C(m)=max(Clocal,Creceived)+1
 C(m) is the timestamp of message

Centralized Sequencer Algorithm


 A coordinator assigns sequence numbers to messages, ensuring total order.
 All messages pass through the sequencer before delivery.

2. Practical Considerations and Challenges

 Scalability: Centralized approaches may become bottlenecks; decentralized


algorithms improve scalability by distributing ordering tasks.
 Fault Tolerance: Sequencer or coordinator failures must be handled gracefully.
 Efficiency: Message ordering can add latency; trade-offs exist between strictness of
order and performance.
 Group Communication: Overlapping group memberships require careful sequencing
to avoid inconsistent orders

Paradigm Guarantees Use Case Example

Unordered No order Logging, stateless events

FIFO Per-sender order Chat between two users

Causal Causality preserved Collaborative editing

Total Same order for all Replicated databases

Atomic All-or-none delivery Distributed transactions

Explain the types of Group communication used in


distributed system

In distributed systems, efficient group communication is crucial for coordinating activities


among multiple entities. This article explores the challenges and solutions involved in

51 | P a g e
facilitating reliable and ordered message delivery among members of a group spread across
different nodes or networks.
What is Group Communication in Distributed Systems?
Group communication in distributed systems refers to the process where multiple nodes or
entities communicate with each other as a group.
 Instead of sending messages to individual recipients, group communication allows a
sender to transmit information to all members of a group simultaneously.
 This method is essential for coordinating actions, sharing data, and ensuring that all
participants in the system are informed and synchronized. It's particularly useful in
scenarios like collaborative applications and real-time updates
Importance of Group Communication in Distributed Systems
Group communication is critically important in distributed systems due to several key
reasons:
 Multiple nodes must collaborate and synchronize their actions. Group communication
helps them exchange information and stay updated.
 Different nodes can create data that needs to be shared. Group communication helps
quickly send this information to everyone involved, reducing delays and keeping data
consistent.
 Group communication protocols enhance reliability by allowing messages to be
replicated or acknowledged across multiple nodes. This ensures robust
communication, even during failures or network issues.
 As distributed systems expand, effective scaling is crucial. Group communication
mechanisms can manage more nodes and messages without sacrificing performance,
keeping the system efficient and

Unicast Communication
Unicast communication is the point-to-point transmission of data between two nodes in a
network. In the context of distributed systems
Unicast is when a sender sends a message to a specific recipient, using their unique network
address.
52 | P a g e
Each message targets one recipient, creating a direct connection between the sender and the
receiver.
You commonly see unicast in client-server setups, where a client makes requests and receives
responses, as well as in direct connections between peers.
This method makes good use of network resources, is easy to implement, and keeps latency
low because messages go straight to the right person.
Unicast isn’t efficient for sending messages to many recipients at once, as it requires separate
messages for each one, leading to more work.

Multicast Communication
Multicast communication involves sending a single message from one sender to multiple
receivers simultaneously within a network. It is particularly useful in distributed systems
where broadcasting information to a group of nodes is necessary:
Multicast lets a sender share a message with a specific group of people who want it.

53 | P a g e
This way, the sender can reach many people at once, which is more efficient than sending
separate messages.
This approach is often used to send updates to subscribers or in collaborative applications
where real-time sharing of changes is needed.
By sending data just once to a group, multicast saves bandwidth, simplifies communication,
and can easily handle a larger number of recipients.
Managing group membership is necessary to ensure reliable message delivery, and multicast
can run into issues if there are network problems that affect everyone in the group.

Broadcast Communication
Broadcast is when a sender sends a message to every node in the network without targeting
specific recipients.
Messages are delivered to all nodes at once using a special address designed for this purpose.

54 | P a g e
It’s often used for network management tasks, like sending status updates, or for emergency
alerts that need to reach everyone quickly.
Broadcast ensures that every node receives the message without needing to specify who the
recipients are, making it efficient for sharing information widely.
It can cause network congestion in larger networks and raises security concerns since anyone
on the network can access the broadcast message, which might lead to unauthorized access.

Reliable Multicast Protocols for Group Communication


Messages sent from a sender to multiple recipients should be delivered reliably, consistently,
and in a specified order. Types of Reliable Multicast Protocols include:
 FIFO Ordering:

55 | P a g e
o Ensures that messages are delivered to all group members in the order they
were sent by the sender.
o Achieved by sequencing messages and delivering them sequentially to
maintain the correct order.
 Causal Ordering:
o Preserves the causal relationships between messages based on their
dependencies.
o Ensures that messages are delivered in an order that respects the causal
dependencies observed by the sender.
 Total Order and Atomicity:
o Guarantees that all group members receive messages in the same global order.
o Ensures that operations based on the multicast messages (like updates to
shared data) appear atomic or indivisible to all recipients.
Scalability for Group Communication
Scalability and performance are vital for effective group communication in distributed
systems. It’s essential for the system to handle more nodes, messages, and participants while
still operating efficiently. Here’s a closer look at these important aspects:
1. Scalability
Scalability refers to the system's ability to grow without losing efficiency. This includes:
 Managing an increasing number of nodes or participants while keeping
communication smooth.
 Handling more messages exchanged among group members, ensuring timely and
responsive communication.
 Supporting connections across distant nodes or networks, which can introduce latency
and bandwidth issues.
2. Challenges in Scalability
As the group size grows, several challenges arise:
 The management of group membership and message routing becomes more complex,
adding overhead.
 The network must have enough bandwidth to support the higher traffic from a larger
group to avoid congestion.
 Keeping distributed nodes consistent and synchronized gets more complicated as the
system scales.
3. Strategies for Scalability
To tackle these challenges, various strategies can be employed:

56 | P a g e
 Partitioning and Sharding: Breaking the system into smaller parts can make
communication and management more manageable.
 Load Balancing: Evenly distributing the workload across nodes helps prevent
bottlenecks and optimizes resource use.
 Replication and Caching: Duplicating data or messages across nodes can lower
access times and enhance fault tolerance, aiding scalability.
Performance for Group Communication
Performance in group communication is crucial for optimizing message speed, resource
utilization, and addressing challenges like message ordering and concurrent access, ensuring
efficient collaboration in distributed systems.
1. Performance
In group communication, performance focuses on a few key areas:
 It’s crucial to minimize the time it takes for messages to reach their intended
recipients.
 We want to enhance the rate at which messages are handled and sent out.
 Efficient use of bandwidth, CPU, and memory helps keep communication fast and
effective.
2. Challenges in Performance
There are challenges that come with achieving high performance:
 Making sure messages arrive in the right order can be tough, especially with strict
protocols.
 Handling multiple users trying to access shared resources at the same time can lead to
slowdowns.
 Communication needs to adjust based on changing conditions, like slow bandwidth or
lost packets.
3. Strategies for Improvement
To boost performance, consider these strategies:
 Smart Routing: Using efficient routing methods can reduce delays by cutting down
on the number of hops messages take.
 Asynchronous Communication: This allows senders and receivers to work
independently, improving responsiveness.
 Pre-storing Data: Keeping frequently accessed messages or data ready can help
lower delays and speed up responses.

Challenges of Group Communication in Distributed Systems

57 | P a g e
Group communication in distributed systems comes with several challenges due to the need
to coordinate multiple nodes that may be spread out or connected through unreliable
networks. Key challenges include:
 Reliability: Messages must reach all intended recipients, even during network
failures or node crashes, which can be complicated when nodes frequently join or
leave.
 Scalability: As the number of participants grows, managing communication
effectively becomes harder, leading to issues with bandwidth usage and processing
delays.
 Concurrency and Consistency: Keeping shared data consistent while allowing
simultaneous updates is tricky, requiring strong synchronization to avoid conflicts.
 Fault Tolerance: The system must handle node failures and communication issues
without losing reliability. This means having mechanisms to detect failures and
manage changes in group membership.

Elucidate on the total and casual order in distributed


system with a neat diagram

58 | P a g e
 Same example as previously, but now causal ordering means that
(a) everyone must see m1 before m3 (as with FIFO), and
(b) everyone must see m1 before m2 (due to happens-before)
 Is this, ok? – No! m1→m2, but P2 sees m2 before m1– To be correct, must hold back
(delay) delivery of m2 at P2– But how do we know this?
 Turns out this is pretty easy!
 Start with receive algorithm for FIFO multicast and replace sequence numbers with
vector clocks
 Some care needed with dynamic groups

Total ordering
Sometimes we want all processes to see exactly the same, FIFO, sequence of messages

59 | P a g e
particularly for state machine replication (see later)
• One way is to have a ‘can send’ token:
Token passed round-robin between processes
Only process with token can send (if he wants)
• Or use a dedicated sequencer process
Other processes ask for global sequence no. (GSN), and then send with this in packet
Use FIFO ordering algorithm, but on GSNs
• Can also build non-FIFO total-order multicast by having processes generate GSNs
themselves and resolving ties

Explain about Chandy Lamport Snapshot algorithms for


FIFO channels

60 | P a g e
Each distributed system has a number of processes running on a number of different physical
servers. These processes communicate with each other via communication channels using
text messaging. These processes neither have a shared memory nor a common physical clock,
this makes the process of determining the instantaneous global state difficult.
A process could record its own local state at a given time but the messages that are in transit
(on its way to be delivered) would not be included in the recorded state and hence the actual
state of the system would be incorrect after the time in transit message is delivered.
Chandy and Lamport were the first to propose a algorithm to capture consistent global state
of a distributed system. The main idea behind proposed algorithm is that if we know that all
message that have been sent by one process have been received by another then we can
record the global state of the system.
Any process in the distributed system can initiate this global state recording algorithm using a
special message called MARKER. This marker traverses the distributed system across all
communication channel and cause each process to record its own state. In the end, the state of
entire system (Global state) is recorded. This algorithm does not interfere with normal
execution of processes.
Assumptions of the algorithm:
 There are finite number of processes in the distributed system and they do not share
memory and clocks.
 There are finite number of communication channels and they are unidirectional and
FIFO ordered.
 There exists a communication path between any two processes in the system
 On a channel, messages are received in the same order as they are sent.
Algorithm:
 Marker sending rule for a process P
 Process p records its own local state
 For each outgoing channel C from process P, P sends marker along C before sending
any other messages along C.

 Marker receiving rule for a process Q


 If process Q has not yet recorded its own local state, then
 Record the state of incoming channel C1 as an empty sequence or null.
 After recording the state of incoming channel C1, process Q Follows the marker
sending rule
 If process Q has already recorded its state

61 | P a g e
 Record the state of incoming channel C1 as the sequence of messages received along
channel C1 after the state of Q was recorded and before Q received the marker along
C1 from process P.

Need of taking snapshot or recording global state of the system:


 Checkpointing: It helps in creating checkpoint. If somehow application fails, this
checkpoint can be reused
 Garbage collection: It can be used to remove objects that do not have any references.
 It can be used in deadlock and termination detection.
 It is also helpful in other debugging.

A Global Snapshot or a global state consists of local states of each process in the distributed
system along with the in-transit messages on communication channels. The snapshot problem
in a distributed system tries to determine the “current” snapshot of the global
state without stopping the system. A Global snapshot is important, in many scenarios such
as:
 determine safety points for a distributed database
 deadlock detection
 termination of an algorithm
 garbage collection of objects
 to find out the current load of a distributed system
Two major problems in determining a global snapshot are:
 One cannot catch all processes at the same time
 Messages that are on the way cannot be seen
The classical algorithm that is used to determine a global snapshot in a distributed system is
the Chandy-Lamport Global Snapshot Algorithm, 1985.
The assumptions of the algorithm are as follows:
 Neither channels nor processes fail, communication is reliable
 Channels are unidirectional and provide FIFO-ordered delivery
 There is a communication path between any two processes in the system
 Any process may initiate the snapshot algorithm
 The snapshot algorithm does not interfere with the normal execution of the processes
 Each process in the system records its local state and the state of its incoming
channels
62 | P a g e
 Global state is collected in a distributed manner
The snapshot algorithm works using marker messages. The marker message is a special
control message and not an application message. It does not interfere with application
messages. The algorithm has three phases: Initiation, Propagation and Termination and works
as follows:
1. Initiating a Snapshot
 Process Pi initiates the snapshot.
 The initiator, Pi records its own state and creates the marker message.
 Pi then forwards the marker message to all other processes using the using N-1
outbound channels Cij(j = 1 to N & j != i).
 Pi starts recording all incoming messages from channels Cji (j = 1 to N & j != i).
2. Propagating a Snapshot
If a process Pj receives a Marker message on an incoming channel Ckj, and if this is the first
Marker that Pj is seeing:
 Pj records its own state and mark Ckj as empty.
 Pj then forwards the marker message to all other processes using the N-1 outbound
channels.
 Pj starts recording all incoming messages from channels Clj for l not equal to j or k.
If the process Pj has already seen a Marker message:
 Mark the state of channel Ckj as all the messages that have arrived on it since the
recording was turned on for Ckj.
3. Terminating a Snapshot
The algorithm terminates when:
 All processes have received a marker (which implies that process state has been
recorded)
 All processes have received a marker on all the N-1 incoming channels (which
implies that state of all channels has been recorded)
 Later, if needed, the central server collects all these partial states to build the global
snapshot.

Consistent Cut

63 | P a g e
A cut in a space-time diagram is a line joining an arbitrary point on each process line that
slices the space‐time diagram into a PAST and a FUTURE. A consistent global state
corresponds to a cut in which every message received in the PAST of the cut was sent in the
PAST of that cut. Such a cut is known as a consistent cut. A consistent cut obeys causality
and any run of the Chandy-Lamport Global Snapshot algorithm creates a consistent cut.

UNIT III - DISTRIBUTED MUTEX AND


DEADLOCK
64 | P a g e
Distributed Mutual exclusion Algorithms: Introduction – Preliminaries – Lamport’s
algorithm – Ricart- Agarwala’s Algorithm –– Token-Based Algorithms – Suzuki-Kasami’s
Broadcast Algorithm; Deadlock Detection in Distributed Systems: Introduction – System
Model – Preliminaries – Models of Deadlocks – Chandy-Misra-Haas Algorithm for the AND
model and OR Model.

Course Outcome
CO3 Use resource sharing techniques in distributed systems (K3)
Course Objective
To describe distributed mutual exclusion and distributed deadlock detection techniques

Important Topics

1. Lamport’s Algorithm (Working, Diagram, Message Complexity)


2. Ricart-Agarwala’s Algorithm (Working, Comparison with Lamport)
3. Suzuki-Kasami’s Algorithm (Steps, Token Structure, Example)
4. Chandy-Misra-Haas Algorithm (For AND & OR Models, Steps, Example)
5. Models of Deadlocks (AND, OR, Single Resource)
6. Comparison of Distributed Mutual Exclusion Algorithms (Table or Points)
7. Differences Between Token-based and Permission-based Algorithms
8. Deadlock Detection System Model and Wait-for Graph

UNIT - III

65 | P a g e
Introduction to Distributed Mutual Exclusion Algorithms

Distributed mutual exclusion algorithms coordinate access to shared resources in systems


without centralized control, ensuring that only one process enters a critical section (CS) at a
time. These algorithms address challenges like network delays, failures, and the absence of
shared memory. Key requirements include:
 Safety: At most one process executes in the CS concurrently.
 Liveness: Every request eventually enters the CS.
 Fairness: Requests are granted in a bounded time or by request order.

Key Challenges in Distributed Systems


 No Shared Memory: Processes communicate via message passing, complicating state
coordination.
 Partial Failures: Processes or links may fail independently.
 Variable Latency: Network delays cause unpredictable message arrival times.
 Scalability: Solutions must work efficiently as the system grows.

Algorithm Classifications
Permission-Based:
Processes request explicit consent from others to enter the CS. Examples:
Lamport’s Algorithm: Uses logical timestamps and total ordering of requests.
Ricart-Agarwala’s Algorithm: Optimizes messages by deferring replies.
Token-Based:
A unique token circulates; only its holder enters the CS. Examples:
Suzuki-Kasami’s Algorithm: Uses broadcasts and token passing.
Raymond’s Tree-Based Algorithm: Organizes processes in a tree for efficient token routing.
Design Trade-offs
Message Overhead: Permission-based algorithms (e.g., Lamport’s) use more messages than
token-based ones.
Fault Tolerance: Token loss requires recovery mechanisms.
Scalability: Hierarchical structures (e.g., trees) reduce coordination overhead.

Lamport's algorithm

66 | P a g e
Lamport's algorithm, developed by Leslie Lamport in 1978, is a foundational distributed
mutual exclusion algorithm that enables processes in a distributed system to safely access
shared resources without centralized coordination. It relies on logical clocks to order requests
and ensure fairness. Below is a structured explanation:
Key Principles
1.Logical Clocks:
 Each process maintains a logical clock (counter) that increments:
 Before initiating a new event.
 Upon receiving a message (update to `max (local clock, received timestamp) + 1`).

2.Request Messaging:
When a process $$ P_i $$ seeks entry to the critical section:
 Sends a timestamped `REQUEST` message to all other processes.
 Adds the request to its local queue (ordered by timestamp).

3.Handling Requests:
Upon receiving a `REQUEST` from $$ P_j $$:
 $$ P_i $$ replies with an immediate `REPLY` if:
 It is not in the critical section, and
 It has no pending request with a lower timestamp.
 Otherwise, it defers the reply until exiting the critical section.

4.Critical Section Entry:


$$ P_i $$ enters the critical section only after:
 Receiving `REPLY` from all other processes, and
 Having the lowest-timestamped request in its queue.

5.Exit and Release:


Upon exiting the critical section:
 Sends `REPLY` to all deferred requests.

67 | P a g e
 Removes its request from the queue.

Example Workflow
1.Process A (logical clock=5) sends `REQUEST` to B and C.
2.Process B (clock=3) defers A's request (B's timestamp is lower).
3.Process C (clock=7) replies immediately (C's timestamp is higher).
4.A receives C's reply but waits for B's reply.
5.B exits its critical section and sends `REPLY` to A.
6.A enters the critical section.

Properties
Safety: Mutual exclusion is guaranteed (only the lowest-timestamped request proceeds).
Liveness: Deadlock-free (requests are eventually granted).
Fairness: Requests are granted in happened-before order (timestamps enforce priority).

Limitations
Message Overhead: Requires $$ 3(N-1) $$ messages per critical section entry ($$ N $$ =
number of processes).
Single Point of Failure: Relies on all processes being responsive.
Lamport's algorithm laid the groundwork for more advanced distributed mutual exclusion
techniques, balancing simplicity with robustness in asynchronous systems

Ricart- Agarwala’s Algorithm

68 | P a g e
Introduction
Ricart–Agarwala’s algorithm is a well-known distributed mutual exclusion algorithm
designed to allow only one process at a time to enter the critical section (CS) in a distributed
system. Unlike Lamport’s algorithm, it reduces message complexity by combining request
and reply messages, making it more efficient for mutual exclusion.

Assumptions
 There are N processes in the system, each with a unique ID.
 Processes communicate using reliable message-passing.
 Each process maintains a logical clock (Lamport’s timestamp).

Algorithm Steps
1. Requesting the Critical Section
 When a process Pi wants to enter the CS:

69 | P a g e
 It increments its logical clock.
 Sends a REQUEST message (with its timestamp and ID) to all other
processes.
 Waits for REPLY messages from all other processes before entering the
CS.
2. Receiving a REQUEST
 When a process Pj receives a REQUEST from Pi:
 If Pj is neither in the CS nor requesting it, it immediately sends a REPLY.
 If Pj is in the CS, it defers the REPLY until it exits the CS.- If Pj is also
requesting the CS, it compares timestamps:
 If Pi is timestamp is earlier, Pj sends a REPLY.
 If Pj is timestamp is earlier, it defers the REPLY.
3. Exiting the Critical Section
 Pi enters the CS after receiving REPLY from all other processes.
 Upon exiting the CS, Pi sends REPLY messages to all deferred requests.

Message Exchange Example


Suppose there are three processes: P1, P2, and P3.
 P1 wants to enter CS:
 Sends REQUEST to P2 and P3.
 P2 and P3 receive REQUEST:
 If not interested in CS, send REPLY to P1.
 If interested, compare timestamps and decide to REPLY or defer.
 P1 enters CS after receiving REPLY from both.
 On exit, P1sends any deferred REPLYs.

Features and Advantages


 Message Complexity: Only 2(N-1) messages per critical section entry (N-1
REQUESTs + N-1 REPLYs).
 Fairness: Requests are granted in timestamp order (first-come, first-served).

70 | P a g e
 No Deadlock: The algorithm is deadlock-free and starvation-free.
 No Central Coordinator: Fully distributed, no single point of failure.

Limitations
 Scalability: Each process must communicate with all others; not efficient for very
large systems.
 Process Failures: Assumes reliable processes; failures can cause indefinite waiting.

Ricart–Agarwala’s algorithm is a classic solution for distributed mutual exclusion, improving


on Lamport’s algorithm by reducing message overhead. It is simple, fair, and effective for
small to medium-sized distributed systems.

Suzuki-Kasami’s Algorithm

Suzuki-Kasami’s Algorithm is a classic token-based distributed mutual exclusion


algorithm designed for distributed systems. It efficiently ensures that only one process at a
time can enter its critical section (CS), even when processes are spread across a network.

71 | P a g e
Below is a detailed explanation based on your attached document, including steps, token
structure, and an example with key points and related images.

Overview
 Type: Distributed Mutual Exclusion (DME)
 Category: Token-based, Broadcast
 Key Feature: Only one token exists in the system; possession of the token grants
access to the CS.
 Efficiency: Minimizes message complexity, especially in environments with
infrequent CS requests.

Steps of Suzuki-Kasami’s Algorithm


1. Data Structures Maintained by Each Process
Each process PiP_iPi maintains:
 Request Number Array (RNi[1..N]RN_i[1..N]RNi[1..N]): Highest request number
received from each process.
 Token (if held):
 Last Array (LN[1..N]LN[1..N]LN[1..N]): Last request number from each process
for which the CS was granted.
 Queue (Q): FIFO queue of requesting process IDs.
2. Requesting Critical Section
 If a process PiP_iPi wants to enter the CS and does not have the token:
 Increment RNi[i]RN_i[i]RNi[i] (its own request number).
 Broadcast a REQUEST(i, RN_i[i]) message to all other processes.
 If PiP_iPi already has the token, it enters the CS directly.

3. Receiving a REQUEST Message


 When PjP_jPj receives REQUEST(k, n):
 Update RNj[k]=max⁡(RNj[k],n)RN_j[k] = \max(RN_j[k], n)RNj[k]=max(RNj[k],n).

72 | P a g e
 If PjP_jPj holds the token and is not in CS, and RNj[k]=LN[k]+1RN_j[k] = LN[k] +
1RNj[k]=LN[k]+1, then
 Send the token to PkP_kPk.
4. Receiving the Token
 When a process receives the token:
 It can enter the CS.
 Upon exiting the CS:
 For each process PkP_kPk, if RNi[k]=LN[k]+1RN_i[k] = LN[k] + 1RNi[k]=LN[k]
+1, add kkk to the queue QQQ (if not already present).
 Update LN[i]=RNi[i]LN[i] = RN_i[i]LN[i]=RNi[i].
 If QQQ is not empty, remove the head kkk and send the token to PkP_kPk.

Token Structure
The token contains:
 LN[1..N]: Last request number for which each process was granted the CS.
 Q: Queue of process IDs waiting for the token (FIFO order).
This structure allows the token holder to decide to whom the token should be passed next,
based on outstanding requests.

Example Execution
Assume a system with 3 processes: P1,P2,P3P_1, P_2, P_3P1,P2,P3.
Ste
Action State Changes
p
P1P_1P1 wants CS, has
1 Enters CS directly
token
RN2[1]RN_2[1]RN2[1] incremented,
2 P2P_2P2 requests CS
REQUEST(2,1) broadcast
P1P_1P1 receives
3 RN1[1]=1RN_1[1]=1RN1[1]=1, adds 2 to Q
REQUEST(2,1)
4 P1P_1P1 exits CS Updates LN2=1, sends token to P2P_2P2 (from Q)
5 P2P_2P2 receives token Enters CS

Important Points
 Message Complexity: In the worst case, a process broadcasts (N−1)(N-1)(N−1)
REQUEST messages per CS entry; the token is sent only when needed.

73 | P a g e
 Fairness: Requests are served in the order they are received (FIFO via Q).
 No Starvation: Every requesting process will eventually receive the token.
 Single Token: Only one token exists; mutual exclusion is guaranteed.
 Scalability: Efficient for systems with many processes and infrequent CS requests.

Chandy-Misra-Haas Algorithm

74 | P a g e
Chandy-Misra-Haas Algorithm is a fundamental distributed deadlock detection algorithm
that operates under both AND and OR resource models. Below is a detailed explanation,
based on your attached document, covering both models, algorithm steps, and an illustrative
example.
1. Overview
 Purpose: Detects deadlocks in distributed systems where resources and processes are
spread across multiple sites.
 Applicability: Works for both AND (all requested resources needed) and OR (any
one requested resource suffices) models.
 Key Idea: Uses probe messages to trace possible cycles in the wait-for graph (WFG)
distributed across sites.
2. AND Model
Algorithm Steps
1. Initiation:
When a process PiP_iPi is blocked waiting for resources held by PjP_jPj, it initiates a
probe if it is not already waiting for a probe response.
2. Probe Structure:

⟨initiator,sender,receiver⟩\langle initiator, sender, receiver \


Each probe is a tuple:

rangle⟨initiator,sender,receiver⟩
 initiator: The process that started the probe (potentially deadlocked).
 sender: The process sending the probe.
 receiver: The process to which the probe is sent.
3. Propagation:
 If PjP_jPj is also waiting for another process PkP_kPk, it forwards the probe

⟨initiator,Pj,Pk⟩\langle initiator, P_j, P_k \rangle⟨initiator,Pj,Pk⟩


as:

 This continues along the wait-for chain.

4. Detection:

75 | P a g e
 If a probe returns to the initiator (i.e., the receiver is the initiator), a cycle
exists in the WFG, indicating a deadlock.
5. Termination:
 If the probe reaches a process not waiting for any other process, it is discarded.

Key Points
 No centralized controller: All processes participate equally.
 Multiple probes: May be in transit simultaneously.
 Deadlock detection: Based on probe cycles.

3. OR Model
Differences from AND Model
 In the OR model, a process is waiting for any one of several resources.
 The WFG may have hyperedges (one process waiting for any in a set).
Algorithm Steps
1. Initiation:
 When a process is blocked, it sends probes to all processes it is waiting for.
2. Probe Structure:
 Probes carry a list of visited processes to track possible cycles.
3. Propagation:
 Each process that receives a probe appends itself to the probe’s list and
forwards the probe to all processes it is waiting for.
4. Detection:
 If a probe returns to the initiator (i.e., the initiator’s ID is already in the
probe’s list), a deadlock is detected.

Key Points

76 | P a g e
 Handles hyperedges: Tracks all possible wait chains.
 Probe lists: Prevents redundant probe forwarding and false positives.

4. Example (AND Model)


Suppose we have three processes and resources:
 P1P_1P1 waits for P2P_2P2
 P2P_2P2 waits for P3P_3P3
 P3P_3P3 waits for P1P_1P1

Probe Flow

Ste
Action Probe
p

⟨P1,P1,P2⟩\langle P_1, P_1, P_2 \


1 P1P_1P1 initiates
rangle⟨P1,P1,P2⟩

⟨P1,P2,P3⟩\langle P_1, P_2, P_3 \


2 P2P_2P2 forwards
rangle⟨P1,P2,P3⟩

⟨P1,P3,P1⟩\langle P_1, P_3, P_1 \


3 P3P_3P3 forwards
rangle⟨P1,P3,P1⟩

4 P1P_1P1 receives probe Cycle detected (deadlock)

5. Example (OR Model)


Suppose P1P_1P1 waits for either P2P_2P2 or P3P_3P3:
 P1P_1P1 sends probes to both P2P_2P2 and P3P_3P3.
 Each probe tracks the path: e.g., [P1,P2][P_1, P_2][P1,P2], [P1,P3][P_1, P_3][P1,P3].
 If a probe returns to P1P_1P1 via any chain, a deadlock is detected.

77 | P a g e
Important Points
 Probe messages are lightweight and only traverse wait chains.
 No global WFG construction is required.
 Scalable: Suitable for large distributed systems.

Deadlocks
Deadlocks in distributed systems are situations where a set of processes are each waiting
for resources held by others, forming a cycle of dependencies that prevents any of them from
proceeding. The modelling of deadlocks is essential to understand, detect, and resolve such
situations in distributed environments. The primary models used to represent deadlocks are
the single-resource (or edge) model, the AND model, the OR model, and the P-out-of-Q
model
In the single-resource model, each resource can be held by only one process at a time, and
each process waits for exactly one resource. This model is often represented using a wait-for
graph (WFG), where nodes are processes and edges indicate waiting relationships. A
deadlock occurs if and only if there is a cycle in the WFG. The AND model generalizes this
by allowing a process to wait for multiple resources simultaneously, requiring all of them to
proceed. Here, a process is blocked until it acquires all the resources it requests, and deadlock
is represented by the presence of a cycle in the WFG where each process in the cycle is
waiting for resources held by the next process.
The OR model further extends deadlock modelling by allowing a process to proceed if it
acquires any one of several requested resources. In this case, the WFG may contain
hyperedges, as a process can be waiting for any member of a set of resources. Deadlock
detection in the OR model is more complex, as it involves finding knots or more intricate
structures than simple cycles. The P-out-of-Q model generalizes both the AND and OR
models, where a process can proceed if it acquires any P resources out of a requested Q.
Understanding these models is crucial for designing algorithms that can effectively detect and
resolve deadlocks in distributed systems

78 | P a g e
Comparison of Distributed Mutual Exclusion Algorithms
Message
Fault
Complexity Synchroniza Key Data Fairnes Notes/
Algorithm Type Toleranc Scalability
(per CS tion Delay Structure s Drawbacks
e
entry)

Lamport’s Timestamped High message


Permission 3(N-1) High Low Yes Moderate
Algorithm Queue overhead

Fewer msgs
Ricart– Timestamped
Permission 2(N-1) Moderate Low Yes Moderate than Lamport,
Agrawala Queue
but still O(N)

O(N) Reduces
Singhal’s Dynamic
Permission (dynamic, Moderate Low Yes High messages using
Algorithm Request Set
often <2N) dynamic sets

Can deadlock,
Maekawa’s
Quorum O(√N) Low Quorum Set Low Yes High requires
Algorithm
quorum design

Token- O(N) (worst Token (array, Token loss is


Suzuki–Kasami Low Low Yes High
based case) queue) problematic

Efficient for
Raymond’s Token- Token in tree Moderat
O(log N) Low Yes High large N, tree
Algorithm based struct. e
mgmt req.

Lodha– Focuses on
Permission O(N) Moderate Fairness queue Low Yes High
Kshemkalyani fairness

79 | P a g e
Differences Between Token-based and Permission-based
Algorithms

Token-based Algorithms
In token-based algorithms, a unique token circulates among all the processes in the
distributed system. Only the process holding the token is allowed to enter its critical section.
When a process wishes to enter the critical section, it waits until it receives the token. After
exiting the critical section, the token is passed to the next requesting process. Examples
include Suzuki-Kasami’s and Raymond’s algorithms. Token loss or duplication can cause
issues, so token management is crucial.
Permission-based Algorithms
Permission-based algorithms require a process to obtain permission from a majority or all
other processes before entering the critical section. This is typically done by sending request
messages and waiting for replies (permissions) from the relevant processes. Mutual exclusion
is ensured because only one process can collect all the necessary permissions at a time.
Examples include Lamport’s and Ricart-Agrawala’s algorithms. These algorithms generally
involve more message exchanges, especially in large systems.
Important Points
 Token-based: Mutual exclusion is achieved by possession of a unique token; fewer
messages if requests are infrequent; sensitive to token loss.
 Permission-based: Mutual exclusion is achieved by collecting permissions from
others; more robust to message loss but higher message overhead.
 Scalability: Token-based algorithms often scale better in large systems with low
contention, while permission-based algorithms may suffer from high message
complexity as the number of processes increases.
 Fault Tolerance: Permission-based algorithms can handle failures more gracefully,
as token loss in token-based algorithms can halt progress until recovery.

80 | P a g e
Deadlock Detection System Model and Wait-for Graph

Deadlock Detection System Model


A deadlock occurs when a group of processes are each waiting for resources held by others in
the group, creating a cycle of dependencies where none can proceed. The deadlock detection
system model provides a formal framework to understand and detect such situations. This
model assumes a distributed environment where resources and processes may be located at
different sites, and processes may request and hold resources dynamically. The system must
monitor the state of resource allocation and process requests, typically using local agents or
coordinators at each site, to detect the presence of deadlocks.
Key aspects of the model include:
 Processes that may request and release resources at any time.
 Resources that can be held by one or more processes, depending on the resource type
(exclusive or shared).
 Communication between sites to exchange information about resource requests and
allocations, since the global state is not readily available at any single site.
 Detection algorithms that periodically or on-demand analyze the state to find
deadlocks.

Wait-for Graph (WFG)


The Wait-for Graph (WFG) is a fundamental tool used in deadlock detection. It is a directed
graph where:
 Nodes represent processes in the system.
 Edges (P₁ → P₂) indicate that process P₁ is waiting for a resource currently held by
process P₂.
In the WFG, a cycle indicates a potential deadlock (in the single-resource or AND model), as
it means a set of processes are waiting on each other in a circular chain, so none can proceed.
In more complex models (OR, P-out-of-Q), the WFG may include hyperedges or more
intricate structures, and deadlock detection involves finding knots or other patterns.
WFGs can be constructed locally at each site or globally by aggregating local graphs.
Distributed deadlock detection algorithms rely on exchanging WFG information or probe
messages to detect cycles that span multiple sites.

81 | P a g e
Summary of Key Points
 The deadlock detection system model defines how processes, resources, and
communication are organized for detecting deadlocks in distributed systems.
 The wait-for graph is the primary abstraction for representing process dependencies
and identifying deadlocks via cycles or knots.
 Efficient deadlock detection relies on periodically analyzing the WFG, either locally
or globally, and using distributed algorithms to cope with the lack of a centralized
global state.

82 | P a g e
UNIT IV - CONSENSUS AND RECOVERY

Consensus and Agreement Algorithms: Problem Definition – Overview of Results –


Agreement in a Failure-Free System (Synchronous and Asynchronous) – Agreement in
Synchronous Systems with Failures; Checkpointing and Rollback Recovery: Introduction –
Background and Definitions – Issues in Failure Recovery – Checkpoint-based Recovery –
Coordinated Checkpointing Algorithm -- Algorithm for Asynchronous Checkpointing and
Recover
Course Outcome
CO4: Apply working model of consensus and reliability of distributed systems (K3)
Course Objective
To elucidate agreement protocols and fault tolerance mechanisms in distributed systems
Important Topics
1. Consensus and Agreement Problem Definition & Results
2. Agreement in Synchronous and Asynchronous Systems (with and without Failures)
3. Checkpointing and Rollback Recovery (Types & Algorithms)

83 | P a g e
UNIT - IV

Consensus and Agreement Algorithms: Problem


Definition
Consensus and agreement algorithms address the fundamental challenge of getting a group of
distributed processes to agree on a single data value, even in the face of unreliable
communication and possible process failures. The classic consensus problem requires that all
non-faulty processes eventually decide on the same value, and that value must be one that
was proposed by some process. This is crucial for tasks like committing transactions, leader
election, or state machine replication in distributed systems.
The problem is defined by three properties: Termination (every correct process eventually
decides), Agreement (no two correct processes decide differently), and Validity (if all
propose the same value, that value is chosen). Achieving consensus is straightforward in ideal
conditions but becomes challenging under asynchrony and failures, making it a central
problem in distributed computing theory.
Important Points:
 Consensus ensures all non-faulty processes agree on a single value.
 Three properties: Termination, Agreement, Validity.
 Fundamental for reliability in distributed systems.
 Complexity increases with failures and asynchrony.

Overview of Results
Research in consensus has led to several key results. In synchronous systems (where message
delivery and process speed are bounded), consensus can be achieved even if some processes
fail. In asynchronous systems (where there are no timing guarantees), the famous FLP result
(Fischer, Lynch, Paterson) proves that no deterministic algorithm can guarantee consensus if
even one process may fail.
Despite the FLP impossibility, practical systems use randomized algorithms, failure
detectors, or partial synchrony assumptions to circumvent theoretical limitations. These
results shape the design of real-world distributed systems, balancing between theoretical
guarantees and practical needs.
Important Points:
 Consensus is possible in synchronous systems with bounded failures.

84 | P a g e
 FLP result: Impossible to guarantee consensus in fully asynchronous systems with
one faulty process.
 Practical systems use randomization or failure detectors.
 Results guide protocol design in real systems.
Agreement in a Failure-Free System (Synchronous and Asynchronous)
In failure-free systems, achieving agreement is more straightforward. In synchronous
systems, processes can exchange messages in rounds and easily coordinate to agree on a
value, since message delays and process execution times are bounded. Each process can wait
for messages from all others before deciding.
In asynchronous systems without failures, processes can still reach agreement, but they must
be designed to handle arbitrary message delays. Since there are no failures, all messages will
eventually be delivered, and all processes will participate in the protocol, ensuring agreement.
Important Points:
 Synchronous, failure-free: Agreement is simple using rounds of communication.
 Asynchronous, failure-free: Agreement possible, but must handle unpredictable
message delays.
 No failures mean all processes participate and all messages are delivered.
 Simpler algorithms suffice compared to failure-prone systems.

Agreement in Synchronous Systems with Failures


In synchronous systems where failures can occur, consensus algorithms must tolerate a
certain number of faulty processes. These algorithms typically operate in rounds, where each
process collects information from others and updates its state based on majority or quorum
rules. The classic result is that consensus is possible if the number of faulty processes is less
than one-third of the total (for Byzantine failures) or less than half (for crash failures).
The system’s synchrony allows processes to detect failures (e.g., by missing messages in a
round), enabling them to proceed with the consensus protocol despite some processes being
unresponsive or malicious.
Important Points:
 Synchronous systems can tolerate a bounded number of failures.
 Algorithms use rounds and majority/quorum rules.
 Consensus possible if faulty processes are less than a threshold.
 Synchrony helps detect and handle failures.

85 | P a g e
Checkpointing and Rollback Recovery: Introduction
Checkpointing and rollback recovery are techniques for ensuring fault tolerance in distributed
systems. The idea is to periodically save the state of each process (a checkpoint), so that if a
failure occurs, the system can roll back to a consistent set of checkpoints and resume
execution without losing all progress.
This approach is essential in long-running distributed applications, where failures are
inevitable. By recovering from the last consistent checkpoint, the system avoids restarting
from scratch, thus improving reliability and efficiency.
Important Points:
 Checkpointing saves process states for recovery after failures.
 Rollback recovery restores the system to a consistent state.
 Essential for fault tolerance in distributed systems.
 Reduces loss of computation due to failures.

Background and Definitions


A checkpoint is a saved state of a process at a particular point in time. A consistent global
checkpoint is a collection of checkpoints, one from each process, such that the system could
have been in that state during normal execution. Rollback recovery uses these checkpoints to
restore the system after a failure.
Key definitions also include orphan messages (messages received after the last checkpoint
but sent before it) and the domino effect (when recovery cascades back to the beginning of
computation). Understanding these terms is crucial for designing effective checkpointing
protocols.
Important Points:
 Checkpoint: Saved state of a process.
 Consistent global checkpoint: Set of checkpoints that form a valid system state.
 Orphan message: Received after checkpoint, sent before it.
 Domino effect: Recovery cascades, causing extensive rollback.\

86 | P a g e
Issues in Failure Recovery
Failure recovery in distributed systems faces several challenges. Ensuring that the system
rolls back to a consistent global state is critical; otherwise, processes may have inconsistent
views, leading to errors. Avoiding the domino effect, minimizing performance overhead, and
managing orphan messages are key concerns.
Another issue is determining when and how often to take checkpoints. Frequent checkpoints
incur overhead, while infrequent checkpoints increase the amount of lost work after a failure.
Efficient algorithms balance these trade-offs for optimal performance and reliability.
Important Points:
 Consistency of global state is crucial during recovery.
 Domino effect and orphan messages must be avoided.
 Checkpoint frequency affects overhead and recovery efficiency.
 Trade-off between performance and reliability.

Checkpoint-based Recovery
Checkpoint-based recovery involves restoring the system to a previously saved consistent
state after a failure. There are two main approaches: uncoordinated checkpointing, where
each process saves its state independently, and coordinated checkpointing, where all
processes synchronize to take a consistent global checkpoint.
While uncoordinated checkpointing is simple, it risks inconsistency and the domino effect.
Coordinated checkpointing ensures a consistent recovery point but requires synchronization
among processes, which may introduce delays.
Important Points:
 Recovery uses saved checkpoints to restore system state.
 Uncoordinated: Simple but risks inconsistency.
 Coordinated: Ensures consistency, avoids domino effect.
 Choice depends on application requirements.

87 | P a g e
Coordinated Checkpointing Algorithm

A coordinated checkpointing algorithm ensures that all processes in a distributed system take
checkpoints in a manner that guarantees a consistent global state. Typically, a process
initiates the checkpoint, and all other processes are notified to take checkpoints, either
immediately or after completing certain actions.
This coordination prevents the domino effect and ensures that, after a failure, the system can
recover to a state where all processes are consistent with each other. The main challenge is to
minimize the performance overhead and the delay caused by synchronization.
Important Points:
 All processes synchronize to take checkpoints.
 Prevents domino effect and ensures global consistency.
 Initiation and coordination can be centralized or distributed.
 Overhead and delays must be managed.

Algorithm for Asynchronous Checkpointing and Recovery


Asynchronous checkpointing allows processes to take checkpoints independently, without
global coordination. This increases flexibility and reduces synchronization overhead but
introduces the challenge of ensuring that a consistent global state can still be constructed
during recovery.
To address this, algorithms track dependencies between checkpoints and use message logging
or dependency vectors. During recovery, the system uses this information to select a set of
checkpoints that together form a consistent global state, ensuring correctness without the
need for all processes to roll back to their latest checkpoint.
Important Points:
 Processes checkpoint independently, increasing flexibility.
 Dependency tracking is required for consistency.
 Message logging or dependency vectors are used.
 Recovery assembles a consistent global state from independent checkpoints.

88 | P a g e
Consensus and Agreement Problem Definition

The consensus problem is a fundamental issue in distributed computing, where a collection


of processes (or nodes) must agree on a single value, even in the presence of failures and
unreliable communication. The problem is formally defined by three properties:
1. Termination: Every non-faulty (correct) process must eventually decide on some
value.
2. Agreement: No two correct processes decide on different values.
3. Validity: If all correct processes propose the same value, then any decided value must
be that value.
Consensus is essential for many distributed system tasks, such as atomic commit in
databases, leader election, state machine replication, and reliable broadcast. Achieving
consensus ensures consistency and coordination among distributed processes, which is vital
for system reliability.
Overview of Results
The solvability of consensus depends on the system model, particularly synchrony and failure
assumptions. In synchronous systems (where message delivery and process execution times
are bounded), consensus can be achieved even if a limited number of processes fail. For
example, in the presence of crash failures, consensus is possible if less than half the processes
can fail; for Byzantine (arbitrary) failures, consensus is possible if less than one-third of
processes are faulty.
However, in asynchronous systems (where there are no timing guarantees), the famous FLP
impossibility result (Fischer, Lynch, Paterson, 1985) proves that no deterministic algorithm
can guarantee consensus if even a single process may fail by crashing. This result is a
cornerstone of distributed computing theory and has driven the development of alternative
approaches, such as randomized algorithms, unreliable failure detectors, and protocols that
assume partial synchrony.
Significance and Implications
Despite the theoretical limitations, consensus is crucial for practical distributed systems,
including distributed databases, coordination services, and blockchain protocols. To
overcome the impossibility in asynchronous systems, practical algorithms like Paxos and Raft
use randomization, leader election, or additional timing assumptions. These protocols ensure
that distributed systems can provide consistency and reliability, even when some components
fail or messages are delayed. Understanding the consensus problem and its theoretical results
is essential for designing robust, fault-tolerant distributed applications.

89 | P a g e
Important Points
 Consensus requires: Termination, Agreement, and Validity among all non-faulty
processes.
 FLP impossibility result: In asynchronous systems, deterministic consensus is
impossible with even one crash failure.
 Synchronous systems: Consensus is possible with bounded failures (less than half
for crash, less than one-third for Byzantine).
 Practical systems: Use randomization, failure detectors, or partial synchrony to
achieve consensus despite theoretical barriers.

Agreement in Synchronous and Asynchronous Systems


(with and without Failures)

Agreement in Failure-Free Systems


In a failure-free synchronous system, processes can reach agreement easily because the
system operates in rounds, and all messages sent in a round are guaranteed to be delivered
before the next round begins. Each process can wait to receive messages from all other
processes, ensuring that everyone has the same information before making a decision. This
predictable timing allows for straightforward consensus algorithms, where all processes can
propose values, exchange them, and deterministically agree on a single value (often the
minimum, maximum, or a majority value).
In a failure-free asynchronous system, agreement is still possible, but the lack of timing
guarantees makes the process more complex. Since messages can be delayed arbitrarily,
processes cannot distinguish between a slow process and a failed one. However, as long as all
processes are correct and all messages are eventually delivered, consensus can be achieved
using algorithms that repeatedly exchange information until all processes converge on the
same value. The main challenge is handling unpredictable message delays, but the absence of
failures simplifies the overall agreement process.
Agreement in Systems with Failures
In synchronous systems with failures, consensus is still achievable, but the number of
failures that can be tolerated is limited. If processes can crash (crash failures), consensus is
possible as long as fewer than half of the processes fail. If processes can behave arbitrarily

90 | P a g e
(Byzantine failures), consensus is possible if fewer than one-third of the processes are faulty.
Synchronous rounds allow processes to detect missing messages (indicating failures) and use
majority or quorum-based rules to reach agreement despite some processes being
unresponsive or malicious.

In asynchronous systems with failures, the problem becomes significantly harder. The
famous FLP impossibility result (Fischer, Lynch, Paterson) proves that no deterministic
consensus algorithm can guarantee agreement if even a single process may crash. This is
because, in an asynchronous system, it is impossible to distinguish between a crashed process
and one whose messages are simply delayed. As a result, practical consensus protocols for
asynchronous systems often rely on randomization, unreliable failure detectors, or partial
synchrony assumptions to circumvent the impossibility.
Important Points
 Synchronous, failure-free: Agreement is easy and deterministic due to predictable
message delivery.
 Asynchronous, failure-free: Agreement is possible, but unpredictable message
delays require more complex algorithms.
 Synchronous with failures: Consensus is possible with bounded failures (less than
half for crash, less than one-third for Byzantine).
 Asynchronous with failures: Deterministic consensus is impossible (FLP result);
practical solutions use randomization or failure detectors.

91 | P a g e
Checkpointing and Rollback Recovery (Types & Algorithms)

Checkpointing and rollback recovery are essential fault-tolerance techniques in distributed


systems that allow processes to recover from failures without losing significant computation.
The main idea is to periodically save the state of each process (a checkpoint), so that in the
event of a failure, the system can roll back to a consistent set of checkpoints and resume
execution. This approach is crucial in long-running applications, scientific computations, and
distributed databases, where restarting from scratch after a failure would be costly or
impractical.
There are three main types of checkpointing: uncoordinated, coordinated, and
communication-induced. In uncoordinated checkpointing, each process independently
decides when to take a checkpoint. This approach is simple but can lead to the domino effect,
where a failure causes a cascade of rollbacks all the way to the beginning of the computation,

92 | P a g e
resulting in significant loss of work. Coordinated checkpointing requires all processes to
synchronize and take checkpoints together, ensuring a consistent global state and avoiding
the domino effect. However, it introduces synchronization overhead and may temporarily
pause the system. Communication-induced checkpointing is a hybrid approach, where
processes take local checkpoints and are sometimes forced to take additional checkpoints
based on message patterns, ensuring consistency without full coordination.
Rollback recovery algorithms are designed to restore the system to a consistent state after a
failure. In uncoordinated checkpointing, recovery can be complex, as it may be difficult to
find a set of checkpoints that form a consistent global state. In coordinated checkpointing,
recovery is straightforward because the last set of coordinated checkpoints is always
consistent. Communication-induced checkpointing algorithms use dependency tracking to
ensure that, even with mostly independent checkpoints, a consistent recovery line can be
found. Some systems also use log-based rollback recovery, where messages are logged so
that lost messages can be replayed during recovery, further increasing reliability.
The choice of checkpointing and rollback strategy involves trade-offs between performance,
storage overhead, and recovery complexity. Coordinated checkpointing is widely used in
practice due to its simplicity and reliability, especially in systems where the cost of pausing
all processes briefly is acceptable. Asynchronous (uncoordinated) checkpointing is more
flexible but requires sophisticated algorithms to avoid inconsistencies and the domino effect.
Communication-induced checkpointing offers a compromise, reducing synchronization
requirements while still providing reliable recovery.

Checkpointing and Rollback Recovery


Below is a conceptual diagram illustrating the three types of checkpointing and how rollback
recovery works in a distributed system:
text
Time →
P1: |---C1---|---C2---|---C3---|
P2: |----C1----|----C2----|----C3----|
P3: |--C1--|--C2--|--C3--|

C1, C2, C3 = Checkpoints


If a failure occurs after C2, the system rolls back to the latest consistent set of checkpoints
(C2 for all processes).

93 | P a g e
Important Points

1. Checkpointing is the process of saving the state of a process so it can be restored after
a failure.
2. Rollback recovery uses checkpoints to restore the system to a consistent state after a
failure.
3. Uncoordinated checkpointing is simple but can cause the domino effect, leading to
extensive rollback.
4. Coordinated checkpointing synchronizes all processes for checkpointing, ensuring a
consistent global state.
5. Communication-induced checkpointing forces additional checkpoints based on
message patterns to maintain consistency.
6. Domino effect occurs when recovery from a failure causes a cascade of rollbacks to
the initial state.
7. Consistent global checkpoint is a set of checkpoints (one per process) that could have
occurred together in a valid execution.
8. Log-based rollback recovery involves logging messages so they can be replayed
during recovery.
9. Coordinated checkpointing is preferred in practice for its simplicity and reliability.
10. Trade-offs in checkpointing strategies involve balancing performance, storage, and
recovery complexity.

94 | P a g e
UNIT V - CLOUD COMPUTING

Definition of Cloud Computing – Characteristics of Cloud – Cloud Deployment Models –


Cloud Service Models – Driving Factors and Challenges of Cloud – Virtualization – Load
Balancing – Scalability and Elasticity – Replication – Monitoring – Cloud Services and
Platforms: Compute Services – Storage Services – Application Services

Course Outcome
CO5: Explain the fundamentals of cloud computing (K2)
Course Objective
To explain the cloud computing models and the underlying concepts
Important Topics
4. Cloud Deployment and Service Models
5. Characteristics and Challenges of Cloud Computing
6. Virtualization and Its Role in Cloud Computing
7. Load Balancing, Scalability, and Elasticity in the Cloud
8. Cloud Services and Platforms: Compute, Storage, and Application Services)

95 | P a g e
UNIT -V

Definition of Cloud Computing


Cloud computing is a paradigm that enables on-demand access to a shared pool of
configurable computing resources (such as servers, storage, applications, and services) over
the internet. These resources can be rapidly provisioned and released with minimal
management effort, providing scalability, flexibility, and cost-efficiency. Cloud computing
abstracts the underlying infrastructure, allowing users to focus on their applications rather
than hardware or software maintenance.
Cloud computing models deliver IT services as utilities, similar to electricity or water, and
are typically billed on a pay-as-you-go basis. This approach democratizes access to powerful
computing resources, making them available to individuals, businesses, and organizations of
all sizes.
Important Points:
 On-demand, internet-based access to shared computing resources.
 Abstracts infrastructure, enabling focus on applications.
 Pay-as-you-go billing model.
 Promotes scalability and flexibility.

Characteristics of Cloud
Cloud computing exhibits several defining characteristics. These include on-demand self-
service, broad network access, resource pooling, rapid elasticity, and measured service. On-
demand self-service allows users to provision resources automatically, without human
intervention. Broad network access ensures services are available over the internet from
various devices.
Resource pooling enables providers to serve multiple customers with dynamically assigned
resources, while rapid elasticity allows quick scaling up or down based on demand. Measured
service means resource usage is monitored, controlled, and billed transparently.
Important Points:
 On-demand self-service and broad network access.
 Resource pooling and multi-tenancy.
 Rapid elasticity and scalability.
 Measured, usage-based billing.

96 | P a g e
Cloud Deployment Models
Cloud deployment models define how cloud services are made available to users. The main
models are public, private, hybrid, and community clouds. Public clouds are owned and
operated by third-party providers and offer services to the general public. Private clouds are
operated exclusively for a single organization, providing greater control and security.
Hybrid clouds combine public and private clouds, enabling data and application portability.
Community clouds are shared by several organizations with common interests or
requirements, such as security or compliance needs.
Important Points:
 Public, private, hybrid, and community models.
 Public: Open to all; Private: Exclusive to one organization.
 Hybrid: Combines public and private benefits.
 Community: Shared by organizations with similar needs.
Cloud Service Models
Cloud service models describe the level of abstraction and control provided to users. The
three main models are Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and
Software as a Service (SaaS). IaaS provides virtualized computing resources over the
internet, such as servers and storage. PaaS offers a platform for developing, testing, and
deploying applications without managing underlying infrastructure.
SaaS delivers fully functional applications accessible via web browsers, eliminating the need
for local installation or maintenance. Each model offers different levels of control, flexibility,
and management.
Important Points:
 IaaS: Infrastructure resources (VMs, storage).
 PaaS: Application development platforms.
 SaaS: End-user applications delivered online.
 Varying levels of user control and provider management.
Driving Factors and Challenges of Cloud
The adoption of cloud computing is driven by factors such as cost savings, scalability,
flexibility, and ease of access. Organizations can avoid upfront hardware investments and
scale resources as needed. Cloud also supports global collaboration and disaster recovery.
However, challenges include data security and privacy, vendor lock-in, compliance with
regulations, and ensuring consistent performance. Addressing these challenges is crucial for
widespread cloud adoption.

97 | P a g e
Important Points:
 Cost efficiency and scalability drive adoption.
 Supports flexibility and disaster recovery.
 Challenges: Security, privacy, compliance, vendor lock-in.
 Performance and reliability concerns.

Virtualization
Virtualization is the technology that enables multiple virtual instances of computing
resources to run on a single physical machine. It decouples hardware from the software,
allowing better utilization of resources and isolation between different workloads.
Hypervisors or virtual machine monitors (VMMs) manage these virtual environments.
Virtualization is the backbone of cloud computing, enabling resource pooling, scalability, and
flexible provisioning. It also enhances security and fault isolation between tenants.
Important Points:
 Enables multiple VMs on a single physical host.
 Managed by hypervisors or VMMs.
 Improves resource utilization and isolation.
 Foundation for cloud resource pooling.

Load Balancing
Load balancing distributes incoming network traffic or workloads across multiple servers or
resources to ensure no single resource is overwhelmed. This improves system responsiveness,
maximizes resource utilization, and enhances fault tolerance.
Cloud platforms use load balancing to support high availability and reliability. Load
balancers can be hardware-based, software-based, or provided as a cloud service.
Important Points:
 Distributes workloads evenly across resources.
 Enhances performance, availability, and fault tolerance.
 Can be hardware, software, or cloud-based.
 Essential for scalable cloud applications.

98 | P a g e
Scalability and Elasticity
Scalability refers to a system’s ability to handle increasing workloads by adding resources,
while elasticity is the ability to automatically adjust resources as demand fluctuates. Cloud
platforms provide both vertical (scaling up) and horizontal (scaling out) scalability.
Elasticity ensures cost-effectiveness by allocating resources only when needed, a key
advantage of cloud computing. It allows systems to respond dynamically to workload
changes.
Important Points:
 Scalability: Add resources to meet increased demand.
 Elasticity: Automatic, dynamic resource adjustment.
 Supports cost efficiency and performance.
 Key to handling variable workloads.

Replication
Replication involves maintaining copies of data or services across multiple servers or
locations to ensure availability, reliability, and fault tolerance. In cloud environments,
replication supports disaster recovery and load sharing.
Data replication can be synchronous (real-time updates) or asynchronous (periodic updates).
It is critical for high availability and business continuity.
Important Points:
 Maintains multiple copies of data/services.
 Enhances availability and fault tolerance.
 Supports disaster recovery.
 Synchronous and asynchronous methods.

Monitoring
Monitoring in cloud computing involves tracking the performance, availability, and usage of
cloud resources and applications. It helps detect anomalies, optimize resource allocation, and
ensure service level agreements (SLAs) are met.
Cloud providers offer monitoring tools and dashboards that provide real-time insights and
alerts, enabling proactive management and troubleshooting.

99 | P a g e
Important Points:
 Tracks resource usage and application performance.
 Enables SLA compliance and optimization.
 Provides real-time alerts and analytics.
 Essential for cloud management and troubleshooting.

Compute Services
Compute services in the cloud provide virtualized processing power, such as virtual
machines, containers, or serverless functions. These services allow users to run applications
and workloads without managing physical hardware.
Major cloud providers offer compute services like AWS EC2, Google Compute Engine, and
Azure Virtual Machines, supporting various operating systems and configurations.
Important Points:
 Provides virtualized CPU and memory resources.
 Supports VMs, containers, and serverless computing.
 Flexible configuration and scaling.
 Core component of cloud infrastructure.

Storage Services
Cloud storage services offer scalable, durable, and accessible storage solutions for data, files,
and backups. They support object, block, and file storage models, catering to different
application needs.
Examples include Amazon S3, Google Cloud Storage, and Azure Blob Storage. These
services ensure data durability, redundancy, and global accessibility.
Important Points:
 Scalable and durable storage solutions.
 Supports object, block, and file storage.
 Ensures data redundancy and availability.
 Accessible from anywhere with proper authorization.

100 | P a g e
Application Services
Application services in the cloud deliver ready-to-use software or platforms for application
development, deployment, and management. These include databases, messaging services,
analytics platforms, and AI/ML services.
Cloud application services reduce development time, simplify maintenance, and offer
integration with other cloud resources, enabling rapid innovation.
Important Points:
 Ready-to-use software and development platforms.
 Includes databases, analytics, AI/ML, messaging.
 Reduces development and maintenance effort.
 Facilitates rapid deployment and integration.

101 | P a g e
Characteristics and Challenges of Cloud Computing

Introduction
Cloud computing is a transformative paradigm in information technology that enables on-
demand access to shared computing resources over the internet. Its rapid adoption is driven
by its ability to deliver scalable, flexible, and cost-effective IT services. However, while
cloud computing offers numerous benefits, it also introduces a set of unique challenges that
must be addressed to fully realize its potential.
Key Characteristics of Cloud Computing
1. On-Demand Self-Service:
Cloud computing allows users to provision computing capabilities, such as server
time and network storage, automatically without requiring human interaction with
each service provider. This empowers users to manage resources as needed, leading to
increased agility and efficiency.
2. Broad Network Access:
Services are available over the network and accessed through standard mechanisms
(e.g., web browsers, mobile apps), supporting use by heterogeneous client platforms
such as smartphones, tablets, laptops, and desktops.
3. Resource Pooling:
The provider’s computing resources are pooled to serve multiple consumers using a
multi-tenant model. Physical and virtual resources are dynamically assigned and
reassigned according to consumer demand, leading to efficient resource utilization.
4. Rapid Elasticity:
Capabilities can be elastically provisioned and released, often automatically, to scale
rapidly outward and inward commensurate with demand. To the consumer, the
available resources often appear to be unlimited.
5. Measured Service:
Cloud systems automatically control and optimize resource use by leveraging a
metering capability. Resource usage can be monitored, controlled, and reported,
providing transparency for both the provider and consumer.
6. Scalability and Flexibility:

102 | P a g e
Cloud platforms can scale horizontally (adding more machines) or vertically (adding
resources to existing machines) to accommodate varying workloads, making them
ideal for businesses with fluctuating demands.

Major Challenges of Cloud Computing

1. Security and Privacy:


Since data and applications reside on remote servers, ensuring the confidentiality,
integrity, and availability of data is a major concern. Threats such as data breaches,
unauthorized access, and cyber-attacks must be mitigated through robust security
policies, encryption, and access controls.

2. Compliance and Legal Issues:


Organizations must comply with various regulatory requirements (like GDPR,
HIPAA) regarding data storage and processing. Ensuring that cloud providers adhere
to these regulations and provide necessary audit trails is challenging.

3. Vendor Lock-In:
Moving applications and data between different cloud providers can be complex due
to proprietary technologies, APIs, and data formats. This lack of interoperability can
restrict flexibility and increase long-term costs.

4. Performance and Reliability:


Cloud applications depend on network connectivity and the reliability of the
provider’s infrastructure. Outages, latency, and bandwidth limitations can impact
application performance and availability.

5. Cost Management:
While cloud can reduce capital expenditure, improper management of resources can
lead to unexpected operational expenses. Monitoring usage and optimizing resource
allocation is essential for cost control.

6. Data Transfer and Bandwidth:

103 | P a g e
Transferring large volumes of data to and from the cloud can be time-consuming and
expensive, especially for organizations with limited bandwidth or in regions with poor
connectivity.

7. Limited Control and Flexibility:


Users often have limited control over the underlying infrastructure and may not be
able to customize the environment to the same extent as with on-premises solutions.

Conclusion
Cloud computing’s characteristics—such as on-demand self-service, broad network access,
resource pooling, rapid elasticity, and measured service—make it an attractive solution for
modern IT needs. However, organizations must address significant challenges, including
security, compliance, vendor lock-in, performance, and cost management, to leverage cloud
computing effectively. A balanced approach, combining careful provider selection, robust
security practices, and strategic planning, is essential for successful cloud adoption.
Key Points to Remember
 Cloud characteristics: On-demand self-service, broad network access, resource
pooling, rapid elasticity, measured service, scalability.
 Security and privacy: Major concerns due to remote data storage and shared
resources.
 Compliance and vendor lock-in: Regulatory and interoperability issues can hinder
adoption.
 Performance, reliability, and cost management: Essential for ensuring value and
business continuity.
 Balanced strategy: Necessary to maximize benefits and minimize risks in cloud
computing.

104 | P a g e
Cloud Deployment and Service Models

Introduction
Cloud computing has revolutionized the way IT resources and services are delivered, offering
flexibility, scalability, and cost-effectiveness. The core of cloud computing lies in its
deployment and service models, which define how cloud resources are made available and
consumed. Understanding these models is fundamental for designing, deploying, and
managing cloud-based solutions.
Cloud Deployment Models
Cloud deployment models describe the nature of cloud environments and how services are
made available to users. The four primary deployment models are:

1. Public Cloud
The public cloud is owned and operated by third-party cloud service providers (such as AWS,
Microsoft Azure, or Google Cloud) and delivers computing resources over the Internet.
Resources are shared among multiple tenants (multi-tenancy), and users pay only for what

105 | P a g e
they use. Public clouds offer high scalability and cost efficiency, but may raise concerns
about data privacy and compliance.
2. Private Cloud
A private cloud is dedicated to a single organization. It can be physically located on-premises
or hosted by a third-party provider. Private clouds offer greater control, security, and
customization, making them suitable for organizations with strict regulatory or security
requirements. However, they may involve higher costs and maintenance efforts compared to
public clouds.
3. Hybrid Cloud
Hybrid cloud combines public and private clouds, allowing data and applications to be shared
between them. This model offers the flexibility to run sensitive workloads in the private
cloud while leveraging the public cloud for less critical resources or to handle peak demands.
Hybrid clouds enable organizations to optimize costs, performance, and security.
4. Community Cloud
A community cloud is shared by several organizations with common concerns, such as
security, compliance, or jurisdiction. The infrastructure may be managed by the organizations
or a third party. Community clouds are ideal for collaborative projects, research, or regulatory
environments.

106 | P a g e
Cloud Service Models

Cloud service models define the level of abstraction and management provided by the cloud
provider. The three main service models are:

1. Infrastructure as a Service (IaaS)


IaaS provides virtualized computing resources over the Internet, such as virtual machines,
storage, and networks. Users have control over operating systems and deployed applications
but do not manage the underlying hardware. IaaS is ideal for organizations seeking flexible,
scalable infrastructure without investing in physical hardware. Examples: Amazon EC2,
Microsoft Azure Virtual Machines.
2. Platform as a Service (PaaS)
PaaS offers a platform for developing, testing, and deploying applications without managing
the underlying infrastructure. It provides development tools, databases, middleware, and

107 | P a g e
runtime environments. Developers can focus on coding and deploying applications, while the
provider handles scalability, patching, and maintenance. Examples: Google App Engine,
Microsoft Azure App Service.

3. Software as a Service (SaaS)


SaaS delivers fully functional applications over the Internet, accessible via web browsers or
APIs. Users do not manage or control the underlying infrastructure or platform. SaaS is
subscription-based and eliminates the need for installation, maintenance, or upgrades.
Examples: Gmail, Salesforce, Microsoft Office 365.
Advantages and Use Cases
These models enable organizations to select the optimal balance of control, flexibility,
security, and cost. For example, a startup may choose public cloud IaaS for rapid scaling,
while a bank may use private cloud PaaS for secure application development. Hybrid models
are widely adopted for disaster recovery, regulatory compliance, and workload optimization.
Key Points to Remember
 Deployment models: Public, Private, Hybrid, Community—each with unique
benefits and trade-offs.
 Service models: IaaS (infrastructure), PaaS (platform), SaaS (software)—differ in
abstraction and user management.
 Hybrid and community models address specific regulatory, performance, or
collaboration needs.
 Choosing the right model depends on organizational goals, security, compliance,
and scalability requirements.

108 | P a g e
Virtualization and Its Role in Cloud Computing
Introduction
Virtualization is a foundational technology in cloud computing that enables the abstraction
of physical hardware resources—such as servers, storage, and networks—into multiple
virtual resources. This abstraction allows cloud providers to deliver scalable, flexible, and
efficient services to users, making cloud computing possible and practical

What is Virtualization?
 Definition: Virtualization is the process of creating a virtual version of a physical
resource, such as a server, storage device, network, or operating system. It allows
multiple virtual machines (VMs) to run on a single physical machine, each operating
independently with its own OS and applications
 Key Component: The software that enables virtualization is called a hypervisor. It
manages the creation, execution, and isolation of VMs on the host hardware.

Types of Virtualization

109 | P a g e
Type Description
Server Virtualization Multiple virtual servers run on a single physical server.
Storage Combines multiple physical storage devices into a single virtual
Virtualization storage pool.
Network Creates multiple virtual networks on a single physical network
Virtualization infrastructure.
Desktop
Provides virtual desktops to users, accessible from any device.
Virtualization

How Virtualization Works


 Hypervisor: The hypervisor sits between the hardware and the operating systems,
allowing multiple OS instances to share the same physical resources securely and
efficiently.
 Resource Pooling: Physical resources (CPU, memory, storage) are pooled and
allocated dynamically to VMs as needed.
 Isolation: Each VM operates in isolation, ensuring that issues in one VM do not
affect others.

Role of Virtualization in Cloud Computing


1. Resource Abstraction and Pooling
 Virtualization abstracts hardware resources, allowing cloud providers to pool and
allocate them dynamically to different users and applications

2. Multi-Tenancy and Isolation


 Multiple users (tenants) can share the same physical infrastructure securely, as each
VM is isolated from others. This is essential for public cloud environments

3. Scalability and Elasticity


 Virtual resources can be rapidly provisioned, scaled up or down, and de-provisioned
based on demand, supporting the elastic nature of cloud services

110 | P a g e
4. Efficient Resource Utilization
 Virtualization increases hardware utilization rates, reducing the need for excess
physical servers and lowering costs.

5. Disaster Recovery and High Availability


 VMs can be easily backed up, restored, or migrated between physical servers,
enabling robust disaster recovery and high availability solutions

6. Cost Savings
 By consolidating workloads onto fewer physical machines, organizations save on
hardware, power, cooling, and maintenance costs

7. Simplified Management
 Virtual machines can be easily created, cloned, moved, or deleted, simplifying IT
management and operations

Advantages of Virtualization in Cloud Computing


 Resource Optimization: Maximizes the use of physical hardware.
 Flexibility: VMs can be quickly deployed, resized, or migrated.
 Security: Isolates applications and users, reducing risk.
 Rapid Provisioning: New environments can be set up in minutes.
 Testing and Development: Safe, isolated environments for software testing.
 Environmental Benefits: Reduces energy consumption and physical space
requirements

Challenges and Considerations


 Performance Overhead: Some performance loss due to the hypervisor layer.
 Security Risks: Vulnerabilities in the hypervisor can affect all VMs.
 Management Complexity: Large-scale virtual environments require robust
management tools

 Hardware Virtualization: This is the most common type, where a hypervisor (also
known as a Virtual Machine Monitor or VMM) sits between the hardware and the
operating systems. The hypervisor manages the hardware resources and allocates
them to the VMs. Examples include VMware vSphere, Microsoft Hyper-V, and
KVM. Hardware virtualization can be further divided into:

111 | P a g e
o Full Virtualization: The hypervisor emulates the entire hardware
environment, allowing VMs to run unmodified operating systems.
o Para-virtualization: The guest operating system is modified to cooperate
with the hypervisor, improving performance.
 Operating System Virtualization: Also known as containerization, this type
virtualizes the operating system kernel, allowing multiple isolated user-space
instances (containers) to run on a single OS. Containers share the same OS kernel but
have their own file system, processes, and network interfaces. Examples include
Docker and Kubernetes.
 Application Virtualization: This involves encapsulating an application and its
dependencies into a single package that can be run on different operating systems
without requiring installation. This eliminates compatibility issues and simplifies
application deployment.
 Storage Virtualization: This aggregate multiple physical storage devices into a
single virtual storage pool, providing a unified view of storage resources. This
simplifies storage management, improves storage utilization, and enables features like
data replication and migration.
 Network Virtualization: This creates a virtual network infrastructure on top of a
physical network, allowing for the creation of isolated network segments, virtual
routers, and virtual firewalls. This enhances network security, flexibility, and
scalability.

Benefits of Virtualization
Virtualization offers a wide range of benefits, including:

 Improved Resource Utilization: Virtualization allows organizations to consolidate


multiple workloads onto fewer physical servers, increasing server utilization rates and
reducing hardware costs.
 Reduced Infrastructure Costs: By consolidating servers, virtualization reduces the
need for physical hardware, leading to lower capital expenditures (CAPEX) on
servers, networking equipment, and data centre space. It also lowers operational
expenditures (OPEX) related to power, cooling, and maintenance.
 Increased Agility and Flexibility: Virtualization enables rapid provisioning and
deployment of new VMs, allowing organizations to quickly respond to changing
business needs. It also simplifies disaster recovery and business continuity by
enabling easy replication and migration of VMs.
 Simplified Management: Virtualization management tools provide a centralized
interface for managing VMs, simplifying tasks such as provisioning, monitoring, and
patching.
 Enhanced Security: Virtualization can improve security by isolating VMs from each
other, preventing malware from spreading from one VM to another. Virtual networks
can also be used to create isolated network segments, further enhancing security.
 Improved Test and Development: Virtualization provides a cost-effective way to
create isolated test environments for software development and testing.

112 | P a g e
Virtualization and Cloud Computing
Virtualization is the cornerstone of cloud computing. Cloud service providers (CSPs) rely
heavily on virtualization to deliver their services. Without virtualization, cloud computing
would be significantly more expensive and less flexible.

Here's how virtualization enables cloud computing:

 Infrastructure as a Service (IaaS): IaaS providers use hardware virtualization to


offer virtualized computing resources, such as VMs, storage, and networking, to their
customers. Customers can provision and manage these resources on demand, paying
only for what they use.
 Platform as a Service (PaaS): PaaS providers use operating system virtualization
(containers) to provide a platform for developers to build, deploy, and manage
applications. Developers can focus on writing code without worrying about the
underlying infrastructure.
 Software as a Service (SaaS): SaaS providers use virtualization to host and deliver
applications to their customers over the internet. Customers can access these
applications from any device with an internet connection, without having to install or
manage them.

Virtualization enables the key characteristics of cloud computing:

 On-demand Self-service: Users can provision resources without requiring human


interaction with the service provider.
 Broad Network Access: Cloud services are accessible over a network from a wide
range of devices.
 Resource Pooling: Computing resources are pooled to serve multiple consumers
using a multi-tenant model.
 Rapid Elasticity: Resources can be rapidly and elastically provisioned and released,
scaling up or down as needed.
 Measured Service: Resource usage is monitored, controlled, and reported, providing
transparency for both the provider and consumer.

113 | P a g e
Load Balancing, Scalability, and Elasticity in the Cloud
Introduction
Cloud computing provides dynamic, on-demand access to computing resources. Three key
concepts that enable efficient and reliable cloud services are load balancing, scalability, and
elasticity. These features ensure optimal resource utilization, high availability, and the ability
to handle varying workloads.

Load Balancing in Cloud Computing


Definition:
Load balancing is the process of distributing workloads and computing resources across
multiple servers, virtual machines, or containers to optimize performance, maximize resource
utilization, and prevent any single resource from becoming a bottleneck.

How It Works:

 Distributes incoming network traffic or application requests across multiple servers.


 Can be implemented at various levels: network, application, and database.
 Ensures no single server is overwhelmed, improving system reliability and response
time.

Types of Load Balancing:

 Network Load Balancing: Distributes network traffic across servers.


 Application Load Balancing: Balances requests among application instances.
 Database Load Balancing: Spreads database queries across multiple database
servers.

Detailed Explanation of Load Balancing Types


1. Network Load Balancing (NLB)
Definition:
Network Load Balancing is a technique that distributes incoming network traffic across
multiple servers to ensure no single server becomes overwhelmed. It operates primarily at the
transport layer (Layer 4) of the OSI model, handling TCP/UDP traffic.

How It Works:

 NLB combines two or more servers running the same application into a single virtual
cluster.
 Incoming client requests are distributed across the servers (hosts) in the cluster.

114 | P a g e
 Each server runs a separate copy of the application.
 The cluster is addressed by a single virtual IP address, while each server maintains its
own unique IP.
 When a server fails or goes offline, NLB automatically redistributes the load among
the remaining servers.
 When the offline server is back online, it can rejoin the cluster transparently and
resume handling its share of the load.

Benefits:

 Enhances availability and scalability of mission-critical servers such as web, FTP,


VPN, and proxy servers.
 Provides fault tolerance by redistributing traffic when a server fails.
 Supports dynamic addition or removal of servers to handle changing loads.
 Can be configured to direct all traffic to a default host if needed.

Use Cases:

 Web server farms


 VPN gateways
 Firewall and proxy servers

Example Technologies:

 Microsoft Network Load Balancing (NLB)


 AWS Network Load Balancer (NLB)

2. Application Load Balancing (ALB)


Definition:
Application Load Balancing distributes incoming application-level traffic (HTTP/HTTPS)
across multiple application instances or services. It operates at the application layer (Layer 7)
of the OSI model.

How It Works:

 ALB acts as a single point of contact for clients.


 It inspects incoming requests and routes them to appropriate targets (e.g., EC2
instances, containers) based on configured rules.
 Routing decisions can be made based on URL path, host headers, HTTP methods,
query strings, or source IP.
 Supports advanced features like SSL termination, WebSocket support, and HTTP/2.
 Performs health checks on registered targets and routes traffic only to healthy
instances.
 Supports dynamic scaling by adding or removing targets without disrupting service.

Benefits:

 Enables content-based routing for microservices and containerized applications.

115 | P a g e
 Improves application availability and fault tolerance.
 Provides detailed metrics and monitoring for traffic and target health.
 Supports complex routing rules and multiple target groups.

Use Cases:

 Microservices architectures
 RESTful APIs
 Containerized applications

Example Technologies:

 AWS Application Load Balancer (ALB)


 Google Cloud Application Load Balancer

3. Database Load Balancing


Definition:
Database Load Balancing is the practice of distributing database queries and transactions
across multiple database servers to improve performance, reliability, and scalability.

How It Works:

 A load balancer sits between the application and multiple database servers.
 Incoming database requests are distributed based on various algorithms considering
server capacity, current load, and network latency.
 Read and write operations can be separated: read requests are distributed among read
replicas, while write requests go to the primary server.
 Load balancing can be implemented via DNS, hardware devices, software proxies, or
database proxies.
 Techniques like sharding (splitting the database into smaller parts) and replication
(creating read replicas) are often used in conjunction.

Common Load Balancing Algorithms:

 Round Robin: Requests are distributed evenly in a cyclic order.


 Least Connections: Requests are sent to the server with the fewest active connections.
 Weighted Round Robin: Servers are assigned weights based on capacity; requests are
distributed proportionally.
 IP Hash: Requests from a particular client IP are consistently routed to the same
server.

Benefits:

 Increases query throughput and reduces latency.


 Prevents any single database server from becoming a bottleneck.
 Enhances fault tolerance by rerouting traffic from failed servers.
 Supports scaling of database infrastructure as demand grows.

116 | P a g e
Use Cases:

 High-traffic web applications


 Large-scale data analytics platforms
 Cloud-based database services

Example Technologies:

 Database proxies (e.g., ProxySQL, Pgpool-II)


 Managed cloud database services with built-in load balancing (e.g., TencentDB for
MySQL, AWS RDS)

By implementing these types of load balancing, organizations can ensure high availability,
scalability, and efficient resource utilization across their network, application, and database
layers, thereby improving overall system performance and reliability.

Benefits:

 Improved Performance: Reduces response time by sharing the load.


 High Availability: Prevents single points of failure.
 Scalability: Makes it easier to add or remove resources as needed.
 Efficient Resource Utilization: Minimizes wastage and optimizes costs.

Challenges:

 Complexity: Requires careful configuration and management.


 Potential Single Point of Failure: If not implemented correctly, the load balancer
itself can become a failure point.
 Security Risks: Misconfiguration can expose vulnerabilities

Scalability in Cloud Computing


Definition:
Scalability is the ability of a cloud system to handle increasing workloads by adding
resources, such as servers or storage, without affecting performance.

Types of Scalability:

 Vertical Scalability (Scale Up): Increasing the capacity of existing resources (e.g.,
adding more CPU or memory to a server).
 Horizontal Scalability (Scale Out): Adding more instances or nodes to distribute the
workload.

117 | P a g e
Vertical Scalability (Scale Up)
Definition
Vertical scalability, also known as "scaling up," refers to the process of increasing the
capacity of an existing single server or resource by adding more power to it. This involves
upgrading its internal components to handle a larger workload or more requests.

When a system experiences increased demand, vertical scaling means enhancing the
specifications of the current machine. This can include:

 Adding more CPU cores: To process more instructions concurrently.


 Increasing RAM (memory): To allow for larger datasets to be held in memory,
speeding up access and enabling more concurrent operations.
 Upgrading to faster or larger storage (SSDs, NVMe drives): To improve I/O
operations and data access times.
 Upgrading to a more powerful processor: To execute tasks more quickly.

Characteristics
 Resource Enhancement: Focuses on improving the performance of a single node.

118 | P a g e
 Downtime Often Required: Scaling up typically requires taking the server offline to
install new hardware or apply configurations, leading to some service interruption.
 Hardware Limitations: There's an inherent physical limit to how much a single
server can be scaled up. Eventually, you hit a ceiling for CPU, RAM, or I/O capacity.
 Cost Implications: High-end, very powerful single machines can be significantly
more expensive than multiple smaller machines.
 Simplicity (for some applications): For applications not designed for distributed
environments, vertical scaling can be simpler to implement initially, as it doesn't
require changes to the application's architecture to handle multiple instances.

Advantages
 Simplicity of Management: Fewer instances to manage and monitor.
 Reduced Complexity: No need for distributed system complexities like load
balancing or data synchronization across multiple nodes.
 Suitable for Legacy Applications: Ideal for applications that are not designed to run
across multiple servers.

Disadvantages
 Single Point of Failure: If the single, powerful server fails, the entire application
goes down.
 Limited Scalability: There's an upper bound to how much you can scale up a single
machine.
 Inefficient Resource Utilization (potentially): Resources might be over-provisioned
during low-demand periods, leading to wasted capacity.
 Downtime for Upgrades: Most upgrades require a service interruption.

Use Cases
 Databases that are hard to shard (split across multiple servers).
 Monolithic applications not designed for horizontal scaling.
 Workloads with predictable, moderate growth that can be handled by a single, more
powerful server.

119 | P a g e
Horizontal Scalability (Scale Out)
Definition
Horizontal scalability, also known as "scaling out," involves increasing the capacity of a
system by adding more identical machines or instances that work together to handle the
workload. Instead of making a single server stronger, you add more servers to distribute the
load.

When demand increases, horizontal scaling means adding more servers (nodes) to a pool of
existing servers. These new servers can be virtual machines, containers, or physical
machines. The workload is then distributed among these multiple instances using a load
balancer.

 Adding Instances: New servers are brought online and added to the cluster or pool.
 Load Distribution: A load balancer (as discussed previously) directs incoming
requests to the most appropriate server in the pool.
 Distributed Processing: The application or service is designed to run across multiple
instances, often requiring a stateless architecture or shared state management (e.g.,
distributed databases, message queues).

120 | P a g e
 Auto-Scaling: In cloud environments, horizontal scaling is often automated through
auto-scaling groups that dynamically add or remove instances based on predefined
metrics (e.g., CPU utilization, request queue length).

Characteristics
 Resource Duplication: Focuses on adding more identical nodes to a system.
 High Availability: If one server fails, the others can continue to operate, ensuring
continuous service.
 No Upper Limit (theoretically): You can theoretically add an infinite number of
servers, making it suitable for massive growth.
 Cost-Effective (often): It can be more cost-effective to add many smaller, commodity
servers than one extremely powerful high-end server.
 Requires Distributed Architecture: Applications must be designed to work in a
distributed environment, often requiring statelessness, shared storage, and robust load
balancing.

Advantages
 High Availability and Fault Tolerance: Redundancy across multiple nodes ensures
that the failure of one does not bring down the entire system.
 Massive Scalability: Can handle virtually unlimited growth in demand.
 Cost Efficiency: Often more cost-effective to scale out using commodity hardware or
smaller cloud instances.
 No Downtime for Scaling: New instances can be added without interrupting service
to existing instances.
 Better Resource Utilization: Resources are added only when needed, minimizing
idle capacity.

Disadvantages
 Increased Complexity: Requires managing multiple instances, distributed data, and
load balancing.
 Application Redesign: Often necessitates architectural changes to the application
(e.g., making it stateless, using distributed databases).
 Data Consistency Challenges: Maintaining data consistency across multiple nodes
can be complex (e.g., eventual consistency, distributed transactions).
 Network Latency: Communication between distributed nodes can introduce latency.

Use Cases
 Web applications and APIs with variable and high traffic.

121 | P a g e
 Microservices architectures.
 Big data processing (e.g., Hadoop, Spark).
 E-commerce platforms.

Both vertical and horizontal scalability are vital for modern cloud systems. While vertical
scaling offers simplicity for certain applications, horizontal scaling is the cornerstone of
cloud computing's elasticity, high availability, and ability to handle massive, unpredictable
workloads. Most large-scale cloud applications leverage a combination of both, often scaling
out their application and web tiers horizontally, while potentially scaling up their database
tier (or also scaling it out using distributed database solutions).

Characteristics:

 Time-Independent: Scalability is a static property; it refers to the system’s capacity


to grow as needed, regardless of time.
 Resource Provisioning: Additional resources can be provisioned to meet higher
demand.

Benefits:

 Supports Business Growth: Handles more users or transactions as the business


expands.
 Cost-Effective: Resources are added only when needed.
 Performance Maintenance: Maintains performance levels even as demand increases

Elasticity in Cloud Computing


Definition:
Elasticity is the ability of a cloud system to automatically add or remove resources in real
time to adapt to workload variations.

Key Features:

 Dynamic Resource Adjustment: Resources (CPU, memory, storage, VMs) are


provisioned or de-provisioned automatically based on current demand.
 Automation: Often achieved through auto-scaling and monitoring tools.
 Two Types:
o Horizontal Elasticity: Adding or removing instances (e.g., VMs or
containers).
o Vertical Elasticity: Increasing or decreasing the capacity of existing instances
(e.g., more CPU or RAM).

Benefits:

122 | P a g e
 Cost Efficiency: Only pay for resources actually used.
 Continuous Availability: Maintains service levels during demand spikes or drops.
 Optimized Performance: Quickly adapts to changing workloads, preventing over-
provisioning or under-provisioning.

Examples:

 Amazon EC2 Auto Scaling: Automatically adjusts the number of instances based on
demand.
 Google App Engine, Microsoft Azure: Provide built-in elasticity for web and multi-
tier applications

Elasticity vs. Scalability:

 Scalability is about the system’s capacity to grow.


 Elasticity is about how quickly and automatically the system can adapt to changes in
demand.
 Elasticity = Auto Scaling + Optimization

Load balancing, scalability, and elasticity are fundamental to the effectiveness of cloud
computing. Load balancing ensures efficient resource use and high availability. Scalability
allows cloud systems to grow with demand, while elasticity provides the flexibility to
automatically adjust resources in real time, optimizing both performance and cost

123 | P a g e
Compute Services
Definition:
Compute services in the cloud provide virtualized processing power, memory, and
networking capabilities that users can rent and scale on demand. Instead of purchasing and
maintaining physical servers, users can provision virtual machines (VMs), containers, or
serverless functions to run their applications.

Key Characteristics:

 Virtualization: Underlying physical servers are virtualized, allowing multiple


isolated virtual instances to run concurrently.
 On-Demand Provisioning: Resources can be spun up or down rapidly, often in
minutes.
 Scalability: Resources can be easily scaled up (vertical scaling) or out (horizontal
scaling) to match changing workload demands.
 Pay-as-you-go: Users are billed only for the compute resources consumed, typically
by the hour or even by the second.

Types of Compute Services:

a. Virtual Machines (VMs) / Infrastructure as a Service


(IaaS)
 Description: This is the most basic and flexible compute service. Users get access to
a virtualized server (VM) that they can configure with their choice of operating
system, applications, and networking settings.
 User Control: High level of control over the operating system, middleware, and
runtime environments.
 Examples: Amazon EC2 (Elastic Compute Cloud), Azure Virtual Machines, Google
Compute Engine.
 Use Cases: Running traditional enterprise applications, hosting websites, developing
and testing environments, custom application deployments.

b. Containers / Container as a Service (CaaS)

124 | P a g e
 Description: Containers (e.g., Docker) encapsulate an application and its
dependencies into a lightweight, portable unit. They share the host OS kernel, making
them more efficient than VMs. CaaS platforms provide tools for orchestrating,
deploying, and managing containers.
 User Control: Control over application and its dependencies, but less control over the
underlying OS than VMs.
 Examples: Amazon Elastic Container Service (ECS), Azure Kubernetes Service
(AKS), Google Kubernetes Engine (GKE).
 Use Cases: Microservices architectures, CI/CD pipelines, highly portable
applications.

c. Serverless Computing / Functions as a Service (FaaS)


 Description: Users deploy individual functions or code snippets without managing
any underlying infrastructure. The cloud provider automatically provisions and scales
the necessary compute resources to execute the code in response to events. Users pay
only for the actual execution time.
 User Control: Focuses entirely on code; no control over servers or OS.
 Examples: AWS Lambda, Azure Functions, Google Cloud Functions.
 Use Cases: Event-driven applications, real-time data processing, chatbots, IoT
backend.

2. Storage Services
Definition:
Cloud storage services provide a scalable, highly available, and durable means to store and
retrieve data over the internet. Users no longer need to manage physical storage hardware or
deal with capacity planning, backup, and disaster recovery.

Key Characteristics:

 Durability: Data is replicated across multiple devices and data centres to ensure high
durability and prevent data loss.
 Availability: Data is accessible anytime, anywhere via standard APIs or web
interfaces.
 Scalability: Storage capacity can be expanded or contracted on demand without
service interruption.
 Cost-Effectiveness: Often cheaper than traditional on-premises storage, with tiered
pricing based on access frequency and storage class.

Types of Storage Services:

a. Object Storage
 Description: Stores data as objects within buckets. Each object is a self-contained
unit with data, metadata, and a unique identifier. Ideal for unstructured data.

125 | P a g e
 Access Method: Accessed via HTTP/HTTPS APIs.
 Examples: Amazon S3 (Simple Storage Service), Azure Blob Storage, Google Cloud
Storage.
 Use Cases: Website content, backup and archival, data lakes for analytics, media
files.

b. Block Storage
 Description: Provides persistent block-level storage volumes that can be attached to
compute instances (VMs), similar to a traditional hard drive.
 Access Method: Accessed via block protocols (e.g., SCSI) as raw storage devices by
an operating system.
 Examples: Amazon EBS (Elastic Block Store), Azure Disk Storage, Google
Persistent Disk.
 Use Cases: Databases, boot volumes for VMs, high-performance applications
requiring low-latency storage.

c. File Storage
 Description: Offers shared file system storage that can be mounted by multiple
compute instances, similar to network-attached storage (NAS). Supports standard file
protocols like NFS (Network File System) or SMB (Server Message Block).
 Access Method: Accessed via file-level protocols.
 Examples: Amazon EFS (Elastic File System), Azure Files, Google Cloud Filestore.
 Use Cases: Content management systems, shared application data, development
environments, home directories.

d. Database Services (Managed Databases)


 Description: Cloud providers offer fully managed database services, taking care of
patching, backups, scaling, and high availability. Users choose from relational,
NoSQL, data warehouse, or in-memory databases.
 Examples: Amazon RDS (Relational Database Service), Azure SQL Database,
Google Cloud SQL, Amazon DynamoDB, Azure Cosmos DB.
 Use Cases: Any application requiring structured or unstructured data storage with
management overhead offloaded to the provider.

3. Application Services (Platform as a Service - PaaS &


Software as a Service - SaaS)
Definition:
Application services in the cloud encompass a wide range of managed services that support
application development, deployment, and operation, abstracting away much of the
underlying infrastructure. They range from platforms for building applications to fully
managed, ready-to-use software applications.

Key Characteristics:

126 | P a g e
 Higher Abstraction: Users interact with services at a higher level, focusing on
application code or configuration rather than infrastructure.
 Increased Productivity: Developers can build and deploy applications faster without
managing servers, operating systems, or middleware.
 Built-in Scalability and Availability: Cloud providers handle the scaling and
reliability of the underlying platform.
 Managed Services: Routine tasks like patching, updates, and backups are handled by
the provider.

Types of Application Services:

a. Platform as a Service (PaaS)


 Description: Provides a complete development and deployment environment in the
cloud, including operating systems, programming language execution environments,
databases, web servers, and other tools. Developers only need to manage their
application code and data.
 Examples: AWS Elastic Beanstalk, Azure App Service, Google App Engine, Heroku.
 Use Cases: Web applications, API development, rapid application development,
microservices platforms.

b. Message Queuing and Streaming Services


 Description: Services for enabling asynchronous communication and event
processing between different parts of an application or microservices.
 Examples: Amazon SQS (Simple Queue Service), Azure Service Bus, Google Cloud
Pub/Sub, Apache Kafka (managed services).
 Use Cases: Decoupling microservices, real-time data ingestion, event-driven
architectures.

c. Developer Tools & DevOps Services


 Description: Cloud providers offer services that support the entire software
development lifecycle, including source control, CI/CD pipelines, monitoring, and
logging.
 Examples: AWS Code Pipeline, Azure DevOps, Google Cloud Build, GitHub
Actions.
 Use Cases: Automating software delivery, continuous integration and deployment.

d. Artificial Intelligence and Machine Learning Services

127 | P a g e
 Description: Pre-built or configurable AI/ML models and services that allow
developers to integrate intelligence into their applications without deep ML expertise.
 Examples: AWS Recognition (image analysis), Azure Cognitive Services (vision,
speech, language), Google AI Platform.

e. Internet of Things (IoT) Services


 Description: Platforms designed to connect, manage, and process data from IoT
devices.
 Examples: AWS IoT Core, Azure IoT Hub, Google Cloud IoT Core.
 Use Cases: Smart home devices, industrial automation, connected vehicles.

f. Software as a Service (SaaS)


 Description: Fully developed and managed software applications hosted by the cloud
provider and delivered over the internet, typically on a subscription basis. Users
simply consume the application.
 Examples: Salesforce (CRM), Google Workspace (Gmail, Docs), Microsoft 365,
Dropbox.
 Use Cases: Enterprise applications, productivity tools, customer relationship
management, email services.

The comprehensive suite of compute, storage, and application services offered by cloud
providers forms the foundation of modern digital infrastructure. By abstracting away, the
complexities of underlying hardware and software management, these services empower
businesses and developers to innovate faster, scale globally, enhance reliability, and optimize
costs, truly realizing the potential of cloud computing.

128 | P a g e

You might also like