Unit 5
Unit 5
copies of data or services across multiple nodes in the system. This enhances availability,
fault tolerance, scalability, and performance by ensuring that users can access the data
or services even if some nodes fail or are unavailable.
Types of Replication
1. Data Replication:
o Involves duplicating data across multiple nodes.
o Used in distributed databases, file systems, and cloud storage services.
2. Service Replication:
o Involves running multiple instances of a service or application across nodes.
o Used in microservices, containerized applications, and web services.
Benefits of Replication
1. Fault Tolerance:
o If one node fails, other replicas can serve the request.
2. Improved Performance:
o Replicas closer to the user reduce latency.
3. Load Balancing:
o Requests are distributed among replicas to prevent overloading a single
node.
4. High Availability:
o Ensures that services and data are accessible even during failures or
maintenance.
5. Disaster Recovery:
o Helps recover from data loss or corruption by providing backups.
Challenges in Replication
1. Consistency:
o Keeping replicas synchronized despite updates and failures is complex.
o CAP Theorem: In distributed systems, it’s impossible to achieve Consistency,
Availability, and Partition tolerance simultaneously.
2. Conflict Resolution:
o Concurrent updates to replicas may lead to conflicts.
3. Replication Latency:
o Synchronizing replicas in real-time can introduce delays.
4. Resource Overhead:
o Maintaining replicas consumes storage and computational resources.
5. Fault Propagation:
o Errors in one replica might propagate if not handled properly.
Replication Strategies
1. Synchronous Replication:
o Updates are propagated to all replicas simultaneously.
o Ensures strong consistency but may increase latency.
o Example: Bank transaction systems.
2. Asynchronous Replication:
o Updates are propagated to replicas after the primary node confirms the write.
o Reduces latency but can lead to eventual consistency.
o Example: Content delivery networks (CDNs).
3. Quorum-Based Replication:
o Requires a majority of nodes to agree on a read or write operation.
o Balances consistency and availability.
o Example: Distributed databases like Apache Cassandra.
2
A system model in distributed systems describes the fundamental assumptions and
abstractions used to replicate and coordinate processes, data, and communication across
distributed nodes. Replication is a key feature in distributed systems to ensure availability,
fault tolerance, and scalability by maintaining multiple copies of data or services.
Here are the primary aspects of a system model for replication in distributed systems:
2. Replication Strategies
Replication ensures multiple copies of data exist across distributed nodes. The system model
impacts the replication strategy:
Active Replication:
o All replicas perform the same operations in a coordinated manner.
o Requires strong consistency and deterministic behavior.
o Example: State machine replication.
Passive Replication:
o A primary node processes updates, then propagates changes to backups.
o Simpler to implement but requires failure detection and failover mechanisms.
Hybrid Replication:
o Combines aspects of active and passive replication for optimization in specific
use cases.
3. Consistency Models
4. Coordination Mechanisms
5. Fault Tolerance
6. Challenges
7. Real-World Examples
Distributed Databases:
o Spanner (strong consistency, global replication).
o Cassandra (eventual consistency, decentralized model).
1. Group: A logical collection of nodes (processes) that communicate with one another.
o Static groups: Membership is fixed at the time of creation.
o Dynamic groups: Membership can change over time (e.g., nodes joining or leaving).
2. Message Types:
o Unicast: Message sent to a single node.
o Multicast: Message sent to a subset of nodes (group members).
o Broadcast: Message sent to all nodes in the system.
3. Replication:
o In distributed systems, data or services are replicated across multiple nodes for fault
tolerance and availability. Group communication ensures consistent updates to
replicas.
1. Reliability:
o Ensures that messages are delivered to all group members, even in the presence of
failures.
o Types include best-effort (no guarantees), reliable delivery, and atomic delivery (all-
or-nothing).
2. Ordering:
o FIFO Order: Messages from the same sender are delivered in the order they were
sent.
o Causal Order: Messages are delivered respecting causal relationships (if A → B,
deliver A before B).
o Total Order: All members deliver messages in the same global order.
3. Fault Tolerance:
o Supports operation even when some nodes fail (partial failures).
o Achieved through protocols like leader election and membership changes.
4. Membership Management:
o Tracks which nodes are currently in the group.
o Dynamic membership ensures nodes can join or leave during operation without
disrupting consistency.
2. Consensus Protocols:
o Facilitate agreement among group members (e.g., on the sequence of updates).
o Examples: Paxos, Raft.
3. Virtual Synchrony:
o Provides a framework for reliable communication and consistent state replication.
o Ensures that messages are delivered consistently to members that see the same
group view.
1. Data Replication:
o Synchronizing updates to replicated databases or distributed file systems.
o Example: Google Spanner, Amazon DynamoDB.
2. Fault Tolerance:
o Replicated state machines use group communication to apply the same operations
on replicas.
o Example: Distributed consensus algorithms like Raft.
3. Distributed Coordination:
o Nodes in distributed systems use group communication to synchronize actions.
o Example: Coordination in distributed schedulers or transaction systems.
1. Scalability:
o Large groups can face bandwidth and latency bottlenecks.
o Solutions: Hierarchical groups, efficient multicast protocols.
2. Network Partitions:
o Ensuring consistency when a network splits into isolated segments.
3. Dynamic Membership:
o Handling joins and leaves without disrupting ongoing communication or replication.
4. Security:
o Ensuring confidentiality, integrity, and authentication in group communication.
1. JGroups:
o A toolkit for reliable multicast communication in Java.
2. ZooKeeper:
o A distributed coordination service using group communication for consistency.
3. Spread:
o A messaging toolkit that supports reliable multicast and group communication.
Group communication forms the backbone of distributed systems, enabling efficient, reliable,
and consistent interaction across nodes.
1. Fault Tolerance
Definition: The ability of a system to continue functioning correctly even when some of its
components fail.
Types of Faults:
o Crash faults: Nodes stop functioning without warning.
o Byzantine faults: Nodes exhibit arbitrary or malicious behavior.
o Network faults: Delays, dropped messages, or partitions in the network.
2. Replication
Definition: Maintaining multiple copies of data or services across nodes to ensure availability
and reliability.
Benefits:
o Improved availability: If one replica fails, others can take over.
o Fault tolerance: Redundant data prevents single points of failure.
o Load balancing: Requests can be distributed among replicas.
Replication Models
1. Active Replication:
o All replicas process every request concurrently.
o Ensures consistency by applying requests in the same order across all replicas.
o Common in systems requiring high fault tolerance.
o Example: Replicated State Machines (RSM).
3. Hybrid Replication:
o Combines aspects of active and passive replication to balance performance and fault
tolerance.
4. Quorum-based Replication:
o Reads and writes are performed on a subset (quorum) of replicas.
o Ensures consistency with rules like:
Read quorum + Write quorum > Total replicas.
Write quorum > Total replicas / 2.
o Examples: DynamoDB, Cassandra.
Consistency Models
1. Strong Consistency:
o All replicas reflect the most recent update immediately.
o Often achieved via consensus protocols like Paxos or Raft.
o Example: Google Spanner.
2. Eventual Consistency:
o Updates propagate to all replicas eventually, but inconsistencies may exist
temporarily.
o Used in systems prioritizing availability over strict consistency (e.g., Amazon
DynamoDB).
3. Causal Consistency:
o Guarantees that causally related updates are seen by all replicas in the same order.
o Example: Collaborative editing tools.
4. CAP Theorem:
o A distributed system can achieve at most two out of three: Consistency, Availability,
and Partition Tolerance.
1. Replication Protocols
Consensus Algorithms:
o Ensure agreement among replicas on the order of operations.
o Examples: Paxos, Raft.
Two-Phase Commit (2PC) and Three-Phase Commit (3PC):
o Used in distributed transactions to ensure atomicity.
2. Membership Management
3. Leader Election
4. State Transfer
Mechanism to synchronize the state of a new or recovering replica with the group.
3. Handling Failures:
o Detecting and recovering from node or network failures reliably.
1. Google Spanner:
o A globally distributed database with strong consistency.
2. Amazon DynamoDB:
o A key-value store with eventual consistency.
3. HDFS (Hadoop Distributed File System):
o Uses replication for fault-tolerant storage.
4. ZooKeeper:
o Provides distributed coordination using replication.
5. Etcd:
o A distributed key-value store for configuration management, using the Raft
consensus protocol.
5
Transactions with replicated data in distributed systems involve operations that ensure
consistency, reliability, and fault tolerance across multiple replicas of data. Managing
replicated data in a transactional context is challenging due to issues like network latency,
failures, and the need for coordination among distributed nodes.
1. Replication:
2. Transactions:
1. Replication Models:
Primary-Backup Model:
o A designated primary replica handles all updates, which are then propagated to
backup replicas.
o Pros: Simplifies conflict resolution.
o Cons: The primary replica is a potential bottleneck and single point of failure.
Quorum-Based Model:
o A quorum (subset of replicas) must agree on updates or reads.
o Commonly used in systems like Cassandra and DynamoDB.
o Guarantees consistency through protocols like read quorum + write quorum > total
replicas.
2. Consistency Models:
Strong Consistency:
o Replicas behave as if there is a single copy of the data.
o Often achieved using consensus algorithms (e.g., Paxos, Raft).
o Suitable for systems requiring high correctness guarantees.
Eventual Consistency:
o Replicas converge to the same state over time.
o Common in highly available systems (e.g., Dynamo, Cassandra).
o Requires applications to handle temporary inconsistencies.
Causal Consistency:
o Ensures operations are seen in the order of causality.
o Suitable for collaborative applications.
3. Coordination Protocols:
Consensus Algorithms:
o Achieve agreement among distributed replicas.
o Examples: Paxos, Raft.
o Used in systems like Google Spanner for global consistency.
1. Google Spanner:
o Strong consistency with a global clock for transactions.
o Uses Paxos for consensus and replication.
2. Amazon DynamoDB:
o Eventual consistency with options for strongly consistent reads.
o Uses quorum-based replication.
3. CockroachDB:
o Strongly consistent and geo-distributed.
o Implements Raft for consensus.
4. Cassandra:
o Tunable consistency model (strong or eventual).
o Uses quorum-based replication.
Best Practices
Transactions with replicated data are vital for modern distributed systems, enabling
scalability, fault tolerance, and high availability while balancing the trade-offs between
consistency, performance, and complexity.