U2
Distribution Models
Definition: Techniques to optimize data distribution
across multiple servers or nodes for better performance,
availability, and scalability.
Achieved through:
Sharding: Data partitioned into smaller chunks across
multiple nodes.
Replication: Copying data across nodes for fault
tolerance and high availability.
Master-Slave
Peer-to-peer
Sharding: Horizontal partitioning of data across multiple
servers to scale and distribute the load efficiently.
How:
Split large datasets into shards, each stored on a
different node.
Sharding criteria can be based on user ID, region,
date, or other attributes.
Data locality: Helps distribute data closer to where it's
needed (e.g., geographical sharding).
Key Aspects:
Shard Key: A specific field used to divide the data
(e.g., customer ID, product ID).
Uniform Distribution: Ideally, shards should be evenly
distributed to avoid hot spots.
Global Indexing: Maintains a global index that helps to
locate data across shards.
Advantages:
Scalability
Performance: Parallel access to different data parts
improves throughput.
Fault Isolation: Failure of one shard doesn’t affect
others.
Challenges:
Data Rebalancing: Adding/removing shards requires
migrating data.
Consistency
Query Complexity across multiple shards may impact
performance.
Master-Slave Replication
One master node holds authoritative data, while slave
nodes replicate the data for read scalability and fault
tolerance.
How:
Master: Handles write operations (INSERT, UPDATE).
Slaves: Replicas that handle read queries.
Synchronous vs Asynchronous replication based on
latency and consistency needs.
Key Aspects:
Replication Lag: Asynchronous replication can
introduce a delay in data synchronization.
Failover Mechanism: In case of master failure, a slave
can be promoted to master.
Eventual Consistency: Slaves may not have the most
up-to-date data immediately after write operations.
Advantages:
Read Scalability
Fault Tolerance: Automatic failover in case of master
failure (with proper configuration).
Simple Architecture
Challenges:
Write Bottleneck: Master can become a bottleneck for
write-heavy workloads.
Replication Lag: May cause temporary data
inconsistency between master and slaves.
Single Point of Failure: If master fails and no failover
mechanism is in place, writes are blocked.
Peer-to-Peer Replication
Every node (peer) is equal, with no central master. All
nodes handle both reads and writes.
How:
Distributed Sync: Data is synchronized across all
nodes, and each peer can handle both read and write
operations.
Conflict Resolution: Conflicts (e.g., same data
modified in two nodes) are resolved using
mechanisms like versioning or last-write-wins.
Key Aspects:
Decentralized: No central master, and all peers
communicate directly.
Conflict Handling: Must ensure data consistency
through advanced techniques (e.g., vector clocks).
Eventual Consistency: Peers may not have the same
data immediately after updates.
Advantages:
Fault Tolerance: No SPOF
High Availability
Load Distribution: Equal distribution of read and write
operations across peers.
Challenges:
Data Consistency
Conflict Management: Requires efficient mechanisms
to resolve conflicts when peers have conflicting
updates.
Network Overhead
Sharding + Replication
Combining sharding (data partitioning) and replication
(data duplication) for scalable, fault-tolerant distributed
systems.
How:
Sharding: Splits data across multiple nodes based on
a shard key (e.g., customer ID).
Replication: Each shard is replicated on multiple
nodes to provide fault tolerance and high availability.
Partitioning & Replication: Partitioning data (sharding)
ensures scalability, while replication ensures data
availability and fault tolerance.
Key Aspects:
Data Redundancy
Consistency Model: Ensures consistency (often
eventual consistency) across shards and replicas.
Fault Isolation
Advantages:
Scalability
Availability
Improved Performance: Load balancing between
shards and replicas for both read and write
operations.
Challenges:
Complexity in Management: Managing both sharding
and replication requires careful planning
Data Synchronization
Rebalancing: Data may need to be redistributed
across shards as the system scales
Consistency & Relaxing Consistency
Consistency: Ensures all nodes have the same data after
an update operation.
Types:
i. Strong Consistency: Immediate synchronization;
all clients see updated data.
ii. Eventual Consistency: Updates propagate
asynchronously; all replicas converge eventually.
iii. Causal Consistency: Updates maintain a cause-
effect order.
Relaxing Consistency: Sacrificing immediate
synchronization for scalability/performance.
Techniques:
Eventual Consistency: Used in distributed systems
(e.g., DynamoDB, Cassandra).
Quorum Reads/Writes: Ensures consistency only
for a subset of nodes.
Trade-offs:
↑ Availability & performance.
↓ Consistency guarantees.
CAP Theorem
A distributed system can only achieve two out of three:
Consistency (C): All nodes see the same data
simultaneously.
Availability (A): System continues to operate despite
failures.
Partition Tolerance (P): System functions despite
network partition.
Scenarios:
CA Systems: No partition tolerance; suited for single-
node systems (e.g., RDBMS).
AP Systems: No strong consistency; common in geo-
distributed systems (e.g., DNS).
CP Systems: Sacrifice availability during partition
(e.g., MongoDB).
Applications: Helps design systems based on trade-offs.
Version Stamps
Concept: Identifiers for tracking changes/versioning in
distributed systems.
Purpose:
Detect conflicts in updates.
Enable reconciliation in peer-to-peer or eventually
consistent systems.
Types:
Logical Timestamps: Tracks order of events (Lamport
timestamps).
Vector Clocks: Tracks causal relationships between
updates.
Applications:
Conflict detection in distributed databases.
Version control systems (e.g., Git).
Challenges:
↑ Overhead in managing clocks.
Resolving conflicting timestamps.
Map-Reduce
Concept: Framework for parallel processing of large
datasets using clusters.
Steps:
Map: Input data → key-value pairs.
Shuffle: Group data by keys across nodes.
Reduce: Combine values with the same key to
produce results.
Key Features:
Parallel processing using mappers/reducers.
Fault Tolerance: Re-runs failed tasks.
Data locality minimizes network transfer.
Use Cases: Analytics, indexing, log processing (e.g.,
Hadoop, Spark).
Challenges: Not ideal for iterative algorithms or real-time
processing.
Partitioning & Combining
Partitioning: Splitting data into logical chunks for parallel
processing.
Based on key-value mapping (e.g., hash partitioning).
Enables distributed reduce operations.
Combining: Reduces data at mapper stage before
shuffling.
Minimizes data transfer across nodes.
Example: Pre-aggregating sales data per product on
mapper nodes.
Applications:
Scaling computations across clusters.
Efficient large-scale data processing.
Challenges:
Non-combinable reducers (e.g., counting unique
elements).
Balancing partitions for equal load distribution.
Composing Map-Reduce Calculations
Concept: Combining multiple Map-Reduce stages for
complex computations.
Process:
Output of one stage → Input for the next (e.g., pipes-
and-filters model).
Example:
Task: Compare monthly sales year-over-year.
Stage 1: Aggregate sales per product per month.
Stage 2: Compare results of each month for
consecutive years.
Advantages:
Reusability: Intermediate outputs useful for other
tasks.
Scalability: Breaks down complex logic into smaller,
manageable stages.
Materialized views improve efficiency.
Tools: Apache Pig (simplified Map-Reduce), Hive (SQL-like
interface)