0% found this document useful (0 votes)
12 views47 pages

distributed system upto 3Module

The document provides a comprehensive overview of Distributed Systems, focusing on Distributed Data Processing and Distributed Database Systems (DDBS). It outlines key characteristics, advantages, challenges, and examples of distributed systems, along with detailed explanations of DDBS architecture, types, and core components. Additionally, it discusses problem areas in DDBS, including data distribution, replication, and transaction management, while offering design strategies for effective data placement and management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views47 pages

distributed system upto 3Module

The document provides a comprehensive overview of Distributed Systems, focusing on Distributed Data Processing and Distributed Database Systems (DDBS). It outlines key characteristics, advantages, challenges, and examples of distributed systems, along with detailed explanations of DDBS architecture, types, and core components. Additionally, it discusses problem areas in DDBS, including data distribution, replication, and transaction management, while offering design strategies for effective data placement and management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Distributed System Notes

Perfect, Amarnath! Go ahead and send me the topics from your Distributed Systems
syllabus, and I’ll give you:

Important points only (no fluff)

Clear, concise explanations for each point

Well-structured for note-making (like headings, bullet points, etc.)

Ready when you are — send the topics!

Great! Here's a concise and clear explanation of the "Introduction – Distributed Data
Processing" topic for your Distributed Systems notes:

📘 INTRODUCTION – Distributed Data Processing


🔹 What is Distributed Data Processing?
Definition: Performing computation and data processing using multiple independent
computers connected via a network.

Goal: Improve performance, scalability, reliability, and resource sharing.

🔹 Key Characteristics
Multiple nodes: Processing tasks are shared across different machines.

Concurrency: Many computations occur simultaneously.

Resource sharing: Hardware, software, and data are shared.

Scalability: System can grow by adding more nodes.

Fault tolerance: If one node fails, others can continue processing.

🔹 Advantages
1/47
Improved performance: Tasks executed in parallel reduce processing time.

High availability: Redundant systems ensure minimal downtime.

Cost-effective: Commodity hardware instead of expensive mainframes.

Geographic distribution: Data and processing can occur close to the source.

🔹 Challenges
Data consistency: Keeping data synchronized across nodes.

Security: Ensuring secure data transfer and storage.

Coordination: Managing communication and task distribution.

Fault handling: Detecting and recovering from node failures.

🔹 Examples of Distributed Data Processing Systems


Google Search Engine

Hadoop MapReduce

Banking and Reservation Systems

Cloud Computing Platforms (e.g., AWS, Azure)

Let me know the next topic from your syllabus!

Here’s a precise and note-ready explanation of "What is a DDBS (Distributed Database


System)":

📘 What is a DDBS (Distributed Database System)


🔹 Definition

2/47
A Distributed Database System (DDBS) is a collection of logically interrelated
databases distributed across multiple locations connected via a network.

Each site (node) is autonomous, but the system appears to users as a single database.

🔹 Key Features
Data Distribution: Data is stored across multiple sites.

Transparency: Users interact as if it's a single database.

Types: Location transparency, replication transparency, and fragmentation


transparency.

Autonomy: Each site can function independently.

Concurrency Control: Supports simultaneous access without conflicts.

Fault Tolerance: Can recover from individual site or communication failures.

🔹 Types of DDBS
1. Homogeneous DDBS:

Same DBMS software across all sites.

Easier integration and management.

2. Heterogeneous DDBS:

Different DBMS software at different sites.

Requires translation and integration mechanisms.

🔹 Advantages
Improved reliability & availability

Faster data access (data closer to users)

Modular growth (easy to add new sites)

3/47
Reduced communication cost

🔹 Disadvantages
Complex management

Data security issues

Difficult recovery and synchronization

Let me know the next topic from your syllabus!

🔹 Problem Areas in Distributed Database Systems


1. Data Distribution & Fragmentation

Challenge: Deciding how to split (fragment) and place data across sites for optimal
performance and availability.

Key Point: Poor fragmentation can lead to excessive cross-site queries and high
network cost.

2. Data Replication & Consistency

Challenge: Keeping multiple copies of data synchronized across nodes.

Key Point: Trade-off between read-performance (more replicas) and update


overhead (synchronization).

3. Distributed Query Processing & Optimization

Challenge: Generating efficient execution plans that minimize data transfer and
response time.

Key Point: Involves cost estimation of data shipping, local processing, and remote
joins.

4. Transaction Management & Concurrency Control

4/47
Challenge: Ensuring atomicity, consistency, isolation, and durability (ACID) across
multiple sites.

Key Point: Two-phase commit (2PC) and distributed locking can incur high latency
and blocking.

5. Failure Detection & Recovery

Challenge: Detecting node or network failures and recovering without data loss or
inconsistency.

Key Point: Requires logging, checkpoints, and coordinated recovery protocols.

6. Global Catalog Management

Challenge: Maintaining a distributed directory of schema definitions, data locations,


and metadata.

Key Point: Catalog must be consistent and highly available for query planning.

7. Security & Authorization

Challenge: Protecting data in transit and at rest across untrusted networks.

Key Point: Must handle authentication, encryption, and fine-grained access control
in a heterogeneous environment.

8. Heterogeneity

Challenge: Integrating sites running different DBMS software, hardware, and data
models.

Key Point: Requires schema mapping, protocol translation, and middleware layers.

9. Network Latency & Bandwidth

Challenge: Variable delays and limited throughput impact query response and
transaction times.

Key Point: System design should minimize cross-site communication and exploit
locality.

10. Scalability & Load Balancing

Challenge: Evenly distributing workload as the system grows to prevent hotspots.

Key Point: Dynamic data and request redistribution may be needed to adapt to
changing loads.

5/47
These points cover the primary problem areas you’ll encounter in designing, implementing, or
maintaining a Distributed Database System. Let me know the next topic!

Here’s a concise yet comprehensive overview organized into clear sections for your notes,
with key points and definitions—each statement backed by high-quality citations.

Summary
This overview covers fundamental database and computer network concepts, the
architecture of Distributed Database Management Systems (DDBMS), and the various
transparencies that make a DDBMS appear as a single database to users. You’ll find
definitions, classification of architectures, and the core transparencies (location, replication,
fragmentation, etc.) laid out in bullet-friendly form for easy note-making.

## Overview of Database and Computer Network Concepts


Database Concepts
A database is a structured collection of data organized to support storage, retrieval, and
update operations efficiently Wikipedia .

A Database Management System (DBMS) is software that interacts with users,


applications, and the database to capture and analyze data, enforcing data integrity and
security Wikipedia .

Computer Network Concepts


A computer network is a set of interconnected computers that share resources and
communicate via a transmission medium (e.g., Ethernet, Wi-Fi) Wikipedia .

Network topologies (bus, star, ring, mesh) determine how nodes are connected and
impact fault tolerance and performance Wikipedia .

The OSI model (7 layers) and TCP/IP stack (4 layers) define protocols and functions for
reliable, end-to-end communication Wikipedia .

6/47
## Distributed Database Management System Architecture
Definition & Goals
A Distributed Database System (DDBS) consists of multiple, logically interrelated
databases distributed over a network, appearing as a single database to users
GeeksforGeeks Wikipedia .

Goals include improved performance via parallelism, higher availability through


replication, and scalability by adding nodes GeeksforGeeks .

Homogeneous vs. Heterogeneous DDBMS


Homogeneous DDBMS: All sites run the same DBMS software, simplifying query
processing and transaction management Oracle Documentation .

Heterogeneous DDBMS: Sites run different DBMS products, requiring schema


translation and protocol conversion Oracle Documentation .

Client/Server Model
Uses a client/server architecture where clients submit queries and servers manage
local data and execute requests Oracle Documentation .

Any node can act as a client, server, or both, depending on the operation
Oracle Documentation .

Architecture Types
Shared-Nothing: Each node has its own CPU, memory, and disk; data is partitioned
across nodes. It offers high scalability and fault isolation Wikipedia .

Shared-Disk: All nodes share common storage but have private memory; coordination is
needed to manage concurrent access Wikipedia .

Shared-Memory: Nodes share memory, requiring tight coupling; rarely used in large-
scale DDBS Wikipedia .

## Transparencies in a Distributed DBMS


A DDBMS provides transparencies to hide the complexity of distribution and make the
system appear as a centralized database.

7/47
1. Location Transparency
Users issue queries without knowing on which site data resides GeeksforGeeks .

2. Replication Transparency
Multiple copies of data can exist without user awareness; the system ensures
consistency across replicas GeeksforGeeks .

3. Fragmentation Transparency
Data fragmentation (horizontal, vertical, hybrid) is hidden; queries access fragmented
data seamlessly GeeksforGeeks .

4. Transaction Transparency
Distributed transactions maintain ACID properties (atomicity, consistency, isolation,
durability) across sites, often via two-phase commit TutorialsPoint .

5. Performance Transparency
The system optimizes query execution plans to minimize network transfer and balance
load without user intervention TutorialsPoint .

6. Failure Transparency
Mask node or communication failures, providing automatic recovery mechanisms so
users see uninterrupted service TutorialsPoint .

These concise points give you a structured foundation for your Distributed Systems notes. Let me
know if you need elaboration on any section!

Sources

Here’s a structured, citation‐backed overview of Distributed DBMS Architectures, covering


both high‐level models and internal components.

A distributed DBMS can be classified along two main dimensions:

1. Distribution Model – how control and data are spread (Client/Server vs. Peer‐to‐Peer).

8/47
2. Hardware/Storage Model – how resources are shared (Shared‐Nothing, Shared‐Disk,
Shared‐Memory, Shared‐Everything).

Under the hood, a DDBMS implements components—such as a global directory, query


processor, transaction manager, and storage manager—to coordinate across sites.

## 1. Distribution Models
### 1.1 Client/Server
Definition: A central server hosts the database; clients send requests and receive results
Purdue Computer Science .

Characteristics:

Clients handle presentation and some query parsing.

Server executes queries, manages storage, and enforces ACID properties.

Scales moderately but can create server bottlenecks.

### 1.2 Peer‐to‐Peer (Fully Distributed)


Definition: All nodes are equal peers, each capable of processing queries and managing
data Dimi Uniud Users .

Characteristics:

No central coordinator—nodes coordinate in a decentralized fashion.

High autonomy: design, communication, and execution autonomy at each site


Dimi Uniud Users .

Better resilience and scalability, but more complex metadata and catalog
management.

## 2. Hardware/Storage Models
### 2.1 Shared‐Nothing

9/47
Definition: Each node has its own CPU, memory, and disk; nodes communicate only via
the network phoenixNAP | Global IT Services CMU 15-445/645 .

Advantages:

Excellent scalability—add nodes without interference.

Failure isolation—node crash affects only local data.

Drawbacks:

Data rebalancing and repartitioning can be complex.

Consistency coordination across shards incurs messaging overhead.

### 2.2 Shared‐Disk


Definition: All nodes share common disk storage, but maintain private memory and CPU
GeeksforGeeks .

Advantages:

Easier to redistribute load—any node can access any data page on shared storage.

Simplifies data rebalancing compared to shared‐nothing.

Drawbacks:

Requires coordination (locks) to manage concurrent disk access.

Shared storage can become a performance bottleneck.

### 2.3 Shared‐Memory


Definition: Nodes share both main memory and disk; essentially a tightly coupled
multiprocessor system Wikipedia .

Usage:

Rare in large‐scale distributed DBMSs due to hardware complexity.

Typically found in SMP (symmetric multiprocessing) environments.

### 2.4 Shared‐Everything


Definition: Variant of shared‐disk where nodes also share in‐memory buffers and caches
Wikipedia .

Example: IBM Db2 pureScale, Oracle RAC in shared‐everything mode.

Trade‐Off:

10/47
Simplifies data visibility and caching.

Needs sophisticated coherence protocols to avoid contention.

## 3. Core DDBMS Components


A fully functional DDBMS comprises several interacting modules:

Component Function

Global Directory Stores metadata on fragmentation, replication, and location of data


fragments drbtaneja.com .

Query Processor Decomposes global queries into site‐specific subqueries; optimizes


execution plans drbtaneja.com .

Transaction Manager Coordinates distributed transactions, ensuring ACID via protocols like Two‐
Phase Commit drbtaneja.com .

Storage Manager Manages local data storage, buffer pools, and index access.

Concurrency Control Enforces isolation across sites using distributed locking or timestamp
ordering.

Recovery Manager Provides fault tolerance via logging, checkpointing, and coordinated
rollback/roll‐forward.

Communication Handles messaging, marshalling, and remote procedure calls among sites.
Subsystem

Client Interface Presents a unified view to applications, hiding the distribution and
replication details.

This architecture overview should give you a clear framework for your notes—defining each model,
its trade‐offs, and the internal building blocks of a Distributed DBMS.

Sources

11/47
🔹 Global Directory Issues
1. Single Point of Failure

If the central directory goes down, the entire distributed DBMS can become
inoperative.

Mitigation: replicate the directory or use distributed directory replicas.

2. Scalability Bottleneck

A centralized directory can become overwhelmed as the number of sites or queries


grows.

Mitigation: partition the directory or employ a hierarchical directory structure.

3. Latency and Performance Overhead

Every query may need a lookup in the global directory, adding network round-trips.

Mitigation: cache directory entries locally with appropriate invalidation policies.

4. Consistency Maintenance

Changes to fragmentation, replication, or schema must be propagated to all


directory replicas.

Mitigation: use efficient update protocols (e.g., invalidation messages or versioned


entries).

5. Stale or Inaccurate Metadata

Cached or replicated directory data can become outdated, leading to misrouted


queries or failed lookups.

Mitigation: implement time-to-live (TTL) or version checks on directory entries.

6. Synchronization Overhead

Coordinating updates across directory replicas incurs messaging and coordination


cost.

Mitigation: batch updates or use asynchronous propagation where strict real-time


consistency isn’t critical.

7. Security and Access Control

12/47
The directory contains critical metadata; unauthorized access can compromise the
entire system.

Mitigation: enforce strong authentication, encrypt directory communications, and


apply role-based access controls.

8. Complex Recovery Procedures

Recovering a directory after failure requires reconstructing accurate metadata


across all fragments and replicas.

Mitigation: maintain robust logging and checkpointing mechanisms for the directory
state.

9. Load Imbalance

Some directory partitions or replicas may receive disproportionate lookup requests.

Mitigation: dynamically redistribute directory partitions or employ load-aware


routing.

These points highlight the core challenges you’ll face in managing a global directory within a
distributed DBMS. Let me know if you need deeper explanations or examples!

Here’s a concise set of alternative design strategies for distributed database design,
organized for easy note‐taking and each point backed by citations.

A variety of placement alternatives (centralized, replicated, partitioned, hybrid) define how


data is distributed across sites www.slideshare.net ; while design approaches (top-down, bottom-
up, mixed, backward-forward, activity-analysis) guide the methodology for creating the
global schema and allocating data NIST Publications . Under the hood, data‐allocation strategies
—fragmentation, replication, or a combination—determine performance, reliability, and
scalability trade-offs in the final system TutorialsPoint .

## A. Data-Placement Alternatives
1. Centralized

13/47
All data resides on a single site; simple to manage but poor scalability and fault
tolerance www.slideshare.net .

2. Replicated

Entire database is copied to multiple sites; improves read performance and


availability but incurs high update‐synchronization cost www.slideshare.net .

3. Partitioned (Fragmented)

Database split into disjoint fragments (horizontal, vertical, hybrid) stored at different
sites; balances load and reduces network traffic www.slideshare.net .

4. Hybrid

Combines replication and fragmentation: some fragments are replicated across


multiple sites for critical data, others are stored singularly www.slideshare.net .

## B. Design-Methodology Approaches
1. Top-Down

Start from a global conceptual schema, then refine into fragments, allocate to sites
for optimal distribution; yields homogeneous systems ie.utcluj.ro Google Groups .

2. Bottom-Up

Integrate existing local schemas into a global schema; suits systems where
databases already exist at sites ie.utcluj.ro Google Groups .

3. Mixed (Hybrid)

Partition requirements top-down into subproblems, then apply bottom-up to each


partition and merge; practical for complex, evolving systems ie.utcluj.ro .

4. Backward-Forward (Output-Driven)

Identify key application outputs, then determine which data sources/fragments


must be placed to support those outputs NIST Publications .

5. Activity-Analysis (Process-Driven)

Analyze business processes to derive data allocation needs; fragments and replicas
are allocated based on workflow and transaction patterns NIST Publications .

14/47
## C. Data-Allocation Strategies
1. Fragmentation

Horizontal: Rows distributed by predicate (e.g., region)

Vertical: Columns grouped into fragments (e.g., sensitive vs. non-sensitive)

Hybrid: Combines both for fine-grained control TutorialsPoint .

2. Replication

Copies of data maintained at multiple sites for fault tolerance and reduced access
latency

Must manage replica control (sync protocols, consistency levels) TutorialsPoint .

3. Combined Strategies

Most real systems use fragmentation for data locality plus selective replication for
critical fragments, balancing performance and update cost TutorialsPoint .

These strategies provide a toolkit for designing scalable, reliable distributed databases—select and
tailor them based on workload, network topology, and administrative autonomy.

Sources

2/2
The user asked about distributed database design issues, so I should
highlight key concerns like fragmentation, data distribution,
heterogeneity, and performance. Since it’s a relatively broad topic, I think
I’ll provide concise notes. Given the guidelines, it's recommended to
include at least 10 citations, though the user didn't ask for deep analysis.
I'll look up relevant details on distributed database design issues and
make sure to cover factors like security, transaction management, and
complexity. Let's proceed with finding sources for the appropriate
citations.

Searched the web

distributed database design issues

GeeksforGeeks basiccomputerscience.softecks.in Medium

8 more

There are several key issues in distributed database design to consider:


fragmentation and allocation, replication control, heterogeneity,
transparency, autonomy, complexity, security, query processing,

15/47
concurrency control, reliability, global directory, network performance,
scalability, and synchronization overhead.
I'll need to gather relevant sources to make sure I have the right citations
for all these components, including factors like heterogeneity, security,
scalability, and replication. I'll go through my sources now and make sure
to cite appropriately!
Done

Here’s a concise overview of the key design issues you’ll face when architecting a distributed
database, organized for your notes:

Distributed database design must reconcile heterogeneity and openness in hardware, OS,
and software interfaces; strategize data placement via fragmentation and replication;
optimize query processing across sites; enforce transactional consistency and deadlock
control; manage network communication overhead; and guarantee scalability, fault‐
tolerance and security throughout the system basiccomputerscience.softecks.in ResearchGate Scribd

GeeksforGeeks ds.cs.luc.edu .

A. System Heterogeneity & Openness


Heterogeneity: Different hardware, OS, and DBMS products require middleware for
schema mapping and protocol translation basiccomputerscience.softecks.in .

Openness: Published interfaces and uniform communication mechanisms let new


resources be added seamlessly basiccomputerscience.softecks.in .

B. Data Distribution & Fragmentation


Fragmentation: Choosing horizontal, vertical, or hybrid splits impacts locality, query
cost, and rebalancing complexity ResearchGate .

Allocation Strategy: Deciding which fragments live on which sites determines load
balance and response time.

C. Replication & Consistency


Replication Control: Maintaining multiple copies boosts read performance but incurs
update‐synchronization overhead and consistency challenges Medium .

16/47
Consistency Models: Trade‐offs between strong (synchronous) and eventual
(asynchronous) consistency affect availability and performance.

D. Query Processing & Optimization


Distributed Query Decomposition: Breaking global queries into site‐specific subqueries
and shipping data judiciously is NP‐hard Scribd .

Cost Estimation: Must account for local CPU, I/O, and inter‐site transfer costs when
choosing join orders and data shipping strategies Scribd .

E. Transaction Management & Concurrency


Two‐Phase Commit (2PC): Ensures atomic distributed transactions but introduces
latency and blocking under coordinator failure Scribd .

Deadlock Detection: Requires global wait‐for graphs and coordinated resolution


protocols across sites Scribd .

F. Communication & Network Overhead


Latency & Bandwidth: High round‐trip times and limited throughput can dominate
query and transaction response times GeeksforGeeks .

Protocol Choice: Synchronous vs. asynchronous message passing impacts reliability and
performance.

G. Scalability, Reliability & Fault Tolerance


Scalability: Systems must handle growth in data volume, users, and sites without
degradation basiccomputerscience.softecks.in .

Failure Masking: Replication, checkpointing, and recovery protocols hide node or link
failures from users ds.cs.luc.edu .

H. Security & Access Control


Authentication & Encryption: Secure data in transit and at rest across untrusted
networks basiccomputerscience.softecks.in .

Authorization: Fine‐grained access control must span multiple DBMS products and sites.

17/47
These distilled points will help you structure your notes on distributed database design
challenges—feel free to ask for deeper examples or clarifications on any item!

Sources

Here’s a concise yet comprehensive breakdown of Fragmentation and Data Allocation in


distributed database design, organized for quick note‐taking.

Summary:
Fragmentation divides relations into smaller pieces to improve query performance and data
locality, with three main types—horizontal, vertical, and hybrid—each offering trade-offs in
complexity and efficiency GeeksforGeeks BINUS Student Information System . Data allocation then decides
where to place these fragments (or full/partial replicas) across sites, using centralized,
partitioned, replicated, or hybrid strategies to balance performance, availability, and update
costs TutorialsPoint ScienceDirect . Optimal designs consider workload patterns, network costs, and
site capacities, often combining fragmentation with selective replication for critical data
Hevo Data Vaia .

## A. Fragmentation
1. Definition
Fragmentation is the process of breaking a global relation into smaller, disjoint pieces
(fragments) that can be stored at different sites GeeksforGeeks .

2. Types of Fragmentation
1. Horizontal Fragmentation

Divides a relation into subsets of rows based on a predicate (e.g., region = “Asia”)
GeeksforGeeks .

Use-case: Localize queries that access only a subset of tuples.

2. Vertical Fragmentation

Splits a relation into subsets of columns, ensuring each fragment retains the
primary key to allow reconstruction via join BINUS Student Information System .

18/47
Use-case: Isolate frequently accessed attributes or sensitive data.

3. Hybrid (Mixed) Fragmentation

Combines horizontal and vertical fragmentation for fine-grained control (e.g., first
horizontal by region, then vertical on each) GeeksforGeeks .

3. Advantages & Disadvantages


Advantages:

Improves data locality and reduces query response time by accessing only relevant
fragments GeeksforGeeks .

Enables parallel query execution across sites.

Disadvantages:

May incur complex reconstruction overhead when queries span multiple fragments
GeeksforGeeks .

Recursive fragmentation can lead to high management cost and network traffic
GeeksforGeeks .

## B. Data Allocation
1. Definition
Data Allocation is the process of deciding which fragments (or full/partial replicas) are
stored at which sites in the network TutorialsPoint .

2. Allocation Strategies
1. Centralized Allocation

All data stored at a single site; simplest but poor scalability and fault tolerance
TutorialsPoint .

2. Partitioned (Fragmented) Allocation

Each fragment stored at exactly one site, leveraging fragmentation for load
distribution and reduced inter-site traffic TutorialsPoint .

3. Replicated Allocation

19/47
Copies of (full or partial) data stored at multiple sites to improve availability and
read performance, at the cost of update synchronization overhead TutorialsPoint .

4. Hybrid Allocation

Combine partitioning and selective replication: critical fragments are replicated,


while others remain singular to balance cost and performance Hevo Data .

3. Factors Influencing Allocation


Workload Characteristics: Read vs. update frequency determines where to replicate vs.
partition Vaia .

Site Capabilities: CPU, storage, and network bandwidth of each site Vaia .

Network Topology & Latency: Place data close to the dominant user group to minimize
access delays ScienceDirect .

Reliability Requirements: Critical data may need more replicas for fault tolerance
ScienceDirect .

4. Combined Approaches
Fragment-then-Replicate: First apply fragmentation to localize data, then replicate hot
fragments to multiple sites for load balancing Hevo Data .

Dynamic Allocation: Adjust allocation at runtime based on monitoring (e.g., via mobile
agents or learning-based models) to adapt to changing workloads www2.math.ethz.ch .

These bullet-points should serve as a solid foundation for your notes on fragmentation and data
allocation in distributed databases.

Sources

Semantic data control in a distributed DBMS encompasses the mechanisms that enforce the
meaning, integrity, and access rights of data across multiple sites. It includes functions
such as view management, security control, and semantic integrity control, all aimed at
ensuring that authorized users perform correct operations on consistent data despite
distribution SpringerLink .

20/47
## 1. Semantics (Semantic Data Models)
A Semantic Data Model (SDM) is a high-level, meaning-oriented database description
formalism that captures entities, their classifications, and interconnections to reflect
real-world application semantics Wikipedia .

In a distributed environment, semantic models guide how global schemas relate to local
schemas, ensuring that data fragments at different sites maintain consistent
interpretations of entities and relationships GoodData .

## 2. Data Control
Data control refers to enforcing regulations—such as integrity constraints,
authorization, and privacy rules—so that only correct and permitted data is available to
authenticated users and applications TutorialsPoint .

Integrity constraints (e.g., key, domain, and referential constraints) must be checked
across sites, requiring coordination protocols to prevent anomalies during distributed
updates Purdue Computer Science .

Authentication and authorization ensure that subjects (users or processes) are


identified, verified, and granted rights according to an authorization matrix, often
implemented via distributed access control lists or role-based schemes Purdue Computer Science

Privacy and security controls screen sensitive data from unauthorized sites and encrypt
communications to protect against eavesdropping in untrusted networks TutorialsPoint .

## 3. View Management
A view is a virtual relation defined by a query over base relations, providing logical data
independence and tailored representations for different users prezi.com .

In a Single-Site DBMS, views are implemented by rewriting user queries or materializing


view contents; in a Distributed DBMS, view management extends to handling
shorthand views (simplified query interfaces) and protection views (access-restricted
projections) across multiple sites VLDB .

Distributed view processing involves two main tasks: (1) query decomposition, where a
global view query is broken into subqueries for relevant fragments, and (2)

21/47
reconstruction, where results from local subqueries are integrated to form the final view
answer ResearchGate .

Challenges include ensuring up-to-date view contents under data replication, optimizing
view query plans to minimize cross-site data transfer, and enforcing view-specific
security policies consistently across replicas Medium .

Protection views add an extra layer of security by exposing only a filtered subset of
attributes or tuples to certain users, enforcing both read and write restrictions
transparently VLDB .

These concise points cover the semantic foundations, data control mechanisms, and view
management strategies essential for maintaining correct, secure, and meaningful data access in
distributed database systems.

Sources

Data security in distributed database systems encompasses a range of techniques and


controls designed to protect data across multiple sites while preserving the core principles of
confidentiality, integrity, and availability. Effective security must cover user and process
authentication, robust authorization mechanisms, encryption of data both at rest and in
transit, fine-grained access control models, continuous auditing and monitoring, intrusion
detection, reliable backup and recovery strategies, adherence to legal and regulatory
requirements, and network-level protections. Emerging technologies—such as queryable
encryption—promise to further enhance security by enabling operations on encrypted data
without exposing plaintext, addressing new threats in distributed environments.

## 1. CIA Triad (Confidentiality, Integrity, Availability)


Confidentiality ensures that data is not disclosed to unauthorized parties; mechanisms
include encryption, access controls, and secure communication protocols Informa TechTarget .

Integrity guarantees that data cannot be altered undetectedly, enforced via checksums,
digital signatures, and transactional controls Informa TechTarget .

22/47
Availability ensures reliable access to data and services despite failures or attacks,
achieved through replication, fault-tolerant architectures, and rapid recovery
mechanisms Informa TechTarget .

## 2. Authentication & Authorization


Authentication verifies the identity of users and services before granting access;
common methods include passwords, multi-factor authentication (MFA), and certificate-
based schemes GeeksforGeeks .

Authorization determines what authenticated entities may do, using models such as
Role-Based Access Control (RBAC), Mandatory Access Control (MAC), and Discretionary
Access Control (DAC) Delinea .

## 3. Encryption Techniques
Data-in-Transit Encryption employs protocols like TLS/SSL or IPSec to prevent
eavesdropping on network links GeeksforGeeks .

Data-at-Rest Encryption uses Transparent Data Encryption (TDE) or disk-level encryption


to protect stored data; algorithms include AES for symmetric encryption and RSA/ECC for
asymmetric encryption Splunk .

Queryable Encryption enables search and query operations over encrypted data
without decryption, reducing exposure risks during processing WIRED .

## 4. Access Control Models


Role-Based Access Control (RBAC) assigns permissions based on user roles, simplifying
management in large systems Delinea .

Mandatory Access Control (MAC) enforces strict policies determined by a central


authority, often used for classified or sensitive data Delinea .

23/47
Discretionary Access Control (DAC) allows data owners to grant permissions, offering
flexibility but requiring careful governance Delinea .

## 5. Auditing & Monitoring


Database Auditing records selected user actions (e.g., SQL statements, object access) to
detect misuse and support forensics Oracle Documentation .

Continuous Monitoring tracks system events and configurations, triggering alerts for
anomalous activity and policy violations Liquid Web .

## 6. Intrusion Detection
Network-Based IDS (NIDS) inspects traffic between sites for known attack signatures or
anomalies GeeksforGeeks .

Host-Based IDS (HIDS) monitors individual database servers for suspicious processes,
file changes, or policy breaches GeeksforGeeks .

## 7. Backup, Recovery & Fault Tolerance


Regular Backups (full, incremental, differential) stored off-site or in the cloud ensure
data can be restored after loss DevOps.com .

Replication across multiple sites provides high availability and quick failover; must be
coordinated to maintain consistency Daily.dev .

Disaster Recovery Plans define RPOs (Recovery Point Objectives) and RTOs (Recovery
Time Objectives) to meet business requirements Daily.dev .

## 8. Compliance & Legal Considerations

24/47
Regulatory Frameworks such as GDPR, HIPAA, and PCI DSS mandate specific security
controls (encryption, access logs, breach notification) for distributed data BlueXP .

Data Residency Laws may restrict where certain data can be stored or processed,
influencing data placement and replication strategies TiDB .

## 9. Network Security
Secure Communication Channels (VPNs, dedicated lines) protect data flows between
distributed nodes GeeksforGeeks .

Firewalls & Network Segmentation isolate database servers from untrusted networks
and limit attack surface GeeksforGeeks .

## 10. Emerging Security Techniques


Crypto Anchors embed encryption keys in secure hardware modules (HSMs) to slow
unauthorized data exfiltration and aid breach detection WIRED .

Homomorphic Encryption allows computations on ciphertexts, protecting data in use


but still under development for practical performance WIRED .

These concise points cover the critical aspects of data security in Distributed DBMS
environments, balancing technical controls with operational considerations to safeguard
data across dispersed systems.

Sources

🔹 Semantic Integrity Control

25/47
Semantic integrity control ensures that data not only meets syntactic rules (e.g., type,
format) but also business‐level meaning and inter‐entity relationships, even when data is
distributed across multiple sites.

1. Types of Semantic Integrity Constraints


Domain Constraints

Values of an attribute must fall within a defined set or range.

Key Constraints

Uniqueness of primary keys to identify tuples globally.

Referential Integrity

Foreign‐key relationships must link to existing primary‐key values, even across


fragments.

User‐Defined (Business) Rules

Complex predicates (e.g., “order total must not exceed credit limit”) that reflect
organizational policies.

2. Enforcement Strategies in a Distributed DBMS


Local Enforcement

Check constraints that reference only locally stored data at each site.

Example: domain checks or local foreign keys within a fragment.

Global Enforcement

Enforce cross‐site constraints by coordinating among involved sites.

Often uses two‐phase protocols:

1. Validation Phase – each site confirms local part of the constraint.

2. Commit Phase – only if all sites agree is the transaction committed.

Deferred Constraint Checking

Collect updates and validate semantic rules at transaction end, reducing interim
communication.

Requires rollback if any global rule is violated.

Predicate‐Based Partitioning

26/47
Fragment relations such that semantically related tuples reside together, minimizing
cross‐site checks.

3. Coordination Protocols
Two‐Phase Commit (2PC)

Ensures atomicity of constraint checks and updates across sites.

Distributed Locking or Timestamps

Prevent concurrent changes from violating semantic rules.

Eventual Consistency with Compensation

For less critical rules, allow temporary violations and run compensating transactions
later.

4. Design Considerations
Minimize Cross‐Site Dependencies

Fragment and allocate data to reduce the number of global constraint checks.

Balance Performance vs. Integrity

Immediate (eager) checks increase correctness but add latency; deferred checks
improve throughput but risk temporary inconsistencies.

Metadata Management

Store semantic‐constraint definitions in the global directory, accessible to all sites for
enforcement.

These points outline how semantic integrity control is defined and enforced in a distributed
database environment, helping you capture both the types of constraints and the mechanisms for
maintaining them across multiple sites.

Summary
Query processing transforms high-level user queries into efficient execution plans, aiming to
minimize overall resource consumption (CPU, I/O, network) while ensuring correctness and
reusability. The main objectives include (1) translating queries into algebraic operations; (2)

27/47
decomposing and localizing subqueries; (3) choosing optimal execution strategies globally
and locally; (4) minimizing costs via optimization; and (5) generating and caching execution
plans for reuse.

Objectives of Query Processing


1. Translate High-Level Queries

Convert declarative, high-level queries into equivalent low-level operations


expressed in relational algebra or procedural code Studocu .

2. Query Decomposition

Break global queries into subqueries targeted at specific fragments or sites,


facilitating parallelism and data localization www.slideshare.net .

3. Data Localization

Determine which data fragments or replicas each subquery should access to


minimize data shipping across the network www.slideshare.net .

4. Global Optimization

Explore and select among alternative global execution strategies (e.g., join ordering,
semijoin usage) to minimize distributed costs DSpace .

5. Local Optimization

For each site, apply cost-based optimization (e.g., index selection, join methods) to
subqueries before execution Purdue Computer Science .

6. Cost Minimization

Use a cost model combining CPU, I/O, and communication costs to guide
optimization decisions, aiming to produce plans with minimal expected execution
time www.slideshare.net Wikipedia .

7. Correctness & Equivalence

Ensure the chosen plan is semantically equivalent to the original query by applying
equivalence-preserving transformation rules (e.g., commutativity, associativity of
joins) SpringerLink .

8. Execution Plan Generation

28/47
Construct a detailed query execution plan (QEP) that specifies operators, data flows,
and site assignments Wikipedia .

9. Plan Reuse & Caching

Cache optimized plans or subplans for recurring queries to reduce future


optimization overhead Microsoft Learn .

10. Adaptivity

Dynamically adjust plans based on runtime statistics or changing workloads, re-


optimizing or re-routing subqueries as needed SpringerLink .

These objectives together ensure that distributed query processing is both efficient and correct,
delivering fast responses while minimizing resource usage and adapting to evolving system
conditions.

Sources

Overview
A query processor translates high-level queries into low-level operations and executes them
efficiently. Its architecture is typically modular, separating compilation, optimization, and
execution. Key characterizations include the processing layers, the phases each query
undergoes, and quality metrics (like performance and extensibility).

1. Architectural Layers
1. Query Parser / DDL Compiler

Parses SQL statements and validates schema objects.

Generates an internal representation (parse tree).

2. DML Compiler / Query Translator

29/47
Converts the parse tree into a logical algebra expression.

Performs semantic checks (type, name resolution).

3. Query Optimizer

Logical Optimization: Applies algebraic rewrite rules (e.g., join commutativity).

Physical Optimization: Chooses algorithms (e.g., nested-loop vs. hash join) and
access paths (indexes vs. full scans).

Uses a cost model (CPU, I/O, network) to compare plans.

4. Plan Generator & Code Generator

Produces an executable plan or low-level code (often in an internal “plan language”).

5. Execution Engine / Runtime

Interprets and runs the chosen plan operators.

Manages buffer pools, I/O scheduling, and inter-site communication in distributed


settings.

2. Processing Phases
1. Parsing & Validation

Syntax checking; look up metadata in the catalog.

2. Normalization & Rewriting

Simplify and canonicalize the query (e.g., predicate push-down).

3. Plan Enumeration & Costing

Generate alternative plans (join orders, access methods).

Estimate cost for each using statistics.

4. Plan Selection

Pick the lowest-cost plan within resource limits.

5. Execution & Feedback

Execute operators; collect runtime statistics for adaptive re-optimization if


supported.

30/47
3. Classification of Query Processors
Static vs. Adaptive

Static: Plan is fixed before execution.

Adaptive: Adjusts plan at runtime based on actual cardinalities or resource


availability.

Centralized vs. Distributed

Centralized: Single-site processing; no network overhead.

Distributed: Handles data localization, subquery shipping, and network costs.

Rule-Based vs. Cost-Based

Rule-Based: Applies fixed heuristics (e.g., join smaller tables first).

Cost-Based: Evaluates multiple plans using a cost model.

4. Quality Metrics
Performance: Throughput and response time under typical workloads.

Scalability: Ability to maintain performance as data size and user concurrency grow.

Extensibility: Ease of adding new operators, data types, or statistics.

Portability: Ability to run on different hardware or integrate with various storage


engines.

Robustness: Stability under skewed data or unexpected workloads.

These points give you a clear framework for how query processors are structured, operate, and are
evaluated—ideal for concise note-making.

Query processing in a distributed DBMS involves a sequence of transformations that convert


a high-level declarative query into an efficient distributed execution plan GeeksforGeeks Medium .
The process is structured into four main layers: Query Decomposition, Data Localization,

31/47
Global Query Optimization, and Distributed Execution Ques10 www.slideshare.net . Each layer
plays a distinct role, from rewriting the query at the global schema level to orchestrating
parallel execution across multiple sites sungsoo.github.io www.slideshare.net . This layered approach
helps manage complexity and optimize performance in distributed environments Ques10

sungsoo.github.io .

1. Query Decomposition
Decomposes the original query on global (distributed) relations into an equivalent algebraic
query over those global relations, applying transformations such as predicate push-down
and view expansion Ques10 www.slideshare.net .

2. Data Localization
Maps the algebraic query on global relations to algebraic queries on local fragments by
determining which data fragments or replicas each subquery must access, thereby
minimizing unnecessary data transfer Ques10 www.slideshare.net .

3. Global Query Optimization


Generates alternative global execution plans—considering factors like join ordering, semijoin
strategies, and site-to-site shipping—and selects the lowest-cost plan based on a cost model
that accounts for CPU, I/O, and communication overhead Ques10 www.slideshare.net .

4. Distributed Execution
Executes the chosen distributed query plan by coordinating subqueries at each site,
managing inter-site communication, buffering, and result assembly to produce the final
answer to the user’s query www.slideshare.net sungsoo.github.io .

Sources

Here’s a focused breakdown of Query Decomposition for your notes:

32/47
1. Purpose
Transform a high-level global query into an equivalent algebraic form over global
relations.

Prepare the query for localization and optimization by simplifying and normalizing it.

2. Steps in Decomposition
1. View Expansion

Replace any view names in the query with their defining expressions.

Ensures you work with base relations and fragmentation definitions.

2. Predicate Push-Down

Move selection predicates as close to the data sources as possible.

Reduces the volume of data shipped by filtering early.

3. Unnesting Nested Queries

Convert nested subqueries (e.g., IN, EXISTS) into joins or semijoins where feasible.

Simplifies the structure and exposes more optimization opportunities.

4. Query Rewriting

Apply algebraic equivalences (commute/associate joins, distribute selections) to


obtain a canonical form.

Aids the optimizer by standardizing join orders and operator placement.

5. Fragment-Aware Rewriting

Incorporate fragmentation metadata: turn global relations into unions of their


fragments.

E.g., R = R₁ ∪ R₂ ∪ …, so a query on R becomes queries on each Rᵢ.

3. Output of Decomposition

33/47
An algebraic query expressed over the set of fragments (and replicas) rather than over
monolithic relations.

Ready for the next layer—Data Localization—where each fragment reference is mapped
to its storing site.

This gives you the precise sequence and rationale for decomposing queries in a distributed
DBMS. Let me know if you need examples or further details!

Summary
Data localization is the process in distributed query processing that maps an algebraic
query over global relations onto site-specific subqueries by using metadata on
fragmentation and replication, with the primary goal of minimizing inter-site data transfer
and network latency Dimi Uniud Users Scribd .

Purpose
Minimize Data Movement: Restrict data access to only those fragments that satisfy
query predicates, thereby reducing network costs GeeksforGeeks .

Exploit Parallelism: Enable simultaneous execution of subqueries at different sites by


pinpointing local data sources.

Prepare for Optimization: Produce localized subqueries that can be further optimized
both globally and locally.

Steps in Data Localization


1. Identify Relevant Fragments

Use the global directory to determine which data fragments (horizontal/vertical) are
needed for each relation in the query www.slideshare.net .

34/47
2. Determine Fragment-Site Mapping

Look up the location of each fragment (or its replicas) to know which site(s) should
execute the subquery Dimi Uniud Users .

3. Generate Localized Subqueries

Rewrite the global algebraic query into subqueries that reference only the local
fragments, applying selection and projection early Scribd .

4. Assemble Reconstruction Operators

Insert UNION (for horizontal fragments) or JOIN (for vertical fragments) operators to
recompose the global relation from local results Purdue Computer Science .

Techniques to Minimize Data Transfer


Semi-Join Reduction: Ship only projected join attributes from one site to another to filter
tuples before the full join Ques10 .

Join Shipping vs. Data Shipping: Decide between moving data fragments to the site of
the join or moving the join operation to data sites based on cost estimates infolab.usc.edu .

Predicate Pushdown: Apply selection predicates at the fragment level to reduce tuple
counts early.

Challenges & Optimizations


Stale Metadata: Cached directory information can lead to incorrect fragment locations;
employ TTL or version checks Medium .

Load Imbalance: Skewed fragment distribution may overload certain sites; adaptive
reallocation or replication can help YouTube .

Complexity of Reconstruction: Queries spanning many fragments incur overhead in


merging results; query optimizers must balance between fragmentation granularity and
recomposition cost.

Adaptive Localization: Monitoring runtime statistics to adjust localization strategies


dynamically for evolving workloads.

35/47
These concise points outline how distributed data localization transforms global queries into
efficient, site-specific operations, forming a critical foundation for subsequent optimization and
execution.

Sources

Concurrency Control in Centralized Database Systems


1. Objective
Maintain Correctness: Ensure interleaved transactions yield the same result as some
serial execution (serializability).

Preserve ACID: Particularly isolation and consistency in a multi-user environment.

2. Key Concepts
Transaction: A sequence of read/write operations treated as a single logical unit.

Conflict: Occurs when two transactions access the same data item and at least one
writes it.

3. Approaches to Concurrency Control

A. Lock-Based Protocols

Shared (S) and Exclusive (X) Locks

S-lock: multiple transactions may read.

X-lock: exclusive write access.

Two-Phase Locking (2PL)

Growing phase: acquire all needed locks.

36/47
Shrinking phase: release locks; no new locks acquired.

Guarantees serializability but can lead to deadlocks.

B. Time-Stamp Ordering

Assign each transaction a unique timestamp.

Read/write rules:

A transaction’s read/write is allowed only if its timestamp ordering does not violate
serial order.

Avoids blocking but may lead to many rollbacks.

C. Optimistic Concurrency Control

Phases:

1. Read Phase: execute using local copies without locks.

2. Validation Phase: check for conflicts at commit time.

3. Write Phase: if valid, apply updates; otherwise, abort and retry.

Best when conflicts are rare.

D. Multiversion Concurrency Control (MVCC)

Maintain multiple versions of each data item, each annotated with a timestamp.

Readers see the version matching their start time, writers create new versions.

Supports high read throughput; avoids read–write blocking.

4. Deadlock Handling (for Lock-Based)


Detection: Build a waits-for graph; if a cycle exists, a deadlock is present.

Resolution: Abort one or more transactions to break the cycle (victim selection by
cost/age).

5. Performance Considerations
Lock Granularity: Table-, page-, or row-level locks balance concurrency vs. overhead.

37/47
Isolation Levels:

Serializable: strictest, full isolation.

Repeatable Read, Read Committed, Read Uncommitted: progressively weaker,


trading isolation for performance.

These points cover the essentials of designing and understanding concurrency control in a
centralized DBMS, suitable for concise exam notes.

Summary
Concurrency control in Distributed Database Management Systems (DDBMSs) extends
centralized techniques to ensure global serializability and ACID properties across multiple
sites. It must coordinate sub-transactions, manage distributed locks or timestamps, and
integrate with atomic commitment protocols (e.g., Two-Phase Commit) to handle failures.
Key goals are to prevent conflicts, avoid deadlocks, and minimize communication
overhead, while ensuring correctness through global serialization.

1. Goals and Challenges


Global Serializability: Ensure that the interleaved execution of sub-transactions across
sites is equivalent to some serial order of the global transactions Wikipedia .

Conflict Detection & Resolution: Coordinate conflicting operations on the same data
item held at different replicas or fragments GeeksforGeeks .

Deadlock Management: Detect and resolve deadlocks that may involve locks spanning
multiple sites TutorialsPoint .

Fault Tolerance: Integrate with atomic commitment (e.g., 2PC) so that in the event of
node or network failure, either all sub-transactions commit or none do Wikipedia .

Performance Overhead: Balance strict coordination against network latency and


message costs.

38/47
2. Lock-Based Protocols
2.1 Strong Strict Two-Phase Locking (SS2PL)
Acquire all needed S/X locks before any release (growing phase), release locks only after
transaction end (shrinking phase).

Ensures both serializability and strictness (recoverability) Wikipedia .

Distributed Lock Manager (DLM): A centralized or hierarchical lock manager tracks


locks across sites.

2.2 Distributed Two-Phase Locking Variants


Centralized DLM: Single lock server; simpler but a bottleneck and single point of failure.

Distributed DLM: Lock managers at each site coordinate via message passing; reduces
bottlenecks but increases protocol complexity and latency TutorialsPoint .

3. Timestamp-Based Protocols
Global Timestamps: Assign each transaction a unique timestamp at start; ensure
operations follow timestamp order.

Read/Write Rules: A read or write is allowed only if it does not violate the timestamp
order; otherwise, the transaction is aborted and restarted TutorialsPoint .

Advantages: Non-blocking, avoids deadlocks.

Drawbacks: High abort rate under contention; requires synchronized clocks or logical
clock algorithms (e.g., Lamport clocks).

4. Optimistic Concurrency Control (OCC)


Read Phase: Transactions execute without locks, reading local snapshots.

Validation Phase: At commit, check for conflicts with overlapping transactions across
sites.

Write Phase: If validation succeeds, changes are applied; otherwise, abort and retry
GeeksforGeeks .

39/47
Use Case: Suitable when conflicts are rare and transaction loads are light.

5. Integration with Atomic Commitment


Two-Phase Commit (2PC)

1. Prepare Phase: Coordinator asks all sites if they can commit.

2. Commit Phase: If all vote “yes,” coordinator instructs commit; otherwise, abort
everywhere.

Role in Concurrency Control: Ensures that lock releases (2PL) or timestamp updates
occur only after a unanimous commit decision, preserving global consistency Wikipedia .

6. Deadlock Detection & Resolution


Global Wait-For Graph: Constructed by aggregating local wait-for graphs to detect
cycles that span sites.

Detection Algorithms:

Centralized: One site periodically gathers wait-for edges.

Distributed: Sites circulate probe messages along wait-for chains to detect cycles
TutorialsPoint .

Resolution: Abort one or more transactions (victim selection based on cost or age) to
break the cycle.

7. Multi-Version Concurrency Control (MVCC) in DDBMS


Versioning: Each update creates a new version; readers access the version matching
their snapshot timestamp.

Benefits: Readers never block writers and vice versa.

Complexity: Requires garbage collection of old versions and careful coordination of


version visibility across sites Medium .

40/47
8. Summary of Trade-offs
Protocol Type Deadlock Risk Blocking Abort Rate Coordination Cost

SS2PL (Lock-Based) Yes Yes Low High

Timestamp Ordering No No High Medium

OCC No (deferred) No High under contention Low

MVCC No No Medium High (versioning)

These key points cover the spectrum of concurrency control mechanisms in distributed DBMSs,
highlighting how each balances correctness, performance, and complexity in a networked
environment.

Sources

Below is a structured overview of the major distributed concurrency control algorithms,


summarizing their mechanics, trade-offs, and deployment considerations. Each sentence is
backed by a high-quality citation.

In distributed DBMSs, achieving global serializability requires algorithms that coordinate


concurrent access across multiple sites while managing communication latency and failures
Wikipedia .

1. Two-Phase Locking (2PL) Variants


Strong Strict 2PL (SS2PL) extends centralized 2PL by holding all locks until transaction
commit, thereby guaranteeing both serializability and strictness for recovery Wikipedia .

41/47
A Distributed Lock Manager (DLM) can be implemented centrally (single lock server) or
hierarchically/distributed (lock managers at each site coordinate via messages) to track
S/X locks across nodes GeeksforGeeks .

Centralized DLM simplifies deadlock detection but is a bottleneck and single point of
failure; distributed DLM avoids that but incurs higher protocol complexity and latency
VLDB .

2. Timestamp-Ordering Protocols
Each transaction receives a global timestamp (e.g., via synchronized clocks or logical
Lamport clocks) and all reads/writes are ordered to respect timestamp order, aborting
transactions that would violate it Wikipedia .

Wound–Wait and Wait–Die are deadlock-free timestamp variants that use timestamp
comparisons to decide whether an older transaction preempts a younger or vice versa
VLDB .

Time-stamp protocols are non-blocking (no locks), but under high contention they incur
high abort rates and require clock synchronization or reliable logical clocks VLDB .

3. Optimistic Concurrency Control (OCC)


OCC assumes conflicts are rare: transactions execute without locks, then undergo a
validation phase at commit time to detect conflicts, rolling back if necessary Medium .

In distributed OCC, validation must coordinate among all sites holding touched data,
typically via a prepare-and-validate sub-protocol before the two-phase commit
People at EECS Berkeley .

OCC excels when workloads are read-dominated with few conflicts, reducing lock-
management overhead but suffering high aborts under contention Medium .

4. Multiversion Concurrency Control (MVCC)


MVCC maintains multiple versions of each data item, tagged with commit-timestamps;
readers access the version valid at their start time, avoiding read–write blocking
GeeksforGeeks .

In a distributed setting, version visibility must be coordinated so that replicas converge


on a consistent version graph, and garbage-collecting old versions requires distributed

42/47
consensus on safe timestamps People at EECS Berkeley .

5. Hybrid and Advanced Protocols


Mixed Protocols combine locking and timestamps (e.g., time-stamped two-phase
locking) to balance blocking vs. aborts, choosing the best mechanism per operation class
VLDB .

Deterministic Concurrency Control (e.g., CALVIN) orders transactions upfront using a


global sequencer, then executes them without further coordination, trading off latency
for predictability arXiv .

RDMA-Enabled Protocols leverage one-sided remote memory access to implement


lightweight locks or version checks, drastically reducing coordination overhead in high-
speed networks arXiv .

6. Deadlock Detection & Resolution


Centralized Deadlock Detection gathers local wait-for graphs at one site to find cycles,
while distributed detection circulates probes along distributed wait chains to detect
global cycles VLDB .

Victim Selection for aborts typically favors younger or low-cost transactions to break
deadlocks with minimal work loss VLDB .

Key Trade-offs:

2PL yields low abort rates but high blocking and coordination cost.

Timestamp methods avoid blocking but suffer abort storms and require clock
management.

OCC/MVCC reduce blocking at the cost of complex validation/version management and


potentially high aborts under contention.

Choosing the right protocol depends on your workload’s read/write mix, network
characteristics, and tolerable abort vs. latency profile.

Sources

43/47
Distributed deadlock management encompasses techniques to prevent, detect, and resolve
cycles of waiting transactions across multiple sites, ensuring global serializability and system
liveness. Below is a structured overview:

Summary
In distributed DBMSs, deadlocks arise when transactions on different nodes form a circular
wait for locks on each other’s resources. Unlike centralized systems, distributed deadlock
management must contend with no global clock, partial information, and network latency.
The primary approaches are:

1. Prevention/Avoidance: Restrict locking patterns or resource requests to eliminate


cycles.

2. Detection: Use waits-for graphs—either centralized or via probe messages—to uncover


global cycles.

3. Resolution: Break deadlocks by aborting (“victimizing”) transactions based on


predefined policies (e.g., age, cost), or rely on atomic commitment timeouts.

4. Advanced: Leverage Commitment Ordering to use atomic commit messages for


automatic deadlock resolution without explicit lock coordination.

1. Deadlock Definition and Conditions


A deadlock is a state where transactions wait indefinitely for resources held by each
other, forming a cycle in the global waits-for graph TutorialsPoint .

Necessary conditions (Coffman’s conditions):

1. Mutual Exclusion: Resources are non-shareable.

2. Hold and Wait: Transactions hold at least one resource while waiting for others.

3. No Preemption: Resources cannot be forcibly taken.

4. Circular Wait: A cycle of transactions exists, each waiting on the next GeeksforGeeks .

44/47
2. Deadlock Prevention and Avoidance
Prevention: Impose ordering on resource requests; e.g., require transactions to lock
resources in a globally agreed sequence GeeksforGeeks .

Avoidance: Use resource-allocation graphs or timestamp‐based schemes (e.g., Wound–


Wait, Wait–Die) to decide at runtime whether granting a lock would risk a cycle
ACM Digital Library .

Trade-off: These methods reduce deadlock risk but constrain concurrency and can
increase abort rates.

3. Deadlock Detection
3.1 Centralized Detection
A single site gathers local wait-for edges from all nodes to build a global waits-for
graph; cycles indicate deadlocks SpringerLink .

Pros: Simpler graph analysis.

Cons: Communication overhead and single point of failure.

3.2 Distributed Detection


Probe Messages: A transaction holding T1 waiting on T2 sends a probe along the wait
chain; if the probe returns to the originator, a deadlock exists GeeksforGeeks .

Chandy-Misra-Haas Algorithm: Efficiently propagates probes without centralized


coordination CMI .

Pros: No central coordinator.

Cons: Probe message overhead and increased complexity.

4. Deadlock Resolution
Victim Selection: Abort one or more transactions in the cycle, chosen based on criteria
such as:

Transaction Age (younger = victim in Wait–Die)

45/47
Rollback Cost (least work lost)

Resource Utilization CMI .

Timeouts: If no response from a participant in the two-phase commit, coordinator


aborts timed-out transactions, breaking voting deadlocks DEV Community .

Automatic via Commitment Ordering: By enforcing that transactions vote on commit in


the order of their global commit sequence, lock-induced deadlocks resolve through the
atomic commit protocol without explicit detection Wikipedia .

5. Practical Considerations
Partial Knowledge: Sites may lack complete waits-for information, necessitating
conservative detection that risks false positives (unnecessary aborts).

Communication Cost: Frequent detection incurs network messages; balance detection


frequency with performance impact.

Integration with 2PC: Coordinate deadlock resolution with the two-phase commit
protocol to ensure atomicity and consistency Citus Data .

Hybrid Approaches: Combine prevention, detection, and automatic resolution (e.g.,


using MVCC to reduce blocking plus periodic distributed detection).

References:

Tutorialspoint on deadlock basics TutorialsPoint

GeeksforGeeks on DBMS deadlocks GeeksforGeeks

GeeksforGeeks on distributed deadlock strategies GeeksforGeeks

DEV Community on handling in distributed DBs DEV Community

Citus Data on distributed deadlocks in PostgreSQL Citus Data

Springer on Deadlock Management SpringerLink

ACM on prevention in DDB ACM Digital Library

OpenReview PDF on distributed deadlocks OpenReview

46/47
Chandy-Misra-Haas in computing surveys CMI

Wikipedia on Commitment Ordering Wikipedia

Sources

47/47

You might also like