DDBMS
DDBMS
DDBMS
UNIT - 1
Distributed Data Processing in Distributed Database Management Systems (DDBMS) is the practice of processing
and managing data that is distributed across multiple nodes or sites in a network. DDBMS is used to handle large
volumes of data efficiently by distributing it across multiple servers or nodes, which can be geographically dispersed.
Here are key aspects of distributed data processing in DDBMS:
1. Data Distribution: Data in a DDBMS is distributed across multiple locations, either for load balancing, fault
tolerance, or to reduce latency. The distribution can be horizontal (sharding) or vertical (partitioning) depending
on the database's architecture.
2. Data Replication: Some DDBMS architectures replicate data copies across multiple nodes or sites for
redundancy and fault tolerance. This ensures that data is available even if some nodes or sites experience
failures.
3. Data Fragmentation: Data fragmentation involves dividing a database into smaller fragments or subsets that
are distributed across nodes. Each fragment can be managed independently.
4. Query Optimization: Distributed data processing requires efficient query optimization algorithms to distribute
and parallelize queries across multiple nodes. This ensures that queries are executed as efficiently as possible.
5. Transaction Management: Distributed transactions can span multiple nodes. Coordinating distributed
transactions, ensuring atomicity, consistency, isolation, and durability (ACID properties), and handling failures
are essential challenges in DDBMS.
6. Data Distribution Strategies: DDBMS can use various strategies to distribute data, such as:
i. - Hash-based distribution: Data is distributed based on a hash function applied to a key attribute.
ii. - Range-based distribution: Data is distributed based on a predefined range of values in a key attribute.
iii. - Round-robin distribution: Data is distributed sequentially to nodes in a round-robin fashion.
7. Data Synchronization: DDBMS must manage data synchronization between distributed copies of data.
Techniques like replication consistency protocols ensure that data remains synchronized.
8. Data Security and Access Control: Access control mechanisms and encryption are crucial in DDBMS to ensure
that only authorized users can access and modify data, especially in distributed environments.
9. Load Balancing: Load balancing algorithms distribute query and transaction workloads evenly across nodes to
prevent overloading some nodes while others are underutilized.
10. Monitoring and Management: Effective monitoring and management tools are essential for diagnosing issues,
optimizing performance, and maintaining the health of a distributed database system.
11. Scalability: DDBMS should be designed to scale horizontally, allowing for the addition of more nodes or sites to
accommodate growing data and user loads.
12. Global Query Optimization: When querying data distributed across multiple sites or nodes, global query
optimization algorithms aim to optimize the execution plan by considering the distribution and location of data.
What is DDBS?
A Distributed Database System (DDBS) is a database management system that manages a database that is
distributed across multiple interconnected sites or nodes in a network. In a DDBS, data is stored in multiple locations
or servers, and the database management system is responsible for providing a unified view of the data to users and
applications. Here are some advantages and disadvantages of DDBS:
1. Improved Data Availability: Data redundancy and replication across multiple sites ensure that data remains
available even if some nodes or sites experience failures.
2. Enhanced Performance: DDBS can improve query and transaction performance by distributing data closer to
the users or applications that need it, reducing network latency.
3. Scalability: DDBS can be easily scaled by adding more nodes or sites to accommodate growing data volumes or
user loads, providing flexibility for expansion.
4. Geographical Distribution: DDBS enables data to be geographically distributed, which is beneficial for
organizations with a global presence or distributed operations.
5. Load Balancing: DDBS can distribute query workloads across nodes, preventing overload on a single server and
ensuring efficient resource utilization.
6. Reduced Data Transfer: Data locality optimization minimizes data transfer over the network by processing data
on nodes where it resides, reducing network overhead.
7. Disaster Recovery: Data distributed across geographically diverse sites can provide disaster recovery
capabilities in case of regional disasters or data center outages.
8. Security: DDBS can implement data security and access control mechanisms at different levels, ensuring that
only authorized users can access specific data.
9. Data Consistency: DDBS can maintain data consistency across distributed copies through mechanisms like two-
phase commit (2PC) and distributed locking.
1. Complexity: DDBS are inherently more complex to design, implement, and manage compared to centralized
databases.
2. Data Integrity: Ensuring data integrity and consistency across distributed nodes can be challenging and may
require sophisticated synchronization mechanisms.
3. Cost: Setting up and maintaining a DDBS can be expensive due to the need for additional hardware, software,
and network infrastructure.
4. Network Dependency: DDBS relies heavily on network connectivity, and network failures or latency can impact
system performance.
5. Data Security: Managing data security and access control across distributed nodes can be more complex and
challenging than in a centralized database.
6. Data Recovery: In case of node or site failures, data recovery and restoration can be more time-consuming and
complex.
7. Data Fragmentation: Data fragmentation and distribution can lead to data access inefficiencies if not managed
properly.
8. Consistency Challenges: Achieving and maintaining strong data consistency in distributed transactions can be
complex and may involve trade-offs.
9. Global Query Optimization: Optimizing queries that involve data distributed across multiple sites can be
challenging and may require advanced query optimization techniques.
10. Data Migration: Moving or replicating data across distributed nodes can be complex, especially when schema
changes are involved.
Problem Areas:
1. Data Security and Privacy: Ensuring data security and privacy is a significant challenge. Data breaches and
unauthorized access can lead to sensitive information exposure.
2. Data Integrity: Maintaining data integrity is crucial. Data may become corrupted due to hardware failures,
software bugs, or human errors.
3. Scalability: As data volumes grow, databases must scale to handle increased loads efficiently. Scaling can be
challenging and may require architecture changes.
4. Performance Tuning: Optimizing database performance is an ongoing task. Inefficient queries, indexing issues,
and resource limitations can impact performance.
5. Data Backup and Recovery: Developing robust backup and recovery strategies is essential to prevent data loss in
case of system failures or disasters.
6. Concurrency Control: Managing concurrent access to the database to prevent conflicts, such as data
inconsistency or deadlocks, can be complex.
7. Data Migration: Moving data between different database systems or versions can be challenging, requiring
careful planning and testing.
8. Compliance and Regulation: Meeting regulatory requirements (e.g., GDPR, HIPAA) regarding data storage,
access, and protection is a complex task.
9. Vendor Lock-In: Organizations can face vendor lock-in issues when they rely heavily on proprietary database
systems, making it difficult to migrate to other solutions.
10. Complex Queries and Reporting: Complex queries and reporting needs may lead to performance bottlenecks
and require optimization.
2. Network Reliability: Ensuring high network availability and reliability is a challenge. Downtime can result from
hardware failures, software bugs, or external factors.
3. Bandwidth and Congestion: Network congestion can degrade performance, leading to slow data transfer and
reduced user experience.
4. Scalability: As organizations grow, their networks must scale to accommodate increased traffic and devices,
which requires careful planning and infrastructure upgrades.
5. Quality of Service (QoS): Maintaining QoS for applications with different requirements (e.g., video streaming,
voice communication) can be challenging, particularly in shared networks.
6. Network Management and Monitoring: Efficiently managing and monitoring network devices and traffic is
essential for identifying and addressing issues promptly.
7. IPv6 Transition: Transitioning from IPv4 to IPv6 to address the depletion of IPv4 addresses poses challenges,
including compatibility issues.
8. Cloud Integration: Integrating on-premises networks with cloud environments introduces complexities in terms
of security, connectivity, and data transfer.
9. Network Virtualization: Implementing virtualized networks and software-defined networking (SDN) can
introduce new challenges in terms of configuration and security.
10. Wireless Network Management: Managing and securing wireless networks, including Wi-Fi and cellular
networks, is essential with the growing reliance on mobile devices.
Distribution transparency is a property of distributed databases that hides the internal distribution details from users.
This allows users to interact with the distributed database in a similar way to a centralized database, without
needing to know about the distribution of tables or data replication.
1. Location transparency is a concept in distributed database systems where the user can query tables or
fragments of a table as if they were stored locally, without being aware that the data is actually stored at a
remote site. To achieve location transparency, the distributed database management system (DDBMS) needs
access to an updated data dictionary and directory containing the details of data locations.
2. Fragmentation transparency allows users to query fragmented tables as if they were unfragmented, hiding the
fact that the table is actually a fragment or a union of fragments. It also conceals the location of the fragments.
This concept is similar to SQL views, where users may not be aware that they are using a view instead of the
actual table.
3. Replication transparency ensures that users are unaware of database replication, allowing them to query a
table as if only one copy exists. It is related to concurrency transparency, where updates to data items are
reflected in all copies without the user's knowledge. Additionally, replication transparency provides failure
transparency, allowing users to continue querying replicated copies even if a site fails.
Distributed databases can be broadly classified into homogeneous and heterogeneous distributed database
environments, each with further sub-divisions, as shown in the following illustration.
1. Homogeneous Distributed Databases: In a homogeneous distributed database, all the sites use identical
DBMS and operating systems. Its properties are −
2) Non-autonomous − Data is distributed across the homogeneous nodes and a central or master DBMS co-
ordinates data updates across the sites.
1) Federated − The heterogeneous database systems are independent in nature and integrated together so
that they function as a single database system.
2) Un-federated − The database systems employ a central coordinating module through which the databases
are accessed.
This is a two-level architecture where the functionality is divided into servers and clients. The server functions primarily
encompass data management, query processing, optimization and transaction management. Client functions include mainly
user interface. However, they have some functions like consistency checking and transaction management.
The two different client - server architecture are −
In these systems, each peer acts both as a client and a server for imparting database services. The peers share their
resource with other peers and co-ordinate their activities.
This architecture generally has four levels of schemas −
This is an integrated database system formed by a collection of two or more autonomous database systems.
1. Multi-database View Level − Depicts multiple user views comprising of subsets of the integrated distributed
database.
2. Multi-database Conceptual Level − Depicts integrated multi-database that comprises of global logical multi-
database structure definitions.
3. Multi-database Internal Level − Depicts the data distribution across different sites and multi-database to local
data mapping.
4. Local database View Level − Depicts public view of local data.
5. Local database Conceptual Level − Depicts local data organization at each site.
6. Local database Internal Level − Depicts physical data organization at each site.
1. Directory Structure: Designing an efficient and scalable directory structure for managing distributed data
locations.
2. Data Location Tracking: Ensuring accurate tracking of data locations across distributed nodes for efficient data
access.
3. Data Distribution Changes: Managing updates to the directory when data distribution strategies change.
4. Concurrency Control: Handling concurrent access to the global directory to prevent conflicts and ensure data
consistency.
5. Scalability: Ensuring that the directory can scale as the database and the number of nodes grow.
6. Query Optimization: Providing accurate information in the directory to support efficient query optimization.
7. Data Migration: Managing directory updates when data is migrated between nodes.
9. Consistency and Synchronization: Maintaining consistency and timely synchronization of directory updates.
10. Failure Handling: Dealing with directory node failures to maintain uninterrupted services.
UNIT - 2
Design Strategies
The strategies can be broadly divided into replication and fragmentation.
Data Replication: Data Replication is the process of storing data in more than one site or node. It is useful in
improving the availability of data. It is simply copying data from a database from one server to another server so that
all the users can share the same data without any inconsistency.
1. Reliability − In case of failure of any site, the database system continues to work since a copy is available at
another site(s).
2. Reduction in Network Load − Since local copies of data are available, query processing can be done with
reduced network usage, particularly during prime hours. Data updating can be done at non-prime hours.
3. Quicker Response − Availability of local copies of data ensures quick query processing and consequently quick
response time.
4. Simpler Transactions − Transactions require less number of joins of tables located at different sites and minimal
coordination across the network. Thus, they become simpler in nature.
1. Increased Storage Requirements − Maintaining multiple copies of data is associated with increased storage
costs. The storage space required is in multiples of the storage required for a centralized system.
2. Increased Cost and Complexity of Data Updating − Each time a data item is updated, the update needs to be
reflected in all the copies of the data at the different sites. This requires complex synchronization techniques
and protocols
3. Undesirable Application – Database coupling − If complex update mechanisms are not used, removing data
inconsistency requires complex co-ordination at application level. This results in undesirable application –
database coupling.
Replication Schemes
1. Full Replication: The most extreme case is replication of the whole database at every site in the distributed
system. This will improve the availability of the system because the system can continue to operate as long as
atleast one site is up.
Advantages of full replication:
Advantages of No replication:
Disadvantages of No replication:
3. Partial replication: Partial replication means, some fragments are replicated whereas others are not. Only a
subset of the database is replicated at each site. This reduces storage costs but requires careful planning to
ensure data consistency.
1) Number of replicas created for a fragment directly depends upon the importance of data in that fragment.
2) Optimized architecture give advantages of both full replication and no replication scheme.
Fragmentation
Fragmentation is the process of dividing a table into smaller subsets called fragments. There are three types of
fragmentation: horizontal, vertical, and hybrid. Horizontal fragmentation can be divided into primary and derived
techniques. The purpose of fragmentation is to ensure that the original table can be reconstructed from the
fragments when needed, which is known as "reconstructiveness."
Advantages of Fragmentation
1) Since data is stored close to the site of usage, efficiency of the database system is increased.
2) Local query optimization techniques are sufficient for most queries since data is locally available.
3) Since irrelevant data is not available at the sites, security and privacy of the database system can be maintained.
Disadvantages of Fragmentation
1) When data from different fragments are required, the access speeds may be very low.
2) In case of recursive fragmentations, the job of reconstruction will need expensive techniques.
3) Lack of back-up copies of data in different sites may render the database ineffective in case of failure of a site.
Types of Fragmentation:
1. Vertical Fragmentation: In vertical fragmentation, the fields or columns of a table are grouped into fragments.
In order to maintain re-constructiveness, each fragment should contain the primary key field(s) of the table.
Vertical fragmentation can be used to enforce privacy of data.
For example, let us consider that a University database keeps records of all registered students in a Student table
having the following schema.
STUDENT
Regd_No Name Course Address Semester Fees Marks
Now, the fees details are maintained in the accounts section. In this case, the designer will fragment the database as
follows −
CREATE TABLE STD_FEES AS
SELECT Regd_No, Fees
FROM STUDENT;
2. Horizontal Fragmentation: Horizontal fragmentation groups the tuples of a table in accordance to values of one
or more fields. Horizontal fragmentation should also confirm to the rule of reconstructiveness. Each horizontal
fragment must have all columns of the original base table.
For example, in the student schema, if the details of all students of Computer Science Course needs to be
maintained at the School of Computer Science, then the designer will horizontally fragment the database as follows
−
CREATE COMP_STD AS
SELECT * FROM STUDENT
WHERE COURSE = "Computer Science";
1) At first, generate a set of horizontal fragments; then generate vertical fragments from one or more of the
horizontal fragments.
2) At first, generate a set of vertical fragments; then generate horizontal fragments from one or more of the
vertical fragments.
Data Allocation: Data distribution, or data allocation, involves assigning each fragment or copy of a fragment to a
specific site in a distributed system. The choice of sites and replication depends on the system's performance and
availability goals, as well as the types and frequencies of transactions. For high availability and retrieval-only
transactions, a fully replicated database is suitable. However, if certain transactions are mostly submitted at a
specific site, the corresponding fragments can be allocated there. Replication can be done for data accessed at
multiple sites. Finding an optimal solution for distributed data allocation is a complex optimization problem.
View Management
Views in SQL are considered as a virtual table. A view also contains rows and columns.
To create the view, we can select the fields from one or more tables present in the database.
A view can either have specific rows based on certain condition or all the rows of a table.
Advantages of View:
1) Complexity: Views help to reduce the complexity. Different views can be created on the same base table for
different users.
2) Security: It increases the security by excluding the sensitive information from the view.
3) Query Simplicity: It helps to simplify commands from the user. A view can draw data from several different
tables and present it as a single table.
4) Consistency: A view can present a consistent, unchanged image of the structure of the database. Views can be
used to rename the columns without affecting the base table.
5) Data Integrity: If data is accessed and entered through a view, the DBMS can automatically check the data to
ensure that it meets the specified integrity constraints.
6) Storage Capacity: Views take very little space to store the data.
7) Logical Data Independence: View can make the application and database tables to a certain extent independent.
Disadvantages of View:
1) The DML statements which can be performed on a view created using single base table have certain restrictions
are:
2) You cannot INSERT if the base table has any not null column that do not appear in view.
3) You ca nnot INSERT or UPDATE if any of the column referenced in the INSERT or UPDATE contains group
functions or columns defined by expression.
4) You can't execute INSERT, UPDATE, DELETE statements on a view if with read only option is enabled.
5) You can't be created view on temporary tables.
6) You cannot INSERT, UPDATE, DELETE if the view contains group functions GROUP BY, DISTINCT or a reference to
a psuedocolumn rownum.
7) You can't pass parameters to the SQL server views.
8) You can't associate rules and defaults with views.
Data Security
In distributed systems, it is imperative to adopt measure to secure data apart from communications. The data
security measures are −
1. Authentication and authorization − These are the access control measures adopted to ensure that only
authentic users can use the database. To provide authentication digital certificates are used. Besides, login is
restricted through username/password combination.
2. Data encryption − The two approaches for data encryption in distributed systems are −
1) Internal to distributed database approach: The user applications encrypt the data and then store the
encrypted data in the database. For using the stored data, the applications fetch the encrypted data from
the database and then decrypt it.
2) External to distributed database: The distributed database system has its own encryption capabilities. The
user applications store data and retrieve them without realizing that the data is stored in an encrypted
form in the database.
3. Validated input − In this security measure, the user application checks for each input before it can be used for
updating the database. An un-validated input can cause a wide range of exploits like buffer overrun, command
injection, cross-site scripting and corruption in data.
2. Entity Integrity Control: Entity integrity control enforces the rules so that each tuple can be uniquely identified
from other tuples. For this a primary key is defined. A primary key is a set of minimal fields that can uniquely
identify a tuple. Entity integrity constraint states that no two tuples in a table can have identical values for
primary keys and that no field which is a part of the primary key can have NULL value.
3. Referential Integrity Constraint: Referential integrity constraint lays down the rules of foreign keys. A foreign
key is a field in a data table that is the primary key of a related table. The referential integrity constraint lays
down the rule that the value of the foreign key field should either be among the values of the primary key of the
referenced table or be entirely NULL.
Query Processing
Query processing is a set of all activities starting from query placement to displaying the results of the query.
Query Processing includes translations on high level Queries into low level expressions that can be used at physical
level of file system, query optimization and actual execution of query to get the actual result.
1. Parsing and translation: Translate the query into its internal form. This is then translated into relational algebra.
Parser checks syntax, verifies relations.
2. Evaluation: The query-execution engine takes a query-evaluation plan, executes that plan, and returns the
answers to the query.A relational algebra expression may have many equivalent expressions. Each relational
algebra operation can be evaluated using one of several different algorithms. Correspondingly, a relational-
algebra expression can be evaluated in many ways. Annotated expression specifying detailed evaluation
strategy is called an evaluation-plan.
3. Query Optimization: Amongst all equivalent evaluation plans choose the one with lowest cost. Cost is
estimated using statistical information from the database catalog.