Unit i Distributed Databases
Unit i Distributed Databases
Distributed Databases
Nosql Database
Xml Databases
D. Rajashree
MCA
1
UNIT I DISTRIBUTED DATABASES
Distributed Systems – Introduction – Architecture – Distributed
Database Concepts – Distributed Data Storage – Distributed
Transactions – Commit Protocols – Concurrency Control – Distributed
Query Processing
Distributed Systems
1
More recently, many new technologies have emerged that combine
distributed and database technologies.
These technologies and systems are being developed for dealing with
the storage, analysis, and mining of the vast amounts of data that are
being produced and collected, and they are referred to generally as big
data technologies.
The origins of big data technologies come from distributed systems and
database systems, as well as data mining and machine learning
algorithms that can process these vast amounts of data to extract
needed knowledge.
2
Database management systems developed using the above types of
architectures are termed parallel database management systems rather
than DDBMSs.
Since they utilize parallel processor technology. Another type of
multiprocessor architecture is called shared-nothing architecture.
In this architecture, every processor has its own primary and secondary
(disk) memory, no common memory exists, and the processors
communicate over a high speed interconnection network (bus or switch).
Although the shared-nothing architecture resembles a distributed
database computing environment, major differences exist in the mode of
operation.
In shared-nothing multiprocessor systems, there is symmetry and
homogeneity of nodes; this is not true of the distributed database
environment, where heterogeneity of hardware and operating system at
each node is very common.
Shared-nothing architecture is also considered as an environment for
parallel databases.
Figure illustrates a parallel database (shared nothing), whereas Figure
illustrates a centralized database with distributed access and Figure
shows a pure distributed database. We will not expand on parallel
architectures and related data management issues here.
3
General Architecture of Pure Distributed Databases
4
Each local DBMS would have its local query optimizer, transaction
manager, and execution engines as well as the local system catalog,
which houses the local schemas.
The global transaction manager is responsible for coordinating the
execution across multiple sites in conjunction with the local transaction
manager at those sites.
5
All the problems related to query processing, transaction processing,
and directory and metadata management and recovery apply to FDBSs
with additional considerations.
6
Three-Tier Client/Server Architecture
Full-scale DDBMSs have not been developed to support all the types
of functionalities.
Instead, distributed database applications are being developed in the
context of the client/server architectures. It is now more common to
use a three-tier architecture rather than a two-tier architecture,
particularly in Web applications.
In the three-tier client/server architecture, the following three layers
exist:
1. Presentation layer (client).
This provides the user interface and interacts with the user. The
programs at this layer present Web interfaces or forms to the client
in order to interface with the application.
Web browsers are often utilized, and the languages and
specifications used include HTML, XHTML, CSS, Flash, MathML,
Scalable Vector Graphics (SVG), Java, JavaScript, Adobe Flex, and
others.
This layer handles user input, output, and navigation by accepting
user commands and displaying the needed information, usually in
the form of static or dynamic Web pages.
The latter are employed when the interaction involves database
access. When a Web interface is used, this layer typically
communicates with the application layer via the HTTP protocol.
2. Application layer (business logic).
7
This layer programs the application logic. For example, queries can
be formulated based on user input from the client, or query results
can be formatted and sent to the client for presentation.
Additional application functionality can be handled at this layer, such
as security checks, identity verification, and other functions.
The application layer can interact with one or more databases or
data sources as needed by connecting to the database using ODBC,
JDBC, SQL/CLI, or other database access techniques.
3. Database server.
This layer handles query and update requests from the application
layer, processes the requests, and sends the results.
Usually SQL is used to access the database if it is relational or
object-relational, and stored database procedures may also be
invoked.
Query results (and queries) may be formatted into XML when
transmitted between the application server and the database server.
Exactly how to divide the DBMS functionality among the client,
application server, and database server may vary.
The common approach is to include the functionality of a centralized
DBMS at the database server level.
A number of relational DBMS products have taken this approach, in
which an SQL server is provided. The application server must then
formulate the appropriate SQL queries and connect to the database
server when needed.
8
The client provides the processing for user interface interactions. Since
SQL is a relational standard, various SQL servers, possibly provided by
different vendors, can accept SQL commands through standards such
as ODBC, JDBC, and SQL/CLI.
In this architecture, the application server may also refer to a data
dictionary that includes information on the distribution of data among
the various SQL servers, as well as modules for decomposing a global
query into a number of local queries that can be executed at the
various sites.
Interaction between an application server and database server might
proceed as follows during the processing of an SQL query:
1. The application server formulates a user query based on input
from the client layer and decomposes it into a number of
independent site queries. Each site query is sent to the
appropriate database server site.
2. Each database server processes the local query and sends the
results to the application server site. Increasingly, XML is being
touted as the standard for data exchange so the database server
may format the query result into XML before sending it to the
application server.
3. The application server combines the results of the subqueries to
produce the result of the originally required query, formats it into
HTML or some other form accepted by the client, and sends it to
the client site for display.
9
The application server is responsible for generating a distributed
execution plan for a multisite query or transaction and for supervising
distributed execution by sending commands to servers.
These commands include local queries and transactions to be
executed, as well as commands to transmit data to other clients or
servers.
Another function controlled by the application server (or coordinator) is
that of ensuring consistency of replicated copies of a data item by
employing distributed (or global) concurrency control techniques.
The application server must also ensure the atomicity of global
transactions by performing global recovery when certain sites fail.
If the DDBMS has the capability to hide the details of data distribution
from the application server, then it enables the application server to
execute global queries and transactions as though the database were
centralized, without having to specify the sites at which the data
referenced in the query or transaction resides.
This property is called distribution transparency. Some DDBMSs do
not provide distribution transparency, instead requiring that applications
are aware of the details of data distribution.
10
Distributed Database Concepts
Distributed database (DDB) as a collection of multiple logically interrelated
databases distributed over a computer network, and a distributed database
management system (DDBMS) as a software system that manages a distributed
database while making the distribution transparent to the user.
Transparency
The concept of transparency extends the general idea of hiding implementation
details from end users.
A highly transparent system offers a lot of flexibility to the end user/application
developer since it requires little or no awareness of underlying details on their
part.
12
In the case of a traditional centralized database, transparency simply pertains to
logical and physical data independence for application developers.
In a DDB scenario, the data and software are distributed over multiple nodes
connected by a computer network, so additional types of transparencies are
introduced.
Consider the company database in Figure that we have been discussing
throughout the book. The EMPLOYEE, PROJECT, and WORKS_ON tables
may be fragmented horizontally (that is, into sets of rows, as we will
discuss in Section 23.2) and stored with possible replication, as shown in
Figure 23.1. The following types of transparencies are possible:
13
Replication transparency. As we show in Figure copies of the same
data objects may be stored at multiple sites for better availability,
performance, and reliability. Replication transparency makes the user
unaware of the existence of these copies.
14
Availability and Reliability
Reliability and availability are two of the most common potential
advantages cited for distributed databases.
Reliability is broadly defined as the probability that a system is running
(not down) at a certain time point, whereas availability is the probability
that the system is continuously available during a time interval.
We can directly relate reliability and availability of the database to the
faults, errors, and failures associated with it.
A failure can be described as a deviation of a system’s behavior from
that which is specified in order to ensure correct execution of
operations.
Errors constitute that subset of system states that causes the failure.
Fault is the cause of an error. To construct a system that is reliable, we
can adopt several approaches.
One common approach stresses fault tolerance; it recognizes that
faults will occur, and it designs mechanisms that can detect and
remove faults before they can result in a system failure.
Another more stringent approach attempts to ensure that the final
system does not contain any faults.
This is done through an exhaustive design process followed by
extensive quality control and testing.
A reliable DDBMS tolerates failures of underlying components, and it
processes user requests as long as database consistency is not
violated.
15
A DDBMS recovery manager has to deal with failures arising from
transactions, hardware, and communication networks.
Hardware failures can either be those that result in loss of main
memory contents or loss of secondary storage contents.
Network failures occur due to errors associated with messages and line
failures. Message errors can include their loss, corruption, or out-of-
order arrival at destination.
The previous definitions are used in computer systems in general,
where there is a technical distinction between reliability and availability.
In most discussions related to DDB, the term availability is used
generally as an umbrella term to cover both concepts.
16
As the system expands its number of nodes, it is possible that the
network, which connects the nodes, may have faults that cause the
nodes to be partitioned into groups of nodes.
The nodes within each partition are still connected by a subnetwork, but
communication among the partitions is lost.
The concept of partition tolerance states that the system should have
the capacity to continue operating while the network is partitioned.
17
Autonomy
Autonomy determines the extent to which individual nodes or DBs in a
connected DDB can operate independently.
A high degree of autonomy is desirable for increased flexibility and
customized maintenance of an individual node.
Autonomy can be applied to design, communication, and execution.
Design autonomy refers to independence of data model usage and
transaction management techniques among nodes.
Communication autonomy determines the extent to which each node
can decide on sharing of information with other nodes. Execution
autonomy refers to independence of users to act as they please.
Advantages
Scalability: Can scale out by adding more nodes, which is more cost-
effective than scaling a single server
Fault tolerance: Can continue to function even if one or more nodes fail
18
Disadvantages
19
Distributed Data Storage
Consider a relation r that is to be stored in the database. There
are two approaches to storing this relation in the distributed database:
Replication:
The system maintains several identical replicas (copies) of the
relation, and stores each replica at a different site. The alternative to
replication is to store only one copy of relation r.
Fragmentation:
The system partitions the relation into several fragments, and
Fragmentation. The system partitions the relation into several fragments,
and stores each fragment at a different site.
Data Replication
Data replication is the process of storing separate copies of the database
at two or more sites. It is a popular fault tolerance technique of distributed
databases.
Advantages of Data Replication
Reliability
In case of failure of any site, the database system continues to work
since a copy is available at another sites.
Reduction in network load Since local copies of data are available,
query processing can be done with reduced network usage, particularly
during prime hours. Data updating can be done at non-prime hours.
20
Quicker response
Availability of local copies of data ensures quick query processing and
consequently quick response time.
Simpler transactions
Transactions require less number of joins of tables located at different
sites and minimal coordination across the network. Thus , they become
simpler in nature.
21
Undesirable Application – Database coupling
If complex update mechanisms are not used, removing data
inconsistency requires complex co-ordination at application level. This
results in undesirable application.
Some commonly used replication techniques are –
Snapshot replication
Near real time replication
Pull replication
Fragmentation
Fragmentation is the task of dividing a table into a set of smaller
tables. The subsets of the table are called fragments. Fragmentation
can be of three types:
Horizontal, Vertical, And Hybrid (Combination Of Horizontal
And Vertical).
Horizontal Fragmentation Can Further Be Classified Into Two
Techniques:
Primary Horizontal Fragmentation and Derived Horizontal
Fragmentation.
Advantages of fragmentation
Since data is stored close to the site of usage, efficiency of the
database system is increased.
22
Local query optimization techniques are sufficient for most queries
since data is locally available.
Since irrelevant data is not available at the sites, security and
privacy of the database system can be maintained.
Disadvantages of fragmentation
When data from different fragments are required, the access speeds
may be very low.
In case of recursive fragmentations, the job of reconstruction will
need expensive techniques.
Lack of back-up copies of data in different sites may render the
database ineffective in case of failure of a site.
23
1. Full replication
In full replication scheme, the database is available to almost
every location or user in communication network.
Disadvantages of no replication
Poor availability of data.
Slows down the query execution process, as multiple clients are
accessing the same server.
3. Partial replication
Partial replication means only some fragments are replicated from the
database.
24
(OR)
Distributed data storage refers to a solution where data is stored at
multiple locations, often geographically distant, in a collection of
interconnected systems.
Each site can individually manage the data, making it ideal for domains
with large volumes of distributed data processed by simultaneous users.
This architecture offers advantages such as modular development,
better response, improved reliability, lower communication costs,
increased availability, easy expansion, and enhanced performance.
Scalability:
The system can easily scale up or down by adding or removing
nodes, accommodating growing data volumes and user demands.
Improved Performance:
Data can be accessed from the node closest to the user, reducing
latency and improving query performance.
Increased Reliability:
Data is distributed across multiple locations, mitigating the risk of
data loss due to a single point of failure.
25
Enhanced Security:
Data can be encrypted both in transit and at rest, and access
control mechanisms can be implemented to protect sensitive
information.
Modular Development:
Each site can manage its data independently, allowing for easier
development and maintenance.
Better Response Time:
Data can be accessed from the location closest to the user, leading
to faster response times.
Lower Communication Costs:
Data can be stored and processed locally, reducing the need for
extensive data transfer between locations.
26
Considerations for Distributed Data Storage:
Data Consistency:
Maintaining data consistency across multiple nodes can be
complex, requiring careful consideration of data replication and
synchronization mechanisms.
Network Latency:
Network latency can impact performance, especially when data
needs to be accessed from geographically distant nodes.
Complexity:
Managing a distributed system can be more complex than
managing a centralized system, requiring specialized expertise and
tools.
Security:
Ensuring the security of data across multiple nodes and networks
requires robust security measures.
27
Distributed Transactions
Transaction provides a mechanism for describing logical units of
database processing. Transaction processing systems are systems
with large databases and hundreds of concurrent users executing
database transactions.
In the latter case, if some work was completed prior to the failure,
that work will be reversed to ensure no net work was done.
28
The operation known as a “two-phase commit” (2PC) is a form
of a distributed transaction.
30
resources that successfully committed their portion of the
transaction to undo or “rollback” that transaction, and once
again, the data is back to its original state.
It is then up to the application to retry the transaction to make
sure it gets completed.
32
When Distributed Transactions Are Not Needed
In some environments, distributed transactions are not necessary,
and instead, extra auditing activities are put in place to ensure
data integrity when the speed of the transaction is not an issue.
The transfer of money across banks is a good example. Each
bank that participates in the money transfer tracks the status of
the transaction, and when a failure is detected, the partial state is
corrected.
This process works well without distributed transactions because
the transfer does not have to happen in (near) real time.
Distributed transactions are typically critical in situations where
the complete update must be done immediately.
33
Commit Protocols
In a local database system, for committing a transaction, the
transaction manager has to only convey the decision to commit to the
recovery manager.
In a distributed system, the transaction manager should convey the
decision to commit to all the servers in the various sites where the
transaction is being executed and uniformly enforce the decision.
When processing is complete at each site, it reaches the partially
committed transaction state and waits for all other transactions to reach
their partially committed states.
When it receives the message that all the sites are ready to commit, it
starts to commit.
In a distributed system, either all sites commit or none of them does.
The different distributed commit protocols are −
One-phase commit
Two-phase commit
Three-phase commit
Distributed One-phase Commit
Distributed one-phase commit is the simplest commit protocol. Let us
consider that there is a controlling site and a number of slave sites
where the transaction is being executed. The steps in distributed
commit are −
34
After each slave has locally completed its transaction, it sends a
DONE message to the controlling site.
The slaves wait for Commit or Abort message from the
controlling site. This waiting time is called window of vulnerability.
When the controlling site receives DONE message from each
slave, it makes a decision to commit or abort. This is called the
commit point. Then, it sends this message to all the slaves.
On receiving this message, a slave either commits or aborts and
then sends an acknowledgement message to the controlling site.
Distributed Two-phase Commit
Distributed two-phase commit reduces the vulnerability of one-phase
commit protocols. The steps performed in the two phases are as
follows −
Phase 1: Prepare Phase
After each slave has locally completed its transaction, it sends a
DONE message to the controlling site. When the controlling site has
received DONE message from all slaves, it sends a Prepare
message to the slaves.
The slaves vote on whether they still want to commit or not. If a
slave wants to commit, it sends a Ready message.
A slave that does not want to commit sends a Not Ready message.
This may happen when the slave has conflicting concurrent
transactions or there is a timeout.
35
Phase 2: Commit/Abort Phase
After the controlling site has received Ready message from all the
slaves −
The controlling site sends a Global Commit message to the
slaves.
The slaves apply the transaction and send a Commit ACK
message to the controlling site.
When the controlling site receives Commit ACK message from all
the slaves, it considers the transaction as committed.
After the controlling site has received the first Not Ready
message from any slave −
The controlling site sends a Global Abort message to the slaves.
The slaves abort the transaction and send a Abort ACK message
to the controlling site.
When the controlling site receives Abort ACK message from all
the slaves, it considers the transaction as aborted.
36
Phase 3: Commit / Abort Phase
The steps are same as two-phase commit except that Commit
ACK/Abort ACK message is not required.
37
Concurrency Control
Concurrency Control in Distributed Databases ensures that multiple
transactions can occur simultaneously without leading to data
inconsistency.
In a distributed environment, data is stored across multiple sites,
making concurrency control more complex than in centralized
databases.
For concurrency control and recovery purposes, numerous problems
arise in a distributed DBMS environment that are not encountered in
a centralized DBMS environment.
The system must be able to deal with the failure of one or more of
the communication links that connect the sites.
This breaks up the sites into two or more partitions, where the sites
within each partition can communicate only with one another and
not with sites in other partitions.
39
1. Distributed Concurrency Control Based on a Distinguished
Copy of a Data Item
To deal with replicated data items in a distributed database, a number
of concurrency control methods have been proposed that extend the
concurrency control techniques for centralized databases.
The locks for this data item are associated with the distinguished
copy, and all locking and unlocking requests are sent to the site that
contains that copy.
A number of different methods are based on this idea, but they differ
in their method of choosing the distinguished copies. In the primary
site technique, all distinguished copies are kept at the same site.
40
2. Distributed Concurrency Control Based on Voting
Each copy maintains its own lock and can grant or deny the request for
it. If a transaction that requests a lock is granted that lock
by a majority of the copies, it holds the lock and informs all copies that
it has been granted the lock.
Distributed Recovery
42
Distributed Query Processing
Distributed Query Processing A distributed database query is
processed in stages as follows:
1. Query Mapping.
2. Localization.
43
Global Query Optimization. Optimization consists of selecting a
strategy from a list of candidates that is closest to optimal.
A list of candidate queries can be obtained by permuting the
ordering of operations within a fragment query generated by the
previous stage.
Time is the preferred unit for mea suring cost. The total cost is a
weighted combination of costs such as CPU cost, I/O costs, and
communication costs.
Since DDBs are connected by a network, often the communication
costs over the network are the most sig nificant.
This is especially true when the sites are connected through a wide
area network (WAN).
This stage is common to all sites in the DDB. The techniques are
similar to those used in centralized systems.
44