0% found this document useful (0 votes)
11 views

Unit i Distributed Databases

The document provides an overview of distributed databases, including their architecture, concepts, and advantages such as reliability and availability. It discusses the evolution of distributed database technology, the importance of transparency in data management, and the role of client/server architectures in distributed database applications. Key topics include distributed transactions, query processing, and the integration of big data technologies with distributed systems.

Uploaded by

Raja shree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Unit i Distributed Databases

The document provides an overview of distributed databases, including their architecture, concepts, and advantages such as reliability and availability. It discusses the evolution of distributed database technology, the importance of transparency in data management, and the role of client/server architectures in distributed database applications. Key topics include distributed transactions, query processing, and the integration of big data technologies with distributed systems.

Uploaded by

Raja shree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

[MC4202]

 Distributed Databases

 Spatial And Temporal Databases

 Nosql Database

 Xml Databases

 Information Retrieval And Web Search

D. Rajashree
MCA

1
UNIT I DISTRIBUTED DATABASES
Distributed Systems – Introduction – Architecture – Distributed
Database Concepts – Distributed Data Storage – Distributed
Transactions – Commit Protocols – Concurrency Control – Distributed
Query Processing

Distributed Systems

 A distributed computing system consists of a number of processing sites


or nodes that are interconnected by a computer network and that
cooperate in performing certain assigned tasks.
 As a general goal, distributed computing systems partition a big,
unmanageable problem into smaller pieces and solve it efficiently in a
coordinated manner.
 Thus, more computing power is harnessed to solve a complex task, and
the autonomous processing nodes can be managed independently while
they cooperate to provide the needed functionalities to solve the problem.
 DDB technology resulted from a merger of two technologies: database
technology and distributed systems technology.
 Several distributed database prototype systems were developed in the
1980s and 1990s to address the issues of data distribution, data
replication, distributed query and transaction processing, distributed
database metadata management, and other topics.

1
 More recently, many new technologies have emerged that combine
distributed and database technologies.
 These technologies and systems are being developed for dealing with
the storage, analysis, and mining of the vast amounts of data that are
being produced and collected, and they are referred to generally as big
data technologies.
 The origins of big data technologies come from distributed systems and
database systems, as well as data mining and machine learning
algorithms that can process these vast amounts of data to extract
needed knowledge.

Distributed Database Architectures

Parallel versus Distributed Architectures


 There are two main types of multiprocessor system architectures that are
commonplace:
■ Shared memory (tightly coupled) architecture. Multiple
processors share secondary (disk) storage and also share primary
memory.
■ Shared disk (loosely coupled) architecture. Multiple
processors share secondary (disk) storage but each has their own
primary memory.
 These architectures enable processors to communicate without the
overhead of exchanging messages over a network.

2
 Database management systems developed using the above types of
architectures are termed parallel database management systems rather
than DDBMSs.
 Since they utilize parallel processor technology. Another type of
multiprocessor architecture is called shared-nothing architecture.
 In this architecture, every processor has its own primary and secondary
(disk) memory, no common memory exists, and the processors
communicate over a high speed interconnection network (bus or switch).
 Although the shared-nothing architecture resembles a distributed
database computing environment, major differences exist in the mode of
operation.
 In shared-nothing multiprocessor systems, there is symmetry and
homogeneity of nodes; this is not true of the distributed database
environment, where heterogeneity of hardware and operating system at
each node is very common.
 Shared-nothing architecture is also considered as an environment for
parallel databases.
 Figure illustrates a parallel database (shared nothing), whereas Figure
illustrates a centralized database with distributed access and Figure
shows a pure distributed database. We will not expand on parallel
architectures and related data management issues here.

3
General Architecture of Pure Distributed Databases

 The logical and component architectural models of a DDB which


describes the generic schema architecture of a DDB, the enterprise is
presented with a consistent, unified view showing the logical structure of
underlying data across all nodes.
 This view is represented by the global conceptual schema (GCS), which
provides network transparency.
 To accommodate potential heterogeneity in the DDB, each node is
shown as having its own local internal schema (LIS) based on physical
organization details at that particular site.
 The logical organization of data at each site is specified by the local
conceptual schema (LCS).
 The global query compiler references the global conceptual schema
from the global system catalog to verify and impose defined constraints.
 The global query optimizer references both global and local conceptual
schemas and generates optimized local queries from global queries.
 It evaluates all candidate strategies using a cost function that estimates
cost based on response time (CPU, I/O, and network latencies) and
estimated sizes of intermediate results.
 The latter is particularly important in queries involving joins. Having
computed the cost for each candidate, the optimizer selects the
candidate with the minimum cost for execution.

4
 Each local DBMS would have its local query optimizer, transaction
manager, and execution engines as well as the local system catalog,
which houses the local schemas.
 The global transaction manager is responsible for coordinating the
execution across multiple sites in conjunction with the local transaction
manager at those sites.

Federated Database Schema Architecture


 Typical five-level schema architecture to support global applications in
the FDBS environment.
 In this architecture, the local schema is the conceptual schema (full
database definition) of a component database, and the component
schema is derived by translating the local schema into a canonical data
model or common data model (CDM) for the FDBS.
 Schema translation from the local schema to the component schema is
accompanied by generating mappings to transform commands on a
component schema into commands on the corresponding local schema.
 The export schema represents the subset of a component schema that
is available to the FDBS.
 The federated schema is the global schema or view, which is the result
of integrating all the shareable export schemas.
 The external schemas define the schema for a user group or an
application, as in the three-level schema architecture.

5
 All the problems related to query processing, transaction processing,
and directory and metadata management and recovery apply to FDBSs
with additional considerations.

6
Three-Tier Client/Server Architecture
 Full-scale DDBMSs have not been developed to support all the types
of functionalities.
 Instead, distributed database applications are being developed in the
context of the client/server architectures. It is now more common to
use a three-tier architecture rather than a two-tier architecture,
particularly in Web applications.
 In the three-tier client/server architecture, the following three layers
exist:
1. Presentation layer (client).
 This provides the user interface and interacts with the user. The
programs at this layer present Web interfaces or forms to the client
in order to interface with the application.
 Web browsers are often utilized, and the languages and
specifications used include HTML, XHTML, CSS, Flash, MathML,
Scalable Vector Graphics (SVG), Java, JavaScript, Adobe Flex, and
others.
 This layer handles user input, output, and navigation by accepting
user commands and displaying the needed information, usually in
the form of static or dynamic Web pages.
 The latter are employed when the interaction involves database
access. When a Web interface is used, this layer typically
communicates with the application layer via the HTTP protocol.
2. Application layer (business logic).
7
 This layer programs the application logic. For example, queries can
be formulated based on user input from the client, or query results
can be formatted and sent to the client for presentation.
 Additional application functionality can be handled at this layer, such
as security checks, identity verification, and other functions.
 The application layer can interact with one or more databases or
data sources as needed by connecting to the database using ODBC,
JDBC, SQL/CLI, or other database access techniques.
3. Database server.
 This layer handles query and update requests from the application
layer, processes the requests, and sends the results.
 Usually SQL is used to access the database if it is relational or
object-relational, and stored database procedures may also be
invoked.
 Query results (and queries) may be formatted into XML when
transmitted between the application server and the database server.
 Exactly how to divide the DBMS functionality among the client,
application server, and database server may vary.
 The common approach is to include the functionality of a centralized
DBMS at the database server level.
 A number of relational DBMS products have taken this approach, in
which an SQL server is provided. The application server must then
formulate the appropriate SQL queries and connect to the database
server when needed.

8
 The client provides the processing for user interface interactions. Since
SQL is a relational standard, various SQL servers, possibly provided by
different vendors, can accept SQL commands through standards such
as ODBC, JDBC, and SQL/CLI.
 In this architecture, the application server may also refer to a data
dictionary that includes information on the distribution of data among
the various SQL servers, as well as modules for decomposing a global
query into a number of local queries that can be executed at the
various sites.
 Interaction between an application server and database server might
proceed as follows during the processing of an SQL query:
1. The application server formulates a user query based on input
from the client layer and decomposes it into a number of
independent site queries. Each site query is sent to the
appropriate database server site.
2. Each database server processes the local query and sends the
results to the application server site. Increasingly, XML is being
touted as the standard for data exchange so the database server
may format the query result into XML before sending it to the
application server.
3. The application server combines the results of the subqueries to
produce the result of the originally required query, formats it into
HTML or some other form accepted by the client, and sends it to
the client site for display.

9
 The application server is responsible for generating a distributed
execution plan for a multisite query or transaction and for supervising
distributed execution by sending commands to servers.
 These commands include local queries and transactions to be
executed, as well as commands to transmit data to other clients or
servers.
 Another function controlled by the application server (or coordinator) is
that of ensuring consistency of replicated copies of a data item by
employing distributed (or global) concurrency control techniques.
 The application server must also ensure the atomicity of global
transactions by performing global recovery when certain sites fail.
 If the DDBMS has the capability to hide the details of data distribution
from the application server, then it enables the application server to
execute global queries and transactions as though the database were
centralized, without having to specify the sites at which the data
referenced in the query or transaction resides.
 This property is called distribution transparency. Some DDBMSs do
not provide distribution transparency, instead requiring that applications
are aware of the details of data distribution.

10
Distributed Database Concepts
Distributed database (DDB) as a collection of multiple logically interrelated
databases distributed over a computer network, and a distributed database
management system (DDBMS) as a software system that manages a distributed
database while making the distribution transparent to the user.

What Constitutes a DDB ?

For a database to be called distributed, the following minimum


conditions should be satisfied:

■ Connection of database nodes over a computer network.


There are multiple computers, called sites or nodes. These sites
must be connected by an underlying network to transmit data and
commands among sites.

■ Logical interrelation of the connected databases. It is essential


that the information in the various database nodes be logically
related.

■ Possible absence of homogeneity among connected nodes. It


is not necessary that all nodes be identical in terms of data,
hardware, and software.

 The sites may all be located in physical proximity—say, within the


same building or a group of adjacent buildings—and connected via a
local area network, or they may be geographically distributed over
large distances and connected via a long-haul or wide area network.
11
 Local area networks typically use wireless hubs or cables, whereas
long-haul networks use telephone lines, cables, wireless
communication infrastructures, or satellites.
 It is common to have a combination of various types of
networks.Networks may have different topologies that define the direct
communication paths among sites.
 The type and topology of the network used may have a significant
impact on the performance and hence on the strategies for distributed
query processing and distributed database design.
 For high-level architectural issues, however, it does not matter what
type of network is used; what matters is that each site be able to
communicate, directly or indirectly, with every other site.
 Some type of network exists among nodes, regardless of any particular
topology. It will not address any network specific issues although it is
important to understand that for an efficient operation of a distributed
database system (DDBS), network design and performance issues are
critical and are an integral part of the overall solution. The details of the
underlying network are invisible to the end user.

Transparency
 The concept of transparency extends the general idea of hiding implementation
details from end users.
 A highly transparent system offers a lot of flexibility to the end user/application
developer since it requires little or no awareness of underlying details on their
part.
12
 In the case of a traditional centralized database, transparency simply pertains to
logical and physical data independence for application developers.
 In a DDB scenario, the data and software are distributed over multiple nodes
connected by a computer network, so additional types of transparencies are
introduced.
Consider the company database in Figure that we have been discussing
throughout the book. The EMPLOYEE, PROJECT, and WORKS_ON tables
may be fragmented horizontally (that is, into sets of rows, as we will
discuss in Section 23.2) and stored with possible replication, as shown in
Figure 23.1. The following types of transparencies are possible:

 Data organization transparency (also known as distribution or


network transparency). This refers to freedom for the user from the
operational details of the network and the placement of the data in the
distributed system. It may be divided into location transparency and
naming transparency.

 Location transparency refers to the fact that the command used to


perform a task is independent of the location of the data and the
location of the node where the command was issued. Naming
transparency implies that once a name is associated with an object,
the named objects can be accessed unambiguously without
additional specification as to where the data is located.

13
 Replication transparency. As we show in Figure copies of the same
data objects may be stored at multiple sites for better availability,
performance, and reliability. Replication transparency makes the user
unaware of the existence of these copies.

 Fragmentation transparency. Two types of fragmentation are


possible. Horizontal fragmentation distributes a relation (table) into
subrelations that are subsets of the tuples (rows) in the original
relation; this is also known as sharding in the newer big data and
cloud computing systems.
 Vertical fragmentation distributes a relation into subrelations
where each subrelation is defined by a subset of the columns of
the original relation. Fragmentation transparency makes the user
unaware of the existence of fragments.

 Other transparencies include design transparency and execution


transparency—which refer, respectively, to freedom from knowing
how the distributed database is designed and where a transaction
executes.

14
Availability and Reliability
 Reliability and availability are two of the most common potential
advantages cited for distributed databases.
 Reliability is broadly defined as the probability that a system is running
(not down) at a certain time point, whereas availability is the probability
that the system is continuously available during a time interval.
 We can directly relate reliability and availability of the database to the
faults, errors, and failures associated with it.
 A failure can be described as a deviation of a system’s behavior from
that which is specified in order to ensure correct execution of
operations.
 Errors constitute that subset of system states that causes the failure.
Fault is the cause of an error. To construct a system that is reliable, we
can adopt several approaches.
 One common approach stresses fault tolerance; it recognizes that
faults will occur, and it designs mechanisms that can detect and
remove faults before they can result in a system failure.
 Another more stringent approach attempts to ensure that the final
system does not contain any faults.
 This is done through an exhaustive design process followed by
extensive quality control and testing.
 A reliable DDBMS tolerates failures of underlying components, and it
processes user requests as long as database consistency is not
violated.
15
 A DDBMS recovery manager has to deal with failures arising from
transactions, hardware, and communication networks.
 Hardware failures can either be those that result in loss of main
memory contents or loss of secondary storage contents.
 Network failures occur due to errors associated with messages and line
failures. Message errors can include their loss, corruption, or out-of-
order arrival at destination.
 The previous definitions are used in computer systems in general,
where there is a technical distinction between reliability and availability.
 In most discussions related to DDB, the term availability is used
generally as an umbrella term to cover both concepts.

Scalability and Partition Tolerance


Scalability determines the extent to which the system can expand its
capacity while continuing to operate without interruption. There are two
types of scalability:
1. Horizontal scalability: This refers to expanding the number of
nodes in the distributed system. As nodes are added to the system,
it should be possible to distribute some of the data and processing
loads from existing nodes to the new nodes.
2. Vertical scalability: This refers to expanding the capacity of the
individual nodes in the system, such as expanding the storage
capacity or the processing power of a node.

16
 As the system expands its number of nodes, it is possible that the
network, which connects the nodes, may have faults that cause the
nodes to be partitioned into groups of nodes.
 The nodes within each partition are still connected by a subnetwork, but
communication among the partitions is lost.
 The concept of partition tolerance states that the system should have
the capacity to continue operating while the network is partitioned.

17
Autonomy
 Autonomy determines the extent to which individual nodes or DBs in a
connected DDB can operate independently.
 A high degree of autonomy is desirable for increased flexibility and
customized maintenance of an individual node.
 Autonomy can be applied to design, communication, and execution.
Design autonomy refers to independence of data model usage and
transaction management techniques among nodes.
 Communication autonomy determines the extent to which each node
can decide on sharing of information with other nodes. Execution
autonomy refers to independence of users to act as they please.

Advantages

Scalability: Can scale out by adding more nodes, which is more cost-
effective than scaling a single server

Fault tolerance: Can continue to function even if one or more nodes fail

Performance: Can improve performance and user experience

Security: Can be more secure than centralized databases by


implementing security measures at multiple levels

Reliability: Can improve reliability and availability

18
Disadvantages

Complexity: Can be more complex than centralized systems because


of the need for coordination and communication between multiple nodes

Cost: Can be expensive to purchase and maintain, especially for large


or complex systems.

Security issues: Data can be prone to interception while


communicating.

Data integrity: Data integrity can be complex, and different data


formats might be used.

Concurrency: Can be difficult to control concurrency when multiple


users are accessing the same piece of data.

19
Distributed Data Storage
Consider a relation r that is to be stored in the database. There
are two approaches to storing this relation in the distributed database:
Replication:
The system maintains several identical replicas (copies) of the
relation, and stores each replica at a different site. The alternative to
replication is to store only one copy of relation r.
Fragmentation:
The system partitions the relation into several fragments, and
Fragmentation. The system partitions the relation into several fragments,
and stores each fragment at a different site.

Data Replication
Data replication is the process of storing separate copies of the database
at two or more sites. It is a popular fault tolerance technique of distributed
databases.
Advantages of Data Replication
Reliability
In case of failure of any site, the database system continues to work
since a copy is available at another sites.
Reduction in network load Since local copies of data are available,
query processing can be done with reduced network usage, particularly
during prime hours. Data updating can be done at non-prime hours.

20
Quicker response
Availability of local copies of data ensures quick query processing and
consequently quick response time.
Simpler transactions
Transactions require less number of joins of tables located at different
sites and minimal coordination across the network. Thus , they become
simpler in nature.

Disadvantages of Data Replication


Increased Storage Requirements –
Maintaining multiple copies of data is associated with
increased storage costs. The storage space required is in multiples
of the storage required for a centralized system.
Increased cost and complexity of data updating
Each time a data item is updated, the update needs to be reflected
in all the copies of the data at the different sites. This requires
complex synchronization techniques and protocols.

21
Undesirable Application – Database coupling
If complex update mechanisms are not used, removing data
inconsistency requires complex co-ordination at application level. This
results in undesirable application.
Some commonly used replication techniques are –
 Snapshot replication
 Near real time replication
 Pull replication

Fragmentation
Fragmentation is the task of dividing a table into a set of smaller
tables. The subsets of the table are called fragments. Fragmentation
can be of three types:
 Horizontal, Vertical, And Hybrid (Combination Of Horizontal
And Vertical).
 Horizontal Fragmentation Can Further Be Classified Into Two
Techniques:
 Primary Horizontal Fragmentation and Derived Horizontal
Fragmentation.

Advantages of fragmentation
 Since data is stored close to the site of usage, efficiency of the
database system is increased.

22
 Local query optimization techniques are sufficient for most queries
since data is locally available.
 Since irrelevant data is not available at the sites, security and
privacy of the database system can be maintained.
Disadvantages of fragmentation
 When data from different fragments are required, the access speeds
may be very low.
 In case of recursive fragmentations, the job of reconstruction will
need expensive techniques.
 Lack of back-up copies of data in different sites may render the
database ineffective in case of failure of a site.

Types of data replication


There are two types of data replication:
1. Synchronous replication:
In synchronous replication, the replica will be modified immediately
after some changes are made in the relation table. So there is no
difference between original data and replica.
2. Asynchronous replication:
In asynchronous replication, the replica will be modified after commit
is fired on to the database.
Replication schemes
There are three replication scheme

23
1. Full replication
In full replication scheme, the database is available to almost
every location or user in communication network.

Advantages of full replication


 High availability of data, as database is available to almost every
location.
 Faster execution of queries.

Disadvantages of full replication


 Concurrency control is difficult to achieve in full replication.
 Update operation is slower.
2. No replication
No replication means, each fragment is stored exactly at one location.
Advantages of no replication
 Concurrency can be minimized.
 Easy recovery of data.

Disadvantages of no replication
 Poor availability of data.
 Slows down the query execution process, as multiple clients are
accessing the same server.

3. Partial replication
Partial replication means only some fragments are replicated from the
database.

24
(OR)
 Distributed data storage refers to a solution where data is stored at
multiple locations, often geographically distant, in a collection of
interconnected systems.
 Each site can individually manage the data, making it ideal for domains
with large volumes of distributed data processed by simultaneous users.
 This architecture offers advantages such as modular development,
better response, improved reliability, lower communication costs,
increased availability, easy expansion, and enhanced performance.

Advantages of Distributed Data Storage:


High Availability:
Data is replicated across multiple nodes, ensuring the system
remains operational even if one or more nodes fail.

Scalability:
The system can easily scale up or down by adding or removing
nodes, accommodating growing data volumes and user demands.

Improved Performance:

Data can be accessed from the node closest to the user, reducing
latency and improving query performance.

Increased Reliability:
Data is distributed across multiple locations, mitigating the risk of
data loss due to a single point of failure.

25
Enhanced Security:
Data can be encrypted both in transit and at rest, and access
control mechanisms can be implemented to protect sensitive
information.
Modular Development:
Each site can manage its data independently, allowing for easier
development and maintenance.
Better Response Time:
Data can be accessed from the location closest to the user, leading
to faster response times.
Lower Communication Costs:
Data can be stored and processed locally, reducing the need for
extensive data transfer between locations.

Examples of Distributed Storage Systems:


Hadoop Distributed File System (HDFS):
A distributed file system designed for storing and processing large
datasets.
Cloud Storage Services:
Services like Amazon S3, Google Cloud Storage, and Microsoft
Azure Blob Storage, which offer scalable and distributed storage
solutions.
Distributed Databases:
Systems like MongoDB, Cassandra, and CockroachDB, which are
designed to store and manage data across multiple nodes.

26
Considerations for Distributed Data Storage:
Data Consistency:
Maintaining data consistency across multiple nodes can be
complex, requiring careful consideration of data replication and
synchronization mechanisms.
Network Latency:
Network latency can impact performance, especially when data
needs to be accessed from geographically distant nodes.
Complexity:
Managing a distributed system can be more complex than
managing a centralized system, requiring specialized expertise and
tools.
Security:
Ensuring the security of data across multiple nodes and networks
requires robust security measures.

27
Distributed Transactions
 Transaction provides a mechanism for describing logical units of
database processing. Transaction processing systems are systems
with large databases and hundreds of concurrent users executing
database transactions.

 A distributed transaction is a set of operations on data that is


performed across two or more data repositories (especially
databases).

 It is typically coordinated across separate nodes connected by a


network, but may also span multiple databases on a single server.

 There are two possible outcomes: 1) all operations successfully


complete, or 2) none of the operations are performed at all due to
a failure somewhere in the system.

 In the latter case, if some work was completed prior to the failure,
that work will be reversed to ensure no net work was done.

 This type of operation is in compliance with the “ACID” (atomicity-


consistency-isolation-durability) principles of databases that
ensure data integrity.

 ACID is most commonly associated with transactions on a single


database server, but distributed transactions extend that
guarantee across multiple databases.

28
 The operation known as a “two-phase commit” (2PC) is a form
of a distributed transaction.

 “XA transactions” are transactions using the XA protocol, which


is one implementation of a two-phase commit operation.

How Do Distributed Transactions Work?


 Distributed transactions have the same processing completion
requirements as regular database transactions, but they must be
managed across multiple resources, making them more
challenging to implement for database developers.
 The multiple resources add more points of failure, such as the
separate software systems that run the resources (e.g., the
database software), the extra hardware servers, and network
failures.
 This makes distributed transactions susceptible to failures,
which is why safeguards must be put in place to retain data
integrity.
 For a distributed transaction to occur, transaction managers
coordinate the resources (either multiple databases or multiple
nodes of a single database).
 The transaction manager can be one of the data repositories
that will be updated as part of the transaction, or it can be a
completely independent separate resource that is only
responsible for coordination.
29
 The transaction manager decides whether to commit a
successful transaction or rollback an unsuccessful transaction,
the latter of which leaves the database unchanged.
 First, an application requests the distributed transaction to the
transaction manager.
 The transaction manager then branches to each resource, which
will have its own “resource manager” to help it participate in
distributed transactions.
 Distributed transactions are often done in two phases to
safeguard against partial updates that might occur when a
failure is encountered.
 The first phase involves acknowledging an intent to commit, or a
“prepare-to-commit” phase.
 After all resources acknowledge, they are then asked to run a
final commit, and then the transaction is completed.
 We can examine a basic example of what happens when a
failure occurs during a distributed transaction.
 Let’s say one or more of the resources become unavailable
during the prepare-to-commit phase.
 When the request times out, the transaction manager tells each
resource to delete the prepare-to-commit status, and all data will
be reset to its original state.
 If instead, any of the resources become unavailable during the
commit phase, then the transaction manager will tell the other

30
resources that successfully committed their portion of the
transaction to undo or “rollback” that transaction, and once
again, the data is back to its original state.
 It is then up to the application to retry the transaction to make
sure it gets completed.

Why Do You Need Distributed Transactions?


 Distributed transactions are necessary when you need to quickly
update related data that is spread across multiple databases.
 For example, if you have multiple systems that track customer
information and you need to make a universal update (like
updating the mailing address) across all records, a distributed
transaction will ensure that all records get updated.
 And if a failure occurs, the data is reset to its original state, and
it is up to the originating application to resubmit the transaction.

Distributed Transactions for Streaming Data


 Distributed transactions are especially critical today in data
streaming environments because of the volume of incoming data.
 Even a short-term failure in one of the resources can represent
a potentially large amount of lost data.
 Sophisticated stream processing engines support “exactly-once”
processing in which a distributed transaction covers the reading
of data from a data source, the processing, and the writing of
data to a target destination (the “data sink”).
31
 The “exactly-once” term refers to the fact that every data point is
processed, and there is no loss and no duplication. (Contrast
this to “at-most-once” which allows data loss, and “at-least-once”
which allows duplication.)
 In an exactly-once streaming architecture, the repositories for
the data source and the data sink must have capabilities to
support the exactly-once guarantee.
 In other words, there must be functionality in those repositories
that lets the stream processing engine fully recover from failure,
which does not necessarily have to be a true transaction
manager, but delivers a similar end result.
 Hazelcast Platform is an example of a stream processing engine
that will enable exactly-once processing with sources and sinks
that do not inherently have the capability to support distributed
transactions.
 This is done by managing the entire state of each data point and
to re-read, re-process, and/or re-write it if the data point in
question encounters a failure.
 This built-in logic allows more types of data repositories (such as
Apache Kafka and JMS) to be used as either sources or sinks in
business-critical streaming applications.

32
When Distributed Transactions Are Not Needed
 In some environments, distributed transactions are not necessary,
and instead, extra auditing activities are put in place to ensure
data integrity when the speed of the transaction is not an issue.
 The transfer of money across banks is a good example. Each
bank that participates in the money transfer tracks the status of
the transaction, and when a failure is detected, the partial state is
corrected.
 This process works well without distributed transactions because
the transfer does not have to happen in (near) real time.
 Distributed transactions are typically critical in situations where
the complete update must be done immediately.

33
Commit Protocols
 In a local database system, for committing a transaction, the
transaction manager has to only convey the decision to commit to the
recovery manager.
 In a distributed system, the transaction manager should convey the
decision to commit to all the servers in the various sites where the
transaction is being executed and uniformly enforce the decision.
 When processing is complete at each site, it reaches the partially
committed transaction state and waits for all other transactions to reach
their partially committed states.
 When it receives the message that all the sites are ready to commit, it
starts to commit.
 In a distributed system, either all sites commit or none of them does.
The different distributed commit protocols are −
 One-phase commit
 Two-phase commit
 Three-phase commit
Distributed One-phase Commit
 Distributed one-phase commit is the simplest commit protocol. Let us
consider that there is a controlling site and a number of slave sites
where the transaction is being executed. The steps in distributed
commit are −

34
 After each slave has locally completed its transaction, it sends a
DONE message to the controlling site.
 The slaves wait for Commit or Abort message from the
controlling site. This waiting time is called window of vulnerability.
 When the controlling site receives DONE message from each
slave, it makes a decision to commit or abort. This is called the
commit point. Then, it sends this message to all the slaves.
 On receiving this message, a slave either commits or aborts and
then sends an acknowledgement message to the controlling site.
Distributed Two-phase Commit
Distributed two-phase commit reduces the vulnerability of one-phase
commit protocols. The steps performed in the two phases are as
follows −
Phase 1: Prepare Phase
 After each slave has locally completed its transaction, it sends a
DONE message to the controlling site. When the controlling site has
received DONE message from all slaves, it sends a Prepare
message to the slaves.
 The slaves vote on whether they still want to commit or not. If a
slave wants to commit, it sends a Ready message.
 A slave that does not want to commit sends a Not Ready message.
This may happen when the slave has conflicting concurrent
transactions or there is a timeout.

35
Phase 2: Commit/Abort Phase
After the controlling site has received Ready message from all the
slaves −
 The controlling site sends a Global Commit message to the
slaves.
 The slaves apply the transaction and send a Commit ACK
message to the controlling site.
 When the controlling site receives Commit ACK message from all
the slaves, it considers the transaction as committed.
 After the controlling site has received the first Not Ready
message from any slave −
 The controlling site sends a Global Abort message to the slaves.
 The slaves abort the transaction and send a Abort ACK message
to the controlling site.
 When the controlling site receives Abort ACK message from all
the slaves, it considers the transaction as aborted.

Distributed Three-phase Commit


The steps in distributed three-phase commit are as follows −
Phase 1: Prepare Phase
The steps are same as in distributed two-phase commit.
Phase 2: Prepare to Commit Phase
The controlling site issues an Enter Prepared State broadcast
message. The slave sites vote OK in response.

36
Phase 3: Commit / Abort Phase
The steps are same as two-phase commit except that Commit
ACK/Abort ACK message is not required.

37
Concurrency Control
 Concurrency Control in Distributed Databases ensures that multiple
transactions can occur simultaneously without leading to data
inconsistency.
 In a distributed environment, data is stored across multiple sites,
making concurrency control more complex than in centralized
databases.
 For concurrency control and recovery purposes, numerous problems
arise in a distributed DBMS environment that are not encountered in
a centralized DBMS environment.

These include the following:

Dealing with multiple copies of the data items.

 The concurrency control method is responsible for maintaining


consistency among these copies.
 The recovery method is responsible for making a copy consistent
with other copies if the site on which the copy is stored fails and
recovers later.

Failure of individual sites.


 The DDBMS should continue to operate with its running sites, if
possible, when one or more individual sites fail.
 When a site recovers, its local database must be brought up-to-date
with the rest of the sites before it rejoins the system.
38
Failure of communication links.

 The system must be able to deal with the failure of one or more of
the communication links that connect the sites.

 An extreme case of this problem is that network partitioning may


occur.

 This breaks up the sites into two or more partitions, where the sites
within each partition can communicate only with one another and
not with sites in other partitions.

 Distributed commit. Problems can arise with committing a


transaction that is accessing databases stored on multiple sites if
some sites fail during the commit process. The two-phase commit
protocol is often used to deal with this problem.
 Distributed deadlock. Deadlock may occur among several sites,
so techniques for dealing with deadlocks must be extended to take
this into account.
 Distributed concurrency control and recovery techniques must deal
with these and other problems.

 Distributed concurrency control and recovery techniques must deal


with these and other problems. In the following subsections, we
review some of the techniques that have been suggested to deal
with recovery and concurrency control in DDBMSs.

39
1. Distributed Concurrency Control Based on a Distinguished
Copy of a Data Item
 To deal with replicated data items in a distributed database, a number
of concurrency control methods have been proposed that extend the
concurrency control techniques for centralized databases.

 We discuss these techniques in the context of extending


centralized locking. Similar extensions apply to other concurrency
control techniques.

 The idea is to designate a particular copy of each data item as


a distinguished copy.

 The locks for this data item are associated with the distinguished
copy, and all locking and unlocking requests are sent to the site that
contains that copy.

 A number of different methods are based on this idea, but they differ
in their method of choosing the distinguished copies. In the primary
site technique, all distinguished copies are kept at the same site.

 A modification of this approach is the primary site with a backup


site. Another approach is the primary copy method, where the
distinguished copies of the various data items can be stored in
different sites.

 A site that includes a distinguished copy of a data item basically


acts as the coordinator site for concurrency control on that item.

40
2. Distributed Concurrency Control Based on Voting

 The concurrency control methods for replicated items discussed earlier


all use the idea of a distinguished copy that maintains the locks for that
item.

 In the voting method, there is no distinguished copy; rather, a lock


request is sent to all sites that includes a copy of the data item.

 Each copy maintains its own lock and can grant or deny the request for
it. If a transaction that requests a lock is granted that lock
by a majority of the copies, it holds the lock and informs all copies that
it has been granted the lock.

 If a transaction does not receive a majority of votes granting it a lock


within a certain time-out period, it cancels its request and informs all
sites of the cancellation.

Distributed Recovery

 The recovery process in distributed databases is quite involved. We


give only a very brief idea of some of the issues here.

 In some cases it is quite difficult even to deter-mine whether a site is


down without exchanging numerous messages with other sites.

 For example, suppose that site X sends a message to site Y and


expects a response from Y but does not receive it. There are several
possible explanations:
41
a) The message was not delivered to Y because of communication
failure.
b) Site Y is down and could not respond.
c) Site Y is running and sent a response, but the response was not
delivered.
Without additional information or the sending of additional messages,
it is difficult to determine what actually happened.

42
Distributed Query Processing
Distributed Query Processing A distributed database query is
processed in stages as follows:

1. Query Mapping.

 The input query on distributed data is specified formally using a


query language.
 It is then translated into an algebraic query on global relations. This
translation is done by referring to the global conceptual schema and
does not take into account the actual distribution and replica tion of
data.
 Hence, this translation is largely identical to the one performed in a
centralized DBMS.
 It is first normalized, analyzed for semantic errors, simplified, and
finally restructured into an algebraic query.

2. Localization.

 In a distributed database, fragmentation results in relations being


stored in separate sites, with some fragments possibly being repli
cated.
 This stage maps the distributed query on the global schema to sepa
rate queries on individual fragments using data distribution and
replication information.

43
 Global Query Optimization. Optimization consists of selecting a
strategy from a list of candidates that is closest to optimal.
 A list of candidate queries can be obtained by permuting the
ordering of operations within a fragment query generated by the
previous stage.
 Time is the preferred unit for mea suring cost. The total cost is a
weighted combination of costs such as CPU cost, I/O costs, and
communication costs.
 Since DDBs are connected by a network, often the communication
costs over the network are the most sig nificant.
 This is especially true when the sites are connected through a wide
area network (WAN).

4. Local Query Optimization.

 This stage is common to all sites in the DDB. The techniques are
similar to those used in centralized systems.

44

You might also like