8 Distributed Databases
8 Distributed Databases
What is a distributed system? It's one of those things that's hard to define without first defining many other things.
Here is a "cascading" definition of a distributed system:
A program
is the code you write.
A process
is what you get when you run it.
A message
is used to communicate between processes.
A packet
is a fragment of a message that might travel on a wire.
A protocol
is a formal description of message formats and the rules that two processes must follow in order to
exchange those messages.
A network
is the infrastructure that links computers, workstations, terminals, servers, etc. It consists of routers which
are connected by communication links.
A component
can be a process or any piece of hardware required to run a process, support communications between
processes, store data, etc.
A distributed system
is an application that executes a collection of protocols to coordinate the actions of multiple processes
on a network, such that all components cooperate together to perform a single or small set of related tasks.
Fault-Tolerant: It can recover from component failures without performing incorrect actions.
Highly Available: It can restore operations, permitting it to resume providing services even when
some components have failed.
Recoverable: Failed components can restart themselves and rejoin the system, after the cause of
failure has been repaired.
Consistent: The system can coordinate actions by multiple components often in the presence of
concurrency and failure. This underlies the ability of a distributed system to act like a non-
distributed system.
Scalable: It can operate correctly even as some aspect of the system is scaled to a larger size. For
example, we might increase the size of the network on which the system is running. This increases
the frequency of network outages and could degrade a "non-scalable" system. Similarly, we might
increase the number of users or servers, or overall load on the system. In a scalable system, this
should not have a significant effect.
Predictable Performance: The ability to provide desired responsiveness in a timely manner.
Secure: The system authenticates access to data and services
DISTRIBUTED DATABASES
Distributed database is a system in which storage devices are not connected to a common
processing unit.
Database is controlled by Distributed Database Management System and data may be
stored at the same location or spread over the interconnected network. It is a loosely
coupled system.
Shared nothing architecture is used in distributed databases.
The above diagram is a typical example of distributed database system, in which
communication channel is used to communicate with the different locations and every
system has its own memory and database.
Reliability: In distributed database system, if one system fails down or stops working for
some time another system can complete the task.
Availability: In distributed database system reliability can be achieved even if sever fails
down. Another system is available to serve the client request.
Performance: Performance can be achieved by distributing database over different
locations. So the databases are available to every location which is easy to maintain.
A client server architecture has a number of clients and a few servers connected in a
network.
A client sends a query to one of the servers. The earliest available server solves it and
replies.
A Client-server architecture is simple to implement and execute due to centralized server
system.
3. Middleware architecture.
Middleware architectures are designed in such a way that single query is executed on
multiple servers.
This system needs only one server which is capable of managing queries and transactions
from multiple servers.
Middleware architecture uses local servers to handle local queries and transactions.
The software are used for execution of queries and transactions across one or more
independent database servers, this type of software is called as middleware.
What is fragmentation?
The process of dividing the database into a smaller multiple parts is called
as fragmentation.
These fragments may be stored at different locations.
The data fragmentation process should be carried out in such a way that the
reconstruction of original database from the fragments is possible.
Example:
Account (Acc_No, Balance, Branch_Name, Type).
In this example if values are inserted in table Branch_Name as Pune, Baroda, Delhi.
Example:
For the above table we can define any simple condition like, Branch_Name= 'Pune',
Branch_Name= 'Delhi', Balance < 50,000
Fragmentation1:
SELECT * FROM Account WHERE Branch_Name= 'Pune' AND Balance < 50,000
Fragmentation2:
SELECT * FROM Account WHERE Branch_Name= 'Delhi' AND Balance < 50,000
Fragmentation1:
SELECT * FROM Account WHERE Branch_Name= 'Baroda' AND Balance < 50,000
Fragmentation2:
SELECT * FROM Account WHERE Branch_Name= 'Delhi' AND Balance < 50,000
2. Vertical Fragmentation
Example:
Fragmentation1:
SELECT * FROM Acc_NO
Fragmentation2:
SELECT * FROM Balance
3) Hybrid Fragmentation
Fragmentation1:
SELECT * FROM Emp_Name WHERE Emp_Age < 40
Fragmentation2:
SELECT * FROM Emp_Id WHERE Emp_Address= 'Pune' AND Salary < 14000
Data replication is the process in which the data is copied at multiple locations (Different
computers or servers) to improve the availability of data.
1. Synchronous Replication:
In synchronous replication, the replica will be modified immediately after some changes are
made in the relation table. So there is no difference between original data and replica.
2. Asynchronous replication:
In asynchronous replication, the replica will be modified after commit is fired on to the
database.
Replication Schemes
1. Full Replication
In full replication scheme, the database is available to almost every location or user in
communication network.
2. No Replication
Advantages of no replication
Concurrency can be minimized.
Easy recovery of data.
Disadvantages of no replication
Poor availability of data.
Slows down the query execution process, as multiple clients are accessing the same server.
3. Partial replication
Partial replication means only some fragments are replicated from the database.
Various factors which are considered while processing a query are as follows:
This is a very important factor while processing queries. The intermediate data is
transferred to other location for data processing and the final result will be sent to the
location where the actual query is processing.
The cost of data increases if the locations are connected via high performance
communicating channel.
The DDBMS query optimization algorithms are used to minimize the cost of data
transfer.
Semi-join based query optimization
Distributed Transactions
For example:
Consider that, location A sends message to location B and expects response from B but B is
unable to receive it. There are several problems for this situation which are as follows.
Commit request:
In commit phase the coordinator attempts to prepare all cohorts and take necessary steps
to commit or terminate the transactions.
Commit phase:
The commit phase is based on voting of cohorts and the coordinator decides to commit or
terminate the transaction.
Some problems which occur while accessing the database are as follows:
4. Distributed commit
While committing a transaction which is accessing databases stored on multiple locations, if
failure occurs on some location during the commit process then this problem is called as
distributed commit.
5. Distributed deadlock
Deadlock can occur at several locations due to recovery problem and concurrency problem
(multiple locations are accessing same system in the communication network).
There are three different ways of making distinguish copy of data by applying:
3) Exclusive lock
The transaction can activate exclusive lock on a data to read and write operation. In this
system, no other transaction can activate any kind of lock on that same data.