Distributed Databases
and Concurrency Control
Chapter Outline
1. Distributed Database Concepts
2. Data Fragmentation, Replication and Allocation
3. Types of Distributed Database Systems
4. Concurrency Control and Recovery
5. 3-Tier Client-Server Architecture
Distributed Database Concepts
A transaction can be executed by multiple
networked computers in a unified manner.
A distributed database (DDB) processes Unit of
execution (a transaction) in a distributed manner.
A distributed database (DDB) can be defined as
A distributed database (DDB) is a collection of
multiple logically related database distributed over
a computer network, and a distributed database
management system as a software system that
manages a distributed database while making the
distribution transparent to the user.
London New York
Payroll app Payroll app
EMP
London
New York
Internet
Hong Kong
Payroll app Problem:
NY and HK payroll
apps run very slowly!
Hong Kong
London New York
Payroll app Payroll app
London
Emp NY
London
New York Emp
Internet
Hong Kong
Payroll app
Much better!!
Hong Kong
HK
Emp
London New York
Payroll app Payroll app
Annual
Bonus app
London
Emp NY
London
New York Emp
Internet
Hong Kong Distribution provides
Payroll app
opportunities for
parallel execution
Hong Kong
HK
Emp
Distributed Database System
Advantages
Management of distributed data with different
levels of transparency:
This refers to the physical placement of data (files,
relations, etc.) which is not known to the user
(distribution transparency).
Distributed Database System
Advantages (transparency, contd.)
The EMPLOYEE, PROJECT, and WORKS_ON
tables may be fragmented horizontally and stored
with possible replication as shown below.
Distributed Database System
Advantages (transparency, contd.)
Distribution and Network transparency:
Users do not have to worry about operational details
of the network.
There is Location transparency, which refers to freedom of
issuing command from any location without affecting its
working.
Then there is Naming transparency, which allows access
to any names object (files, relations, etc.) from any
location.
Distributed Database System
Advantages (transparency, contd.)
Replication transparency:
It allows to store copies of a data at multiple sites as
shown in the previous diagram.
This is done to minimize access time to the required
data.
Fragmentation transparency:
Allows to fragment a relation horizontally (create a
subset of tuples of a relation) or vertically (create a
subset of columns of a relation).
Distributed Database System
Other Advantages
Increased reliability and availability:
Reliability refers to system live time, that is, system
is running efficiently most of the time. Availability is
the probability that the system is continuously
available (usable or accessible) during a time
interval.
A distributed database system has multiple nodes
(computers) and if one fails then others are
available to do the job.
Distributed Database System
Other Advantages (contd.)
Improved performance:
A distributed DBMS fragments the database to keep
data closer to where it is needed most.
This reduces data management (access and
modification) time significantly.
Easier expansion (scalability):
Allows new nodes (computers) to be added anytime
without changing the entire configuration.
Distributed Data Storage
Assume relational data model
Replication
System maintains multiple copies of data, stored in different
sites, for faster retrieval and fault tolerance.
Fragmentation
Relation is partitioned into several fragments stored in distinct
sites
Replication and fragmentation can be combined
Relation is partitioned into several fragments: system
maintains several identical replicas of each such fragment.
Data Fragmentation, Replication and
Allocation
Data Replication
Database is replicated to all sites.
In full replication the entire database is replicated and in
partial replication some selected part is replicated to some
of the sites.
Data replication is achieved through a replication schema.
Data Distribution (Data Allocation)
This is relevant only in the case of partial replication or
partition.
The selected portion of the database is distributed to the
database sites.
Data Replication
A relation or fragment of a relation is
replicated if it is stored redundantly in two or
more sites.
Full replication of a relation is the case where
the relation is stored at all sites.
Fully redundant databases are those in which
every site contains a copy of the entire
database.
Data Replication (Cont.)
Advantages of Replication
Availability: failure of site containing relation r does not result in
unavailability of r is replicas exist.
Parallelism: queries on r may be processed by several nodes in
parallel.
Reduced data transfer: relation r is available locally at each site
containing a replica of r.
Disadvantages of Replication
Increased cost of updates: each replica of relation r must be
updated.
Increased complexity of concurrency control: concurrent updates to
distinct replicas may lead to inconsistent data unless special
concurrency control mechanisms are implemented.
One solution: choose one copy as primary copy and apply
concurrency control operations on primary copy
Types of Distributed Database Systems
Homogeneous
All sites of the database
system have identical Window
setup, i.e., same database Site 5 Unix
system software. Oracle Site 1
Oracle
The underlying operating Window
system may be different. Site 4 Communications
For example, all sites run network
Oracle or DB2, or Sybase
or some other database Oracle
system.
Site 3 Site 2
The underlying operating Linux Oracle Linux Oracle
systems can be a mixture
of Linux, Window, Unix,
etc.
Types of Distributed Database Systems
Heterogeneous
Federated: Each site may run different database system but the
data access is managed through a single conceptual schema.
This implies that the degree of local autonomy is minimum. Each site
must adhere to a centralized access policy. There may be a global
schema.
Multidatabase: There is no one conceptual global schema. For
data access a schema is constructed dynamically as needed by
the application software. Unix Relational
Object
Oriented Site 5 Unix
Site 1
Hierarchical
Window
Site 4 Communications
network
Network
Object DBMS
Oriented Site 3 Site 2 Relational
Linux Linux
Homogeneous Distributed Databases
In a homogeneous distributed database
All sites have identical software
Are aware of each other and agree to cooperate in processing user
requests.
Each site surrenders part of its autonomy in terms of right to change
schemas or software
Appears to user as a single system
In a heterogeneous distributed database
Different sites may use different schemas and software
Difference in schema is a major problem for query processing
Difference in software is a major problem for transaction processing
Sites may not be aware of each other and may provide only
limited facilities for cooperation in transaction processing
Types of Distributed Database Systems
Federated Database Management Systems
Issues
Differences in data models:
Relational, Objected oriented, hierarchical, network,
etc.
Differences in constraints:
Each site may have their own data accessing and
processing constraints.
Differences in query language:
Some site may use SQL, some may use SQL-89,
some may use SQL-99, and so on.
Client-Server Database Architecture
It consists of clients running client software, a set
of servers which provide all database
functionalities and a reliable communication
infrastructure.
Server 1 Client 1
Client 2
Server 2 Client 3
Server n Client n
Client-Server Database Architecture
Clients reach server for desired service, but
server does reach clients.
The server software is responsible for local data
management at a site, much like centralized
DBMS software.
The client software is responsible for most of the
distribution function.
The communication software manages
communication among clients and servers.
Concurrency Control Techniques
What is Concurrency Control?
Concurrency Control in Database Management
System is a procedure of managing simultaneous
operations without conflicting with each other.
It ensures that Database transactions are performed
concurrently and accurately to produce correct
results without violating data integrity of the
respective Database.
Concurrent access is quite easy if all users are just
reading data. There is no way they can interfere with
one another. Though for any practical Database, it
would have a mix of READ and WRITE operations
and hence the concurrency is a challenge.
Potential problems of Concurrency
Lost Updates occur when multiple transactions select
the same row and update the row based on the value
selected
Uncommitted dependency issues occur when the
second transaction selects a row which is updated by
another transaction (dirty read)
Non-Repeatable Read occurs when a second
transaction is trying to access the same row several
times and reads different data each time.
Temporary Update Problem:
Temporary update or dirty read
problem occurs when one
transaction updates an item and
fails. But the updated item is
used by another transaction
before the item is changed or
reverted back to its last value.
In this example, if transaction 1
fails for some reason then X will
revert back to its previous value.
But transaction 2 has already
read the incorrect value of X.
Lost Update Problem:
In the lost update problem,
update done to a data item
by a transaction is lost as it is
overwritten by the update
done by another transaction.
In this example, transaction 1
changes the value of X but it
gets overwritten by the
update done by transaction 2
on X. Therefore, the update
done by transaction 1 is lost.
Unrepeatable Read Problem:
The unrepeatable problem occurs
when two or more read operations of
the same transaction read different
values of the same variable.
In this example, once transaction 2
reads the variable X, a write
operation in transaction 1 changes
the value of the variable X. Thus,
when another read operation is
performed by transaction 2, it reads
the new value of X which was
updated by transaction 1.
Concurrency Control and Recovery
Distributed Databases encounter a number of
concurrency control and recovery problems which
are not present in centralized databases. Some
of them are listed below.
Dealing with multiple copies of data items
Failure of individual sites
Communication link failure
Distributed commit
Distributed deadlock
Concurrency Control and Recovery
Details
Dealing with multiple copies of data items:
The concurrency control must maintain global
consistency. Likewise the recovery mechanism
must recover all copies and maintain consistency
after recovery.
Failure of individual sites:
Database availability must not be affected due to
the failure of one or two sites and the recovery
scheme must recover them before they are
available for use.
Concurrency Control and Recovery
Details (contd.)
Communication link failure:
This failure may create network partition which would affect
database availability even though all database sites may be
running.
Distributed commit:
A transaction may be fragmented and they may be executed
by a number of sites. This require a two or three-phase
commit approach for transaction commit.
Distributed deadlock:
Since transactions are processed at multiple sites, two or
more sites may get involved in deadlock. This must be
resolved in a distributed manner.
System Failure Modes
Failures unique to distributed systems:
Failure of a site.
Loss of massages
Handled by network transmission control protocols such as
TCP-IP
Failure of a communication link
Handled by network protocols, by routing messages via
alternative links
Network partition
A network is said to be partitioned when it has been split into
two or more subsystems that lack any connection between them
Note: a subsystem may consist of a single node
Network partitioning and site failures are generally indistinguishable.
What is a Transaction?
A set of steps completed by a DBMS to accomplish a
single user task.
Must be either entirely completed or aborted
No intermediate states are acceptable
Distributed Transactions
Transaction may access data at several sites.
Each site has a local transaction manager responsible for:
Maintaining a log for recovery purposes
Participating in coordinating the concurrent execution of the transactions
executing at that site.
Each site has a transaction coordinator, which is responsible for:
Starting the execution of transactions that originate at the site.
Distributing subtransactions at appropriate sites for execution.
Coordinating the termination of each transaction that originates at the
site, which may result in the transaction being committed at all sites or
aborted at all sites.
Transaction System Architecture
Database Concurrency Control
1 Purpose of Concurrency Control
To enforce Isolation (through mutual exclusion) among
conflicting transactions.
To preserve database consistency through consistency
preserving execution of transactions.
To resolve read-write and write-write conflicts.
Example:
In concurrent execution environment if T1 conflicts with T2
over a data item A, then the existing concurrency control
decides if T1 or T2 should get the A and if the other
transaction is rolled-back or waits.
Concurrency Control Protocols
Different concurrency control protocols offer
different benefits between the amount of
concurrency they allow and the amount of overhead
that they impose.
Following are the Concurrency Control techniques
in DBMS:
Lock-Based Protocols
Two Phase Locking Protocol
Timestamp-Based Protocols
Validation-Based Protocols
Lock-based Protocols
Lock Based Protocols in DBMS is a mechanism in which
a transaction cannot Read or Write the data until it
acquires an appropriate lock. Lock based protocols help
to eliminate the concurrency problem in DBMS for
simultaneous transactions by locking or isolating a
particular transaction to a single user.
A lock is a data variable which is associated with a data
item. This lock signifies that operations that can be
performed on the data item. Locks in DBMS help
synchronize access to the database items by concurrent
transactions.
All lock requests are made to the concurrency-control
manager. Transactions proceed only once the lock
request is granted.
1. Shared Lock (S):
A shared lock is also called a Read-only lock. With
the shared lock, the data item can be shared
between transactions. This is because you will never
have permission to update data on the data item.
For example, consider a case where two
transactions are reading the account balance of a
person. The database will let them read by placing a
shared lock. However, if another transaction wants
to update that account’s balance, shared lock
prevent it until the reading process is over.
2. Exclusive Lock (X):
With the Exclusive Lock, a data item can be read as well as
written. This is exclusive and can’t be held concurrently on the
same data item. X-lock is requested using lock-x instruction.
Transactions may unlock the data item after finishing the ‘write’
operation.
For example, when a transaction needs to update the account
balance of a person. You can allows this transaction by placing X
lock on it.
Therefore, when the second transaction wants to read or write,
exclusive lock prevent this operation.
Database Concurrency Control
Two-Phase Locking Techniques
Locking is an operation which secures
(a) permission to Read
(b) permission to Write a data item for a transaction.
Example:
Lock (X). Data item X is locked in behalf of the requesting
transaction.
Unlocking is an operation which removes these permissions
from the data item.
Example:
Unlock (X): Data item X is made available to all other
transactions.
Lock and Unlock are Atomic operations.
Database Concurrency Control
Two-Phase Locking Techniques: Essential components
Two locks modes:
(a) shared (read) (b) exclusive (write).
Shared mode: shared lock (X)
More than one transaction can apply share lock on X for
reading its value but no write lock can be applied on X by any
other transaction.
Exclusive mode: Write lock (X)
Only one write lock on X can exist at any time and no shared
lock can be applied by any other transaction on X.
Conflict matrix Read Write
Read
Y N
Write
N N
Database Concurrency Control
Two-Phase Locking Techniques: Essential
components
Lock Manager:
Managing locks on data items.
Lock table:
Lock manager uses it to store the identify of
transaction locking a data item, the data item, lock
mode and pointer to the next data item locked. One
simple way to implement a lock table is through
linked list.
Transaction ID Data item id lock mode Ptr to next data item
T1 X1 Read Next
Database Concurrency Control
Two-Phase Locking Techniques: Essential
components
Database requires that all transactions should be
well-formed. A transaction is well-formed if:
It must lock the data item before it reads or writes to
it.
It must not lock an already locked data items and it
must not try to unlock a free data item.
Database Concurrency Control
Dealing with Deadlock and Starvation
Deadlock
T’1 T’2
read_lock (Y); T1 and T2 did follow two-phase
read_item (Y); policy but they are deadlock
read_lock (X);
read_item (Y);
write_lock (X);
(waits for X) write_lock (Y);
(waits for Y)
Deadlock (T’1 and T’2)
Database Concurrency Control
Dealing with Deadlock and Starvation
Deadlock prevention
A transaction locks all data items it refers to before
it begins execution.
This way of locking prevents deadlock since a
transaction never waits for a data item.
The conservative two-phase locking uses this
approach.
Dealing with Deadlock and Starvation
Starvation is the situation when a transaction needs to wait
for an indefinite period to acquire a lock.
Following are the reasons for Starvation:
When waiting scheme for locked items is not properly
managed
In the case of resource leak
The same transaction is selected as a victim repeatedly
Deadlock
Deadlock refers to a specific situation where two or more
processes are waiting for each other to release a resource or
more than two processes are waiting for the resource in a
circular chain.
Database Concurrency Control
Dealing with Deadlock and Starvation
Deadlock avoidance
There are many variations of two-phase locking algorithm.
Some avoid deadlock by not letting the cycle to complete.
That is as soon as the algorithm discovers that blocking a
transaction is likely to create a cycle, it rolls back the
transaction.
Wound-Wait and Wait-Die algorithms use timestamps to
avoid deadlocks by rolling-back victim.
Database Concurrency Control
Dealing with Deadlock and Starvation
Starvation
Starvation occurs when a particular transaction consistently
waits or restarted and never gets a chance to proceed
further.
In a deadlock resolution it is possible that the same
transaction may consistently be selected as victim and
rolled-back.
This limitation is inherent in all priority based scheduling
mechanisms.
In Wound-Wait scheme a younger transaction may always
be wounded (aborted) by a long running older transaction
which may create starvation.
Commit Protocols
Commit protocols are used to ensure atomicity across
sites
a transaction which executes at multiple sites must either
be committed at all the sites, or aborted at all the sites.
not acceptable to have a transaction committed at one
site and aborted at another
The two-phase commit (2PC) protocol is widely used
The three-phase commit (3PC) protocol is more
complicated and more expensive, but avoids some
drawbacks of two-phase commit protocol.
Distributed commit problem
.
Transaction T
Action: Action: Action:
a1,a2 a3 a4,a5
Commit must be atomic
Distributed commit problem
Commit must be atomic
Solution: Two-phase commit (2PC)
Centralized 2PC
Distributed 2PC
Linear 2PC
Many other variants…
Terminology
Resource Managers (RMs)
Usually databases
Participants
RMs that did work on behalf of transaction
Coordinator
Component that runs two-phase commit on behalf
of transaction
Two – Phase Commit
REQUEST-TO-PREPARE
PREPARED*
Coordinator
Participant
COMMIT*
DONE
REQUEST-TO-PREPARE
NO
Coordinator
Participant
ABORT
DONE
TWO-PHASE COMMIT (2PC) - OK
TWO-PHASE COMMIT (2PC) - ABORT
‘G
l ob
al
A bo
rt’