What is a Distributed Database System?
Distributed Database
A logically interrelated collection of shared data (and a description of this data), physically distributed over a computer network.
Distributed DBMS
Software system that permits the management of the distributed database and makes the distribution transparent to users.
What is not a DDBS?
A timesharing computer system A loosely or tightly coupled multiprocessor
system
A database system which resides at one of the
nodes of a network of computers - this is a centralized database on a network node
The Fundamental Principle of Distributed Database
To the user, a distributed system should look exactly like a nondistributed system.
A typical distributed database system:
New York Shanghai
Communication network
London
San Francisco
What is the 12 objectives?
Local autonomy No reliance on a central Distributed query
site Continuous operation Location independence Fragmentation independence Replication independence
processing Distributed transaction management Hardware independence Operating system independence Network independence DBMS independence
Types Of Distributed Databases
In a homogeneous distributed database
All sites have identical software
Are aware of each other and agree to cooperate in processing user
requests. Each site surrenders part of its autonomy in terms of right to change schemas or software Appears to user as a single system
In a heterogeneous distributed database
Different sites may use different schemas and software
Difference in schema is a major problem for query processing Difference in software is a major problem for transaction processing Sites may not be aware of each other and may provide only limited facilities for cooperation in transaction processing
Why use a DDBMS? (!)
Advantages:
Reflects organizational structure
Improved
shareability and local autonomy Improved availability Improved reliability Improved performance Economics Modular growth
Disadvantages: Complexity Cost Security Integrity control more difficult Lack of standards Lack of experience Database design more complex
Distributed Database Design
DATA FRAGMENTATION, REPLICATION, AND ALLOCATION TECHNIQUES FOR DISTRIBUTED DATABASE DESIGN
Fragmentation: Breaking up the database into logical units called
fragments and assigned for storage at various sites.
Data replication: The process of storing fragments in more than one site Data Allocation: The process of assigning a particular fragment to a particular
site in a distributed system.
The information concerning the data fragmentation, allocation and
replication is stored in a global directory.
12.5 Distributed Relational Database Design
Fragmentation !
Four types of fragmentation:
1.
Horizontal:
Consists of a subset of the tuples of a relation.
- Defined using Selection operation - Determined by looking at predicates used by Ts. - Involves finding set of minimal (complete and relevant) predicates. - Set of predicates is complete, iff, any two tuples in same fragment are referenced with same probability by any application. - Predicate is relevant if there is at least one application that accesses fragments differently.
12.5 Distributed Relational Database Design
Fragmentation !
Four types of fragmentation:
2.
Other possibility is no fragmentation:
Vertical:
-If relation is small and not updated frequently, may be - Defined using Projection operation better not to fragment. - Determined by establishing affinity of one attribute to another.
subset of atts of a relation.
3.
Mixed: horizontal fragment that is vertically fragmented, or a
vertical fragment that is horizontally fragmented. - Defined using Selection and Projection operations
4.
Derived: horizontal fragment that is based on horizontal
fragmentation of a parent relation. - Ensures fragments frequently joined together are at same site. - Defined using Semijoin operation
Data Allocation !
Four alternative strategies regarding placement of data:
Centralized: single database and DBMS stored at one site with users distributed across the network.
Partitioned: Database partitioned into disjoint fragments, each fragment assigned to one site. Complete Replication: Consists of maintaining complete copy of database at each site. Selective Replication: Combination of partitioning, replication, and centralization.
Data Allocation
DATA REPLICATION
Fully replicated database:
* Stores multiple copies of each database fragment at multiple sites *Can be impractical due to amount of overhead Partially replicated database: *Stores multiple copies of some database fragments at multiple sites *Most DDBMSs are able to handle the partially replicated database well Unreplicated database: *Stores each database fragment at a single site *No duplicate database fragments
Advantages of Replication
Availability: failure of site containing relation r does
not result in unavailability of r is replicas exist. Parallelism: queries on r may be processed by several nodes in parallel. Reduced data transfer: relation r is available locally at each site containing a replica of r.
Disadvantages of Replication
Increased cost of updates: each replica of relation r
must be updated. Increased complexity of concurrency control: concurrent updates to distinct replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented.
One solution: choose one copy as primary copy and apply concurrency control operations on primary copy.
Transparency in a DDBMS
Transparency hides implementation details from users. Overall objective: equivalence to user of DDBMs to centralised DBMS - FULL transparency not universally accepted objective
Transparency types: 1.Distribution/ Netwrok Transparency a.Location Transparency b.Naming Transparency 2.Replication Transparency 3.Fragmentation Transparency 4.Design Transparency 5.Execution Transparency
Distributed DBMS Issues
Query Processing
convert user transactions to data manipulation instructions optimization problem min{cost = data transmission + local processing} general formulation is NP-hard
Concurrency Control
synchronization of concurrent accesses consistency and isolation of transactions' effects deadlock management
Reliability
how to make the system resilient to failures
atomicity and durability
Relationship Between Issues
Directory Management
Query Processing
Distribution Design
Reliability
Concurrency Control
Deadlock Management
Concurrency Control and Recovery
Distributed Databases encounter a number of
concurrency control and recovery problems which are not present in centralized databases. Some of them are listed below.
Dealing with multiple copies of data items
Failure of individual sites Communication link failure
Distributed commit
Distributed deadlock
Slide 2520
System Failure Modes
Failures unique to distributed systems:
Failure of a site. Loss of massages
Handled by network transmission control protocols such as TCPIP Failure of a communication link Handled by network protocols, by routing messages via alternative links Network partition A network is said to be partitioned when it has been split into two or more subsystems that lack any connection between them Note: a subsystem may consist of a single node Network partitioning and site failures are generally indistinguishable.
Client-Server Database Architecture
It consists of clients running client software, a set of
servers which provide all database functionalities and a reliable communication infrastructure.
Server 1 Client 1 Client 2 Server 2 Client 3
Server n
Client n
Slide 2522
Conclusion
Todays business environment has an increasing need for distributed database and client/server applications as the desire for reliable, scalable and accessible information is steadily rising. Distributed database systems provide an improvement on communication and data processing due to its data distribution throughout different network sites. Not only is data access faster, but a singlepoint of failure is less likely to occur, and it provides local control of data for users. However, there is some complexity when attempting to manage and control distributed database systems. A distributed database allows faster local queries and can reduce network traffic. With these benefits comes the issue of maintaining data integrity. Single big server could hardly handle requirement of high availability, data warehousing and fast data storage simultaneously. The distributed database satisfies them by separating functions at low cost. The grid computing is becoming the main stream of information technology. Not only computation, we expect database grid will also be a key technology in the future.
THANK YOU