1
Introduction
to Distributed
Database
Indra Maryati, S.Kom.
maryati@stts.edu
Agenda
» Distributed data processing
» What is a distributed database system
» Promises of distributed database
systems
2
DDBS = Database + Network
» The technology of computer networks,
promotes a mode of work that goes against all
centralization efforts and facilitates distributed
computing
» Distributed database system technology is the
union of what appear to be diametrically
opposed approaches to data processing:
Database System, Computer Network
technologies
» A database system aims at integrating the
operational data of an enterprise, and to provide
a centralized and controlled access to that data 3
Distributed Computing System
» A distributed computing system consists
of
˃ a number of autonomous processing
elements (not necessarily homogeneous)
˃ interconnected by a computer network
˃ cooperate in performing their assigned tasks
» What is distributed?
All these are necessary
˃ Processing Logic and important for
˃ Function distributed database
technology
˃ Data 4
˃ Control
What is a True Distributed System?
» “A system that runs on a collection of
machines that do not have shared
memory, yet looks to its users like a
single computer”
5
Computer Networks
» Computers (hosts or sites) which are capable
of performing autonomous work are
connected by a communication network
» The communication network constitutes of
communication links of various kinds
(telephone lines, coaxial cables, satellite links,
etc.) and computers dedicated to
communication function (not hosts)
» A communication network facilitates a
process running at one site to send a message
to a process running at any other site of the 6
network (point to point)
Computer Networks
» A communication network facilitates any two sites
of the network to communicate independent of
its structure
» Parameters of communication network that are
relevant for distributed DBMSs
˃ Delay : the amount of time taken for a message to
reach its destination from the time it was sent from its
source
˃ Cost : consists of a fixed cost plus a variable cost
depending on the length of the message
˃ Reliability : of the network; i.e., the probability that a
message is correctly delivered at its destination
7
» Broadcast is the ability of a site to send the same
message to processes at all other sites.
Network Topology - Mesh
» Fully connected
» High cost, direct communication
between sites and high reliability
8
Network Topology - Irregular
» Partially connected
» Lower cost, not all sites directly
connected, less reliable than fully
connected
9
Network Topology - Tree
» Hierarchical
» Sites organized as a tree with
each node having an unique
parent and some children
» Lower cost than partially
connected topology
» Inter-node communication
through ancestor nodes
» If a parent fails, further
communication among
children is not possible
» Failure of any node or link
partitions the network into 10
disjoint subtrees
Network Topology - Star
» One site (central site)
connected to all the sites
» Cost is linear in the
number of sites
» All communication
through the central site
» Reliability depends on
the central site, since if it
fails, the network is
completely partitioned
11
Network Technology - Ring
» Each site is connected to
exactly two sites (neighbours)
» The ring can be unidirectional
or bidirectional
» Cost is linear in the number
of sites
» Communication cost can be
high
» In a unidirectional ring,
failure of one site or a link
partitions the network
» A bidirectional ring can 12
withstand two link failures
Network Topology – Multiaccess Bus
» A single shared link (the
bus) accessed by all the
sites
» Sites communicate with
each other through the
link
» Cost is linear in the
number of sites
» Communication cost is
low
» If the link fails, the
network is completely 13
partitioned
Distributed System Communications
» A distributed system relies on
communication, generally involving:
˃ Message Passing
+ When a client want to communicate with a server it
sends a message. The server replies with a response
message passing mechanism may be:
˃ Remote Procedure Calls
+ A reliable send followed by a get is very like a
procedure call A remote procedure call can be
provided by including a stub with the server and the
client The stub is responsible for dealing with the
communication between systems The use of the
stubs makes the client call just a local procedure call 14
and the server routine is called locally
Local Area Networks
» LANs usually cover a small geographical
area like a building or a few adjacent
buildings
» The communication links tend to be
high-speed (Gigabytes/sec) and a low
error rate
» Links are usually twisted pair, coaxial
cable or fiber optic
» Messages are usually sent in packets
15
Wide Area Networks
» WAN connects sites
that are physically
distributed over a wide
geographical area
» Typical communication
links are telephone
lines, microwave links
and satellite links
» WANs are generally
slower than LANs and 16
are more error prone
DS: Design Issues
» Must be transparent
» Provide flexibility
» Be reliable
˃ Design should not require the simultaneous functioning of a
substantial number of critical components
˃ More redundancy greater availability and greater inconsistency
˃ Fault tolerance, the ability to mask failures from the user
» Good performance
˃ The rest are useless without this
˃ Hard to measure
˃ Balance number of messages and grain size of distributed
computations
» Scalable
˃ A maxim for developing distributed systems
˃ Avoid centralized components, tables and algorithms 17
˃ Only decentralized algorithms should be used
Agenda
» Distributed data processing
» What is a distributed database system
» Promises of distributed database
systems
18
Distributed Database Systems
» A distributed database is a collection of
multiple, logically interrelated databases
distributed over a computer network;
stores data on multiple computers
(nodes) over the network and permits
access from any node to the joint data
» A distributed database management
system (DDBMS) is a software system
that permits the management of the
distributed databases and makes the 19
distribution transparent to the users.
Reasons for Data Distribution
» Several factors have led to the
development of DDBS:
˃ Distributed nature of some database
applications
˃ Increased reliability and availability
˃ Allowing data sharing while maintaining
some measure of local control
˃ Improved performance
20
Distributed DBMS Environment
21
What is not a Distributed Database
System?
» A DDBS is not a “collection of files'' that
can be individually stored at each node
of a computer network
˃ files are not logically related
˃ no access via common interface
22
Centralized DBMS on a Network
» data resides only at one node
» the database management is no
different from centralized DBMS
» remote processing, single server-
multiple clients
23
Distributed Database System
Technology
» The key is integration, not centralization
» Distributed database technology
attempts to achieve integration without
centralization
24
Example
» Multinational manufacturing company:
˃ head quarters in New York
˃ manufacturing plants in Chicago and
Montreal
˃ warehouses in Phoenix and Edmonton
˃ R&D facilities in San Francisco
» Data and Information:
˃ employee records (working location)
˃ projects (R&D)
˃ engineering data (manufacturing plants, R&D)
25
˃ inventory (manufacturing, warehouse)
Agenda
» Distributed data processing
» What is a distributed database system
» Promises of distributed database
systems
26
Promises of Distributed DBMS
» transparent management of
distributed, fragmented, and replicated
data
» improved reliability and availability
through distributed transactions
» improved performance
» higher system extendibility
27
Transparency
» Transparency refers to separation of the
higher-level semantics of a system from
lower-level implementation details.
» From data independence in centralized
DBMS to fragmentation transparency in
DDBMS.
» Issues
˃ Who should provide transparency?
˃ What is the state of the art in the industry?
28
Improved Reliability
» Distributed DBMS can use replicated
components to eliminate single point
failure.
» The users can still access part of the
distributed database with “proper care”
even though some of the data is
unreachable.
» Distributed transactions facilitate
maintenance of consistent database
state even when failures occur. 29
Improved Performance
» Since each site handles only a portion of
a database, the contention for CPU and
I/O resources is not that severe. Data
localization reduces communication
overheads.
30
Easier System Expansion
» Ability to add new sites, data, and users
over time without major restructuring.
» New applications (such as, supply chain)
are naturally distributed - centralized
systems will just not work.
31
Disadvantages of DDBSs
» Lack of Experience
˃ No operating true distributed database systems in existence
» Complexity
˃ DDBS problems are inherently more complex than
centralized DBMS ones
» Cost
˃ More hardware, software and people costs
» Distribution of control
˃ Problems of synchronization and coordination to maintain
data consistency
» Security
˃ Database security + network security
» Difficult to convert 32
˃ No tools to convert centralized DBMSs to DDBSs