0% found this document useful (0 votes)
37 views

Distributed Database Design Methodologies

This document discusses different methodologies for designing distributed databases, including top-down and bottom-up approaches. It also describes the design phases involved, including conceptual design, logical design, physical design, and distribution design, which maps the global schema to subschemas at each site.

Uploaded by

sami khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Distributed Database Design Methodologies

This document discusses different methodologies for designing distributed databases, including top-down and bottom-up approaches. It also describes the design phases involved, including conceptual design, logical design, physical design, and distribution design, which maps the global schema to subschemas at each site.

Uploaded by

sami khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Distributed Database Design Methodologies

STEFAN0CERI, BARBARA PERNICI, AND G I 0 WIEDERHOLD, MEMBER, IEEE

Invited Paper

This paper surveys methodological approaches for distributed of datadistribution,and the system takesthe responsibility
database design. The design of distribution can be performed top- of distributing access operations to the databases at dif-
down or bottom-up; the first approach is typical of a distributed
database developed from scratch, while the second approach is ferent sites. However, the actual distribution ofdata affects
typical of the development of a multidatabase as the aggregation the overall performance ofthe system: time and cost
of existing databases. We review the design problems and metho- required for accessing multiple data objects differ greatly
dologies along both directions, and we describe DATAID-D, a top- depending on whether alldataobjects are stored at the same
down methodology for distribution design. We indicate how the
site or are spread over multiple sites. Replication of data
methodology is part of a global approachto database design; how
to collect the requirements about the distribution of data and affects the overall reliabilityand availability of thesystem,
applications; and how toprogressively build the distributionof a because several copies of the same information become
schema. Our approachis exemplified through onecase study. availablewith independent failure modes. Atthe same time,
data replication affects performance, due to the require-
I. INTRODUCTION ment of maintaining the consistency of copies. Thus the
Due to demand for system availabilityand autonomy, and database designer must carefully considerthe distribution
enabled by advances in database and communication tech- of data, even when the system supports a high degree of
nology, distributed database systems are becoming wide- transparency.
spread. Many database management systems now support One key principle in distribution design is to achieve
extensions for distribution, and at the same time structur- maximum locality of data and applications. While distrib-
ally distributed transactionmanagement systems are avail- uted databases enable more sophisticated communication
able on the market [Ill, [19].The designers of distributed between sites, the major motivation for developing a dis-
database applications are now facing a new and relevant tributeddatabaseistoreducecommunication byallocating
problem: how to distribute the data and programs on dif- dataascloseaspossibletotheapplicationswhichusethem.
ferent computersto obtain the intended performance, reli- Thus in a well-designed distributed database "90 percent
ability, and availability. Distribution design is thus emerg- of the data should be found at the local site, and only 10
ing as a newproblem area of databasedesign, which percent of the data should be accessed on a remote site"
requires its own theory, design methodologies, and sup [41]. However, it rarelyoccursthat dataand applications can
port tools. be cleanly partitioned and assignedto a particular site; more
The degree ofsophisticationofdistributed database often, the designer is faced by tradeoffs because several
management systems(DBMS) is often measured by the applications need to access the same data from different
degree of distributiontransparency provided to theusers. locations. In this case, the most effective designis the one
In an ideal situation, the user does not need to be aware which ensures locality to the largest number of applica-
tions. A basic assumption of this paper is that thedatabase
designer is able to predict both the logical properties which
Manuscript received November 26, 1985; revised October 30, characterize the "locality" of data with respect to appli-
1986. This research was partially conducted within theKBMS Proj- cations and the quantitative information measuring the
ect at Stanford University, which is supported by DARPA under
Contract N39-84-C-11. Itdraws heavily on research performed "load" of applications, in terms of frequency of execution
under the DATAID Project in Italy, supported by the CNR within requests at each site. The difficulty of designing the dis-
the ProgettoFinalizzatolnformatica. Theconclusionsdrawn tribution of a database is, in fact, due to the interference
remain the responsibility of the authors. of logical and quantitative considerations.
S. Ceri i s with the Dipartimento di Matematica, Universiti di
Modena, Modena, Italy.
B. Pernici is with CSISEI-CNR, DipartimentodiElettronica, A. Top-Down and Bottom-Up Approaches to Distribution
Politecnico di Milano, 20133 Milano, Italy. Design
G. Wiederhold i s with the Computer Science Department, Stan-
ford University, Stanford, CA 94305, USA. Distribution design can be performed top-down or bot-
I E E E Log Number 8714302. tom-up. The former approach is typical of distributeddata-

001&9219/87/05~533501.00@ 1987 IEEE

PROCEEDINGS OF THEIEEE, VOL. 75, NO. 5, MAY 1987 533


Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on August 10, 2009 at 15:09 from IEEE Xplore. Restrictions apply.
bases developed from scratch, while thelatter approachis Local Designtransforms the integratedconceptual
typical ofthe development of multi-database systems as an schema into adatabase schemaof a given DBMS type (Rela-
aggregation of existing databases. tional, Network,or Hierarchical). The choice ofDBMS type
The top-downapproach assumes that the designer will be affected by the requirements of the conceptual
understands the requirements of a database application model as well as by pragmatic considerations.
from the user, and transforms them into formal specifi- Physical Design is performed according to the capabili-
cations. During this process, the designer performs con- ties and features of the particular DBMS chosen, and pro-
ceptual, logical, and physical design phases, which pro- duces the definition of physical access structures which
gressively refinehigh-level,system-independent spec- implement the database.
ifications ofthe database into low-level, system-dependent The design of distribution adds to the above phases an
specifications. During conceptual design, the designer is additional one, called Distribution Design, which assumes
expected to ignore any detail concerning thephysical as input a global, site-independent schema and produces
implementation (in particular, data distribution). The result as a resultthe subschemas for each site of the distributed
is aglobal databaseschema which incorporates, at an database. In principle, distribution design can be applied
abstract level, all the data elements of thedatabase and the to any of the global conceptual, logical, or physical sche-
patterns of their use. A design phase specific to distributed mas. This choice is subject to the following tradeoff:
databases, calleddistribution design,maps the global
Details aboutimplementation should be decided only
schema to several, possibly overlappingsubschemas, each
when thedatabase distribution is given, to allow con-
one representing the subset of information whichis asso-
centrating on the physical design of each local data-
ciated with one site. Then the design of each individual
base independently. Independent physical design is
database is completed.
mandatory if the site DBMSs are heterogeneous.
The bottom-up approach assumes, instead, that a spec-
On the otherhand, a precise description of data and
ification of thedatabases at each site exists already, either
operations helpsin estimating the performance of var-
because there are existing databases that have to be inter-
ious distributions.
connected to form amulti-database (or federated)system
([la], [32]), or because the conceptual specification of the This tradeoff suggests that the distribution
design should
databases has been done for each site independently. In be performed at the beginningof the logical designphase.
either case, the site specifications have to be integrated in At that time, data and operations are described precisely
order to generate a global specification. and the first implementation problems are considered.
While top-down and bottom-up approaches appear to
represent two extremes, in many practical cases the A. Problem Classification
designer proceeds partially bottom-up and partially t o p
down. We will present the twoapproaches in separate sec- Database distribution requires determiningthe frag-
tions and then discuss the interaction between them. mentation and allocationof data.
Fragmentation is the process of subdividing a global
B. Structure of the Paper object (entity or relation) into several pieces, called frag-
ments.
In Section I I , we illustrate thetopdown approach to dis- Allocation is the process of mapping each fragment to
tribution design. We indicate how distribution is incor- one or more sites.
porated into centralized database design methodologies, Fragments must be appropriate units of allocation.Thus
then we classify the design problems; we also review pre- the database designers should define fragments as homo-
vious work on thetop-down approach. geneous collections of information from the viewpointof
In Section I l l we describe DATAID-D, a methodologyfor transaction access [IO], i.e., such that all instances of frag-
the topdown design of data distribution. ments are uniformly accessed by the transactions.
In Section lVwe illustrate the bottom-upapproach to dis- Two types of fragmentations are possible:
tribution design. We classify the problems, present some Horizontal fragments consist of subsets of the instances
solutions, and review previous work on the bottom-up (or tuples) of a global object. Each fragment is associated
approach. We also discuss how topdown and bottom-up with a predicate (called qualification) which indicates the
approaches relate to each other. distinguishingproperty possessed by the instances or
tuples of that fragment. It is always possible to reduce an
11. THE TOP-DOWN
APPROACH overlapping horizontal fragmentationto a nonoverlapping
A general method for designing centralized databases fragmentation by redefining horizontal fragments in an
includes four phases: requirements analysis, conceptual appropriate way; therefore, we can assumewithout loss of
design, logical design, and physical design[8], [MI,
[43], [47]. generality that horizontal fragmentsare disjoint.
RequirementsAnalysis dealswith the collection of users' Vertical fragments are derived by projectingglobal
unstructured specificationsof thedatabaseapplication, and objects over subsets of their attributes. In order forthe pro-
produces an unambiguous definition and classification of jection to be lossless [MI, it is required that each fragment
theelements to beconsidered in thedesign thedatabase.
of include a key attribute of the global object, orat least an
The information is collected in a design data dictionary. internal tuple identifier.
Conceptual Design, sometimes further divided in View Mixed fragmentations can be built by alternating hori-
Design and View Integration, produces a conceptual spec- zontal and vertical fragmentations.
ification ofa global, integrateddatabase schemaand ofthe The rationale of horizontal fragmentation is to produce
applications that are performed on it. fragmentswith themaximum potential localitywith respect

534 PROCEEDINGS OF THE IEEE, VOL. 75, NO. 5, MAY 1987


Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on August 10, 2009 at 15:09 from IEEE Xplore. Restrictions apply.
to operations, i.e., such that each fragment is located where design” has centered on both thefragmentation and allo-
it is mostly used. cation problems, and their interactions.
The possibility of not partitioningan entity at all should 7) File Allocation Problem: The file allocation problem
beconsidered. In particular, ifthe benefit observed byfrag- was originally investigated by Chu [16], who presented an
menting an entity is weak, the overhead due to horizontal integer programming formulation of the problema fixed for
partitioning might not be compensated by the advantage number of copies of the file.Casey [61 addressed similarly
of local processing. the problem witha variable degree of replication of each
The rationale ofvertical fragmentation is to cluster attri- file. Eswaran [23] provedthat Casey’s formulation is
butes frequently used together. An ideal vertical fragmen- NP-complete.
tation exists when each application uses just onesubset ,of The file allocation problem was generalized by Mah-
attributes; otherwise, some applications will be harmed, moud and Riordon [31] to incorporate the determination of
since they will need to access both fragments. In this gen- channel capacity, and by Morgan and Levin [35] to incor-
eral situation, one has to balance potential benefits(due to porateprogram allocation. Fisher and Hochbaum [24]
the possibility of placing each fragment close to theappli- improved Morgan and Levin‘s optimizationalgorithm.
cations which mostly use it) against potential disadvantages Finally, lrani and Khabbaz [281 studied the file allocation
(due to the same applications accessing two fragments). problem for distributed supercomputer systems.
Inthe fragmentation process, the designer applies Reference [20] provides an excellent survey of several file
repeatedly horizontal and vertical fragmentation to each allocation methods.
object until appropriate fragments are obtained. 2) Distributed Database Design: Research in distributed
The allocation offragments can be either nonredundant database design typically assumes the relational modelof
or redundant: data; this choiceis appropriate because links among remote
files are not too useful: remote links are not as beneficial
in anonredundantallocation, each fragment is mapped aslocalIinks;itisaheavytasktoutiIizeandmaintainremote
to exactly one site; links which violate site autonomy.
in a redundant allocation, each fragment is mapped to Chang and Cheng[I41 indicated theway in whicha rela-
one or more sites. tion can be fragmented by giving a methodology for rela-
tion decomposition intofragments. Note that they used the
With a redundant allocation, the designer must decide
terms horizontal and vertical for describing fragmentation
the degree of replication of each fragment. The benefit of
exactly in theopposite way as they are used in all the sub-
replication grows withthe ratiobetween retrieval and
sequent literature on fragmentation (and in this paper).
updates because maintaining theconsistency of the data-
The theory of horizontal fragmentation was studied by
base requires distributing the updates to all the copies.
Maier and Ullman [33], Paredaens and De Bra [39], and Ceri
However, the system may permit temporary inconsisten- and Pelagatti [V. Dependency-preserving horizontal de-
cies, and in this case replication becomes moreconvenient. compositions are discussed in [25].
The benefit of replication decreases with the increase of
The problem of determiningan optimal horizontalfrag-
storage costs, since replicated copies require morespace.
mentation for a distributed database has been investigated
Moreover, replication increases resiliency from failures,
by Ceri, Navathe, and Wiederhold [9]: the formulation of
since
the nonredundant allocation problemis given by a linear
the independent loss of several copies of the same integer program, and heuristics are presented for decom-
information is unlikely; posing theprobleminto smaller subproblems and for
applications can access alternative copies when some determining a redundant allocation starting from the opti-
failure affects the copies usually accessed. mal nonredundant allocation.
Vertical fragmentation was addressed forcentralized
Given the above classification of problems, the top-down databases by Hofferand Severance[27l and by Hammerand
approach to distributiondesign consists of solving for each Niamir [26]; an application of vertical fragmentationto sys-
global object the followingdesign problems: tems with memory hierarchies (i.e., with a “fast” and a
“slow” memory) is presented by Eisner and Severance[21].
horizontal partitioning (H)
March and Scudder have discussed how vertical fragmen-
vertical partitioning (V)
tation can be useful for the recovery of databases in [MI.
nonredundant allocation(A)
Navathe,Ceri, Wiederhold, and Dou have studied the
redundant allocation (R).
application of vertical fragmentation to distributed data-
Whiletheseproblemsareclearlyrelated,mostofthework bases in [37l.
intheliterature has attacked these problemsindepen- In [42],the problem of distributing
thedatabaseon aclus-
dently. ter of processors is considered. The motivation ofsuch an
architecture is high performance, availability, and reliabil-
ity. The problem deviates from the standard distribution
B. Previous Work on Top-Down Distribution Design
design because applications also need to be allocated to
The early work performed on the problem of data dis- processors; thus an object to be allocated can be eitheran
tributionwasaddressedtothefileallocationproblem;afile application or a fragment. The model considers capacity
was considered to be the “unit of allocation,” thus disre- constraints for the CPUs, processors’ input-output, and
garding the fragmentation problem. A knowledgeable user network communications.
could, ofcourse, present fragments as files to this problem Apers addressed in his thesis the design and allocation
definition. More recently, work in “distributed database of both horizontal and vertical fragments of relations [I],

CERl et a/.: DISTRIBUTEDDATABASEDESIGNMETHODOLOGIES 535


Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on August 10, 2009 at 15:09 from IEEE Xplore. Restrictions apply.
[2]. In this work, design and allocationare based on query ER model [8], which also includes generalization hierar-
processing strategies, treating nonredundant and redun- chies andn-aryrelationships. The firstpartof Logical
dant allocation of fragments in a uniform way. Design, Global Logical Design, transforms EER schemata
into simplified ER schemata and is performed before dis-
Ill. THE DATAID-D METHODOLOGY tribution design. The final part of Logical Design, which
transforms an ER schema into eithera relationalor a Coda-
This section reviews the DATAID-D methodology for dis- syl schema, is deferred until after distribution design, and
tributed database design, developed at the Politecnico di is performed independently for each site.
Milano [12]. The emphasis in DATAID-D is on providing a In this paper, we concentrate on distribution; thus we
methodological framework to thedesigner, indicating the deliberately omit all discussion of requirements analysis,
relevant design problems, which parameters are required conceptual design, and global logical design (which are
to solveeach ofthem,and how each problem can be performed before distribution design), or of local logical
approached. designandphysicaldesign (which are performed sepa-
Several tools for distribution design have been devel- rately for each site after distribution design). In ourexam-
oped, providing facilities for documentation and forauto- ple, we will assume that the results of phases preceding
matic selection among design alternatives in some sub- distribution design are all available. Refer to [8] for those
problems. Design tools for deciding about the horizontal aspects of the DATAID-I methodologywhich are not related
and vertical fragmentation of relations are presented in [q, to distribution.
a design support envi-
[9], [37, [42]; their integration within Obviously, our clean separation of the design process
ronment is discussed in [13]. All these tools can be used into phases is an idealization introduced here for simpli-
consistentlywith the DATAID-D approach. In paper, this we fying the presentation and also for abstractingthe decision
stress the methodological approach of DATAID-D and we process. The actual course of the design of a distributed
deliberately omit any discussion about tools; in this way, database will require several feedbacks between phases.
the results presented here can be immediately applied to
design problems,even if thedesign team has no toolsavail-
able. Decision processes for selecting among design alter-
natives are basedon simple heuristics rather than complex Requirements
I
mathematical formulations; and thereforecan be handled
manually. We refer to [I31 for a discussion about
port to distributiondesign.
tool s u p
REWIREMENTS
ANALVSIS

I
I
DATAID-D is built as an extension ofthe DATAID-1 meth- Design Data Dictionary
odology for centralized database design, which is divided

r-l
into 4 phases [8]:
CONCEPTUAL
requirements analysis DESIGN
conceptualdesign I I
logicaldesign Global
Operatim
Global
Data
physical design. Schemata Schema

DATAID-D requires the addition of two phases to the


I t
I - 4
I

original DATAID-I architecture (see Fig. 1):


Analysis of Distribution Requirements: This phase is
required in order to collect informationabout distribution,
such as partitioning predicates for horizontal fragmenta- I
GLOBAL
LOGICAL
DESIGN

1
I 77-
ANALVSIS (
DISTRIWTIOI
REQUIREMENT!

Logical
Simplified
tions and the frequency of activation of each application Global S c h m Access
Tables
from each site. Since the structure ofdata and applications I I

must be known inorder to collect information about their


DISTRIBUTION
distribution, distribution requirements are collected start- i

ing fromsome of theresults of theconceptual design phase.


Distribution Design: This phase starts from the speci-
Logical Logical
fication of the global database schema and from the col- Schernata Access
Tables
at each site at each site
lected distribution requirements, and produces several
databaseschemata,oneforeachsiteofthedistributeddata-
base, each one describing the portion of data that will be
allocated to that site.
M LOCAL
DESIOI
LOGICAL
LOGICAL
OESING
The input for DistributionDesign isa specification dataof 1

and applications produced by the previous design phase, Local Logical Schemata
(Relational or Codasyl)
in the formof global database schema and logical access
tables. Global database schemas are described using sim-
a LOCALPHYSICAL
ple version of the ER model [E], which includes just binary DESIGN
relationships; logicalaccess tables indicate the type ofdata-
base access operations performed by each application on
each entity. In DATAID-I,conceptual modeling is per-
formed on a more sophisticated model, called the Extended

536 PROCEEDINGS OF THE IEEE, VOL. 75, NO. 5, M A Y 1987


Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on August 10, 2009 at 15:09 from IEEE Xplore. Restrictions apply.
A. Analysis of Distribution Requirements Distribution design in DATAID-D is subdivided into four
phases:
The goal of analyzingdistribution requirements is to col-
lect all the information that will later be used to drive dis- fragmentationdesign
tribution designs. Collecting requirements is adifficult pro- nonredundantallocation
cess which entails communicatingwith users to understand redundantallocation
their needs; however, in this paperwe just stress the results reconstruction of local schemata.
which should be produced, Le., thetypeof informationthat Fragmentation design applieshorizontalandvertical
should be collected. fragmentation to entities in order to determine possible
Inputs to this phase are users’ requirements for distri- units of allocation forthe subsequent designphases. Each
bution and the global data and operation schemata. Three fragment, to be a good unit of allocation, must contain
types of tables are built as output of this phase: a frequency instances which are accessedroughly in the same way (i.e.,
table for the applications, partitioning tables for entities, with the same frequency) by each application executed at
and polarization tables which relate data and applications. each site. This uniform behavior allows considering frag-
The frequency table gives the number of activations of ments as uniform units of allocation, during the subse-
each application at each site. We assume that all applica-
quent allocation phases. Obviously, a too strict application
tions are potentially executable at all sites; clearly, when of this criterion would result in fragmentation down to the
one application is never executed at one site, then thefre- level of an individual attribute oran individual tuple; there
quency entry in the corresponding positionis zero. exists a threshold beyond which further fragmentation is
The partitioning tables indicate potential horizontal par-
not feasible. Fragmentation designis mostly a logical deci-
titioning criteria whichapply to each entity in theschema.
sion, which is performed by selecting some of the predi-
In practice, each partitioning criterionindicates a potential
cates from polarization tables and composing them into
reason for introducing horizontal fragmentation, and is
logically defined fragments.
induced by one or more applications which access a given
Nonredundant allocationis performed by mapping each
subset of data at a given location.
fragment to the site where it is mostly used. The potential
Two types of paritioning are defined:
frequency of use of each fragment at each site is obtained
Primaryparritioning, wherethe partitioning ofan entity
as the summation over all transactionsissued from that site
E is expressed using several disjoint predicateson the attri- of the product between polarizations and frequencies of
butes of E. Let p be a generic predicateof some attribute
use of the fragment. It is thus possible to identify the site
of€;then,abinarypartitioningofEisgiven bythefragments
which accesses the fragment most frequently, and hence
F1 and F2, such thatp holdsfortuplesofF1 and (notp) holds
allocate the fragment to that site.
for tuples of F 2 .
A quantitative measure of the number of accesses to a
Derivedpartitioning, wherethe partitioning ofan entity
given fragment from a given site is trivially obtained from
E2 is determined by a binary relationship (of the ER model)
frequency and polarization tables. Let
between €2 and another entity E l which is already parti-
tioned. Let €1 be an entityfragmented intoF11 and F12 and Cj frequency of application i from site j
let E2 be a second entity connectedto El by the binary rela- pj,k polarization of fragmentk for application ifrom site
tionship R. Then, the derived partitioning of €2 is defined i.
as follows:
Then the number of accesses to fragment k from site j is
F21 is the set of tuples related to tuples of F11 by R;
given by
F22 is the set of tuples related to tuples of F12 by R.
We also constrain F21 and F22 to be disjoint: hence, not nkj = ,, ,Z fij pilk.
1.ruresk
all binary relations R can be used to derive a partitioning;
in order for this constraint to hold, it is sufficient (but not Therefore, fragment k is allocated to the site such that 7
necessary)that R be a 1: 1or 1:n relationship from E l to €2.
nkj = maxalli nkj.
The polarization table indicates, on a quantitative basis,
how the partitions influence the locality of processing of Redundantallocation is performed by selectingaddi-
applications. A polarizationvalue indicates the probability tional sites for each fragment with respect to the initial non-
that a given fragment is accessed by a given application redundant allocation, using “greedy” heuristics. At each
issued by a givensite; we indicate only those values which iteration, the most beneficial site for storing a copy of the
deviate from a uniform distribution. fragment (ifany) is added to the set of allocations.However,
the benefit in retrieval accesses must outbalancethe major
costs and complexitydue to updating of redundantcopies.
B. Distribution Design
Theabove benefitisdifficulttoquantify, sincemeasuring
The goal of Distribution Design is to allocate data at sites, the complexity of managing redundant copies is very dif-
starting from theglobal data schema, logical access tables, ficult. However, it is possible to evaluate the difference
and the distribution requirements. between the numberof retrievalaccesses that become local
The output of the Distribution Design phase is a logical due to additionof one copy versus the numberof addtional
schema and logical access tables for each site. These are remote updateaccesses required to maintain thatcopy. This
used during the following local logical design phase and number should be largely positive to justify theincrease of
during physical designphases performed independentlyat redundancy. The difference is evaluated for an object kand
each site. a potential new copy stored at site j as follows:

CERl et a/.: DISTRIBUTED


Authorized DATABASE
licensed use limited DESIGNFederal
to: Universidad METHODOLOGIES
de Pernambuco. Downloaded on August 10, 2009 at 15:09 from IEEE Xplore. Restrictions apply. 537
Examples of operation schemata for 3 applications are
shown in Fig. 3.
The Make-Reservation application (Fig. 3(a)) is activated
The reconstruction of local schemata builds local sche- every time that a new passenger (i.e., one not holding
mata from the final fragment allocation; thisphase is also another reservation)wants to get a reservation on one flight.
responsible for the allocation of relationships of the ER Inthiscase,theaccesstothedatabaseisthroughDeparture
global schema.Assumingthatmost relationships are imple- and Arrival airport, Departure and Arrival time, and the
mented as associations among the identifiers of the cor- flight’s Date. These attributes are marked in the diagram
responding entities,the DATAID-Dmethodology suggests with a “K,” which indicates that they are used as keys in
placing relationshipsat the site of the entities or fragments accessing data. Arrows indicate that the access proceeds
with thelargest cardinality, so that few entity identifiers will from the Airport entities to theFlight entityvia the two rela-
have to be transmitted. tions From and To. Numbers in theleft and right lower cor-
ners of entities indicate, respectively,the total number of
instances and the average number of instances selected by
C.Case Study: Airline Reservation System the application. Once a flight has been identified, a new
1) Case Study Description:In this section, we present an instance of the Passenger entity is created, as well as an
example of the application of DATAID-D methodologyto instance of the relationship Reservation; data about the pas-
an airline reservation system. The airline maintains adata- senger’s Name, Telephone, and Class (corresponding to
base distributed over three sites, i.e., airports l, 2, and 3, the ticket’s fare) are also written ( “ W ’ ) in the database.
eachofwhichisatthecenterofageographica1area;allother NoticethattheAvaiIableSeatsattributeisfirstreadandthen
airports which are serviced by the company are in one of written (“0,W ” ; 0 indicates output).
these areas. To help in visualizing the system, think of a A similarapplication,called Continue-Reservations,
company operating in the U.S., with 1 = Denver, CO, allows taking reservations for passengers whose data have
locatedintheWest,2= NewYork,NY,locatedintheNorth, already been collected and stored into thePassenger entity;
and 3 = Atlanta, GA, located in theSouth. A database stores we do not show this application.
dataabout airport regulations, flight schedules, flight avail- The Check-ln application is executed whenever a pas-
abilities, and passenger reservations. senger actually boards a plane: based on the passenger’s
We do notpresent here the conceptual and logical design Name and on the Number and Date of the flight,the
phases; instead, we present their results: the Global Data involved Passenger and Flight instances are detected ( “ K ’
Schema and the Global Operation Schemata. A Global Data attributes). Then, the Class information is retrieved (“O”),
Schema of the database using the ER model [I51 is shown and based on this informationand on theflight’s SeatMap,
inFig.2.Weassumethateachflightisdirectfromthedepar- a SeatNumber is assigned to thepassenger. Both SeatMap
ture airport to the arrival airport, without intermediate and SeatNumber attributes are written, together with the
stops. All entities, attributes, and relationships are self- number of CheckedBags by the passenger.
explanatory; we have kept ourexample as small and simple The Airport-Departures applicationbuilds report
a
as possible, though itis sufficiently complexto show some describing the departure information for the next 30flights
of the problems which arise in distributiondesign. departing from theairport, to be displayedon TV monitors.
Distribution design also requires the collection knowl-of The Airport symbol and present Date and Time are usedto
edge about the most relevant applications that are per- identify the involved Airport and Flight entities. For each
formed on data. The 80120 heuristic assures usthat we will flight, the Number, Depa-rture-Time,Gate, Delay, and des-
obtain 80 percent of theaccesses by analyzing the most fre- tination airport Symbol and City are extracted from the
quent20percentoftheappIications.DATAID-I collectsthis database. Information about the destinationairport is
information intoGlobal Operation Schemata, which show determined using the To relationship.
the use of entities and relationships for each application. Notice that the last operation schema is linearized, fol-

p GATE TELEPHONE
NAME

SECURITY RULES

Fig. 2. Global Data Schema of the airline reservation database.

538 PROCEEDINGS OF THEIEEE, VOt. 75, NO. 5, MAY 1987


Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on August 10, 2009 at 15:09 from IEEE Xplore. Restrictions apply.
NUMBER
( K)
-
FLIGHT
- CLASS ( a )
NAME (K)

-PASSENGER
20000 1 1000000 1
DATE
(K)
d
SEATMAP
(0.w)
SEAT NUMBER CHECKED
RAGS
(w) (w)

(b)

GATE [ a )
(0) NUMBER
FLIGHT DELAY ( a ) AIRPORT SYMBOL ( a )
(k) DATE
CITY [a)
__f

€3 AIRPORT

SYMsOL (K)

(C)

Fig. 3. Global Operation Schemata of the airline reservation database. (a) Make-Reser-
vation. (b) Check-In. (c) Airport Departures.

lowingthe application’s access path through data. As a more relationship was used for accessing the entity, while the
general consideration, notice thatoperation schemata are row denotedby A N (accesses number) gives the total num-
“procedural,” i.e., they indicate precisely the access path ber of instances involved in the operation. For a more
of an application through thedatabase; however, they are detailed description, see [8].
built at a logical level, without postulating theexistence of 2) Analysis of Distribution Requirements in the Airline
access methods. Subsequent design phases (i.e., local log- Reservation System: In this section we examine how dis-
ical design and physical design) will decide which access tribution requirements are represented by the DATAID-D
methods haveto beprovided, based on theseschemataand methodology after the analysis ofdistributionrequire-
on quantitative informationabouttheapplications’load [8]. ments phase.
Finally, quantitative data about applications aresum- The frequency fable illustrated in Fig. 4(b) showsthe fre-
marized for each entity, building the logical access tables. quencyof applications a, b, cdescribed in theGlobal Oper-
Fig. 4(a) shows the logicalaccess fable for the entityFlight. ation Schemata in Fig. 3 at sites 1 (Denver), 2 (New York),
Columns correspondto operations, rows correspond to the 3 (Atlanta).
entity’s attributes; positions in thematrix indicate the type Partitioning fables for the Airline Reservation Database
of action (“0,”“W,” “ K ” ) performed on objects. The row are shown in Fig. 5(a) and (b). Fig. 5(b) shows the primary
denoted by RA (relationship access) indicates whether a partitioning table for entities Airport and Passenger. The

CERl et a/.: DISTRIBUTEDDATABASEDESIGNMETHODOLOGIES 539


Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on August 10, 2009 at 15:09 from IEEE Xplore. Restrictions apply.
E n t i t y : FLIGHT (ABC).The above seven cases arerequired because Reser-
vation is a many-to-many relationship (each passenger can
T Operations
hold multiplereservations) andyet in the definition of the

y
Attributes
fragmentation we want to map each passenger in the real
worldto exactlyone instance of the Passenger entity. In the
NUMBER first case, each Passenger instance is statically assigned to
a partition when the first reservation is made, while in the
DATE second case the mapping of Passenger instances to parti-
tions is dynamic: whenever a passenger changes reserva-
SEAT
MRP
tion, the corresponding entityinstance might move from
GATE one partition to another.
Fig. 6 shows the polarization table. Columns are associ-
DELAY ated with activations of applicationsat each site, rows with
partitioning predicates. Each entry indicates the percent-
AVAILABLE
SEATS
age of accesses to a fragment ifthe particular partitioning
criterion is selected. In practice, only afewentries are spec-
R4 FROn

P r i m apr ya r t i t i o n i nt agb l e
AN 30

Entity Name PredicateSelectivityPredicatesPartitio

AIRPORT AREA AREA = 'W' 30


AREA = 'N' 45
Frequency Sites
.AREA = 'S' 25
table
1 2 3

Operations PASSENGER TELEPHONE TELEPHONE = '415 *'


OR '408"' OR"' 35
(FIRST 3 DIGITS)
a 10000 20000 10000 I TELEPHONE = ' W * '
OR ' 7 1 3 * ' O R . . . 35

TELEPHONE = '212 *'


OR ' 6 1 7 * ' O R . . . 30

(b) (a)
intity access table (a) and frequency table (b).
r D e r i pv ae rdt i t i o nt ianbgl e

A s s o c i a t i o nf o r Predicate Partition
aeslgner chooses the Area attribute as a partitioning cri- Entity thederivation
Base Entity Narfe Selectivity
terion for the Airport entity and the first three digits of tele-
phonenumber(areacode)asapartitioningattributeforthe FLIGHT FROM AIRPORT AREA SAME
Passenger entity. For each possiblevalue of the partitioning
attribute, predicate selectivitygives the percentage of the
tuples of the entity withthat value.
The designer then considers potential derived partition-
ing induced by the primary partitioning of Airports into
Areas. Four derived partitioning choices are considered in
Fig. 5(b):
The Flight entitycan be partitioned in two ways, based
on the relationshipFrom (with thedeparture airport)or To
(with thearrival airport) and the partitioningofairports into
areas. Notice that the same primary partitioning gives rise
to two different derived partitionings based on the use of
two different relationships. WTE:FLIGHT DEPARTURE P a r t i t i o n i n g i s basedon7 predicates: Predicate
Selectivity
The last two rows show two alternative ways of parti-
20
tioning the Passenger entity based on therelationships from P1 : A l f l i g h t sr e s e r v e db yt h ep a s s e n g e rd e p a r tf r a na r e a A

Reservationto Flight, partitioned by departure areas. Notice P2 : Al f l i g h t sr e s e r v e d bythepassengerdepartfromarea B 20

that in this case the derivation mechanism i s in two steps, P3 : A l f l i g h t sr e s e r v e d bythepassengerdepartfromarea C 20

since it is applied from Airport Flight


to andthen fromFlight P4 : A11 f l i g h t s r e s e r v e d bythe passenger depart from areas A and B 10

to Passenger. In the former case, passengers are partitioned P5 : A l f l i g h t sr e s e r v e d bythepassengerdepartfromareas A and C 10

based on the flight on which they hold the first


reservation; P6 : Al f l i g h t sr e s e r v e d bythepassengerdepart fm areas B and C 10

in the latter case, passengers are partitioned based on all P7 : A ll f l i g h t sr e s e r v e db y the passengerdepartfrom a l l areas 10

the reservations that they hold. Seven cases are possible -


(illustrated in theNOTE table): passengers canhold reser- (b)
vations on flights that leave only from one area (A, B, C), Fig. 5. (a) Primary partitioning table. (b) Derived partition-
orfromtwoareas(AB,BC,AC),orfinallyfromallthreeareas ing table.

540 PROCEEDINGS OF THE IEEE, VOL. 75, NO. 5, MAY 1987


Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on August 10, 2009 at 15:09 from IEEE Xplore. Restrictions apply.
a b C
\
PASSENGER.l U
1 2 3 1 2 3 1 2 3 PASSENGER.4U
PASSENGER.5U
PASSENGER. 7
AIRPORT
by AREA

PASSENGER
by PHONE Site 1
1
AC
PASSMGER
by FIRST
RESERVATION
LOCATION

PASSENGER. 2 U
PASSENGER.4 U
PASSENGER. 6 U
PASSENGER. 7
PASSENGER
by
DEPARTURE

Site 2
\

FLIGHT by I

DEPARTURE

PASSENGER. 3 U
FLIGHT by PASSENGER. 5 U
ARRIVAL PASSENGER. 6 U
PASSENGER. 7

Fig. 6. Polarization table.

ified; the remaining entries can be computed by assuming


uniform distributionof the remaininginstances. Some sub-
(rl - - ---__ AIRPORT. 3 Site 3

1
matrices are not relevant because the applicationdoes not
use the entity; these are crossed out in the table. Fig. 7. Result of distribution design: local schemataforsites
1-3.
To illustrate, let us discussthe firstsubmatrix associated
with theMake-Reservation application (Fig. 3(a))and using
the partitioningof the Airport entity by areas (respectively, "larger" than the second class. In ourcase study:
Pl for area 1, P2 for area 2, and P3 for area 3). The matrix 1) vertical partitioning is not useful to determine units
indicates that if we partition airportsareas,by then a query of allocation, in fact, no application exists that can be
about reservations issued at area 1 has 80 percent proba- clearly eased by a vertical partitioning;
bilityof beingrelatedtoairportsinarea1.Fortheothertwo
2) conversely, all entities have a horizontal fragmenta-
positions incolumn 1 we assume uniformity over the
tion:
remaining 20 percent of access. the Airport entity has a primary horizontal frag-
3) Distribution Design in the Airline Reservation System:
mentation based on Area (fragments Airport.1,
Distribution design consists of four steps: the selection of
Airport.2, Airport.3);
fragmentation criteria foreach entity, the determination of
the Flight entity has a derived horizontal frag-
nonredundant allocation, the introduction ofredundancy
mentation based on the Departure airport (frag-
over the nonredundant allocation, and, finally, the recon-
ments Flight.1, Flight.2, Flight.3);
struction of local schemata at each site (see Fig. 7).
the Passengerentity has a derived horizontal frag-
a) Fragmentation design:Given the potential partition-
mentation based on Departure of all Flights on
ingcriteriacontained in the polarization tables, the designer
which the passenger i s booked (fragments Pas-
must here select the most appropriate one foreach entity, senger.1 to Passenger.7).
and validate that the partitioningitself is convenient. This
requires the quantitative analysis of the relevant applica- b) Nonredundant allocation: In somecases, nonre-
tions, which can be classified in three classes: those which dundant allocation followseasily from theselection of the
are made easier by the partitioning, those whichare made partitioning criterion.For instance, Airport.1, Flight.1, and
more difficult, and, finally, those which are not affected. Passenger.1 areimmediatelyallocated at site l , and likewise
Then, the partitioning is convenient if the first class is Airport.2, Flight.2, and Passenger.2 are allocated at site 2

CERl et a/.: DISTRIBUTEDDATABASEDESIGNMETHODOLOGIES 541


Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on August 10, 2009 at 15:09 from IEEE Xplore. Restrictions apply.
and Airport.3, Flight.3, and Passenger.3 are allocated at site A. Design Problems in Building a Global Schema
3. For other fragments of the entityPassenger, we needto
select the site which mostly uses the fragment according Problems in the bottom-up design of a multi-database
to the polarization tables and frequency tables; thus we system are due to the need for building a global schema
allocate Passenger.4, Passenger.6, and Passenger.7 to site (also called Superview in [18], [36]). Theintegration process
2, Passenger.5 to site 3. recognizes matching entities and their attributes.
c) Redundant allocation: Redundancy is then consid- To integrate databases we need to select a suitable type
ered; in many cases, costs associatedwith redundancy out- of data model for the global schema. Previouswork onview
balance benefits for all fragments of the same entity. This integration has demonstrated the power of generalization
is the case in the Airport and Flight entities. For the frag- hierarchies to support view integration. A generalization
ments of the Passenger entity we introduce instead a lim- hierarchy allows defining a type-subtyperelationship
ited degree of redundancy: Passenger.4, Passenger.5, and between two entities; this can be useful when two views
Passenger.6, which correspond to passengers holding give a partially overlappingdescription of thesame entity.
reservations on flights departing from two areas, are stored Then, the classical solution of theview integration problem
atthetwocorrespondingsites. Likewise, Passenger.7,which consists in generating three entities, one with the common
includes passengerswith flights leaving from all threeareas, attributes and the other two with thenonoverlapping attri-
is stored at all sites. butes [22]. An example is shown in Fig. 8, where two views
d) Reconstructionoflocalschemata: This stepis mostly of the Flight entity have some common attributes; in the
concerned with the fragmentation and allocation of rela- global view, the common attributes are associatedwith the
tionships of the ER global data model. According to the supertype, and one subtype is generated for each view
above allocation of fragments, the relationships Reserva- including the nonoverlapping attributes. Generalization
tion and Check-In happen to have a natural allocationsuch hierarchies are represented by double-line arrows.
that all linksare local (because each passenger is associated The need for generalization hierarchies indicates that
to one Passenger instance at each site where he holds a conceptual models such as the ER model (extended with
reservation or checks in).The Fromrelationship is also nat- generalization hierarchies), the structural model, or the
urally allocated, since Airport and Flight entities are par- functional model are good candidates for the view inte-
titioned horizontally according to the From relationship; gration process. In this paper,we use the ER model extended
hence, all links are local. Instead,the distribution of theTo with generalization hierarchies, as shown in Fig. 8; gen-
relationship needs to be discussed,since it links entity eralization hierarchies are represented bydouble-line
instances which may be storedat two differentsites. In our arrows.
solution, the instances of theTo relationship are stored at In the example, two views of theFlight entity have some
the same site as the corresponding instances of the Flight common attributes; in theglobal view, we generate a gen-
entity. eralization hierarchy, with one supertype and two sub-
types; the supertype has the common attributes, while the
subtypes correspond each to one view; "difference" attri-
The most relevant property of this solution is that all
butes are related to the subtypes.
requests about flights can be answered by looking onlyat
Another general question is the order o f integration of
the data which are allocated at the departure site of the
flight; no remote information is required forpreparing for views. When several views are present, integration is typ
ically performed by merging one view at a time with the
the flight's departurewhen thedatabase is heavily used. As
global schema, which is thus builtprogressively. Thus the
a drawback,passenger information is replicated, and a
general problem thatweconsideris howto build the super-
careful management of passenger information mustbe
view of two views. It is, in general, better to integrate first
done when reservations are taken.
the largest or most important views, followed by the small-
est or least important ones.
7) Recognizing Similarities: The first step in the integra-
IV. B O ~ O M - UDISTRIBUTION
P DESIGN tion of two schemata is to recognize their similarities; these
In the bottom-up approach, schemata representing the provide the starting'point for integration. Matches can be
portion ofdata stored at eachindividual site constitute the recognized because of the naming or structure similarities
starting point in the design, and distribution design con- of overlapping portions of the schemata. Matches can also
sists of identifying the data which are common to them, as bededuced from similarityof data in pre-existingdatabases
well as their differences. or files. Similar valuesets indicateoverlap. Domainsof attri-
During operation, most multi-database systems provide butes can be matched by comparing their projections.
only global query capability and local update capability, so Observing thatfiles or domains have the same or nearly the
that each local system mayonly beupdated bytransactions same cardinality can be an initial hint.
issued at that site. If thedesigner cannot modify the local If different applications on differentdatabase locations
databases of a multi-database system, then conflict reso- use copiesof the same source data, then thereis surely some
lution has to be incorporated into the query processing overlap between the databases.
capability of thesystem. 2) Recognizing Conflicts: Integration should also iden-
Multidatabase support provides an automatic mapping tify conflicts, i.e., different representations or domain def-
of queries issued according to the global view into queries initions of similiardata in different schemata [3], [22], [38].
applicable to the local schemata, and coordinatesthe exe- Conflicts can be resolved by incorporating differences in
cution of queries and the collection ofresults. theglobal model or by inducingcompromises in thesource

542 PROCEEDINGS OF THE IEEE, VOL. 75, NO. 5 , M A Y 1987


Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on August 10, 2009 at 15:09 from IEEE Xplore. Restrictions apply.
-
VIEW 1 VIEW 2
I

DATE
AIRPLANE
AVAILABLE
AVAILABLE MEAL SERVED
SEATS

GLOBAL VIEW

NUMBER
DATE
1-
MAP
FLIGHT io AVAILABLE SEATS

SEAT

Fig. 8. Merge of two views using generalization hierarchies.

models [471. Conflicts which cannot be resolved at design the same real-world objectcan be modeled as an attribute
time requireselection of policies for answering queries with in one view and as an entity in another view. Only a few
inconsistent data. of these differences can be dealt with by generalization in
Schema differences include naming conflicts, scale dif- the global schema. In view design, structural differences
ferences, and structural differences. are typically solved by changingone or bothviews. When
Naming conflicts: There are two types of naming con- dealing with autonomous databases, structural differences
flicts: synonyms occur when data objects whichrepresent typically require writingcomplex query modification pro-
the same world objects have been given different names cedures, to be stored in the globalschema, if the distrib-
in thetwo views; homonyms occurwhen data objects with uted DBMS can support them [18].
the same name in twoviews represent different real-world Many of theabove conflicts shouldbe reported and fixed
objects. Once naming conflicts are detected, they are easily before integration, and local systems should be modified
handled by storing name correspondence tables in the to reflect integration whenever possible. Otherwise, the
global schema. global schema should include information about conflicts,
9 Domain differences: The most insidious problems are and policies for resolution. Clearly, the management of
due to domain differences. A relation Personnel at a site conversion formulas , query modification procedures, or
handlingthe payrollmay not includecontractorpersonnel, conflictresolution policies, mentionedinthis section,
who are paid indirectly. At the site where projectsare man- requirean extension of thecapabilities of traditional DBMS
aged, contractor personnelare included, butauditors, who in the direction of more sophisticated multi-database sys-
are not assignable to any project, are not. Detectionof such tems; in absence of such extensions, the above features
problems is best achieved by comparing the source data- must be carefully supported by applicationprograms.
bases or files, and noting inconsistencies. Generalization 3) Dealing with Inconsistent Data During Operation: In
hierarchiescan beused for representing the solution to this practice, operational multi-databases have errors; anerror
problem. In the above case, permanent personnel, con- rate of up to1 percent per stored records is not unusual.
tractor personnel, and auditors could all besubtypes of the They maybe dueto inputtranscription, omission or failure
same entity Personnel. Applications, such as payment, are in synchronization of updates, and improper recoveryfrom
then performed in different ways according to thesubtype system errors. The databasedesigner must decidethe pol-
that they use. icies for dealing with inconsistencies that arise duringoper-
Scaledifferences: Scale differences may be found in dif- ation of the globaldatabase.
ferent views of the same numeric values. The data should Letus consider, for instance, the situation where two
be retrievedusing, if possible, the moreprecise scale, and instances of the Employee entities storedat different sites
joined or output using conversion formulas. Conversion happen to have the same identifier, but different values for
formulas should be stored as a part of the globalschema, the salary attribute. This situation can be due toseveral rea-
if the distributed DBMS can support them. sons:
Structural differences: Structural differences may be It is possible that the same identifier corresponds to
due to differentdesign choices in each view; for instance, twodifferentemp1oyees;thissituationcan besolved bysys-

CERl et a/.: DISTRIBUTEDDATABASEDESIGNMETHODOLOGIES 543


Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on August 10, 2009 at 15:09 from IEEE Xplore. Restrictions apply.
tematically modifying values as they are extracted from the ceptual schema of Fig. 9 representthe data of airlineB. Let
database. For instance, it is possible to introduce system- us look at the features of theschema of airline Band com-
wide unique identifiers, concatenating all local identifiers pare the schema with theschema of airlineA. Clearly, the
with site identifiers. new schema is simpler; in particular, there is no informa-
Another possibility is that the same employee has two tion about airport sand check-ins. Information aboutdepar-
different jobsat the twosites, and the salary attributes just ture and arrival of flightsis represented as attributes of the
refer to the two differentjobs. This may be considered at entity Flight. The basic information about passengers and
the level of schema integration, treatingthetwo salaryattri- flights is, however, very similar.
butes as homonyms. Otherwise, it is possible to manage Fig. 10 represents the global schema built after integra-
this data inconsistency by indicatingin the global schema tion. A generalization hierarchy is used to represent two
that all queries interested in theglobal salary should com- subtypes F1ight.A and Flight.B, while the supertype Flight
pute it as the summation of the two local salaries. contains all common attributes. They include theFlightld,
A third reason for discrepancy may be obsolescence. called flight Number inschema A; here, a synonym is rec-
This again can be solved at the schema level, treating the ognized. Common attributes also include the flight Date,
two salaries as homonyms (i.e,, old salary and new salary). which was encoded in the first positions of the attribute
Otherwise, it is possible to operate at the query modifi- Departure Time; here, a moredifficult analysis is required.
cation level, indicating that the most recent value should The entity F1ight.A is related to the entity
Passengerthrough
be used (in this case, the salary attributes must be time- the relationship Checkln, showing that thisinformation is
stamped, but obviously this is an additional major cost). available only for entities of schema A. Finally, the infor-
A fourth reason for the inconsistency presentedabove mation about the departure and arrival airports is repre-
is actual incorrectness, due to spurious errors. sented in different ways for entities F1ight.A and Flight.B,
The four situations above have shown thatthe database reflecting the different representations. Noticea query that
designer has several options on how to manage inconsis- requesting information about flights departing from agiven
tencies. While inconsistencieswill bedetectedat execution airport (for instance, the Airport-Departureapplication of
time, the determination of policies for solving inconsis- Section Ill-C)should betranslatedinverydifferentwaysfor
tencies is a design problem. Policies include: the two local schemata.
Presenting any one of the inconsistent values without
notifying theuser: the most straightforwardbut at the same C. Earlier Work in Bottom-Up Distribution Design
time most dangerous solution.
Many of the considerations of this section are derived
Presenting all inconsistent values, and showing to the from the paper by Dayal and Hwang [IB]. The integration
user the sources of the information. In thiscase, the user
of multi-databases is also discussed by Breitbart and Paolini
should be able to evaluate the causes of inconsistencies.
[5], Litwin [29], Wong and Bazex [MI, and Dayal[17l. These
Evaluate some aggregate function on theinconsistent papers describe how queries should be decomposed into
values and present the result of the function to the user. subqueries to each local system; we have not discussed this
Possible aggregate functions include average, minimum, topic, since it is more related to the development of a multi-
maximum. This technique was used where observations database systemthan to thedesign of applications forsuch
were expected to differ since they occurred at different systems.
times [4]. View integrationis mainly discussed in the frameworkof
Present the most recent value. This policy requiresthe database design for centralized systems by ElMasri and
time-stampingofupdateoperations (which is certainly Wiederhold [Dl, Batini, Lenzerini, and Moscarini [3], and
rather costly). It is based on the assumption that incon- Navathe, Sashidar,and ElMasri [MI. Though generalization
sistencies are due to deferred updates, and thus the latest hierarchies are modeled in slightly different ways, the use
value is also the most likely one. of generalizationhierarchies in viewintegration is sug-
Present the value from the most reliable system. This gested by most references in view integration, including
policy is based on the assumption thatthe designer is able
[22] (using the Structural Model), [3] (using the EER model),
to evaluate the reliability of sites in the distributed data- and [I81 (usingthe functional data model).
base.
D. Interaction Between Top-Down and Bottom-Up
B. A n Example of Bottom-Up Integration Approaches
Assume that two operational airlines decide to have a In Section I I , wehave presented the pure topdown
multidatabase system in order to allow queries about flight approach, in whichthedatabasedesigner ignoresanyphys-
availabilities from any office ofthe twocompanies. Let the ical detail (including distribution) when performing con-
conceptual schema illustrated in Fig. 2 and discussed in ceptual design. While this approach is theoretically valu-
Section 1li-C be associated with airline A, and let the con- able, there can be practicalsituationsfor which it is

SEATMAP

FLIGHT PASSENGER

ARRIVAL
ARRIVAL
TIME
AVAILABLE
SEATS
d
NAME TELEPMNE

Fig. 9. Conceptual schema of airline B.

544 PROCEEDINGS OF THE IEEE, VOL. 75, NO. 5, M A Y 1987


Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on August 10, 2009 at 15:09 from IEEE Xplore. Restrictions apply.
NAME TELEPHONE

Fig. 10. Global schemata built after integration.


I AIRPORT

inappropriate. For instance, assume that the database REFERENCES


describes an enterprise which is organized by functions, P. M. C. Apers, “Query processing anddata allocation i n dis-
with each function immediately mapped to one database tributed database systems,”Ph.D. dissertation, Vrije Uni-
location. versiteit, Amsterdam, The Netherlands, Sept. 1982.
Even if thedesignis started from scratch, abstractingfrom
-, “Data allocation in distributed database systems,” to
appear in ACM Trans. Database Syst.
distribution information might be unnatural in this case. C. Batini, M. Lenzerini, and M. Moscarini, ”Views integra-
One possible approachto thedesign is to proceed partially tion,” in Methodologyand Tools forDatabase Design, S. Ceri,
topdown and partially bottom-up, by collecting require- Ed. Amsterdam, The Netherlands: North-Holland, 1983.
ments and performing conceptual design for each function R. L. Blum, Discovery and Representation of Causal Relation-
ships from a Large Timeoriented Clinical Database: The RX
independently, then integrating all conceptual schemata Project (Lecture Notes in MedicalInformatics, no. 19). New
intoaglobal one, and finally redistributingthe information. York, NY: Springer-Verlag, 1982.
Clearly, with this approachwe retain the notionof the view Y. J. Breitbart andP. Paolini, “Chairmen’s report, Session on
where each object was originally defined, because in turn multidatabases,” in 3rd Int. Seminar on Distributed Data
this notion willsuggest the allocation of the object. How- Sharing Systems, F. A. Schreiber and W. Litwin,
Eds. Amsterdam, The Netherlands: North Holland, 1985.
ever, redistribution is useful forintroducing replication and R. C. Casey, “Allocation of copies of a filei n an information
for“moving” some objects from onelocal schema to network,” in Proc. 7972 Springloint Computer Conf., AFIPS,
another. 1972.
In summary,a”pure”top-down approach todistribution S. Ceri, M. Negri, and C. Pelagatti, “Horizontal data parti-
is advisable for views that have no correspondence with tioning in database design,” presented at ACM-SICMOD,
June 1982.
distribution, whilea bottom-upapproach isadvisable when S. Ceri, Ed., Methodology and Tools for Database
views have an immediate correspondenceto database loca- Design. Amsterdam, The Netherlands: North-Holland,1983.
tions; all intermediate situation are possible and should be S. Ceri, S. B. Navathe, and C. Wiederhold,”Distribution
dealt with by intermediate approaches. design of logical database schemas,” / E € € Trans. Software
Eng., vol. SE-9, no. 4, pp. 487-504,July 1983.
S. Ceri, B. Pernici, and C. Wiederhold, “An overview in the
V. CONCLUSIONS design of distributeddatabases,” / E € € Database Eng. (Special
Issue on Database Design), IEEE Computer Society, 1984.
As distributed applications are becoming a reality, dis- S. Ceri andC. Pelagatti, Distributed Databases: Principles and
tribution design is becominga new and relevantareawithin Systems. New York, NY: McGraw-Hill, 1984.
S. Ceri and B. Pernici, ”DATAID-D: Methodology for distrib-
database design. Distribution requires i t s own theory,
uted database design,” i n Computer AidedDatabase Design,
problem definitions, solution methods, and methodolo- A. Albano,V. de Antonellis, andA. di Leva, Eds. Amsterdam,
gies. The Netherlands: North-Holland, 1985.
This paper has presented asurvey of top-down and bot- 5. Ceri, B. Pernici, and C. Wiederhold, “Optimization prob-
tom-up approaches to distributiondesign and has focused lems and solution methodsi n the designof distributed data-
bases,” Politecnico di Milano, Department of Electronics,
on the DATAID-D topdown methodology; approaches have Computer Laboratory, Internal Rep. 85-7,revised version.
been exemplified and compared. S. K. Chang andW. H. Cheng, “A methodologyfor structured
database decomposition,” / € € E Trans. Software Eng., vol. SE-
6, no. 2, pp. 205-218,Mar. 1980.
ACKNOWLEDGMENT P. P. Chen, “The entity-relationship model: Toward a unified
view of data,”ACM Trans. DatabaseSyst., vol. 1, no. 1, pp. 9-
The authors would like thank
to the anonymousreferees 36, 1976.
for providing very useful and detailed comments for revi- W. W. Chu, “Optimal file allocation in a multiple computer
sion of this paper. system,” / € E € Trans. Cornput., vol. C-18, no. 10,pp. 885-888,

CERl et a/.: DISTRIBUTED DATABASE DESIGN METHODOLOGIES 545


Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on August 10, 2009 at 15:09 from IEEE Xplore. Restrictions apply.
Oct. 1%9. bility. An approacht o physical database design,” I€€€Trans.
[ I 7 U.Dayal, “Processing queriesovergeneralizationhierar- Comput., vol. C-33,no. 3, pp. 209-222,Mar. 1984.
chies in a multidatabase system,” presented at the 9th Int. [46] G. Wiederhold, “A method for the design of multi-objective
Conf. on Very Large Data Bases, Florence, Italy, 1983. databases,” i n Computers in Engineering 7982, vol. 4, R. Ra-
[I81 U. Dayal andH. Y. Hwang, “Viewdefinitionandgeneral- ghavan, Ed. New York, NY: Amer. SOC. Mech. Eng., 1982.
ization for database integration in a multibase system,” /€€€ [47l -, Database Design, 2nd ed. New York, NY: McGraw-Hill,
Trans. Software Eng., vol. SE-10, no. 6,pp. 628-645,Nov. 1984. 1983.
[I91 / E € € Database Eng. (Special Issue on Highly Available Sys- [a] K. K. Wong and P. Bazex, “MRDSM: A relational multidata-
tems), IEEE Computer Society, vol. 6, no. 2, 1983. bases management system,” i n Third International Seminar
[20] L. W. Dowdy and D. V. Foster, “Comparative models of file on Distributed Data Sharing Systems, F. A. Schreiber and W.
assignment problem,”ACMComput. Surv.,vol. 14,no. 2,1982. Litwin, Eds. Amsterdam, The Netherlands: North Holland,
[21] M. J. Eisner and D. G. Severance, “Mathematical techniques 1985.
for efficient record segmentation in large shared databases,”
1. ACM, vol. 23, no. 4,Oct. 1976.
[22] R. ElMasri and G. Wiederhold, “Data model integration using
the structural model,” inACM-SIGMOD (May 1979),pp. 191-
202.
[23] K. P. Eswaran, “Placement of records in a file and file allo-
cation in a computer network,” Informat. frocessing(Amster-
dam, The Netherlands: North-Holland), 1974.
[24] M. L. Fisher and D. S. Hochbaum, “Database location in com-
puter networks,” ]. ACM, vol. 27,Oct. 1980.
[25] J. Grant, “Constraint preserving and lossless database trans-
formations,” Inform. Syst., vol. 9,no. 2, pp. 147-156, 1984.
[26] M. Hammer and6. Niamir, “A heuristic approacht o attribute
partitioning,” presented at ACM-SIGMOD, Boston, MA, 1979.
[27 J.A. Hoffer and D. G. Severance, “The use of cluster analysis
i n physical database design,” in froc. 7 s t VLDB Conf., 1975.
[28] K. 6. lrani andN. G. Khabbaz, “A methodology for the design
of communication networks and the distributionof data i n
distributed supercomputer systems,” /€E€ Trans. Comput.,
vol. C-31, no. 5, 1982. Barbara Pernici is a Research Associate at
[29] W. Litwin, ‘“ALPHA, Amultidatabasemanipulation lan- the CSlSEl Research Center of the Italian
guage,” i n Proc. I€€€Data Engineering Conf. 7 (LosAngeles, National Research Council, Department of
CA, Apr. 1984). Electronics, Politecnico di Milano, Milan,
[30] V. Lum et a/., “1978 New Orleans Data Base Design Work- Italy. She has been with the Department of
shop Report,” IBM Rep. Rj2554 (33154),IBM Res.Lab.,San Electronics since 1981. Her research inter-
Jose, CA; and i n froc. Int. Conf. on Very Large Data Bases, ests include distributed database design,
1979. office information systems modeling and
[31] S. Mahmoudand J. S. Riordon,“Optimalallocationof design.
resources in distributed information networks,” ACM Trans. Dr. Pernici is an affiliate member of the
Database Syst., vol. 1, no. 1, 1976. IEEE Computer Societv and a member of
[32] D. McLeod, “A federated architecture for database systems: ACM.
NCC, pp. 283-289,1980.
[33] D. Maier and J. Ullman, “Fragments of relations: First hack,”
in froc. Xf2 Workshop, 1981.
[34] S. T. March andG. D. Scudder, “On the selection of efficient
record segmentations and backup strategies for largeshared
databases,” ACM Trans. Database Syst., vol. 9, no. 3, 1984.
[35] H. L. Morgan and K. D. Levin, “Optimal program and data
location in computer networks,” Commun. ACM, VOI. 20,no.
5, pp. 315-322,May 1977.
[36] A. Motro and P. Buneman, “Constructing superviews,” i n
froc. SICMOD ’87,pp. 56-64, 1981.
[37l S. 6. Navathe, S. Ceri, G. Wiederhold, and J. Dou, “Vertical
partitioning algorithms for database design,” ACM Trans.
Database syst., vol. 9, no. 4, 1984.
[38] S. B. Navathe,T. Sashidar,and R. EIMasri,”Relationship merg-
ing inschema integration,” in froc.70th Conf. on Very Large
Data Bases (Singapore), 1984.
[39] J. Paredaensand P. De Bra,“On horizontal decompositions,”
in froc. X f 2 Workshop, 1981.
[a]D. Reiner et a/., “The database design and evaluation work-
bench (DDEW) project at CCA,” /€€€ Database Eng. (IEEE
Computer Society), vol. 7, no. 4, Dec. 1984.
[41] J. 6. Rothnie and N. Goodman, “A survey of research and
development in distributed database management systems,”
in froc. 3rdInt. Conf. on Very Large Data Bases, 1977.
[42] D. SacciandG. Wiederhold,”Database partitioning i n aclus-
ter of processors,” ACM Trans. Database Syst., vol. 10,no. 1,
Mar. 1985.
[43] T. J. Teorey and J. P. Fry, Design o f Database Struc-
tures. Englewood Cliffs, NJ: Prentice-Hall, 1982.
[44] J. Ullman, Principles ofDatabase Systems, 2nd ed. Rockville,
MD: Comput. Sci. Press, 1982.
[45] K.-Y. Whang, G. Wiederhold, and D. Sagalowicz, “Separa-

546 PROCEEDINGS OF THEIEEE, VOL. 75, NO. 5 , M A Y 1987


Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on August 10, 2009 at 15:09 from IEEE Xplore. Restrictions apply.

You might also like