Distributed Database Design Methodologies
Distributed Database Design Methodologies
Invited Paper
This paper surveys methodological approaches for distributed of datadistribution,and the system takesthe responsibility
database design. The design of distribution can be performed top- of distributing access operations to the databases at dif-
down or bottom-up; the first approach is typical of a distributed
database developed from scratch, while the second approach is ferent sites. However, the actual distribution ofdata affects
typical of the development of a multidatabase as the aggregation the overall performance ofthe system: time and cost
of existing databases. We review the design problems and metho- required for accessing multiple data objects differ greatly
dologies along both directions, and we describe DATAID-D, a top- depending on whether alldataobjects are stored at the same
down methodology for distribution design. We indicate how the
site or are spread over multiple sites. Replication of data
methodology is part of a global approachto database design; how
to collect the requirements about the distribution of data and affects the overall reliabilityand availability of thesystem,
applications; and how toprogressively build the distributionof a because several copies of the same information become
schema. Our approachis exemplified through onecase study. availablewith independent failure modes. Atthe same time,
data replication affects performance, due to the require-
I. INTRODUCTION ment of maintaining the consistency of copies. Thus the
Due to demand for system availabilityand autonomy, and database designer must carefully considerthe distribution
enabled by advances in database and communication tech- of data, even when the system supports a high degree of
nology, distributed database systems are becoming wide- transparency.
spread. Many database management systems now support One key principle in distribution design is to achieve
extensions for distribution, and at the same time structur- maximum locality of data and applications. While distrib-
ally distributed transactionmanagement systems are avail- uted databases enable more sophisticated communication
able on the market [Ill, [19].The designers of distributed between sites, the major motivation for developing a dis-
database applications are now facing a new and relevant tributeddatabaseistoreducecommunication byallocating
problem: how to distribute the data and programs on dif- dataascloseaspossibletotheapplicationswhichusethem.
ferent computersto obtain the intended performance, reli- Thus in a well-designed distributed database "90 percent
ability, and availability. Distribution design is thus emerg- of the data should be found at the local site, and only 10
ing as a newproblem area of databasedesign, which percent of the data should be accessed on a remote site"
requires its own theory, design methodologies, and sup [41]. However, it rarelyoccursthat dataand applications can
port tools. be cleanly partitioned and assignedto a particular site; more
The degree ofsophisticationofdistributed database often, the designer is faced by tradeoffs because several
management systems(DBMS) is often measured by the applications need to access the same data from different
degree of distributiontransparency provided to theusers. locations. In this case, the most effective designis the one
In an ideal situation, the user does not need to be aware which ensures locality to the largest number of applica-
tions. A basic assumption of this paper is that thedatabase
designer is able to predict both the logical properties which
Manuscript received November 26, 1985; revised October 30, characterize the "locality" of data with respect to appli-
1986. This research was partially conducted within theKBMS Proj- cations and the quantitative information measuring the
ect at Stanford University, which is supported by DARPA under
Contract N39-84-C-11. Itdraws heavily on research performed "load" of applications, in terms of frequency of execution
under the DATAID Project in Italy, supported by the CNR within requests at each site. The difficulty of designing the dis-
the ProgettoFinalizzatolnformatica. Theconclusionsdrawn tribution of a database is, in fact, due to the interference
remain the responsibility of the authors. of logical and quantitative considerations.
S. Ceri i s with the Dipartimento di Matematica, Universiti di
Modena, Modena, Italy.
B. Pernici is with CSISEI-CNR, DipartimentodiElettronica, A. Top-Down and Bottom-Up Approaches to Distribution
Politecnico di Milano, 20133 Milano, Italy. Design
G. Wiederhold i s with the Computer Science Department, Stan-
ford University, Stanford, CA 94305, USA. Distribution design can be performed top-down or bot-
I E E E Log Number 8714302. tom-up. The former approach is typical of distributeddata-
I
I
DATAID-D is built as an extension ofthe DATAID-1 meth- Design Data Dictionary
odology for centralized database design, which is divided
r-l
into 4 phases [8]:
CONCEPTUAL
requirements analysis DESIGN
conceptualdesign I I
logicaldesign Global
Operatim
Global
Data
physical design. Schemata Schema
1
I 77-
ANALVSIS (
DISTRIWTIOI
REQUIREMENT!
Logical
Simplified
tions and the frequency of activation of each application Global S c h m Access
Tables
from each site. Since the structure ofdata and applications I I
and applications produced by the previous design phase, Local Logical Schemata
(Relational or Codasyl)
in the formof global database schema and logical access
tables. Global database schemas are described using sim-
a LOCALPHYSICAL
ple version of the ER model [E], which includes just binary DESIGN
relationships; logicalaccess tables indicate the type ofdata-
base access operations performed by each application on
each entity. In DATAID-I,conceptual modeling is per-
formed on a more sophisticated model, called the Extended
p GATE TELEPHONE
NAME
SECURITY RULES
-PASSENGER
20000 1 1000000 1
DATE
(K)
d
SEATMAP
(0.w)
SEAT NUMBER CHECKED
RAGS
(w) (w)
(b)
GATE [ a )
(0) NUMBER
FLIGHT DELAY ( a ) AIRPORT SYMBOL ( a )
(k) DATE
CITY [a)
__f
€3 AIRPORT
SYMsOL (K)
(C)
Fig. 3. Global Operation Schemata of the airline reservation database. (a) Make-Reser-
vation. (b) Check-In. (c) Airport Departures.
lowingthe application’s access path through data. As a more relationship was used for accessing the entity, while the
general consideration, notice thatoperation schemata are row denotedby A N (accesses number) gives the total num-
“procedural,” i.e., they indicate precisely the access path ber of instances involved in the operation. For a more
of an application through thedatabase; however, they are detailed description, see [8].
built at a logical level, without postulating theexistence of 2) Analysis of Distribution Requirements in the Airline
access methods. Subsequent design phases (i.e., local log- Reservation System: In this section we examine how dis-
ical design and physical design) will decide which access tribution requirements are represented by the DATAID-D
methods haveto beprovided, based on theseschemataand methodology after the analysis ofdistributionrequire-
on quantitative informationabouttheapplications’load [8]. ments phase.
Finally, quantitative data about applications aresum- The frequency fable illustrated in Fig. 4(b) showsthe fre-
marized for each entity, building the logical access tables. quencyof applications a, b, cdescribed in theGlobal Oper-
Fig. 4(a) shows the logicalaccess fable for the entityFlight. ation Schemata in Fig. 3 at sites 1 (Denver), 2 (New York),
Columns correspondto operations, rows correspond to the 3 (Atlanta).
entity’s attributes; positions in thematrix indicate the type Partitioning fables for the Airline Reservation Database
of action (“0,”“W,” “ K ” ) performed on objects. The row are shown in Fig. 5(a) and (b). Fig. 5(b) shows the primary
denoted by RA (relationship access) indicates whether a partitioning table for entities Airport and Passenger. The
y
Attributes
fragmentation we want to map each passenger in the real
worldto exactlyone instance of the Passenger entity. In the
NUMBER first case, each Passenger instance is statically assigned to
a partition when the first reservation is made, while in the
DATE second case the mapping of Passenger instances to parti-
tions is dynamic: whenever a passenger changes reserva-
SEAT
MRP
tion, the corresponding entityinstance might move from
GATE one partition to another.
Fig. 6 shows the polarization table. Columns are associ-
DELAY ated with activations of applicationsat each site, rows with
partitioning predicates. Each entry indicates the percent-
AVAILABLE
SEATS
age of accesses to a fragment ifthe particular partitioning
criterion is selected. In practice, only afewentries are spec-
R4 FROn
P r i m apr ya r t i t i o n i nt agb l e
AN 30
(b) (a)
intity access table (a) and frequency table (b).
r D e r i pv ae rdt i t i o nt ianbgl e
A s s o c i a t i o nf o r Predicate Partition
aeslgner chooses the Area attribute as a partitioning cri- Entity thederivation
Base Entity Narfe Selectivity
terion for the Airport entity and the first three digits of tele-
phonenumber(areacode)asapartitioningattributeforthe FLIGHT FROM AIRPORT AREA SAME
Passenger entity. For each possiblevalue of the partitioning
attribute, predicate selectivitygives the percentage of the
tuples of the entity withthat value.
The designer then considers potential derived partition-
ing induced by the primary partitioning of Airports into
Areas. Four derived partitioning choices are considered in
Fig. 5(b):
The Flight entitycan be partitioned in two ways, based
on the relationshipFrom (with thedeparture airport)or To
(with thearrival airport) and the partitioningofairports into
areas. Notice that the same primary partitioning gives rise
to two different derived partitionings based on the use of
two different relationships. WTE:FLIGHT DEPARTURE P a r t i t i o n i n g i s basedon7 predicates: Predicate
Selectivity
The last two rows show two alternative ways of parti-
20
tioning the Passenger entity based on therelationships from P1 : A l f l i g h t sr e s e r v e db yt h ep a s s e n g e rd e p a r tf r a na r e a A
in the latter case, passengers are partitioned based on all P7 : A ll f l i g h t sr e s e r v e db y the passengerdepartfrom a l l areas 10
PASSENGER
by PHONE Site 1
1
AC
PASSMGER
by FIRST
RESERVATION
LOCATION
PASSENGER. 2 U
PASSENGER.4 U
PASSENGER. 6 U
PASSENGER. 7
PASSENGER
by
DEPARTURE
Site 2
\
FLIGHT by I
DEPARTURE
PASSENGER. 3 U
FLIGHT by PASSENGER. 5 U
ARRIVAL PASSENGER. 6 U
PASSENGER. 7
1
matrices are not relevant because the applicationdoes not
use the entity; these are crossed out in the table. Fig. 7. Result of distribution design: local schemataforsites
1-3.
To illustrate, let us discussthe firstsubmatrix associated
with theMake-Reservation application (Fig. 3(a))and using
the partitioningof the Airport entity by areas (respectively, "larger" than the second class. In ourcase study:
Pl for area 1, P2 for area 2, and P3 for area 3). The matrix 1) vertical partitioning is not useful to determine units
indicates that if we partition airportsareas,by then a query of allocation, in fact, no application exists that can be
about reservations issued at area 1 has 80 percent proba- clearly eased by a vertical partitioning;
bilityof beingrelatedtoairportsinarea1.Fortheothertwo
2) conversely, all entities have a horizontal fragmenta-
positions incolumn 1 we assume uniformity over the
tion:
remaining 20 percent of access. the Airport entity has a primary horizontal frag-
3) Distribution Design in the Airline Reservation System:
mentation based on Area (fragments Airport.1,
Distribution design consists of four steps: the selection of
Airport.2, Airport.3);
fragmentation criteria foreach entity, the determination of
the Flight entity has a derived horizontal frag-
nonredundant allocation, the introduction ofredundancy
mentation based on the Departure airport (frag-
over the nonredundant allocation, and, finally, the recon-
ments Flight.1, Flight.2, Flight.3);
struction of local schemata at each site (see Fig. 7).
the Passengerentity has a derived horizontal frag-
a) Fragmentation design:Given the potential partition-
mentation based on Departure of all Flights on
ingcriteriacontained in the polarization tables, the designer
which the passenger i s booked (fragments Pas-
must here select the most appropriate one foreach entity, senger.1 to Passenger.7).
and validate that the partitioningitself is convenient. This
requires the quantitative analysis of the relevant applica- b) Nonredundant allocation: In somecases, nonre-
tions, which can be classified in three classes: those which dundant allocation followseasily from theselection of the
are made easier by the partitioning, those whichare made partitioning criterion.For instance, Airport.1, Flight.1, and
more difficult, and, finally, those which are not affected. Passenger.1 areimmediatelyallocated at site l , and likewise
Then, the partitioning is convenient if the first class is Airport.2, Flight.2, and Passenger.2 are allocated at site 2
DATE
AIRPLANE
AVAILABLE
AVAILABLE MEAL SERVED
SEATS
GLOBAL VIEW
NUMBER
DATE
1-
MAP
FLIGHT io AVAILABLE SEATS
SEAT
models [471. Conflicts which cannot be resolved at design the same real-world objectcan be modeled as an attribute
time requireselection of policies for answering queries with in one view and as an entity in another view. Only a few
inconsistent data. of these differences can be dealt with by generalization in
Schema differences include naming conflicts, scale dif- the global schema. In view design, structural differences
ferences, and structural differences. are typically solved by changingone or bothviews. When
Naming conflicts: There are two types of naming con- dealing with autonomous databases, structural differences
flicts: synonyms occur when data objects whichrepresent typically require writingcomplex query modification pro-
the same world objects have been given different names cedures, to be stored in the globalschema, if the distrib-
in thetwo views; homonyms occurwhen data objects with uted DBMS can support them [18].
the same name in twoviews represent different real-world Many of theabove conflicts shouldbe reported and fixed
objects. Once naming conflicts are detected, they are easily before integration, and local systems should be modified
handled by storing name correspondence tables in the to reflect integration whenever possible. Otherwise, the
global schema. global schema should include information about conflicts,
9 Domain differences: The most insidious problems are and policies for resolution. Clearly, the management of
due to domain differences. A relation Personnel at a site conversion formulas , query modification procedures, or
handlingthe payrollmay not includecontractorpersonnel, conflictresolution policies, mentionedinthis section,
who are paid indirectly. At the site where projectsare man- requirean extension of thecapabilities of traditional DBMS
aged, contractor personnelare included, butauditors, who in the direction of more sophisticated multi-database sys-
are not assignable to any project, are not. Detectionof such tems; in absence of such extensions, the above features
problems is best achieved by comparing the source data- must be carefully supported by applicationprograms.
bases or files, and noting inconsistencies. Generalization 3) Dealing with Inconsistent Data During Operation: In
hierarchiescan beused for representing the solution to this practice, operational multi-databases have errors; anerror
problem. In the above case, permanent personnel, con- rate of up to1 percent per stored records is not unusual.
tractor personnel, and auditors could all besubtypes of the They maybe dueto inputtranscription, omission or failure
same entity Personnel. Applications, such as payment, are in synchronization of updates, and improper recoveryfrom
then performed in different ways according to thesubtype system errors. The databasedesigner must decidethe pol-
that they use. icies for dealing with inconsistencies that arise duringoper-
Scaledifferences: Scale differences may be found in dif- ation of the globaldatabase.
ferent views of the same numeric values. The data should Letus consider, for instance, the situation where two
be retrievedusing, if possible, the moreprecise scale, and instances of the Employee entities storedat different sites
joined or output using conversion formulas. Conversion happen to have the same identifier, but different values for
formulas should be stored as a part of the globalschema, the salary attribute. This situation can be due toseveral rea-
if the distributed DBMS can support them. sons:
Structural differences: Structural differences may be It is possible that the same identifier corresponds to
due to differentdesign choices in each view; for instance, twodifferentemp1oyees;thissituationcan besolved bysys-
SEATMAP
FLIGHT PASSENGER
ARRIVAL
ARRIVAL
TIME
AVAILABLE
SEATS
d
NAME TELEPMNE