Data Warehouse Schema Design: Account Balance Balanceclass Nooftransactions Time Account
Data Warehouse Schema Design: Account Balance Balanceclass Nooftransactions Time Account
Data Warehouse Schema Design: Account Balance Balanceclass Nooftransactions Time Account
Jens Lechtenbörger
Dept. of Information Systems
University of Münster
Leonardo-Campus 3
D-48149 Münster, Germany
lechten@wi.uni-muenster.de
1 Introduction
ProductID ProductType
data sources is a separate database consisting of a set of materialized views that integrate
data from the sources. For our purposes, we view a data warehouse as a set of UPSJR
views, i.e., relational views defined by union, projection, selection, join, and renaming.
We restrict our attention to the design process up to and including logical design, the rela-
tional data model as the target model to be used during logical design, and we emphasize
that we do not address design issues that are related to, e.g., cleansing, maintenance, or
meta-data management. Our aim is to define a data warehouse design process based on
the extension of independently developed data warehouse design approaches and their in-
tegration into the traditional database design process. For this purpose, we strive (1) to
identify and formalize desirable warehouse schema properties as design goals and (2) to
set up conceptual and logical design phases in such a way that the identified design goals
are guaranteed to be fulfilled.
The remainder of this paper is organized as follows. In Section 2, we present the design
goals of multidimensional normal forms and independence, which are addressed by a de-
sign process outlined in Section 3. In Sections 4, 5, and 6, we summarize requirement
analysis, conceptual design, and logical design as phases of this design process, and in
Section 7 we sketch a method to enforce update independence in the course of logical
design. We conclude the paper in Section 8.
2 Design Goals
While normal forms have a long tradition in the area of relational databases as guidelines
for good schema design, research on quality factors for data warehouses or multidimen-
sional databases has started only recently.
2.1 Summarizability and Multidimensional Normal Forms
The work presented by Lenz and Shoshani [LS97], which provides necessary conditions
for summarizability, can be regarded as a first step to define quality factors for concep-
tual data warehouse schemata. The notion of summarizability refers to the possibility to
compute aggregate values with a lower level of detail from aggregate values with a higher
level of detail. Roughly speaking, summarizability is given if an analyst is guaranteed to
obtain consistent results when performing sequences of roll-up operations along a path in a
dimension lattice. Lenz and Shoshani [LS97] argue that summarizability is of most impor-
tance for queries concerning multidimensional data. Hence, any multidimensional schema
should be set up in such a way that summarizability is obtained to the highest possible de-
gree, and violations of summarizability should be expressed in the schema. Following and
extending the summarizability results presented in [LS97], Lehner et al. [LAW98] define
the generalized multidimensional normal form (GMNF), which ensures summarizability
in the presence of optional dimension levels in a context-sensitive manner and supports an
efficient physical database design.
We extend the framework of [LAW98] by showing that there is a natural correspondence
between optional dimension levels and attributes occurring in sub-classes of generaliza-
tion hierarchies in the sense of Smith and Smith [SS77], and we define three increas-
ingly restrictive multidimensional normal forms to address this correspondence (see also
[LV02b]).
We briefly summarize the impact of those three multidimensional normal forms. Let F be
a fact schema with universe UF , and let FD U be a set of functional dependencies over UF
that can be observed in the application domain. Associated with F there is a second set
of functional dependencies over UF , namely the functional dependencies implied by F ,
FD F . We argue that F can only be regarded as reasonable schema for the application do-
main if certain relationships among FD U and FD F hold, and these relationships are made
formally precise in terms of first multidimensional normal form (1MNF). Roughly, 1MNF
demands that (1) the functional dependencies FD F do occur in the application domain,
(2) the potential for roll-up and drill-down operations available in the application domain
is preserved in the fact schema, and (3) the fact schema embodies the multidimensional
contexts of measures.
While the first of these requirements is a natural semantic faithfulness condition that should
be satisfied in any database, the following one is specific to the context of multidimensional
data, where it formalizes a completeness aspect with respect to coverage of the underlying
application domain. Finally, it can be shown that the last of these requirements avoids
certain kinds of redundancies that arise due to transitive functional dependencies; hence,
this last requirement sets up a loose connection between traditional normal forms and
multidimensional ones.
Next, 2MNF strengthens 1MNF by taking care of optional dimension levels in order to
guarantee summarizability in a context-sensitive manner, just as GMNF does. Thus, if
a schema is in 2MNF then an analyst (or an analysis tool for that matter) can identify
optional dimension levels based on schema information, which has two consequences.
(1) If an analyst considers measures in conjunction with optional dimension levels, then
she is aware of the fact that she is only looking at partial information. (2) Contradictory
queries (e.g., group company customers according to their jobs) can be avoided.
Finally, 3MNF places further restrictions on schemata in 2MNF, and these restrictions
are sufficient to construct a class hierarchy concerning dimension levels, which is im-
plicitly contained in a fact schema, in terms of relation schemata that avoid null values.
This normal form will be our formal guideline to measure the quality of conceptual data
warehouse schemata. In the next subsection, we turn towards properties of logical data
warehouse schemata.
From a technical point of view we regard a data warehouse as a relational database that
stores a set of materialized relational views. These materialized views have to be main-
tained in order to keep track of updates in the operational databases, and self-maintainabil-
ity has been identified as desirable property that can be exploited during view maintenance
to avoid maintenance queries and update anomalies.
Roughly, a set of materialized views is self-maintainable if it is possible to determine the
new view instance after every database update only based on the old view instance and
the update information itself [GJM96]. Since a logical data warehouse schema is defined
in terms of a set of materialized views, the property of self-maintainability can imme-
diately be perceived as property of logical data warehouse schemata. In the data ware-
housing context this property appears particularly attractive for the following reasons: A
data warehouse typically integrates information from a variety of heterogenous informa-
tion sources. In such a scenario, it might already be difficult to extract deltas representing
source changes, but querying the sources is almost impossible [GJM96]. Indeed, infor-
mation sources may not provide query interfaces or may not permit ad-hoc queries. Even
if querying of sources is possible then the associated maintenance algorithms can incur
processing delays with undesirable loads on operational systems, and such queries can
cause maintenance anomalies. Clearly, these problems do not occur if the data warehouse
is self-maintainable.
So far, we have argued that a data warehouse should be self-maintainable. According to
the terminology introduced by Laurent et al. [LLSV01], self-maintainability is also called
update independence, because a self-maintainable data warehouse is independent from the
information sources as far as the consistent integration of updates is concerned. More-
over, the concept of update independence can be extended to queries as well. Intuitively,
a data warehouse is query-independent, if every query to the sources can be answered
using the data warehouse views only. The motivation for enabling data warehouses to
answer queries that could also be posed directly to the sources is similar to the motiva-
tion for update independence, i.e., direct querying of sources may be impossible or too
expensive. Therefore, a data warehouse should not only be update-independent but also
query-independent with respect to the queries that are important to the warehouse users.
3 Data Warehouse Design
In the first phase of data warehouse design the data requirements to be met by the forth-
coming data warehouse have to be analyzed and specified. As opposed to the requirement
analysis performed during traditional database design, data warehouse design aims at the
integration of a number of pre-existing operational data sources. Hence, the schemata
describing these sources form a major input to the requirement analysis.
In the course of the requirement analysis, data warehouse designers, warehouse end users,
and business domain experts select relevant data warehouse attributes and define initial
OLAP queries based on the information found in operational database schemata. We ne-
glect a detailed description of how the requirement analysis can be accomplished; instead,
we focus on what the requirement specification should deliver in order to support the
schema design process. To this end, we propose to structure the requirement specification
in terms of a set of initial multidimensional or OLAP queries and a derivation and usage
table, which describes the identified relevant data warehouse attributes.
The multidimensional queries can be seen as yardstick for the functionality of the data
warehouse under design, while the derivation and usage table contains an informal de-
scription for each identified warehouse attribute, specifies how its values are derived from
the operational databases, and indicates whether the attribute may be used as measure or
dimensional attribute. Afterwards, each identified warehouse attribute has to be classified
either as measure or as dimension level or as property attribute. Then, to prepare the design
goal of multidimensional normal forms, optional dimension levels must be distinguished
from mandatory ones, and a context of validity has to be identified for each optional level.
In this respect we propose to determine a basis for the functional dependencies among
warehouse attributes, and we show how such a basis helps to identify (a) measures, (b)
certain problems related to dimension levels, and (c) contexts of validity.
5 Conceptual Design
6 Logical Design
The conceptual schema as derived according to the previous section is now going to be
transformed into a logical data warehouse schema. For this purpose, we present a trans-
formation process that starts from a set of fact schemata in 3MNF and produces a set of
view definitions which constitute an update-independent data warehouse.
The overall transformation process is sketched in Figure 3. In a first step, a relational
database schema DW is generated, which expresses the information modeled by the con-
ceptual fact schemata on a logical level in terms of relation schemata and foreign key con-
straints. Taking advantage of input fact schemata in 3MNF, null values are avoided in re-
lation schemata representing dimension levels. Afterwards, the resulting database schema
is linked to the operational databases. To this end, a set of UPSJR views W over the op-
erational databases is derived such that the materialization of W implements the database
schema DW . Finally, the views W thus obtained are rendered update-independent by
adding a suitable amount of auxiliary information to W , which ends the logical schema
design process. Roughly, we use a so-called view complement [BS81] as key tool to
derive a suitable amount of auxiliary information that renders the data warehouse update-
independent. For this purpose, expressions to compute complements for UPSJR views are
given. Importantly, these expressions are applicable to a larger class of views than those
offered by earlier approaches and lead at the same time to uniformly smaller complements;
furthermore, the complexity of constructing these expressions is shown to be polynomial
in the size of schema information, which is in striking contrast to previous approaches,
which are NP-complete (see also [LV02a]).
Besides, we propose a complement-based method to guarantee independence of a set of
views with respect to an arbitrary set of operations, which can be applied to enforce in-
dependence properties of pre-existing warehouses or to support warehouse evolution (see
also [LLSV01]).
Fact schemata Extracted database schemata
Warehouse views
(relational view definitions)
Update−independent
data warehouse
7 Update Independence
We are now going to demonstrate how update independence of the data warehouse under
design can be enforced in the course of the logical design phase based on the notion of
view complement.
While it is well-known that single views involving projections or joins are almost never
update-independent [GJM96], we now state the only known necessary condition concern-
ing update independence:
Proposition 2. Let D be a set of relation schemata without foreign keys, and let V be a
set of views over D.
Based on Proposition 2 we pursue the following approach to achieve the design goal of
update independence by means of (a) modifying view definitions and (b) augmenting the
warehouse with auxiliary views. Let D denote a set of (exported) relation schemata, and
let V 0 denote a set of UPSJR views over D defining the warehouse schema.
In view of statement (1) of Proposition 2, we check for each view in V 0 whether it pre-
serves the keys of all accessed operational relation schemata or not. If a view preserves
all keys (possibly in renamed form) then it remains unchanged. Otherwise, we augment
the projection with a duplicate counter. Let V 1 denote the set of views obtained from V 0
using this transformation.
To address statement (2) of Proposition 2, we add complementary views to the warehouse
to ensure that relation schemata involved in joins can be computed from warehouse views.
To this end, we perform the following transformationSon each view Uj ∈ V 1 . If Uj in-
uj
volves the union operation, i.e., if Uj is of the form i=1 Vji , where each Vji is a PSJR
view, then we change the definition of Uj into a “source-preserving union”. Let V 2 denote
the set of source-preserving UPSJR views obtained from V 1 using the previous transfor-
mation. We now derive a set Aui of auxiliary views to render V 2 update-independent. For
this purpose, we compute a complement C of V 2 with respect to D. For each relation
schema R that is involved in a join in some view in V 2 we add the complementary views
for R to Aui .
Theorem 3. Using the above notation, let W = V 2 ∪ Aui be a set of views over D. Then
W is update-independent.
8 Concluding Remarks
This work advances data warehouse schema design beyond a process based on experience
and rules of thumb towards a formal framework where rigorous methods are available to
achieve clearly defined design goals. However, the present approach could be refined and
extended in a number of aspects. First, we have neglected physical schema design, which
is a mandatory part of any database implementation to optimize the logical schema for
performance reasons. Second, meta-data management is largely ignored, and a systematic
study concerning the co-design of warehouse schemata and meta-data repository still needs
to be performed. Next, we have not paid attention to special issues that arise when temporal
data is stored or analyzed, and the design of data warehouses using temporal data models
does not seem to have been studied yet. Finally, we have shown how a data warehouse
can be rendered update-independent based on rewritings that add duplicate counters to
warehouse views and the additional storage of complementary views to deal with updates
concerning join views. Additionally, Proposition 2 suggests that this additional amount
of information is even necessary, if join views do not involve projections. In general,
however, it remains an open problem to determine a minimal amount of information that
is necessary to render a set of views update-independent.
References
[BG01] A. Bauer, H. Günzel, eds., Data Warehouse Systeme — Architektur, Entwicklung, An-
wendung, dpunkt.verlag, 2001.
[BLT86] J.A. Blakeley, P. Larson, F.W. Tompa, “Efficiently Updating Materialized Views,” Proc.
ACM SIGMOD 1986, 61–71.
[GJM96] A. Gupta, H.V. Jagadish, I.S. Mumick, “Data Integration using Self-Maintainable
Views,” Proc. 5th EDBT 1996, LNCS 1057, 140–144.
[LS97] H. Lenz, A. Shoshani, “Summarizability in OLAP and statistical databases,” Proc. 9th
SSDBM 1997, 132–143.
[SS77] J.M. Smith, D.C.P. Smith, “Database Abstractions: Aggregation and Generalization,”
ACM TODS 2, 1977, 105–133.