SchemaSQL—An Extension to SQL for
Multidatabase Interoperability
LAKS V. S. LAKSHMANAN
The University of British Columbia
FEREIDOON SADRI
University of North Carolina, Greensboro
and
SUBBU N. SUBRAMANIAN
Tavant Technologies
We provide a principled extension of SQL, called SchemaSQL, that offers the capability of
uniform manipulation of data and schema in relational multidatabase systems. We develop a
precise syntax and semantics of SchemaSQL in a manner that extends traditional SQL syntax
and semantics, and demonstrate the following. (1) SchemaSQL retains the flavor of SQL while
supporting querying of both data and schema. (2) It can be used to transform data in a database
in a structure substantially different from original database, in which data and schema may be
interchanged. (3) It also permits the creation of views whose schema is dynamically dependent on
the contents of the input instance. (4) While aggregation in SQL is restricted to values occurring
in one column at a time, SchemaSQL permits “horizontal” aggregation and even aggregation over
more general “blocks” of information. (5) SchemaSQL provides a useful facility for interoperability
and data/schema manipulation in relational multidatabase systems. We provide many examples
to illustrate our claims. We clearly spell out the formal semantics of SchemaSQL that accounts for
all these features. We describe an architecture for the implementation of SchemaSQL and develop
implementation algorithms based on available database technology that allows for powerful
integration of SQL based relational DBMS. We also discuss the applicability of SchemaSQL for
handling semantic heterogeneity arising in a multidatabase system.
Categories and Subject Descriptors: H.2.3 [Database Management]: Languages—query languages; H.2.4 [Database Management]: Systems—relational databases; query processing; H.2.5
[Database Management]: Heterogeneous Databases
L. V. S. Lakshmanan’s work was supported by a grant from the Natural Science and Engineering
Research Council of Canada (NSERC).
F. Sadri’s work was supported by a grant from the National Science Foundation (NSF).
Authors’ addresses: L. V. S. Lakshmanan, Department of Computer Science, The University of
British Columbia, 2366 Main Mall, Vancouver, BC V6T 1Z4, Canada, e-mail: laks@cs.ubc.ca;
F. Sadri, Department of Mathematical Sciences, University of North Carolina, Greensboro, NC
27410, e-mail: sadri@uncg.edu; S. N. Subramanian, Tavant Technologies, 542 Lakeside Drive #5,
Sunnyvale, CA 94085, e-mail: subbu@tavant.com.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is
granted without fee provided that copies are not made or distributed for profit or direct commercial
advantage and that copies show this notice on the first page or initial screen of a display along with
the full citation. Copyrights for components of this worked owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,
to redistribute to lists, or to use any component of this work in other works requires prior specific
permission and/or a fee. Permission may be requested from Publications Dept., ACM, Inc., 1515
Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, or permissions@acm.org.
C 2001 ACM 0362-5915/01/1200-0476 $5.00
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001, Pages 476–519.
SchemaSQL
•
477
General Terms: Algorithms, Languages
Additional Key Words and Phrases: Information integration, multidatabase systems, restructuring
views, SchemaSQL, schematic heterogeneity
1. INTRODUCTION
Significant strides in relational database technology have resulted in the autonomous creation and proliferation of database systems and applications,
tailored to specific needs and user communities. A single organization often
contains a large number of databases designed, created, and maintained by numerous groups independently of each other. These databases can be on different
operating systems, different hardware, different platforms, and use different
data models. The need to share information, within a single organization as
well as across different organizations, has long been identified.
There are several applications that create a need for sharing data and programs across the different databases. An example is a large organization whose
operations may be divided on functional or departmental lines. Therefore, it is
natural to have independent database systems that cater to different functional
units and handle different organizational tasks. Normally, such an organization maintains separate databases for purchase, payroll, patents, and human
resources. Suppose the organization decides to reward employees who have
filed and successfully obtained more than n patents, for some suitable number
n. Given the size of the organization, it would like to have this process automated. If all relevant information were available in one database, this would
amount to implementing some kind of trigger, a functionality readily supported
in most commercial DBMSs. On the other hand, the reality is that information
pertinent to applications is often scattered across many database systems, on
a multitude of platforms, with different DBMSs and even data models. Given
the popularity of relational databases, it is reasonable to expect that in a significant number of cases, the database systems holding information of interest
are relational. A challenge to developing such applications is one of being able
to express and efficiently compute “cross-queries” that relate information in
different databases.
The issue of interoperability among databases has received considerable attention as a result of such a need. Interoperability is the ability to uniformly
share, interpret, query, and manipulate information across many component
databases. A multidatabase system (MDBS) provides such interoperability
across a distributed network encompassing a heterogeneous mix of computers, operating systems, communication links, and local database systems. The
terms, heterogeneous database systems and federated database systems, have
also been used for these systems, sometimes with slight differences in the
intended meaning. We use the term, multidatabase system, in this article.
Some commercial systems (e.g., IBM DataJoiner [IBM], UniSQL/M [Kelley
et al. 1995]) have already appeared on the market and significant research
has been conducted in this area but there are many issues that still remain
unresolved. For surveys on MDBS, see ACM [1990] (in particular, Sheth and
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
478
•
L. V. S. Lakshmanan et al.
Larson [1990] and Litwin et al. [1990]), Hsiao [1992], and Elmagarmid et al.
[1998].
Almost all factors of heterogeneity in a MDBS pose challenges for interoperability. These factors can be classified into semantic issues (e.g., interpreting
and cross-relating information in different local databases), syntactic issues
(e.g., heterogeneity in database schemas, data models, and in query processing, etc.), and systems issues (e.g., operating systems, communication protocols,
consistency management, security management, etc.). We focus on syntactic
and query language issues in this article. We consider the problem of interoperability among a number of component relational databases storing semantically similar information in structurally dissimilar ways. As was pointed out
in Krishnamurthy et al. [1991], even in this case, the requirements imposed by
interoperability are beyond the functionalities provided by conventional query
languages like SQL.
Some of the key features required of a language for interoperability in a
(relational) MDBS are the following:
(1) The language must have an expressive power that is independent of the
schema used by the database. For instance, in most conventional relational
languages, some queries (e.g., “find all department names”), expressible
against the database univ-A in Figure 2, Section 2 are no longer expressible
when the information is reorganized according to the schema of, say, univ-B
there. This is undesirable and should be avoided.
(2) To promote interoperability, the language must permit the restructuring
of one database (e.g., the database univ-A in Figure 2) to conform to the
schema of another (say, that of univ-B in that figure).
(3) The language must be easy to use and yet sufficiently expressive.
(4) The language must provide full data manipulation and view definition
capabilities, and must be downward compatible with SQL, in the sense
that it must be compatible with SQL syntax and semantics. This is
a must given the importance and popularity of SQL in the database
world.
(5) Finally, the language must admit effective and efficient implementation.
In particular, it must be possible to realize a non-intrusive implementation
that would require minimal additions to component RDBMS.
In this article, we propose an extension to SQL that meets all the criteria
above. Specifically, our contributions are the following:
—We propose a language called SchemaSQL, which meets the above criteria.
We review the syntax and semantics of SQL, and develop SchemaSQL as a
principled extension of SQL (Section 2). As a result, for a SQL user, adapting
to SchemaSQL is relatively easy.
—We study the semantics of SchemaSQL in depth (Sections 3 and 4), and
illustrate the following powerful features of SchemaSQL: (i) uniform manipulation of data and meta-data; (ii) creating restructured views and the
ability to dynamically create output schemas (e.g., univ-B in Figure 2); and
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
SchemaSQL
•
479
Fig. 1. Syntax of simple SQL queries.
(iii) the ability to express sophisticated aggregate computations far beyond
those expressible in conventional languages like SQL.
—We propose an implementation architecture for SchemaSQL that is designed
to build on existing RDBMS technology, and requires minimal additions to
it, while greatly enhancing its power (Section 5). We provide an implementation algorithm for SchemaSQL, and establish its correctness (Theorem 5.1
and 5.2).
—We sketch novel query optimization opportunities that arise in the context of implementing SchemaSQL. We develop a lightweight extension to
SchemaSQL for addressing semantic heterogeneity in MDBS in an elegant
way (Section 6).
—We provide a comparison with related work. Section 7 discusses related work
and places our work in context. Since its proposal, it has been observed by
researchers that SchemaSQL can provide useful enhancement to the functionality of SQL even in the context of single database systems for such applications as database publishing on the web [Miller 1998], query optimization
in a data warehouse [Subramanian and Venkataraman 1998], and scalable
classification algorithms in data mining [Wang et al. 1998]. We highlight
these in Section 7.
—Finally, Section 8 summarizes the paper and discusses future work.
2. SYNTAX
Our goal is to develop SchemaSQL as a principled extension of SQL. To this
end, in this section, we briefly analyze the syntax of SQL, and then develop the
syntax of SchemaSQL as a natural extension. Later, in Section 3, we review the
semantics of SQL and obtain the semantics of SchemaSQL as a simple extension. Our discussion in these sections is itself a novel way of viewing the syntax
and semantics of SQL, which, in our opinion, helps a better understanding of
SQL subtleties.
In an SQL query, the (tuple) variables are declared in the from clause. A
variable declaration has the form <range> <var>. For example, in the query in
Figure 1(a), the expression emp T declares T as a variable that ranges over the
(tuples of the) relation emp (in the usual SQL jargon, these variables are called
aliases). The select and where clauses refer to (the extension of) attributes,
where an attribute is denoted as <var>.<attName>, <var> being a (tuple) variable declared in the from clause, and <attName> being the name of an attribute
of the relation over which var ranges.
When no ambiguity arises, SQL permits certain abbreviations. Queries of
Figure 1(b,c) are equivalent to the first one, and are the most common ways
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
480
•
L. V. S. Lakshmanan et al.
such queries are written in practice. Note that in Figures 1(b) and 1(c), emp acts
essentially as a tuple variable.
The SchemaSQL syntax extends that of SQL in several directions.
(1) The federation consists of databases, with each database containing relations. The syntax allows to distinguish between (the components of) different databases.
(2) To permit meta-data queries and restructuring views, SchemaSQL permits
the declaration of other types of variables in addition to the (tuple) variables
permitted in SQL.
(3) Aggregate operations are generalized in SchemaSQL to make horizontal
and block aggregations possible, in addition to the usual vertical aggregation in SQL.
In this section, we concentrate on the first two aspects. Restructuring views
and aggregation are discussed in Section 4.
In SchemaSQL, we refer to relation rel in database db as db::rel, thus
distinguishing between relations in different databases.
2.1 Variable Declarations in SchemaSQL
SchemaSQL permits the declaration of variables that can range over any of
the following five sets: (i) names of databases in a federation; (ii) names of the
relations in a database; (iii) names of the attributes in the scheme of a relation; (iv) tuples in a given relation in a database; and (v) values appearing
in a column corresponding to a given attribute in a relation. Variable declarations follow the same syntax as <range> <var> in SQL, where var is any
identifier. However, there are two major differences. (1) The only kind of range
permitted in SQL is a set of tuples in some relation in the database, whereas
in SchemaSQL any of the five kinds of ranges above can be used to declare
variables. (2) More importantly, the range specification in SQL is made using a
constant, that is, an identifier referring to a specific relation in a database. By
contrast, the diversity of ranges possible in SchemaSQL permits range specifications to be nested, in the sense that it is possible to say, for example, that X
is a variable ranging over the relation names in a database D, and that T is a
tuple in the relation denoted by X. These ideas are made precise in the following
definition:
Definition 2.1 [Range Specifications]. The concepts of range specifications, constant, and variable identifiers are simultaneously defined by mutual recursion as follows:
(1) Range specifications are one of the following five types of expressions,
where db, rel, attr are any constant or variable identifiers (defined in (2)
below).
(a) The expression -> denotes a range corresponding to the set of database
names in the federation.
(b) The expression db-> denotes the set of relation names in the database
db.
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
SchemaSQL
•
481
Fig. 2. Representing similar information using different schemas in multiple databases univ-A,
univ-B, univ-C, and univ-D.
(c) The expression db::rel-> denotes the set of names of attributes in the
scheme of the relation rel in the database db.1
(d) db::rel denotes the set of tuples in the relation rel in the database db.
(e) db::rel.attr denotes the set of values appearing in the column named
attr in the relation rel in the database db.
(2) A variable declaration is of the form <range> <var> where <range> is one
of the range specifications above and <var> is an identifier. An identifier <var> is said to be a variable if it is declared as a variable by an
expression of the form <range> <var> in the from clause. Variables declared over the ranges (a) to (e) are called db-name, rel-name, attr-name,
tuple, and domain variables, respectively. Any identifier not so declared is a
constant.
As an illustration of the idea of nesting variable declarations, consider the
clause from db1-> X, db1::X T. This declares X as a variable ranging over
the set of relation names in the database db1 and T as a variable ranging over
the tuples in each relation X in the database db1.
The following sections provide several examples demonstrating various capabilities of SchemaSQL.
The following federation of databases is used as our running example. Consider the federation consisting of four databases, univ-A, univ-B, univ-C,
and univ-D. Figure 2 shows some sample data in each of these four databases.
Each database has (one or more) relation(s) that record(s) the salary floors for
employees by their categories and their departments as follows:
1 The
intuition for the notation is that we can regard the attributes of a relation as written to the
right of the relation name itself!
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
482
•
L. V. S. Lakshmanan et al.
• univ-A has a relation salInfo (category, dept, salFloor).
• univ-B has a relation salInfo (category, dept1, dept2, . . . ). Note that the
domains of dept1, dept2, . . . are the same as the domain of salFloor in
univ-A::salInfo.
• univ-C has one relation for each department with the scheme depti
(category, salFloor).
• univ-D has a relation salInfo (dept, cat1, cat2, . . . ). Note that the domains of attributes cat1, cat2, . . . are the same as the domain of salFloor
in univ-A::salInfo.
Example 2.1 [Comparing salaries in univ-A and univ-B]. List the departments in univ-A that pay a higher salary floor to their technicians compared
with the same department in univ-B.
(Q1)
select
from
where
A.dept
univ-A::salInfo A,
univ-B::salInfo B,
univ-B::salInfo-> AttB
AttB
<> "category"
A.dept
= AttB
A.category = "technician"
B.category = "technician"
A.salFloor > B.AttB
and
and
and
and
Explanation. Variables A and B are (SQL-like) tuple variables ranging over
the relations univ-A::salInfo and univ-B::salInfo, respectively. The variable
AttB is declared as an attribute name of the relation univ-B::salInfo. It is
intended to be a depti attribute (hence, the condition AttB <> "category" in
the where clause). The rest of the query is self-explanatory.
Example 2.2 [Comparing salaries in univ-C and univ-D]. List the departments in univ-C that pay a higher salary floor to their technicians compared
with the same department in univ-D.
(Q2)
select
from
where
RelC
univ-C-> RelC,
univ-C::RelC C,
univ-D::salInfo D
RelC
= D.dept
and
C.category = "technician" and
C.salFloor > D.technician
Explanation. The variable RelC is declared as a relation name in the
database univ-C. Note that in this database there is one relation per department, and the relation name coincides with department name. Variable C is
then declared as a tuple variable on this (variable) relation RelC. The variable
D is an (SQL-like) tuple variable ranging over the relation univ-D::salInfo.
Note that in univ-D::salInfo categories are represented by attribute names,
whose domains consist of the salary floors of the corresponding category. Hence,
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
SchemaSQL
•
483
D.technician is the salary floor for category technician (for the tuple represented by tuple variable D).
3. SEMANTICS I: FIXED OUTPUT SCHEMA
In this section, we discuss SchemaSQL queries with a fixed output schema. The
topics of dynamic output schema and restructuring views are discussed in the
next section.
We first quickly review the semantics of SQL and express it in a manner that
makes it possible to realize the semantics of SchemaSQL by a simple extension.
3.1 SQL Semantics Reviewed
A query in SQL assumes a fixed scheme for the underlying database, and maps
each database to a relation over a fixed scheme, called the output scheme associated with the query. Let D be the set of all database instances over a fixed
scheme. Let a query Q be of the form2
select
from
where
group by
having
attrList, aggList
fromList
whereConditions
groupbyList
havingConditions
Let R be the set of all relations over the output scheme of the query Q. The
query Q induces a function
Q : D→R
from databases to relations over a fixed scheme, defined as follows: Let D ∈ D
be an input database, and T D the set of all tuples appearing in any relation in
D. Let τ be the set of tuple variables occurring in Q. We define an instantiation
as a function ı : τ → T D which instantiates each tuple variable in Q to some
tuple over its appropriate range. The conditions whereConditions in the where
clause induce a Boolean function, denoted satw (ı, Q), on the set of all instantiations, reflecting whether the conditions are satisfied by an instantiation. This
is defined in the obvious manner. Let I Q = {ı | ı is an instantiation for which
satw (ı, Q) = true} denote the set of instantiations satisfying whereConditions.
The query assembles each satisfying instantiation into a tuple for the answer
relation, via a tuple assembly function, defined below. Let TattrList denote the
set of all tuples over the scheme attrList such that each value in each tuple appears in the database D. Then the tuple assembly function is a function
tuple Q : I Q → TattrList defined as follows:
O
tuple Q (ı) =
ı(t)[A].
“t.A”∈attrList
Here, the predicate “t.A” ∈ attrList indicates the condition that the attribute
denotation t.A literally appears in the list of attributes attrList in the select
2 In
this article, we restrict attention to single block SchemaSQL queries. In keeping with this, we
only consider single block SQL queries here.
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
484
•
L. V. S. Lakshmanan et al.
N
statement. The symbol
denotes concatenation, and ı(t)[A] denotes the restriction of the tuple ı(t) to the attribute A. For an instantiation ı, tuple Q (ı)
produces a tuple over the attributes attrList listed in the select statement.
Suppose Q is a regular query, that is, a query without aggregation. In this case,
the aggList is empty and the having and group by clauses are absent. So, the
result of an SQL query without aggregation is captured by the function
Q(D) = [tuple Q (ı) | ı ∈ I Q ].
Note that SQL has a multiset semantics. We use [· · ·] instead of {· · ·} to denote
multisets.
To account for aggregation, we need the following extension:
Definition 3.1 [Equivalence Relation Induced by group by Clause]. For ı,
∈ I Q , ı ∼ iff ∀ “t.A” ∈ groupbyList, ı(t)[A] = (t)[A]. It is easy to see that
∼ is an equivalence relation on I Q . In words, two instantiations (satisfying
the conditions in the where clause) are ∼-equivalent provided they agree on all
attributes appearing in the group by clause.
The conditions havingConditions in the having clause are a Boolean combination of atomic conditions of the form agg(t.att) relOp c and of the form
agg1(t.att1) relOp agg2(t.att2). Intuitively, we are only interested in those
equivalence classes of ∼ that satisfy havingConditions. For an equivalence
class e of ∼, let sath (e, Q) = true denote that e satisfies havingConditions. Note
that, in practice, checking whether sath (e, Q) = true for an equivalence class
e requires the calculation of any additional aggregations that may appear in the
having clause but not necessarily in the select clause. We have the following:
Definition 3.2 [Valid Equivalence Classes].
E Q = {e | e is an equivalence class of ∼ and sath (e, Q) = true}.
Let TaggList denote the set of all tuples over the scheme aggList. We define a
function aggregate Q : E Q → TaggList as follows.
O
agg([ (t)[B] | e ∈ E Q and ∈ e]).
aggregate Q (e) =
“agg(t.B)”∈AggList
For a given equivalence class e ∈ E Q , aggregate Q considers all instantiations in
e, and, for each aggregate operation, say agg, indicated on the attribute t.B in
aggList, it performs the operation agg on the multiset of values associated with
this attribute by instantiations in e. Again, we use [· · ·] to denote multisets.
Now, we are ready to describe the tuple assembly associated with aggregate
queries. Let Q be a query involving aggregation. Define a function aggtuple Q :
E Q → TattrList × TaggList as follows:
O
aggregate Q (e),
aggtuple Q (e) = tuple Q (ı)
where ı is any instantiation in e. Note that SQL requires the set of attributes
in attrList to be a subset of those in groupbyList. Hence, tuple Q (ı) is the
same for all instantiations ı ∈ e, and thus aggtuple Q (e) is well defined, for any
equivalence class e in E Q .
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
SchemaSQL
•
485
Finally, the result of an aggregate SQL query is captured by the function
Q(D) = [aggtuple Q (e) | e ∈ E Q ].
3.2 Semantics of SchemaSQL Queries
The semantics of SchemaSQL is obtained as a natural extension of that of SQL.
A SchemaSQL query Q is of the form:
select
from
where
group by
having
itemList, aggList
fromList
whereConditions
groupbyList
havingConditions
where itemList is a list of db-name, rel-name, attr-name, and domain variables;
aggList is a list of expressions of the form agg(X) where agg is an aggregate
function, and X is a db-name, rel-name, attr-name, or domain variable; fromList
is a list of variable declarations; groupbyList is a list of db-name, rel-name,
attr-name, and domain variables; and the conditions in the where and having
clauses are analogous to SQL. The main difference with an SQL query is the
availability of additional variable types, in addition to the usual SQL tuple
variables.
Let D be the set of all federation database instances. Let R be the set of
all relations over the output scheme of the query Q. The query Q induces a
function
Q : D→R
from federations to relations, defined as follows. Let D ∈ D be an input federation, and O D the set of all items (database names, relation names, attribute
names, tuples, and values) appearing in D. Let V be the set of variables occurring in Q. We define an instantiation as a function ı : V → O D which instantiates each variable in Q to some item over its appropriate range. Throughout
this article, we assume that any instantiation ı is extended in such a way that for
a literal constant c, ı(c) = c. In defining the semantics of SchemaSQL queries,
we find the following definitions useful. Identifiers in typewrite font (e.g., db)
can be constants or variables.
Definition 3.3 [Admissibility]. An instantiation ı is admissible provided, it
satisfies the following conditions:
—whenever -> D is a declaration in the from clause, ı(D) is the name of a
database in the federation.
—whenever db-> R is a declaration in the from clause, ı(R) is the name of a
relation in the database ı(db).
—whenever db::rel-> A is a declaration in the from clause, ı(A) is an attribute
name in the relation ı(rel) in the database ı(db).
—whenever db::rel T is a declaration in the from clause, ı(T) is a tuple in the
relation ı(rel) in the database ı(db).
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
486
•
L. V. S. Lakshmanan et al.
—whenever db::rel.attr V is a declaration in the from clause, ı(V) is a value
that appears in the column ı(attr) of the relation ı(rel) in the database ı(db).
—whenever T.attr V is a variable declaration in the from clause, ı(V) =
ı(T)[ı(attr)].
Definition 3.3 precisely captures the notion of an appropriate range for a
variable.
Definition 3.4 [Validity]. Let sat(ı, Q) be a Boolean function on the set of all
instantiations, induced by the conditions in the where clause. An instantiation
ı is valid provided (a) it is admissible, and (b) sat(ı, Q) is true.
The conditions in the where clause as well as additional conditions induced
by the presence of certain patterns involving tuple variables is captured in
Definition 3.4. We now define,
I Q = {ı | ı is a valid instantiation}.
(1)
The query assembles each satisfying instantiation into a tuple for the answer
relation, as follows. Let TitemList denote the set of all tuples over the scheme
itemList such that each value in each tuple appears in the federation D. Then
the tuple assembly function is a function tuple Q : I Q → TitemList defined as
follows:
O
tuple Q (ı) =
ı(s),
(2)
s∈itemList
where s is a db-name, rel-name, attr-name, or domain variable. For an instantiation ı, tuple Q (ı) produces a tuple over the list of objects itemList listed in
the select statement.
3.2.1 Queries with Fixed Output Schema and No Aggregation. Suppose Q
is a SchemaSQL query without aggregation. In this case, the result of the query
is captured by the function
Q(D) = [tuple Q (ı) | ı ∈ I Q ].
(3)
Similar to SQL, SchemaSQL’s semantics is based on multisets. Multisets are
distinguished from sets with the use of [. .] instead of {. .}.
It is not hard to see that the formal semantics captured by these definitions
exactly correspond to the intuitive semantics discussed earlier for Queries Q1
and Q2 of Section 2.
3.3 Aggregation with Fixed Output Schema
In SQL, we are restricted to “vertical” (or column-wise) aggregation on a predetermined set of columns, while SchemaSQL allows “horizontal” (or row-wise)
aggregation, and also aggregation over more general “blocks” of information.
Before we illustrate these points with examples, we provide a formal development of the semantics.
3.3.1 Semantics of Aggregation with Fixed Output Schema. Let Q be a
SchemaSQL query involving aggregation. Similar to the development of SQL
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
SchemaSQL
•
487
semantics, we define the equivalence relation ∼ on the instantiations I Q , and
the set E Q of equivalence classes of ∼ that satisfy the havingConditions for
SchemaSQL queries.
Definition 3.5 [Equivalence Relation Induced by group by Clause]. For ı,
∈ I Q , ı ∼ iff ∀ “v” ∈ groupbyList, ı(v) = (v). It is straightforward to see
that ∼ is an equivalence relation on I Q . Intuitively, two instantiations are
∼-equivalent provided they agree on all variables appearing in the group by
clause. We denote by E Q the set of equivalence classes of I Q under ∼.
Definition 3.6 [Valid equivalence classes].
E Q = {e | e is an equivalence class of ∼ and sath (e, Q) = true},
(4)
where sath (e, Q) = true means equivalence class s satisfies the havingConditions.
Let TaggList denote the set of tuples over the scheme aggList. We define a
function aggregate Q : E Q →TaggList as follows:
O
agg([ (v) | e ∈ E Q and ∈ e]).
(5)
aggregate Q (e) =
“agg(v)”∈aggList
For a given equivalence class e ∈ E Q , aggregate Q considers all instantiations
in e, and, for each aggregate operation, say agg, indicated on the variable v in
aggList, it performs the operation agg on the multiset of values associated with
this variable by instantiations in e.
Let Q be a query involving aggregation. We define the tuple assembly function aggtuple Q : E Q → TitemList × TaggList as follows:
O
aggregate Q (e),
(6)
aggtuple Q (e) = tuple Q (ı)
where ı is any instantiation in e. In SchemaSQL, we require the set of variables
in itemList to be a subset of those in groupbyList. Hence, tuple Q (ı) is the
same for all instantiations ı ∈ e, and thus aggtuple Q (e) is well defined, for any
equivalence class e in E Q .
Finally, the result of a SchemaSQL query with aggregation is captured by
the function
Q(D) = [aggtuple Q (e) | e ∈ E Q ].
(7)
We now provide some examples illustrating aggregation in SchemaSQL.
Example 3.1 [Horizontal Aggregation and Aggregation through Multiple
Relations]. The query
(Q3)
select
from
where
group by
T.category, avg(T.D)
univ-B::salInfo-> D,
univ-B::salInfo T
D <> "category"
T.category
computes the average salary floor of each category of employees over all departments in univ-B. This captures horizontal aggregation. The condition D <>
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
488
•
L. V. S. Lakshmanan et al.
“category” enforces the variable D to range over department names. Hence, a
knowledge of department names (and even the number of departments) is not
required to express this query. Alternatively, we could enumerate the departments, that is, using the condition (D = “Math” or D = “CS” or . . .).3 By contrast,
the query
(Q4)
select
from
group by
T.category, avg(T.salFloor)
univ-C-> D,
univ-C::D T
T.category
computes a similar information from univ-C. Notice that the aggregation is
computed over a multiset of values obtained from several relations in univ-C. In
a similar way, aggregations over values collected from more than one database
can also be expressed. Block aggregations of a more sophisticated form are
illustrated in Example 4.3.
4. SEMANTICS II: DYNAMIC OUTPUT SCHEMA AND RESTRUCTURING
VIEWS
The result of an SQL query (or view definition) is a single relation. Our discussion in the previous section was limited to the fragment of SchemaSQL queries
that produce one relation, with a fixed schema, as output. In this section, we
provide examples to demonstrate the following capabilities of SchemaSQL: (i)
declaration of dynamic output schema, (ii) restructuring views, and (iii) interaction between dynamic output schema creation and aggregation.
We illustrate the capabilities of SchemaSQL for the generation of an output schema which can dynamically depend on the contents of the input instance (i.e., the databases in the federation). While aggregation in SQL is restricted to vertical aggregation on a predetermined set of columns, we have
so far seen that SchemaSQL can express horizontal aggregation and aggregation over more general “blocks” (see Example 3.1, Q3 and Q4). In this section, we shall see that the combination of dynamic output schema and metadata variables, namely, db-name, rel-name, and attr-name variables, allows
us to express more powerful aggregations such as vertical aggregation on a
variable number of columns and aggregation on a variable number of blocks
as well.
4.1 Restructuring Without Aggregation
We first illustrate the ideas and expressive power of SchemaSQL for performing
restructuring, using examples. Formal development will follow.
Example 4.1 [Restructuring univ-B Database into the Schema of univ-A
Database]. Consider the relation salInfo in the database univ-B. The
3 An
elegant solution would be to specify some kind of “type hierarchy” for the attributes, which
can then be used for saying “D is an attribute of the following kind”, rather than “D is one of the
following attributes”. Our proposed extension to SchemaSQL, discussed in Section 6.2, addresses
this issue.
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
SchemaSQL
•
489
following SchemaSQL view definition restructures this information into the
format of the schema of univ-A::salInfo.
(Q5)
create view
select
from
where
BtoA::salInfo(category, dept, salFloor) as
T.category, D, T.D
univ-B::salInfo-> D,
univ-B::salInfo T
D <> ‘category’
Explanation. Two variables are declared in the from clause: T is a tuple variable ranging over the tuples of relation univ-B::salInfo, and D is an attributename variable ranging over the attributes of univ-B::salInfo. The condition
in the where clause forces D to be a department name. Finally, each output tuple
(T.category,D,T.D) lists the category, department name, and the corresponding salary floor (which is in the format of univ-A::salInfo).
Note that corresponding to each tuple in the univ-B::salInfo format, Q(5)
generates several tuples in the univ-A::salInfo scheme. The mapping, in this
respect, is one-to-many. But each instantiation of the variables in the query,
actually contributes to one output tuple.
The following example illustrates restructuring involving dynamic creation
of output schema.
Example 4.2 [Restructuring univ-A database into the schema of univ-B
database]. This view definition restructures data in univ-A::salInfo into the
format of the schema univ-B::salInfo.
(Q6)
create view
select
from
AtoB::salInfo(category, D) as
A.category, A.salFloor
univ-A::salInfo A,
A.dept D
Explanation. Each tuple of univ-A::salInfo contains the salary floor for
one category in a single department, while each tuple of univ-B::salInfo contains the salary floors for one category in every department. Intuitively, all
tuples in univ-A::salInfo corresponding to the same category are grouped together and “merged” to produce one output tuple.
Another aspect of this restructuring view is the use of variables in the create
view clause. The variable D in create view AtoB::salInfo(category, D) is declared as a domain variable ranging over the values of the dept attribute in the
relation univ-A::salInfo. Hence, the schema of the view AtoB::salInfo is “dynamically” declared as AtoB::salInfo(category, dept1, . . . , deptn), where
dept1, . . . , deptn are the values occurring in the dept column in the relation
univ-A::salInfo.
The restructuring in this example corresponds to a many-to-one mapping
from instantiations to output tuples.
As demonstrated by the previous examples, the semantics of restructuring
in the context of a dynamically declared output schema has two aspects to
it: (i) the determination of the output schema itself, and (ii) the formatting of
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
490
•
L. V. S. Lakshmanan et al.
data to conform to the schema determined in (i). In this section, we formalize
these concepts. We illustrate our development of the semantics by revisiting
Example 4.2.
Let a SchemaSQL query Q be a view definition of the form
create view
select
from
where
db::rel(attr1,..., attrn) as
obj1,..., objn
fromList
whereConditions
where db, rel are constants or db-name, rel-name, att-name, or domain variables; attr1, . . . , attrn are all constants or one of them is a db-name, rel-name,
att-name, or domain variable and the rest are constants; and obj1, . . . , objn are
db-name, rel-name, att-name, or domain variables.
We first define some useful notions. Recall that I Q is the set of valid instantiations (of the variables declared in the from clause) that satisfy the conditions
in the where clause, as defined in Section 3.2.
Definition 4.1 [Instantiations contributing to the same output relation]. For
two instantiations ı, ∈ I Q , we define ı ≡ , provided ı(db) = (db) and ı(rel) =
(rel). Clearly ≡ is an equivalence relation.
Determination of Output Schema. Each instantiation ı ∈ I Q produces a view
of a database namely, ı(db), containing a relation named ı(rel), whose scheme
consists of the attribute set attrset Q (ı) = { (attr) | attr ∈ {attr1, . . . , attrn}, ∈
I Q , ≡ ı}. Thus, each ≡-equivalence class of instantiations defines one relation
scheme in the output view. For example, in Example 4.2, all instantiations ı ∈
I Q are ≡-equivalent, and this one equivalence class produces a view containing
a database called AtoB, containing one relation named salInfo, with the scheme
{category, CS, Math, . . .}.
Formatting Data to Fit the Schema. There are two aspects to this. First,
the output computed by the SchemaSQL query defining the view has to be
properly allocated to conform to the output schema declared in the create view
statement. This by itself might in general result in null values, which we can
eliminate by identifying maximal subsets of “related” tuples and “merging”
them. These ideas are made precise below.
As seen above, an instantiation ı ∈ I Q contributes to a view of a database
ı(db) containing a relation named ı(rel) with a scheme given by attrset Q (ı).
The instantiations ≡-equivalent to ı contribute to a relation, allocate Q (ı), over
the attribute set attrset Q (ı), as follows. For each instantiation ∈ I Q such that
≡ ı, allocate Q (ı) contains a tuple t, defined as follows. Let A ∈ attrset Q (ı).
Then
½
(objk),
whenever A = (attrk)
t[A] =
null ,
otherwise.
Figure 3(i) shows allocate Q6 (ı) for the view (Q6) defined in Example 4.2, where
ı is any instantiation (recall all of them are ≡-equivalent).
Secondly, merging of tuples in allocate Q (ı) is formalized as follows: Let DOM
denote the union of all domains of all attributes of all relations involved in the
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
SchemaSQL
•
491
Fig. 3. (i) The relation allocate Q6 (ı) and (ii) the final result after merging.
federation, together with the null value, null. Define a partial order on DOM, by
setting null ≤ v, ∀v ∈ DOM. In particular, note that any two distinct non-null
values are incomparable. The least upper bound, lub, of two values in DOM is
defined in the obvious way.
if v ≤ u
u,
if u ≤ v
lub(u, v) = v,
undefined, otherwise.
We now have the following:
Definition 4.2 [Merging a Pair of Tuples]. Two tuples t1 , t2 over a relation
scheme R = {A1 , . . . , An } are mergeable provided for each i = 1, . . . , n, either
t1 [Ai ] = t2 [Ai ], or at least one of t1 [Ai ] or t2 [Ai ] isJa null. Suppose t1 and t2
are mergeable. Then their merge, denoted t = t1 t2 , is defined as t[Ai ] =
lub(t1 [Ai ], t2 [Ai ]), i = 1, . . . , n.
J
Clearly, the operator
is commutative and associative, and it can be easily
extended to any
set
of
mergeable
tuples. It will be convenient below to extend
J
the operator
to any relation containing an arbitrary (i.e., not necessarily
mergeable) set of tuples. The idea is to partition the relation into sets of mergeable tuples, and merge the tuples in each partition. Notice that mergeability is
not a transitive relation: for example, tuple t1 = (a, ⊥) is mergeable with each of
the tuples t2 = (a, b) and t3 = (a, b′ ), which themselves are not mergeable. Thus,
in partitioning a relation into sets of mergeable tuples, choice arises in grouping
tuples. From the point of view of the meaning of the tuples, the choices may be
made arbitrarily. So, for instance, we may merge t1 and t2 above (and leave t3
alone) or merge t1 and t3 . While the results look physically different, the meaning is the same. Intuitively, we need any “maximal” partition of a relation that
respects mergeability, formalized below.
Definition 4.3 [Maximal Partitions]. A partition P = {r1 , . . . , rk } of a relation r is valid provided for each block ri , the tuples in ri are pairwise mergeable,
1 ≤ i ≤ k. Define a partial order ≤ on partitions as follows:4 P1 ≤ P2 provided
∀ri ∈ P1 : ∃s j ∈ P2 : ri ⊆ s j . A partition P is a maximal valid partition provided
for every valid partition P ′ : P ≤ P ′ ⇒ P ′ = P.
4 This
is the dual of the standard refinement ordering on partitions.
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
492
•
L. V. S. Lakshmanan et al.
The following easily established result reveals the significance of maximal
valid partitions.
LEMMA 4.1 [MAXIMAL PARTITIONS]. Let r be a relation and P be any maximal
valid partition of r. Then the following holds: (i) for each block ri ∈ P, the tuples
in ri are pairwise mergeable; (ii) for any two distinct blocks ri , r j ∈ P, the tuples
ri ∪ r j are not pairwise mergeable, that is, there is a pair of tuples t, t ′ ∈ ri ∪ r j
such that they are not mergeable.
We suppress the obvious proof. Finally, we can define the result of applying
a merge operator to a relation as follows:
Definition 4.4 [Merging a Relation]. Let
and P = {r1 , . . . , rk }
Jr be a relation
J
be any maximal valid partition of r. Then,
r = { (ri ) | ri ∈ P }.
Finally, we can define the semantics of view definitions in SchemaSQL without aggregation as follows:
Definition 4.5 [Semantics of Restructuring Views without Aggregation]. Let
Q be the SchemaSQL query that defines a view V . Then the materialization
of V consists of one relation for each ≡-equivalence class of instantiations in
I Q . For an equivalence
class [ı], for any ı ∈ I Q , the corresponding relation is
J
determined by
allocate Q (ı).
As an example, the final output produced by the view definition (Q6) in
Example 4.2 is a view of a database A2B containing a relation
salInfo(category, CS, Math) as shown in Figure 3(ii).
4.2 Aggregation with Dynamic View Definition
In Section 3, we illustrated the capability of SchemaSQL for computing (i)
horizontal aggregation and (ii) aggregation over blocks of information collected
from one or more relations, or even databases. In this section, we shall see that
when SchemaSQL aggregation is combined with its view definition facility, it
is possible to express vertical aggregation over a variable number of columns
or blocks of data, determined dynamically by the input instance. The following
examples illustrate this point.
Example 4.3 [Block Aggregation: Fixed and Dynamic Output Schemas].
Suppose in the database univ-D in Figure 2, there is an additional relation
faculty(dname, fname) relating each department to its faculty, such as, (math,
arts and sciences), (physics, arts and sciences), (cs, engineering and comp sci),
etc.
The first query uses a fixed output schema. The second query is identical
to the first except for a create view expression that defines a dynamic output
schema:
(Q7)
select
from
U.fname, avg(T.C)
univ-D::salInfo-> C,
univ-D::salInfo T,
univ-D::faculty U
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
SchemaSQL
where
group by
•
493
C <> "dept" and
T.dept = U.dname
U.fname
Q7 computes, for each faculty, the faculty-wide average floor salary of all employees (over all departments) in the faculty. Notice that the aggregation is
performed over ‘rectangular blocks’ of information.
Consider now the following view definition Q8, which is essentially defined
using the query Q7.
(Q8)
create view
select
from
where
group by
averages::salInfo(faculty, C) as
U.fname, avg(T.C)
univ-D::salInfo-> C,
univ-D::salInfo T,
univ-D::faculty U
C <> "dept" and
T.dept = U.dname
U.fname
The view defined by Q8 actually computes, for each faculty, the average floor
salary in each category of employees (over all departments) in the faculty. This is
achieved by using the variable C, ranging over categories, in the dynamic output
schema declaration through the create view statement. The schema of the
output has the form {faculty, Prof, AssocProf, Technician, . . . }, namely,
the variable C in the view definition represents its value in the set of valid
instantiations, and hence the output schema depends on the input data.
Example 4.4 [Aggregation on a Variable Number of Blocks]. Let us add
yet another relation to the database univ-D of Figure 2. The relation
empType(category, type) lists the categories, and their types, for example,
(prof, teaching faculty), (assoc prof, teaching faculty), (technician, technical
staff), (secretary, administrative staff), etc. Consider the query:
(Q9)
create view
select
from
where
group by
averages::salInfo(faculty, Y) as
U.fname, avg(T.C)
univ-D::salInfo-> C,
univ-D::salInfo T,
univ-D::faculty U,
univ-D::empType E,
E.type Y
C <> "dept" and
T.dept = U.dname and
E.category = C
U.fname
Figure 4 depicts relation univ-D::salInfo, and the “blocks” upon which the
aggregation is carried out. Each block corresponds to a faculty and an employee
type.
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
494
•
L. V. S. Lakshmanan et al.
Fig. 4. univ-D::salInfo relation. The schema components appear in bold face font. The “facultyi
& category type j ” rectangles are the blocks upon which the aggregation is carried out.
The schema of the output relation is
averages :: salInfo(faculty, teachingfaculty, technicalstaff, . . .)
and each tuple in the output lists a faculty, such as “arts and sciences”, and the
corresponding average salary floors for the employees in each of the category
types “teaching faculty”, “technical staff”, “administrative staff”, etc.
4.2.1 Semantics of Aggregation with Dynamic View Definition. The semantics of restructuring (via view definition) with aggregation involves putting together the ideas behind each of these operations. Intuitively, as explained in
Section 4.1, the instantiations ≡-equivalent to ı ∈ I Q produce (in the view) one
relation in a database whose scheme consists of the attributes attrset Q (ı), as
defined in that section. The tuples for this relation are obtained by computing
the aggregations listed in the aggList in the select statement with respect to
each valid equivalence class of instantiations that agree on the variables listed
in the group by clause as well as on the variables appearing in the create view
statement, and then performing the necessary merging. The reason for including the variables in the create view statement is that they specify an implicit
group by. This intuition is formalized below. Consider the SchemaSQL view
definition Q, below:
create view
select
from
where
group by
having
db::rel(attr1,..., attrn) as
itemList, aggList
fromList
whereConditions
groupbyList
havingConditions
Definition 4.6 [Equivalence Relation Induced by Group by and Create View
Clauses]. For two instantiations ı, ∈ I Q , we define ı# , provided for each
o ∈ groupbyList, ı(o) = (o), and for each variable X occurring in the create
view statement, ı(X ) = (X ). Clearly, # is an equivalence relation.
Similar to previous cases, we denote by E Q the set of equivalence classes of
I Q under # that satisfy the havingConditions:
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
SchemaSQL
•
495
Definition 4.7 [Valid Equivalence Classes].
E Q = {e | e is an equivalence class of # and sath (e, Q) = true}.
The notions of aggregate Q (e), aggtuple Q (e), and sath (e, Q) for an equivalence
class e ∈ E Q , are defined analogously to the way they were defined in Section
3.3. The only difference is that the equivalence relation # is used instead of ∼.
O
agg([ (o) | e ∈ E Q and ∈ e])
aggregate Q (e) =
“agg(o)”∈aggList
aggtuple Q (e) = tuple Q (ı)
O
aggregate Q (e)
where ı is any instantiation in the equivalence class e.
The concept of allocating tuples computed above according to the various
output schemas dynamically created by the instantiations can be formalized in
a way similar to what was done in Section 4.1, and we suppress these obvious
details for brevity. Recall that the schema of each output relation is determined
by one equivalence class of I Q under the ≡ relationship. Let aggallocate Q (ı)
denote the allocated relation determined by the ≡-equivalence class of ı ∈ I Q .
The concept of merging, defined in Definitions 4.2 and 4.4, can now be directly
applied to compute the final output. Thus, for each
J instantiation ı ∈ I Q , the
≡-equivalence class of ı contributes to the relation
aggallocate Q (ı), over the
schema ı(db) :: ı(rel)(attrset Q (ı)).
As an example, it is easy to verify that the view defined by Q8 indeed computes for each faculty, the category-wise floor salary averages. Before closing
this section, we note that the combination of dynamic output schema declaration with SchemaSQL’s aggregation mechanism makes it possible to express
many other novel forms of aggregation as well. Further, note that the examples in this and previous sections were in the context of a single database,
demonstrating some of the applications of SchemaSQL for a single database
environment.
5. IMPLEMENTATION
In this section, we describe the architecture of a system for implementing a multidatabase querying and restructuring facility based on SchemaSQL. A highlight of our architecture is that it builds on existing architecture of SQL in a
nonintrusive way, requiring minimal extensions to prevailing database technology. This makes it possible to build a SchemaSQL system on top of (already
available) SQL systems. We also identify novel query optimization opportunities that arise in a multidatabase setting.
The architecture consists of a SchemaSQL server that communicates with
the local databases in the federation. We assume that the meta-information
comprising component database names, names of the relations in each
database, names of the attributes in each relation, and possibly other useful
information (such as statistical information on the component databases, useful for query optimization) are stored in the SchemaSQL server in the form of
a relation called Federation System Table (FST).
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
496
•
L. V. S. Lakshmanan et al.
Fig. 5. SchemaSQL — implementation architecture.
Figure 5 depicts our architecture for implementing SchemaSQL. At a high
level, global SchemaSQL queries are submitted to the SchemaSQL server,
which determines a series of local SQL queries and submits them to the local databases. The SchemaSQL server then collects the answers from local
databases, and, using its own resident SQL engine, executes a final series of
SQL queries against the answers collected above, to produce the answer to
the global query. Intuitively, the task of the SchemaSQL server is to compile the instantiations for the variables declared in the query, and enforce
the conditions, groupings, aggregations, and mergings to produce the output. Many query optimization opportunities at different stages, and at different levels of abstraction, are possible, and should be employed for efficiency
(see discussions in Section 6.1). Algorithms 5.1 and 5.2 give detailed accounts
of our query processing strategy for the fixed and dynamic output schema,
respectively.
Query processing in a SchemaSQL environment consists of two major phases.
In the first phase, tables called VIT ’s (Variable Instantiation Tables) corresponding to the variable declarations in the from clause of a SchemaSQL query
are generated. The schema of a VIT consists of all the variables in one or more
variable declarations in the from clause, while its contents correspond to instantiations of these variables. VIT’s are materialized by executing appropriate
SQL queries on the FST and/or component databases. In the second phase, the
SchemaSQL query is rewritten into a series of one or more SQL queries against
the VITs with the property that the result of the series of queries against the
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
SchemaSQL
•
497
VITs is identical to that of the original SchemaSQL query against the federation. The result so obtained is appropriately restructured for presentation to
the user.
For ease of exposition, we first consider (Section 5.1) SchemaSQL queries
with a fixed output schema and then present the algorithm for handling queries
with a dynamic output schema (Section 5.2).
5.1 Processing SchemaSQL Queries with a Fixed Output Schema
We assume that the FST has the scheme FST(db-name, rel-name, attr-name).
Also, we refer to the db-name, rel-name, and attr-name variables (defined in
Definition 2.1) collectively as meta-variables.
Our algorithm below considers SchemaSQL queries with a fixed output
schema, possibly with aggregation. The algorithm consists of two phases as
explained above. In order to facilitate the reader to follow the development, we
break the presentation of the algorithm after each phase and follow it with an
illustrative example.
ALGORITHM 5.1. SchemaSQL Query Processing—Fixed Output Schema
Input: A SchemaSQL query with aggregation and with a fixed output schema.
Output: The result of the query.
Method:
Phase I: Corresponding to a set of variable declarations in the from clause,
create V I T s using one or more SQL queries against some local databases
and/or the FST.
Phase II: Rewrite the original SchemaSQL query against the federation into
an “equivalent” query against the set of V I T relations and compute it using
the resident SQL server.
Phase I
(0) Rewrite the input SchemaSQL statement into the following form such that the
conditions in the where and having clauses are in conjunctive normal form.
select
from
where
group by
having
itemList, aggList
hrang e1 i V1 , . . . , hrang ek i Vk
hcond 1 i and . . . and hcond m i
groupbyList
havingConditions
We assume an ordering on the variable declarations in the from clause: If a variable V j
appears in the range of another variable Vi , that is, in hrangei i, then we say Vi depends
on V j , denoted V j ≺ Vi . We will assume that whenever V j ≺ Vi , the declaration of V j
comes before that of Vi in the from clause. It is trivial to reorder the declarations in a
given SchemaSQL query to meet this condition.
(1) Consider each variable declaration hrangei i Vi , 1 ≤ i ≤ k. The following cases arise:
(a) Vi is a meta-variable: In this case, all variables in the declaration hrangei i Vi must
be meta-variables. As an optimization issue, notice that it is sufficient to create VITs
for just those meta variables that are maximal (among meta-variables) with respect to
the ordering ≺: As we will discuss below, the VIT for a maximal meta-variable Vi also
contains the instantiations for all meta-variables V j such that V j ≺ Vi . Create a VIT
for Vi , VITVi as follows: the schema of VITVi consists of Vi and all variables V j such that
V j ≺ Vi ; the contents of VITVi are obtained using an appropriate SQL query against the
FST. For example, assume the from clause contains the declarations -> D and D::r->
A declaring db-name and attr-name variables D and A, where r is a constant (relation
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
498
•
L. V. S. Lakshmanan et al.
name). The VIT for D will have the schema {D, A}. Its contents can be obtained using
the SQL query
select
from
where
db-name as D, attr-name as A
FST
rel-name = r
Notice that VITD includes instantiations for the attr-name variable A, and thus we can
save on creating a separate VIT for A. We can further optimize the query processing
by enforcing conditions of the form U relOp c (or c relOp U ) in the where clause of the
original SchemaSQL query, if any, at this time, provided U is a variable in the schema
of VITVi , relOp is a comparison operator, and c is a constant.
(b) Vi is a tuple variable: Let the declaration for T be db::rel T, where db (rel) can be
constant or variable. Determine the schema of VITT as follows.
(i) Include all meta-variables appearing in db::rel, if any, in the schema of VITT .
(ii) If T.a, where a is a constant, appears in the query (in the select, where,5 group
by, or having clauses), include Ta in the schema of VITT . Here, the identifier Ta is
obtained by concatenating the tuple variable T with the attribute name a.
(iii) For each domain variable declaration T.a V, where a is a constant, include V in the
schema of VITT .
For each domain variable declaration T.A V, where A is an attr-name variable,
include both A and V in the schema of VITT .
(iv) If, for an attr-name variable A, T.A appears in the select, where, group by, or having
clauses, then include both A and TA in the schema of VITT .
Next, obtain the contents of VITT by submitting SQL queries to component databases,
and computing their union, as follows.
(i) Identify the meta-variables in the schema of VITT . These include any variables
in db::rel, and possibly some attr-name variables. Obtain the bindings for these
variables by executing an SQL query on their VITs.
(ii) For each binding obtained, generate one SQL query and submit it to the appropriate
component database, in order to retrieve the relevant tuples from that database that
pertain to the binding on the meta-variables. Push each condition of the form T.att
relOp c (c relOp T.att), T.A relOp c (c relOp T.A), and V relOp c (c relOp V),
where att and c are constants, A is an attr-name variable, and V is a domain variable
declared using T, in the where clause of the original SchemaSQL query into the where
clause of the SQL query to be submitted, above.
(iii) Finally, obtain the contents of VITT as the union of the SQL queries submitted to
component databases. If only one component database is involved, i.e. if db in the
declaration db::rel T is a constant, then the union can be expressed within one
SQL query submitted to the component database db.
This completes Phase I of the algorithm.
Notice that no separate VITs are needed for domain variables, as their instantiations are captured by the VITs for their associated tuple variables. Exampe 5.1
illustrates generation of VITs for meta and tuple variables.
Example 5.1 [Illustrating Phase I: Generating VITs]. In this example, we
illustrate the Phase I of the algorithm using a variant of query Q2 of
Example 2.2:
5 If
T.a appears only in the where clause, in the form T.a relOp c where c is a constant, then
there is no need to include Ta in the schema of VITT .
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
SchemaSQL
(Q2’)
select
from
where
•
499
RelC, C.salFloor
univ-C-> RelC,
univ-C::RelC C,
univ-D::salInfo D
RelC
= D.dept
and
C.salFloor > D.technician and
C.category = "technician"
The query contains declarations for one meta variable RelC, which is a rel-name
variable, and two tuple variables C and D. Here is how Phase I of the algorithm
will proceed to construct the VITs for this query. (Notice that as far as Phase I
is concerned, aggregation is completely orthogonal.)
The schema of VITRelC is {RelC} (one column). Its contents are generated
using the following query to the FST:
select
from
where
rel-name as RelC
FST
db-name = "univ-C"
The schema of VITC is {RelC, CsalFloor}. Note that Ccategory need not
be included: The where clause condition C.category = "technician" will be
enforced within the SQL query to the component database univ-C. VITC is
generated as the union of SQL queries to database univ-C as follows: First,
bindings for the meta-variable RelC are obtained from its VIT via the trivial
query:
select
from
RelC
VITRelC
Let {r1 , . . . , rn } be the answer to the above query. VITC is obtained by submitting
the following SQL query to the database univ-C.
select
from
where
’r1 ’ as RelC, salFloor as CsalFloor
r1
category = "technician"
UNION
...
select
from
where
UNION
’rn ’ as RelC, salFloor as CsalFloor
rn
category = "technician"
Finally, the schema of VITD is {Ddept,Dtechnician}, and its contents are obtained via the following SQL query to the database univ-D:
select
from
dept as Ddept, technician as Dtechnician
salInfo
We now return to Phase II of Algorithm 5.1.
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
500
•
L. V. S. Lakshmanan et al.
Algorithm 5.1 SchemaSQL Query Processing (cont’d.)
Phase II
(1) Execution of this phase happens in the SchemaSQL server. The SchemaSQL query
is rewritten into an equivalent conventional SQL query on the VIT’s generated in Phase
I, as follows: To simplify the presentation, we use the concept of the Joined Variable
Instantiation Table (JVIT). As the name suggests, JVIT is the (natural) join of the VITs
generated for the various meta-variables and tuple variables during Phase I. In the
creation of the JVIT, we enforce the remaining conditions, if any, in the where clause
of the original SchemaSQL query. However, these conditions need to be syntactically
rewritten so they make sense against the schema of the JVIT.
Recall that all where clause conditions involving a comparison with a constant were
already enforced in Phase I. The remaining conditions must have the form obj1 relOp
obj2, where obj1 and obj2 are either variables, or qualified attributes of the form T.a or
T.A. We need to rewrite these conditions by renaming the obj’s (variables and/or qualified
attributes) as follows:
(i) Replace each occurrence of variable V by VITU .V, where VITU is the VIT corresponding
to V. It is possible that U is different from V, since VITs are generated only for certain
variables and the instantiations for the remaining variables are obtained from these
(see Phase I).
(ii) Replace each occurrence of a qualified attribute T.a (or T.A) by VITT .Ta (or VITT .TA).
(2) Assume VITV1 , . . . , V I TVn are all the variable instantiation tables generated in
Phase I. Let {U1 , . . . , Um } be the union of their schemas. Properly qualify each Ui with
the VIT whose schema contains Vi . For example, if Ui is Ta, then its qualified version
is VITT .Ta. If Ui appears in the schemas of several VITs, break the tie arbitrarily for
the purpose of qualification. Let {V I TU1 .U1 , . . . , V I TUm .Um } represent the qualified set.
Then obtain the joined variable instantiation table using the following SQL query:
create view
select
from
where
JVIT(U1 , . . . , Um ) as
VITU1 .U1 , . . . , VITUm .Um
VITV1 , . . . , VITVn
(rewritten where clause) and
(natural join conditions)
Here, “rewritten where clause” is obtained by rewriting the remaining where clause
conditions as pointed out above, and “natural join conditions” consist of conditions of
the form VITVi .attr = VITV j .attr for all pairs of VITs VITVi and VITV j and all attributes
attr that are common to the schemas of these VITs.
(3) Finally, generate a SQL query from the original SchemaSQL query to produce the
final output, as follows.
select
from
group by
having
itemList’, aggList’
JVIT
groupbyList’
havingConditions’
where, itemList’ (respectively, aggList’, groupbyList’, havingConditions’) is obtained from itemList (respectively, aggList, groupbyList, havingConditions) in the
original SchemaSQL query by replacing every occurrence of T.a by Ta and of T.A by TA.
Note that the generation of JVIT and the final SQL query can be combined in one step.
We chose to divide it into two steps for clarity of exposition. Further, it also simplifies our
presentation of the query processing algorithm for SchemaSQL queries with dynamic
output schemas in Section 5.2. In fact, since we generated JVIT as a view, a natural
option for the the resident SQL engine’s query optimizer is to use query rewriting (rather
than materializing JVIT) to process the final query in one step.
This completes Phase II of the algorithm.
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
SchemaSQL
•
501
Fig. 6. Example—Query Processing.
We now present an example that illustrates our algorithm.
Example 5.2 [Illustrating Phase II: Computing the Rest]. We continue
with the processing of query (Q2’) of Example 5.1. There, we demonstrated
how the VITs VITRelC , VITC and VITD are generated. Figure 6 shows these VITs
for the federation of Figure 2.
The following SQL query produces the JVIT.
create view
select
from
where
JVIT(RelC, CsalFloor, Ddept, Dtechnician) as
VITRelC .RelC, VITC .CsalFloor, VITD .Ddept,
VITD .Dtechnician
VITRelC , VITC , VITD
VITRelC .RelC = VITD .Ddept and
VITC .CsalFloor > VITD .Dtechnician and
VITRelC .RelC = VITC .RelC
In this query, the first two conditions in the where clause came from the (rewritten versions of) the first two conditions in the where clause of (Q2’). The last
condition in the SQL query enforces natural join. Here is the final SQL query.
select
from
RelC, CsalFloor
JVIT
Notice that C.salFloor is replaced by CsalFloor.
Next, let us consider the effect of aggregation. As with Phase I, aggregation
remains orthogonal to the processing being done in Phase II. As an illustration,
consider the following query from Example 3.1.
(Q3): “find the average salary floor across all departments for each employee
category in database univ-B.” This query involves a horizontal aggregation.
select
from
where
group by
T.category, avg(T.D)
univ-B::salInfo -> D,
univ-B::salInfo T,
D <> "category"
T.category
Computation of the JVIT for this query is analogous to that for (Q2’): aggregation does not influence this step. Then, the final SQL query would be:
select
from
group by
Tcategory, avg(TD)
JVIT
Tcategory
The following lemma forms the basis for the correctness of our query processing
strategy for both fixed and dynamic output schema.
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
502
•
L. V. S. Lakshmanan et al.
LEMMA 5.1 [COMPUTING VALID INSTANTIATIONS]. Let JVIT be the joined variable instantiation table for a given SchemaSQL query Q. Then JVIT contains
exactly the set of valid instantiations I Q for query Q, restricted to the columns
of JVIT.
PROOF. It is straightforward to see that for each variable V , VITV generated
in Phase I only contains admissible instantiation (restricted to V and variables
on which V depends, in case V is a meta-variable, or restricted to all relevant
attribute and domain variables related to V , in case V is a tuple variable). We
only need to verify that the JVIT produced from these in Phase II will ensure
that each instantiation in JVIT will satisfy all conditions in the where clause
of Q. To see why this is true, notice that each condition of the form x relOp y
where one of x, y is a constant, that constrains some column of VITV is enforced
in the formation of VITV in Phase I. The remaining conditions are enforced
while computing the join of the VITs. The natural join ensures that pieces of
the same (global) instantiation are put together in the formation of JVIT. It
follows that JVIT contains exactly the valid instantiations I Q , restricted to the
columns of JVIT.
THEOREM 5.1 [CORRECTNESS FOR FIXED OUTPUT SCHEMA]. Algorithm 5.1 correctly computes answers to SchemaSQL queries with fixed output schemas.
PROOF. By Lemma 5.1, JVIT contains the set of valid instantiations I Q ,
restricted to the columns of JVIT. The final SQL query of Phase II assembles
the result. For a SchemaSQL query with no aggregation, this final query is
simply
select
from
itemList’
JVIT
which corresponds to the semantics given by Eqs. (2) and (3) (Section 3.2),
repeated here for convenience
O
ı(s).
Q(D) = [tuple Q (ı) | ı ∈ I Q ], where, tuple Q (ı) =
s∈itemList
If the SchemaSQL query contains aggregation, then the final SQL query
generated in the algorithm is:
select
from
group by
having
itemList’, aggList’
JVIT
groupbyList’
havingConditions’
Semantics of SchemaSQL queries with aggregation was defined using the set E Q
of valid equivalence classes of I Q with respect to the group by clause groupings
of the original SchemaSQL query, which was characterized by the equivalence
relation ∼. It is easy to verify that the group by clause of the rewritten SQL
query above partitions JVIT, that is, the set of valid instantiations I Q , exactly
into equivalence classes of I Q with respect to ∼. The having clause restricts
these partitions to the valid equivalence classes, namely, E Q . Finally, the select
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
SchemaSQL
•
503
clause assembles the result. This corresponds exactly to the semantics which
was defined by Eqs. (5), (6), and (7) (Section 3.3), repeated here for convenience.
Q(D) = [aggtuple Q (e) | e ∈ E Q ],
where
aggtuple Q (e) = tuple Q (ı)
and
aggregate Q (e) =
O
O
aggregate Q (e)
agg([ (t)[B] | e ∈ E Q and ∈ e]).
“agg(t.B)”∈AggList
This completes the proof of Theorem 5.1.
5.2 Processing SchemaSQL Queries with Dynamic Output Schema
Our algorithms for the processing of SchemaSQL queries with dynamic output schema are directly based on the allocate and merge operations discussed
in Section 4.2. Recall that the output schema is dynamic, that is, it depends
on the input data. This happens when variables appear in the create view
statement.
Consider the SchemaSQL view definition
create view
select
from
where
group by
having
db::rel(attr1, ..., attrn) as
itemList, aggList
fromList
whereConditions
groupbyList
havingConditions
If db, rel and attr1, . . ., attrn are all constants, then the query has a fixed output schema and will be processed according to Algorithm 5.1. In the following,
we assume the create view clause has at least one variable. Also, recall that
at most one of attr1, . . ., attrn can be a variable, while both or either of db and
rel can be variable.
ALGORITHM 5.2. SchemaSQL Query Processing—Dynamic Output Schema
Input: A SchemaSQL query with aggregation and with a dynamic output schema
Output: The result of the query.
Method: First, we compute the JVIT as for fixed output schema. Then for each database
d and for each relation r for which an output view must be created, we implement the allocate and merge operations discussed in Section 4.1. These operations are implemented
in cascade, without materializing the result of the allocate.
(1) Compute the VITs and the JVIT exactly as in Algorithm 5.1.
(2) If db and rel in the create view clause are both constants, then create a single
relation (view) rel in the database db. Otherwise, query the JVIT to obtain the
values of db and/or rel, to form distinct (d , r) pairs dictated by the create view
statement. For example, if both db and rel are variables, compute the (d , r) pairs
using the following SQL query:
select
from
distinct db as d , rel as r
JVIT
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
504
•
L. V. S. Lakshmanan et al.
(3) Assume attri in the create view clause is a variable. For each (d , r) pair of the
previous step, obtain the instantiations for the attri by applying the following SQL
query on JVIT:
select
from
where
distinct attri
JVIT
db = d and rel = r
Note that one query is needed for each (d , r) pair, and one relation is constructed (as
the output of the SchemaSQL query) for each (d , r) pair, as described in next steps.
(4) Let (d , r) be an arbitrary, but fixed pair, from above. Let a1 , . . . , am be the output of the previous step for this pair (d , r). The schema of the relation corresponding to this (d , r) pair in the result of the SchemaSQL query is {attr1 ,. . .,
attri−1 ,a1 , . . . , am ,attri+1 ,. . .,attrn }. For each ak , k = 1, . . . , m, we generate a
relation Rk (attr1 ,. . .,attri−1 , ak ,attri+1 ,. . .,attrn ) as follows. To simplify the
presentation, we expand the select clause of the SchemaSQL query, namely
select itemList, aggList, as select obj1 ,. . . ,objn . Note that each of obj1 ,. . . ,
objn can be a simple or an aggregate operation. Compute relation R k (attr1 ,. . .,
attri−1 ,ak ,attri+1 ,. . .,attrn ) using the following SQL query on JVIT.
select
from
where
group by
having
obj1 as attr1 ,. . .,obji−1 as attri−1 ,obji as ak ,
obji+1 as attri+1 ,. . .,objn as attrn
JVIT
db = d and rel = r and attri = ak
groupbyList
havingConditions
We observe here that the outer union of R1 , . . . , Rm computes the result of the
allocate operation discussed in Sections 4.1 and 4.2. More precisely, the outer
union is equal to allocate Q (ı) (for SchemaSQL queries with no aggregation), or
aggallocate Q (ı) (for SchemaSQL queries containing aggregation), as discussed in
Sections 4.1 and 4.2. However, we do not compute this outer union, and instead
compute the result of the cascade of allocate followed by merge directly, in the next
step.
(5) Perform an outer join of the relations Rk , 1 ≤ k ≤ m, of the previous step using the
following SQL query:
select
from
where
R1 .obj1 as attr1 ,. . .,R1 .obji−1 as attri−1 ,
R1 .a1 as a1 ,R2 .a2 as a2 , . . . Rm .am as am ,
R1 .obji+1 as attri+1 ,. . .,R1 .objn as attrn
R1 outer join R2 outer join . . . outer join Rm
R1 .obj[1..i − 1] = R2 .obj[1..i − 1] and . . . and
R1 .obj[1..i − 1] = Rm .obj[1..i − 1] and
R1 .obj[i + 1..n] = R2 .obj[i + 1..n] and . . . and
R1 .obj[i + 1..n] = Rm .obj[i + 1..n] and
R1 .sid = R2 .sid and . . . and R1 .sid = Rm .sid
where R1 .obj[ j..k] = R2 .obj[ j..k] abbreviates R1 .obj j = R2 .obj j and · · · and
R1 .objk = R2 .objk . We have assumed that the relations Rk , k = 1, . . . , m, have a special attribute sid (sequence id), that records a sequential row number within each
group of tuples with the same values for obj1 ,. . .,obji−1 ,obji+1 ,. . ., objn . Generating these sequence id values using SQL applications or user defined functions is
quite simple in most commercial database systems as explained below.
Our technique for sequence id generation assumes the availability of a system
supplied, user query-able unique row identifier column such as the rid column of
ORACLE or the identity column of SYBASE SQL SERVER. Now, consider a tuple t in a
relation Rk whose sequence id (within the group to which it belongs) needs to be
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
SchemaSQL
•
505
generated. The following SQL statement returns the sid for t. We use the Oracle
convention to refer to the unique row identifier column as rid.
select
from
where
count(*)
Rk
Rk .obj[1..i − 1] = t.obj[1..i − 1] and
Rk .obj[1 + 1..n] = t.obj[i + 1..n] and
Rk .rid < t.rid
In the above statement, sid for tuple t is obtained by counting the number of tuples in t’s group that have a lesser rid value than the rid value of t. Note that the
join condition involving the sid column in the above outer join SQL query correctly
captures the merge semantics of Section 4.2.
This completes Algorithm 5.2.
Example 5.3 [Processing a SchemaSQL Query with Aggregation and
Dynamic Output Schema]. Consider query Q8 of Example 4.3, repeated here
for convenience.
(Q8)
create view
select
from
where
group by
averages::salInfo(faculty, C) as
U.fname, avg(T.C)
univ-D::salInfo-> C,
univ-D::salInfo T,
univ-D::faculty U
C <> "dept" and
T.dept = U.dname
U.fname
The first step of the algorithm is the generation of VITC (schema: {C}), VITT
(schema: {C, TC, Tdept}), and VITU (schema: {Udname, Ufname}), and then the
generation of JVIT. Since the details of the generation of these tables are identical to that for fixed output schema, we suppress these details and assume
that the JVIT for this query, with schema (C, TC, Tdept, Ufname, Udname) is
created.
Since the db and rel components of the create view clause of the
SchemaSQL query are constant, we proceed to the third step. The following
query is used to enumerate the distinct values of the variable C appearing in
the create view statement.
select
from
distinct C
JVIT
We obtain (Prof, AssocProf, Technician) as the answer to the query above.
(We are using the federation of Figure 2).
Step (4) of the algorithm involves the creation of relations Rk , one per each
value obtained in Step (3). For this example, there are three relations: Relation
R1 has the schema {faculty, Prof} and is obtained as:
select
from
where
group by
Ufname as faculty, avg(TC) as Prof
JVIT
C = "Prof"
Ufname
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
506
•
L. V. S. Lakshmanan et al.
Relations R2 and R3 are similar, with Prof substituted by AssocProf and
Technician, respectively. Note that the schemas of these three relations
are {faculty, Prof}, {faculty, AssocProf}, and {faculty, Technician},
respectively.
The final step of the algorithm performs an outer join of the three relations
generated in Step (4):
select
from
where
R1 .faculty as faculty, R1 .Prof as Prof,
R2 .AssocProf as AssocProf, R3 .Technician as Technician
R1 outer join R2 outer join R3
R1 .faculty = R2 .faculty and R1 .faculty = R3 .faculty
and R 1 .sid = R2 .sid and R1 .sid = R3 .sid
THEOREM 5.2 [CORRECTNESS FOR DYNAMIC OUTPUT SCHEMA]. Algorithm 5.2
correctly computes answers to SchemaSQL queries with dynamic output
schemas.
PROOF. We proceed in two steps: First we show the schema of the output
generated by Algorithm 5.2 is correct, then we show the contents generated
by the algorithm is also correct. This parallels the semantics definitions of
Sections 4.1 and 4.2. Recall that by Lemma 5.1, JVIT contains the set of valid
instantiations I Q , restricted to the columns of JVIT.
(i) Schema of the output. The semantics of dynamic output schema was
discussed in Section 4.1 with the aid of the equivalence relation ≡ (Definition
4.1). Each equivalence class of ≡ corresponds to one value of the (db, rel) pair
in the create view statement. The output schema corresponding to a (d , r)
pair is simply the set of values in the corresponding equivalence class of ≡
for the attributes in the create view statement. It is evident in Steps (4) and
(5) of the algorithm that the schema of the output for a (d , r) pair, namely,
{attr1 ,. . .,attri−1 , a1 , . . . , am ,attri+1 ,. . .,attrn }, indeed corresponds to this
semantics.
(ii) Contents of the Output. We concentrate on queries with aggregation and
dynamic output schema. Queries with no aggregation are simpler and can be
regarded as a special case of the former. The semantics of aggregation with
dynamic view definition was discussed in Section 4.2 with the aid of the equivalence relation # (Definition 4.6). Each equivalence class of # corresponds to the
set of instantiations with the same values for SchemaSQL variables appearing in the create view statement, plus the objects appearing in the group by
list. The variables in the create view statement can include db, rel, and one
attribute such as attri .
In Step (4) of the algorithm, each relation Rk , k = 1, . . . , m, corresponds to
one value ak of attri , and is for a given (d , r) pair. This is evident by examining the where clause of the (regular) SQL query that generates Rk . In fact,
since this SQL query also incorporates the group by and having clauses from
the original SchemaSQL query, we can see that each R k “packs” the output
corresponding to equivalence classes of # that have the same values d , r, and
ak , for db, rel, and attri , respectively. Also, notice that the schema of Rk is
{attr1 ,. . .,attri−1 ,ak ,attri+1 ,. . .,attrn }. We observe that if an outer union is
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
SchemaSQL
•
507
performed on R1 , . . . , Rm , tuples of each relation Rk is extended by nulls to the
full schema {attr1 ,. . .,attri−1 ,a1 , . . . , am , attri+1 ,. . .,attrn }, and unioned.
This coincides exactly with the semantics of the allocate operation.
Finally, we need to show that Step (5) of the algorithm generates the merge
(Definition 4.4) of the allocated relation. Note that our algorithm does not
materialize the allocated relation. Rather, Step (5) performs an outer join
of the relations Rk , k = 1, . . . , m. We observe that a set of tuples {t1 , . . . , tm },
t j ∈ R j , are mergeable (Definition 4.2) if and only if, for all j, k ∈ {1, . . . , m},
t j (attr1 ) = tk (attr1 ), . . ., t j (attri−1 ) = tk (attri−1 ), t j (attri+1 ) = tk (attri+1 ), . . .,
t j (attrn ) = tk (attrn ). The SQL query in Step 5 achieves the merging of these
tuples. Further, the conditions in the where clause of this query that involve the
sid attribute make sure each tuple from a relation Rk participates in the formation of exactly one output tuple. Without these conditions each tuple could
possibly participate in the formation of multiple output tuples violating the
multi-set semantics of SchemaSQL.
This completes the proof of Theorem 5.2.
Based on the algorithms presented in this section, prototype implementations of SchemaSQL have been developed [Gingras et al. 1997; Sadri and
Wilson 1997].
6. DISCUSSION
In this section, we discuss a variety of optimization opportunities in the context
of SchemaSQL implementation. We also discuss some novel database applications facilitated by a SchemaSQL-based system.
6.1 Query Optimization Opportunities
There are several opportunities for query optimization that are peculiar to
the MDBS environment. In the following, we identify the major optimization
possibilities and sketch how they can be incorporated in Algorithm 5.1.
(1) The conditions in the where clause of the input SchemaSQL query should
be pushed inside the local spawned SQL queries so that they are as ‘tight’
as possible. Algorithm 5.1 incorporates this optimization to some extent.
(2) Knowledge of the variables in the select and where clauses can be used to
minimize the size of the VIT’s generated in Phase I. For example, if certain
attributes are not required for processing in Phase II, they can ‘dropped’
while generating the local SQL queries.
(3) If more than one tuple variable refers to the same database, and their relevant where conditions do not involve data from another database, the SQL
statements corresponding to these variable declarations should be combined into one. This would have the effect of combining the VIT ’s corresponding to these variable declarations and thus reducing the number
of spawned local SQL queries. This can be incorporated by modifying
Step (1) (b) of our algorithm.
(4) One of the costliest factors for query evaluation in a multidatabase environment is database connectivity. We should minimize the number of times
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
508
•
L. V. S. Lakshmanan et al.
connections are made to a database during query evaluation. Thus, the
spawned SQL statements need to be submitted (in batches) to the component databases in such a way that they are evaluated in minimal number
of connections to the databases.
(5) In view of the sideways information passing (sip) [Bancilhon and
Ramakrishnan 1986] technique inherent in our algorithm, reordering of
variable declarations would result in more efficient query processing. However, the heuristics that meta-variables obtain a significantly fewer number of bindings when compared to other variables in a multidatabase setting, presents novel issues in reordering. For instance, the order db::r.a
R, -> D, D-> R suggested by the conventional reordering strategies could
be worse than -> D, D-> R, db::r.a R because of the lower number of
bindings R obtains for its VIT in the latter.
(6) We should make use of works such as Lipton and Naughton [1990] and
Lipton et al. [1990] to determine which of the VIT’s should be generated
first so that the tightest bindings are passed for generating subsequent
VITs.
(7) If parallelism can be supported, SQL queries to multiple databases can be
submitted in parallel.
6.2 Semantic Heterogeneity
One of the roadblocks to achieving true interoperability is the heterogeneity
that arises due to the difference in the meaning and interpretation of similar
data across the component systems. This semantic heterogeneity problem has
been discussed in detail in Sheth [1991], Kim et al. [1993], and Hammer and
Mcleod [1993]. A promising approach to dealing with semantic heterogeneity is
the proposal of Sciore et al. [1994]. The main idea behind their proposal is the
notion of semantic values, obtained by introducing explicit context information
to each data object in the database. In applying this idea to the relational model,
they develop an extension of SQL called Context-SQL (C-SQL) that allows for
explicitly accessing the data as well as its context information.
In this section, we sketch how SchemaSQL can be extended with the wherewithal to tackle the semantic heterogeneity problem. We extend the proposal
of Sciore et al. [1994], by associating the context information to relation names
as well as attribute names, in addition to the values in a database. Also, in the
SchemaSQL setting, there is a natural need for including the type information of an object as part of its context information. We propose techniques for
intensionally specifying the semantic values as well as for algorithmically deriving the (intensional) semantic value specification of a restructured database,
given the old specification and the SchemaSQL view definition. The following
example illustrates our ideas.
Example 6.1 [Semantic heterogeneity]. Consider the database univInfoA
having a single relation stats with scheme {cat, cs, math, ontario, quebec}.
This database stores information on the floor salary of various employee categories for each department (as in univ-B of the university federation) as well as
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
SchemaSQL
•
509
information on the average number of years it takes to get promoted to a category, in each province in the country. The type information of the objects in the
database univInfoA is stored in a relation called isa and is captured using the
following rules6 :
isa(cs, d ept) ←
isa(math, d ept) ←
isa(ontario, prov) ←
isa(quebec, prov) ←
isa(C, cat) ← stats[cat → C]
isa(S, sal ) ← stats[D → S], isa(D, d ept)
isa(Y, year) ← stats[P → Y ], isa(P, prov)
Now, consider restructuring univInfoA into univInfoB, which consists of two relations salstats{dept, prof, assoc-prof} and timestats {prov, prof, assoc-prof}. salstats
has tuples of the form <d , s1 , s2 >, representing the fact that d is a department
that has a floor salary of s1 for category professor, and s2 for associate professor.
A tuple of the form < p, y 1 , y 2 > in timestats says that p is a province in which
the average time it takes to reach the category professor is y 1 and to reach
the category associate professor is y 2 . The following SchemaSQL statements
perform the restructuring that yields univInfoB.
create view
select
from
where
create view
select
from
where
univInfoB::salstats(dept, C) as
D, T.D
univInfoA::stats T,
T.cat C,
univInfoA::stats-> D,
D isa ‘dept’
univInfoB::timestats(prov, C) as
P, T.P
univInfoA::stats T,
T.cat C,
univInfoA::stats-> P,
P isa ‘prov’
Note how the type information is used in the where clause to elegantly specify
the range of the attribute variables. Our algorithm that processes the restructuring view definitions derives the following intensional type specification for
univInfoB:
isa(prof, cat) ←
isa(assoc-prof, cat) ←
isa(D, dept) ← salstats[dept → D]
isa(S, sal) ← salstats[C → S], isa(C, cat)
isa(P, prov) ← timestats[ prov → P ]
isa(Y, year) ← timestats[P → Y ], isa(P, prov)
6 The syntax of the type specification rules is based on the syntax of SchemaLog [Lakshmanan et al.
1997].
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
510
•
L. V. S. Lakshmanan et al.
Query processing in this setting involves the following modification to the processing of comparisons mentioned in the user’s query. The comparison is performed after (a) finding the type information using the specification, (b) finding
the associated context information, and (c) applying the appropriate conversion
functions.
7. COMPARISON WITH RELATED WORK
In this section, we compare and contrast our proposal against some of the related work for meta-data manipulation and multidatabase interoperability.
The features of SchemaSQL that distinguishes it from similar works include
—Uniform treatment of data and metadata.
—No explicit use of object identifiers.
—Downward compatibility with SQL.
—Comprehensive aggregation facility.
—Restructuring views, in which data and meta-data may be interchanged.
—Designed specifically for interoperability in multi-database systems.
In Litwin et al. [1989] and Grant et al. [1993], Litwin et al. propose a multidatabase manipulation language called MSQL that is capable of expressing
queries over multiple databases in a single statement. MSQL extends the traditional functions of SQL to the context of a federation of databases. The salient
features of this language include the ability to retrieve and update relations
in different databases, define multi-database views, and specify compatible
and equivalent domains across different databases. Missier and Rusinkiewicz
[1995] extends MSQL with features for accessing external functions (for resolving semantic heterogeneity) and for specifying a global schema against which
the component databases could be mapped. Though MSQL (and its extension)
has facilities for ranging variables over multiple database names, its treatment of data and meta-data is nonuniform in that relation names and attribute
names are not given the same status as the data values. The issues of schema
independent querying and resolving schematic discrepancies of the kind discussed in this paper, are not addressed in their work.
Many object-oriented query languages, by virtue of treating the schema information as objects, are capable of powerful meta-data querying and manipulation. Some of these languages include XSQL [Kifer et al. 1992], HOSQL
[Ahmed et al. 1991], Noodle [Mumick and Ross 1993], and OSQL [Chomicki
and Litwin 1993].
XSQL [Kifer et al. 1992] has its logical foundations in F-logic [Kifer et al.
1995] and is capable of querying and restructuring object-oriented databases.
However, it is not suitable for the needs addressed in this article as its syntax
was not designed with interoperability as a main goal. Besides, the complex
nature of this query language raises concerns about effective and efficient implementability, a concern not addressed in Kifer et al. [1992]. Indeed, we are
not aware of any implementation of XSQL. The Pegasus Multidatabase system
[Ahmed et al. 1991] uses a language called HOSQL as its data manipulation
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
SchemaSQL
•
511
language. HOSQL is a functional object-oriented language that incorporates
nonprocedural statements to manipulate multiple databases. OSQL [Chomicki
and Litwin 1993], an extension of HOSQL, is capable of tackling schematic
discrepancies among heterogeneous object-oriented databases with a common
data model. Both HOSQL and OSQL do not provide for ad-hoc queries that refer
to many local databases in the federation in one shot. While XSQL, HOSQL,
and OSQL have a SQL flavor, unlike SchemaSQL, they do not appear to be
downward compatible with SQL syntax and semantics. In other related work,
Ross [1992] proposes an interesting algebra and calculus that treats relation
names at par with the values in a relation. However, its expressive power is limited in that attribute names, database names, and comprehensive aggregation
capabilities are not supported.
Lefebvre et al. [1992] use F-logic [Kifer et al. 1995] to reconcile schematic discrepancies in a federation of relational databases. Unlike SchemaSQL, which
can provide a ‘dynamic global schema,’ ad hoc queries that refer the data and
schema components of the local databases in a single statement cannot be posed
in their framework.
UniSQL/M [Kelley et al. 1995] is a multidatabase system for managing
a heterogeneous collection of relational database systems. The language of
UniSQL/M, known as SQL/M, provides facilities for defining a global schema
over related entities in different local databases, and to deal with semantic heterogeneity issues such as scaling and unit transformation. However, it does not
have facilities for manipulating metadata. Hence, features such as restructuring views that transform data into metadata and vice-versa, dynamic schema
definitions, and extended aggregation facilities supported in SchemaSQL are
not available in SQL/M. The emerging standard for SQL3 [SQL Standards
Home Page 1996; Beech 1993] supports ADTs and oid’s, and thus shares some
features with higher-order languages. However, even though it is computationally complete, to our knowledge it does not directly support the kind of
higher-order features in SchemaSQL.
Krishnamurthy and Naqvi [1988] and Krishnamurthy et al. [1991] are early
and influential proposals that demonstrated the power of using variables that
uniformly range over data and meta-data, for schema browsing and interoperability. While such ‘higher-order variables’ admitted in SchemaSQL have been
inspired by these proposals, there are major differences that distinguish our
work from the above proposals. (i) These languages have a syntax closer to
that of logic programming languages, and far from that of SQL. (ii) More importantly, these languages do not admit tuple variables of the kind permitted in SchemaSQL (and even SQL). This limits their expressive power. (iii)
Lastly, aggregate computations of the kind discussed in Sections 3.3 and 4.2 are
unique to our framework, and to our knowledge, not addressed elsewhere in the
literature.
Many researchers have addressed the issue of information integration. Recent emphasis has been on data integration from semantically heterogeneous
sources and integration of semistructured and unstructured sources. Some of
the recent work on semantic integration issues include Bergamaschi et al.
[1999] and Castano and Antonellis [1997]. The main idea is to provide a
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
512
•
L. V. S. Lakshmanan et al.
semantic dictionary, or thesaurus, that is used to map data from multiple
sources into a common data model. In Calvanese et al. [1998], the authors
present an architecture for declarative information integration that is based
on the conceptual modeling of the domain and reasoning support over the
conceptual representation. The TSIMMIS project [Garcia-Molina et al. 1997]
uses the wrapper/mediator approach for the integration of semistructured data
sources. It uses an object-oriented language called LOREL (lightweight object repository language) to access information wrapped into a common data
model. A case-based approach is adapted in Panti et al. [2000], which enables
the system to cope with dynamic changes in the schema of sources. Our work on
SchemaSQL provides a facility for the integration of data in syntactically heterogeneous sources. Furthermore, we have laid the groundwork for enhancing
SchemaSQL with semantic reconciliation functionality (Section 6.2). Further
research is needed in this area.
The issues of information capacity and information capacity preserving mapping have been discussed in Miller et al. [1993]. Intuitively, we are interested in
characterizing views (in our case restructuring views) that, can be used in lieu
of the original data (e.g., for answering queries, viewing data, and/or updating
data). First we notice that, if we restrict the query language to SQL, then any
restructuring view in which data becomes metadata is not information capacity
preserving, since some SQL queries on the original data are not expressible in
SQL on the restructured view. This is due to SQL’s inability to query metadata.
So we will assume the full power of SchemaSQL henceforth. Let us consider
the views in Figure 2. It can be shown that the mapping between univ-A and
the view univ-C is one-to-one, and hence the two schemas are equivalent in
terms of information capacity [Miller et al. 1993]. On the other hand, views
univ-B and univ-D are not, in general, equivalent to univ-A. This is due to the
possible creation of null values in the views, which compromises the one-to-one
property. A detailed and general study of information capacity equivalence and
preservation for restructured views is beyond the scope of this article.
Relational tuple calculus is generally regarded as the mathematical basis
for SQL. The important difference is the set semantics of tuple calculus versus multi-set semantics of SQL. Klug [1982] has extended relational algebra
and relational tuple calculus to include aggregate functions, and has proved
their equivalence. While retaining the set semantics of the relational model, he
provides an elegant semantics of aggregation by providing aggregate operators
on columns of relations (as opposed to operators on a set of values). Our treatment of SQL semantics (Section 3.1), based on instantiations of tuple variables,
closely parallels the relational tuple calculus approach. Furthermore, we have
accounted for aggregate constructs group by and having in SQL, and have also
supported a multi-set semantics to be consistent with SQL and commercial
database systems. Semantics of SchemaSQL was obtained by extending this
semantics to (1) additional variable types of SchemaSQL, and (2) accounting
for dynamic outputs, where the output schema is not fixed and is a function of
input data. We have presented an extended relational algebra for the singledatabase version of SchemaSQL in Lakshmanan et al. [1999]. Deriving calculus
and algebra for the full SchemaSQL is an interesting topic for future research.
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
SchemaSQL
•
513
In the context of multidimensional databases (MDDB) and online analytical processing (OLAP), there is a great need for powerful languages expressing complex forms of aggregation [Codd et al. 1995]. The powerful features of
SchemaSQL for horizontal and block aggregation will be especially useful in
this context (e.g., see Examples 3.3 and 4.3). Interestingly, the Data Cube operator proposed by Gray et al. [1996] can be simulated in SchemaSQL. Unlike the
cube operator, SchemaSQL can express any subset of the data cube to any level
of granularity. Substantial amount of work has been done on efficient computation of the data cube. This is a large body of work and we refer the reader to the
pioneering work by Agarwal et al. [1996] for details. It would be interesting to
investigate how some of these fast algorithms for the cube can be leveraged for
efficient computation of SchemaSQL queries involving complex aggregations.
As pointed out above, the aggregation and restructuring features of
SchemaSQL would be very useful in OLAP style computations. However,
SchemaSQL by itself is not a full-fledged OLAP query language. Gingras [1997]
and Gingras and Lakshmanan [1998] document the limitations of SchemaSQL
for OLAP applications and propose a language they call nD-SQL specifically
designed for OLAP computations. An example of such a limitation is that in
SchemaSQL, one cannot create a column as a function of more than one domain value. SchemaSQL was designed with interoperability in mind, where
there seems to be no natural need for this capability.
Statistical Databases (SDB) applications, in addition to requiring complex
statistical operations, also have natural use for restructuring and for handling data meta-data conflicts [Shoshani 1997]. Languages developed in the
SDB community do possess such features (e.g., see Meo-Evoli et al. [1992] and
Ozsoyoglu et al. [1985; 1989]). However, on comparing SDB languages with
SchemaSQL, we find that they are complementary with respect to what they
have to offer. It might be interesting to investigate how useful features from
these languages can be combined to support sophisticated SDB and OLAP
applications.
While we propose SchemaSQL as a language for multidatabase interoperability and data warehousing applications, researchers have identified several applications where SchemaSQL can enhance the functionality of a single
DBMS significantly. These applications include publishing of relational data
on the web [Miller et al. 1997; Miller 1998; Krishnamurthy and Zloof 1995],
techniques for providing physical data independence [Miller 1998], developing
tightly coupled scalable classification algorithms in data mining [Wang et al.
1998], and query optimization in the context of data warehousing [Subramanian
and Venkataraman 1998]. In a recent work, Agrawal et al. discuss the advantages of storing data in vertical attribute/value pair (akin to XML tag/value
structure), while viewing and querying it in the restructured traditional (horizontal) format [Agrawal et al. 2001]. Motivated by similar observations, in
Lakshmanan et al. [1999], we discuss efficient implementation of SchemaSQL
on a single RDBMS. We develop logical as well as physical algebraic operators for SchemaSQL and using these operators as a vehicle, present several
alternate implementation strategies for SchemaSQL queries/views in a single
database setting. We also test the effectiveness of these strategies by means of
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
514
•
L. V. S. Lakshmanan et al.
a series of tests based on TPC-D benchmark data. In a more recent work, we
discuss the equivalences (i.e., rewrite rules) for this (extended) algebra to be
used in a cost-based optimizer for SchemaSQL [Davis and Sadri 2001].
In other related work, Gyssens et al. [1996] develop a general data model
called the Tabular Data Model, which subsumes relations and spreadsheets as
special cases. They develop an algebra for querying and restructuring tabular
information and show that the algebra is complete for a broad class of natural
transformations. They also demonstrate that the tabular algebra can serve
as a foundation for the restructuring aspects of OLAP. Restructuring views
expressible in SchemaSQL can also be expressed in their algebra but they do
not address aggregate computations.
In Lakshmanan et al. [1993; 1997], we proposed a logic-based query/
restructuring language, SchemaLog, for facilitating interoperability in multidatabase systems. SchemaLog admits a simple syntax and semantics, but
allows for expressing powerful queries and programs in the context of schema
browsing and interoperability. A formal account of SchemaLog’s syntax and semantics can be found in Lakshmanan et al. [1997] SchemaLog can also express
the complex forms of aggregation discussed in this paper. SchemaSQL has been
to a large extent inspired by SchemaLog. Indeed, the logical underpinnings of
SchemaSQL can be found in SchemaLog. However, SchemaSQL is not obtained
by simply “SQL-izing” SchemaLog. There are important differences between the
two languages: (1) SchemaSQL has been designed to be as close as possible to
SQL. In this vein, we have developed the syntax and semantics of SchemaSQL
by extending that of SQL. SchemaLog on the other hand has a syntax based on
logic programming. (2) Answers to SchemaSQL queries come with an associated schema. In SchemaLog, as in other logic programming systems, answers to
queries are simply a set of (tuples of) bindings of variables in the query (unless
explicitly specified using a restructuring rule). (3) The aggregation semantics
of SchemaSQL is based on a ‘merging’ operator. There is no obvious way to simulate merging in SchemaLog. (4) To facilitate an ordinary SQL user to adapt to
SchemaSQL easily, we have designed SchemaSQL without the following features present in SchemaLog—(a) function symbols and (b) explicit access to
tuple-id’s. As demonstrated in this paper, the resulting language is simple, yet
powerful for the interoperability needs in a federation.
8. CONCLUSIONS AND FUTURE WORK
We introduced SchemaSQL, a principled extension of SQL for relational multidatabase systems. In SchemaSQL data and meta-data, that is, database instance and its schema, are treated uniformly, thus making it possible to query
both the contents and the structure of a database. In a multidatabase environment, SchemaSQL provides the means for handling schematic (structural) heterogeneity, that is, similar information represented in different structures. View definition in SchemaSQL makes it possible to define restructuring
views, namely, views that can change the structure of input data in a manner
that exploits the data/meta-data interplay, while preserving the information
content. Data (i.e., attribute values) in one representation can play the role
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
SchemaSQL
•
515
of meta-data objects (database name, relation name, and attribute name) in
a restructured view. SchemaSQL also provides novel aggregation capabilities,
akin to some aggregations only available in OLAP systems. In addition to the
usual SQL column aggregations, in SchemaSQL it is possible to aggregate on
a set of columns, individually or collectively, whether the columns are in one
relation or in several. This can be done using a single query. The latter case is
basically aggregation over a “block” of data. The set of columns participating in
the aggregation is determined dynamically, at execution time, and is dependent
on the database instance. Horizontal (i.e., row) aggregation is also possible in
SchemaSQL.
SchemaSQL was introduced specifically in the context of multidatabase interoperability, but it has since found significant usefulness in the context of
single database applications. These applications include database publishing
on the web, query optimization in a data warehouse, and scalable classification
algorithms in data mining, and were briefly reviewed in Section 7.
In this article we presented the syntax of SchemaSQL, developed a formal
semantics for it, and discussed an architecture and algorithms for the implementation of SchemaSQL using existing (SQL) database technology. We also
proved the correctness of the algorithms. Our work opens up many interesting
directions for future work: we just mention a few.
—Implementation of a SchemaSQL system. The architecture for SchemaSQL
implementation described in this paper (Section 5) is a nonintrusive architecture built upon SQL systems. We believe this is the most appropriate
architecture for purposes of multi-database interoperability. Depending on
the intended applications, a wide spectrum of alternative architectures are
possible, similar in spirit to the frameworks studied in Sarawagi et al. [1998].
Further research is needed to design and compare various architectures for
the implementation of SchemaSQL.
The approach in Lakshmanan et al. [1999] for the implementation of
SchemaSQL on a single RDBMS is based upon an extended relational algebra that can be used in a cost-based optimizer. The new operators in this
algebra can be implemented as embedded SQL applications, stored procedures, or user defined functions. When a SchemaSQL query is rewritten into
this algebra, it is possible for some operands to be unknown at compile time.
An investigation into dynamic query optimization for SchemaSQL is thus an
attractive topic for future work.
—Extending SchemaSQL to handle semantic heterogeneity. We laid the
groundwork for enhancing SchemaSQL with semantic heterogeneity reconciliation capability in Section 6.2. Many issues merit further investigation.
For example, how to handle semantic heterogeneity in SchemaSQL in a manner that is transparent to the user? How to optimize queries when semantic
reconciliation is also involved?
—Information capacity. The notions of information capacity and equivalence
for classes of schemas involving data/meta-data interplay need to be rigorously studied. A related question is characterizing restructuring views that
preserve information content.
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
516
•
L. V. S. Lakshmanan et al.
—Query containment and answering using views. The issue of query containment, studied extensively for SQL, is the basis for query optimization as well
as for query answering using materialized views. The restructuring view
capability of SchemaSQL opens a whole new spectrum of optimization opportunities with possible applications in conventional relational databases
and data warehouses. What can we say about containment of SchemaSQL
queries?
—XML. The rich capabilities of SchemaSQL for restructuring among various
representations of reations suggest it may be a convenient tool for mapping
between alternative storage structures for XML data (e.g., see Carey et al.
[2000], Florescu and Kossmann [1999], and Shanmugasundaram et al. [1999;
2000]). In addition, it appears SchemaSQL has the potential for use as a
language for mapping between the flat relational representation and the
hierarchical representation of XML data. This is an interesting topic with
substantial applications.
REFERENCES
ACM. 1990. ACM Computing Surveys 22, 3 (Sept.). Special issue on HDBS.
AGARWAL, S., AGRAWAL, R., DESHPANDE, P., GUPTA, A., NAUGHTON, J. F., RAMAKRISHNAN, R., AND SARAWAGI,
S. 1996. On the computation of multidimensional aggregates. In VLDB’96, Proceedings of the
22th International Conference on Very Large Data Bases, T. M. Vijayaraman, A. P. Buchmann,
C. Mohan, and N. L. Sarda Eds. (Mumbai (Bombay), India, Sept. 3–6). Morgan-Kaufmann, San
Mateo, Calif., pp. 506–521.
AGRAWAL, R., SOMANI, A., AND XU, Y. 2001. Storage and querying of e-commerce data. In Proceedings of the 27th International Conference on Very Large Databases, pp. 149–158.
AHMED, R., SMEDT, P., DU, W., KENT, W., KETABCHI, A., AND LITWIN, W. 1991. The pegasus heterogeneous multidatabase system. IEEE Comput. 24, 12 (Dec.), 19–27.
BANCILHON, F. AND RAMAKRISHNAN, R. 1986. An amateur’s introduction to recursive queryprocessing strategies. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, pp. 16–52.
BEECH, D. 1993. Collections of objects in SQL3. In Proceedings of the International Conference
on Very Large Database. pp. 244–255.
BERGAMASCHI, S., CASTANO, S., AND VINCINI, M. 1999. Semantic integration of semistructured and
structured data sources. SIGMOD Record 28, 1, 54–59.
CALVANESE, D., GIACOMO, G. D., LENZERINI, M., NARDI, D., AND ROSATI, R. 1998. Information integration: Conceptual modeling and reasoning support. In Proceedings of the 3rd IFCIS International
Conference on Cooperative Information Systems (New York, New York, Aug. 20–22). Sponsored
by IFCIS, The International Foundation on Cooperative Information Systems. IEEE-CS Press,
Los Alamitos, Calif., pp. 280–291.
CAREY, M. J., KIERNAN, J., SHANMUGASUNDARAM, J., SHEKITA, E. J., AND SUBRAMANIAN, S. N. 2000.
Xperanto: Middleware for publishing object-relational data as XML documents. In VLDB 2000,
Proceedings of the 26th International Conference on Very Large Data Bases (Cairo, Egypt, Sept.
10–14). A. E. Abbadi, M. L. Brodie, S. Chakravarthy, U. Dayal, N. Kamel, G. Schlageter, and K.-Y.
Whang, Eds. Morgan-Kaufmann, San Mateo, Calif., pp. 646–648.
CASTANO, S. AND ANTONELLIS, V. D. 1997. Semantic dictionary design for database interoperability. In Proceedings of the 13th International
Conference on Data Engineering (Birmingham,
❛
U.K., Apr. 7–11). A. Gray and P.-A. Larson, Eds. IEEE Computer Society, Los Alamitos, Calif.,
pp. 43–54.
CHOMICKI, J. AND LITWIN, W. 1993. Declarative definition of object-oriented multidatabase mappings. In Distributed Object Management. M. T. Ozsu, U. Dayal, and P. Valduriez, Eds. MorganKaufmann, Los Altos, Calif.
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
SchemaSQL
•
517
CODD, E. F., CODD, S. B., AND SALLEY, C. T. 1995. Providing OLAP (on-line analytical processing) to user-analysts: An IT mandate. White paper—URL:http://www.arborsoft.com/papers/
coddTOC.html.
DAVIS, K. B. AND SADRI, F. 2001. Optimization of SchemaSQL queries. In Proceedings of International Database Engineering and Applications (IDEAS). pp. 111–116.
ELMAGARMID, A., RUSINKIEWICZ, M., AND SHETH, A. EDS. 1998. Management of Heterogeneous and
Autonomous Database Systems. Morgan-Kaufmann, San Mateo, Calif.
FLORESCU, D. AND KOSSMANN, D. 1999. Storing and querying XML data using an RDMBS. IEEE
Data Eng. Bull. 22, 3, 27–34.
GARCIA-MOLINA, H., PAPAKONSTANTINOU, Y., QUASS, D., RAJARAMAN, A., SAGIV, Y., ULLMAN, J. D., VASSALOS,
V., AND WIDOM, J. 1997. The TSIMMIS approach to mediation: Data models and languages.
J. Int. Inf. Syst. 8, 2, 117–132.
GINGRAS, F. 1997. Extending SchemaSQL towards multidimensional databases and OLAP. Master’s dissertation, Dept. Computer Science. Concordia Univ., Montreal, Que., Canada.
GINGRAS, F. AND LAKSHMANAN, L. V. S. 1998. nD-SQL: A multi-dimensional language for interoperability and OLAP. In VLDB’98, Proceedings of 24rd International Conference on Very Large Data
Bases (New York, New York, Aug. 24–27). A. Gupta, O. Shmueli, and J. Widom, Eds. MorganKaufmann, San Mateo, Calif., pp. 134–145.
GINGRAS, F., LAKSHMANAN, L. V. S., SUBRAMANIAN, I. N., PAPOULIS, D., AND SHIRI, N. 1997. Languages
for multi-database interoperability. In SIGMOD 1997, Proceedings ACM SIGMOD International
Conference on Management of Data (Tucson, Az., May 13–15). J. Peckham, Ed. ACM, New York,
pp. 536–538.
GRANT, J., LITWIN, W., ROUSSOPOULOS, N., AND SELLIS, T. 1993. Query languages for relational multidatabases. VLDB J. 2, 2, 153–171.
GRAY, J., BOSWORTH, A., LAYMAN, A., AND PIRAHESH, H. 1996. Data cube: A relational aggregation
operator generalizing group-by, cross-tab, and sub-totals. In Proceedings of the International
Conference on Data Engineering. pp. 152–159.
GYSSENS, M., LAKSHMANAN, L. V. S., AND SUBRAMANIAN, I. N. 1996. Tables as a paradigm for querying
and restructuring. In Proceedings of the ACM Symposium on Principles of Database Systems
(PODS) (June). ACM, New York, pp. 93–103.
HAMMER, J. AND MCLEOD, D. 1993. An approach to resolving semantic heterogeneity in a federation
of autonomous, heterogeneous database systems. Int. J. Intell. Coop. Inf. Syst. 2, 1, 51–83.
HSIAO, D. K. 1992. Federated databases and systems: Part one – A tutorial on their data sharing.
VLDB J. 1, 127–179.
IBM. DB2 datajoiner. http://www.software.ibm.com/data/datajoiner.
KELLEY, W., GALA, S. K., KIM, W., REYES, T. C., AND GRAHAM, B. 1995. Schema architecture of
the UniSQL/M multidatabase system. In Modern Database Systems. Addison-Wesley, Reading,
Mass.
KIFER, M., KIM, W., AND SAGIV, Y. 1992. Querying object-oriented databases. In Proceedings of the
ACM SIGMOD International Conference on Management of Data. ACM, New York, pp. 393–402.
KIFER, M., LAUSEN, G., AND WU, J. 1995. Logical foundations for object-oriented and frame-based
languages. J. ACM 42, 4, 741–843.
KIM, W., CHOI, I., GALA, S. K., AND SCHEEVEL, M. 1993. On resolving schematic heterogeneity in
multidatabase systems. Dist. Parall. Datab. 1, 3, 251–279.
KLUG, A. C. 1982. Equivalence of relational algebra and relational calculus query languages
having aggregate functions. J. ACM 29, 3, 699–717.
KRISHNAMURTHY, R., LITWIN, W., AND KENT, W. 1991. Language features for interoperability of
databases with schematic discrepancies. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, pp. 40–49.
KRISHNAMURTHY, R. AND NAQVI, S. 1988. Towards a real Horn clause language. In Proceedings of
the 14th VLDB Conference, pp. 252–263.
KRISHNAMURTHY, R. AND ZLOOF, M. M. 1995. RBE: Rendering by example. In Proceedings of the 11th
International Conference on Data Engineering (Taipei, Taiwan, Mar. 6–10). P. S. Yu and A. L. P.
Chen, Eds. IEEE Computer Society Press, Los Alamitos, Calif., pp. 288–297.
LAKSHMANAN, L. V. S., SADRI, F., AND SUBRAMANIAN, I. N. 1993. On the logical foundations of
schema integration and evolution in heterogeneous database systems. In Proceedings of the 3rd
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
518
•
L. V. S. Lakshmanan et al.
International Conference on Deductive and Object-Oriented Databases (DOOD ’93) (Dec.). Lecture
Notes in Computer Science, Vol. 760. Springer-Verlag, New York, pp. 81–100.
LAKSHMANAN, L. V. S., SADRI, F., AND SUBRAMANIAN, I. N. 1997. Logic and algebraic languages for
interoperability in multidatabase systems. J. Logic prog. 33, 2 (Nov.), 101–149.
LAKSHMANAN, L. V. S., SADRI, F., AND SUBRAMANIAN, S. N. 1999. On efficiently implementing
SchemaSQL on a SQL database system. In Proceedings of International Conference on Very
Large Databases. pp. 471–482.
LEFEBVRE, A., BERNUS, P., AND TOPOR, R. 1992. Query transformation for accessing heterogeneous
databases. In Workshop on Deductive Databases in conjunction with JICSLP (Nov.), pp. 31–40.
LIPTON, R. AND NAUGHTON, J. 1990. Query size estimation by adaptive sampling. In Proceedings
of the ACM Symposium on Principles of Database Systems (PODS). ACM, New York.
LIPTON, R., NAUGHTON, J., AND SCHNEIDER, D. 1990. Practical selectivity estimation through adaptive sampling. In Proceedings of the ACM SIGMOD. ACM, New York.
LITWIN, W., ABDELLATIF, A., ZEROUAL, A., AND NICOLAS, B. 1989. MSQL: A multidatabase language.
Inf. Sci. 49, 50–101.
LITWIN, W., MARK, L., AND ROUSSOPOULOS, N. 1990. Interoperability of multiple autonomous
databases. ACM Comput. Surv. 22, 3 (Sept.), 267–293.
MEO-EVOLI, L., RICCI, F. L., AND SHOSHANI, A. 1992. On the semantic completeness of macro-data
operators for statistical aggregation. In Proceedings of the International Conference on Scientific
and Statistical Database Management. pp. 239–258.
MILLER, R. J. 1998. Using schematically heterogeneous structures. In SIGMOD 1998, Proceedings of the ACM SIGMOD International Conference on Management of Data (Seattle, Wash., June
2–4). L. M. Haas and A. Tiwari, Eds. ACM, New York, pp. 189–200.
MILLER, R. J., IOANNIDIS, Y. E., AND RAMAKRISHNAN, R. 1993. The use of information capacity in
schema integration and translation. In Proceedings of the 19th International Conference on Very
Large Data Bases (Dublin, Ireland, Aug. 24–27). R. Agrawal, S. Baker, and D. A. Bell, Eds.
Morgan-Kaufmann, San Mateo, Calif., pp. 120–133.
MILLER, R. J., TSATALOS, O. G., AND WILLIAMS, J. H. 1997. Dataweb: Customizable database publishing for the web. IEEE MultiMed. 4, 4, 14–21.
MISSIER, P. AND RUSINKIEWICZ, M. 1995. Extending a multidatabase manipulation language to
resolve schema and data conflicts. In Proceedings of the 6th IFIP TC-2 Working Conference on
Data Semantics (DS-6) (Atlanta, Ga., May).
MUMICK, I. S. AND ROSS, K. A. 1993. Noodle: A language for declarative querying in object-oriented
database. In Proceedings of the 3rd International Conference on Deductive and Object-Oriented
Databases (DOOD’93) (Dec.). Lecture Notes in Computer Science, Vol. 760, Springer-Verlag, New
York.
OZSOYOGLU, G., MATOS, V., AND OZSOYOGLU, Z. M. 1989. Query processing techniques in the
summary-table-by-example query language. ACM Trans. Datab. Syst. 14, 4, 526–573.
OZSOYOGLU, G., OZSOYOGLU, Z. M., AND MATA, F. 1985. A language and a physical organization
technique for summary tables. In Proceedings of ACM SIGMOD International Conference on
Management of Data. ACM, New York, pp. 3–16.
PANTI, M., SPALAZZI, L., AND GIRETTI, A. 2000. A case-based approach to information integration. In
VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases (Cairo, Egypt,
Sept. 10–14). A. E. Abbadi, M. L. Brodie, S. Chakravarthy, U. Dayal, N. Kamel, G. Schlageter,
and K.-Y. Whang, Eds. Morgan-Kaufmann, Reading, Mass., pp. 557–565.
ROSS, K. 1992. Relations with relation names as arguments: Algebra and calculus. In Proceedings
of the 11th Annual ACM Symposium on Principles of Database Systems (June). ACM, New York,
pp. 346–353.
SADRI, F. AND WILSON, S. B. 1997. Implementation of SchemaSQL—A language for relational
multi-database systems. Manuscript, www.uncg.edu/∼sadrif/papers.html.
SARAWAGI, S., THOMAS, S., AND AGRAWAL, R. 1998. Integrating mining with relational database
systems: Alternatives and implications. In SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data (Seattle, Wash., June 2–4). L. M. Haas and A. Tiwari,
Eds. ACM, New York, pp. 343–354.
SCIORE, E., SIEGEL, M., AND ROSENTHAL, A. 1994. Using semantic values to facilitate interoperability among heterogeneous information systems. ACM Trans. Datab. Syst. 19, 2 (June), 254–290.
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.
SchemaSQL
•
519
SHANMUGASUNDARAM, J., SHEKITA, E. J., BARR, R., CAREY, M. J., LINDSAY, B. G., PIRAHESH, H., AND
REINWALD, B. 2000. Efficiently publishing relational data as XML documents. In VLDB 2000,
Proceedings of 26th International Conference on Very Large Data Bases (Cairo, Egypt, Sept. 10–
14). A. E. Abbadi, M. L. Brodie, S. Chakravarthy, U. Dayal, N. Kamel, G. Schlageter, and K.-Y.
Whang, Eds. Morgan-Kaufmann, San Mateo, Calif., pp. 65–76.
SHANMUGASUNDARAM, J., TUFTE, K., ZHANG, C., HE, G., DEWITT, D. J., AND NAUGHTON, J. F. 1999. Relational databases for querying XML documents: Limitations and opportunities. In VLDB’99,
Proceedings of 25th International Conference on Very Large Data Bases (Edinburgh, Scotland,
U.K., Sept. 7–10). M. P. Atkinson, M. E. Orlowska, P. Valduriez, S. B. Zdonik, and M. L. Brodie,
Eds. Morgan-Kaufmann, San Mateo, Calif., pp. 302–314.
SHETH, A., ED. December, 1991. Semantic Issues in Multidatabase Systems. SIGMOD Record
20, 4.
SHETH, A. P. AND LARSON, J. A. 1990. Federated database system for managing distributed, heterogeneous and autonomous databases. ACM Comput. Surv. 22, 3 (Sept.), 183–236.
SHOSHANI, A. 1997. OLAP and statistical databases: Similarities and differences. In Proceedings
of ACM Symposium on Principles of Database Systems. ACM, New York, pp. 185–196.
SQL STANDARDS HOME PAGE. 1996. SQL 3 articles and publications. URL: www.jcc.com/
sql articles.html.
SUBRAMANIAN, S. N. AND VENKATARAMAN, S. 1998. Query optimization using restructuring views.
IBM Internal Report.
WANG, M., IYER, B., AND VITTER, J. S. 1998. Scalable mining for classification rules in relational
databases. In Proceedings of International Database Engineering and Applications (IDEAS).
Received December 1999; revised September 2001; accepted August 2000
ACM Transactions on Database Systems, Vol. 26, No. 4, December 2001.