Data Integration Through Database Federation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220354241

Data Integration Through Database Federation

Article in IBM Systems Journal · January 2002


DOI: 10.1147/sj.414.0578 · Source: DBLP

CITATIONS READS

158 2,328

3 authors, including:

Laura M. Haas Eileen Lin


IBM IBM
136 PUBLICATIONS 8,149 CITATIONS 8 PUBLICATIONS 320 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Laura M. Haas on 03 December 2014.

The user has requested enhancement of the downloaded file.


Data integration
through database
federation
by L. M. Haas
E. T. Lin
M. A. Roth

In a large modern enterprise, it is almost In the finance industry, mergers are an almost com-
inevitable that different parts of the monplace occurrence. The company resulting from
organization will use different systems to a merger inherits the data stores of the original in-
produce, store, and search their critical data. stitutions. Many of those stores will be relational da-
Yet, it is only by combining the information tabase management systems, but often from differ-
from these various systems that the ent manufacturers. One company may have used
enterprise can realize the full value of the data primarily Sybase, Inc. products, whereas another may
they contain. Database federation is one have used products from Informix Software, Inc.
approach to data integration in which They may both have had one or more document
middleware, consisting of a relational management systems, such as Documentum 1 or IBM
Content Manager, 2 for storing text documents. Each
database management system, provides
may have had applications that compute important
uniform access to a number of heterogeneous
information (e.g., the risk of granting a loan to a given
data sources. In this paper, we describe the customer) or that mine for information about cus-
basics of database federation, introduce tomers’ buying patterns.
several styles of database federation, and
outline the conditions under which each style After the merger, the new company needs to be able
of federation should be used. We discuss the to access the customer information from both sets
benefits of an information integration solution of data stores, to analyze its new portfolio using ex-
based on database technology, and we isting and new applications, and, in general, to use
demonstrate the utility of the database the combined resources of both institutions through
federation approach through a number of a common interface. The company needs to be able
usage scenarios involving IBM’s DB2 product. to identify common customers and consolidate their
accounts, even though the customer data may be
stored in different databases and in different formats.
In addition, the company must be able to combine
the legacy data with new data available on the Web
In a large modern enterprise, it is inevitable that dif- or from its business partners. These are all aspects
ferent parts of the organization will use different sys- of data integration, 3 and all pose hefty challenges.
tems to produce, store, and search their critical data.
This diversity of data sources is caused by many fac-
娀Copyright 2002 by International Business Machines Corpora-
tors including lack of coordination among company tion. Copying in printed form for private use is permitted with-
units, different rates of adopting new technology, out payment of royalty provided that (1) each reproduction is done
mergers and acquisitions, and geographic separation without alteration and (2) the Journal reference and IBM copy-
of collaborating groups. Yet, it is only by combining right notice are included on the first page. The title and abstract,
but no other portions, of this paper may be copied or distributed
the information from these various systems that the royalty free without further permission by computer-based and
enterprise can realize the full value of the data they other information-service systems. Permission to republish any
contain. other portion of this paper must be obtained from the Editor.

578 HAAS, LIN, AND ROTH 0018-8670/02/$5.00 © 2002 IBM IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002
Sometimes the need for data integration may be as ious sources is needed, the application developer
simple as getting a reading from a sensor (say, the must provide that code.
temperature at some stage in a manufacturing pro-
cess), and comparing it to a baseline value derived Similarly, workflow systems 7 provide application de-
from historical data stored in a database. Or per- velopers with a higher-level abstraction against which
haps data from a spreadsheet are to be compared to program, but using a more process-oriented
with data from several databases containing exper- model. Again, these systems provide some protec-
imental results, or perhaps one has to find custom- tion to the application against changes in the envi-
ers with large credit card balances whose income is ronment, and they additionally provide support for
less than some threshold. The need for data inte- routine results from one source to other sources.
gration is ubiquitous in today’s business world. However, they still provide only limited help for com-
paring and manipulating data.
There are many mechanisms for integrating data.
These include application-specific solutions, applica- Digital libraries, which collect results from multiple
tion-integration frameworks, workflow (or business different data sources in response to a user’s request,
process integration) frameworks, digital libraries with represent another style of data integration. For ex-
portal-style or meta-search-engine integration, data ample, Stanford’s InfoBus architecture 8 provides a
warehousing, and database federation. We discuss bibliographic search service that looks through sev-
each briefly. eral reference repositories and returns a single list
of results. Usually the sources are similar in func-
tion (e.g., all store text documents, bibliographies,
Perhaps the most common means of data integra- URLs [uniform resource locators], images, etc.). For
tion is via special-purpose applications that access most of these “meta search engines,” no combina-
sources of interest directly and combine the data re- tion of results is done, except perhaps for normal-
trieved from those sources with the application it- ization of relevance scores or formats—all are re-
self. This approach always works, but it is expensive turned as part of one list, possibly sorted by estimated
(in terms of both time and skills), fragile (changes relevance. Similarly, portals provide a means of col-
to the underlying sources may all too easily break lecting information—perhaps from quite different
the application), and hard to extend (a new data data sources—and putting the results together for
source requires new code to be written). the user to see in conjunction. Results from the dif-
ferent searches are typically displayed in different
Application-integration frameworks 4,5 and compo- areas of the interface, although portal frameworks
nent-based frameworks 6 provide a more promising may allow application code to be written to combine
approach. Such systems typically employ a standard the results in some other way. Portals are often built
data or programming model, such as CORBA** on top of some sort of application-integration frame-
(Common Object Request Broker Architecture), work, providing a simple and amazingly powerful way
J2EE** (Java 2 Platform, Enterprise Edition), and to quickly get access to a variety of information. IBM’s
so on. They provide well-defined interfaces to the Enterprise Information Portal, for example, can re-
application for accessing data and other applications trieve various types of data from a range of repos-
and for adding new data sources, typically by writ- itories, including text search engines, document man-
ing an adaptor for the data source that meets the agement systems, image stores, and relational
framework’s adaptor interface. These frameworks databases. 9 Yet again, however, the application de-
protect the application somewhat from changes in veloper must write code to do any more sophisticated
the data sources (if the source changes, the adaptor analysis, comparison, or integration that is required.
may have to change, but the application may never
see it). Adding a new source may be easier (although Data warehouses and database federation, by con-
a new adaptor may need to be written, the change trast, provide users with a powerful, high-level query
is more isolated, and the adaptor may already exist language that can be used to combine, contrast, an-
and be available for purchase). The application pro- alyze, and otherwise manipulate their data. Tech-
grammer is not required to have detailed systems nology for optimizing queries ensures that they are
knowledge, so applications will typically be easier to answered efficiently, even though the queries are
write. However, such systems do not necessarily ad- posed nonprocedurally, greatly easing application
dress data integration issues; if combination, anal- development. A data warehouse is built by loading
ysis, or comparison of the data received from the var- data from one or more data sources into a newly de-

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 HAAS, LIN, AND ROTH 579
fined schema in a relational database. The data are timal across both local tables and the federated data
often cleansed and transformed in the load process. sources involved in the query.
Changes in the underlying sources may cause
changes to the load process, but the part of the ap- Note that in some cases, a query issued to a feder-
plication that deals with data analysis is protected. ated server with a sophisticated query processing en-
New data sources may introduce changes to the gine may actually outperform the same query issued
schema, requiring that a new load process for the directly to the data source itself. During query re-
new data be defined. SQL (Structured Query Lan- write phase, a user’s query can be rewritten in such
guage) views can further protect the application from a way that the federated data source can choose a
such evolutions. However, any functions of the data more efficient execution plan than it would if it were
source that are not a standard part of a relational given the user’s original query.
database management system must be re-imple-
mented in the warehouse or as part of the applica-
tion. The goal of this paper is to demonstrate that data-
base federation is a fundamental tool for data in-
tegration. The rest of the paper is organized as fol-
A solution based on warehousing alone may not be lows. In the next section, we discuss database
possible or cost effective for various reasons. For ex- federation in greater detail and describe the archi-
ample, it is not always feasible to move data from tecture for DB2* (Database 2*) database federation.
their original location to a data warehouse, and as There are several styles of database federation, and
described above, warehousing comes with its own these are introduced in the following section, in
maintenance and implementation costs. Database which we also discuss when each style of federation
federation (or mediation as it is sometimes referred
should be used. In the next section, we discuss the
to in the literature) provides users with a virtual data
advantages of building an integration solution on da-
warehouse, without necessarily moving any of the
tabase technology, and in the section that follows we
data. Thus, in addition to the benefits of a warehouse,
demonstrate the utility of this approach through a
it can provide access to “live” data and functions. A
number of usage scenarios. We conclude with a sum-
single arbitrarily complex query can efficiently com-
mary and some thoughts on future work.
bine data from multiple sources of different types,
even if those sources themselves do not possess all
the functionality needed to answer such a query. In The basics of database federation
other words, a federated database system can op-
timize queries and compensate for SQL function that The term “database federation” refers to an archi-
may be lacking in a data source. Additionally, que- tecture in which middleware, consisting of a rela-
ries can exploit the specialized functions of a data tional database management system, provides uni-
source, so no functionality is lost in accessing the form access to a number of heterogeneous data
source through the federated system. sources. The data sources are federated, that is, they
are linked together into a unified system by the da-
Of course, database federation adds another com- tabase management system. The system shields its
ponent between the client application and the data, users from the need to know what the sources are,
and this extra layer introduces performance trade- what hardware and software they run on, how they
offs. Indeed, one should not expect that introducing are accessed (via what programming interface or lan-
a new layer would reduce access time for any single guage), and even how the data stored in these sources
federated data source. The key performance advan- are modeled and managed. Instead, a database fed-
tage offered by database federation is the ability to eration looks to the application developer like a sin-
efficiently combine data from multiple sources in a gle database management system. The user can
single SQL statement. Two components of the fed- search for information and manipulate data using
erated server contribute to this: query rewrite and the full power of the SQL language. A single query
cost-based optimization. 10 –12 The query rewrite com- may access data from multiple sources, joining and
ponent of a federated server can aggressively rewrite restricting, aggregating and analyzing the data at will.
a user’s query into a semantically equivalent form Yet the sources may not be database systems at all,
that can be more efficiently executed. A cost-based but in fact could be anything from sensors to flat files
optimizer can search a large space of access strat- to application programs to XML (Extensible Markup
egies and choose a global execution plan that is op- Language), and so on.

580 HAAS, LIN, AND ROTH IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002
Many research projects and a few commercial sys- ation that can be exploited for data integration. Dis-
tems have implemented and evolved the concept of coveryLink*, 27 for example, is an IBM solution that
database federation. Pioneering research projects in- applies DB2 technology to integrate data in the life
cluded TSIMMIS 13 and HERMES, 14 which used data- sciences.
base concepts to implement “mediators,” special-
purpose query engines that use nonprocedural IBM’s database federation offerings have pursued sev-
specifications to integrate specific data sources. eral goals. The most basic goal is transparency, which
DISCO 15 and Pegasus 16 were closer in feel to true da- requires masking from the user the differences, id-
tabase federation. DISCO focused on a design scal- iosyncrasies, and implementations of the underlying
ing up to many heterogeneous data sources. Pega- data sources. This allows applications to be written
sus had its own data model and query language, and as if all the data were in a single database, although,
it had a functional object model that allowed great in fact, the data may be stored in a heterogeneous
flexibility for modeling data sources. Each created collection of data sources. A second key goal is to
its own database management system from scratch. support heterogeneity, or the ability to accommodate
Garlic 17 was the first research project to exploit the a broad range of data sources, without restriction of
full power of a standard relational database (DB2 18 ), hardware, software, data model, interface, or pro-
although it extended the language and data model tocols. A high degree of function provides users with
to support some object-oriented features. The wrap- the best of both worlds: not only rich, standard-com-
per architecture 19 and cross-source query optimiza- pliant DB2 SQL capabilities against all the data in the
tion 20 of Garlic are now fundamental components federation, but also the ability to exploit functions
of IBM’s federated database offerings. of the data sources that the federated engine may
lack.
Meanwhile, in the commercial world, the first steps
toward federation involved “gateways.” A gateway The fourth goal is extensibility, the ability to add new
allows a database management system to route a data sources dynamically in order to meet the chang-
query to another database system. Initially, gateway ing needs of the business. Openness is an important
products only allowed access to homogeneous corollary; DB2 products support the appropriate stan-
sources, often from the same manufacturer, and que- dards 28,29 in order to ensure the federation can be
ries could reference data in only one data source. extended with standard-compliant components. Fur-
Over time, more elaborate gateway products ap- ther, joining a federation should not compromise the
peared, such as iWay Software’s EDA/SQL Gate- autonomy of individual data sources. Existing appli-
ways. 21 These products allowed access to heteroge- cations run unchanged; data are neither moved nor
neous relational systems and to nonrelational systems modified; interfaces remain the same. Last but not
that provided an Open Database Connectivity least, DB2 database federation technology optimizes
(ODBC) driver. 22 DB2 DataJoiner* 23 was the first com- queries for performance. Relational queries are non-
mercial system to really use database federation. It procedural. When a relational query spans multiple
provided a full database engine with the ability to data sources, making the wrong decisions about how
combine data from multiple heterogeneous rela- to execute it can be costly in terms of resources.
tional sources in a single query, and to optimize that Hence, the DB2 federated engine extends the capa-
query for good performance. Microsoft’s Access 24 bilities of a traditional optimizer 12 to handle que-
allowed access to a larger set of data sources, but ries involving federated data sources.
was geared toward smaller applications, as opposed
to the mission-critical large enterprise applications DB2 architecture for database federation. In the re-
that DataJoiner supported. mainder of the paper, we focus on the database fed-
eration capabilities of DB2. Figure 1 depicts the over-
DataJoiner and Garlic “grew up” together, with all architecture of a DB2 database federation. Users
DataJoiner focused on robust and efficient queries access the federation via any DB2 interface (CLI [Call
over a limited range of mostly relational sources, Level Interface], ODBC, JDBC** [Java** Database
whereas Garlic focused on extensibility to a much Connectivity], etc.). The key component in this ar-
more heterogeneous set of sources. Today, the best chitecture is the DB2 federated engine itself. This en-
ideas from both projects have been incorporated into gine consists of a Starburst-style query processor 30
DB2, 25 which also enabled the use of user-defined that compiles and optimizes SQL queries, a run-time
functions to “federate” simple data sources. 26 DB2 interpreter for driving the query execution, a data
thus supports a very rich notion of database feder- manager that controls a local store, and several ex-

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 HAAS, LIN, AND ROTH 581
Figure 1 DB2 architecture for database federation

DB2
QUERY COMPILER

RUN TIME

DATA MANAGER
WRAPPERS
UDFs

EXTERNAL DATA
AND FUNCTIONS

LOCAL TABLES EXTERNAL DATA SOURCES

tension mechanisms that allow access to external DB2 styles of federation


data, including user-defined functions and wrappers.
As shown in Figure 2, the federated architecture of
DB2 offers a range of alternatives for federation, from
A user-defined function (UDF) is a routine written the simplicity of a scalar user-defined function to the
by an application developer and registered with the flexibility of the DB2 wrapper architecture. The fig-
database. User-defined functions may take input pa- ure illustrates that a natural trade-off exists between
rameters and return either a scalar result, or a table the level of federation and the effort involved to
of data. They can be used to combine functions al- achieve federation. In this section we describe points
ready supported by the database, or to provide ac- along this range and provide a decision tree that helps
cess to a specific function provided by an external to illustrate which type of federation is most appro-
source. priate for a given integration scenario.
A wrapper mediates between the DB2 engine and one Scalar UDFs: Federating function. The simplest
or more data sources, mapping the data model of form of a UDF is a scalar function, which can take
the sources to the DB2 data model and transforming data from the surrounding SQL statement as input
DB2 data operations into requests that the sources and produce a single scalar result. Scalar UDFs pro-
can handle. DB2 provides wrappers for many pop- vide a way to federate function with data in DB2, com-
ular data sources, such as Informix, Oracle, and bining data from one source with a function provided
MS SQL server, 25 as well as wrappers for specialized by another in a single statement. UDFs can provide
data sources such as Documentum and BLAST. 31 In two major benefits to client applications: (1) a sim-
addition, there is a toolkit for customers and third- pler programming model and (2) greater perfor-
party vendors to build wrappers for their own mance, achieved by reducing network traffic and path
sources. length between the calling application, data, and
function.
Clearly, there are different styles of federation avail-
able with this architecture. In the next section, we For example, the user-defined function db2mq.mqsend( )
explore each of these styles in greater detail and give ships with DB2 and allows a user to send a message
guidelines for when each style is appropriate. to an MQSeries 32 queue. By invoking this function

582 HAAS, LIN, AND ROTH IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002
within a SELECT statement, data can be moved from Figure 2 Different styles of federation
a database table to an MQ queue without the client
application accessing the queue directly:

SELECT db2mq.mqsend(a.headline)
FROM Articles a WRAPPER
WHERE a.article_timestamp ⬎⫽ CURRENT TIMESTAMP

LEVEL OF FEDERATION
Scalar functions can also be used as a simple way to
bring data into the engine. For example, the func-
tion db2mq.mqreceive( ) can be used to pull the next TABLE UDF
message from a queue. Because it is implemented
as a UDF, this function can be used in a complex query
to combine the message with other data from the SCALAR
UDF
federation.

Table UDFs: Federating data. A table function is a IMPLEMENTATION EFFORT

more sophisticated UDF that produces a table as out-


put and can appear wherever a table can be refer-
enced in an SQL statement. A row function is a spe-
cial case of a table function that returns 1 row. A
table function can retrieve a set of data and refor- tion could not be passed the predicate on the col-
mat the data into rows and columns, providing a sim- umn last_modified_date:
ple way to federate external data. Because a table
function can appear anywhere a table can, the full SELECT f.filename, f.author, f.last_modified_date
power of SQL can be applied to the resulting data. FROM TABLE (dir(‘⶿laura⶿papers’, ‘.pdf’)) AS f
Furthermore, views can be created on top of the UDF WHERE f.last_modified_date ⬎ ‘07/04/2002’
in order to make the external data appear even more
like a local table to client applications. Fortunately, the DB2 engine can apply any other
predicates to the result returned by the table func-
For example, a table function called addressbook( ) can tion. In this example, the engine would filter the re-
be used to access an address book stored in a Lotus sults of the table function and return only those mod-
Notes* database that contains sales contact infor- ified more recently than July 2002.
mation. This function can be invoked within a query
that joins the result of the function call with a local Wrappers: Federating function and data. The wrap-
table that contains company profile information to per architecture provides the most powerful and flex-
return sales contacts at financial companies whose ible infrastructure for federation. 33 It provides the
gross revenues are greater than $50 million: means to integrate both function and data by rais-
ing the level of federation from a single function or
SELECT a.first, a.last, a.phone, a.email set of data to that of an entire external data source.
FROM TABLE (addressbook( )) AS a, Company_Profiles c Client applications can transparently access and use
WHERE c.industry ⫽ ‘FINANCIAL’ AND c.revenue ⬎ the full power of the query language on data man-
50,000,000 AND c.name ⫽ a.company_name aged by these sources as though DB2 itself manages
the data.
Although table functions can be used just like ta-
bles, they look more like functions. Table functions For example, consider a set of scientists at a univer-
can take arguments that can be used to restrict the sity working in a drug discovery laboratory. The sci-
data to be returned. For example, a function to get entists store chemical compound data and experi-
information about files on the local disk might take mental results in an Oracle database. The university
one argument that determines the directory to also has access to Lotus Extended Search 34 (LES),
search, and a second that determines what file types a Web search engine that can perform searches
are of interest. But the table function can only filter across multiple Web search sites. The scientists use
using those predicates that it was designed to han- this search engine to retrieve research articles from
dle. For example, in the query below, the dir( ) func- other scientists. Both the Oracle database and the

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 HAAS, LIN, AND ROTH 583
search engine can be transparently accessed from DB2 ● Multidata-set integration and multioperation integra-
via wrappers, allowing the scientists to ask a single tion. Table UDFs are well suited to integrate a sin-
query that combines data and functions from both gle external operation and set of data into the fed-
sources. eration. The wrapper architecture expands on this
capability by providing the infrastructure to model
The Oracle wrapper maps the compound data and multiple data sets and multiple operations. Data
experimental results tables stored in Oracle to nick- sets are modeled as nicknames, and nicknames can
names, which can appear wherever a table can ap- appear wherever a table can appear in an SQL state-
pear in an SQL query (for example, the FROM clause). ment. Operations include query, insert, update,
The LES wrapper provides a nickname for a list of and delete. The wrapper participates in the plan-
articles that can be searched. Each article has a ti- ning and execution of these operations, and trans-
tle, subject, and URL. The search functionality itself lates them into the corresponding operations sup-
is mapped as an SQL function that takes two argu- ported by the server.
ments, the section of the article to search (title, sub-
ject, or both), and a set of key words, and returns The multidata-set and multioperation features in-
a score that measures the relevance of the article. troduce the notion of common query framework for
For example, a common task the scientists might un- external sources. The common query framework
dertake would be to find other research reports on is the way in which the wrapper architecture re-
chemical compounds that achieved a certain result alizes the goal of high function. That is, the frame-
in their experiments: work enables (1) the full power of operations ex-
pressible in SQL over the external source’s data,
SELECT c.name, a.URL and (2) specialized functions supported by the ex-
FROM Compounds c, Experiments e, Articles a ternal source to be invoked in a query even though
WHERE e.result ⬍ 1.1e-9 and e.id ⫽ c.id and DB2 does not natively support the function.
search(a.subject, c.name) ⬎ 0
Part 1 of the goal is addressed through function
Because Oracle is itself a relational database, the DB2 compensation. If the external source does not sup-
optimizer can choose a plan in which both the pred- port the full power of DB2’s SQL, client applications
icate on the Experiments table (e.result ⬍ 1.1e-9) and will not suffer any loss of query power over those
the join between the Compounds and Experiments ta- data, because DB2 will automatically compensate
bles (e.id ⫽ c.id) are pushed down to the Oracle da- for any differences between DB2’s capabilities and
tabase to execute, and this results in only those com- those of the external source. So, for example, if a
pounds with the right test results being returned to data source does not support ORDER BY, DB2 will
DB2. DB2 then routes the names of the matching com-
retrieve the data from the source and perform the
pounds to LES, which will retrieve relevant articles sort.
and return their URLs back to DB2.
Part 2 of the goal is addressed through function
DB2 supplies wrappers for a variety of relational and mappings. Function mappings can be used to de-
nonrelational sources, 18 and also provides a toolkit claratively expose functions supported by the ex-
for third-party vendors and customers to develop ternal source through the DB2 query language. For
wrappers for their own data sources. 17,19 The wrap- example, a chemical structure store may have the
per toolkit provides an external interface to the wrap- ability to find structurally similar chemical com-
per architecture, allowing a developer to federate a pounds. Although that function is not present in
new type of data source. The wrapper architecture DB2, it can still be exploited in DB2 queries, as long
enables the following federated features: as there is a mapping that tells DB2 which source
implements the function.
● Multiserver integration. Unlike a UDF, a wrapper can
easily provide access to multiple external sources If the external source implements new functions,
of a particular type. The DB2 engine acts as a traf- the wrapper itself may not require modification,
fic cop, maintaining the state information for each depending on the data source and what the wrap-
server, managing connections, decomposing que- per needs to do to map the function. For sources
ries into fragments that each server can handle, that invoke new functions in a predictable way, the
and managing transactions across servers. database administrator (DBA) need only register

584 HAAS, LIN, AND ROTH IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002
the new function mappings. They can then be im- Figure 3 Determining the style of federation to use
mediately used in DB2 queries.

● Optimization. Operations to be performed outside WRAPPER


of the DB2 engine may have significant cost impli- YES MULTISERVER
cations for the queries that contain them, and it ACCESS TO
MULTIPLE SERVERS INTEGRATION
is critical for DB2 to be aware of these implications
when choosing a strategy for executing the query. NO

The wrapper architecture includes a flexible frame- TRANSACTIONAL YES TRANSACTION


CONSISTENCY? INTEGRATION
work for wrapper writers to provide input to the
DB2 query optimizer about the cost of their oper- NO

ations and the size of the data that will be returned. MULTIPLE, DISTINCT YES MULTITABLE
This framework includes a dialog between the op- DATA SETS? INTEGRATION

timizer and wrapper at query compilation time, al- NO


lowing the wrapper writer to provide information MULTIPLE, DISTINCT YES
QUERY
based on the immediate context of the query. OPERATIONS? PLANNING
NO
This information is particularly crucial when the SOPHISTICATED YES QUERY
optimizer must consider alternatives for cross- COST MODEL? OPTIMIZATION
source operations, as these types of operations ex- NO
plode the set of possible execution strategies. Ref-
erence 11 provides an example of an image server SCALAR VALUE OR
1/MORE ROWS?
that can support two kinds of image search, one
SCALAR ROWS
of which is substantially more expensive than the
other. Because the wrapper can provide this in- SCALAR ROW/
formation to the DB2 optimizer, the optimizer is UDF TABLE UDF

able to choose a plan that applies the more expen-


sive operation after another predicate has filtered
out much of the data.

● Transactional integration. DB2 acts as a transaction ferent federation alternatives. However, it is usually
manager for operations performed on external true that one alternative is more suitable than the
sources, and the wrapper architecture provides the others for a given problem. Figure 3 presents a de-
infrastructure for wrappers to participate as re- cision tree that helps characterize the style of fed-
source managers in a DB2 transaction. DB2 main- eration most appropriate for a particular integration
tains a list of wrappers that have participated in problem.
the transaction, and it ensures that they are no-
tified at transaction commit or rollback time, giv- The decision tree is based on a series of YES/NO ques-
ing a wrapper an opportunity to invoke the appro- tions. Consistently answering NO to the questions in-
priate routine on the external source. dicates that the integration problem is sufficiently
contained that the best solution is likely to be a UDF.
Determining the style of database federation to use. A YES answer indicates that the integration prob-
To determine which level of federation is appropri- lem exhibits a characteristic that is best handled by
ate for a particular integration problem, it is impor- the wrapper architecture. The decision tree points
tant to characterize the purpose and desired prop- to the feature of the wrapper architecture that ad-
erties of the integration effort. UDFs are easy to dresses that characteristic. Next we consider the
implement, but have limited support for data mod- nodes in the decision tree in more detail.
eling and no support for transactional integration.
Wrappers are powerful, but rely on more advanced 1. Does the integration problem reach out to multiple
capabilities of the external source and require a more data sources of the same type? If so, is there a ben-
advanced skill set to implement. efit to modeling such servers?

Every application is different, and each can be solved If the server concept is key to the integration prob-
via a “small matter of programming” using the dif- lem, then the wrapper architecture is likely to be an

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 HAAS, LIN, AND ROTH 585
appropriate solution. For example, a function that includes APIs (application programming interfaces)
retrieves the temperature from an on-line thermom- for the wrapper writer to provide optimization in-
eter does not require the notion of a server, and the formation on an operation-by-operation basis. The
wrapper solution may be overkill. On the other hand, wrapper writer is allowed to examine the operation
suppose the goal is to integrate a set of Lotus Notes to be performed to determine which portions of the
databases. It is possible to write a UDF to access mul- operation the data source can execute. The wrap-
tiple databases; however, the burden is on the UDF per reports this information and provides cost and
developer to take the appropriate information to cardinality estimates based on the operations, pa-
identify the databases as arguments, manage con- rameters, persistent cost information stored in the
nections to the databases, and use the scratchpad to system catalogs, and state information stored in the
store any state information. Furthermore, the life- wrapper. With this cost information the DB2 opti-
time of the scratchpad is within a single statement, mizer can determine an optimal execution plan.
so UDF invocations from separate statements will re-
quire their own connections. For this integration 4. Is there a sophisticated cost model associated with
problem, the multiserver support offered by the the federated operations?
wrapper architecture, including connection manage-
ment, will help to nicely manage access to those da-
tabases. The UDF architecture provides some support for pro-
viding static cost information. If the operation is typ-
2. Is transactional consistency important for the in- ically executed as part of a simple, single-table query,
tegration problem? a table function may be the appropriate vehicle for
federation. However, if the operation is likely to be
The UDF architecture does not support transactional invoked as part of a more complex query, and its po-
consistency, whereas the wrapper architecture offers sition in the query plan may greatly impact the per-
both one-phase and (in a future release) two-phase formance of the query, it may be worthwhile to use
commit support. If transactional consistency is cru- the wrapper architecture to federate this operation,
cial for the integration problem, and the source that just to exploit the flexible costing infrastructure.
has the data or functions needed is able to partic-
ipate as a resource in a transaction, the wrapper ar- If the answer to all of the above questions is no, then
chitecture is the only choice that can provide that a UDF is likely to be an appropriate choice for fed-
level of support. eration. In this case, the final question to ask is
whether the integrated operation should return a sca-
3. Are there multiple, distinct data sets to be accessed? lar value, or a multirow set. If the operation returns
Are there multiple, distinct operations to be feder- a scalar value, the right choice is a scalar UDF, and
ated? if the operation can return a result set, a table UDF
is the appropriate choice.
An operation is a specific external action that can
be performed on the data to be integrated. Exam- If the answer to any of the questions is yes, the wrap-
ples of operations include data retrieval, search, in- per architecture provides at least one valuable fea-
sert, update, and delete. Table UDFs are well suited ture that addresses some aspect of the integration
to integrate a single external operation and set of problem, and it is worthwhile to consider writing a
data into DB2. When there are multiple data sets wrapper to exploit that feature. Note that because
and/or multiple operations involved, the wrapper ar- the architecture is designed to be flexible, it is pos-
chitecture may more easily provide the infrastruc- sible to implement important pieces of a wrapper
ture for modeling those data sets and operations. without writing a complete wrapper. For example,
This is particularly true when there are restrictions if the function being integrated is really a simple sca-
on how multiple operations can be combined, or cost lar UDF but transactional integration is key, it is pos-
implications that depend on how they are combined. sible to implement that function within a simple
When using the wrapper architecture, data sets be- wrapper that exploits the transactional architecture,
come separate nicknames, and flexible query plan- and only minimally implements the data modeling
ning and query execution components allow the and query optimization aspects. On the other hand,
wrapper writer to control the set of operations the if the operation on the data source is a highly so-
wrapper will support, and how those operations will phisticated read-only function with significant cost
be executed. In addition, the wrapper architecture implications, it is possible to write a wrapper that

586 HAAS, LIN, AND ROTH IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002
focuses on the optimization component and does not mit protocols. These add great value, of course, for
address transaction issues at all. integrating data from several sources, allowing in-
tegrity guarantees to be offered and building users’
Why use a database engine for data trust in the system. Sophisticated integrity con-
integration? straints 41 and active mechanisms for enforcing con-
straints of various kinds 42 may also be extended to
Database federation centers on a relational database protect relationships across federated data. Were we
engine. That engine is an essential ingredient for in- to build a data integration engine on some other plat-
tegrating data, because it significantly raises the level form, all of this would need to be built from scratch.
of abstraction for the application developer. Rather
than deal with the details of how to access individual The local store is a further important aid to appli-
sources, and in what order, the application devel- cations over federated data. Any application deal-
oper gets the benefits of a powerful nonprocedural ing with data, and especially distributed data from
language, support for data integrity, a local store to a variety of sources, needs some place to keep in-
use for persisting data and meta-data as needed, and formation about the data that it uses. The local store
the use (or reuse) of the many tools and applications is a convenient place for these meta-data. Without
already written for “normal” relational systems. In that store, and the automatic cataloging of basic facts
this section, we look at each of these benefits in more about the data, the application developer would be
detail. forced to manually track and ensure the persistence
of such information. A second use for the local store
The virtues of SQL and the relational data model have is to materialize data. Data may need to be mate-
often been extolled since their invention in the late rialized temporarily, as part of query processing, for
1960s and early 1970s. 35–37 Foremost among these example, or for longer—from a short-term cache to
are the simplicity of the model and the nonproc- a persistent copy, or even a long-term store of data
edural nature of the language, which together allow unique to the federated applications.
the application developer to specify what data are
needed, and not how to find the data. The system Materialized views 43 are an advanced feature of SQL
automatically finds a path to the individual relations, that exploits the local store. Originally conceived of
and decides how to combine them in order to return as a way to amortize the cost of aggregating large
the desired result. With the invention of query op- volumes of data over multiple queries, a material-
timization in the late 1970s, 12 systems were able to ized view (called an automatic summary table, or AST,
choose wisely among alternative execution strategies, in DB2) creates a stored version of the query results.
greatly enhancing the performance and expanding When new queries arrive that match the AST’s def-
the range of queries the systems could effectively han- inition, they are automatically “routed” to use the
dle. Over the years, the language has been contin- stored results rather than recomputing them. 44 If the
uously enriched with new constructs. It is now pos- view definition can reference foreign data, then a
sible to write recursive queries, perform statistical stored summary of data from federated sources can
analyses, define virtual tables on the fly, set defaults, be created. When summary tables are considered
take different actions based on the data retrieved, during query planning, DB2 will compare the cost of
and so on. 38 User-defined functions and stored pro- fetching data directly from the local summary table
cedures make it possible to execute much of the ap- to the cost of re-evaluating the remote query and
plication logic in the database, as close as possible pick the cheaper strategy. This is easy and natural
to the data for greatest efficiency. Database feder- in a database federation, but would take an enor-
ation extends these advantages to data that are not mous effort to replicate with other integration ap-
stored in the local store. Now applications requiring proaches, as it depends so heavily on multiple fea-
distributed data can be written as easily as single tures of the database engine approach.
source applications, using all the power of SQL.
Last but not least, using the database engine to in-
Another fundamental aspect of relational systems tegrate data means that many if not all of the tools
is that they protect data integrity. Transaction sup- and applications that have been developed over the
port 39 guarantees an all-or-nothing semantics for up- years for relational databases can be used without
dates. The extension of commit protocols to distrib- modification on distributed data. These include
uted environments 40 is well understood and many query builders such as Cognos 45 and Brio, 46 appli-
database systems implement one or two-phase com- cation development environments such as IBM Web-

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 HAAS, LIN, AND ROTH 587
Figure 4 A DB2 database federation that includes relational databases

CORPORATE DB2 DATABASE

Past_Sales
AST
National_Transactions
FEDERATED VIEW

sf.Transactions ny.Transactions
NICKNAME NICKNAME

ORACLE
WRAPPER

store_id tran_date tran_id item_id store_id tran_date tran_id item_id

SF ORACLE DATABASE NY ORACLE DATABASE

Sphere* Studio, 47 IBM VisualAge*, 48 and Microsoft bles, OLAP (on-line analytical processing) functions,
Visual C⫹⫹** 49 or WebGain VisualCafé**, 50 report DB2 Spatial Extender, replication, and caching.
generators, visualization tools, and so on. Likewise,
the customer may already have applications devel- Federation of distributed data. A nationwide depart-
oped for a single database that can be reused easily ment store chain has stores in several regions of the
in the federated environment by substituting names country. Each of the stores relies on a relational da-
of remote tables for local ones, or modifying view tabase to maintain inventory records and customer
definitions to make use of remote data. transactions. However, because the department store
has gone through several technology migrations, not
In summary, many of the advantages offered by stan- all of the stores use the same database products. Both
dard relational database systems are equally appli- the San Francisco store and the New York store use
cable to database federation. Thus, the database fed- Oracle databases to record business transactions,
eration approach provides a richly supportive while the corporate headquarters has recently mi-
environment for the task of data integration. In the grated to DB2.
following section, we illustrate some of these ben-
efits through some typical usage scenarios. Each store maintains a Transactions table, which con-
tains an entry for each item scanned during a cus-
DB2 federation usage scenarios tomer transaction. It is easy for individual stores to
generate storewide sales reports using the informa-
In this section, we describe a set of scenarios that tion stored in this table. Figure 4 shows how the cor-
illustrate how database federation can be used to in- porate office can use DB2 technology to generate a
tegrate local and remote data from a variety of sales report across all stores. Because the San Fran-
sources, including relational data, Microsoft Excel cisco and New York offices both use Oracle, the cor-
spreadsheets, XML documents, Web services, and porate office can use the Oracle wrapper provided
message queues. Furthermore, these scenarios dem- with DB2 to access both stores’ databases. Likewise,
onstrate that by using the database as the integra- the corporate office can access other stores’ databases
tion engine, business applications are able to exploit using the wrappers appropriate for those databases.
the full power of SQL over federated data, including Note that the schemas for the individual databases
complex queries and views, automatic summary ta- need not be the same, as long as queries can be for-

588 HAAS, LIN, AND ROTH IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002
mulated to extract the same information from each whether a given query could be executed over the
store. (local) summary table rather than over the (remote)
nicknames. If so, in addition to plans that send re-
Each store’s database is registered as a server in the mote requests to the regional stores to get the in-
corporate headquarters database, and the tables that formation and compute the result, DB2 will consider
the corporate office needs to access are registered plans to extract the information from the Past_Sales
as nicknames. For example, each store’s Transactions summary table directly. This analysis is done trans-
table is registered as nickname. Once the nicknames parently and does not require changes to the orig-
are in place, a federated view that shows company- inal query.
wide transactions can be defined as follows:
Federation of nonrelational structured data. Figure
CREATE VIEW National_Transactions (store_id, tran_date,
5 shows how the corporate office of the department
tran_id, item_id) AS
store chain can use a DB2 database federation to ac-
SELECT store_id, tran_date, tran_id, item_id
cess nonrelational data sources as well. For exam-
FROM sf.Transactions
ple, the procurement office might like to know the
UNION ALL
manufacturer and the supplier of the best-selling
SELECT store_id, tran_date, tran_id, item_id
television in 2001. Item and supplier information are
FROM ny.Transactions
stored in Excel spreadsheets and can be accessed
from DB2 using the Excel wrapper. The items spread-
Note that if more stores are added to the depart- sheet is mapped to an Items nickname, and the sup-
ment chain, the corporate office does not need to
pliers spreadsheet is mapped to a Suppliers nickname.
modify its business application. Rather, the
Data contained in the spreadsheets can be retrieved
National_Transactions view definition can be updated
using SQL as shown by the following query:
to include the information for the new stores. Given
this view, the corporate office can run a single query
SELECT i.mfg, s.id
to generate a national sales report that shows the
FROM Items i, Suppliers s
total of the number of items sold per month by all
WHERE i.id ⫽ s.id AND i.id ⫽ (SELECT g.id
stores in the company:
FROM (SELECT g.id, COUNT(*), ROWNUMBER( )
OVER (ORDER BY COUNT(*) DESC) AS rownum
SELECT MONTH(tran_date), item_id, COUNT(*)
FROM National_Transactions g, Items it
FROM National_Transactions
WHERE it.cat⫽‘television’ AND g.id ⫽ it.id AND
WHERE YEAR(tran_date)⫽2001
YEAR(tran_date)⫽2001
GROUP BY MONTH(tran_date), item_id
GROUP BY g.id) AS tv_total_2001
WHERE rownum ⫽ 1)
In addition, the corporate office can create an au-
tomatic summary table over the federated view to
cache the transaction information for previous years In the above query, the OLAP function ROWNUMBER
locally, since it is not likely to change. The following is used to order the COUNT(*) in descending order in
statement can be used to create this materialized the nationwide total of sales for every model of tele-
view: vision in the year 2001. The first row is then selected
to find the item identifier (ID) of the most frequently
CREATE TABLE Past_Sales AS (
sold television. This example shows that by exploit-
SELECT YEAR(tran_date) AS year, MONTH(tran_date) AS
ing DB2 database federation technology, the corpo-
month, item_id, COUNT(*) AS sales rate office can use a single (complex) SQL statement
FROM National_Transactions to correlate information among the stores and the
WHERE YEAR(tran_date) ⬍⫽ 2001 corporate office, although the data may be stored and
GROUP BY YEAR(tran_date), MONTH(tran_date), item_id) represented differently at each location.
DATA INITIALLY DEFERRED REFRESH DEFERRED
Federation of semi-structured data. The department
The Past_Sales summary table is locally stored, and store chain also provides on-line shopping for cus-
as a result, indexes can be created over the table and tomers. Customer data are stored in a Customers ta-
statistics can be collected to improve query perfor- ble in the corporate office database, and orders are
mance. In addition, a special register can be set to generated as XML documents by a Web application.
indicate that DB2 should automatically check to see These XML documents can also be accessed from the

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 HAAS, LIN, AND ROTH 589
Figure 5 A DB2 database that includes relational and nonrelational data sources

DB2 <order>
<trans_id id=“12AV56BG90”/>
id cat mfg <date date=“04/26/2002”/>
EXCEL XML <customer id=“8899”>
WRAPPER WRAPPER <street street=“2041 Russell”/>
id name phone <city name=“Paper City”/>
<state name=“WI”/>
<zip code=“54496”/>
EXCEL
</customer>
<item id=“5795”>
ORACLE <quantity amount=“1” />
WRAPPER
</item>
<item id=“51766”/>
<quantity amount=“2” />
</item>
</order>
sf.Transactions ny.Transactions
XML DOCUMENT
SF ORACLE DATABASE NY ORACLE DATABASE

corporate database using the XML wrapper to be WHERE o.order ⫽ ‘/home/customers/order5795.xml’ AND
shipped with DB2 Life Sciences Data Connect. 31 o.transaction_id⫽‘12AV56BG90’
ORDER BY distance
XML documents map to nicknames, and elements of
the XML document map to columns of the nickname. Federation of Web services. A small furniture com-
A single XML document can be mapped to multiple pany supplies several nationwide retail stores with
nicknames. An option associated with the nickname its products. The retail stores submit orders for fur-
allows the user to specify the location of the XML niture through a Web application, the furniture com-
document in the query, and an XPATH option on the pany fulfills the order and ships the furniture to the
nickname column definition maps the column to its stores. Since freight shipment represents a signifi-
location in the XML document. For example, the or- cant portion of the production cost, the furniture
der document contains both order and item infor- company contracts with several trucking companies
mation, and as a result, the document is mapped to and puts each freight shipment up for bid. Figure 6
an Orders nickname and an Order_Items nickname, and shows a system configuration using DB2 for the or-
queries involving items from a particular order can der processing system. New orders are placed on an
be expressed as joins of these two nicknames. MQ Series queue. A back-end order processing sys-
tem removes orders from the queue and solicits the
As part of order processing, the Web application bids from various trucking companies for shipment.
must determine and dispatch the order to the dis-
tribution center closest to the customer. The corpo- The order processing system uses a Pending_Orders ta-
rate office can store the geo-coded location of each ble to maintain a list of orders currently being pro-
distribution center in a Distribution_Centers table, geo- cessed. Each order has an auto-generated unique or-
code the customer’s location, and use the DB2 Spa- der number, as well as the XML document that
tial Extender function db2gse.st_distance( ) to identify contains the order information. The furniture com-
the distribution center closest to the customer: pany maintains a private UDDI registry for trucking
companies that support a common Web services in-
SELECT s.store_id, db2gse.st_distance(GEOCODE(o.street, terface to request freight shipment bids. This reg-
o.city,o.state,o.zip), s.location, ‘mile’) AS distance istry is available to the order processing system via
FROM Distribution_Centers s, Orders o a wrapper, and a Freight_Shippers nickname supported

590 HAAS, LIN, AND ROTH IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002
Figure 6 A DB2 database federation that includes Web services and other sources

DB2 UDF
TRUCKING COMPANY 1
New_Orders

UDF
TRUCKING COMPANY 2

WRAPPER
TRUCKING COMPANY N
MQSERIES QUEUE

WEB SERVICES

Pending_Orders Freight_Shippers
Bids

LOCAL TABLES UDDI REGISTRY

by the wrapper supplies the names and URLs of truck- INSERT INTO Bids
ing companies that support the bid Web service. A SELECT Order.ordernum, s.name, bid(s.url, order.orderxml)
UDF called bid( ) takes the URL of a company and an FROM Freight_Shippers S
XML description of an order, sends a Web services
request to retrieve a bid for the order from the com- This SQL statement illustrates the power of a data-
pany specified by the URL, and returns the compa- base federation solution for integrating data. A sim-
ny’s bid. A Bids table contains a list of bids obtained ple insert statement causes a sophisticated trigger to
for a given order, including the order number, the execute over federated data that are transparently com-
name of the trucking company that supplied the bid, bined via user defined functions (db2mq.mqreceive( ) and
and the bid itself. bid( )) and wrapper-based federation (the Freight_Shippers
nickname).
The furniture company can exploit DB2’s federation
capabilities to automate a significant portion of the
Heterogeneous replication using database feder-
order process. For example, orders can be removed
ation. Many businesses choose to keep multiple cop-
from the MQ Series queue (via the db2mq.mqreceive( )
ies of their data for various uses, including data ware-
UDF), assigned a unique order number, and inserted
housing, fault tolerance, and fail-over scenarios. A
into the Pending_Orders table with the following
major retailer with outlets all over the United States
statement:
backs up data from its various locations to regional
INSERT INTO Pending_Orders
data centers. Due to independent purchasing deci-
VALUES(GENERATE_UNIQUE( ), db2mq.mqreceive( ))
sions, the retail outlets use one relational database
management system, while the data center might use
Furthermore, a trigger defined on the Pending_Orders another. The replication process is relatively straight-
table can automatically kick off the bid process as forward, and involves extracting data from the out-
new orders are inserted into the table: lets’ databases, optionally reshaping the data and/or
aggregating the data, and inserting the data into the
CREATE TRIGGER Get_Bids data center database.
AFTER INSERT ON Pending_Orders
REFERENCING NEW AS order Figure 7 shows two approaches that use DB2 tech-
FOR EACH ROW MODE DB2SQL nology as the extract/transform vehicle to transfer

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 HAAS, LIN, AND ROTH 591
Figure 7 Heterogeneous replication using database federation

APPROACH 1 APPROACH 2
DB2 DB2

OUTLET TABLES

LOCAL TABLES

OUTLET TABLES DATA CENTER DATABASE


DATA CENTER DATABASE

data from the outlets to the data center. Approach tiered architecture: the Web server tier, the appli-
1 in Figure 7 shows that if the data center uses a DB2 cation server tier, and the data tier. User requests
database, the data transfer can take place with sim- are routed to one of multiple Web servers, which
ple statements of the form: forward the user requests to one of the application
servers for processing. The application servers in turn
INSERT INTO ⬍data center local_table⬎ retrieve product data from, and insert order infor-
SELECT . . . FROM ⬍outlet nickname⬎. ... mation into, a single back-end database.

Approach 2 in Figure 7 shows that even if the data It is easy to see from the figure that as traffic in-
center uses another relational database product, the creases, the back-end database can quickly become
same INSERT statement can be used to populate the the bottleneck. DBCACHE 52,53 is a research prototype
database, with the only difference being that the fed- built with federation technology that provides scal-
erated database will insert into a nickname instead: ability of the data tier. Each application server node
may include a front-end database server, the “cache”
INSERT INTO ⬍data center nickname⬎ of DBCACHE. DBCACHE allows database administra-
SELECT . . . FROM ⬍outlet nickname⬎. ... tors to replicate portions of the back-end database
across multiple front-end databases, allowing client
Note that in either approach the SELECT statement requests to be routed to the front-end databases. This
can be arbitrarily complex and can be used to se- topology is often less expensive than a single large
lectively retrieve, re-shape, and aggregate data ac- parallel system, and also provides a layer of fault tol-
cording to the semantics of the application. erance; a single database crash does not cause the
entire database to become inoperative.
DB2 DataPropagator* 51 is an IBM product that uses
DB2 federation technology to replicate data. Data- Figure 9 shows an e-commerce application using
Propagator automates the copying of data between DBCACHE technology. Application tables are divided
remote systems, providing automated change prop- into two categories: cached and noncached. Cached
agation, flexible scheduling, and other customization tables are mostly read-only, and accessed by mul-
features. DataPropagator exploits the remote tiple users. For these kinds of tables, maintaining
insert/update/delete capability illustrated above to strict consistency is not necessary, and reading stale
transparently apply changes to all relational sources data is acceptable. From an on-line shopper’s per-
including non-DB2 sources. spective, product catalog data are read-only (query
Q1 in Figure 9, for example). Updates occur infre-
Dynamic data caching. As shown in Figure 8, a typ- quently, usually through a scheduled deployment sce-
ical e-commerce Web application consists of a three- nario (never by an on-line shopper), and it is not a

592 HAAS, LIN, AND ROTH IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002
Figure 8 Three-tier architecture for an e-commerce application

CLIENT CLIENT CLIENT CLIENT CLIENT CLIENT


APPLICATION APPLICATION APPLICATION APPLICATION APPLICATION APPLICATION

WEB SERVER WEB SERVER WEB SERVER

APPLICATION APPLICATION
SERVER SERVER

DATABASE

critical loss if a shopper sees product information The client application sees a single view of the data.
that is a few seconds out of date. Furthermore, since Queries for product information (query Q1 in Fig-
all shoppers browse the product catalog before mak- ure 9) are dynamically routed to one of the cached
ing a purchase, the product catalog is heavily ac- tables on the front-end database, providing a level
cessed. of load balancing and fault tolerance for these heav-
ily accessed data. Insert statements to the Orders ta-
Noncached tables are typically read/write. For these ble (query Q2 in Figure 9) are routed directly to the
tables, maintaining strict consistency is important, back-end table. Statements can span both cached and
and accessing stale data is intolerable. Order infor- noncached tables. For example, a query to check the
mation is an example of data stored in a noncached status of an order (query Q3 in Figure 9) can join
table. From the perspective of the shopper, order information from the Orders nickname with the
information is mostly write-only (query Q2 in Fig- Products table.
ure 9, for example). It is captured as part of on-line
order processing and read only by back-end order
fulfillment applications and by the shopper to check Conclusions
order status. Because orders result in transfer of In this paper, we have shown that database feder-
money, it is critical that they present an accurate view ation is a powerful tool for integrating data. Data-
of the data. base federation employs a database engine to cre-
ate a virtual database from several, possibly many,
As shown in Figure 9, the Products back-end table is heterogeneous and distributed data stores. We iden-
automatically replicated to a cached table across mul- tified three styles of database federation. In all of
tiple front ends using DataPropagator (Data Prop them, the database engine is the key driver, but the
in Figure 9), according to a schedule defined by the method by which data or functions are included in
DBA. Reads to the Products table are transparently the federation differs. We presented guidelines on
routed to one of the front-end databases, whereas when each style of federation should be used: user-
writes (if any) are routed directly to the back-end defined functions are most appropriate for fairly sim-
database. The Orders back-end table is represented ple integration tasks, whereas the wrapper architec-
as a nickname in the front-end databases, and both ture supports a much broader and more complex set
reads and writes are passed to the back-end data- of tasks. We discussed why database technology is
base using DB2 federated technology. so crucial to data integration, and how many of the

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 HAAS, LIN, AND ROTH 593
Figure 9 E-commerce application using DBCASHE

QI Q2 Q3
SELECT o.status, p.name
SELECT p* INSERT INTO Orders
FROM Orders o, Products p
FROM Products p VALUES(GENERATE_UNIQUE( ),
8080,50785,2) WHERE o.customerid = 8080 AND
WHERE p.category = ‘SHOES’
o.item_id = p.item_id

FRONT FRONT
END 1 END 2

Products Orders Orders Products

WRAPPER WRAPPER

DATA DATA
PROP PROP

BACKEND DATABASE

Orders

Products

features of relational databases can be applied in a it is overkill, or for which another of the many ap-
distributed environment to ease the development of proaches to data integration might be easier. How-
new applications that span multiple data sources. Fi- ever, we strongly believe that database federation
nally, we demonstrated through a number of use must be a fundamental part of any integration so-
cases how various database features—including lution. Our future efforts will be directed toward im-
views, ASTs, OLAP functions, joins, unions, and ag- proving the ease of use for this technology. Tools
gregations— can be used in conjunction with mul- are needed to develop robust and efficient wrappers
tiple federation styles to integrate data from sources quickly and with minimal effort. We continue to en-
as diverse as MQ queues, XML documents, Excel hance the performance of the system, improving the
spreadsheets, Web services, and relational database optimization of queries and the set of execution strat-
management systems. We showed how federation egies available to the optimizer. Further challenges
could be used as the basis for report gathering, for in this modern age include being able to handle asyn-
warehouse loading and replication, and even for chrony everywhere, better integration of XML capa-
caching. The diversity of applications for this tech- bilities, and including native support for XML query.
nology is reflective of the many ways in which data As customers exercise the technology, we are find-
must be integrated, and the applicability of database ing that richer support for semantic meta-data would
federation to the broad range of challenges demon- be helpful, as well as more automation in the ad-
strates its importance for data integration. ministration of the federated system. With enhance-
ments such as these, we expect that database fed-
Of course, database federation is not a panacea. eration will come to be widely perceived as the
There may be some integration scenarios for which cornerstone of data integration.

594 HAAS, LIN, AND ROTH IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002
*Trademark or registered trademark of International Business puter Systems, May 1996, Hong Kong, IEEE, New York
Machines Corporation. (1996), pp. 449 – 457.
16. M.-C. Shan, “Pegasus Architecture and Design Principles,”
**Trademark or registered trademark of the Object Management Proceedings of the 1993 ACM SIGMOD International Confer-
Group, Sun Microsystems, Inc., Microsoft Corporation, or
ence on Management of Data, Washington, D.C., May 1993,
WebGain, Inc.
ACM, New York (1993), pp. 422– 425.
17. M. Tork Roth, P. Schwarz, and L. Haas, “An Architecture
for Transparent Access to Diverse Data Sources,” Compo-
Cited references nent Database Systems, K. R. Dittrich, A. Geppert, Editors,
Morgan-Kaufmann Publishers, San Mateo, CA (2001), pp.
1. Documentum, Inc., Content Management: Documentum 175–206.
Products, http://www.documentum.com/content-management_
18. IBM Corporation, DB2 Product Family, http://www-3.
Products.html.
ibm.com/software/data/db2/.
2. IBM Corporation, Content Manager, http://www.ibm.com/
19. M. Tork Roth and P. Schwarz, “Don’t Scrap It, Wrap It! A
software/data/cm/.
Wrapper Architecture for Legacy Data Sources,” Proceed-
3. M. A. Roth, D. C. Wolfson, J. C. Kleewein, and C. J. Nelin,
“Information Integration: A New Generation of Information ings of the Conference on Very Large Data Bases (VLDB), Ath-
Technology,” IBM Systems Journal 41, No. 4, 563–577 (2002, ens, Greece, August 1997, Morgan Kaufmann Publishers, San
this issue). Mateo, CA (1997), pp. 266 –275.
4. IBM Corporation, WebSphere Application Server, http:// 20. L. Haas, D. Kossmann, E. Wimmers, and J. Yang, “Optimiz-
www.3.ibm.com/software/webservers/appserv/enterprise.html. ing Queries Across Diverse Data Sources,” Proceedings of the
5. LION Bioscience AG, LION DiscoveryCenter, http:// Conference on Very Large Data Bases (VLDB), Athens,
www.lionbioscience.com/solutions/discoverycenter/. Greece, August 1997, Morgan Kaufmann Publishers, San Ma-
6. middleAware.com, Component Focus, http://www.middleware. teo, CA (1997), pp. 276 –285.
net/components/index.html. 21. iWay Software, iWay Adapters and Connectors, http://www.
7. F. Leymann and D. Roller, “Using Flows in Information In- iwaysoftware.com/products/e2e_integration_products.html.
tegration,” IBM Systems Journal 41, No. 4, 732–742 (2002, 22. Microsoft Corporation, Microsoft ODBC, http://www.
this issue). microsoft.com/data/odbc/.
8. M. Roscheisen, M. Baldonado, C. Chang, L. Gravano, 23. IBM Corporation, DataJoiner, http://www.software.ibm.com/
S. Ketchpel, and A. Paepcke, “The Stanford InfoBus and Its data/datajoiner/.
Service Layers: Augmenting the Internet with Higher-Level 24. Microsoft Corporation, Access 2002 Product Guide, http://
Information Management Protocols,” Digital Libraries in www.microsoft.com/office/access/default.asp.
Computer Science: The MeDoc Approach, Lecture Notes in 25. IBM Corporation, DB2 Relational Connect, http://www-3.
Computer Science, No. 1392, Springer, New York (1998), pp. ibm.com/software/data/db2/relconnect/.
213–230. 26. B. Reinwald, H. Pirahesh, G. Krishnamoorthy, G. Lapis,
9. IBM Corporation, Enterprise Information Portal, http:// B. Tran, and S. Vora, “Heterogeneous Query Processing
www-3.ibm.com/software/data/eip. Through SQL Table Functions,” Proceedings of the 15th In-
10. H. Pirahesh, J. Hellerstein, and W. Hasan, “Extensible Rule- ternational Conference on Data Engineering, March 1999, Sid-
Based Query Rewrite Optimization in Starburst,” Proceed- ney, Australia, IEEE, New York (1999), pp. 366 –373.
ings of the 1992 ACM SIGMOD International Conference on 27. L. M. Haas, P. M. Schwarz, P. Kodali, E. Kotlar, J. E. Rice,
Management of Data, San Diego, CA, June 2–5, 1992, ACM, and W. C. Swope, “DiscoveryLink: A System for Integrated
New York (1992), pp. 39 – 48. Access to Life Sciences Data Sources,” IBM Systems Journal
11. M. Tork Roth, F. Ozcan, and L. Haas, “Cost Models DO Mat- 40, No. 2, 489 –511 (2001), http://www.research.ibm.com/
ter: Providing Cost Information for Diverse Data Sources in journal/sj/402/haas.html.
a Federated System,” Proceedings of the Conference on Very 28. ISO/IEC 9075-2:2000, Information technology—Database
Large Data Bases (VLDB), Edinburgh, Scotland, September
languages—SQL—Part 2: Foundation (SQL/Foundation), In-
1999, Morgan Kaufmann Publishers, San Mateo, CA (1999),
ternational Organization for Standardization, Geneva, Swit-
pp. 599 – 610.
zerland (2000).
12. P. Selinger, M. Astrahan, D. Chamberlin, R. Lorie, and
29. ISO/IEC 9075-9:2000, Information technology—Database
T. Price, “Access Path Selection in a Relational Database
Management System,” Proceedings of the 1979 ACM SIGMOD languages—SQL—Part 9: Management of External Data
International Conference on Management of Data, Boston, MA, (SQL/MED), International Organization for Standardization,
1979, ACM, New York (1979), pp. 23–34. Geneva, Switzerland (2000).
13. H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakon- 30. L. Haas, J. Freytag, G. Lohman, and H. Pirahesh, “Exten-
stantinou, J. Ullman, and J. Widom, “Integrating and Access- sible Query Processing in Starburst,” Proceedings of the 1989
ing Heterogeneous Information Sources in TSIMMIS,” Pro- ACM SIGMOD International Conference on Management of
ceedings of the AAAI Symposium on Information Gathering, Data, Portland, OR, May 31–June 2, 1989, ACM, New York
Stanford, CA, March 1995, AAAI Press (1995), pp. 61– 64. (1989), pp. 377–388.
14. S. Adali, K. Candan, Y. Papakonstantinou, and V. S. Sub- 31. IBM Corporation, DB2 Life Sciences Data Connect, http://
rahmanian, “Query Caching and Optimization in Distributed www-3.ibm.com/software/data/db2/lifesciencesdataconnect/.
Mediator Systems,” Proceedings of the ACM SIGMOD Con- 32. IBM Corporation, IBM MQSeries Integrator V2.0: The Next
ference on Management of Data, Montreal, Canada, June 1996, Generation Message Broker, http://www-3.ibm.com/software/
ACM, New York (1996), pp. 137–148. ts/mqseries/library/whitepapers/mqintegrator/msgbrokers.
15. A. Tomasic, L. Raschid, and P. Valduriez, “Scaling Heter- html.
ogeneous Databases and the Design of DISCO,” Proceed- 33. V. Josifovski, P. Schwarz, L. Haas, and E. Lin, “Garlic: A
ings of the 16th International Conference on Distributed Com- New Flavor of Federated Query Processing for DB2,” Pro-

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 HAAS, LIN, AND ROTH 595
ceedings of the ACM SIGMOD Conference on Management Accepted for publication July 22, 2002.
of Data, Madison, WI, June 2002, ACM, New York (2002).
34. IBM Corporation, Lotus Extended Search, http://
Laura M. Haas IBM Software Group, Silicon Valley Laboratory,
www.lotus.com/products/des.nsf/wdocs/home.
555 Bailey Avenue, San Jose, California 95141 (electronic mail:
35. E. F. Codd, “A Relational Model of Data for Large Shared lmhaas@us.ibm.com). Dr. Haas is a Distinguished Engineer and
Data Banks,” Communications of the ACM 13, No. 6, 377– senior manager in the IBM Software Group, where she is respon-
387 (June 1970). sible for the DB2 Query Compiler development, including key
36. E. F. Codd, The Relational Model for Database Management, technologies for DiscoveryLinkTM and Xperanto. Dr. Haas, who
Version 2, Addison-Wesley Publishing Co., Reading, MA joined the IBM Almaden Research Center in 1981, has made sig-
(1990). nificant contributions to database research and has led several
37. C. J. Date and H. Darwen, A Guide to SQL Standard, 4th research projects, including R*, Starburst, and Garlic. She has
Edition, Addison-Wesley Publishing Co., Reading, MA received an IBM Outstanding Technical Achievement Award for
(1997). her work on R* and DiscoveryLink, an IBM Outstanding Con-
38. D. Chamberlin, A Complete Guide to DB2 Universal Data- tribution Award for Starburst, a YWCA Tribute to Women in
base, Morgan Kaufmann Publishers, San Mateo, CA (1998). Industry (TWIN) Award, and an ACM SIGMOD Outstanding
39. J. Gray and A. Reuter, Transaction Processing: Concepts and Contribution Award.
Techniques, Morgan Kaufmann Publishers, San Mateo, CA
(1993). Eileen T. Lin IBM Software Group, Silicon Valley Laboratory, 555
40. I. Traiger, J. Gray, C. Galtieli, and B. Lindsay, Transactions Bailey Avenue, San Jose, California 95141 (electronic mail:
and Consistency in Distributed Database Systems, Research Re- etlin@us.ibm.com). Dr. Lin is a Senior Technical Staff Member
port RJ2555, IBM Almaden Research Center, San Jose, CA and lead architect for DB2 database federation. She joined IBM
95120 (1979). in 1990 after receiving her Ph.D. degree from the Georgia In-
41. M. Stonebraker, “Implementation of Integrity Constraints stitute of Technology in Atlanta. She was the lead architect for
and Views by Query Modification,” Proceedings of the 1975 DataJoiner query processing and led the team that merged Data-
ACM SIGMOD Conference on Management of Data, ACM, Joiner and Garlic technology into DB2 for UNIX䉸 and Win-
New York (1975), pp. 65–78. dows䉸. She is currently a lead architect responsible for the de-
42. Rules in Database Systems, T. K. Sellis, Editor, Proceedings livery of database federation technology in DB2 and Xperanto.
of the Second International Workshop (RIDS ’95), Glyfada,
Athens, Greece, September 25–27, 1995, Lecture Notes in Mary A. Roth IBM Software Group, Silicon Valley Laboratory,
Computer Science, No. 985, Springer, New York (1995). 555 Bailey Avenue, San Jose, California 95141 (electronic mail:
43. D. Srivastava, S. Dar, H. V. Jagadish, and A. Levy, “Answer- torkroth@us.ibm.com). Ms. Roth is a senior engineer and man-
ing Queries with Aggregation Using Views,” Proceedings of ager in the Database Technology Institute for e-Business at IBM’s
the 22nd International Conference on Very Large Data Bases Silicon Valley Lab. She has over 12 years of experience in da-
(VLDB ’96), Morgan Kaufmann Publishers, San Mateo, CA tabase research and development. As a researcher and member
(1996), pp. 318 –329. of the Garlic project at IBM’s Almaden Research Center, she
44. M. Zaharioudakis, R. Cochrane, G. Lapis, H. Pirahesh, and contributed key advances in heterogeneous data integration tech-
M. Urata, “Answering Complex SQL Queries Using Auto- niques and federated query optimization and led efforts to trans-
matic Summary Tables,” Proceedings of the 2000 ACM SIG- fer Garlic support to DB2. Ms. Roth is leading a team of devel-
MOD International Conference on Management of Data, ACM, opers to deliver a key set of components for Xperanto, IBM’s
New York (2000), pp. 105–116. information integration initiative for distributed data access and
45. Cognos Incorporated, Business Intelligence Products, http:// integration.
www.cognos.com/products/index.html.
46. Brio Software, Inc., Business Intelligence solution for data
query and analysis, http://www.brio.com/products/overview.
html.
47. IBM Corporation, WebSphere Studio Application Developer,
http://www-3.ibm.com/software/ad/studioappdev/.
48. IBM Corporation, VisualAge C⫹⫹, http://www-3.ibm.com/
software/ad/vacpp/.
49. Microsoft Corporation, Visual C⫹⫹ .NET, http://msdn.
microsoft.com/visualc/.
50. WebGain, Inc., Visual Café, http://www.webgain.com/
products/visual_cafe/.
51. IBM Corporation, DB2 Universal Database, DB2 Data Prop-
agator, http://www-3.ibm.com/software/data/dpropr/.
52. C. Mohan, “Caching Technologies for Web Applications,”
Proceedings of the 27th International Conference on Very Large
Data Bases, September 11–14, 2001, Roma, Italy, Morgan
Kaufmann Publishers, San Mateo, CA (2001).
53. Q. Luo, S. Krishnamurthy, C. Mohan, H. Pirahesh, H. Woo,
B. Lindsay, and J. Naughton, “Middle-Tier Database Cach-
ing for e-Business,” Proceedings of the ACM SIGMOD Con-
ference on Management of Data, Madison, WI, June 2002,
ACM, New York (2002).

596 HAAS, LIN, AND ROTH IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002

View publication stats

You might also like