Design of A Spatial Data Warehouse Based On An Int
Design of A Spatial Data Warehouse Based On An Int
net/publication/228770411
Article
CITATIONS READS
0 226
1 author:
Abdulvahit Torun
Middle East Technical University
8 PUBLICATIONS 26 CITATIONS
SEE PROFILE
All content following this page was uploaded by Abdulvahit Torun on 02 April 2015.
Abdulvahit Torun
Harita Genel Komutanlığı (General Command of Mapping) (GCM),
Kartografya Dairesi, 06100 Cebeci, Ankara, Türkiye
atorun@hgk.mil.tr
1. INTRODUCTION
There are multiple, distributed, heterogeneous and autonomous information sources storing spatial
data within a single company or property of various companies. The information sources vary from
legacy systems to operational spatial databases. The need for gathering or defining associations among
those data sources motivates to integrate them as a spatial data warehouse (DW) or as a database using
integration techniques. Spatial data sets are integrated either through a part of data warehouse or
distributed/federated database (by means of wrappers and mediators).
In warehousing phase, resolving formats, data integration, data cleansing, consistency-checking tasks
are accomplished. Warehouse organization step comprises re-organizing and aggregating the data as
base tables, summary tables and metadata. Querying and visualizing a data spatial warehouse
considering spatial characteristics of data requires essential tools for analytical processes and
visualization of the information both for organizational and ad hoc users.
Materialized views pre-compute and store (materialize) aggregates from the base data. The data is
grouped using categories from the dimensions tables, which corresponds to the subjects of interest
(dimensions) of the organization. Storing all possible aggregates poses storage space problems and
increases maintenance cost, since all stored aggregates need to be refreshed as updates are being made
to the source data sources. .
In most of the DB integration cases, the tendency is management of legacy systems and re-use of data
in recent years. The DBs, which are distributed geographically, in different models, in different
environment and with different semantics are needed to be integrated to use the available data. This
problem is called as ‘DB integration’. DBMSs and applications are designed considering incorporate
data and process sharing to ease autonomy. Main efforts are spend in three directions; ‘schema
integration’, ‘federated DB’ and ‘multi-database language’ [Elmagarmid et.al., 1999] [Bobak, 1996].
For integrating different data sets describing the same phenomena, firstly a correct understanding of
the semantics of the existing data should be developed. Schemas are transformed into a common data
model. For instance SSI DB schema into SQL and DB instance into dbf are converted. Then, an
accurate correlation among structures (schemas) is established to avoid comparing different type of
objects within the same category. The inter-schema correspondence at metalevel and inter-DB
correspondence at data level are identified and described. Finally, integration is described precisely to
prevent merging irrelevant data together. The conflicts are solved semi-automatically. Integrated
schema is generated on top of contributing data sources. Schema integration is done by using modified
5 level integration architecture for the purpose of single DB generation (Figure 1.a) [Ozsu, et.al.,
1999].
1. Local Schema: Local schema of plain table, SSI DB and DCT are made available by the
Component DBs.
2. Component Schema: Representation of Local schema in canonical (standard) data Model is
defined by using SQL. Canonical data Model is employed for unifying divergent local
schemas in a single schema.
3. Export Schema: Export Schemas are defined from component schemas based on integrated
global schema of PPDB_T.
4. Schema Integration: Export Schemas are integrated to form a single schema. Since the desired
final DB is not a federated one, the integrated schema is not a Federated Schema, either.
Firstly, two component schemas are integrated. Then, the third one is added to the integrated
schema. This method is called as ladder technique. Data integration follows schema
integration.
5. External Schema: Upon the integrated schema different conceptual/external schemas are
defined for different purposes and usage [Torun, 2002_2].
2.3 Developing Populated Places Database of Turkey as a base for Spatial Data warehouse
Data warehousing phase is implemented as follows. Population information and information about
administrative hierarchy are extracted from State Statistics Institute (SSI) DB and from Ministry of
Interior ‘Populated Places Book’ respectively. Location information and administrative boundaries are
imported from topo-maps at scale 1:25000 and Digital Chart of Turkey (DCT) at scale 1:1000000 of
General Command of Mapping respectively. Integration of those four data sources is accomplished by
using ladder technique to construct Populated Places DB of Turkey (PPDB_T) as a central DB at
Cartography Department of GCM.
I_SDB
Schema 1 Schema 2 Schema 3
PPDB_T
Declaration of Correspondence
(a) (b)
Population information and relevant statistics are collected and stored by SSI. In order to maintain
population information up-to-date for map production and for monitoring the changes of populated
places, DB integration is necessary to share the legacy data by SSI instead of re-constructing the same
content.
Defining Associations
Turkey has a hierarchical administrative system, which looks like a balanced tree structure. The
sequence of residential entities (populated places) from top to bottom is province, district, sub-district,
village and suburb respectively.
Schema Integration
Source DBs can be integrated by means of schema integration or using standardized data or schema
models. Schema integration is a practical method such that the data is integrated as a logical DB with a
global schema. With a given a set of local DB schemas belonging to an individual DBMS, an
integrated schema which subsumes those local schemas is created by synthesizing the schemas.
First of all, database schema of PPDB_T is designed by means of actual and future needs. DB schema
of PPDB_T is designed considering the administrative binding of populated places. Then, relevant
attributes of SSI DB are added to the PPDB_T schema in addition to a foreign key, which provides
PPDB_T to connect SSI DB. The final step is designing an Integrated spatial DB (I_SDB) with the
three component DBs; PPDB_T, DCT and spatial DB derived from a subset of PPDB_T.
Data Integration
There are three main data integration techniques. Firstly, a very basic approach is providing a global
catalogue of accessible information sources to user to allow him doing integration by himself without
attempting any integration. The user should search, find, decide and find tools to select, extract,
integrate and query the data. Secondly, a step further is integrating the source DBs as one single DB
by putting the data together. The data and applications should be converted into the integrated DB.
Thirdly, subsets of source databases are extracted due to main application needs.
Data integration process is done in four major steps by using the first two techniques. Data integration
is accomplished by applying ladder technique. Firstly, DB schema of PPDB_T is designed considering
the available non-spatial digital information sources and further needs. Secondly, PPDB_T is
populated with the data from plain table. Therefore, the plain table is mapped into PPDB_T. Thirdly,
PPDB_T schema is extended in order to import population information from exported data of SSI DB
(Figure 2.a).
Derived Spatial DB (d_SDB) is based on a common schema for both PPDB_T and spatial data
processor –for the time being ESRI-ArcInfo- that will import the data. Export schema is created by
means of a non-spatial predicate that cuts the DB both vertically –a subset of attributes- and
horizontally –a subset of tuples-. Primary key is repeated in every fragment in vertical fragmentation.
Thus, disjointness is valid only on non-primary key attributes in vertical fragmentation (Devogele,
et.al, 1998). The exported non-spatial data is mapped into spatial format to generate d_SDB. These are
done by developing a tiny software which extracts data from PPDB_T into a common model,
transforms the common model into a spatial DB and visa-versa. Different languages are employed for
spatial and non-spatial definitions and manipulations. DCT contains a set of spatial data classes such
as administrative boundaries, hydrology, transportation, elevation, populated places (only provinces
and sub-provinces), physiography. The relationship among PPDB_T and d_SDB is preserved by
keeping the same primary key –Populated Place ID- in corresponding relations of both DBs.
Problems of Integration
Constructing an integrated DB from existing DBs yield some problems due lack of interoperability
among DBs. GCM and SSI have different non-spatial data to an extent for the same context. The data
differs because of partly semantic but mainly schematic and format (syntax) discrepancy. Updateness
is the main reason for this kind of anomalies. If one of the DBs stores a different name than the current
one for a populated place the corresponding populated places cannot be matched till the mistake is
removed or association is built. This process prevents scalability of the resultant schema and DB
International Symposium on GIS, September 23-26, 2002, Istanbul-TURKEY
instance. Each time there happens a change, the integrated schema and integrated DB should be
updated.
Data analysis applications typically aggregate data across many dimensions looking for anomalies or
unusual patterns. They categorize data values and trends, extract statistical information, and contrast
one category with another. Data analysis is done four steps;
• A query that extracts relevant data from a large database is formulated,
• The aggregated data from the database into file or table is extracted
• The results are visualized by using graphic or mapping tools,
• The results are analyzed and new queries are formulated [Gray et al., 1997].
Data integration and consistency checks are accomplished. The coming stage will comprise organizing
DW, defining and developing analytical operations and generating tools to extend integrated global
schema of DW and DW instance.
Query Interface
Query Engine
integration
rules
Extracter (DB API)
Data
Integration
(a) (b)
Figure : 2.a DB intefration for PPDB_T, 2.b. Hybrid model for Spatial DW
The dimension determined in the actual DB are geometry, time, type of populated places. The
geometry dimension comprises geometric boundary of populated places from region, province,
district, sub-district and village. Some other geometric hierarchies can be defined according to a
property like height zones, earthquake danger zones, fruitful soil zones etc…
A cube is a group of data cells arranged by the dimensions of the data. A dimension is defined as a
structural attribute of a cube that is a list of members, all of which are of a similar type in the users
perception of the data. Each dimension has an associated hierarchy of levels of aggregated data. For
instance time can be detailed as year, month, week, day, hour. Measures (variables, metrics, facts)
represent the real world values. A single datapoint that occurs at the intersection is defined by
selecting one member from each dimension in a multi-dimensional array (cube).
Although relational technologies provide some non-procedural tools for analytical operations, routines
should be provided for complex calculations. Operators defined in SQL in relational DBMS are group
by (), count (), sum (), minimum (), maximum (), average (), median (), standard deviation (), variance
() etc... However, an algebra is defined on cube having the operations group by, cube, roll up [Gray,
et.al., 1997].
4. CONCLUSIONS
The concepts about database integration are briefed in the text. Data and schema integration
techniques are given related with developing an integrated DB comprising populated places,
administeral hierarchies, population information which is called as PPDB_T (Populated Places DB of
Turkey). Finally, hybrid model for integration processes is defined within the application phase.
Syntactic conflicts are removed by using common formats for all data sources. Schematic conflicts are
resolved by defining an integrated global schema and removing redundant and repeated information.
Since the semantic meanings of administrative units are unique among governmental bodies, almost
no semantic conflicts are met. Naming differences are removed by defining unique names or
synonyms.
Organizing DW and defining analytical operations phases are designed. A hybrid method is going to
be used to develop spatial DB. The spatial DW will comprise a set of maps, which are generated for
the frequent user queries considering aggregation hierarchies.
REFERENCES
Bobak, A.R. “Distributed and Multi-Database Systems”, Artech House, Boston, pp 121-138 (1996).
Devogele, T., C.parent, S.Spaccapietra, “On Spatial Database Integration”, International Journal of
Geographic Information Science, 12(4), pp. 335-352 (1998).
Elmagarmid, A. et.al., “Management of Heterogeneous And Autonomous Database Systems”, Morgan
Kaufmann Publishers, San Francisco, pp.2-32 (1999).
Ezeife, C.I., “Selecting and materializing horizontally partitioned warehouse views”, Data &
Knowledge Engineering 36, pp 185-210, Elsevier Science (2001).
Gray, J. et.al., “Data Cube: A Relational Aggregation Operator Generalizing Group By, Cross-tab, and
Sub-Totals”, data Mining and Knowledge Discovery 1, pp 29-53, Kluwer Academic Publishers
(1997).
Hepner, P., “Integrating Heterogenous Databases”: An Overview, http://citeseer.nj.nec.com/cs,
(1995).(accessed Dec. 2001).
Niemi, T., “Constructing OLAP Cubes Based on Queries”, http://citeseer.nj.nec.com/cs (accessed
Aug. 2002).
Özsu, M.T., P. Valduriez, “Principles of Distributed Database Systems”, Prentice Hall, New Jersey,
pp. 75-101 (1999).
Torun, A., “Populated Places DB Project”, General Command of Mapping. Internal Report,
Cartography Department, General Command of Mapping, Turkey (2000).
Torun, A., “Designing Populated Places Database of Turkey (PPDB_T) by Using Relational Model,
Harita Dergisi, 128 (2002_1).