0% found this document useful (0 votes)
53 views9 pages

Design of A Spatial Data Warehouse Based On An Int

This document discusses the design of a spatial data warehouse based on integrating non-spatial and geospatial data sources. It describes extracting relevant population, administrative hierarchy, location, and boundary data from multiple sources to construct a central Populated Places Database of Turkey. It then discusses organizing this data in the warehouse, including resolving formats, data integration, and defining aggregation levels and dimensions. The goal is to enable efficient querying and visualization of the integrated spatial data.

Uploaded by

Suchismita Sahu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views9 pages

Design of A Spatial Data Warehouse Based On An Int

This document discusses the design of a spatial data warehouse based on integrating non-spatial and geospatial data sources. It describes extracting relevant population, administrative hierarchy, location, and boundary data from multiple sources to construct a central Populated Places Database of Turkey. It then discusses organizing this data in the warehouse, including resolving formats, data integration, and defining aggregation levels and dimensions. The goal is to enable efficient querying and visualization of the integrated spatial data.

Uploaded by

Suchismita Sahu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/228770411

Design of a Spatial Data Warehouse Based on an Integrated Non-Spatial


Database and Geo-Spatial Information

Article

CITATIONS READS

0 226

1 author:

Abdulvahit Torun
Middle East Technical University
8 PUBLICATIONS   26 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Abdulvahit Torun on 02 April 2015.

The user has requested enhancement of the downloaded file.


DESIGN OF A SPATIAL DATA WAREHOUSE BASED ON AN INTEGRATED NON-
SPATIAL DATABASE AND GEO-SPATIAL INFORMATION

Abdulvahit Torun
Harita Genel Komutanlığı (General Command of Mapping) (GCM),
Kartografya Dairesi, 06100 Cebeci, Ankara, Türkiye

atorun@hgk.mil.tr

Accessing multiple, distributed, heterogeneous and autonomous information sources


storing spatial or non-spatial data and integration of those data sources has been an
important issue as distributed data processing and management became available
with today’s technology. Common solutions to data integration in database area are
the data integration, schema integration and software developing approaches. The
first approach is implemented as a data warehouse (DW) or an integration as a
central database (DB). The data in DW has been cleansed, integrated, and pre-
processed and infrastructures have been built surrounding DW for efficient data
analysis. Main processes to construct a spatial data warehouse are; collecting data
from information sources (data warehousing), organizing data in a geo-spatial data
store (warehouse organization) and querying and visualizing data. This study covers
implementation of first and design of the following two phases of developing spatial
data warehouse. In data warehousing phase, required data about population
information, information about administrative hierarchy, location information and
administrative boundaries are extracted from relevant data sources. After introducing
data warehousing step, model of latter two phases are given in this paper.

1. INTRODUCTION

There are multiple, distributed, heterogeneous and autonomous information sources storing spatial
data within a single company or property of various companies. The information sources vary from
legacy systems to operational spatial databases. The need for gathering or defining associations among
those data sources motivates to integrate them as a spatial data warehouse (DW) or as a database using
integration techniques. Spatial data sets are integrated either through a part of data warehouse or
distributed/federated database (by means of wrappers and mediators).

DW is a subject-oriented, integrated, time-varying, non-volatile collection of data repository, which


has been extracted and integrated from heterogeneous and autonomous distributed data sources. DW is
used primarily in organizational decision-making. The data in DW has been cleansed, integrated, and
pre-processed and infrastructures have been built surrounding DW for efficient data analysis. Main
processes to construct a spatial data warehouse are; collecting data from information sources (data
warehousing), organizing data in a geo-spatial data store (warehouse organization) and querying and
visualizing data [Ezeife, 2001] [Voisard, et.al., 2002].

In warehousing phase, resolving formats, data integration, data cleansing, consistency-checking tasks
are accomplished. Warehouse organization step comprises re-organizing and aggregating the data as
base tables, summary tables and metadata. Querying and visualizing a data spatial warehouse
considering spatial characteristics of data requires essential tools for analytical processes and
visualization of the information both for organizational and ad hoc users.

International Symposium on GIS, September 23-26, 2002, Istanbul-TURKEY


Most of DWs store large amount of data, which is often used for summarization-based on-line
analytical processing (OLAP). ROLAP (relational OLAP) and MOLAP (multidimensional OLAP) are
two basic OLAP architectures. DWs are usually based on relational technologies but OLAP uses a
multidimensional view of aggregate data to provide quick access to strategic information for further
analysis. OLAP enables analysts, managers and executive to gain insight into data through fast,
consistent, interactive access to a wide variety of possible views of information. However OLAP and
DWs are complementary. OLAP transforms raw data so that it reflects the real dimensionality of the
enterprise as understood by the user [Niemi, 2002].

Materialized views pre-compute and store (materialize) aggregates from the base data. The data is
grouped using categories from the dimensions tables, which corresponds to the subjects of interest
(dimensions) of the organization. Storing all possible aggregates poses storage space problems and
increases maintenance cost, since all stored aggregates need to be refreshed as updates are being made
to the source data sources. .

2. DATA SOURCES AND DATABASE INTEGRATION

In most of the DB integration cases, the tendency is management of legacy systems and re-use of data
in recent years. The DBs, which are distributed geographically, in different models, in different
environment and with different semantics are needed to be integrated to use the available data. This
problem is called as ‘DB integration’. DBMSs and applications are designed considering incorporate
data and process sharing to ease autonomy. Main efforts are spend in three directions; ‘schema
integration’, ‘federated DB’ and ‘multi-database language’ [Elmagarmid et.al., 1999] [Bobak, 1996].

2.1 Data Warehousing


In warehousing phase resolving formats, data integration, data cleansing, consistency-checking tasks
are done. New data is added to DW due to format and transformation definitions in metadata. Before
adding the data, consistency checks are performed in order to guarantee that data is loaded once and
correctly.

2.2 Database Integration Strategy


A hybrid integration technique is applied in order to develop Populated Places DB of Turkey
(PPDB_T) [Torun, 2000] [Torun, 2002_1]. Data integration method –integrating several DBs into one
unique DB- and schema integration method are used for integration. Data insertion from the
component DBs is done by running the programs written in API of PPDB_T. Schema and Data
integration is accomplished by using ladder technique in which first of all, schema of newly designed
PPDB_T of General Command of Mapping (GCM) is integrated with the data collected for populated
places which are stored in a plain table. Then, schema of PPDB_T is extended to integrate it with State
Statistics Institute (SSI) DB schema (Figure 1.b).

For integrating different data sets describing the same phenomena, firstly a correct understanding of
the semantics of the existing data should be developed. Schemas are transformed into a common data
model. For instance SSI DB schema into SQL and DB instance into dbf are converted. Then, an
accurate correlation among structures (schemas) is established to avoid comparing different type of
objects within the same category. The inter-schema correspondence at metalevel and inter-DB
correspondence at data level are identified and described. Finally, integration is described precisely to
prevent merging irrelevant data together. The conflicts are solved semi-automatically. Integrated
schema is generated on top of contributing data sources. Schema integration is done by using modified
5 level integration architecture for the purpose of single DB generation (Figure 1.a) [Ozsu, et.al.,
1999].

International Symposium on GIS, September 23-26, 2002, Istanbul-TURKEY


Modified 5 Level Schema Integration Architecture

1. Local Schema: Local schema of plain table, SSI DB and DCT are made available by the
Component DBs.
2. Component Schema: Representation of Local schema in canonical (standard) data Model is
defined by using SQL. Canonical data Model is employed for unifying divergent local
schemas in a single schema.
3. Export Schema: Export Schemas are defined from component schemas based on integrated
global schema of PPDB_T.
4. Schema Integration: Export Schemas are integrated to form a single schema. Since the desired
final DB is not a federated one, the integrated schema is not a Federated Schema, either.
Firstly, two component schemas are integrated. Then, the third one is added to the integrated
schema. This method is called as ladder technique. Data integration follows schema
integration.
5. External Schema: Upon the integrated schema different conceptual/external schemas are
defined for different purposes and usage [Torun, 2002_2].

2.3 Developing Populated Places Database of Turkey as a base for Spatial Data warehouse
Data warehousing phase is implemented as follows. Population information and information about
administrative hierarchy are extracted from State Statistics Institute (SSI) DB and from Ministry of
Interior ‘Populated Places Book’ respectively. Location information and administrative boundaries are
imported from topo-maps at scale 1:25000 and Digital Chart of Turkey (DCT) at scale 1:1000000 of
General Command of Mapping respectively. Integration of those four data sources is accomplished by
using ladder technique to construct Populated Places DB of Turkey (PPDB_T) as a central DB at
Cartography Department of GCM.
I_SDB
Schema 1 Schema 2 Schema 3

PPDB_T
Declaration of Correspondence

Semantic and Resolution of Integration PPDB_T


Structural Conflicts Rules
Conflicts
Schema Fusion PPDB_T Plain Table SSI DB DCT

(a) (b)

Figure : 1.a Schema Integration, 1.b. DB integration using ladder technique

Population information and relevant statistics are collected and stored by SSI. In order to maintain
population information up-to-date for map production and for monitoring the changes of populated
places, DB integration is necessary to share the legacy data by SSI instead of re-constructing the same
content.

Defining Associations
Turkey has a hierarchical administrative system, which looks like a balanced tree structure. The
sequence of residential entities (populated places) from top to bottom is province, district, sub-district,
village and suburb respectively.

International Symposium on GIS, September 23-26, 2002, Istanbul-TURKEY


2.4 Database Integration

Schema Integration
Source DBs can be integrated by means of schema integration or using standardized data or schema
models. Schema integration is a practical method such that the data is integrated as a logical DB with a
global schema. With a given a set of local DB schemas belonging to an individual DBMS, an
integrated schema which subsumes those local schemas is created by synthesizing the schemas.

First of all, database schema of PPDB_T is designed by means of actual and future needs. DB schema
of PPDB_T is designed considering the administrative binding of populated places. Then, relevant
attributes of SSI DB are added to the PPDB_T schema in addition to a foreign key, which provides
PPDB_T to connect SSI DB. The final step is designing an Integrated spatial DB (I_SDB) with the
three component DBs; PPDB_T, DCT and spatial DB derived from a subset of PPDB_T.

Data Integration
There are three main data integration techniques. Firstly, a very basic approach is providing a global
catalogue of accessible information sources to user to allow him doing integration by himself without
attempting any integration. The user should search, find, decide and find tools to select, extract,
integrate and query the data. Secondly, a step further is integrating the source DBs as one single DB
by putting the data together. The data and applications should be converted into the integrated DB.
Thirdly, subsets of source databases are extracted due to main application needs.

Data integration process is done in four major steps by using the first two techniques. Data integration
is accomplished by applying ladder technique. Firstly, DB schema of PPDB_T is designed considering
the available non-spatial digital information sources and further needs. Secondly, PPDB_T is
populated with the data from plain table. Therefore, the plain table is mapped into PPDB_T. Thirdly,
PPDB_T schema is extended in order to import population information from exported data of SSI DB
(Figure 2.a).

Derived Spatial DB (d_SDB) is based on a common schema for both PPDB_T and spatial data
processor –for the time being ESRI-ArcInfo- that will import the data. Export schema is created by
means of a non-spatial predicate that cuts the DB both vertically –a subset of attributes- and
horizontally –a subset of tuples-. Primary key is repeated in every fragment in vertical fragmentation.
Thus, disjointness is valid only on non-primary key attributes in vertical fragmentation (Devogele,
et.al, 1998). The exported non-spatial data is mapped into spatial format to generate d_SDB. These are
done by developing a tiny software which extracts data from PPDB_T into a common model,
transforms the common model into a spatial DB and visa-versa. Different languages are employed for
spatial and non-spatial definitions and manipulations. DCT contains a set of spatial data classes such
as administrative boundaries, hydrology, transportation, elevation, populated places (only provinces
and sub-provinces), physiography. The relationship among PPDB_T and d_SDB is preserved by
keeping the same primary key –Populated Place ID- in corresponding relations of both DBs.

Problems of Integration
Constructing an integrated DB from existing DBs yield some problems due lack of interoperability
among DBs. GCM and SSI have different non-spatial data to an extent for the same context. The data
differs because of partly semantic but mainly schematic and format (syntax) discrepancy. Updateness
is the main reason for this kind of anomalies. If one of the DBs stores a different name than the current
one for a populated place the corresponding populated places cannot be matched till the mistake is
removed or association is built. This process prevents scalability of the resultant schema and DB
International Symposium on GIS, September 23-26, 2002, Istanbul-TURKEY
instance. Each time there happens a change, the integrated schema and integrated DB should be
updated.

3. SPATIAL DATA WAREHOUSE MODEL

Data analysis applications typically aggregate data across many dimensions looking for anomalies or
unusual patterns. They categorize data values and trends, extract statistical information, and contrast
one category with another. Data analysis is done four steps;
• A query that extracts relevant data from a large database is formulated,
• The aggregated data from the database into file or table is extracted
• The results are visualized by using graphic or mapping tools,
• The results are analyzed and new queries are formulated [Gray et al., 1997].

Data integration and consistency checks are accomplished. The coming stage will comprise organizing
DW, defining and developing analytical operations and generating tools to extend integrated global
schema of DW and DW instance.

Query Interface

Query Engine

Plain Table for PPDB_T SSI DB


Populated Places Schema Schema SSI DB
of GCM Data
Warehouse
imigration
Integrated (extended) Modified
PPDB_T Schema for PPDB_T SSI DB DB Integration Metadata

integration
rules
Extracter (DB API)
Data
Integration

Source Source Source Source


DB1 DB2 DB... Data warehouse
PPDB_T

(a) (b)
Figure : 2.a DB intefration for PPDB_T, 2.b. Hybrid model for Spatial DW

3.1 Spatial Data Warehouse Organization


Data is organized in a spatial data store in this stage. DW does not allow the user to delete or update
the data in the store. However, the growth of DW is managed either transferring into cheaper media or
generating summary tables of less used data. Row data is re-structured and aggregated in base tables.
Aggregation rules are defined in metadata. Users are sometimes interested in aggregated values such
as population of a region or province, which is calculated by summing the population of populated
places at lower hierarchies. Getting the result of a query on the fly based on operational DB may not
be efficient. On the other hand, the more data is aggregated, the faster queries are responded, but
update and storage are increased. A balance among query response time and update time should be
considered.

International Symposium on GIS, September 23-26, 2002, Istanbul-TURKEY


A hybrid approach is employed to provide efficient and rich analytical analysis to the user (Figure
2.b). A part of data is organized in a DW while other part of data is extracted from the integrated DB
during query processing. In the metadata information about ‘where and how to get the relevant data’ is
stored. A warehouse is useful for data related to whole field. This is useful in geomarketing
applications where data, being socio-cultural behaviors or statistical data. For the case of spatial data
warehouse, resultant maps of some queries are stored in the DW for quick response and other queries
are responded on the fly as a graphic result. Therefore dimension of data cube is extended by adding
related maps of each dimension considering aggregate hierarchies.

The dimension determined in the actual DB are geometry, time, type of populated places. The
geometry dimension comprises geometric boundary of populated places from region, province,
district, sub-district and village. Some other geometric hierarchies can be defined according to a
property like height zones, earthquake danger zones, fruitful soil zones etc…

3.2 Analytical Operations on Datawarehouse


Analytical Operations applied to warehouses for are aggregation (consolidation, roll up), roll down
(drill down, drill through), selection (screening, filtering, dicing), slicing, pivoting (rotation).
Aggregation is summarization of data for the higher level of hierarchy. Roll down is navigation among
levels of data ranging from higher-level summary (up) to lower level summary or detailed data
(down). Selection is taking a subset of data by means of a criterion which is evaluated against the data
or members of a dimension in order to restrict the set of retrieved data. Slicing is selection of all data
satisfying a condition along a particular dimension. Pivoting is changing the dimensional orientation
of cube [Vassiliadis, 2000] [Vassiliadis, 2002].

A cube is a group of data cells arranged by the dimensions of the data. A dimension is defined as a
structural attribute of a cube that is a list of members, all of which are of a similar type in the users
perception of the data. Each dimension has an associated hierarchy of levels of aggregated data. For
instance time can be detailed as year, month, week, day, hour. Measures (variables, metrics, facts)
represent the real world values. A single datapoint that occurs at the intersection is defined by
selecting one member from each dimension in a multi-dimensional array (cube).

Although relational technologies provide some non-procedural tools for analytical operations, routines
should be provided for complex calculations. Operators defined in SQL in relational DBMS are group
by (), count (), sum (), minimum (), maximum (), average (), median (), standard deviation (), variance
() etc... However, an algebra is defined on cube having the operations group by, cube, roll up [Gray,
et.al., 1997].

Visualizing the Results by Means of Graphic and cartographic Tools


Visualization tools display trends, clusters and differences by using graphic tools to make the user to
understand the metaphor easily. Visualization tools render the results as 2D or 3D graphs in addition
to tools of cartography, which are used for mapping statistical distributions. In order to visualize
dynamic characteristics dynamic visual variables are used in addition to visual variables. and
Cartographic tools in GIS software provide most of the mapping tools. Results of queries against
spatial data warehouses are visualized as a value, an array (table), graphic (2D and 3D), maps (2D, 3D,
multimedia-animation, sound etc…).

3.3 Extending and Maintaining Spatial DB as a source of Spatial Data Warehouse


After founding a spatial DW, the data is updated, changed or model is extended in two ways. Firstly,
source sends data regularly or after balk changes. Secondly, warehouse asks for data at certain time
periods [Voisard, et.al, 2002].
International Symposium on GIS, September 23-26, 2002, Istanbul-TURKEY
Since, mapping from data sources into PPDB_T is initially done automatically, the tools for mapping
are available for further data injection into the DB. Moreover, there are tools to check consistency
considering administrative hierarchy and to compare different versions of PPDB_T. The application
has tools for entering, manipulating and updating data in addition to intelligent query generator based
on administrative hierarchy and standard topographic map indexes.

4. CONCLUSIONS

The concepts about database integration are briefed in the text. Data and schema integration
techniques are given related with developing an integrated DB comprising populated places,
administeral hierarchies, population information which is called as PPDB_T (Populated Places DB of
Turkey). Finally, hybrid model for integration processes is defined within the application phase.
Syntactic conflicts are removed by using common formats for all data sources. Schematic conflicts are
resolved by defining an integrated global schema and removing redundant and repeated information.
Since the semantic meanings of administrative units are unique among governmental bodies, almost
no semantic conflicts are met. Naming differences are removed by defining unique names or
synonyms.

Organizing DW and defining analytical operations phases are designed. A hybrid method is going to
be used to develop spatial DB. The spatial DW will comprise a set of maps, which are generated for
the frequent user queries considering aggregation hierarchies.

Acknowledgements. Populated Places Database of Turkey is designed and developed in Cartography


Department of General Command of Mapping, Turkey.

REFERENCES

Bobak, A.R. “Distributed and Multi-Database Systems”, Artech House, Boston, pp 121-138 (1996).
Devogele, T., C.parent, S.Spaccapietra, “On Spatial Database Integration”, International Journal of
Geographic Information Science, 12(4), pp. 335-352 (1998).
Elmagarmid, A. et.al., “Management of Heterogeneous And Autonomous Database Systems”, Morgan
Kaufmann Publishers, San Francisco, pp.2-32 (1999).
Ezeife, C.I., “Selecting and materializing horizontally partitioned warehouse views”, Data &
Knowledge Engineering 36, pp 185-210, Elsevier Science (2001).
Gray, J. et.al., “Data Cube: A Relational Aggregation Operator Generalizing Group By, Cross-tab, and
Sub-Totals”, data Mining and Knowledge Discovery 1, pp 29-53, Kluwer Academic Publishers
(1997).
Hepner, P., “Integrating Heterogenous Databases”: An Overview, http://citeseer.nj.nec.com/cs,
(1995).(accessed Dec. 2001).
Niemi, T., “Constructing OLAP Cubes Based on Queries”, http://citeseer.nj.nec.com/cs (accessed
Aug. 2002).
Özsu, M.T., P. Valduriez, “Principles of Distributed Database Systems”, Prentice Hall, New Jersey,
pp. 75-101 (1999).
Torun, A., “Populated Places DB Project”, General Command of Mapping. Internal Report,
Cartography Department, General Command of Mapping, Turkey (2000).
Torun, A., “Designing Populated Places Database of Turkey (PPDB_T) by Using Relational Model,
Harita Dergisi, 128 (2002_1).

International Symposium on GIS, September 23-26, 2002, Istanbul-TURKEY


Torun, A., Using Schema and Data Integration Techniques to Integrate Spatial and non-Spatial Data:
Developing Populated Places DB of Turkey (PPDB_T), ISPRS Comm. IV, Canada (2002_2).
Vassiliadis, P., “Data Warehouse Modeling and Quality Issues”, Ph.D. Thesis, Nat. Tech. Univ. of
Athens, Greece (2000).
Vassiliadis, P., “Modeling Multidimensional Databases, Cubes and Cube Operations”,
http://citeseer.nj.nec.com/cs (accessed Aug. 2002).
Voisard, A., M. Jürgens, “Geospatial Information Extraction: Querying or Quarrying”,
http://citeseer.nj.nec.com/cs (accessed Aug. 2002).

International Symposium on GIS, September 23-26, 2002, Istanbul-TURKEY

View publication stats

You might also like