unit-1
unit-1
unit-1
UNIT-I
DATA WAREHOUSING
Syllabus
Data Warehouse Components, Building a Data warehouse, Mapping Data Warehouse to a
Multiprocessor Architecture, Data Extraction, Clean up and Transformation Tools, Meta data
Objective
To know about data warehousing components
To know the considerations of building a data warehouse
How to map data warehouse to a multiprocessor architecture
To know about Data Extraction, Clean up and Transformation Tools
To know details about Meta data
Data Warehouse Components
1. Overall Architecture
The data warehousing architecture is based on a relational database management system server
that functions as the central repository for informational data. Typically the source data for the
warehouse is coming from the operational applications. As the data enters the data warehouse, it is
transformed into an integrated structure and format.
2. Data Warehouse Database
The central data warehouse database is a cornerstone of the data warehousing environment. This
data base is almost implemented on relational data base management system. Some of the approaches
needed for the data base are
Parallel relational database designs that require a parallel computing platform.
An innovative approach to speed up a traditional RDBMS MDDBS that are based on
proprietary data base technology or implemented using RDBMS
3. Sourcing, Acquisition, Clean Up and transformation Tools
These tools perform all of the conversion, summarization, key changes, structural changes and
condensation needed to transform disparate data into information that can be used by decision support
tool. These functionality includes
Removing unwanted data from operational databases
Converting to common data names and definitions
Calculating summaries and derived data
Establishing defaults for missing data
Accommodating sources data definitions changes
These tools have to deal with some significant issues as follows
Data heterogeneity: This is the difference in the way how the data is defined and used in different
models
Database heterogeneity: DBMS are very different in data models data access language data
languages, data navigation and so on
4. Meta data
Meta data is data about data that describes the data warehouse.Meta data can be classified into
1. Technical metadata:
This contains information about data warehouse data for use by warehouse designers and
administratorswhen carrying out warehouse development and management tasks, Technical metadata
documents include
Information about data sources
Transformation descriptions
Warehouse object and data structure definitions for data targets.
The rules used to perform data clean up and data enhancement.
2. Business Meta data:
Contains information that gives users an easy-to-understand perspective of the information
stored in the data warehouse .Business metadata documents information about. Subject areas and
information object type, including queries, reports, images, video, and/or audio clips.
Internet home pages.
5. Access Tools
Users interact with the data warehousing using front-end tools. Tools are divided into five
groups
Data query and reporting tools
Application development tools
Executive information system tools
Online analytical processing tools
Data mining tools
1. Query and Reporting Tools
This category is further divided into two groups: Reporting and Managed query tools
Reporting tools can be divided into production reporting tools and desktop report writers
Production reporting tools will let companies generate regular operational reports
or support high-volume batch jobs, such as calculating and paychecks
Report writers, on the other hand, are inexpensive desktop tools designed for end
user.
Managed query tools shield end users from the complexities of SQL and database structures by
inserting a metalayer between users and the database.
Metalayer
It is the software that provides subject –oriented views of a database and supports point-and
click creation of SQL.
2. Applications The tools require such a complex set of queries and sophisticated data models that the
business users may find themselves overwhelmed by the need to become SQL and/or data
modeling reports.
3. OLAP: These tools are based on the concepts of multi-dimensional data bases and allow a
sophisticated user to analyze the data using elaborate, multi dimensional complex views.
4. Data Mining
A critical success factor for any business today is its ability to use information effectively.
Data mining as the process of discovering meaningful new correlations, patterns, and trends by
digging into large amounts of data stored in warehouses, using AI and statistical and
mathematical techniques.
In these areas, data mining can reach beyond the capabilities of the OLAP, especially
since the major attraction of the data mining is its ability to build predictive rather than
retrospective models. Most organizations engage in data mining to
Discover Knowledge
Visualize data
Correct data
5. Data Visualization
Data visualization is not a separate class of tools; rather than it is a method of presenting
the output of the previously mentioned tools in such a way that the entire problem or/and the
solution.
Data visualization goes far beyond simple bar and pie charts. It is a collection of complex
techniques that currently represents an area of intense research.
6.Data Marts
Data marts are presented as an inexpensive alternative to a data warehouse that takes
significantly less time and money to build. It is a subsidiary to a data warehouse of integrated data. It is
created for a dedicated set of users.
A data mart is a set of de normalized, summarized or aggregated data. Data mart is separate data
base server, often on local area network serving a dedicated group of users.
1. Dependent data mart: The data content is sourced from a data ware house, have a high value
because no matter how may are deployed and no matter how many different technology are use.
2. Independent data mart: Unfortunately the misleading statements about the simplicity and low cost
of data marts sometimes result in organizations or vendors incorrectly positioning them as an
alternative to the data warehouse .This view point defines independent data marts.
The concept of independent data mart is dangerous one. Each enterprise will start to design their own
data marts with out integration. The complex many-to-one problem will be diverted to many-to many
sourcing and management nightmare. Scalability of data mart is complex.
The recommended approach by Ralph Kimball is as follows. For any two data marts in an
enterprise, the common dimension must conform to the equality and roll-up rule. In summary data mart
presents two problems those are scalability and integration
Purging data
The information delivery component is used to enable the process of subscribing for data
warehouse information and having it delivered to one or more destinations of choice according to some
user-specified scheduling algorithm. Information delivery system distributes ware house stored data
and other information objects to other data warehouses and end user products
1. Approach
The subject oriented nature of the data warehouse determines the scope of the information
in the data warehouse. Organizations embarking on data warehousing development can
chose on of the two approaches
Top-down approach: Meaning that the organization has developed an enterprise data
model, collected enterprise wide business requirement, and decided to build an enterprise
data warehouse with subset data marts
2. Organizational Issues
2. Design Consideration
In general, a data warehouse’s design point is to consolidate data from multiple, often
heterogeneous, sources into a query data base. The main factors include
Heterogeneity of data sources, which affects data conversion, quality, time-liness
Use of historical data, which implies that data may be” old”
Data Content: Typically a data warehouse may contain detailed data, but the data is cleaned up
and transformed to fit the warehouse model, and certain transactional attributes of the data are filtered
out. The content and the structure of the data warehouses are reflected in its data model. The data model
is a template for how information will be organized with in the integrated data warehouse framework.
Meta data: Defines the contents and location of the data in the warehouse, relationship between
the operational databases and the data warehouse, and the business view of the warehouse data that are
accessible by end-user tools. the warehouse design should prevent any direct access to the warehouse
data if it does not use meta data definitions to gain the access.
Data distribution: As the data volumes continue to grow, the data base size may rapidly outgrow
a single server. Therefore, it becomes necessary to know how the data should be divided across multiple
servers. The data placement and distribution design should consider several options including data
distribution by subject area, location, or time.
Tools: Data warehouse designers have to be careful not to sacrifice the overall design to fit to a
specific tool. Selected tools must be compatible with the given data warehousing environment each
other.
Performance consideration: Rapid query processing is a highly desired feature that should be
designed into the data warehouse.
3. Technical Considerations
The data base management system that supports the warehouse data base.
The communication infrastructure that connects the warehouse, data marts, operational
systems, and end users.
The hardware platform and software to support the meta data repository
The systems management framework that enables the centralized management and
administration of the entire environment.
4. Implementation Considerations
A data warehouse can not be simply bought and installed-its implementation requires the
integration of many products with in a data ware house.
Access tools
Meta data
The size of a data warehouse rapidly approaches the point where the search of a data warehouse
rapidly approaches the point where the search for better performance and scalability becomes a real
necessity. The search is pursuing two goals
Speed Up: the ability to execute the same request on the same amount of data in less time
Scale-Up: The ability to obtain the same performance on the same request as the data base size
increases.
1. Types of Parallelism
Parallel execution of tasks with in the SQL statements can be done in either of two ways.
Horizontal parallelism: Which means that the data base is partitioned across multiple disks
and the parallel processing occurs in the specific tasks, that is performed concurrently on
different processors against different sets of data
Vertical Parallelism: which occurs among different tasks all components query operations are
executed in parallel in a pipelined fashion. In other words an output from one task becomes
an input into another task as soon as records become available.
2. Data Partitioning
Data partitioning is a key requirement for effective parallel execution of data base operations. It
spreads data from data base tables across multiple disks so that I/O operations such as read and
write can be performed in parallel.
Random partitioning includes random data striping across multiple disks on single servers. In
round robin partitioning, each new record id placed on the new disk assigned to the data base.
Intelligent partitioning assumes that DBMS knows where a specific record id located and does
not waste time searching for it across all disks. This partitioning allows a DBMS to fully exploit
parallel architectures and also enables higher availability.
Schema partitioning :Each table is placed in each disk, Useful for small references
User-defined partitioning: Tables are partitioned based on user defined expressions. 2. Database
Architecture for parallel Processing
1. Shared-Memory Architecture
2. Shared-disk Architecture
It implements the concept of shared ownership of the entire data base between RDBMS servers,
each of which is running on a node of distributed memory system. Each RDBMS server can read,
write, update and delete records from the same shared data base, which would require the system to
implement a form of distributed lock manager (DLM).
Pining:
In worst case scenario, if all nodes are reading and updating same data, the RDBMS and its DLM
will have to spend a lot of resources synchronizing multiple buffer pool. This problem is called as
pining
3. Shared-Nothing Architecture
The data is partitioned across many disks, and DBMS is “partitioned” across multiple conservers,
each of which resides on individual nodes of the parallel system and has an ownership of its own
disk and thus, its own data base partition.
Query compilation
Support for the single system image of the data base environment.
4. Combined Architecture
Interserver parallelism of the distributed memory architecture means that each query is parallelized
across multiple servers. While intraserver parallelism of the shared memory architecture means that
a query is parallelized with in the server.
Optimized implementation
Application transparency
Price/Performance
4. Alternative Technologies
In addition to parallel data base technology, a number of vendors are working on other solutions
improving performance in data warehousing environments. These includes
1. Oracle
2. Informix
3. IBM
4. Sybase
5. Microsoft
1. Tool Requirements
The tools that enable sourcing of the proper data contents and formats from operational
and external data stores into the data warehouse have to perform a number of important
tasks that include
Data transformation from one format to another on the basis of possible differences
between the source and target platforms
Data consolidation and integration, which may include combining several source records
into a single record to be loaded into the warehouse.
Meta data synchronization and management and calculation based on the application of
the business rules that force certain transformation.
2. Vendor Approaches
The tasks of capturing data from a source data system, cleaning and transforming it, and then
loading the results into a target data system can be carried out either by separate products, or by
single integrated solution.
Code generator
The middleware strategy is the foundation for the tools such as Enterprise/Access from Apertus
Corporation.
The data layer provides data access and transaction services for management of corporate
data asserts
The process layer provides services to manage automation and support for current
business processes.
The user layer manages user interaction with process and/or data layer services. It allows
the user interfaces to change independently of the underlying business processes.
Meta data
1. Meta data-definition
Meta data is one of the most important aspects of data warehousing. It is data about data stored in the
warehouse and its users.
The location of and description of the warehouse system and data components
Names, definition, structure and content of the data warehouse and end user views
Integration and transformation rules used to deliver data to end-user analytical tools
The MetaData Coalition was founded by a group of industry-leading vendors aimed at defining
a tactical set of standard specifications for the access and interchange of meta data between different
software tools. What follows is an overview of Version 1.0 of the MetaData Interchange Specification
(MDIS) initiative taken by the MetaData Coalition. Goals of the MetaData Interchange Specification
Initiative
Situation Analysis
The volatility of our global economy and an increasingly competitive business climate are
driving companies to leverage their existing resources in new, creative, and more effective ways.
Enterprise data, once viewed as merely fodder for the operational systems that ran the day-to-day
mechanics of business, is now being recognized not only as one of these valuable resources but as a
strategic business asset.
However, as the rate of change continues to accelerate-in response to both business pressures
and technological advancement-managing this strategic asset and providing timely, accurate, and
manageable access to enterprise data becomes increasingly critical. This need to find faster, more
comprehensive and efficient ways to provide access to and manage enterprise data has given rise to a
variety of new architectures and approaches, such as data warehouses, distributed client/server
computing, and integrated enterprise-wide applications.
In these new environments, meta data, or the information about the enterprise data, is emerging as a
critical element in effective data management. Vendors as well as users have been quick to appreciate
the value of meta data, but the rapid proliferation of data manipulation and management tools has
resulted in almost as many different "flavors" and treatments of meta data as there are tools.
The Challenge
To enable full-scale enterprise data management, different tools must be able to freely and easily
access, and in some cases manipulate and update, the meta data created by other tools and stored in a
variety of different storage facilities. The only viable mechanism to enable disparate tools from
independent vendors to exchange this meta data is to establish at least a minimum common denominator
of interchange specifications and guidelines to which the different vendors' tools can comply.
Establishing and adhering to a core set of industry meta data interchange specifications will
enable IS managers to select what they perceive as "best of breed" to build the tool infrastructure that
best fits their unique environment needs. In choosing the interchange-compliant tools, they can be
assured of the accurate and efficient exchange of meta data essential to meeting their users' business
information needs.
The MetaData Coalition was established to bring industry vendors and users together to
address a variety of difficult problems and issues with regard to exchanging, sharing, and managing
meta data. This is intended as a coalition of interested parties with a common focus and shared goals,
not a traditional standards body or regulatory group in any way.
The MetaData Interchange Specification (MDIS) draws a distinction between: The Application
Metamodel - the tables, etc., used to "hold" the meta data for schemas, etc., for a particular application;
for example, the set of tables used to store meta data in Composer may differ significantly from those
used by the Bachman Data Analyst.
The MetaData Metamodel - the set of objects that the MetaData Interchange Specification can be
used to describe. These represent the information that is common (i.e., represented) by one or more
classes of tools, such as data discovery tools, data extraction tools, replication tools, user query tools,
database servers, etc. The meta data metamodel should be:
Because users' information needs are growing more complex, the IS organization would ideally
like the interchange specification to support (to the greatest extent possible) the bidirectional
interchange of meta data so that updates can be made in the most natural place. For example,
the user might initially specify the source-to-target mapping between a legacy database and a
RDBMS target in a CASE tool but, after using a data extraction tool to generate and execute
programs to actually move the data, discover that the mapping was somehow incorrect. The
most natural place to test out the "fix" to this problem is in the context of the data extraction
tool. Once the correction is verified, one updates the metamodel in the CASE tool, rather than
having to go to the CASE tool, change the mapping, and trigger the meta data interchange
between the CASE tool and the data extraction tool before being able to test the new mapping.
Vendors would like to support the MetaData Interchange Specification with a minimum amount
of additional development. In light of these assumptions, the meta data model must be
sufficiently extensible to allow a vendor to store the entire metamodel for any application. In
other words, MDIS should provide mechanisms for extending the meta data model so that
additional (and possibly encrypted) information can be passed. An example of when a vendor
might want encryption is in the case of a tool that generates parameters for invoking some
internal routine. Because these parameters might provide other vendors with information
regarding what is considered a proprietary part of their tool, the vendor may wish to encrypt
these parameters.
If one assumed that all updates to the model occurred in the context of a single tool, e.g., the CASE
tool in the example above, the MDIS would not benefit from "carrying along" any of the tool-specific
meta data. However, as the above example indicates, this assumption is not the "natural" meta data
interchange flow. Consequently, some type of mechanism for providing extensions to the type of
information exchanged by the interchange specification is necessary if one hopes to achieve
bidirectional interchange between vendor applications.
The MetaData Interchange Framework
For Version 1.0, the MetaData Council is recommending the ASCII-based batch approach so
that vendors can implement support for the specification with minimum overhead and the customer
benefits from the availability of meta data interchange as quickly as possible.
An ASCII Batch approach relies on the ASCII file format that contains the description of the common
meta data components and standardized access requirements that make up the interchange specification
meta data model. In this approach, the entire ASCII file containing the MDIS schema and access
parameters is reloaded whenever a tool accesses the meta data through the specification API.
This approach requires only the addition of a simple import/export function to the tools and
would not require updating the tool in the event of meta data model changes, because the most up-to
date schema will always be available through the access framework. This eliminates the amount of
retrofitting required to enable tools to remain compliant with the MDIS, because the burden for update
stays primarily within the framework itself.
Those that pertain to the semantics and syntax used to represent the meta data to be exchanged.
These items are those that are typically found in a specifications document.
Those that pertain to some framework in which the specification will be used. This second set
of items is two file-based semaphores that are used by the specification's import and export
functions to help the user of the specification control consistency.
Components defining the semantics and syntax that define the specification:
The Metamodel
The MetaData Interchange Specification Metamodel describes the entities and relationships that are
used to directly represent meta data in the MDIS. The goal in designing this metamodel is twofold:
To choose the set of entities and relationships that represents the objects that the majority of
tools require.
To provide some mechanism for extensibility in the case that some tool requires the
representation of some other type of object. Section 5 describes the metamodel for Version 1.0
of the MetaData Interchange Specification. In the rest of this document the entities that are
directly represented by the specification are referred to as objects in the "public view," while
any other meta data stored in the interchange file is referred to as "private meta data" (i.e., tool
specific meta data).
Version 1.0 of the MDIS includes information which will support a bidirectional flow of meta
data while maintaining meta data consistency.
Versioning information in the header of the file containing the meta data
A Tool Profile which describes what type of data elements a tool directly represents and/or
updates
A Configuration Profile which describes the "legal flow of meta data." For example, although
source-to-target mapping may be specified in the context of some analysis tool, once that meta
data has been exported to ETI*EXTRACT and the mapping is changed because of errors found
in expected data, one may want to require that all future changes to mapping originate in
ETI*EXTRACT. If the configuration profile is set properly, the import function for
ETI*EXTRACT would err off if asked to import a conversion specification from the analysis
tool with a version number greater than the version number of the one originally imported from
the mapping tool.
Tool profile
The data warehouse architecture framework represents a higher level of abstraction than the meta
data interchange standard framework and by design. The warehouse design should prevent any
direct access to the warehouse data if it does not use Meta data definitions to gain the access
It provides a comprehensive suite of tools for enterprise wide meta data management It
It leverages investment in legacy systems with the ability inventory and utilize existing
application
It enforces CASE development standards and eliminates redundancy with bthe ability
to share and reuse meta data.
A frequently occurring problem in data warehousing is the inability to communicate to the end
user what information resides in the data warehouse how it can be accessed.
The key to providing users and applications with a roadmap to the information stored in the
warehouse is the meta data.
It defines all data elements and their attributes, data sources and timing, and the rules that govern
data use and data transformation. Meta data needs to be collected as the warehouse is designed and
built. Must enforce integrity and redundancy.
SUMMARY
This unit covers the basics components of data warehousing. This includes the architecture of
data warehousing, components, building data warehousing
Data marts are presented as an inexpensive alternative to a data warehouse that takes
significantly less time and money to build. It is a subsidiary to a data warehouse of integrated
data. It is created for a dedicated set of users.
The information delivery component is used to enable the process of subscribing for data
warehouse information and having it delivered to one or more destinations of choice according to
some user-specified scheduling algorithm. Information delivery system distributes ware house
stored data and other information objects to other data warehouses and end user products
KEY TERMS