UNIT-1 Data Warehousing Part-III
UNIT-1 Data Warehousing Part-III
UNIT-1 Data Warehousing Part-III
Data warehousing:
◦ The process of constructing and using data warehouses
01/03/23 2
Data Warehouse: Subject-Oriented
Organized around major subjects, such as customer, product,
sales.
Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing.
Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process.
01/03/23 3
Data Warehouse—Integrated
Constructed by integrating multiple, heterogeneous data
sources
◦ relational databases, flat files, on-line transaction
records
Data cleaning and data integration techniques are applied.
◦ Ensure consistency in naming conventions,
encoding structures, attribute measures, etc.
among different data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
◦ When data is moved to the warehouse, it is
converted.
01/03/23 4
Data Warehouse—Time Variant
Thetime horizon for the data warehouse is significantly longer
than that of operational systems.
◦ Operational database: current value data.
◦ Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
◦ Contains an element of time, explicitly or implicitly
◦ But the key of operational data may or may not
contain “time element”.
01/03/23 5
Data Warehouse—Non-Volatile
01/03/23 6
Data Warehouse vs. Heterogeneous DBMS
01/03/23 7
Data Warehouse vs. Operational DBMS
OLTP (on-line transaction processing)
◦ Major task of traditional relational DBMS
◦ Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
◦ Major task of data warehouse system
◦ Data analysis and decision making
Distinct features (OLTP vs. OLAP):
◦ User and system orientation: customer vs. market
◦ Data contents: current, detailed vs. historical, consolidated
◦ Database design: ER + application vs. star + subject
◦ View: current, local vs. evolutionary, integrated
◦ Access patterns: update vs. read-only but complex queries
01/03/23 8
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
01/03/23 9
Why Separate Data Warehouse?
High performance for both systems
◦ DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
◦ Warehouse—tuned for OLAP: complex OLAP queries, multidimensional
view, consolidation.
01/03/23 10
A Multidimensional Data Model
Data warehouses and OLAP tools are based on a
multidimensional data model.
01/03/23 11
Data Cubes
• Dimensions are the perspectives or entities with respect to
which an organization wants to keep records.
– sales data warehouse: time, item, branch, and location.
01/03/23 13
3-D Data Cube
01/03/23 14
Conceptual Representation of
3-D Data Cube
01/03/23 15
4-D Data Cube Representation
01/03/23 16
Base and Apex Cuboids
The important thing to remember is that data cubes are
n- dimensional and do not confine data to 3-D.
01/03/23 17
Cube: A Lattice of Cuboids
01/03/23 18
Conceptual Modeling of Data
Warehouses
The entity-relationship data models are appropriate for on-line
transaction processing.
01/03/23 19
Star Schema
The most common modeling paradigm.
01/03/23 20
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
01/03/23 21
Star Schema
• In the star schema, each dimension is represented by only one
table, and each table contains a set of attributes.
01/03/23 23
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
location
branch location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country
01/03/23 24
Fact Constellation
• Sophisticated applications may require multiple fact tables to
share dimension tables.
• For example, the dimensions tables for time, item, and location
are shared between both the sales and shipping fact tables.
– See the figure…
01/03/23 25
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
01/03/23 27
DMQL: Language Primitives
A data mining query language (DMQL) can be used to specify
data mining tasks.
Data warehouses and data marts can be defined using two
language primitives.
01/03/23 28
Defining a Star Schema in
DMQL
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month,
quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)
01/03/23 29
Defining a Snowflake Schema in DMQL
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month,
quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key,
province_or_state, country))
01/03/23 30
Defining a Fact Constellation in DMQL
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location in
cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
01/03/23 31
Measures: Their Categorization and
Computation
• A data cube measure is a numerical function that can be
evaluated at each point in the data cube space.
01/03/23 32
Measures: Three Categories
• Distributive: if the result derived by applying the function to n
aggregate values is the same as that derived by applying the
function on all the data without partitioning.
• count(), sum(), min(), max().
• Algebraic: if it can be computed by an algebraic function with
M arguments (where M is a bounded integer), each of which is
obtained by applying a distributive aggregate function.
• avg(), min_N(), standard_deviation().
• Holistic: if there does not exist an algebraic function with M
arguments (where M is a constant) that characterizes the
computation.
• median(), mode(), rank().
01/03/23 33
Concept Hierarchy
A concept hierarchy defines a sequence of mappings from a set
of low-level concepts to higher-level, more general concepts.
Many concept hierarchies are implicit within the database
schema.
◦ Location: street < city < province or state < country.
◦ Time: day < {month <quarter; week} < year
Concept hierarchies may also be defined by discretizing or
grouping values for a given dimension or attribute, resulting in
a set-grouping hierarchy.
There may be more than one concept hierarchy for a given
attribute or dimension, based on different user viewpoints.
◦ Provided manually by system users, domain experts, or knowledge
engineers, or may be automatically generated based on statistical analysis
of the data distribution.
01/03/23 34
Concept Hierarchy of location
01/03/23 35
Hierarchical and lattice structures of
attributes
01/03/23 36
Concept hierarchy of price
01/03/23 37
OLAP Operations in the
Multidimensional Data Model
In the multidimensional model, data are organized into
multiple dimensions, and each dimension contains multiple
levels of abstraction defined by concept hierarchies.
This organization provides users with the flexibility to view
data from different perspectives.
A number of OLAP data cube operations exist to materialize
these different views, allowing interactive querying and
analysis of the data at hand.
OLAP provides a user-friendly environment for interactive
data analysis.
01/03/23 38
Data cube for sales
01/03/23 39
Roll-up
• The roll-up operation performs aggregation on a data
cube, either by climbing up a concept hierarchy for a
dimension or by dimension reduction.
01/03/23 40
roll-up on location
(form cities to
countrries)
01/03/23 41
Drill-down
Drill-down is the reverse of roll-up.
01/03/23 42
drill-down on time
(from quarters to months)
01/03/23 43
Slice, Dice and Pivot
The slice operation performs a selection on one dimension of
the given cube, resulting in a subcube.
01/03/23 44
slice for time
01/03/23 45
dice for
(location=“Toronto”or”vencouver”) and
(time =“Q1” or “Q2”) and
(item=“home entertinement” or ” Computer”)
01/03/23 46
Pivot
01/03/23 47
Other OLAP operations
Some OLAP systems offer additional drilling
operations.
◦ drill across: involving (across) more than one
fact table
◦ drill through: through the bottom level of the
cube to its back-end relational tables (using
SQL)
01/03/23 48
Data Warehouse Architecture
How to design and construct a data warehouse.
Three-tier data warehouse architecture.
Back-end tools and utilities for data warehouses.
Metadata repository.
Various types of warehouse servers for OLAP
processing.
01/03/23 49
Design of a Data Warehouse: A
Business Analysis Framework
Four views regarding the design of a data warehouse
◦ Top-down view
allows selection of the relevant information necessary for the
data warehouse
◦ Data source view
exposes the information being captured, stored, and managed
by operational systems
◦ Data warehouse view
consists of fact tables and dimension tables
◦ Business query view
sees the perspectives of data in the warehouse from the view
of end-user
01/03/23 50
Data Warehouse Design Process
Top-down, bottom-up approaches or a combination of both
◦ Top-down: Starts with overall design and planning (mature)
◦ Bottom-up: Starts with experiments and prototypes (rapid)
From software engineering point of view
◦ Waterfall: structured and systematic analysis at each step before
proceeding to the next
◦ Spiral: rapid generation of increasingly functional systems, short turn
around time, quick turn around
Typical data warehouse design process
◦ Choose a business process to model, e.g., orders, invoices, etc.
◦ Choose the grain (atomic level of data) of the business process
◦ Choose the dimensions that will apply to each fact table record
◦ Choose the measure that will populate each fact table record
01/03/23 51
Back-end tools and Utilities
Three-tire architecture
01/03/23 52
Three-tiered Architecture
Monitor
Metadata & OLAP Server
other
source Integrator
s Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
01/03/23 54
Back-End Tools and Utilities
Data warehouse systems use back-end tools and utilities to
populate and refresh their data.
These tools and utilities include the following functions:
◦ Data extraction
◦ Data cleaning
◦ Data transformation
◦ Load
◦ Refresh
Data cleaning and data transformation are important steps in
improving the quality of the data and, subsequently, of the data
mining results
01/03/23 55
Metadata Repository
Metadata are data about data.
In a data warehouse, metadata are the data that define warehouse
objects.
A metadata repository should contain the following:
01/03/23 56
Three-tire architecture
• The middle tier is an OLAP server that is typically implemented
using
– a relational OLAP (ROLAP) model or
– a multidimensional OLAP (MOLAP) model.
• The top tier is a front-end client layer, which contains query and
reporting tools, analysis tools, and/or data mining tools (e.g., trend
analysis, prediction, and so on).
01/03/23 57
Three Data Warehouse Models
From the architecture point of view, there are three data warehouse models:
enterprise warehouse, data mart, and virtual warehouse.
Enterprise warehouse
◦ Collects all of the information about subjects spanning the entire organization
Data Mart
◦ A subset of corporate-wide data that is of value to a specific groups of users.
Its scope is confined to specific, selected groups, such as marketing data mart
Independent vs. dependent (data sourced directly from
warehouse) data mart
Virtual warehouse
◦ A set of views over operational databases
◦ Only some of the possible summary views may be materialized
01/03/23 58
Data Warehouse Development
Approaches
Top-down approach:
◦ A systematic solution and minimizes integration problems.
◦ It is expensive, takes a long time to develop, and lacks
flexibility due to the difficulty in achieving consistency and
consensus for a common data model for the entire
organization.
Bottom-up approach:
◦ Develop, and deploy independent data marts.
◦ Provides flexibility, low cost, and rapid return of investment.
◦ It can lead to problems when integrating various disparate
data marts into a consistent enterprise data warehouse .
01/03/23 59
Data Warehouse Development: A
Recommended Approach
incremental
and Multi-Tier Data
evolutionary Distributed Warehouse
approach Data Marts
01/03/23 61
OLAP Servers
Relational OLAP (ROLAP)
◦ These are the intermediate servers that stand in between a relational back-
end server and client front-end tools.
◦ Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware to support missing pieces.
Greater scalability
Example:
01/03/23 62
OLAP Servers
• Multidimensional OLAP (MOLAP)
– Array-based multidimensional storage engine (sparse matrix techniques)
Faster Computation
– These will be discussed further in detailed in subsequent Sections.
01/03/23 63
From Data Warehousing to Data Mining
Data Warehouse applications
◦ Information processing
supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
◦ Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
◦ Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.
01/03/23 64
From Data Warehousing to Data Mining
01/03/23 65
From Data Warehousing to Data Mining
• On-line analytical processing Vs. Data mining
– OLAP is a data summarization/aggregation tool that helps simplify data
analysis, whereas data mining allows the automated discovery of
implicit patterns and interesting knowledge hidden in large amounts of
data.
– OLAP tools are targeted toward simplifying and supporting interactive
data analysis, whereas the data mining tools aim to automate as much of
the process as possible, while still allowing users to guide the process.
– OLAP systems can present general descriptions of data from data
warehouses (by drilling, pivoting, slicing, dicing, and other operations),
whereas data mining is not confined to the analysis of data stored in
data warehouses.
– Data mining involves more automated and deeper analysis than OLAP.
• Hence data mining is one step beyond traditional
OLAP.
01/03/23 66
From On-Line Analytical Processing to
On Line Analytical Mining (OLAM)
Why online analytical mining?
◦ High quality of data in data warehouses
DW contains integrated, consistent, cleaned data
◦ Available information processing structure surrounding data warehouses
ODBC, OLEDB, Web accessing, service facilities,
reporting and OLAP tools
◦ OLAP-based exploratory data analysis
mining with drilling, dicing, pivoting, etc.
◦ On-line selection of data mining functions
integration and swapping of multiple mining functions,
algorithms, and tasks.
Architecture of OLAM
01/03/23 67
An OLAM Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM
Layer2
MDDB
MDDB
Meta
Data
Filtering&Integration Database API Filtering
Layer1
Data cleaning Data
Databases Data
Data integration Warehouse Repository68
01/03/23