Concepts and Techniques: - Chapter 4

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 50

Data Mining:

Concepts and Techniques


— Chapter 4 —
 Course teacher: Dr. Helal Uddin Ahmed
 PhD in Computer Engineering (South Korea), MBA (UK),
MSc (DU), BSc.(DU)
 Professor
 Department of MIS

 University of Dhaka

 Mob. 01783368720

 e-mail: helal@du.ac.bd
.

1
Chapter 4: Data Warehousing and On-line Analytical
Processing

 Data Warehouse: Basic Concepts

 Data Warehouse Modeling: Data Cube and OLAP

 Data Warehouse Design and Usage

 Data Warehouse Implementation

2
What is a Data Warehouse?
 Defined in many different ways, but not rigorously.
 A decision support database that is maintained separately from
the organization’s operational database
 Support information processing by providing a solid platform of
consolidated, historical data for analysis.
 “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon

3
Data Warehouse—Subject-Oriented

 Organized around major subjects, such as customer,


product, sales
 Focusing on the modeling and analysis of data for decision
makers,
 not on daily operations or transaction processing
 Provide a simple and concise view around particular
subject issues
 by excluding data that are not useful in the decision
support process

4
Data Warehouse—Integrated
 Constructed by integrating multiple, heterogeneous data
sources
 relational databases, flat files, on-line transaction

records
 Data cleaning and data integration techniques are
applied.
 When data is moved to the warehouse, it is

converted.

5
Data Warehouse—Time Variant

 The time horizon for the data warehouse is significantly


longer than that of operational systems
 Operational database: current value data
 Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
 Every key structure in the data warehouse
 Contains an element of time, explicitly or implicitly
 But the key of operational data may or may not
contain “time element”

6
Data Warehouse—Nonvolatile

 A physically separate store of data transformed from the


operational environment
 Operational update of data does not occur in the data
warehouse environment
 Does not require transaction processing, recovery,
and concurrency control mechanisms
 Requires only two operations in data accessing:
 initial loading of data and access of data

7
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

8
September 10, 2021 Data Mining: Concepts and Techniques 9
Why a Separate Data Warehouse?
 High performance for both systems
 DBMS— tuned for OLTP: access methods, indexing, concurrency
control
 Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
 Different functions and different data:
 missing data: Decision support requires historical data which
operational DBs do not typically maintain
 data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
 data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
 Note: There are more and more systems which perform OLAP analysis
directly on relational databases

10
Data Warehouse: A Multi-Tiered Architecture

Monitor
Metadata & OLAP Server
Other
sources Integrator

Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


11
Warehouse: 3-Tiered Architecture

Bottom tier is warehouse database server:


usually relational database system
Back end tools and utilities are used to

extract, cleaning and transforming

Middle tier follows one of two method:


Relational OLAP (ROLAP) & Multidimensional
OLAP (MOLAP)

Top tier contains: query, reporting and


analysis tools

September 10, 2021 Data Mining: Concepts and Techniques 12


Three Data Warehouse Models
 Enterprise warehouse
 collects all of the information about subjects spanning

the entire organization


 Data Mart
 a subset of corporate-wide data that is of value to a

specific groups of users. Its scope is confined to specific,


selected groups, such as marketing data mart
 Independent vs. dependent (directly from warehouse) data mart
 Virtual warehouse
 A set of views over operational databases

 Only some of the possible summary views may be

materialized
13
Metadata Repository
 Meta data is the data defining warehouse objects. It stores:
 Description of the structure of the data warehouse
 schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
 The algorithms used for summarization
 The mapping from operational environment to the data warehouse
 Business data
 business terms and definitions, ownership of data, charging policies

Meta acts like an index to the contents of the data


warehouse.
 It sits above the warehouse and keeps track of what is

where in the warehouse.


14
Meta data
Typically, items the meta data store tracks are as follows:
 Structure of data as known to the programmer

 Structure of data as known to the DSS analyst

 Source data feeding the data warehouse


 Transformation of data as it passes into the data

warehouse
 Data model

 Relationship between the data model and the data

warehouse
 History of extracts

September 10, 2021 Data Mining: Concepts and Techniques 15


Chapter 4: Data Warehousing and On-line Analytical
Processing

 Data Warehouse: Basic Concepts

 Data Warehouse Modeling: Data Cube and OLAP

 Data Warehouse Design and Usage

 Data Warehouse Implementation

 Summary

16
From Tables and Spreadsheets to
Data Cubes
 A data warehouse is based on a multidimensional data model
which views data in the form of a data cube
 A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
 Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
 Fact table contains measures (such as dollars_sold) and keys
to each of the related dimension tables
 In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
17
Cube: A Lattice of Cuboids
Total dollar sold
OR Item sold
all
0-D (apex) cuboid

time item location supplier


1-D cuboids

time,location item,location location,supplier


time,item 2-D cuboids
time,supplier item,supplier

time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier

4-D (base) cuboid


time, item, location, supplier

18
Conceptual Modeling of Data Warehouses
 Modeling data warehouses: dimensions & measures
 Star schema: A fact table in the middle connected to a
set of dimension tables
 Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
 Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
19
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures

20
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
location
branch location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country

21
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type 22
A Concept Hierarchy:
Dimension (location)

all all

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind

23
Multidimensional Data

 Sales volume as a function of product, month,


and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
on
gi

Industry Region Year


Re

Category Country Quarter


Product

Product City Month Week

Office Day

Month
24
A Sample Data Cube

Total annual sales


Date of TVs in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
t
uc

TV
od

PC U.S.A
Pr

VCR

Country
sum
Canada

Mexico

sum

25
Typical OLAP Operations
 Roll up (drill-up): summarize data
 by climbing up hierarchy or by dimension reduction
 Drill down (roll down): reverse of roll-up
 from higher level summary to lower level summary or
detailed data, or introducing new dimensions
 Slice and dice: project and select
 Pivot (rotate):
 reorient the cube, visualization, 3D to series of 2D planes
 Other operations
 drill across: involving (across) more than one fact table
 drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
26
Fig. 3.10 Typical OLAP
Operations

27
A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS

ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT

REGION
DIVISION
Location Each circle is
called a footprint Promotion Organization
28
Browsing a Data Cube

 Visualization
 OLAP capabilities
 Interactive manipulation
29
Chapter 4: Data Warehousing and On-line Analytical
Processing

 Data Warehouse: Basic Concepts

 Data Warehouse Modeling: Data Cube and OLAP

 Data Warehouse Design and Usage

 Data Warehouse Implementation

 Summary

30
Design of Data Warehouse: A Business
Analysis Framework
 Four views regarding the design of a data warehouse
 Top-down view
 allows selection of the relevant information necessary for the
data warehouse
 Data source view
 exposes the information being captured, stored, and
managed by operational systems
 Data warehouse view
 consists of fact tables and dimension tables
 Business query view
 sees the perspectives of data in the warehouse from the view
of end-user
31
Data Warehouse Design Process
 Top-down, bottom-up approaches or a combination of both
 Top-down: Starts with overall design and planning (mature)
 Bottom-up: Starts with experiments and prototypes (rapid)
 From software engineering point of view
 Waterfall: structured and systematic analysis at each step before
proceeding to the next
 Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
 Typical data warehouse design process
 Choose a business process to model, e.g., orders, invoices, etc.
 Choose the grain (atomic level of data) of the business process
 Choose the dimensions that will apply to each fact table record
 Choose the measure that will populate each fact table record
32
Data Warehouse Development: A
Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts

Data Data Enterprise


Mart Mart Data
Warehouse

Model refinement Model refinement

Define a high-level corporate data model


33
Data Warehouse Usage
 Three kinds of data warehouse applications
 Information processing
 supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
 Analytical processing
 multidimensional analysis of data warehouse data
 supports basic OLAP operations, slice-dice, drilling, pivoting
 Data mining
 knowledge discovery from hidden patterns
 supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools

34
From On-Line Analytical Processing (OLAP)
to On Line Analytical Mining (OLAM)
 Why online analytical mining?
 High quality of data in data warehouses

 DW contains integrated, consistent, cleaned data

 Available information processing structure surrounding

data warehouses
 ODBC, OLEDB, Web accessing, service facilities,

reporting and OLAP tools


 OLAP-based exploratory data analysis

 Mining with drilling, dicing, pivoting, etc.

 On-line selection of data mining functions

 Integration and swapping of multiple mining

functions, algorithms, and tasks


35
Chapter 4: Data Warehousing and On-line Analytical
Processing

 Data Warehouse: Basic Concepts

 Data Warehouse Modeling: Data Cube and OLAP

 Data Warehouse Design and Usage

 Data Warehouse Implementation

 Summary

36
Efficient Data Cube Computation
 Data cube can be viewed as a lattice of cuboids
 The bottom-most cuboid is the base cuboid
 The top-most cuboid (apex) contains only one cell
 How many cuboids in an n-dimensional cube with L
levels? n
T   ( Li 1)
i 1
 Materialization of data cube
 Materialize every (cuboid) (full materialization), none
(no materialization), or some (partial
materialization)
 Selection of which cuboids to materialize
 Based on size, sharing, access frequency, etc.
37
The “Compute Cube” Operator
 Cube definition and computation in DMQL
define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales
 Transform it into a SQL-like language (with a new operator cube
by, introduced by Gray et al.’96) ()
SELECT item, city, year, SUM (amount)
FROM SALES (city) (item) (year)

CUBE BY item, city, year


 Need compute the following Group-Bys
(city, item) (city, year) (item, year)
(date, product, customer),
(date,product),(date, customer), (product, customer),
(date), (product), (customer) (city, item, year)
()
38
Indexing OLAP Data: Bitmap Index
 Index on a particular column
 Each value in the column has a bit vector: bit-op is fast
 The length of the bit vector: # of records in the base table
 The i-th bit is set if the i-th row of the base table has the value for the
indexed column
 not suitable for high cardinality domains
 A recent bit compression technique, Word-Aligned Hybrid (WAH), makes it
work for high cardinality domain as well [Wu, et al. TODS’06]

Base table Index on Region Index on Type


Cust Region Type RecID Asia Europe Am erica RecID Retail Dealer
C1 Asia Retail 1 1 0 0 1 1 0
C2 Europe Dealer 2 0 1 0 2 0 1
C3 Asia Dealer 3 1 0 0 3 0 1
C4 America Retail 4 0 0 1 4 1 0
C5 Europe Dealer 5 0 1 0 5 0 1
39
Indexing OLAP Data: Join Indices
 Join index: JI(R-id, S-id) where R (R-id, …)  S
(S-id, …)
 Traditional indices map the values to a list of
record ids
 It materializes relational join in JI file and

speeds up relational join


 In data warehouses, join index relates the values
of the dimensions of a start schema to rows in
the fact table.
 E.g. fact table: Sales and two dimensions city

and product
 A join index on city maintains for each

distinct city a list of R-IDs of the tuples


recording the Sales in the city
 Join indices can span multiple dimensions

40
Efficient Processing OLAP Queries
 Determine which operations should be performed on the available cuboids
 Transform drill, roll, etc. into corresponding SQL and/or OLAP operations,
e.g., dice = selection + projection
 Determine which materialized cuboid(s) should be selected for OLAP op.
 Let the query to be processed be on {brand, province_or_state} with the
condition “year = 2004”, and there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?
 Explore indexing structures and compressed vs. dense array structs in MOLAP
41
OLAP Server Architectures

 Relational OLAP (ROLAP)


 Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware
 Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
 Greater scalability
 Multidimensional OLAP (MOLAP)
 Sparse array-based multidimensional storage engine
 Fast indexing to pre-computed summarized data
 Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
 Flexibility, e.g., low level: relational, high-level: array
 Specialized SQL servers (e.g., Redbricks)
 Specialized support for SQL queries over star/snowflake schemas
42
Chapter 4: Data Warehousing and On-line Analytical
Processing

 Data Warehouse: Basic Concepts

 Data Warehouse Modeling: Data Cube and OLAP

 Data Warehouse Design and Usage

 Data Warehouse Implementation

 Summary

43
Summary
 Data warehousing: A multi-dimensional model of a data warehouse
 A data cube consists of dimensions & measures
 Star schema, snowflake schema, fact constellations
 OLAP operations: drilling, rolling, slicing, dicing and pivoting
 Data Warehouse Architecture, Design, and Usage
 Multi-tiered architecture
 Business analysis design framework
 Information processing, analytical processing, data mining, OLAM
(Online Analytical Mining)
 Implementation: Efficient computation of data cubes
 Partial vs. full vs. no materialization

 Indexing OALP data: Bitmap index and join index

 OLAP query processing

 OLAP servers: ROLAP, MOLAP, HOLAP

44
References (I)
 S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan,
and S. Sarawagi. On the computation of multidimensional aggregates. VLDB’96
 D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data
warehouses. SIGMOD’97
 R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97
 S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology.
ACM SIGMOD Record, 26:65-74, 1997
 E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer World, 27,
July 1993.
 J. Gray, et al. Data cube: A relational aggregation operator generalizing group-by,
cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
 A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations, and
Applications. MIT Press, 1999.
 J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD Record,
27:97-107, 1998.
 V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently.
SIGMOD’96

45
References (II)
 C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design:
Relational and Dimensional Techniques. John Wiley, 2003
 W. H. Inmon. Building the Data Warehouse. John Wiley, 1996
 R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to
Dimensional Modeling. 2ed. John Wiley, 2002
 P. O'Neil and D. Quass. Improved query performance with variant indexes.
SIGMOD'97
 Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
http://www.microsoft.com/data/oledb/olap, 1998
 A. Shoshani. OLAP and statistical databases: Similarities and differences.
PODS’00.
 S. Sarawagi and M. Stonebraker. Efficient organization of large
multidimensional arrays. ICDE'94
 P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987.
 J. Widom. Research problems in data warehousing. CIKM’95.
 K. Wu, E. Otoo, and A. Shoshani, Optimal Bitmap Indices with Efficient
Compression, ACM Trans. on Database Systems (TODS), 31(1), 2006, pp. 1-38.
46
September 10, 2021 Data Mining: Concepts and Techniques 47
Chapter 4: Data Warehousing and On-line Analytical
Processing
 Data Warehouse: Basic Concepts
 (a) What Is a Data Warehouse?
 (b) Data Warehouse: A Multi-Tiered Architecture
 (c) Three Data Warehouse Models: Enterprise Warehouse, Data Mart, ad Virtual Warehouse
 (d) Extraction, Transformation and Loading
 (e) Metadata Repository
 Data Warehouse Modeling: Data Cube and OLAP
 (a) Cube: A Lattice of Cuboids
 (b) Conceptual Modeling of Data Warehouses
 (c) Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Databases
 (d) Dimensions: The Role of Concept Hierarchy
 (e) Measures: Their Categorization and Computation
 (f) Cube Definitions in Database systems
 (g) Typical OLAP Operations
 (h) A Starnet Query Model for Querying Multidimensional Databases
 Data Warehouse Design and Usage
 (a) Design of Data Warehouses: A Business Analysis Framework
 (b) Data Warehouses Design Processes
 (c) Data Warehouse Usage
 (d) From On-Line Analytical Processing to On-Line Analytical Mining
 Data Warehouse Implementation
 (a) Efficient Data Cube Computation: Cube Operation, Materialization of Data Cubes, and Iceberg Cubes
 (b) Indexing OLAP Data: Bitmap Index and Join Index
 (c) Efficient Processing of OLAP Queries
 (d) OLAP Server Architectures: ROLAP vs. MOLAP vs. HOLAP
 Summary

48
Compression of Bitmap Indices
 Bitmap indexes must be compressed to reduce I/O costs
and minimize CPU usage—majority of the bits are 0’s
 Two compression schemes:
 Byte-aligned Bitmap Code (BBC)
 Word-Aligned Hybrid (WAH) code
 Time and space required to operate on compressed
bitmap is proportional to the total size of the bitmap
 Optimal on attributes of low cardinality as well as those of
high cardinality.
 WAH out performs BBC by about a factor of two
49
■■ Structure of data as known to the
programmer
■■ Structure of data as known to the DSS analyst
■■ Source data feeding the data warehouse
■■ Transformation of data as it passes into the
data warehouse
■■ Data model
■■ Relationship between the data model and the
data warehouse
■■ History of extracts

September 10, 2021 Data Mining: Concepts and Techniques 50

You might also like