DW Dimensional Modeling

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Data Warehousing and

Multi-dimensional Data Modeling

Prof. Srikumar Krishnamoorthy

IIM Ahmedabad
Learning objectives

Data Warehouse fundamentals

Dimensional modeling

OLAP

DMBI IIMA 2
What is a Data Warehouse

Defined in many different ways, but not rigorously.


I A decision support database that is maintained separately from the
organization’s operational database
I Support information processing by providing a solid platform of
consolidated, historical data for analysis.

”A data warehouse is a subject-oriented, integrated, time-variant,


and nonvolatile collection of data in support of management’s
decision-making process.” − W. H. Inmon

Data warehousing:
The process of constructing and using data warehouses

DMBI IIMA 3
Feature OLTP OLAP

Users Analyst, DBA Managers, Business


Analyst, Executives
Function Day-to-day oper- Decision support
ations
DB design ERD Multi Dim. Model
Data Latest Historical
Query Read/write Mostly read
#Records per query Tens Millions
#Users Thousands Hundreds
Size of DB 100MB to GB 100GB to TB
Metric #TPS Response time

DMBI IIMA 4
Data Warehouse: A Multi-tiered Architecture

DMBI IIMA 5
Use of Metadata

Which territories does region


“SOUTH” include?
Does the data item 04-01-2000 Query: Sales Information
denote April 1, 2000 or January 4, Product = Widget-1 or Widget-2
2000? What is the convention used
for dates in your data warehouse? Region = ’South’

Are the numbers shown as sale units Period = 04-01-2000 to 04-07-2000


given in physical units of the
products, or in some measure such Result
as pounds or kilograms? Qty sold Amount
What about the amounts shown in Widget-1 25,000 250,000
the result set? Are these amounts in
Widget-2 30,000 150,000
dollars or in some other currency?
Metadata gives your user the
meaning of each data element.

DMBI IIMA 6
Multi-dimensional data modeling

A data warehouse is based on a multidimensional data model


which views data in the form of a data cube

A data cube, such as sales, allows data to be modeled and viewed


in multiple dimensions
Dimension tables such as item (item name, brand, type), or time
(day, week, month, quarter, year)
Fact table contains measures (such as dollars sold) and keys to each
of the related dimension tables

In data warehousing literature, an n-D base cube is called a base


cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
DMBI IIMA 7
Cube: A Lattice of Cuboids

DMBI IIMA 8
A Sample Data Cube

DMBI IIMA 9
Aggregate, Group By, Cross Tab and Cube: An
Illustrative Example

DMBI IIMA 10
Conceptual Modeling of Data Warehouses

Star schema: A fact table in the middle connected to a set of


dimension tables

Snowflake schema: A refinement of star schema where some


dimensional hierarchy is normalized into a set of smaller dimension
tables, forming a shape similar to snowflake

Fact constellations: Multiple fact tables share dimension tables,


viewed as a collection of stars, therefore called galaxy schema or
fact constellation

DMBI IIMA 11
Star schema

Time Item
time key (PK) item key (PK)
day item name
day of the week brand
month Sales type
quarter supplier name
year time key (FK) supplier type
item key (FK)
branch key (FK)
loc key (FK)
qtySales
Branch dollarSales Location
branch key (PK) avgSales loc key (PK)
branch name street
branch type city
state

DMBI IIMA 12
Snowflake schema
Supplier
Time Item supplier key (PK)
supplier name
time key (PK) item key
supplier type
day (PK)
day of the week item name
month Sales brand
quarter type
year time key (FK) supplier key
item key (FK)
branch key (FK)
loc key (FK)
qtySales
Branch dollarSales Location
branch key (PK) avgSales loc key (PK)
branch name street
branch type city key

City
city key (PK)
city
state
DMBI IIMA 13
Fact Constellation schema Shipping
Time time key (FK)
item key (FK)
time key (PK)
shipper key (FK)
day Item from loc key (FK)
day of the week
item key (PK) to loc key (FK)
month
Sales item name qtyShipped
quarter
brand dollarCost
year time key (FK) type
item key (FK) supplier name
branch key supplier type
(FK)
loc key (FK)
qtySales
Branch dollarSales Location
branch key avgSales loc key (PK)
(PK) street
branch name city Shipper
branch type state shipper key (PK)
shipper name
shipper type

DMBI IIMA 14
Factless Fact Table

Q: Which products were under promotion but did not sell?

Sales fact table typically captures only products that were sold

Create Promotion Coverage Factless Fact Table


I Factless Fact Table = has no measurement metrics
I Contains date, product, store, and promotion keys

Two-step process to answer Q:


I Query Promotion Coverage table: products under promotion on
given date
I From Sales Fact table: products sold
I Answer is the set difference of above

DMBI IIMA 15
Fact Table: Measure Types

Additive: it makes sense to sum the measures across all


dimensions
I Quantity sold across Region, Store, Salesperson, Date, Product . . .

Semi additive: additive only across certain dimensions


I Quantity on hand is not additive over Date, but it is additive across
Store and Product

Non-additive: cannot be summed across any dimensions


I A ratio, a percentage
I A measure that is non additive on one dimension may be the object
of other data aggregations
I Average, Min, Max of quantities on hand over time

DMBI IIMA 16
Degenerate Dimension (DD)

Dimension keys used in fact table without corresponding


dimension tables
Useful for grouping by transaction
Common DDs: order numbers, invoice numbers, transaction #

DMBI IIMA 17
Surrogate Keys

Surrogate keys are integers assigned sequentially as needed to


populate a dimension. They serve to join dimension tables to the
fact table.
Avoid embedding intelligence in the data warehouse keys.
Benefits:
I Surrogate keys buffer the DW environment from operational
changes. What happens when operations decide to recycle account
numbers after some period of inactivity? Fine for operational
systems, but problematic for DW if it is using account numbers as a
PK.
I Can more easily integrate data from multiple operational systems,
even if they lack consistent source keys.
I Performance advantages because small size of surrogate keys leads
to smaller fact tables
I Surrogate keys are used to support one of the primary techniques
for handling changes in dimension table attributes
DMBI IIMA 18
Slowly Changing Dimensions (SCD)

Attributes in a dimensional table that change over time. Three basic


approaches:
Type 1: overwrite the old value
Type 2: create a new dimensional record
Type 3: create a “previous value” attribute

DMBI IIMA 19
SCD – Examples
Product key Description Category SKU Original

21553 LeadPad Education LP2105

Type 1
Product key Description Category SKU
21553 LeadPad Toy LP2105

Type 2
Product key Description Category SKU
21553 LeadPad Education LP2105
44631 LeadPad Toy LP2105
Type 3
Product key Description Category SKU Old category
21553 LeadPad Toy LP2105 Education
DMBI IIMA 20
Dimension Tables: Concept Hierarchy

DMBI IIMA 21
Concept Hierarchies

Schema hierarchy: total or partial order among attributes in the


database schema, formally expresses existing semantic
relationships between attributes
I Location hierarchy (total order)
• street <city <province or state <country
I Time hierarchy (partial order)
• day <{month <quarter; week} <year
day <month <quarter <year
day <week <year

Set-grouping hierarchy: organizes values for a given attribute or


dimension into groups or constant range values
I young, middle aged, senior subset of all(age)
{20-39} = young
{40-59} = middle aged
{60-89} = senior

DMBI IIMA 22
Concept Hierarchies (2)

Operation-derived hierarchy: based on operations specified by


users, experts, or the data mining system
I email address or a URL contains hierarchy info relating
departments, universities (or companies) and countries
I E-mail address: sri.is@iima.in
I Partial concept hierarchy
login-name <department <university <country

Rule-based hierarchy: either a whole concept hierarchy or a


portion of it is defined by a set of rules and is evaluated
dynamically based on the current data and rule definition
I Following rules used to categorize items as low profit margin,
medium profit margin and high profit margin
Low profit margin: <$50
Medium profit margin: between $50 & $250
High profit margin: >$250
DMBI IIMA 23
Typical OLAP Operations

Roll up (drill-up): summarize data


• by climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up


• from higher level summary to lower level summary or detailed data,
or introducing new dimensions

Slice and dice: project and select


Pivot (rotate):
• reorient the cube, visualization, 3D to series of 2D planes

Drill across: involving (across) more than one fact table

Drill through: through the bottom level of the cube to its


back-end relational tables (using SQL)
DMBI IIMA 24
Dimensional Modeling: Extensibility

Dimensional models can handle extensions without invalidating


existing applications:
I New dimension attributes: simply add columns to dimension table.
If new attribute is only available after a point in time, populate old
dimension records with something like “Not Available”
I New dimensions: add foreign field keys to fact table

I New measured facts: add to fact table. If not at the same grain,
then need separate fact table
I Dimension becoming more granular: create new dimension. May
imply more granular fact table, in which case, may have to rebuild
the fact table.
I Addition of a completely new data source involving existing and
new dimensions: usually needs new fact table

DMBI IIMA 25
References

Han & Kamber, Data Mining: Concepts and Techniques

E. Turban, R. Sharda, D. Delen, Decision Support and Business


Intelligence Systems

DMBI IIMA 26
THANK YOU

You might also like