Data Warehousing and Data Mining: Sunil Paudel

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Data Warehousing and Data Mining

Sunil Paudel
sunilpaudel@gmail.com

Outline
Review on RDBMS
OLAP Operations

DBMS Overview
 DBMS (Database Management Systems) are designed to
achieve the following four main goals:
1. Increase Data Independence
Data & programs are independent
Change in data did not affect user programs

2. Reduce Data Redundancy


Data is only stored once
Different applications share the same centralized data

3.

Increase Data Security


Authorize the access to the database
Place restrictions on operations that may be performed on data

4.

Maintain Data Integrity


Same data is used by many users

RDBMS
 A relational database is a database that is
perceived by the user as a collection of tables
 This user view is independent of the actual way
the data is stored
 Tables are sets of data made up from rows and
columns
 Structure






Very flexible -- create views


Keep the data secure (use views)
Relation between tables
Primary & Foreign Keys
Normalization

RDBMS- Normalization
 Normalization is the process of streamlining your tables
and their relationships
1. Normal Form (1NF)
 Action: Eliminate repeating values in one atom and repeating
groups
 Rule: Each column must be a fact about .... the key

2. Normal Form (2NF)




Action: Regroup columns dependent on only one part of the


composite key
 Rule: Each column must be a fact about .... the whole key

3.

Normal Form (3NF)





Action: Regroup non-key columns representing a fact about


another non-key column
Rule: Each column must be a fact about .... nothing but the key
5

Views and Joins


Views are ways of looking at data from one or
more tables
Tables can be related to each other by the data
they hold (called joins)
Join Strategies:
 Cross Product
 Inner Join
 Outer Join

SQL
SQL is divided into three major categories:
1. DDL Data Definition Language
 Used to create, modify or drop database objects

2.DML Data Manipulation Language


 Used to select, insert , update or delete database data

(records)
3. DCL Data Control Language
 Used to provide data object access control
 E.g. connect to database, grant, revoke

Multidimesional Data
 A data warehouse is based on a multidimensional data
model which views data in the form of a data cube

Sample Quary
 Query:







"What are the net sales, in terms of revenue and quantities of items sold,
Per product,
Per store and sales region,
Per customer and customer sales area,
Per day as well as aggregated over time,
Over the last two weeks?

 Evaluation entails viewing historical sales figures from


multiple perspectives such as:







Sales (overall)
Sales per product
Sales per store and per sales region
Sales per customer and customer sales area
Sales per day and aggregated over time
Sales and aggregated sales over given time periods

Representation of Query as a Cube

Multi-dimensional data models could be presented using cubes or using a


mathematical notation technique representing points in a multi-dimensional
space, for example: QTY_SOLD = F(S,P,C,t)

Query as a Cube : Usage

Hypercube Representation
 If more than three dimensions are present in the
solution, the cube or 3D space representation is no
longer usable.
 The principle of the cube can be extended to hypercube
4th Dimension

Sample Multidimensional
Representation

Six Basic Concepts of MDDM


 To build an initial multi-dimensional data model, the
following six base elements have to be identified:
1. Measures
2. Dimensions
3. Grains of dimensions and granularities of measures and
facts
4. Facts
5. Dimension hierarchies
6. Aggregation levels

1. Measure
 A measure is a data item which information analysts use
in their queries to measure the performance or behavior
of a business process or a business object
 Sample types of measures






Quantities
Sizes
Amounts
Durations, delay
And so forth

 KPI Key Performance Indicator is a common known


synonym for the most important measures of a business.

2. Dimensions
 A dimension is an entity or a collection of related
entities, used by information analysts to identify
the context of the measures they work with
 Examples: Product, Customer, Store, Time

 Dimensions are referred to through so-called


Dimension keys
 Dimensions contain
 Dimension entities
 Dimension attributes
 Dimension hierarchies

 As an example, the measure sales revenue


only make sense if we know this value for special
item, special customer, at a day and in a certain
store - we got the four dimensions: item, store,
customer and time.

3. Granularity
 The grain of a dimension is the lowest level of detail
available within that dimension





Product grain:
Customer grain:
Store grain:
Time grain:

Item
Customer
Store
Day

 The granularity of a measure is determined by the


combination of the grains of all its dimensions
 For example the granularity of the measure
QTY_SOLD is: (item, customer, store, day).
 Fine granularity enables fine analysis possibilities,
but on the other side it has a big impact on the size
of the Data Warehouse.

Granularity

Here we see an example how fine granularity can show hidden


information, like that stores in a region are better performers than other.

4. Facts
A fact is a collection of related measures and
their associated dimensions, represented by the
dimension keys
 Example: Sales

A fact can represent a business object, a


business transaction or an event which is used
by the information analyst
Facts contain





A Fact Identifier
Dimension Keys
Measures
Supportive Attributes

5. Dimension Hierarchies
 Dimensions consist of one or more dimension hierarchies
 Examples: Hierarchies in the Product Dimension
 Product Classification Hierarchy ("Merchandising Hierarchy")
 Branding Hierarchy

 Each dimension hierarchy can include several aggregation


levels
 Examples: Aggregation Levels in the Product Classification
Hierarchy

6. Aggregation Levels

Each dimension hierarchy can include several


aggregation levels
For example:
Item: 4-pack Duracell AA Alkaline Batteries.
Product: Duracell AA Alkaline Batteries
Sub-category: AA Alkaline Batteries
Category: Batteries
Department: Supplies

Dimension hierarchies and aggregation levels are


used by users when drilling up or down.

Summary

Initial MDM- Example

 This shows the six base concepts as they apply to our


Sales Query and the initial model that corresponds with
that query.

Star Schema
 A star schema is a way to represent multidimensional
data in a relational database
 The star schema logical design, unlike the entityrelationship model, is specifically geared towards
decision support applications.
 Fact table stores business data
 Generally several orders of magnitude larger than any dimension
table
 One key column joined to each dimension table
 One or more data columns

 Multidimensional queries can be built by joining fact and


dimension tables
 Some products use this method to make a relational
OLAP (ROLAP) system

Star Schema- Example

Logical Data Modeling: A Star


Schema Example
Time
time_key

Branch
1

1
Sales

day

month

time_key

year

name
type

branch_key
n

location_key
product_key

Location
1
location_key

branch_key

num_units
amount_usd

Product
1

product_key

???
Supplier

city

name

supplier_key

state

brand

name

country

type

type





One-to-many relationships between the fact and dimensions.


The fact-dimension relationships are certain.
Dimensions in star models are often tightly coupled.

Snowflake Schema
 The snowflake model is a further
normalized version of the star
schema.
 When a dimension table contains
data that is not always necessary
for queries, too much data may be
picked up each time a dimension
table is accessed.
 To eliminate access to this data, it
is kept in a separate table off the
dimension, thereby making the star
resemble a snowflake.

Typical OLAP Operations


 Roll up (drill-up): summarize data
 by climbing up hierarchy or by dimension reduction

 Drill down (roll down): reverse of roll-up


 from higher level summary to lower level summary or detailed data, or
introducing new dimensions

 Slice and dice


 project and select

 Pivot (rotate)
 reorient the cube, visualization, 3D to series of 2D planes.

 Other operations
 drill across: involving (across) more than one fact table
 drill through: through the bottom level of the cube to its back-end
relational
 tables (using SQL)
 Rankings
 time functions: e.g. time avg.

You might also like