Data Warehousing and Data Mining: Sunil Paudel
Data Warehousing and Data Mining: Sunil Paudel
Data Warehousing and Data Mining: Sunil Paudel
Sunil Paudel
sunilpaudel@gmail.com
Outline
Review on RDBMS
OLAP Operations
DBMS Overview
DBMS (Database Management Systems) are designed to
achieve the following four main goals:
1. Increase Data Independence
Data & programs are independent
Change in data did not affect user programs
3.
4.
RDBMS
A relational database is a database that is
perceived by the user as a collection of tables
This user view is independent of the actual way
the data is stored
Tables are sets of data made up from rows and
columns
Structure
RDBMS- Normalization
Normalization is the process of streamlining your tables
and their relationships
1. Normal Form (1NF)
Action: Eliminate repeating values in one atom and repeating
groups
Rule: Each column must be a fact about .... the key
3.
SQL
SQL is divided into three major categories:
1. DDL Data Definition Language
Used to create, modify or drop database objects
(records)
3. DCL Data Control Language
Used to provide data object access control
E.g. connect to database, grant, revoke
Multidimesional Data
A data warehouse is based on a multidimensional data
model which views data in the form of a data cube
Sample Quary
Query:
"What are the net sales, in terms of revenue and quantities of items sold,
Per product,
Per store and sales region,
Per customer and customer sales area,
Per day as well as aggregated over time,
Over the last two weeks?
Sales (overall)
Sales per product
Sales per store and per sales region
Sales per customer and customer sales area
Sales per day and aggregated over time
Sales and aggregated sales over given time periods
Hypercube Representation
If more than three dimensions are present in the
solution, the cube or 3D space representation is no
longer usable.
The principle of the cube can be extended to hypercube
4th Dimension
Sample Multidimensional
Representation
1. Measure
A measure is a data item which information analysts use
in their queries to measure the performance or behavior
of a business process or a business object
Sample types of measures
Quantities
Sizes
Amounts
Durations, delay
And so forth
2. Dimensions
A dimension is an entity or a collection of related
entities, used by information analysts to identify
the context of the measures they work with
Examples: Product, Customer, Store, Time
3. Granularity
The grain of a dimension is the lowest level of detail
available within that dimension
Product grain:
Customer grain:
Store grain:
Time grain:
Item
Customer
Store
Day
Granularity
4. Facts
A fact is a collection of related measures and
their associated dimensions, represented by the
dimension keys
Example: Sales
A Fact Identifier
Dimension Keys
Measures
Supportive Attributes
5. Dimension Hierarchies
Dimensions consist of one or more dimension hierarchies
Examples: Hierarchies in the Product Dimension
Product Classification Hierarchy ("Merchandising Hierarchy")
Branding Hierarchy
6. Aggregation Levels
Summary
Star Schema
A star schema is a way to represent multidimensional
data in a relational database
The star schema logical design, unlike the entityrelationship model, is specifically geared towards
decision support applications.
Fact table stores business data
Generally several orders of magnitude larger than any dimension
table
One key column joined to each dimension table
One or more data columns
Branch
1
1
Sales
day
month
time_key
year
name
type
branch_key
n
location_key
product_key
Location
1
location_key
branch_key
num_units
amount_usd
Product
1
product_key
???
Supplier
city
name
supplier_key
state
brand
name
country
type
type
Snowflake Schema
The snowflake model is a further
normalized version of the star
schema.
When a dimension table contains
data that is not always necessary
for queries, too much data may be
picked up each time a dimension
table is accessed.
To eliminate access to this data, it
is kept in a separate table off the
dimension, thereby making the star
resemble a snowflake.
Pivot (rotate)
reorient the cube, visualization, 3D to series of 2D planes.
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its back-end
relational
tables (using SQL)
Rankings
time functions: e.g. time avg.