Data Warehouse
Data Warehouse
Data Warehouse
BY
B.M.BRINDA,
AP / IT - VCEW
12/04/2016
OVERVIEW
Introduction to Data
Data Warehouse
OLAP Vs. OLTP
Multi Dimensional Data
Data Mining
Data Preprocessing
12/04/2016
INTRODUCTION
Data
Data is information that
has been translated into a
form
that
is
more
convenient to move or
process.
Database
Database is a organized
collection of information
which can easily be
accessed, managed, and
updated by a set of
programs.
12/04/2016
12/04/2016
12/04/2016
A
process
of
transforming
data
into
information and
making it available
to users in a timely
enough manner to
make a difference
Data
12/04/2016
DATA WAREHOUSE
A data warehouse is a subjectoriented,
integrated,
timevariant, and non
volatile
collection of data in support of
managements
decision-making
process.W. H. Inmon
12/04/2016
Subject-Oriented
Subject Oriented
Loans
ATM
Credit Card
Customer
Product
Vendor
Activity
Trust
Savings
12/04/2016
Integrated
Data on a given subject is defined and stored once.
Data that is gathered into the data warehouse from
a variety of sources and merged into a coherent whole.
Savings
Current
accounts
Loans
OLTP Applications
12/04/2016
Customer
Data Warehouse
9
Time-Variant
Data is stored as a series of snapshots,
each representing a period of time
All data in the data warehouse is identified
with a particular time period.
Time
Jan-97
Feb-97
Mar-97
12/04/2016
Data
January
February
March
10
Nonvolatile
Insert
Update
Delete
12/04/2016
Read
Read
11
Changing Data
First time load
Warehouse Database
Operational
Database
Refresh
Refresh
Refresh
12/04/2016
12
OLTP
OLTP Online Transaction Processing
or Operational Database Systems
Performs Online Transaction & Query
Processing
Covers most of the day - to day
operations
Characterized by a large number of
short on-line transactions (INSERT,
UPDATE, DELETE).
Purchasing,
Inventory,
Manufacturing,
Banking, Payroll, Registration, etc..,
12/04/2016
13
OLAP
OLAP Online Analytical Processing or Data
Warehouse
It serves users or knowledge workers in the role of
decision making and data analysis
Organize and present data in various formats in
order to satisfy various user requests
characterized by relatively low volume of
transactions.
OLAP allows users to analyze database information
from multiple database systems at one time.
OLAP data is stored in multidimensional databases.
12/04/2016
14
customer vs.
market
Data contents:
consolidated
Database design:
subject
15
users
function
DB design
data
usage
access
unit of work
# records
accessed
#users
DB size
metric
12/04/2016
OLTP
clerk, IT professional
day to day operations
application-oriented
current, up-to-date
detailed, flat
relational
isolated
repetitive
read/write
index/hash on prim.
key
short, simple
transaction
tens
OLAP
knowledge worker
decision support
subject-oriented
historical,
summarized,
multidimensional
integrated, consolidated
ad-hoc
lots of scans
thousands
100MB-GB
transaction
throughput
hundreds
100GB-TB
query throughput, response
complex query
millions
16
MULTIDIMENSIONAL DATA
MODEL
A data warehouse is based on a multi dimensional data model which views data in
the form of a data cube
Data Cube Allows data to be modeled and
viewed in multiple dimensions.
Supports viewing/modeling of a variable (a
set of variables) of interest.
Measures are used to report the values of
the particular variable with respect to a
given set of dimensions.
12/04/2016
18
DATA CUBE
Data Cube is defined by dimensions and
facts
Dimensions Entities in which organization wants
to keep record.
Dimension Table - Each dimension may be
associated with a table
Item, Branch, Location
19
12/04/2016
20
3D Data Cube
Dollars_Sold
12/04/2016
21
Modeling of Data
Warehouse
A data warehouse, however, requires
a concise, subject-oriented schema
that facilitates on-line data analysis.
Star Schema
Snow Flake Schema
Fact Constellation Schema
12/04/2016
22
STAR SCHEMA
Contains
a
large
central
table
fact
table contains
bulk data and a
set of smaller
attendant
tables
dimension
table, one for
each dimension
12/04/2016
23
SNOWFLAKE SCHEMA
A variant of the
star
schema
model,
where
some dimension
tables
are
normalized,
thereby
further
splitting the data
into
additional
tables.
12/04/2016
24
FACT CONSTELLATION
Schema can be
viewed
as
a
collection
of
stars, and hence
it is called a
galaxy
schema
or
a
fact
constellation.
Used
for
sophisticated
applications
12/04/2016
25
CONCEPT HIERARCHY
Defines a sequence of mappings from a set of
low-level concepts to higher-level, more general
concepts.
12/04/2016
26
OLAP OPERATIONS
Roll Up (Drill Up) Reduction of
Dimension
Drill Down ( Roll Down) Adds new
Dimesnion
Slice and Dice
Pivot (Rotate)
12/04/2016
27
ROLL UP
Performs
aggregation on a
data cube, either
by climbing up a
concept hierarchy
for a dimension
Dimension
Reduction
12/04/2016
28
DRILL DOWN
Drill-down
is
the
reverse of roll-up. It
navigates from less
detailed data to more
detailed data.
Drill-down
can
be
realized
by
either
stepping
down
a
concept hierarchy for
a dimension
Introducing additional
dimensions
12/04/2016
29
12/04/2016
30
12/04/2016
31
PIVOT
The
pivot
operation
is
also known as
rotation.
It
rotates
the
data axes in
view, in order
to provide an
alternative
presentation of
data.
12/04/2016
32
33
combination of both
Top-down: Starts with overall design and
planning (mature)
Bottom-up: Starts with experiments and
prototypes (rapid)
Data Warehouse Design Steps
Choose a business process to model, e.g.,
orders, invoices, etc.
Choose the grain (atomic level of data) of the
business process
Choose the dimensions that will apply to each
fact table record
12/04/2016
Choose the measure that will populate each 34
12/04/2016
35
Virtual warehouse
A set of views over operational databases
Only some of the possible summary views may be
materialized
12/04/2016
36
DATA MINING
Extracting or Mining knowledge from
large amounts of data
Also called as Knowledge Extraction,
Knowledge Discovery from Data,
Data / Pattern Analysis
12/04/2016
37
12/04/2016
38
KDD STEPS
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are
retrieved from the database)
4. Data transformation (where data are transformed or
consolidated into forms appropriate for mining by performing
summary or aggregation operations, for instance)
5. Data mining (an essential process where intelligent methods are
applied in order to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on some interestingness
measures;
7. Knowledge presentation (where visualization and knowledge
representation techniques are used to present the mined
knowledge to the user)
12/04/2016
39
40
41
8/13/16
42
8/13/16
43
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces
the same or similar analytical results
Data discretization
Part of data reduction but with particular importance,
especially for numerical data
8/13/16
44
8/13/16
45
Data Cleaning
Importance
Data cleaning is one of the three biggest
problems in data warehousingRalph Kimball
Data cleaning is the number one problem in
data warehousingDCI survey
Data Cleaning
: How to Handle Missing Data?
Data Cleaning
: How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency)
bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
8/13/16
49
Data Cleaning
: Binning Methods
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
REGRESSION
Data can be smoothed by fitting the data to a
function, such as with regression.
Linear regression involves finding the best line
to fit two attributes (or variables), so that one
attribute can be used to predict the other.
Multiple linear regression is an extension of
linear regression, where more than two
attributes are involved and the data are fit to a
multidimensional surface.
12/04/2016
51
Regression
y
Y1
y=x+1
Y1
X1
8/13/16
52
CLUSTERING
Outliers
may
be
detected
by
clustering, where similar values are
organized into groups, or clusters.
Intuitively, values that fall outside of
the set of clusters may be considered
outliers
Outliers - Data objects with characteristics
that are considerably different than most
of the other data objects in the data set
12/04/2016
53
CLUSTERING
12/04/2016
54
Data Integration
Data integration:
Combines data from multiple sources into a
coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data
sources, e.g., Bill Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values
from different sources are different
Possible reasons: different representations,
different scales, e.g., metric vs. British units
8/13/16
55
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Data Integration
: Handling Redundancy in Data Integration
Data Integration :
Correlation Analysis (Numerical Data)
Correlation coefficient (also called Pearsons product
moment coefficient)
rA, B
( A A)( B B ) ( AB) n A B
( n 1)AB
( n 1)AB
Data Integration
: Correlation Analysis (Categorical Data)
2 (chi-square) test
2
(
Observed
Expected
)
2
Expected
The larger the 2 value, the more likely the
variables are related
The cells that contribute the most to the 2 value
are those whose actual count is very different from
the expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population
Play chess
Sum (row)
250(90)
200(360)
450
50(210)
1000(840)
1050
Sum(col.)
300
1200
1500
90
)
(
50
210
)
(
200
360
)
(
1000
840
)
2
507.93
90
210
360
840
Data Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube
construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small,
specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
Data Transformation
: Normalization
v'
v minA
(new _ maxA new _ minA) new _ minA
maxA minA
v'
v A
73,600 54,000
1.225
v
v' j
10
Data reduction
Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
Data Reduction :
Aggregation
Combining two or more attributes (or
objects) into a single attribute (or object)
Purpose
Data reduction
Reduce the number of attributes or objects
Change of scale
Cities aggregated into regions, states, countries,
etc
Data Reduction
: Dimensionality Reduction
Purpose:
Avoid curse of dimensionality
Reduce amount of time and memory required by
data mining algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or
reduce noise
Techniques
Principle Component Analysis
Singular Value Decomposition
Others: supervised and non-linear techniques
Dimensionality Reduction :
PCA
Goal is to find a projection that
captures the largest amount of
x2
variation in data
e
x1
Dimensionality Reduction :
PCA
Find the eigenvectors of the
covariancex matrix
2
The eigenvectors define the new
space
e
x1
Data Reduction
: Feature Subset Selection
Another way to reduce dimensionality of data
Redundant features
duplicate much or all of the information contained in
one or more other attributes
Example: purchase price of a product and the amount
of sales tax paid
Irrelevant features
contain no information that is useful for the data
mining task at hand
Example: students' ID is often irrelevant to the task of
predicting students' GPA
Data Reduction
: Feature Subset Selection
Techniques:
Brute-force approch:
Try all possible feature subsets as input to data
mining algorithm
Filter approaches:
Features are selected before data mining
algorithm is run
Wrapper approaches:
Use the data mining algorithm as a black box
to find best subset of attributes
Data Reduction
: Feature Creation
Create new attributes that can capture the
important information in a data set much
more efficiently than the original attributes
Three general methodologies:
Feature Extraction
domain-specific
Data Reduction
: Mapping Data to a New Space
Fourier transform
Wavelet transform
Frequency
Data Reduction
: Discretization Using Class Labels
Data Reduction
: Discretization Without Using Class Labels
Data
Equal frequency
K-means
Data Reduction
: Attribute Transformation