DWM
DWM
DWM
Invention
• Data explosion problem
• Other Applications
– Text mining (news group, email, documents)
and Web mining
– Stream data mining
– DNA and bio-data analysis
Corporate Analysis & Risk
Management
core of
knowledge
discovery Data Mining
process
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databas
es
Steps of a KDD
Process
• Learning the application domain
– relevant prior knowledge and goals of
application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may
take 60% of effort!)
• Data reduction and transformation
– Find useful features, dimensionality/variable
reduction, invariant representation.
• Choosing functions of data mining
– summarization, classification, regression,
association, clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of
interest
• Pattern evaluation and knowledge
presentation
– visualization, transformation, removing
redundant patterns, etc.
• Use of discovered knowledge
Architecture: Typical Data
Mining System
Pattern evaluation
Data
Databa Wareho
ses use
Data Mining: On What
Kinds of Data?
• Relational database
• Data warehouse
• Transactional database
• Advanced database and
information repository
– Object-relational database
– Spatial and temporal data
– Time-series data
– Stream data
– Multimedia database
– Heterogeneous and legacy
database
– Text databases & WWW
Data Mining
Functionalities
• Concept description: Characterization and
discrimination
– Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions
• Association (correlation and causality)
– Diaper Beer [0.5%, 75%]
• Classification and Prediction
– Construct models (functions) that describe and
distinguish classes or concepts for future
prediction
• E.g., classify countries based on climate, or
classify cars based on gas mileage
– Presentation: decision-tree, classification rule,
neural network
– Predict some unknown or missing numerical
values
Data Mining
Functionalities (2)
• Cluster analysis
– Class label is unknown: Group data to form new
classes, e.g., cluster houses to find distribution
patterns
– Maximizing intra-class similarity & minimizing
interclass similarity
• Outlier analysis
– Outlier: a data object that does not comply with
the general behavior of the data
– Noise or exception? No! useful in fraud
detection, rare events analysis
• Trend and evolution analysis
– Trend and deviation: regression analysis
– Sequential pattern mining, periodicity analysis
– Similarity-based analysis
• Other pattern-directed or statistical analyses
Data Mining: Confluence of
Multiple Disciplines
Database
Statistics
Systems
Machine
Learning
Data Mining Visualization
Algorithm Other
Disciplines
Major Issues in Data
Mining
• Mining methodology
– Mining different kinds of knowledge from diverse
data types, e.g., bio, stream, Web
– Performance: efficiency, effectiveness, and
scalability
– Pattern evaluation: the interestingness problem
– Incorporation of background knowledge
– Handling noise and incomplete data
– Parallel, distributed and incremental mining methods
– Integration of the discovered knowledge with existing
one: knowledge fusion
• User interaction
– Data mining query languages and ad-hoc mining
– Expression and visualization of data mining results
– Interactive mining of knowledge at multiple levels of
abstraction
• Applications and social impacts
– Domain-specific data mining & invisible data mining
– Protection of data security, integrity, and privacy
Data Mining
Association Rules
Mining Association
Rules in Large
Databases
• Association rule mining
• Mining single-dimensional
Boolean association rules
from transactional databases
• Summary
What Is
Association
Mining?
• Association rule mining:
– Finding frequent patterns, associations,
correlations, or causal structures
among sets of items or objects in
transaction databases, relational
databases, and other information
repositories.
• Applications:
– Basket data analysis, cross-marketing,
catalog design, loss-leader analysis,
clustering, classification, etc.
• Examples.
– Rule form: “Body → Ηead [support,
confidence]”.
– buys(x, “diapers”) → buys(x, “beers”)
[0.5%, 60%]
– major(x, “CS”) ^ takes(x, “DB”) →
grade(x, “A”) [1%, 75%]
Association Rule:
Basic Concepts
• Given: (1) database of transactions,
(2) each transaction is a list of items
(purchased by a customer in a visit)
• Find: all rules that correlate the
presence of one set of items with
that of another set of items
– E.g., 98% of people who purchase tires
and auto accessories also get
automotive services done
• Applications
– * ⇒ Maintenance Agreement (What
the store should do to boost
Maintenance Agreement sales)
– Home Electronics ⇒ * (What other
products should the store stocks up?)
– Attached mailing in direct marketing
– Detecting “ping-pong”ing of patients,
faulty “collisions”
Rule Measures:
Support and
Custo
Confidence
mer Custom • Find all the rules X &
er
buys
both buys Y ⇒ Z with minimum
diaper confidence and
support
– support, s, probability
that a transaction
contains {X Y Z}
Customer – confidence, c,
buys beer conditional probability
that a transaction
having {X Y} also
contains
Let minimum Z
Transaction IDItems Bought
support 50%, and
2000 A,B,C minimum
1000 A,C confidence 50%,
4000 A,D we have
– A ⇒ C (50%,
5000 B,E,F 66.6%)
– C ⇒ A (50%,
100%)
Association Rule
Mining: A Road Map
• Boolean vs. quantitative associations
(Based on the types of values handled)
– buys(x, “SQLServer”) ^ buys(x, “DMBook”) →
buys(x, “DBMiner”) [0.2%, 60%]
– age(x, “30..39”) ^ income(x, “42..48K”) → buys(x,
“PC”) [1%, 75%]
• Single dimension vs. multiple dimensional
associations (see ex. Above)
• Single level vs. multiple-level analysis
– What brands of beers are associated with what
brands of diapers?
• Various extensions
– Correlation, causality analysis
• Association does not necessarily imply correlation
or causality
– Maxpatterns and closed itemsets
– Constraints enforced
• E.g., small sales (sum < 100) trigger big buys (sum
> 1,000)?
Mining Association
Rules in Large
Databases
• Association rule mining
• Mining single-dimensional
Boolean association rules
from transactional databases
• Summary
Mining Association
Min. support 50%
Rules—An Min.
Example
confidence 50%
Lk-1with itself
• Prune Step: Any (k-1)-itemset that is
not frequent cannot be a subset of a
frequent k-itemset
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates
in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with
min_support
end
return ∪k Lk;
The Apriori Algorithm
— Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 1 3 4 {2} 3 {2} 3
Scan D
200 2 3 5 {3} 3 {3} 3
300 1 2 3 5 {4} 1 {5} 3
400 2 5 {5} 3
C2
itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset L3 itemset sup
Scan D
{2 3 5} {2 3 5} 2
How to Generate
Candidates?
• Suppose the items in Lk-1 are listed
in an order
• Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2,
p.itemk-1 < q.itemk-1
• Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
How to Count
Supports of
Candidates?
• Why counting supports of
candidates a problem?
– The total number of candidates can
be very huge
– One transaction may contain many
candidates
• Method:
– Candidate itemsets are stored in a
hash-tree
– Leaf node of hash-tree contains a list
of itemsets and counts
– Interior node contains a hash table
– Subset function: finds all the
candidates contained in a transaction
Example of Generating
Candidates
• Self-joining: L3*L3
– abcd from abc and abd
– acde from acd and ace
• Pruning:
– acde is removed because ade is not in
L3
• C4={abcd}
Methods to Improve
Apriori’s Efficiency
Classification
Algorithms
Trainin
g
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
• Data cleaning
– Preprocess data in order to reduce
noise and handle missing values
• Relevance analysis (feature
selection)
– Remove the irrelevant or redundant
attributes
• Data transformation
– Generalize and/or normalize data
classification and prediction
(2): Evaluating
Classification Methods
• Predictive accuracy
• Speed and scalability
– time to construct the model
– time to use the model
• Robustness
– handling noise and missing values
• Scalability
– efficiency in disk-resident databases
• Interpretability:
– understanding and insight provded
by the model
• Goodness of rules
– decision tree size
– compactness of classification rules
Chapter 7.
Classification and
Prediction
• What is classification? What is
prediction?
• Issues regarding classification
and prediction
• Classification by decision tree
induction
• Bayesian Classification
• Other Classification Methods
• Prediction
Classification by
Decision Tree
Induction
• Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class
distribution
• Decision tree generation consists of two
phases
– Tree construction
• At start, all the training examples are at the
root
• Partition examples recursively based on
selected attributes
– Tree pruning
• Identify and remove branches that reflect
noise or outliers
• Use of decision tree: Classifying an
unknown sample
– Test the attribute values of the sample against
the decision tree
Training
Dataset
age?
<=30 overcast
30..40 >40
no yes no yes
Algorithm for Decision
Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive
divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued,
they are discretized in advance)
– Examples are partitioned recursively based on
selected attributes
– Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the
same class
– There are no remaining attributes for further
partitioning – majority voting is employed for
classifying the leaf
– There are no samples left
Attribute
Selection
Measure
• Information gain (ID3/C4.5)
– All attributes are assumed to be
categorical
– Can be modified for continuous-valued
attributes
• Gini index (IBM IntelligentMiner)
– All attributes are assumed continuous-
valued
– Assume there exist several possible split
values for each attribute
– May need other tools, such as clustering,
to get the possible split values
– Can be modified for categorical
attributes
Information Gain
(ID3/C4.5)
p p n n
I ( p, n) = − log 2 − log 2
p+n p+n p+n p+n
Information Gain in
Decision Tree
Induction
• Assume that using attribute A a set
S will be partitioned into sets {S1,
S2 , …, Sv}
– If Si contains pi examples of P and ni
examples of N, the entropy, or the
expected information needed to
classify objects in all subtrees Si is
ν p +n
E ( A) = ∑ i i
I ( pi , ni )
i =1 p + n
• The encoding information that
would be gained by branching on
A
Gain( A) = I ( p, n) − E ( A)
Attribute Selection by
Information Gain
Computation
Class P: 5 4
buys_computer = E (age) = I (2,3) + I (4,0)
14 14
“yes”
5
Class N: + I (3,2) = 0.69
14
buys_computer = Hence
“no”
I(p, n) = I(9, 5)
=0.940 Similarly
Compute the Gain(age) = I ( p, n) − E (age)
entropy for age:
age pi ni I(pi, ni)
<=30 2 3 0.971
30…40 4 0 0
>40 Gain
3 (income ) = 0.029
2 0.971
Gain( student ) = 0.151
Gain(credit _ rating ) = 0.048
Extracting Classification
Rules from Trees
n
P( C j |V ) ∞ P( C j )∏ P( v i | C j )
i =1
• Greatly reduces the
computation cost, only count
the class distribution.
Naive Bayesian
Classifier (II)
O utlook P N H um idity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 norm al 6/9 1/5
rain 3/9 2/5
Tem preature W indy
hot 2/9 2/5 true 3/9 3/5
m ild 4/9 2/5 false 6/9 2/5
cool 3/9 1/5
Bayesian classification
• E.g. P(class=N |
outlook=sunny,windy=true,…)
• Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
• P(X) is constant for all classes
• P(C) = relative freq of class C samples
• C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
• Problem: computing P(X|C) is unfeasible!
Naïve Bayesian
Classification
• P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|
p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 =
0.010582
• P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|
n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 =
0.018286
_
_ _
.
+ _
+
. .
_ .
xq + .
_
+ .
Discussion on the k-
NN Algorithm
• The k-NN algorithm for continuous-valued target
functions
– Calculate the mean values of the k nearest
neighbors
• Distance-weighted nearest neighbor algorithm
– Weight the contribution of each of the k
neighbors according to their distance to the
query point xq
• giving greater weight to closer neighbors
– Similarly, for real-valued target functions
• Robust to noisy data by averaging k-nearest
neighbors
• Curse of dimensionality: distance between
neighbors could be dominated by irrelevant
attributes.
– To overcome it, axes stretch or elimination of
the least relevant attributes.
w≡ 1
d ( xq , xi )2
Remarks on Lazy vs.
Eager Learning
• Instance-based learning: lazy evaluation
• Decision-tree and Bayesian classification:
eager evaluation
• Key differences
– Lazy method may consider query instance xq
when deciding how to generalize beyond the
training data D
– Eager method cannot since they have already
chosen global approximation when seeing the
query
• Efficiency: Lazy - less time training but
more time predicting
• Accuracy
– Lazy method effectively uses a richer hypothesis
space since it uses many local linear functions to
form its implicit global approximation to the target
function
– Eager: must commit to a single hypothesis that
covers the entire instance space
Chapter 7.
Classification and
Prediction
• What is classification? What is
prediction?
• Issues regarding classification
and prediction
• Classification by decision tree
induction
• Bayesian Classification
• Other Classification Methods
• Prediction
What Is
Prediction?
• Prediction is similar to
classification
– First, construct a model
– Second, use model to predict
unknown value
• Major method for prediction is
regression
– Linear and multiple regression
– Non-linear regression
• Prediction is different from
classification
– Classification refers to predict
categorical class label
– Prediction models continuous-valued
functions