10 Classification New 1
10 Classification New 1
10 Classification New 1
1
Classification: Basic Concepts
3
Classification and Numeric Prediction
Classification
predicts categorical class labels (discrete or nominal)
missing values
Applications
Credit/loan approval:
4
Classification—A Two-Step Process
Model construction
Describing a set of predetermined classes
formulae
Model usage
For classifying future or unknown objects
The known label of test sample is compared with the classified result from
the model
Accuracy rate is the percentage of test set samples that are correctly
Note: If the test set is used to select models, it is called validation (test) set 5
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
7
Classification: Basic Concepts
no yes no yes
9
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer
manner
At start, all the training examples are at the root
10
Comparing Attribute Selection Measures
The following three measures, in general, return good results but
Information gain
Biased towards multi-valued attributes
Gain ratio
Tends to prefer unbalanced splits in which one partition is
much smaller than the others
Gini index
Biased to multi-valued attributes
Has difficulty when number of classes is large
Tends to favor tests that result in equal-sized partitions and
purity in both partitions
11
Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training data
Too many branches, some may reflect anomalies due to noise or
outliers
Poor accuracy for unseen samples
Attribute construction
Create new attributes based on existing ones that are sparsely
represented
This reduces fragmentation, repetition, and replication
13
Classification
15
Bayes’ Theorem: Basics
M
Total probability Theorem: P( B) P( B | A ) P( A )
i i
i 1
income
16
Prediction Based on Bayes’ Theorem
Given training data X, posteriori probability of a hypothesis H, P(H|
X), follows the Bayes’ theorem
needs to be maximized
18
Naïve Bayes Classifier
A simplified assumption
Attributes are conditionally independent (i.e., no dependence
If Ak is categorical,
P(x |C ) is the # of tuples in C having value x for A divided by |
k i i k k
Ci, D| (# of tuples of Ci in D)
If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
and P(xk|Ci) is
( x )2
1
and P ( X | C i ) g ( xk , Ci , Ci )
g ( x, , ) e 2 2
2
19
Naïve Bayes Classifier: Training Dataset
credit_r buys_co
Class: age income student
ating mputer
<=30 high no fair no
C1:buys_computer = ‘yes’ <=30 high no excellent no
C2:buys_computer = ‘no’ 31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
Data to be classified:
<=30 medium no fair no
X = (age <=30,
<=30 low yes fair yes
Income = medium,
>40 medium yes fair yes
Student = yes, <=30 medium yes excellent yes
Credit_rating = Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
20
Naïve Bayes Classifier: An Example
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
21
Avoiding the Zero-Probability Problem
Naïve Bayesian prediction requires
each conditional probability be non-zero. Otherwise, the
“uncorrected” counterparts
22
Naïve Bayes Classifier
Advantages
Easy to implement
Disadvantages
Assumption: class conditional independence, therefore loss of
accuracy
Practically, dependencies exist among variables
Bayes Classifier
How to deal with these dependencies? Bayesian Belief Networks
23
Chapter 8. Classification: Basic Concepts
Many tuples of Ci
But none (or few) of the tuples of other classes
27
Rule Induction: Sequential Covering Method
Steps:
Rules are learned one at a time
Each time a rule is learned
The tuples covered by the rules are removed
The process is Repeated on the remaining tuples
Until termination condition is met
Example
When no more training examples
When the quality of a rule returned is below a user-
specified threshold
28
Sequential Covering Algorithm
Examples covered
by Rule 2
Positive
examples
29
Rule Generation
To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold
then add p to current rule
else break
A3=1 && A3=1
A1=2
A3=1
&&
A1=2
&&
A8=5
Positive Negative
examples examples
30
How to Learn-One-Rule?
Start with the most general rule possible: condition = empty
Adding new attributes by adopting a greedy depth-first strategy
Picks the one that most improves the rule quality