10 Classification New 1

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 31

Classification: Basic Concepts

1
Classification: Basic Concepts

 Classification: Basic Concepts


 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
2
Supervised vs. Unsupervised Learning

 Supervised learning (classification)


 Supervision: The training data (observations, measurements, etc.)
are accompanied by labels indicating the class of the observations
 New data is classified based on the training set

 Unsupervised learning (clustering)


 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data

3
Classification and Numeric Prediction
 Classification
 predicts categorical class labels (discrete or nominal)

 classifies data (constructs a model) based on the training set and

the values (class labels) in a classifying attribute and uses it in


classifying new data
 Numeric Prediction
 models continuous-valued functions, i.e., predicts unknown or

missing values
 Applications
 Credit/loan approval:

 Medical diagnosis: if a tumor is cancerous or benign

 Fraud detection: if a transaction is fraudulent

 Web page categorization: which category it is

4
Classification—A Two-Step Process
 Model construction
 Describing a set of predetermined classes

 Each tuple/sample is assumed to belong to a predefined class, as determined

by the class label attribute


 The set of tuples used for model construction is training set

 The model is represented as classification rules, decision trees, or mathematical

formulae
 Model usage
 For classifying future or unknown objects

 Estimate accuracy of the model

 The known label of test sample is compared with the classified result from

the model
 Accuracy rate is the percentage of test set samples that are correctly

classified by the model


 Test set is independent of training set

 If the accuracy is acceptable, use the model to classify new data

 Note: If the test set is used to select models, it is called validation (test) set 5
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
6
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
7
Classification: Basic Concepts

 Classification: Basic Concepts


 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
8
Decision Tree Induction: An Example
age income student credit_rating buys_computer
 Training data set: Buys_computer <=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
 Resulting tree: >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
age? <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes no yes
9
Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer

manner
 At start, all the training examples are at the root

 Attributes are categorical

 if continuous-valued, they are discretized in advance

 Samples are partitioned recursively based on selected attributes

 Test attributes are selected on the basis of a heuristic or statistical

measure (e.g., information gain, Gain ratio, Gini Index)


 Conditions for stopping partitioning
 All samples for a given node belong to the same class

 There are no remaining attributes for further partitioning

 There are no samples left

10
Comparing Attribute Selection Measures
 The following three measures, in general, return good results but

 Information gain
 Biased towards multi-valued attributes

 Gain ratio
 Tends to prefer unbalanced splits in which one partition is
much smaller than the others

 Gini index
 Biased to multi-valued attributes
 Has difficulty when number of classes is large
 Tends to favor tests that result in equal-sized partitions and
purity in both partitions
11
Overfitting and Tree Pruning
 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due to noise or

outliers
 Poor accuracy for unseen samples

 Two approaches to avoid overfitting


 Prepruning : Halt tree construction early
 Do not split a node if this would result in the goodness

measure falling below a threshold


 Difficult to choose an appropriate threshold

 Postpruning : Remove branches from a fully grown tree—


 Get a sequence of progressively pruned trees

 Use a set of data different from the training data to decide

which is the best pruned tree


12
Enhancements to Basic Decision Tree Induction
 Allow for continuous-valued attributes
 Dynamically define new discrete-valued attributes that partition
the continuous attribute value into a discrete set of intervals

 Handle missing attribute values


 Assign the most common value of the attribute
 Assign probability to each of the possible values

 Attribute construction
 Create new attributes based on existing ones that are sparsely
represented
 This reduces fragmentation, repetition, and replication

13
Classification

 Classification: Basic Concepts


 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
14
Bayesian Classification
 A statistical classifier
 Predicts class membership probabilities
 Based on Bayes’ Theorem
 A simple Bayesian classifier, Naive Bayesian classifier
 Has comparable performance with
 Decision tree and selected neural network classifiers
 Each training example can incrementally
 Increase/decrease the probability that a hypothesis is correct
 Prior knowledge can be combined with observed data
 Even when Bayesian methods are computationally intractable
 They can provide a standard of optimal decision making
against which other methods can be measured

15
Bayes’ Theorem: Basics
M
 Total probability Theorem: P( B)   P( B | A ) P( A )
i i
i 1

 Bayes’ Theorem: P(H | X)  P(X | H ) P( H )  P(X | H ) P( H ) / P(X)


P(X)
 Let X be a data sample (“evidence”): class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
 P(H) (prior probability): the initial probability
 E.g., X will buy computer, regardless of age, income, …

 P(X): probability that sample data is observed


 P(X|H) (likelihood): the probability of observing the sample X, given that the
hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is 31..40, medium

income
16
Prediction Based on Bayes’ Theorem
 Given training data X, posteriori probability of a hypothesis H, P(H|
X), follows the Bayes’ theorem

P( H | X)  P(X | H ) P( H )  P(X | H ) P( H ) / P(X)


P(X)
 Informally, this can be viewed as
posteriori = likelihood x prior/evidence
 Predicts X belongs to Ci iff
 The probability P(Ci|X) is the highest among all the P(Ck|X) for all
the k classes
 Practical difficulty
 It requires initial knowledge of many probabilities
 Involving significant computational cost 17
Classification Is to Derive the Maximum Posteriori
 Let D be a training set of tuples and their associated class labels
 Each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm
 Classification is to derive the maximum posteriori, i.e.,
 The maximal P(Ci|X)
 This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)

 Since P(X) is constant for all classes, only


P(C | X)  P(X | C )P(C )
i i i

needs to be maximized
18
Naïve Bayes Classifier
 A simplified assumption
 Attributes are conditionally independent (i.e., no dependence

relation between attributes)


n
P( X | C i )   P( x | C i )  P( x | C i )  P( x | C i )  ... P( x | C i )
k 1 2 n
k 1
 Greatly reduces the computation cost
 Only counts the class distribution

 If Ak is categorical,
 P(x |C ) is the # of tuples in C having value x for A divided by |
k i i k k
Ci, D| (# of tuples of Ci in D)
 If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
and P(xk|Ci) is
( x )2
1 
and P ( X | C i )  g ( xk ,  Ci ,  Ci )
g ( x,  ,  )  e 2 2
2 
19
Naïve Bayes Classifier: Training Dataset
credit_r buys_co
Class: age income student
ating mputer
<=30 high no fair no
C1:buys_computer = ‘yes’ <=30 high no excellent no
C2:buys_computer = ‘no’ 31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
Data to be classified:
<=30 medium no fair no
X = (age <=30,
<=30 low yes fair yes
Income = medium,
>40 medium yes fair yes
Student = yes, <=30 medium yes excellent yes
Credit_rating = Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
20
Naïve Bayes Classifier: An Example
 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
 Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
21
Avoiding the Zero-Probability Problem
 Naïve Bayesian prediction requires
 each conditional probability be non-zero. Otherwise, the

predicted probability will be zero


n
P( X | C i)   P( x k | C i)
k 1
 Ex. Suppose a dataset with 1000 tuples with
 income=low (0), income= medium (990), and income = high (10)

 Use Laplacian correction (or Laplacian estimator)


 Adding 1 to each case

Prob(income = low) = 1/1000


Prob(income = medium) = 991/1000
Prob(income = high) = 11/1000
 The “corrected” probability estimates are close to their

“uncorrected” counterparts
22
Naïve Bayes Classifier
 Advantages
 Easy to implement

 Good results obtained in most of the cases

 Disadvantages
 Assumption: class conditional independence, therefore loss of

accuracy
 Practically, dependencies exist among variables

 E.g., hospitals: patients: Profile: age, family history, etc.

Symptoms: fever, cough etc., Disease: lung cancer, diabetes,


etc.
 Dependencies among these cannot be modeled by Naïve

Bayes Classifier
 How to deal with these dependencies? Bayesian Belief Networks

23
Chapter 8. Classification: Basic Concepts

 Classification: Basic Concepts


 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
24
Using IF-THEN Rules for Classification
 Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
 Rule antecedent/precondition vs. rule consequent

 Assessment of a rule: coverage and accuracy


 n
covers = # of tuples covered by R

 ncorrect = # of tuples correctly classified by R


coverage(R) = ncovers /|D| where D is the training data set
accuracy(R) = ncorrect / ncovers
 If more than one rule are triggered, need conflict resolution
 Size ordering: assign the highest priority to the triggering rules that has the

“toughest” requirement (i.e., with the most attribute tests)


 Class-based ordering: decreasing order of prevalence or misclassification

cost per class


 Rule-based ordering (decision list): rules are organized into one long

priority list, according to some measure of rule quality or by experts


25
Rule Extraction from a Decision Tree
 Rules are easier to understand than large
trees
age?
 One rule is created for each path from the
root to a leaf <=30 31..40 >40

 Each attribute-value pair along a path student?


yes
credit rating?

forms a conjunction: the leaf holds the no yes excellent fair

class prediction no yes no yes

 Rules are mutually exclusive and exhaustive


 Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
26
Rule Induction: Sequential Covering Method
 Sequential covering algorithms
 Extract rules directly from training data
 Examples: FOIL, AQ, CN2, RIPPER
 Rules are learned sequentially
 Each for a given class Ci will cover

 Many tuples of Ci
 But none (or few) of the tuples of other classes

27
Rule Induction: Sequential Covering Method
 Steps:
 Rules are learned one at a time
 Each time a rule is learned
 The tuples covered by the rules are removed
 The process is Repeated on the remaining tuples
 Until termination condition is met
 Example
 When no more training examples
 When the quality of a rule returned is below a user-
specified threshold
28
Sequential Covering Algorithm

while (enough target tuples left)


generate a rule
remove positive target tuples satisfying this rule

Examples covered
by Rule 2

Examples covered Examples covered


by Rule 1 by Rule 3

Positive
examples

29
Rule Generation
 To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold
then add p to current rule
else break
A3=1 && A3=1
A1=2
A3=1
&&
A1=2
&&
A8=5
Positive Negative
examples examples
30
How to Learn-One-Rule?
 Start with the most general rule possible: condition = empty
 Adding new attributes by adopting a greedy depth-first strategy
 Picks the one that most improves the rule quality

 Rule-Quality measures: consider both coverage and accuracy


 Foil-gain (in FOIL & RIPPER): assesses info_gain by extending

condition pos ' pos


FOIL _ Gain  pos '(log 2  log 2 )
pos ' neg ' pos  neg
 favors rules that have high accuracy and cover many positive tuples
 Rule pruning based on an independent set of test tuples
pos  neg
FOIL _ Prune( R ) 
pos  neg
Pos/neg are # of positive/negative tuples covered by R.
If FOIL_Prune is higher for the pruned version of R, prune R
31

You might also like