10 Classification New 1

Classification: Basic Concepts
1
 Classification: Basic Concepts

 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary
2
Supervised vs. Unsupervised Learning
 Supervised learning (classification)

 Supervision: The training data (observations, measurements, etc.)
are accompanied by labels indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)

 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
3
Classification and Numeric Prediction
 Classification
 predicts categorical class labels (discrete or nominal)
 classifies data (constructs a model) based on the training set and
the values (class labels) in a classifying attribute and uses it in

classifying new data
 Numeric Prediction
 models continuous-valued functions, i.e., predicts unknown or
missing values
 Applications
 Credit/loan approval:
 Medical diagnosis: if a tumor is cancerous or benign
 Fraud detection: if a transaction is fraudulent
 Web page categorization: which category it is
4
Classification—A Two-Step Process
 Model construction
 Describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class, as determined
by the class label attribute

 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees, or mathematical
formulae
 Model usage
 For classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the classified result from
the model
 Accuracy rate is the percentage of test set samples that are correctly
classified by the model

 Test set is independent of training set
 If the accuracy is acceptable, use the model to classify new data
 Note: If the test set is used to select models, it is called validation (test) set 5
Process (1): Model Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
6
Process (2): Using the Model in Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
7

Ensemble Methods
 Summary
8
Decision Tree Induction: An Example
age income student credit_rating buys_computer
 Training data set: Buys_computer <=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
 Resulting tree: >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
age? <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
student? yes credit rating?
no yes excellent fair
no yes no yes
9
Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer
manner
 At start, all the training examples are at the root
 Attributes are categorical
 if continuous-valued, they are discretized in advance
 Samples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain, Gain ratio, Gini Index)

 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning
 There are no samples left
10
Comparing Attribute Selection Measures
 The following three measures, in general, return good results but
 Information gain
 Biased towards multi-valued attributes
 Gain ratio
 Tends to prefer unbalanced splits in which one partition is
much smaller than the others
 Gini index
 Biased to multi-valued attributes
 Has difficulty when number of classes is large
 Tends to favor tests that result in equal-sized partitions and
purity in both partitions
11
Overfitting and Tree Pruning
 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due to noise or
outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting

 Prepruning : Halt tree construction early
 Do not split a node if this would result in the goodness
measure falling below a threshold

 Difficult to choose an appropriate threshold
 Postpruning : Remove branches from a fully grown tree—

 Get a sequence of progressively pruned trees
 Use a set of data different from the training data to decide
which is the best pruned tree

12
Enhancements to Basic Decision Tree Induction
 Allow for continuous-valued attributes
 Dynamically define new discrete-valued attributes that partition
the continuous attribute value into a discrete set of intervals
 Handle missing attribute values

 Assign the most common value of the attribute
 Assign probability to each of the possible values
 Attribute construction
 Create new attributes based on existing ones that are sparsely
represented
 This reduces fragmentation, repetition, and replication
13
Classification

Ensemble Methods
 Summary
14
Bayesian Classification
 A statistical classifier
 Predicts class membership probabilities
 Based on Bayes’ Theorem
 A simple Bayesian classifier, Naive Bayesian classifier
 Has comparable performance with
 Decision tree and selected neural network classifiers
 Each training example can incrementally
 Increase/decrease the probability that a hypothesis is correct
 Prior knowledge can be combined with observed data
 Even when Bayesian methods are computationally intractable
 They can provide a standard of optimal decision making
against which other methods can be measured
15
Bayes’ Theorem: Basics
M
 Total probability Theorem: P( B)   P( B | A ) P( A )
i i
i 1
 Bayes’ Theorem: P(H | X)  P(X | H ) P( H )  P(X | H ) P( H ) / P(X)

P(X)
 Let X be a data sample (“evidence”): class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
 P(H) (prior probability): the initial probability
 E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed

 P(X|H) (likelihood): the probability of observing the sample X, given that the
hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is 31..40, medium
income
16
Prediction Based on Bayes’ Theorem
 Given training data X, posteriori probability of a hypothesis H, P(H|
X), follows the Bayes’ theorem
P( H | X)  P(X | H ) P( H )  P(X | H ) P( H ) / P(X)

P(X)
 Informally, this can be viewed as
posteriori = likelihood x prior/evidence
 Predicts X belongs to Ci iff
 The probability P(Ci|X) is the highest among all the P(Ck|X) for all
the k classes
 Practical difficulty
 It requires initial knowledge of many probabilities
 Involving significant computational cost 17
Classification Is to Derive the Maximum Posteriori
 Let D be a training set of tuples and their associated class labels
 Each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm
 Classification is to derive the maximum posteriori, i.e.,
 The maximal P(Ci|X)
 This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
 Since P(X) is constant for all classes, only

P(C | X)  P(X | C )P(C )
i i i
needs to be maximized
18
Naïve Bayes Classifier
 A simplified assumption
 Attributes are conditionally independent (i.e., no dependence
relation between attributes)

n
P( X | C i )   P( x | C i )  P( x | C i )  P( x | C i )  ... P( x | C i )
k 1 2 n
k 1
 Greatly reduces the computation cost
 Only counts the class distribution
 If Ak is categorical,
 P(x |C ) is the # of tuples in C having value x for A divided by |
k i i k k
Ci, D| (# of tuples of Ci in D)
 If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
and P(xk|Ci) is
( x )2
1 
and P ( X | C i )  g ( xk ,  Ci ,  Ci )
g ( x,  ,  )  e 2 2
2 
19
Naïve Bayes Classifier: Training Dataset
credit_r buys_co
Class: age income student
ating mputer
<=30 high no fair no
C1:buys_computer = ‘yes’ <=30 high no excellent no
C2:buys_computer = ‘no’ 31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
Data to be classified:
<=30 medium no fair no
X = (age <=30,
<=30 low yes fair yes
Income = medium,
>40 medium yes fair yes
Student = yes, <=30 medium yes excellent yes
Credit_rating = Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
20
Naïve Bayes Classifier: An Example
 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
 Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
21
Avoiding the Zero-Probability Problem
 Naïve Bayesian prediction requires
 each conditional probability be non-zero. Otherwise, the
predicted probability will be zero

n
P( X | C i)   P( x k | C i)
k 1
 Ex. Suppose a dataset with 1000 tuples with
 income=low (0), income= medium (990), and income = high (10)
 Use Laplacian correction (or Laplacian estimator)

 Adding 1 to each case
Prob(income = low) = 1/1000

Prob(income = medium) = 991/1000
Prob(income = high) = 11/1000
 The “corrected” probability estimates are close to their
“uncorrected” counterparts
22
Naïve Bayes Classifier
 Advantages
 Easy to implement
 Good results obtained in most of the cases
 Disadvantages
 Assumption: class conditional independence, therefore loss of
accuracy
 Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes,

etc.
 Dependencies among these cannot be modeled by Naïve
Bayes Classifier
 How to deal with these dependencies? Bayesian Belief Networks
23
Chapter 8. Classification: Basic Concepts

Ensemble Methods
 Summary
24
Using IF-THEN Rules for Classification
 Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
 Rule antecedent/precondition vs. rule consequent
 Assessment of a rule: coverage and accuracy

 n
covers = # of tuples covered by R
 ncorrect = # of tuples correctly classified by R

coverage(R) = ncovers /|D| where D is the training data set
accuracy(R) = ncorrect / ncovers
 If more than one rule are triggered, need conflict resolution
 Size ordering: assign the highest priority to the triggering rules that has the
“toughest” requirement (i.e., with the most attribute tests)

 Class-based ordering: decreasing order of prevalence or misclassification
cost per class

 Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts

25
Rule Extraction from a Decision Tree
 Rules are easier to understand than large
trees
age?
 One rule is created for each path from the
root to a leaf <=30 31..40 >40
 Each attribute-value pair along a path student?

yes
credit rating?
forms a conjunction: the leaf holds the no yes excellent fair
class prediction no yes no yes
 Rules are mutually exclusive and exhaustive

 Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
26
Rule Induction: Sequential Covering Method
 Sequential covering algorithms
 Extract rules directly from training data
 Examples: FOIL, AQ, CN2, RIPPER
 Rules are learned sequentially
 Each for a given class Ci will cover
 Many tuples of Ci
 But none (or few) of the tuples of other classes
27
Rule Induction: Sequential Covering Method
 Steps:
 Rules are learned one at a time
 Each time a rule is learned
 The tuples covered by the rules are removed
 The process is Repeated on the remaining tuples
 Until termination condition is met
 Example
 When no more training examples
 When the quality of a rule returned is below a user-
specified threshold
28
Sequential Covering Algorithm
while (enough target tuples left)

generate a rule
remove positive target tuples satisfying this rule
Examples covered
by Rule 2
Examples covered Examples covered

by Rule 1 by Rule 3
Positive
examples
29
Rule Generation
 To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold
then add p to current rule
else break
A3=1 && A3=1
A1=2
A3=1
&&
A1=2
&&
A8=5
Positive Negative
examples examples
30
How to Learn-One-Rule?
 Start with the most general rule possible: condition = empty
 Adding new attributes by adopting a greedy depth-first strategy
 Picks the one that most improves the rule quality
 Rule-Quality measures: consider both coverage and accuracy

 Foil-gain (in FOIL & RIPPER): assesses info_gain by extending
condition pos ' pos

FOIL _ Gain  pos '(log 2  log 2 )
pos ' neg ' pos  neg
 favors rules that have high accuracy and cover many positive tuples
 Rule pruning based on an independent set of test tuples
pos  neg
FOIL _ Prune( R ) 
pos  neg
Pos/neg are # of positive/negative tuples covered by R.
If FOIL_Prune is higher for the pruned version of R, prune R
31

10 Classification New 1

Uploaded by

Copyright:

Available Formats

10 Classification New 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 Classification New 1

Uploaded by

Copyright:

Available Formats

Classification: Basic Concepts

 Classification: Basic Concepts

 Supervised learning (classification)

 Unsupervised learning (clustering)

 classifies data (constructs a model) based on the training set and

the values (class labels) in a classifying attribute and uses it in

 Medical diagnosis: if a tumor is cancerous or benign

 Fraud detection: if a transaction is fraudulent

 Web page categorization: which category it is

 Each tuple/sample is assumed to belong to a predefined class, as determined

by the class label attribute

 The model is represented as classification rules, decision trees, or mathematical

 Estimate accuracy of the model

classified by the model

 If the accuracy is acceptable, use the model to classify new data

NAME RANK YEARS TENURED Classifier

 Classification: Basic Concepts

student? yes credit rating?

no yes excellent fair

 Attributes are categorical

 if continuous-valued, they are discretized in advance

 Samples are partitioned recursively based on selected attributes

 Test attributes are selected on the basis of a heuristic or statistical

measure (e.g., information gain, Gain ratio, Gini Index)

 There are no remaining attributes for further partitioning

 There are no samples left

 Two approaches to avoid overfitting

measure falling below a threshold

 Postpruning : Remove branches from a fully grown tree—

 Use a set of data different from the training data to decide

which is the best pruned tree

 Handle missing attribute values

 Classification: Basic Concepts

 Bayes’ Theorem: P(H | X)  P(X | H ) P( H )  P(X | H ) P( H ) / P(X)

 P(X): probability that sample data is observed

P( H | X)  P(X | H ) P( H )  P(X | H ) P( H ) / P(X)

 Since P(X) is constant for all classes, only

relation between attributes)

predicted probability will be zero

 Use Laplacian correction (or Laplacian estimator)

Prob(income = low) = 1/1000

 Good results obtained in most of the cases

 E.g., hospitals: patients: Profile: age, family history, etc.

Symptoms: fever, cough etc., Disease: lung cancer, diabetes,

 Classification: Basic Concepts

 Assessment of a rule: coverage and accuracy

 ncorrect = # of tuples correctly classified by R

“toughest” requirement (i.e., with the most attribute tests)

cost per class

priority list, according to some measure of rule quality or by experts

 Each attribute-value pair along a path student?

forms a conjunction: the leaf holds the no yes excellent fair

class prediction no yes no yes

 Rules are mutually exclusive and exhaustive

while (enough target tuples left)

Examples covered Examples covered

 Rule-Quality measures: consider both coverage and accuracy

condition pos ' pos

You might also like