Lecture 6 Classification-Decision Tree Rule Based K-NN
Lecture 6 Classification-Decision Tree Rule Based K-NN
1
Outline of ppt
2
Classification vs. Prediction
◼ Classification:
◼ predicts categorical class labels
◼ target marketing
◼ medical diagnosis
or mathematical formulae
◼ Model usage: for classifying future or unknown objects
◼ Estimate accuracy of the model
will occur
4
Overfitting
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
7
Supervised vs. Unsupervised
Learning
◼ Supervised learning (classification)
◼ Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
◼ New data is classified based on the training set
◼ Unsupervised learning (clustering)
◼ The class labels of training data is unknown
◼ Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
8
Issues regarding classification and
prediction (1): Data Preparation
◼ Data cleaning
◼ Preprocess data in order to reduce noise and handle
missing values
◼ Relevance analysis (feature selection)
◼ Remove the irrelevant or redundant attributes
◼ Data transformation
◼ Generalize and/or normalize data
9
Issues regarding classification and prediction
(2): Evaluating Classification Methods
◼ Predictive accuracy
◼ Speed
◼ time to construct the model
◼ Robustness
◼ handling noise and missing values
◼ Scalability
◼ efficiency in disk-resident databases
◼ Interpretability:
◼ Level of understanding and insight that is provided
by the model
◼ Goodness of rules
◼ decision tree size
12
Classification by Decision Tree
Induction
◼ Decision tree
◼ A flow-chart-like tree structure
◼ Tree pruning
13
Classification by Decision Tree
Induction
14
Algorithm for Decision Tree Induction
◼ Basic algorithm (a greedy algorithm)
◼ Tree is constructed in a top-down recursive divide-and-conquer
manner
◼ At start, all the training examples are at the root
15
How Does the Decision
Tree Algorithm Work?
Different splitting criterion
17
Attribute Selection Measure
◼ Information gain
◼ Gini index
18
Entropy
Entropy is 1 Entropy is 0
19
Entropy
◼ The entropy may be calculated using the
formula below:
E=−(1 log21) =0
22
Expected Information needed to classify a tuple in D is →
Log2 0.5 = -1
23
Information Gain in Decision
Tree Induction
• we would like each partition to be pure.
• It is quite likely that the partitions will be impure (e.g.,
where a partition may contain a collection of tuples from
different classes rather than from a single class).
• How much more information would we still need (after
the partitioning) in order to arrive at an exact classification?
•This amount is measured by
25
Information Gain in Decision Tree
Induction
26
Expected Information needed to classify a tuple in D is →
age yes no
youth 2 3
middleaged 4 0
senior 3 2
27
Information Gain in Decision Tree
Induction
28
29
Final tree after Classification by
Decision Tree Induction
(using Information gain)
30
Information gain of an attribute that
is continuous-valued
(For example, suppose we have the raw values for age.)
• we must determine the “best” split-point for A, where the split-point is a
threshold on A.
• We first sort the values of A in increasing order.
• Typically, the midpoint between each pair of adjacent values is considered
as a possible split-point.
•Therefore, given v values of A, then v-1 possible splits are evaluated.
• For example, the midpoint between the values ai and ai+1 of A is
31
Drawback of Information Gain
• The information gain measure is biased toward tests
with many outcomes.
• It prefers to select attributes having a large number of
values.
• For example, consider an attribute that acts as a unique
identifier, such as product ID. A split on product ID would
result in a large number of partitions (as many as there
are values), each one containing just one tuple.
Because each partition is pure, the information gained by
partitioning on this attribute is maximal.
Clearly, such a partitioning is useless for classification.
32
Information Gain vs Gini Index
The method of
the Gini Index
is used by CART
algorithms.
Information
Gain is used in
ID3, C4.5
algorithms.
34
35
Gini index
36
Gini index
37
Gini index
38
Tree Pruning
39
Tree Pruning
40
Tree Pruning-
prepruning and postpruning
41
Choosing an appropriate threshold -
prepruning
42
Postpruning
➢ It removes subtrees from a “fully grown” tree.
➢ A subtree at a given node is pruned by removing its branches
and replacing it with a leaf.
➢ The leaf is labeled with the most frequent class among the
subtree being replaced.
43
Postpruning - example
➢ This approach considers the cost complexity of a tree to be a
function of the number of leaves in the tree and the error rate of the
tree (where the error rate is the percentage of tuples misclassified
by the tree).
➢ It starts from the bottom of the tree. For each internal node, N, it
computes the cost complexity of the subtree at N, and the cost
complexity of the subtree at N if it were to be pruned (i.e., replaced
by a leaf node).
➢ The two values are compared.
➢ If pruning the subtree at node N would result in a smaller cost
complexity, then the subtree is pruned. Otherwise, it is kept.
➢ (A pruning set of class labelled tuples is used to estimate cost
complexity.)
44
Prepruning and Postpruning -
interleaving
45
Repetition and Replication
in Decision tree
48
Rule-Based Classification- Using IF-
THEN Rules for Classification
49
Using IF-THEN Rules for Classification
50
Conflict resolution strategy :
size ordering and rule ordering
51
Rule-Based Classification- Extracting
Classification Rules from Trees
◼ Represent the knowledge in the form of IF-THEN rules
◼ One rule is created for each path from the root to a leaf
◼ Each attribute-value pair along a path forms a conjunction
◼ The leaf node holds the class prediction
◼ Rules are easier for humans to understand
◼ Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN
buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer =
“no”
52
Extracting Classification Rules
from Trees
◼ Pruning the resulting rule set is required: any condition that does
54
Rule-Based Classification-
Rule Induction Using a Sequential
Covering Algorithm
55
Rule Induction Using a Sequential
Covering Algorithm
56
Rule Induction Using a Sequential
Covering Algorithm
Rules are grown in a general-to-specific manner - Each time it is faced
with adding a new attribute test to the current rule, it picks the one
that most improves the rule quality, based on the training samples.
57
Rule Quality Measures for Rule Induction
Using a Sequential Covering Algorithm
Learn One Rule needs a measure of rule quality. Every time it
considers an attribute test, it must check to see if appending
such a test to the current rule’s condition will result in an
improved rule.
Rule R1 correctly classifies 38 of
the 40 tuples it covers. Rule R2
covers only two tuples, which it
correctly classifies.
Their respective accuracies are
95% and 100%.
Thus, R2 has greater accuracy than
R1, but it is not the better rule
because of its small coverage
Rules for the class loan decision = accept, showing accept (a) and
reject (r) tuples
58
Rule Quality Measures
59
Rule Quality Measures
the tuples of the class for which we are learning rules are called
positive tuples, while the remaining tuples are negative.
pos (neg) are the number of positive(negative) tuples covered
by R
pos’ (neg’) are the number of positive(negative) tuples covered
by R’
60
Rule Quality Measures
The higher the likelihood ratio is, the more likely that
there is a significant difference in the number of correct
predictions made by our rule in comparison with a “random
guessor.” (That is the performance of our rule is not due to
chance.)
61
Rule Pruning
Rule pruning is done to avoid overfitting.
A rule is pruned by removing a conjunct (attribute test).
62
Eager vs. Lazy Learner
Eager learner:
◼ When it receive data set it starts classifying (learning)
◼ Then it does not wait for test data to learn
◼ So it takes long time learning and less time classifying data
(Explicit description of target function on the whole training set)
◼ Decision Tree, Naive Bayes, Artificial Neural Networks
Lazy learner:
◼ Just store Data set without learning from it
◼ Start classifying data when it receive Test data
◼ So, it takes less time learning and more time classifying data
◼ K - Nearest Neighbour, Case - Based Reasoning: both are
Instance based Learning
63
Different Learning Methods
◼ Eager Learning
I saw a mouse!
Different Learning Methods
66
1-Nearest Neighbor
A distance metric
◼ Euclidean
◼ When different units are used for each dimension
(numeric data)
→ normalize each dimension
◼ For discrete data, can use hamming distance
→ D(x1,x2) =number of features on which x1 and x2
differ
k – Nearest Neighbor
70
k-Nearest-Neighbor Classifiers
71
k-Nearest-Neighbor Classifiers –
classification and prediction
72
Thank you !!!
73