0% found this document useful (0 votes)
18 views

Lecture 6 Classification-Decision Tree Rule Based K-NN

This document discusses classification and decision trees. It defines classification as predicting categorical class labels by constructing a model based on training data, while prediction models continuous functions. The classification process involves model construction using a training set and model usage to classify new data. Decision trees are a popular classification method that create rules to partition data into branches and leaves representing class labels. They have advantages like interpretability but disadvantages like potential overfitting. The document outlines the basic decision tree induction algorithm, which recursively partitions data based on attribute tests, and discusses measures like information gain and Gini index used for attribute selection.

Uploaded by

iamboss086
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Lecture 6 Classification-Decision Tree Rule Based K-NN

This document discusses classification and decision trees. It defines classification as predicting categorical class labels by constructing a model based on training data, while prediction models continuous functions. The classification process involves model construction using a training set and model usage to classify new data. Decision trees are a popular classification method that create rules to partition data into branches and leaves representing class labels. They have advantages like interpretability but disadvantages like potential overfitting. The document outlines the basic decision tree induction algorithm, which recursively partitions data based on attribute tests, and discusses measures like information gain and Gini index used for attribute selection.

Uploaded by

iamboss086
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Classification

Dr. Kavita Pabreja


Associate Professor
Deptt. of Computer Applications-MSI

1
Outline of ppt

◼ What is classification? What is prediction?


◼ Issues regarding classification and prediction
◼ Advantages and disadvantages of Decision tree
◼ Classification by decision tree induction
◼ Attribute Selection measures

2
Classification vs. Prediction

◼ Classification:
◼ predicts categorical class labels

◼ classifies data (constructs a model) based on the


training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
◼ Prediction:
◼ models continuous-valued functions, i.e., predicts
unknown or missing values
◼ Typical Applications
◼ credit approval

◼ target marketing

◼ medical diagnosis

◼ treatment effectiveness analysis


3
Classification—A Two-Step Process

◼ Model construction: describing a set of predetermined classes


◼ Each tuple/sample is assumed to belong to a predefined class,

as determined by the class label attribute


◼ The set of tuples used for model construction: training set

◼ The model is represented as classification rules, decision trees,

or mathematical formulae
◼ Model usage: for classifying future or unknown objects
◼ Estimate accuracy of the model

◼ The known label of test sample is compared with the

classified result from the model


◼ Accuracy rate is the percentage of test set samples that are

correctly classified by the model


◼ Test set is independent of training set, otherwise over-fitting

will occur
4
Overfitting

Overfitting means we have matched the training data well


But we have failed to perform on future points(test data).
5
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


M ike A ssistant P rof 3 no (Model)
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
IF rank = ‘professor’
D ave A ssistant P rof 6 no
OR years > 6
A nne A ssociate P rof 3 no
THEN tenured = ‘yes’
6
Classification Process (2): Use the
Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
7
Supervised vs. Unsupervised
Learning
◼ Supervised learning (classification)
◼ Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
◼ New data is classified based on the training set
◼ Unsupervised learning (clustering)
◼ The class labels of training data is unknown
◼ Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
8
Issues regarding classification and
prediction (1): Data Preparation

◼ Data cleaning
◼ Preprocess data in order to reduce noise and handle
missing values
◼ Relevance analysis (feature selection)
◼ Remove the irrelevant or redundant attributes
◼ Data transformation
◼ Generalize and/or normalize data

9
Issues regarding classification and prediction
(2): Evaluating Classification Methods

◼ Predictive accuracy
◼ Speed
◼ time to construct the model

◼ time to use the model

◼ Robustness
◼ handling noise and missing values

◼ Scalability
◼ efficiency in disk-resident databases

◼ Interpretability:
◼ Level of understanding and insight that is provided
by the model
◼ Goodness of rules
◼ decision tree size

◼ compactness of classification rules


10
Advantages of Decision tree
◼ The decision tree model can be used for both classification and
regression problems, and it is easy to interpret, understand, and
visualize.
◼ The output of a decision tree can also be easily understood.
◼ Compared with other algorithms, data preparation during pre-
processing in a decision tree requires less effort and does not
require normalization of data.
◼ The implementation can also be done without scaling the data.
◼ A decision tree is one of the quickest ways to identify relationships
between variables and the most significant variable.
◼ New features can also be created for better target variable
prediction.
◼ Decision trees are not largely influenced by outliers or missing
values, and it can handle both numerical and categorical variables.
11
Disadvantages of Decision tree
◼ Overfitting is one of the practical difficulties for decision tree
models. It happens when the learning algorithm continues
developing hypotheses that reduce the training set error but at the
cost of increasing test set error. But this issue can be resolved by
pruning and setting constraints on the model parameters.
◼ Decision trees cannot be used well with continuous numerical
variables.
◼ A small change in the data tends to cause a big difference in the
tree structure, which causes instability.
◼ Calculations involved can also become complex compared to other
algorithms, and it takes a longer time to train the model.
◼ It is also relatively expensive as the amount of time taken and the
complexity levels are greater.

12
Classification by Decision Tree
Induction
◼ Decision tree
◼ A flow-chart-like tree structure

◼ Internal node denotes a test on an attribute

◼ Branch represents an outcome of the test

◼ Leaf nodes represent class labels or class distribution

◼ Decision tree generation consists of two phases


◼ Tree construction

◼ At start, all the training examples are at the root

◼ Partition examples recursively based on selected attributes

◼ Tree pruning

◼ Identify and remove branches that reflect noise or outliers

◼ Use of decision tree: Classifying an unknown sample


◼ Test the attribute values of the sample against the decision tree

13
Classification by Decision Tree
Induction

14
Algorithm for Decision Tree Induction
◼ Basic algorithm (a greedy algorithm)
◼ Tree is constructed in a top-down recursive divide-and-conquer

manner
◼ At start, all the training examples are at the root

◼ Attributes are categorical (if continuous-valued, they are

discretized in advance)- are to be label encoded in Python


◼ Examples are partitioned recursively based on selected attributes

◼ Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain)


◼ Conditions for stopping partitioning
◼ All samples for a given node belong to the same class

◼ There are no remaining attributes for further partitioning

◼ There are no samples left

15
How Does the Decision
Tree Algorithm Work?
Different splitting criterion

17
Attribute Selection Measure

◼ Information gain
◼ Gini index

18
Entropy

◼ Entropy is an information theory metric that


measures the impurity or uncertainty in a group
of observations. It determines how a decision
tree chooses to split data. The image below
gives a better description of the purity of a set.

Entropy is 1 Entropy is 0
19
Entropy
◼ The entropy may be calculated using the
formula below:

what happens when all observations belong to the same


class? In such a case, the entropy will always be zero.

E=−(1 log21) =0

Such a dataset has no impurity. This implies that such a


dataset would not be useful for learning. However, if we
have a dataset with say, two classes, half made up of
yellow and the other half being purple, the entropy will be
one.
20
Entropy

◼ A dataset with a 50/50 split of samples for the


two classes would have a maximum entropy
(maximum surprise) of 1 bit, whereas an
imbalanced dataset with a split of 10/90 would
have a smaller entropy as there would be less
surprise for a randomly drawn example from
the dataset.

◼ Skewed Probability Distribution (unsurprising): Low entropy.


◼ Balanced Probability Distribution (surprising): High entropy.
21
Information Gain (ID3/C4.5)

◼ Select the attribute with the highest information gain


◼ The expected information needed to classify a tuple in D is
given by

where pi is the probability that an arbitrary tuple in D


belongs to class Ci and is estimated
by |Ci,D | / |D|

22
Expected Information needed to classify a tuple in D is →

If all records belong to a


single class then Info(D)=0
If each class has 7 tuples,
then Info(D)=1.0

Log2 0.5 = -1

23
Information Gain in Decision
Tree Induction
• we would like each partition to be pure.
• It is quite likely that the partitions will be impure (e.g.,
where a partition may contain a collection of tuples from
different classes rather than from a single class).
• How much more information would we still need (after
the partitioning) in order to arrive at an exact classification?
•This amount is measured by

InfoA (D) is the expected information required to classify a tuple


from D based on the partitioning by A.
The smaller the expected information (still) required, the greater the
purity of the partitions.
24
Information Gain in Decision Tree
Induction
Information gain is defined as the difference between
the original information requirement (i.e., based on just
the proportion of classes) and the new requirement
(i.e., obtained after partitioning on A).

25
Information Gain in Decision Tree
Induction

Compute the expected information requirement for each attribute.


For attribute age:-
For the age category youth, there are two yes tuples and three no tuples.
For the category middle aged, there are four yes tuples and zero no tuples.
For the category senior, there are three yes tuples and two no tuples.
The expected information needed to classify a tuple in D if
the tuples are partitioned according to age is

26
Expected Information needed to classify a tuple in D is →

age yes no
youth 2 3
middleaged 4 0
senior 3 2

27
Information Gain in Decision Tree
Induction

28
29
Final tree after Classification by
Decision Tree Induction
(using Information gain)

30
Information gain of an attribute that
is continuous-valued
(For example, suppose we have the raw values for age.)
• we must determine the “best” split-point for A, where the split-point is a
threshold on A.
• We first sort the values of A in increasing order.
• Typically, the midpoint between each pair of adjacent values is considered
as a possible split-point.
•Therefore, given v values of A, then v-1 possible splits are evaluated.
• For example, the midpoint between the values ai and ai+1 of A is

31
Drawback of Information Gain
• The information gain measure is biased toward tests
with many outcomes.
• It prefers to select attributes having a large number of
values.
• For example, consider an attribute that acts as a unique
identifier, such as product ID. A split on product ID would
result in a large number of partitions (as many as there
are values), each one containing just one tuple.
Because each partition is pure, the information gained by
partitioning on this attribute is maximal.
Clearly, such a partitioning is useless for classification.

32
Information Gain vs Gini Index

The method of
the Gini Index
is used by CART
algorithms.

Information
Gain is used in
ID3, C4.5
algorithms.

Gini Index is more efficient than


entropy/ Information Gain in
terms of computing power
33
Gini index
The Gini index measures the impurity of D, a data
partition or set of training tuples, as

where pi is the probability that a tuple in D belongs to


class Ci and is estimated by |Ci,D|/|D|. The sum is
computed over m classes.

34
35
Gini index

36
Gini index

37
Gini index

38
Tree Pruning

➢ When a decision tree is built, many of the branches


will reflect anomalies in the training data due to
noise or outliers.
➢ Tree pruning methods address this problem of
overfitting the data.
➢ Such methods typically use statistical measures to
remove the least reliable branches.
➢ Pruned trees tend to be smaller and less complex
and, thus, easier to comprehend.
➢ They are usually faster and better at correctly
classifying independent test data (i.e., of previously
unseen tuples) than unpruned trees.

39
Tree Pruning

40
Tree Pruning-
prepruning and postpruning

•Prepruning : a tree is “pruned” by halting its


construction early (e.g., by deciding not to further split or
partition the subset of training tuples at a given node.)
•Upon halting, the node becomes a leaf. The leaf may
hold the most frequent class among the subset tuples or
the probability distribution of those tuples.
•When constructing a tree, measures such as statistical
significance, information gain, Gini index, can be used to
assess the goodness of a split.
•If partitioning the tuples at a node would result in a split
that falls below a prespecified threshold, then further
partitioning of the given subset is halted.

41
Choosing an appropriate threshold -
prepruning

❖ High thresholds could result in oversimplified trees,

❖ low thresholds could result in very little simplification.

42
Postpruning
➢ It removes subtrees from a “fully grown” tree.
➢ A subtree at a given node is pruned by removing its branches
and replacing it with a leaf.
➢ The leaf is labeled with the most frequent class among the
subtree being replaced.

43
Postpruning - example
➢ This approach considers the cost complexity of a tree to be a
function of the number of leaves in the tree and the error rate of the
tree (where the error rate is the percentage of tuples misclassified
by the tree).
➢ It starts from the bottom of the tree. For each internal node, N, it
computes the cost complexity of the subtree at N, and the cost
complexity of the subtree at N if it were to be pruned (i.e., replaced
by a leaf node).
➢ The two values are compared.
➢ If pruning the subtree at node N would result in a smaller cost
complexity, then the subtree is pruned. Otherwise, it is kept.
➢ (A pruning set of class labelled tuples is used to estimate cost
complexity.)

44
Prepruning and Postpruning -
interleaving

❖ Prepruning and postpruning may be


interleaved for a combined approach.
❖ Postpruning requires more computation than
prepruning, yet generally leads to a more
reliable tree.
❖ No single pruning method has been found to
be superior over all others.

45
Repetition and Replication
in Decision tree

Decision trees can suffer from repetition and replication,


making them overwhelming to interpret.
Repetition occurs when an attribute is repeatedly tested
along a given branch of the tree (such as “age < 60?”,
followed by “age < 45”?, and so on).

In replication, duplicate subtrees exist within the tree.

➢ These situations can impede the accuracy and


comprehensibility of a decision tree.
➢ The use of multivariate splits (splits based on a
combination of attributes) can prevent these problems.
46
47
Scalability and
Decision Tree Induction

➢ ID3, C4.5 -well established for relatively small data sets.


➢ Not Efficient for the mining of very large real-world databases.
➢ Restriction - the training tuples should reside in memory.
➢ Very large training sets of millions of tuples – do not fit in
memory!
➢ Result : Decision tree construction therefore becomes inefficient
due to swapping of the training tuples in and out of main and
cache memories.

48
Rule-Based Classification- Using IF-
THEN Rules for Classification

If: rule antecedent or precondition. Then : rule consequent

49
Using IF-THEN Rules for Classification

➢ Let X satisfy R1, which triggers the rule.


➢ If R1 is the only rule satisfied, then the rule fires by returning
the class prediction for X.
➢ triggering does not always mean firing because there may be
more than one rule that is satisfied!
➢ What if they each specify a different class? Or what if no rule is
satisfied by X?
➢ If more than one rule is triggered, we need a conflict resolution
strategy : size ordering and rule ordering.
➢is no rule satisfied by X : a fallback or default rule can be set up
to specify a default class

50
Conflict resolution strategy :
size ordering and rule ordering

➢The size ordering scheme assigns the highest priority to the


triggering rule that has the “toughest” requirements, where toughness is
measured by the rule antecedent size.
➢ The rule ordering scheme prioritizes the rules beforehand. The
ordering may be class-based or rule-based.
➢ class-based ordering: the classes are sorted in order of decreasing
“importance,” such as by decreasing order of prevalence (frequency) OR
they may be sorted based on the misclassification cost per class.
➢ rule-based ordering : the rules are organized into one long priority list,
according to some measure of rule quality such as accuracy, coverage, or
size (number of attribute tests in the rule antecedent), or based on advice
from domain experts.

51
Rule-Based Classification- Extracting
Classification Rules from Trees
◼ Represent the knowledge in the form of IF-THEN rules
◼ One rule is created for each path from the root to a leaf
◼ Each attribute-value pair along a path forms a conjunction
◼ The leaf node holds the class prediction
◼ Rules are easier for humans to understand
◼ Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN
buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer =
“no”

52
Extracting Classification Rules
from Trees

◼ A disjunction (logical OR) is implied between each of


the extracted rules.
◼ Rules are extracted directly from the tree, they are
mutually exclusive and exhaustive.
◼ Mutually exclusive :we cannot have rule conflicts
here because no two rules will be triggered for the
same tuple. (We have one rule per leaf, and any tuple
can map to only one leaf.)
◼ Exhaustive: there is one rule for each possible
attribute-value combination, so that this set of rules
does not require a default rule.
Rules are unordered.
53
Extracting Classification Rules
from Trees

◼ Subtree repetition and replication: the resulting set of rules


extracted can be large and difficult to follow, because some of the
attribute tests may be irrelevant or redundant

◼ Pruning the resulting rule set is required: any condition that does

not improve the estimated accuracy of the rule can be pruned.

◼ After pruning, the rules will no longer be mutually exclusive and


exhaustive.

54
Rule-Based Classification-
Rule Induction Using a Sequential
Covering Algorithm

◼ Classification rules can be generated using associative


classification algorithms, which search for attribute-
value pairs that occur frequently in the data.
◼ These pairs may form association rules, which can be
analyzed and used in classification.- deferred for
Association Rule mining topic.
◼ Using a Sequential Covering Algorithm :Rules are
learned one at a time. Each time a rule is learned, the
tuples covered by the rule are removed, and the
process repeats on the remaining tuples.

55
Rule Induction Using a Sequential
Covering Algorithm

56
Rule Induction Using a Sequential
Covering Algorithm
Rules are grown in a general-to-specific manner - Each time it is faced
with adding a new attribute test to the current rule, it picks the one
that most improves the rule quality, based on the training samples.

57
Rule Quality Measures for Rule Induction
Using a Sequential Covering Algorithm
Learn One Rule needs a measure of rule quality. Every time it
considers an attribute test, it must check to see if appending
such a test to the current rule’s condition will result in an
improved rule.
Rule R1 correctly classifies 38 of
the 40 tuples it covers. Rule R2
covers only two tuples, which it
correctly classifies.
Their respective accuracies are
95% and 100%.
Thus, R2 has greater accuracy than
R1, but it is not the better rule
because of its small coverage

Rules for the class loan decision = accept, showing accept (a) and
reject (r) tuples
58
Rule Quality Measures

1. Information / Entropy : expected information needed to


classify a tuple in data set, D. (Lower the entropy, better
the rule)

59
Rule Quality Measures

2. Another measure is based on information gain and was


proposed in FOIL (First Order Inductive Learner), a
sequential covering algorithm. It favors rules that have
high accuracy and cover many positive tuples

the tuples of the class for which we are learning rules are called
positive tuples, while the remaining tuples are negative.
pos (neg) are the number of positive(negative) tuples covered
by R
pos’ (neg’) are the number of positive(negative) tuples covered
by R’
60
Rule Quality Measures

3. Likelihood ratio statistic

The higher the likelihood ratio is, the more likely that
there is a significant difference in the number of correct
predictions made by our rule in comparison with a “random
guessor.” (That is the performance of our rule is not due to
chance.)
61
Rule Pruning
Rule pruning is done to avoid overfitting.
A rule is pruned by removing a conjunct (attribute test).

62
Eager vs. Lazy Learner
Eager learner:
◼ When it receive data set it starts classifying (learning)
◼ Then it does not wait for test data to learn
◼ So it takes long time learning and less time classifying data
(Explicit description of target function on the whole training set)
◼ Decision Tree, Naive Bayes, Artificial Neural Networks

Lazy learner:
◼ Just store Data set without learning from it
◼ Start classifying data when it receive Test data
◼ So, it takes less time learning and more time classifying data
◼ K - Nearest Neighbour, Case - Based Reasoning: both are
Instance based Learning

63
Different Learning Methods

◼ Eager Learning

Any random movement


=>It’s a mouse

I saw a mouse!
Different Learning Methods

• Lazy Learning-Instance-based Learning

Its very similar to a


Desktop!!
Lazy Learners

◼ k-nearest neighbor classifier


◼ case-based reasoning

66
1-Nearest Neighbor

◼ One of the simplest of all machine learning


classifiers
◼ Simple idea: label a new point the same as the
closest known point (also used in filling missing
data/class)
Label it red.

Also known as “memory-based” learning


1-NN’s Aspects as an
Instance-Based Learner:

A distance metric
◼ Euclidean
◼ When different units are used for each dimension
(numeric data)
→ normalize each dimension
◼ For discrete data, can use hamming distance
→ D(x1,x2) =number of features on which x1 and x2
differ
k – Nearest Neighbor

◼ A new point is now assigned the most frequent


label of its k nearest neighbors

Label it red, when k = 3

Label it green, when k = 7


k-Nearest-Neighbor Classifiers
◼ It compares a given test tuple with training tuples that
are similar to it.
◼ The training tuples are described by n attributes. Each
tuple represents a point in an n-dimensional space.
◼ All of the training tuples are stored in an n-dimensional
pattern space.
◼ When given an unknown tuple, a k-nearest-neighbor
classifier searches the pattern space for the k
training tuples that are closest to the unknown tuple.
These k training tuples are the k “nearest neighbors” of
the unknown tuple.

70
k-Nearest-Neighbor Classifiers

The values of each attribute are normalized to prevent


attributes with initially large ranges (such as income)
from outweighing attributes with initially smaller ranges
(such as binary attributes).
Min-max normalization --- range [0, 1]

71
k-Nearest-Neighbor Classifiers –
classification and prediction

◼ Classification: the unknown tuple is assigned


the most common class among its k nearest
neighbors. When k = 1, the unknown tuple is
assigned the class of the training tuple that is
closest to it in pattern space.
◼ Prediction : to return a real-valued prediction
for a given unknown tuple. In this case, the
classifier returns the average value of the real-
valued labels associated with the k nearest
neighbors of the unknown tuple.

72
Thank you !!!
73

You might also like