Classification_Decision Tree
Classification_Decision Tree
Arvind Deshpande
11/6/2024 Arvind Deshpande (VJTI) 2
Classification
• Supervised learning – Training data accompanied by
labels indicating the class of observations.
• Classifies data based on the training data set and uses it
to classify new data.
• e. g. Classification model to categorize bank loan
applications as either safe or risky.
11/6/2024 Arvind Deshpande (VJTI) 3
Model Construction
11/6/2024 Arvind Deshpande (VJTI) 4
Model Usage
11/6/2024 Arvind Deshpande (VJTI) 5
Problem
11/6/2024 Arvind Deshpande (VJTI) 6
Decision tree
11/6/2024 Arvind Deshpande (VJTI) 7
Decision tree
• A flowchart-like tree structure, where
• each internal node (nonleaf node) denotes a test on an
attribute,
• each branch represents an outcome of the test, and
• each leaf node (or terminal node) holds a class label.
• The topmost node in a tree is the root node.
• A greedy algorithm that is a non-backtracking
approach in which decision trees are constructed
in a top down recursive divide and conquer
manner, it is very interpretable model.
11/6/2024 Arvind Deshpande (VJTI) 8
Decision tree
• Algorithm: Generate decision tree. Generate a decision
tree from the training tuples of data partition, D.
• Input:
• Data partition, D, which is a set of training tuples and their
associated class labels;
• attribute list, the set of candidate attributes;
• Attribute selection method, a procedure to determine the splitting
criterion that “best” partitions the data tuples into individual classes.
This criterion consists of a splitting attribute and, possibly, either a
split-point or splitting subset.
• Output: A decision tree.
11/6/2024 Arvind Deshpande (VJTI) 9
Method
(1) create a node N;
(2) if tuples in D are all of the same class, C, then
(3) return N as a leaf node labeled with the class C;
(4) if attribute list is empty then
(5) return N as a leaf node labeled with the majority class
in D; // majority voting
(6) apply Attribute selection method(D, attribute list) to
find the “best” splitting_criterion;
(7) label node N with splitting_criterion;
(8) if splitting attribute is discrete-valued and
multiway splits allowed then // not restricted to
binary trees
(9) attribute list ↔ attribute list - splitting attribute; // remove
splitting attribute
11/6/2024 Arvind Deshpande (VJTI) 10
11/6/2024 Arvind Deshpande (VJTI) 11
Method
(10) for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each
partition
(11) let Dj be the set of data tuples in D satisfying
outcome j; // a partition
(12) if Dj is empty then
(13) attach a leaf labeled with the majority class in
D to node N;
(14) else attach the node returned by Generate
decision tree(Dj , attribute list) to node N;
end for
(15) return N;
11/6/2024 Arvind Deshpande (VJTI) 12
Method
• The algorithm uses the same process recursively to form a decision
tree for the tuples at each resulting partition, Dj , of D (step 14).
• The recursive partitioning stops only when any one of the following
terminating conditions is true:
1. All the tuples in partition D (represented at node N) belong to the
same class (steps 2 and 3).
2. There are no remaining attributes on which the tuples may be
further partitioned (step 4). In this case, majority voting is employed
(step 5). This involves converting node N into a leaf and labeling it
with the most common class in D. Alternatively, the class
distribution of the node tuples may be stored.
3. There are no tuples for a given branch, that is, a partition Dj is
empty (step 12). In this case, a leaf is created with the majority
class in D (step 13).
• The resulting decision tree is returned (step 15).
11/6/2024 Arvind Deshpande (VJTI) 13
Information Gain
• Let node N represent or hold the tuples of partition D.
• The attribute with the highest information gain is chosen
as the splitting attribute for node N.
• This attribute minimizes the information needed to classify
the tuples in the resulting partitions and reflects the least
randomness or “impurity” in these partitions.
• Such an approach minimizes the expected number of
tests needed to classify a given tuple and guarantees that
a simple (but not necessarily the simplest) tree is found.
11/6/2024 Arvind Deshpande (VJTI) 15
Information Gain
• The expected information needed to classify a tuple in D
is given by
𝑚
𝐼𝑛𝑓𝑜 𝐷 = − 𝑝𝑖 log 2 𝑝𝑖
𝑖=1
• where pi is the nonzero probability that an arbitrary tuple
in D belongs to class Ci and is estimated by |Ci,Dj|/|D|.
• A log function to the base 2 is used, because the
information is encoded in bits.
• Info(D) also known as the entropy of D is just the average
amount of information needed to identify the class label of
a tuple in D.
11/6/2024 Arvind Deshpande (VJTI) 16
11/6/2024 Arvind Deshpande (VJTI) 17
Information Gain
• it is quite likely that the partitions will be impure (e.g.,
where a partition may contain a collection of tuples from
different classes rather than from a single class).
• The amount of information needed (after the partitioning)
to arrive at an exact classification is measured by
𝑣
𝐷𝑗
𝐼𝑛𝑓𝑜𝐴 𝐷 = × 𝐼𝑛𝑓𝑜 𝐷𝑗
𝐷
𝑖=1
𝐷𝑗
• The term acts as the weight of the jth partition.
𝐷
• 𝐼𝑛𝑓𝑜 𝐷𝑗 is the expected information required to classify a
tuple from D based on the partitioning by A.
11/6/2024 Arvind Deshpande (VJTI) 18
11/6/2024 Arvind Deshpande (VJTI) 19
Information Gain
• The smaller the expected information (still) required, the
greater the purity of the partitions.
• Information gain is defined as the difference between the
original information requirement (i.e., based on just the
proportion of classes) and the new requirement (i.e.,
obtained after partitioning on A). That is,
𝐺𝑎𝑖𝑛 𝐴 = 𝐼𝑛𝑓𝑜 𝐷 − 𝐼𝑛𝑓𝑜𝐴 𝐷
• The attribute A with the highest information gain, 𝐺𝑎𝑖𝑛 𝐴 ,
is chosen as the splitting attribute at node N.
11/6/2024 Arvind Deshpande (VJTI) 20
Information Gain
• For information gain of an attribute that is continuous valued,
determine the “best” split-point for A, where the split-point is a
threshold on A.
• Sort the values of A in increasing order.
• Typically, the midpoint between each pair of adjacent values is
considered as a possible split-point. Therefore, given v values
of A, then v - 1 possible splits are evaluated.
• For each possible split-point for A, evaluate 𝐼𝑛𝑓𝑜𝐴 𝐷 , where
the number of partitions is two, that is, v = 2 (or Dj = 1, 2)
• The point with the minimum expected information requirement
for A is selected as the split point for A.
• D1 is the set of tuples in D satisfying A < split point, and D2 is
the set of tuples in D satisfying A > split point.
11/6/2024 Arvind Deshpande (VJTI) 21
Gain Ratio
• The information gain measure is biased toward tests with
many outcomes. That is, it prefers to select attributes
having a large number of values.
• Gain ratio attempts to overcome this bias. It applies a kind
of normalization to information gain using a “split
information” value defined analogously with 𝐼𝑛𝑓𝑜 𝐷 as
𝑣
𝐷𝑗 𝐷𝑗
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝐴 𝐷 = − × log 2
𝐷 𝐷
𝑖=1
• This value represents the potential information generated
by splitting the training data set, D, into v partitions,
corresponding to the v outcomes of a test on attribute A.
Note that,
11/6/2024 Arvind Deshpande (VJTI) 22
Gain Ratio
𝐺𝑎𝑖𝑛 𝐴
𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜 𝐴 =
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝐴 𝐷
• The attribute with the maximum gain ratio is selected as
the splitting attribute
11/6/2024 Arvind Deshpande (VJTI) 23
Gini index
• The Gini index measures the impurity of D, a data
partition or set of training tuples, as
𝑚
𝐺𝑖𝑛𝑖 𝐷 = 1 − 𝑝𝑖 2
1=1
where pi is the probability that a tuple in D belongs to
class Ci and is estimated by |Ci,Di|/|Di|. The sum is
computed over m classes.
• The Gini index considers a binary split for each attribute.
11/6/2024 Arvind Deshpande (VJTI) 24
Gini index
• When considering a binary split, we compute a weighted
sum of the impurity of each resulting partition. For
example, if a binary split on A partitions D into D1 and D2,
the Gini index of D given that partitioning is
𝐷1 𝐷2
𝐺𝑖𝑛𝑖𝐴 𝐷 = 𝐺𝑖𝑛𝑖 𝐷1 + 𝐺𝑖𝑛𝑖 𝐷2
𝐷 𝐷
• For each attribute, each of the possible binary splits is
considered.
• For a discrete-valued attribute, the subset that gives the
minimum Gini index for that attribute is selected as its
splitting subset.
11/6/2024 Arvind Deshpande (VJTI) 25
Gini index
• For continuous-valued attributes, each possible split-point must
be considered.
• The midpoint between each pair of (sorted) adjacent values is
taken as a possible split-point.
• The point giving the minimum Gini index for a given
(continuous-valued) attribute is taken as the split-point of that
attribute.
• D1 is the set of tuples in D satisfying A < split point, and D2 is
the set of tuples in D satisfying A > split point.
• The reduction in impurity that would be incurred by a binary
split on a discrete- or continuous-valued attribute A is
∆𝐺𝑖𝑛𝑖 𝐴 = 𝐺𝑖𝑛𝑖 𝐷 − 𝐺𝑖𝑛𝑖𝐴 𝐷
• The attribute that maximizes the reduction in impurity (or,
equivalently, has the minimum Gini index) is selected as the
splitting attribute.
11/6/2024 Arvind Deshpande (VJTI) 26
Tree Pruning
• When a decision tree is built, many of the branches will
reflect anomalies in the training data due to noise or
outliers.
• Tree pruning methods address this problem of overfitting
the data. Such methods typically use statistical measures
to remove the least-reliable branches.
• Pruned trees tend to be smaller and less complex and,
thus, easier to comprehend.
• They are usually faster and better at correctly classifying
independent test data (i.e., of previously unseen tuples)
than unpruned trees.
11/6/2024 Arvind Deshpande (VJTI) 28
Tree Pruning
• In the prepruning approach, a tree is “pruned” by halting
its construction early (e.g., by deciding not to further split
or partition the subset of training tuples at a given node).
• Upon halting, the node becomes a leaf. The leaf may hold
the most frequent class among the subset tuples or the
probability distribution of those tuples.
• When constructing a tree, measures such as statistical
significance, information gain, Gini index, and so on, can
be used to assess the goodness of a split.
11/6/2024 Arvind Deshpande (VJTI) 29
Tree Pruning
• The second and more common approach is postpruning,
which removes subtrees from a “fully grown” tree.
• A subtree at a given node is pruned by removing its
branches and replacing it with a leaf.
• The leaf is labeled with the most frequent class among
the subtree being replaced.
11/6/2024 Arvind Deshpande (VJTI) 30
Tree Pruning
11/6/2024 Arvind Deshpande (VJTI) 31