0% found this document useful (0 votes)
11 views34 pages

L04 Decision Trees

Uploaded by

black hello
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views34 pages

L04 Decision Trees

Uploaded by

black hello
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Decision Tree

1
Decision Tree - Introduction

+non-parametric supervised learning approach


(data does not have a normal distribution)
+can be applied to both regression and
classification problems
+Using tree analogy with a sequential decision
process
+Start from the root node, a feature
Decision Tree - Introduction
+Start from the root node
+a feature is evaluated, and one of the 2 nodes
(branches) is selected
+Each node in the tree is basically a decision
rule
+The procedure is repeated until a final leaf is
reached - target
Types of Decision Tree Algorithms
+ID3 (Iterative Dichotomiser 3)
+C4.5
+C5.0
+CART (Classification and Regression trees)
Types of Decision Tree Algorithms
ID3 C4.5 C5.0 CART
• developed in 1986 by • developed in 1993 by Ross • Quinlan’s • very similar to C4.5,
Ross Quinlan Quinlan, latest version but it differs in that it
• creates a multiway • successor to ID3 release under supports numerical
tree, finding for each • removed the restriction that a proprietary target variables
node (i.e., in a greedy features must be categorical license (regression)
manner) the by dynamically defining a • uses less • does not compute
categorical feature discrete attribute (based on memory and rule sets
that will yield the numerical variables) that builds smaller • constructs binary
largest information partitions the continuous rulesets than trees using the
gain for categorical attribute value into a discrete C4.5 while feature and threshold
targets set of intervals being more that yield the largest
• trees are grown to • converts the trained trees accurate information gain at
their maximum size (i.e., the output of the ID3 each node
and then a pruning algorithm) into sets of if-then
step is usually applied rules
to improve the ability • accuracy of each rule is then
of the tree to evaluated to determine the
generalize to unseen order in which they should be
data applied. Pruning is done by
removing a rule’s
precondition if the accuracy
Terminology

Source: https://medium.datadriveninvestor.com/the-basics-of-decision-trees-e5837cc2aba7
Terminology
+Root node : is the first node in decision trees
+Splitting : is a process of dividing node into two or more sub-nodes,
starting from the root node
+Node : splitting results from the root node into sub-nodes and splitting
sub-nodes into further sub-nodes
+Leaf or terminal node : end of a node, since node cannot be split
anymore
+Pruning : is a technique to reduce the size of the decision tree by
removing sub-nodes of the decision tree. The aim is reducing
complexity for improved predictive accuracy and to avoid overfitting
+Branch / Sub-Tree : A subsection of the entire tree is called branch or
sub-tree.
+
Intuition

+Case: Predicting salary of baseball player:


Featur
e Attribute

Year<4.5
Root node Branch

3 regions can be written as


• R1 ={X | Years<4.5 }
Hits<117.
• R2 ={X | Years>=4.5 ,Hits<117.5 }
5.11 • R3 ={X | Years>=4.5 , Hits>=117.5
5

• Note:
Terminal node
• Hitters data
or leaf
• Salary in mean log
6.74 6.00 • Hits is number of hits the
player made in the previous
Splitting in Decision Trees
+split the nodes at the most informative features
using the decision algorithm
+start at the tree root and split the data on the
feature that results in the largest information
gain (IG)
+the objective function is to maximize the
information gain (IG) at each split
Splitting in Decision Trees
• Select a
feature and
split data
into binary
tree
• Continue
splitting
with Until:
available • Leaf node(s) are pure (only
features one class remains)
• A maximum depth is reached
• A performance metric is
achieved
Splitting in Decision Trees
+( (1)
+- feature to perform the split
+- data set of the parent
+- -th child node
+ – impurity measure
+- total number of samples at the parent node
+- number of samples in the -th child node
Splitting in Decision Trees
+information gain is simply the difference between the impurity of
the parent node and the sum of the child node impurities
+the lower the impurity of the child nodes, the larger the
information gain
+for simplicity and to reduce the combinatorial search space, most
libraries (including scikit-learn) implement binary decision trees -
each parent node is split into two child nodes, D-left and D-right
+( (2)
+3 common impurity measures or splitting criteria used in binary
decision trees
+Gini impurity (IG), entropy (IH), and misclassification error (IE)
Sample Dataset
+Simple simulation with Heart Disease Data set with 303 rows and
has 13 attributes. Target consist 138 value 0 and 165 value 1

use the sex, fbs (fasting blood sugar), exang (exercise


induced angina), and target attributes
Gini Impurity
+used by the CART (classification and regression tree)
algorithm for classification trees
+measure of how often a randomly chosen element from the
set would be incorrectly labeled if it was randomly labeled
according to the distribution of labels in the subset

here j is the number of classes present in the node and p is th


stribution of the class in the node.
Gini Impurity
+determine which separation is best by measuring and
comparing Gini Impurity in each attribute
+The lowest Gini Impurity value on the first iteration will
be the root node
Gini Impurity – Sex attribute
Gini Impurity – Sex attribute

calculate the total Gini Impurity with weight average. Left


Node represented 138 patient while Right Node represented
165 patient
Gini Impurity – Fbs (fasting blood sugar) attribute

calculate the total Gini Impurity with


weight average. Left Node
represented 138 patient while Right
Node represented 165 patient
Gini Impurity – Exang (exercise induced
angina) attribute

calculate the total Gini Impurity with


weight average. Left Node represented
138 patient while Right Node represented
165 patient blood sugar) has the lowest
Fbs (fasting
Gini Impurity, so we will use it as the Root
Entropy
+used by the ID3, C4.5 and C5.0 tree-generation
algorithms
+information gain is based on the concept of entropy,
the entropy measure

where j is the number of classes present in


the node and p is the distribution of the class
in the node.
Entropy
+entropy in Target attribute first
Entropy – Sex attribute
Entropy – Fbs attribute
Entropy – Exang attribute

Fbs (fasting blood sugar) has the


highest entropy, so we will use it as
the Root Node. In this case here, it
shows the same results we obtained
from Gini Impurity.
Misclassification Impurity
+also called classification error
+this index is not the best choice because it’s not
particularly sensitive to different probability
distributions
Classification Error vs Entropy
Classification error Entropy
+ flat function with maximum at center
+ has the same maximum
+ Center represents ambiguity—50/50 split
+ Splitting metrics favor results that are
but is curved
furthest away from the center
+ Curvature allows splitting
to continue until nodes
are pure
Classification Error vs Entropy
+Classification error
+With classification error, the function is
flat
+Final average classification error can be
identical to parent
+Resulting in premature stopping
Classification Error vs Entropy
+Entropy gain
+function has a "bulge"
+Allows average information of children
to be less than parent
+Results in information gain and
continued splitting
Gini Index
+In practice, Gini index often used for splitting
+Function is similar to entropy—has bulge
+Does not contain logarithm
How to split on continuous data?
1. Order data by ascending
2. Calculate the average
weight
3. Calculate the information
gain
How to split on continuous data?
Decision Trees are High Variance

• Problem: decision trees


tend to overfit
• Small changes in data
greatly affect prediction
- high variance
• Solution: Prune trees

• Variance is the variability of model prediction for a given data point or a value which tells us
spread of our data
• Model with high variance pays a lot of attention to training data and does not generalize on the
data which it hasn’t seen before
• such models perform very well on training data but has high error rates on test data
Pruning Decision Trees

• How to decide what


leaves to prune?
• Solution: prune based on
classification error
threshold

𝐸 ( 𝑡 )=1 −max [ 𝑝 ( 𝑖|𝑡 ) ]


𝑖
Strengths of Decision Trees

• Easy to interpret and


implement—"if…then…
else" logic
• Handle any data
category—binary,
ordinal, continuous
• No preprocessing or
scaling required

You might also like