Decision Trees Machine Learning
Decision Trees Machine Learning
Decision Trees Machine Learning
CSCE CSCE
478/878 478/878
Lecture 3:
Learning
Lecture 3:
Learning
Outlook
Decision
Trees
Decision trees form a simple, easily-interpretable, Decision
Trees
Stephen Scott
hypothesis Stephen Scott
Inductive Bias
Quick to evaluate new instances Inductive Bias
Overfitting
Effective “off-the-shelf” learning method Overfitting High Normal Strong Weak
Tree Pruning Tree Pruning
Can be combined with boosting, including using “stumps”
No Yes No Yes
2 / 26 4 / 26
CSCE CSCE
478/878 478/878
Lecture 3: Lecture 3:
x2
Learning Learning
Decision Decision
Trees x1>w10 Trees
C2
Stephen Scott Stephen Scott Each internal node tests an attribute
Yes
Introduction
No
Introduction Each branch corresponds to attribute value
Outline
x2>w20
Outline Each leaf node assigns a classification
Tree w20 Tree
Representation C1 Representation
Learning
Yes No Learning
How would we represent:
Trees Trees
Inductive Bias
C1 Inductive Bias ^, _, XOR
Overfitting Overfitting
C2 C1 (A ^ B) _ (C ^ ¬D ^ E)
Tree Pruning Tree Pruning
w10 x1
5 / 26 6 / 26
Main loop:
1.0
CSCE CSCE
478/878 478/878
Lecture 3: 1 A the “best” decision attribute for next node m Lecture 3:
Entropy(S)
Learning Learning
Decision 2 Assign A as decision attribute for m Decision
0.5
Trees Trees
Stephen Scott
3 For each value of A, create new descendant of m Stephen Scott
4 Sort (partition) training examples over children based 0.0 0.5
p
1.0
Introduction
on A’s value Introduction +
Outline 5 If training examples perfectly classified, Then STOP, Outline Xm is a sample of training examples reaching node m
Tree
Representation Else recursively iterate over new child nodes
Tree
Representation
pm is the proportion of positive examples in Xm
Learning Learning pm is the proportion of negative examples in Xm
Trees Which attribute is best? Trees
Entropy Im measures the impurity of Xm
High-Level Algorithm High-Level Algorithm
Entropy Entropy
Learning Algorithm [29+,35-] A1=? [29+,35-] A2=? Learning Algorithm
Im ⌘ pm log2 pm pm log2 pm
Example Run Example Run
Regression Trees Regression Trees
Variations
t f t f Variations
or for K classes,
Inductive Bias Inductive Bias
K
X
Overfitting Overfitting
CSCE
478/878 Now can look for an attribute A, when used to partition CSCE
478/878
Lecture 3:
Learning
Xm by value, produces the most pure (lowest-entropy) Lecture 3:
Learning
Decision subsets Decision
Trees Trees
Weight each subset by relative size
Stephen Scott Stephen Scott
E.g., size-3 subsets should carry less influence than
Introduction size-300 ones Introduction
Learning
j 2 {1, . . . , n} for attribute A Learning
Trees i = number of these instances with label
Let Nmj Trees
High-Level Algorithm High-Level Algorithm
Entropy
Learning Algorithm
i 2 {1, . . . , K} Entropy
Learning Algorithm
Example Run
Regression Trees
Let pimj = Nmj i /N
mj Example Run
Regression Trees
Variations
Then the total impurity is Variations
CSCE CSCE
478/878 478/878 Which attribute is the best classifier?
Lecture 3: Lecture 3:
Learning
Decision
Learning
Decision
Comparing Humidity to Wind:
Trees
Day Outlook Temperature Humidity Wind PlayTennis Trees S: [9+,5-] S: [9+,5-]
Stephen Scott
D1 Sunny Hot High Weak No Stephen Scott
E =0.940 E =0.940
D2 Sunny Hot High Strong No Humidity Wind
Introduction D3 Overcast Hot High Weak Yes Introduction
Tree
D5 Rain Cool Normal Weak Yes Tree
Representation D6 Rain Cool Normal Strong No Representation
Learning D7 Overcast Cool Normal Strong Yes Learning [3+,4-] [6+,1-] [6+,2-] [3+,3-]
Trees Trees
High-Level Algorithm
D8 Sunny Mild High Weak No High-Level Algorithm
E =0.985 E =0.592 E =0.811 E =1.00
11 / 26 12 / 26
Tree Tree
variance of labels:
? Yes
?
Representation Representation
1 X
Learning Learning
Em ⌘ (rt gm )2 ,
Trees Trees
Nm
High-Level Algorithm High-Level Algorithm
(xt ,rt )2Xm
Entropy Which attribute should be tested here? Entropy
Learning Algorithm Learning Algorithm
Example Run
Regression Trees
Ssunny = {D1,D2,D8,D9,D11}
Xm = {D1 , D2 , D8 , D9 , D11 }
Example Run
Regression Trees where gm is the mean (or median) label in Xm
Variations Gain (Ssunny , Humidity) = .970 Variations
(3/5) 0.0 (2/5) 0.0 = .970
Inductive Bias Im0 (Humidity) =Gain(3/5)0.0 + (2/14)0.0
(Ssunny , Temperature) = .970 (2/5) 0.0
0.01.0 (1/5) 0.0 = .570
= (2/5) Inductive Bias
0
Im (Wind) = (2/5)1.0 0.951
Overfitting Gain (Ssunny ,+ (3/5)0.918
Wind) = .970 (2/5) 1.0 =(3/5) .918 = .019 Overfitting
Tree Pruning Im0 (Temp) = (2/5)0.0 + (2/5)1.0 + (1/5)0.0 = 0.400 Tree Pruning
13 / 26 14 / 26
Regression Trees (cont’d) Continuous-Valued Attributes
CSCE
478/878
CSCE
478/878
Use threshold to map continuous to boolean, e.g.
Lecture 3:
Learning Now can adapt Eq. (9.8) from classification to
Lecture 3:
Learning
(Temperature > 72.3) 2 {t, f }
Decision Decision
Trees regression: Trees
Temperature: 40 48 60 72 80 90
Stephen Scott Stephen Scott
0 1 PlayTennis: No No Yes Yes Yes No
n
X X
Introduction 0 Nmj @ 1 t 2A Introduction
Em (A) ⌘ (r gmj )
Outline Nm Nmj t t Outline
Can show that threshold minimizing impurity must lie
j=1 (x ,r )2Xmj
Tree Tree
Representation Representation between two adjacent attribute values in X such that
Learning
n
1 X X Learning label changed, so try all such values, e.g.,
Trees (9.14) = (rt gmj )2 , Trees
(48 + 60)/2 = 54 and (80 + 90)/2 = 85
High-Level Algorithm
Nm High-Level Algorithm
Entropy j=1 (xt ,rt )2Xmj Entropy
Now (dynamically) replace continuous attribute with
Learning Algorithm Learning Algorithm
Example Run
Regression Trees where j iterates over the values of attribute A
Example Run
Regression Trees boolean attributes Temperature>54 and
Variations Variations
Temperature>85 and run algorithm normally
Inductive Bias When variance of a subset is sufficiently low, insert leaf Inductive Bias
Other options: Split into multiple intervals rather than
Overfitting with mean or median label as constant value Overfitting
CSCE CSCE
478/878 478/878
Lecture 3: Problem: Lecture 3:
Learning Learning
Decision Decision What if a training example is missing a value of A?
Trees If attribute A has many values, it might artificially Trees
Stephen Scott
minimize Im0 (A) Stephen Scott Use it anyway (sift it through tree)
Introduction E.g., if Date is attribute, will be low because
Im0 (A) Introduction
If node m tests A, assign most common value of A
Outline
several very small subsets will be created Outline
among other training examples sifted to m
Tree Tree
Representation Representation
One approach: penalize A with a measure of split Assign most common value of A among other examples
Learning Learning
Trees information, which measures how broadly and uniformly Trees with same target value (either overall or at m)
High-Level Algorithm High-Level Algorithm
Entropy attribute A splits data: Entropy Assign probability pj to each possible value vj of A
Learning Algorithm Learning Algorithm
Example Run Example Run Assign fraction pj of example to each descendant in tree
n
X Nmj Nmj
Regression Trees Regression Trees
17 / 26 18 / 26
CSCE CSCE
478/878 478/878 Consider adding noisy training example #15:
Lecture 3: Lecture 3:
Learning
Decision
Learning
Decision
Sunny, Hot, Normal, Strong, PlayTennis = No
Trees Trees
What effect on earlier tree?
Stephen Scott Hypothesis space H is complete, in that any function Stephen Scott
Outlook
Introduction can be represented Introduction
Outline Thus inductive bias does not come from restricting H, Outline
No Yes No Yes
CSCE CSCE
478/878 478/878
0.9
Lecture 3: Lecture 3:
Learning Learning
Decision
Trees
Consider error of hypothesis h over Decision
Trees
0.85
Stephen Scott training data (empirical error): errortrain (h) Stephen Scott 0.8
entire distribution D of data (generalization error):
Introduction Introduction 0.75
errorD (h)
Accuracy
Outline Outline
Tree
Hypothesis h 2 H overfits training data if there is an Tree
0.7
Representation alternative hypothesis h0 2 H such that Representation
0.65
Learning Learning
Trees Trees
Inductive Bias
errortrain (h) < errortrain (h0 ) Inductive Bias
0.6 On training data
On test data
Overfitting Overfitting 0.55
Tree Pruning
and Tree Pruning
0 0.5
errorD (h) > errorD (h ) 0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)
21 / 26 22 / 26
CSCE CSCE
478/878
Lecture 3:
To prevent trees from growing too much and overfitting 478/878
Lecture 3:
0.9
Learning the data, we can prune them Learning
Decision Decision 0.85
Trees In spirit of Occam’s Razor, minimum description length Trees
Stephen Scott
In prepruning, we allow skipping a recursive call on set Stephen Scott 0.8
Introduction
Xm and instead insert a leaf, even if Xm is not pure Introduction 0.75
Can do this when entropy (or variance) is below a
Accuracy
Outline Outline
0.7
Tree threshold (✓I in pseudocode) Tree
Representation Can do this when |Xm | is below a threshold, e.g., 5 Representation
0.65
Learning Learning
Trees In postpruning, we grow the tree until it has zero error Trees
0.6 On training data
Inductive Bias on training set and then prune it back afterwards Inductive Bias On test data
On test data (during pruning)
Overfitting First, set aside a pruning set not used in initial training Overfitting
0.55
Tree Pruning Then repeat until pruning is harmful: Tree Pruning 0.5
Rule Postpruning 1 Evaluate impact on validation set of pruning each Rule Postpruning
0 10 20 30 40 50 60 70 80 90 100
possible node (plus those below it) Size of tree (number of nodes)
2 Greedily remove the one that most improves validation
set accuracy
23 / 26 24 / 26
CSCE CSCE
478/878 478/878 Outlook
Lecture 3: Lecture 3:
Learning Learning
Decision Decision
Trees Trees Sunny Overcast Rain
Stephen Scott Stephen Scott
Convert tree to equivalent set of rules Humidity Yes Wind
Introduction
Prune each rule independently of others by removing Introduction
Outline Outline
selected preconditions (the ones that improve accuracy High Normal Strong Weak
Tree Tree
Representation the most) Representation
No Yes No Yes
Learning
Trees
Sort final rules into desired sequence for use Learning
Trees