Decision Trees Machine Learning

Introduction Decision Tree for PlayTennis (Mitchell)
CSCE CSCE
478/878 478/878
Lecture 3:
Learning
Lecture 3:
Learning
Outlook
Decision
Trees
Decision trees form a simple, easily-interpretable, Decision
Trees
Stephen Scott
hypothesis Stephen Scott
Sunny Overcast Rain

Introduction Interpretability useful in independent validation and Introduction
Outline explanation Outline
Tree Tree Humidity Wind

Representation Representation Yes
Quick to train
Learning Learning
Trees Trees
Inductive Bias
Quick to evaluate new instances Inductive Bias
Overfitting
Effective “off-the-shelf” learning method Overfitting High Normal Strong Weak
Tree Pruning Tree Pruning
Can be combined with boosting, including using “stumps”
No Yes No Yes
2 / 26 4 / 26
With Numeric Attributes Decision Tree Representation
CSCE CSCE
478/878 478/878
Lecture 3: Lecture 3:
x2
Learning Learning
Decision Decision
Trees x1>w10 Trees
C2
Stephen Scott Stephen Scott Each internal node tests an attribute
Yes
Introduction
No
Introduction Each branch corresponds to attribute value
Outline
x2>w20
Outline Each leaf node assigns a classification
Tree w20 Tree
Representation C1 Representation
Learning
Yes No Learning
How would we represent:
Trees Trees
Inductive Bias
C1 Inductive Bias ^, _, XOR
Overfitting Overfitting
C2 C1 (A ^ B) _ (C ^ ¬D ^ E)
w10 x1
5 / 26 6 / 26
High-Level Learning Algorithm Entropy

(ID3, C4.5, CART)
Main loop:
1.0
CSCE CSCE
478/878 478/878
Lecture 3: 1 A the “best” decision attribute for next node m Lecture 3:
Entropy(S)
Learning Learning
Decision 2 Assign A as decision attribute for m Decision
0.5
Trees Trees
Stephen Scott
3 For each value of A, create new descendant of m Stephen Scott
4 Sort (partition) training examples over children based 0.0 0.5
p
1.0
Introduction
on A’s value Introduction +
Outline 5 If training examples perfectly classified, Then STOP, Outline Xm is a sample of training examples reaching node m
Tree
Representation Else recursively iterate over new child nodes
Tree
Representation
pm is the proportion of positive examples in Xm
Learning Learning pm is the proportion of negative examples in Xm
Trees Which attribute is best? Trees
Entropy Im measures the impurity of Xm
High-Level Algorithm High-Level Algorithm
Entropy Entropy
Learning Algorithm [29+,35-] A1=? [29+,35-] A2=? Learning Algorithm
Im ⌘ pm log2 pm pm log2 pm
Example Run Example Run
Regression Trees Regression Trees
Variations
t f t f Variations
or for K classes,
Inductive Bias Inductive Bias
K
X

(9.3) Im ⌘ pim log2 pim
[21+,5-] [8+,30-] [18+,33-] [11+,2-] i=1
7 / 26 8 / 26
Total Impurity Learning Algorithm
CSCE
478/878 Now can look for an attribute A, when used to partition CSCE
478/878
Lecture 3:
Learning
Xm by value, produces the most pure (lowest-entropy) Lecture 3:
Learning
Decision subsets Decision
Trees Trees
Weight each subset by relative size
Stephen Scott Stephen Scott
E.g., size-3 subsets should carry less influence than
Introduction size-300 ones Introduction
Outline Let Nm = |Xm | = number of instances reaching node m Outline
Tree Let Nmj = number of these instances with value Tree

Representation Representation
Learning
j 2 {1, . . . , n} for attribute A Learning
Trees i = number of these instances with label
Let Nmj Trees
Entropy
Learning Algorithm
i 2 {1, . . . , K} Entropy
Learning Algorithm
Example Run
Regression Trees
Let pimj = Nmj i /N
mj Example Run
Regression Trees
Variations
Then the total impurity is Variations
Inductive Bias Inductive Bias

n
X K
X
Overfitting Nmj Overfitting
(9.8) Im0 (A) ⌘ pimj log2 pimj
Tree Pruning Nm Tree Pruning
j=1 i=1
9 / 26 10 / 26
Example Run Example Run

Training Examples Selecting the First Attribute
CSCE CSCE
478/878 478/878 Which attribute is the best classifier?
Learning
Decision
Learning
Decision
Comparing Humidity to Wind:
Trees
Day Outlook Temperature Humidity Wind PlayTennis Trees S: [9+,5-] S: [9+,5-]
Stephen Scott
D1 Sunny Hot High Weak No Stephen Scott
E =0.940 E =0.940
D2 Sunny Hot High Strong No Humidity Wind
Introduction D3 Overcast Hot High Weak Yes Introduction
Outline D4 Rain Mild High Weak Yes Outline

High Normal Weak Strong
Tree
D5 Rain Cool Normal Weak Yes Tree
Representation D6 Rain Cool Normal Strong No Representation
Learning D7 Overcast Cool Normal Strong Yes Learning [3+,4-] [6+,1-] [6+,2-] [3+,3-]
Trees Trees
High-Level Algorithm
D8 Sunny Mild High Weak No High-Level Algorithm
E =0.985 E =0.592 E =0.811 E =1.00
Entropy D9 Sunny Cool Normal Weak Yes Entropy Im0 (Humidity)

Gain (S, Humidity Gain (S, Wind ) = 0.789
=) (7/14)0.985 + (7/14)0.592
Learning Algorithm
Example Run
D10 Rain Mild Normal Weak Yes Learning Algorithm
Example Run
= .940 - (7/14).985 - (7/14).592
Im0 (Wind)
= .151 = (8/14)0.811 + (6/14)1.000
= .940 - (8/14).811 - (6/14)1.0
= .048 = 0.892
Regression Trees D11 Sunny Mild Normal Strong Yes Regression Trees
Variations
D12 Overcast Mild High Strong Yes
Variations Im0 (Outlook) = (5/14)0.971 + (4/14)0.0 + (5/14)0.971 = 0.694
Inductive Bias
D13 Overcast Hot Normal Weak Yes
Inductive Bias Im0 (Temp) = (4/14)1.000 + (6/14)0.918 + (4/14)0.811 = 0.911
D14 Rain Mild High Strong No
11 / 26 12 / 26
Example Run Regression Trees

Selecting the Next Attribute
{D1, D2, ..., D14}

CSCE [9+,5 ]
CSCE
478/878 478/878
Lecture 3: Outlook Lecture 3:
Learning Learning
Decision Decision
Trees Trees
Stephen Scott Sunny Overcast Rain Stephen Scott

A regression tree is similar to a decision tree, but with
real-valued labels at the leaves
Introduction Introduction
{D1,D2,D8,D9,D11} {D3,D7,D12,D13} {D4,D5,D6,D10,D14}
To measure impurity at a node m, replace entropy with
Outline [2+,3 ] [4+,0 ] [3+,2 ] Outline
Tree Tree
variance of labels:
? Yes
?
1 X
Learning Learning
Em ⌘ (rt gm )2 ,
Trees Trees
Nm
(xt ,rt )2Xm
Entropy Which attribute should be tested here? Entropy
Learning Algorithm Learning Algorithm
Example Run
Regression Trees
Ssunny = {D1,D2,D8,D9,D11}
Xm = {D1 , D2 , D8 , D9 , D11 }
Example Run
Regression Trees where gm is the mean (or median) label in Xm
Variations Gain (Ssunny , Humidity) = .970 Variations
(3/5) 0.0 (2/5) 0.0 = .970
Inductive Bias Im0 (Humidity) =Gain(3/5)0.0 + (2/14)0.0
(Ssunny , Temperature) = .970 (2/5) 0.0
0.01.0 (1/5) 0.0 = .570
= (2/5) Inductive Bias
0
Im (Wind) = (2/5)1.0 0.951
Overfitting Gain (Ssunny ,+ (3/5)0.918
Wind) = .970 (2/5) 1.0 =(3/5) .918 = .019 Overfitting
Tree Pruning Im0 (Temp) = (2/5)0.0 + (2/5)1.0 + (1/5)0.0 = 0.400 Tree Pruning
13 / 26 14 / 26
Regression Trees (cont’d) Continuous-Valued Attributes
CSCE
478/878
CSCE
478/878
Use threshold to map continuous to boolean, e.g.
Lecture 3:
Learning Now can adapt Eq. (9.8) from classification to
Lecture 3:
Learning
(Temperature > 72.3) 2 {t, f }
Decision Decision
Trees regression: Trees
Temperature: 40 48 60 72 80 90
0 1 PlayTennis: No No Yes Yes Yes No
n
X X
Introduction 0 Nmj @ 1 t 2A Introduction
Em (A) ⌘ (r gmj )
Outline Nm Nmj t t Outline
Can show that threshold minimizing impurity must lie
j=1 (x ,r )2Xmj
Tree Tree
Representation Representation between two adjacent attribute values in X such that
Learning
n
1 X X Learning label changed, so try all such values, e.g.,
Trees (9.14) = (rt gmj )2 , Trees
(48 + 60)/2 = 54 and (80 + 90)/2 = 85
High-Level Algorithm
Nm High-Level Algorithm
Entropy j=1 (xt ,rt )2Xmj Entropy
Now (dynamically) replace continuous attribute with
Example Run
Regression Trees where j iterates over the values of attribute A
Example Run
Regression Trees boolean attributes Temperature>54 and
Variations Variations
Temperature>85 and run algorithm normally
Inductive Bias When variance of a subset is sufficiently low, insert leaf Inductive Bias
Other options: Split into multiple intervals rather than
Overfitting with mean or median label as constant value Overfitting

two; use thresholded linear combinations of continuous
attributes (Sec 9.6)
15 / 26 16 / 26
Attributes with Many Values Unknown Attribute Values
CSCE CSCE
478/878 478/878
Lecture 3: Problem: Lecture 3:
Learning Learning
Decision Decision What if a training example is missing a value of A?
Trees If attribute A has many values, it might artificially Trees
Stephen Scott
minimize Im0 (A) Stephen Scott Use it anyway (sift it through tree)
Introduction E.g., if Date is attribute, will be low because
Im0 (A) Introduction
If node m tests A, assign most common value of A
Outline
several very small subsets will be created Outline
among other training examples sifted to m
Tree Tree
One approach: penalize A with a measure of split Assign most common value of A among other examples
Learning Learning
Trees information, which measures how broadly and uniformly Trees with same target value (either overall or at m)
Entropy attribute A splits data: Entropy Assign probability pj to each possible value vj of A
Example Run Example Run Assign fraction pj of example to each descendant in tree
n
X Nmj Nmj
Regression Trees Regression Trees
log2 2 [0, log2 n]

Variations Variations
S(A) ⌘ Classify new examples in same fashion
Inductive Bias Nm Nm Inductive Bias
j=1
17 / 26 18 / 26
Inductive Bias of Learning Algorithm Overfitting
CSCE CSCE
478/878 478/878 Consider adding noisy training example #15:
Learning
Decision
Learning
Decision
Sunny, Hot, Normal, Strong, PlayTennis = No
Trees Trees
What effect on earlier tree?
Stephen Scott Hypothesis space H is complete, in that any function Stephen Scott
Outlook
Introduction can be represented Introduction
Outline Thus inductive bias does not come from restricting H, Outline
Tree but from preferring some trees over others Tree

Sunny Overcast Rain
Tends to prefer shorter trees
Learning Learning Humidity Wind
Trees Computationally intractable to find a guaranteed Trees Yes
Inductive Bias shortest tree, so heuristically apply greedy approach to Inductive Bias
Overfitting locally minimize impurity Overfitting

High Normal Strong Weak
No Yes No Yes
Expect old tree to generalize better since new one fits

noisy example
19 / 26 20 / 26
Overfitting (cont’d) Overfitting (cont’d)
CSCE CSCE
478/878 478/878
0.9
Learning Learning
Decision
Trees
Consider error of hypothesis h over Decision
Trees
0.85
Stephen Scott training data (empirical error): errortrain (h) Stephen Scott 0.8
entire distribution D of data (generalization error):
Introduction Introduction 0.75
errorD (h)
Accuracy
Outline Outline
Tree
Hypothesis h 2 H overfits training data if there is an Tree
0.7
Representation alternative hypothesis h0 2 H such that Representation
0.65
Learning Learning
Trees Trees
Inductive Bias
errortrain (h) < errortrain (h0 ) Inductive Bias
0.6 On training data
On test data
Overfitting Overfitting 0.55
Tree Pruning
and Tree Pruning
0 0.5
errorD (h) > errorD (h ) 0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)
21 / 26 22 / 26
Pruning to Avoid Overfitting Pruning Example
CSCE CSCE
478/878
Lecture 3:
To prevent trees from growing too much and overfitting 478/878
Lecture 3:
0.9
Learning the data, we can prune them Learning
Decision Decision 0.85
Trees In spirit of Occam’s Razor, minimum description length Trees
Stephen Scott
In prepruning, we allow skipping a recursive call on set Stephen Scott 0.8
Introduction
Xm and instead insert a leaf, even if Xm is not pure Introduction 0.75
Can do this when entropy (or variance) is below a
Accuracy
Outline Outline
0.7
Tree threshold (✓I in pseudocode) Tree
Representation Can do this when |Xm | is below a threshold, e.g., 5 Representation
0.65
Learning Learning
Trees In postpruning, we grow the tree until it has zero error Trees
0.6 On training data
Inductive Bias on training set and then prune it back afterwards Inductive Bias On test data
On test data (during pruning)
Overfitting First, set aside a pruning set not used in initial training Overfitting
0.55
Tree Pruning Then repeat until pruning is harmful: Tree Pruning 0.5
Rule Postpruning 1 Evaluate impact on validation set of pruning each Rule Postpruning
0 10 20 30 40 50 60 70 80 90 100
possible node (plus those below it) Size of tree (number of nodes)
2 Greedily remove the one that most improves validation
set accuracy
23 / 26 24 / 26
Rule Postpruning Converting A Tree to Rules
CSCE CSCE
478/878 478/878 Outlook
Learning Learning
Decision Decision
Trees Trees Sunny Overcast Rain
Convert tree to equivalent set of rules Humidity Yes Wind
Introduction
Prune each rule independently of others by removing Introduction
Outline Outline
selected preconditions (the ones that improve accuracy High Normal Strong Weak
Tree Tree
Representation the most) Representation
No Yes No Yes
Learning
Trees
Sort final rules into desired sequence for use Learning
Trees
Inductive Bias Inductive Bias IF (Outlook = Sunny) ^ (Humidity = High)

Overfitting Perhaps most frequently used method (e.g. C4.5) Overfitting THEN PlayTennis = No
IF (Outlook = Sunny) ^ (Humidity = Normal)

Rule Postpruning Rule Postpruning
THEN PlayTennis = Yes

...
25 / 26 26 / 26

Decision Trees Machine Learning

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Decision Trees Machine Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Decision Trees Machine Learning

Uploaded by

Copyright:

Available Formats

Introduction Decision Tree for PlayTennis (Mitchell)

Sunny Overcast Rain

Outline explanation Outline

Tree Tree Humidity Wind

With Numeric Attributes Decision Tree Representation

High-Level Learning Algorithm Entropy

Tree Pruning Tree Pruning

Outline Let Nm = |Xm | = number of instances reaching node m Outline

Tree Let Nmj = number of these instances with value Tree

Inductive Bias Inductive Bias

Example Run Example Run

Outline D4 Rain Mild High Weak Yes Outline

Entropy D9 Sunny Cool Normal Weak Yes Entropy Im0 (Humidity)

Example Run Regression Trees

{D1, D2, ..., D14}

Stephen Scott Sunny Overcast Rain Stephen Scott

Tree Pruning Tree Pruning

Attributes with Many Values Unknown Attribute Values

log2 2 [0, log2 n]

Tree Pruning Tree Pruning

Inductive Bias of Learning Algorithm Overfitting

Tree but from preferring some trees over others Tree

Overfitting locally minimize impurity Overfitting

Expect old tree to generalize better since new one fits

Pruning to Avoid Overfitting Pruning Example

Rule Postpruning Converting A Tree to Rules

Inductive Bias Inductive Bias IF (Outlook = Sunny) ^ (Humidity = High)

IF (Outlook = Sunny) ^ (Humidity = Normal)

THEN PlayTennis = Yes

You might also like