07 - ML - Decision Tree
07 - ML - Decision Tree
Machine Learning
Decision Tree
2
Decision Tree Representation
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Outlook
No Yes No Yes
4
Alternative Decision Tree for PlayTennis
Temperature
Humidity
Normal High
{1,2,3} {13] ...
YES
Wind
Mild Strong
{1,3} {2}
Outlook NO What is different?
Sunny Overcast Sequence of attributes influences
{1} {3}
size and shape of tree
NO YES
5
Occam’s Principle
Occam’s Principle:
“If two theories explain the
facts equally well, then the
simpler theory is preferred”
6
Decision Trees
Decision tree representation:
• Each internal node tests an attribute
• Each branch corresponds to attribute value
• Each leaf node assigns a classification
B B
yes no yes no
NO YES YES NO
7
When to Consider Decision Trees
Examples:
Medical diagnosis
Text classification
Credit risk analysis
8
Top-Down Induction of Decision Trees, ID3
9
Example ID3
• Look at current training set S
10
Example ID3
• Tree
11
Example – Resulting Tree
Outlook
No Yes No Yes
12
ID3 – Intermediate Summary
13
Which attribute is best?
Value 1 Value 2
x1 {+, +, −, −, + } {−, +, +, −, −}
x2 {+} {+, −, −, +, −, +, +, −, −}
x3 {+, +, +, +, −} {−, −, −, −, + }
• No attribute is perfect
• Which one to choose?
14
Entropy
examples
0.5
• Entropy measures the impurity of S
15
Entropy
S = { + + + + + + + + + , − − − − − } = { 9 + , 5−}. Entropy( S) = ?
Entr opy( S) = 0
S = { + + + + + + + + − − − − − − − } = { 7 + , 7−}.
Entr opy( S) = ?
Entr opy( S) = 1
16
Entropy
17
Information Gain
• Measuring attribute x creates subsets S 1 and S 2 with
different entropies
• Taking the mean of Entropy(S 1 ) and Entropy(S 2 )
gives conditional entropy Entropy(S|x), i.e. in general:
𝑠𝑣
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆|𝑥) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
𝑆
𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝑥)
| sv |
Gain( S , x) Entropy ( S ) vValues ( x ) Entropy ( Sv )
|S|
𝑉𝑎𝑙𝑢𝑒 (x): the set of all possible values for the attribute x.
S𝑣: the subset of S for which x has value 𝑣
19
Example - Training Set
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
20
Example
| sv |
Gain( S , x) Entropy ( S ) vValues ( x ) Entropy ( Sv )
|S|
21
Selecting the Next Attribute
• For whole training set:
G a i n ( S , O u t l o o k ) = 0.246
G a i n ( S , H u m i d i t y ) = 0.151
G a i n ( S , W i n d ) = 0.048
G a i n ( S , Te m p e r a t u r e ) = 0.029
→ O u t l o o k should be used to split training set!
22
Next step in growing the decision tree
23
The Resulting Decision Tree & Its Rules
24
Some issues: Real-Valued Attributes
Temperature = 82.5
Create discrete attributes to test continuous:
(Temperature > 54) = true or = false
Sort attribute values that occur in training set:
Temperature: 40 48 60 72 80 90
PlayTennis: No No Yes Yes Yes No
25
Some issues: Noise
Consider adding noisy (=wrongly labeled) training example #15:
S u n n y , M i l d , N o r m a l , We a k , P l a y Te n n i s = N o
, i.e. outlook = sunny, humidity = normal
What effect on earlier tree? Outlook
No Yes No Yes
26
Some issues: Overfitting Outlook
No Yes No Yes
28
[Mitchell, 1997]
Overfitting: solutions
Some solutions:
Stop learning early: prevent the tree before it fits the
training data perfectly.
Prune the full tree: grow the tree to its full size, and then
post prune the tree.
It is hard to decide when to stop learning.
Post-pruning the tree empirically results in better
performance. But
How to decide the good size of a tree?
When to stop pruning?
We can use a validation set to do pruning, such
as, reduced-error pruning, and rule-post pruning
29
Summary
30
Advantages & Disadvantages
Advantages
It is simple to understand as it follows the same
process which a human follow while making any
decision in real-life.
It can be very useful for solving decision-related
problems.
It helps to think about all the possible outcomes for
a problem.
There is less requirement of data cleaning
compared to other algorithms.
31
Advantages & Disadvantages
Disadvantages
The decision tree contains lots of layers, which
makes it complex.
It may have an overfitting issue, which can be
resolved using the Random Forest algorithm.
For more class labels, the computational
complexity of the decision tree may increase
32
Random forests
Random forests (RF) is a method by Leo Breiman
(2001) for both classification and regression.
Main idea: prediction is based on combination of many
decision trees, by taking the average of all individual
predictions
Each tree in RF is simple but random.
Each tree is grown differently, depending on the
choices of the attributes and training data
33
Random forests
RF currently is one of the most popular and accurate
methods [Fernández-Delgado et al., 2014]
It is also very general.
RF can be implemented easily and efficiently.
It can work with problems of very high dimensions,
without overfitting
34
How Random Forest work
35
RF: three basic ingredients
36
Exersice