0% found this document useful (0 votes)
9 views

07 - ML - Decision Tree

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

07 - ML - Decision Tree

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Ho Chi Minh University of Banking

Department of Economic Mathematics

Machine Learning
Decision Tree

Vuong Trong Nhân (nhanvt@hub.edu.vn)


Outline
 Decision tree representation
 ID3 learning algorithm
 Which attribute is best?
 C4.5: real valued attributes
 Which hypothesis is best?
 Noise
 From Trees to Rules
 Miscellaneous

2
Decision Tree Representation
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

 Outlook, Temperature, etc.: attributes


 PlayTennis: class
 Shall I play tennis today?
3
Decision Tree for PlayTennis

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

4
Alternative Decision Tree for PlayTennis
Temperature

hot mild cool


{1,2,3,13} {4,8,10,11,12,14} {5,6,7,9}

Humidity

Normal High
{1,2,3} {13] ...

YES
Wind
Mild Strong
{1,3} {2}
Outlook NO  What is different?
Sunny Overcast  Sequence of attributes influences
{1} {3}
size and shape of tree
NO YES
5
Occam’s Principle

 Occam’s Principle:
“If two theories explain the
facts equally well, then the
simpler theory is preferred”

 Preferred the smallest tree


that correctly classifies all
training examples

6
Decision Trees
Decision tree representation:
• Each internal node tests an attribute
• Each branch corresponds to attribute value
• Each leaf node assigns a classification

How would we represent:


A
• ∧, ∨, XOR Example XOR:
yes no

B B
yes no yes no

NO YES YES NO

7
When to Consider Decision Trees

 Instances describable by attribute–value pairs


 Target function is discrete valued
 Disjunctive hypothesis may be required
 Possibly noisy training data
 Interpretable result of learning is required

 Examples:
 Medical diagnosis
 Text classification
 Credit risk analysis
8
Top-Down Induction of Decision Trees, ID3

 ID3 (Quinlan, 1986) operates on whole training set S


 Algorithm:
1. create a new node
2. If current training set is sufficiently pure:
• Label node with respective class
• We’re done
3. Else:
• x → the “best” decision attribute for current training set
• Assign x as decision attribute for node
• For each value of x, create new descendant of node
• Sort training examples to leaf nodes
• Iterate over new leaf nodes and apply algorithm recursively

9
Example ID3
• Look at current training set S

• Determine best attribute

• Split training set according to different values

10
Example ID3

• Tree

• Apply algorithm recursively

11
Example – Resulting Tree

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

12
ID3 – Intermediate Summary

• Recursive splitting of the training set


• Stop, if current training set is sufficiently pure
• ... What means pure? Can we allow for errors?
• What is the best attribute?
• How can we tell that the tree is really good?
• How shall we deal with continuous values?

13
Which attribute is best?

• Assume a training set { + , + , − , − , + , − , + , + , − , − }


(only classes)
• Assume binary attributes x 1 , x 2 , and x 3
• Produced splits:

Value 1 Value 2
x1 {+, +, −, −, + } {−, +, +, −, −}
x2 {+} {+, −, −, +, −, +, +, −, −}
x3 {+, +, +, +, −} {−, −, −, −, + }

• No attribute is perfect
• Which one to choose?

14
Entropy

• p⊕ is the proportion of positive


examples
1.0
• p⊖ is the proportion of negative
Entropy (S)

examples
0.5
• Entropy measures the impurity of S

Entropy(S) ≡ −p ⊕ log2 p⊕ − p⊖ log2 p⊖


0.0 0.5 1.0
p⊕
• Information can be seen as the
negative of entropy

15
Entropy

S = { + + + + + + + + + , − − − − − } = { 9 + , 5−}. Entropy( S) = ?

Entropy( S) = −9/14 log(9/14) − 5/14 log(5/14) = 0.94

S = { + + + + + + + + , − − − − − − } = { 8 + , 6−}. Entropy (S) = ?

Entropy( S) = −8/14 log(8/14) − 6/14 log(6/14) = 0.98


S = { + + + + + + + + + + + + + + + } = {14+}. Entropy( S) = ?

Entr opy( S) = 0
S = { + + + + + + + + − − − − − − − } = { 7 + , 7−}.
Entr opy( S) = ?

Entr opy( S) = 1
16
Entropy

• All members of 𝑆 belong to the same


1.0
class
Entropy (S)

• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = 0 (the purest set)


0.5
• Numbers of positive and negative
examples are equal (p ⊕ = p⊖ = 0.5)
• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = 1 (most impurity)
0.0 0.5 1.0
p⊕
• Numbers of positive and negative
examples are unequal
• Entropy is between 0 and 1.

17
Information Gain
• Measuring attribute x creates subsets S 1 and S 2 with
different entropies
• Taking the mean of Entropy(S 1 ) and Entropy(S 2 )
gives conditional entropy Entropy(S|x), i.e. in general:
𝑠𝑣
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆|𝑥) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
𝑆
𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝑥)

• → Choose that attribute that maximizes difference:


Gain(S , x)  Entropy(S )  Entropy( S | x)
• Gain(S, x ) = expected reduction in entropy due to partitioning
on x
| sv |
Gain( S , x)  Entropy ( S )   vValues ( x ) Entropy ( Sv )
|S|
18
Information Gain: Definition

| sv |
Gain( S , x)  Entropy ( S )   vValues ( x ) Entropy ( Sv )
|S|

 𝑉𝑎𝑙𝑢𝑒 (x): the set of all possible values for the attribute x.
 S𝑣: the subset of S for which x has value 𝑣

 Information Gain is a measure of the effectiveness of an


attribute in classifying data.
 It is the expected reduction in entropy caused by
partitioning the objects according to this attribute.

19
Example - Training Set
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

20
Example
| sv |
Gain( S , x)  Entropy ( S )   vValues ( x ) Entropy ( Sv )
|S|

For top node: S = { 9 + , 5−}, E n t r o p y ( S ) = 0.94 ID Wind


Play
Tennis
Attribute Wind: D1 Weak No
S _ w e a k = { 6 + , 2−}, | S _we a k | = 8 D8 Weak No

S _s t r o n g = { 3 + , 3−}, | S _s t r o n g | = 6 D3 Weak Yes

E n t r o p y ( S _w e a k ) = −6/8*l o g ( 6 / 8 ) − - D4 Weak Yes

2 / 8 *l o g ( 2 / 8 ) = 0.81 D5 Weak Yes


D9 Weak Yes
E n t r o p y ( S _s t r o n g ) = 1
Expected Entropy when assuming attribute ’Wind’:
D10 Weak Yes
D13 Weak Yes
E n t r o p y ( S | W i n d ) = 8 / 1 4 * E n t r o p y ( S _w e a k ) +
D2 Strong No
6 / 1 4 *E n t r o p y ( S _s t r o n g ) = 0.89
D6 Strong No
D14 Strong No
Gain(S, Wind) = D7 Strong Yes
0.94 − 0.89 ≈ 0.05 D11 Strong Yes
D12 Strong Yes

21
Selecting the Next Attribute
• For whole training set:
 G a i n ( S , O u t l o o k ) = 0.246
 G a i n ( S , H u m i d i t y ) = 0.151
 G a i n ( S , W i n d ) = 0.048
 G a i n ( S , Te m p e r a t u r e ) = 0.029
→ O u t l o o k should be used to split training set!

• Further down in the tree, E n t r o p y ( S ) is computed locally


• Usually, the tree does not have to be minimized
• Reason of good performance of ID3!

22
Next step in growing the decision tree

23
The Resulting Decision Tree & Its Rules

24
Some issues: Real-Valued Attributes
 Temperature = 82.5
 Create discrete attributes to test continuous:
 (Temperature > 54) = true or = false
 Sort attribute values that occur in training set:

Temperature: 40 48 60 72 80 90
PlayTennis: No No Yes Yes Yes No

 Determine points where the class changes


 Candidates are (48 + 60) / 2 and (80 + 90) / 2
 Select best one using info gain
 Implemented in the system C4.5 (successor of ID3)

25
Some issues: Noise
 Consider adding noisy (=wrongly labeled) training example #15:
 S u n n y , M i l d , N o r m a l , We a k , P l a y Te n n i s = N o
, i.e. outlook = sunny, humidity = normal
 What effect on earlier tree? Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

26
Some issues: Overfitting Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

• Algorithm will introduce new test


• Unnecessary, because new example was erroneous due to the
presence of Noise
→ Overfitting corresponds to learning coincidental regularities
• Unfortunately, we generally don’t know which examples are noisy
• ... and also not the amount, e.g. percentage, of noisy examples
27
Some issues: Overfitting
 An example: continuing to grow the tree can improve the accuracy
on the training data, but perform badly on the test data.

28
[Mitchell, 1997]
Overfitting: solutions
 Some solutions:
 Stop learning early: prevent the tree before it fits the
training data perfectly.
 Prune the full tree: grow the tree to its full size, and then
post prune the tree.
 It is hard to decide when to stop learning.
 Post-pruning the tree empirically results in better
performance. But
 How to decide the good size of a tree?
 When to stop pruning?
 We can use a validation set to do pruning, such
as, reduced-error pruning, and rule-post pruning
29
Summary

 Decision tree is a Machine Learning algorithm that


can perform both classification and regression
tasks.
 Decision tree represents a function by using a tree.
 Each decision tree can be interpreted as a set of
rules of the form: IF-THEN
 Decision trees have been used in many practical
applications.

30
Advantages & Disadvantages

 Advantages
 It is simple to understand as it follows the same
process which a human follow while making any
decision in real-life.
 It can be very useful for solving decision-related
problems.
 It helps to think about all the possible outcomes for
a problem.
 There is less requirement of data cleaning
compared to other algorithms.

31
Advantages & Disadvantages

 Disadvantages
 The decision tree contains lots of layers, which
makes it complex.
 It may have an overfitting issue, which can be
resolved using the Random Forest algorithm.
 For more class labels, the computational
complexity of the decision tree may increase

32
Random forests
 Random forests (RF) is a method by Leo Breiman
(2001) for both classification and regression.
 Main idea: prediction is based on combination of many
decision trees, by taking the average of all individual
predictions
 Each tree in RF is simple but random.
 Each tree is grown differently, depending on the
choices of the attributes and training data

33
Random forests
 RF currently is one of the most popular and accurate
methods [Fernández-Delgado et al., 2014]
 It is also very general.
 RF can be implemented easily and efficiently.
 It can work with problems of very high dimensions,
without overfitting

34
How Random Forest work

35
RF: three basic ingredients

 Randomization and no pruning:


 For each tree and at each node, we select randomly a subset of
attributes.
 Find the best split, and then grow appropriate subtrees.
 Every tree will be grown to its largest size without pruning.

 Combination: each prediction later is made by taking the


average of all predictions of individual trees.

 Bagging: the training set for each tree is generated by


sampling (with replacement) from the original data.

36
Exersice

1. Build Decision tree


2. Predict :
customer (15, youth, medium, no, fair)
Customer (16, senior, low, yes, excellent) 37

You might also like