Decision Tree based Learning
Example
5/13/2022
Decision tree representation (PlayTennis)
Outlook=Sunny, Temp=Hot, Humidity=High, Wind=Strong No
5/13/2022
Decision trees expressivity
Decision trees represent a disjunction of conjunctions on
constraints on the value of attributes:
(Outlook = Sunny Humidity = Normal)
(Outlook = Overcast)
(Outlook = Rain Wind = Weak)
5/13/2022
Top-down induction of Decision Trees
ID3 (Quinlan, 1986) is a basic algorithm used to create DT's
Given a training set of examples, the algorithms for building DT
performs search in the space of decision trees.
The construction of the tree is top-down. The algorithm is greedy.
The fundamental question is “which attribute should be tested next?
Which attribute gives us more information?”
Select the best attribute
A descendent node is then created for each possible value of this
attribute and data set is partitioned according to this value.
The process is repeated for each successor node until all the
examples are classified correctly or there are no attributes left
5/13/2022
Which attribute is the best classifier?
A statistical property called information gain, measures how
well a given attribute separates the training examples
Information gain uses the notion of entropy, commonly used in
information theory
Information gain = expected reduction of entropy
5/13/2022
5/13/2022
Entropy in binary classification
Entropy measures the impurity of a collection of examples. It
depends from the distribution of the random variable p.
S is a collection of training examples
p+ the proportion of positive examples in S
p– the proportion of negative examples in S
Entropy (S) – p+ log2 p+ – p–log2 p– [0 log20 = 0]
Entropy ([14+, 0–]) = – 14/14 log2 (14/14) – 0 log2 (0) = 0
Entropy ([9+, 5–]) = – 9/14 log2 (9/14) – 5/14 log2 (5/14) = 0.94
Entropy ([7+, 7– ]) = – 7/14 log2 (7/14) – 7/14 log2 (7/14) =
= 1/2 + 1/2 = 1 [log21/2 = – 1]
Note: the log of a number < 1 is negative, 0 p 1, 0 entropy 1
Entropy in general
Entropy measures the amount of information in a random
variable
H(X) = – p+ log2 p+ – p– log2 p– X = {+, –}
for binary classification [two-valued random variable]
c c
H(X) = – pi log2 pi = pi log2 1/ pi X = {i, …, c}
i=1 i=1
for classification in c classes
Example: rolling a die with 8, equally probable, sides
8
H(X) = – 1/8 log2 1/8 = – log2 1/8 = log2 8 = 3
i=1
Information gain as entropy reduction
Information gain is the expected reduction in entropy caused by
partitioning the examples on an attribute.
The higher the information gain the more effective the attribute
in classifying training data.
Expected reduction in entropy knowing A
Gain(S, A) = Entropy(S) − |Sv|
Entropy(Sv)
v Values(A) |S|
Values(A) possible values for A
Sv subset of S for which A has value v
Example: expected information gain
Let
Values(Wind) = {Weak, Strong}
S = [9+, 5−]
SWeak = [6+, 2−]
SStrong = [3+, 3−]
Information gain due to knowing Wind:
Gain(S, Wind) = Entropy(S) − 8/14 Entropy(SWeak) − 6/14 Entropy(SStrong)
= 0.94 − 8/14 0.811 − 6/14 1.00
= 0.048
Which attribute is the best classifier?
First step: which attribute to test at the root?
Which attribute should be tested at the root?
Gain(S, Outlook) = 0.246
Gain(S, Humidity) = 0.151
Gain(S, Wind) = 0.048
Gain(S, Temperature) = 0.029
Outlook provides the best prediction for the target
Lets grow the tree:
add to the tree a successor for each possible value of Outlook
partition the training samples according to the value of Outlook
After first step
Second step
Working on Outlook=Sunny node:
Gain(SSunny, Humidity) = 0.970 3/5 0.0 2/5 0.0 = 0.970
Gain(SSunny, Wind) = 0.970 2/5 1.0 3/5 0.918 = 0 .019
Gain(SSunny, Temp.) = 0.970 2/5 0.0 2/5 1.0 1/5 0.0 = 0.570
Humidity provides the best prediction for the target
Lets grow the tree:
add to the tree a successor for each possible value of Humidity
partition the training samples according to the value of Humidity
Second and third steps
{D1, D2, D8} {D9, D11} {D4, D5, D10} {D6, D14}
No Yes Yes No
Thanks