Chapter ML:III
III. Decision Trees
q Decision Trees Basics
q Impurity Functions
q Decision Tree Algorithms
q Decision Tree Pruning
ML:III-66
Decision Trees
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
ID3 Algorithm
[Quinlan 1986] [CART Algorithm]
Characterization of the model (model world)
[ML Introduction] :
X is a set of feature vectors, also called feature space.
C is a set of classes.
c : X C is the ideal classifier for X.
D = {(x1, c(x1)), . . . , (xn, c(xn))} X C is a set of examples.
Task: Based on D, construction of a decision tree T to approximate c.
ML:III-67
Decision Trees
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
ID3 Algorithm
[Quinlan 1986] [CART Algorithm]
Characterization of the model (model world)
[ML Introduction] :
X is a set of feature vectors, also called feature space.
C is a set of classes.
c : X C is the ideal classifier for X.
D = {(x1, c(x1)), . . . , (xn, c(xn))} X C is a set of examples.
Task: Based on D, construction of a decision tree T to approximate c.
Characteristics of the ID3 algorithm:
1. Each splitting is based on one nominal feature and considers its complete
domain. Splitting based on feature A with domain {a1, . . . , ak } :
X = {x X : x|A = a1} . . . {x X : x|A = ak }
2. Splitting criterion is the information gain.
ML:III-68
Decision Trees
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
ID3 Algorithm
[Mitchell 1997] [algorithm template]
ID3(D, Attributes, Target)
q Create a node t for the tree.
q Label t with the most common value of Target in D.
q If all examples in D are positive, return the single-node tree t, with label +.
q If all examples in D are negative, return the single-node tree t, with label .
q If Attributes is empty, return the single-node tree t.
q Otherwise:
q Let A* be the attribute from Attributes that best classifies examples in D.
q Assign t the decision attribute A*.
q For each possible value a in A* do:
q Add a new tree branch below t, corresponding to the test A* = a.
q Let D_a be the subset of D that has value a for A*.
q If D_a is empty:
Then add a leaf node with label of the most common value of Target in D.
Else add the subtree ID3(D_a, Attributes \ {A*}, Target).
q Return t.
ML:III-69
Decision Trees
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
ID3 Algorithm (pseudo code)
[algorithm template]
ID3(D, Attributes, Target)
1. t = createNode()
2. label(t) = mostCommonClass(D, Target)
3. IF hx, c(x)i D : c(x) = c THEN return(t) ENDIF
4. IF Attributes = THEN return(t) ENDIF
5. A = argmaxAAttributes (informationGain(D, A))
6. FOREACH a A DO
Da = {(x, c(x)) D : x|A = a}
IF Da = THEN
t0 = createNode()
label(t0 ) = mostCommonClass(D, Target)
createEdge(t, a, t0 )
ELSE
createEdge(t, a, ID3(Da , Attributes \ {A }, Target))
ENDIF
ENDDO
7. return(t)
ML:III-70
Decision Trees
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
ID3 Algorithm (pseudo code)
[algorithm template]
ID3(D, Attributes, Target)
1. t = createNode()
2. label(t) = mostCommonClass(D, Target)
3. IF hx, c(x)i D : c(x) = c THEN return(t) ENDIF
4. IF Attributes = THEN return(t) ENDIF
5. A = argmaxAAttributes (informationGain(D, A))
6. FOREACH a A DO
Da = {(x, c(x)) D : x|A = a}
IF Da = THEN
t0 = createNode()
label(t0 ) = mostCommonClass(D, Target)
createEdge(t, a, t0 )
ELSE
createEdge(t, a, ID3(Da , Attributes \ {A }, Target))
ENDIF
ENDDO
7. return(t)
ML:III-71
Decision Trees
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
ID3 Algorithm (pseudo code)
[algorithm template]
ID3(D, Attributes, Target)
1. t = createNode()
2. label(t) = mostCommonClass(D, Target)
3. IF hx, c(x)i D : c(x) = c THEN return(t) ENDIF
4. IF Attributes = THEN return(t) ENDIF
5. A = argmaxAAttributes (informationGain(D, A))
6. FOREACH a A DO
Da = {(x, c(x)) D : x|A = a}
IF Da = THEN
t0 = createNode()
label(t0 ) = mostCommonClass(D, Target)
createEdge(t, a, t0 )
ELSE
createEdge(t, a, ID3(Da , Attributes \ {A }, Target))
ENDIF
ENDDO
7. return(t)
ML:III-72
Decision Trees
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
ID3 Algorithm (pseudo code)
[algorithm template]
ID3(D, Attributes, Target)
1. t = createNode()
2. label(t) = mostCommonClass(D, Target)
3. IF hx, c(x)i D : c(x) = c THEN return(t) ENDIF
4. IF Attributes = THEN return(t) ENDIF
5. A = argmaxAAttributes (informationGain(D, A))
6. FOREACH a A DO
Da = {(x, c(x)) D : x|A = a}
IF Da = THEN
t0 = createNode()
label(t0 ) = mostCommonClass(D, Target)
createEdge(t, a, t0 )
ELSE
createEdge(t, a, ID3(Da , Attributes \ {A }, Target))
ENDIF
ENDDO
7. return(t)
ML:III-73
Decision Trees
STEIN/LETTMANN 2005-2015
Remarks:
q Target designates the feature (= attribute) that is comprised of the labels according to
which an example can be classified. Within Mitchells algorithm the respective class labels
are + and , modeling the binary classification situation. In the pseudo code version,
Target may be comprised of multiple (more than two) classes.
q Step 3 of of the ID3 algorithm checks the purity of D and, given this case, assigns the
unique class c, c dom(Target), as label to the respective node.
ML:III-74
Decision Trees
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
ID3 Algorithm: Example
Example set D for mushrooms, implicitly defining a feature space X over the three
dimensions color, size, and points:
ML:III-75
Color
Size
Points
Eatability
red
small
yes
toxic
2
3
4
5
brown
brown
green
red
small
large
small
large
no
yes
no
no
eatable
eatable
eatable
eatable
Decision Trees
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
ID3 Algorithm: Example
(continued)
Top-level call of ID3. Analyze a splitting with regard to the feature color :
D|color
toxic eatable
red
1
1
=
brown
0
2
green
0
1
|Dred | = 2, |Dbrown | = 2, |Dgreen | = 1
Estimated a-priori probabilities:
pred =
ML:III-76
Decision Trees
2
= 0.4,
5
pbrown =
2
= 0.4,
5
pgreen =
1
= 0.2
5
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
ID3 Algorithm: Example
(continued)
Top-level call of ID3. Analyze a splitting with regard to the feature color :
D|color
toxic eatable
red
1
1
=
brown
0
2
green
0
1
|Dred | = 2, |Dbrown | = 2, |Dgreen | = 1
Estimated a-priori probabilities:
pred =
2
= 0.4,
5
pbrown =
2
= 0.4,
5
pgreen =
1
= 0.2
5
Conditional entropy values for all attributes:
H(C | color)
= (0.4 ( 12 log2 12 + 12 log2 12 ) +
0.4 ( 20 log2 02 + 22 log2 22 ) +
0.2 ( 10 log2 01 + 11 log2 11 )) = 0.4
H(C | size)
0.55
H(C | points) = 0.4
ML:III-77
Decision Trees
STEIN/LETTMANN 2005-2015
Remarks:
q The smaller H(C | feature) is, the larger becomes the information gain. Hence, the
difference H(C) H(C | feature) needs not to be computed since H(C) is constant within
each recursion step.
q In the example, the information gain in the first recursion step is maximum for the two
features color and points.
ML:III-78
Decision Trees
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
ID3 Algorithm: Example
(continued)
Decision tree before the first recursion step:
attribute: points
yes
no
color
size
eatability
color
size
eatability
red
brown
small
large
toxic
eatable
brown
green
red
small
small
large
eatable
eatable
eatable
The feature points was chosen in Step 5 of the ID3 algorithm.
ML:III-79
Decision Trees
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
ID3 Algorithm: Example
(continued)
Decision tree before the second recursion step:
attribute: points
yes
no
attribute: color
red
brown
green
size
eatability
size
small
toxic
-/-
eatability
-/-
size
eatability
large
eatable
color
size
eatability
brown
green
red
small
small
large
eatable
eatable
eatable
The feature color was chosen in Step 5 of the ID3 algorithm.
ML:III-80
Decision Trees
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
ID3 Algorithm: Example
(continued)
Final decision tree after second recursion step:
attribute: points
yes
no
label: eatable
attribute: color
red
label: toxic
green
label: toxic
brown
label: eatable
Break of a tie: choosing the class toxic for Dgreen in Step 6 of the ID3 algorithm.
ML:III-81
Decision Trees
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
ID3 Algorithm: Hypothesis Space
...
A1
+
+o o
A2
+
+ o
A2
+
+
+
A2
A3 -
Decision Trees
+
o
...
ML:III-82
...
A4 -
...
...
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
ID3 Algorithm: Inductive Bias
Inductive bias is the rigidity in applying the (little bit of) knowledge learned from a
training set for the classification of unseen feature vectors.
Observations:
q
Decision tree search happens in the space of all hypotheses.
The target concept is a member of the hypothesis space.
To generate a decision tree, the ID3 algorithm needs per branch at most as
many decisions as features are given.
no backtracking takes place
local optimization of decision trees
ML:III-83
Decision Trees
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
ID3 Algorithm: Inductive Bias
Inductive bias is the rigidity in applying the (little bit of) knowledge learned from a
training set for the classification of unseen feature vectors.
Observations:
q
Decision tree search happens in the space of all hypotheses.
The target concept is a member of the hypothesis space.
To generate a decision tree, the ID3 algorithm needs per branch at most as
many decisions as features are given.
no backtracking takes place
local optimization of decision trees
ML:III-84
Decision Trees
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
ID3 Algorithm: Inductive Bias
Inductive bias is the rigidity in applying the (little bit of) knowledge learned from a
training set for the classification of unseen feature vectors.
Observations:
q
Decision tree search happens in the space of all hypotheses.
The target concept is a member of the hypothesis space.
To generate a decision tree, the ID3 algorithm needs per branch at most as
many decisions as features are given.
no backtracking takes place
local optimization of decision trees
Where the inductive bias of the ID3 algorithm becomes manifest:
q
Small decision trees are preferred.
Highly discriminative features tend to be closer to the root.
Is this justified?
ML:III-85
Decision Trees
STEIN/LETTMANN 2005-2015
Remarks:
q Let Aj be the finite domain (the possible values) of feature Aj , j = 1, . . . , p, and let C be a
set of classes. Then, a hypothesis space H that is comprised of all decision trees
corresponds to the set of all functions h, h : A1 . . . Ap C. Typically, C = {0, 1}.
q The inductive bias of the ID3 algorithm is of a different kind than the inductive bias of the
candidate elimination algorithm (version space algorithm):
1. The underlying hypothesis space H of the candidate elimination algorithm is
incomplete. H corresponds to a coarsened view onto the space of all hypotheses since
H contains only conjunctions of attribute-value pairs as hypotheses. However, this
restricted hypothesis space is searched completely by the candidate elimination
algorithm. Keyword: restriction bias
2. The underlying hypothesis space H of the ID3 algorithm is complete. H corresponds to
the set of all discrete functions (from the Cartesian product of the feature domains onto
the set of classes) that can be represented in the form of a decision tree. However, this
complete hypothesis space is searched incompletely (following a preference).
Keyword: preference bias or search bias
q The inductive bias of the ID3 algorithm renders the algorithm robust with respect to noise.
ML:III-86
Decision Trees
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
CART Algorithm
[Breiman 1984] [ID3 Algorithm]
Characterization of the model (model world)
[ML Introduction] :
X is a set of feature vectors, also called feature space. No restrictions are
presumed for the measurement scales of the features.
C is a set of classes.
c : X C is the ideal classifier for X.
D = {(x1, c(x1)), . . . , (xn, c(xn))} X C is a set of examples.
Task: Based on D, construction of a decision tree T to approximate c.
ML:III-87
Decision Trees
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
CART Algorithm
[Breiman 1984] [ID3 Algorithm]
Characterization of the model (model world)
[ML Introduction] :
X is a set of feature vectors, also called feature space. No restrictions are
presumed for the measurement scales of the features.
C is a set of classes.
c : X C is the ideal classifier for X.
D = {(x1, c(x1)), . . . , (xn, c(xn))} X C is a set of examples.
Task: Based on D, construction of a decision tree T to approximate c.
Characteristics of the CART algorithm:
1. Each splitting is binary and considers one feature at a time.
2. Splitting criterion is the information gain or the Gini index.
ML:III-88
Decision Trees
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
CART Algorithm
(continued)
1. Let A be a feature with domain A. Ensure a finite number of binary splittings
for X by applying the following domain partitioning rules:
If A is nominal, choose A0 A such that 0 < |A0| |A \ A0|.
If A is ordinal, choose a A such that xmin < a < xmax, where xmin, xmax
are the minimum and maximum values of feature A in D.
If A is numeric, choose a A such that a = (xk + xl )/2, where xk , xl are
consecutive elements in the ordered value list of feature A in D.
ML:III-89
Decision Trees
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
CART Algorithm
(continued)
1. Let A be a feature with domain A. Ensure a finite number of binary splittings
for X by applying the following domain partitioning rules:
If A is nominal, choose A0 A such that 0 < |A0| |A \ A0|.
If A is ordinal, choose a A such that xmin < a < xmax, where xmin, xmax
are the minimum and maximum values of feature A in D.
If A is numeric, choose a A such that a = (xk + xl )/2, where xk , xl are
consecutive elements in the ordered value list of feature A in D.
2. For node t of a decision tree generate all splittings of the above type.
3. Choose a splitting from the set of splittings that maximizes the impurity
reduction :
|DL|
|DR |
(D(t), {D(tL), D(tR )}) = (t)
(tL)
(tR ),
|D|
|D|
where tL and tR denote the left and right successor of t.
ML:III-90
Decision Trees
STEIN/LETTMANN 2005-2015
Decision Tree Algorithms
CART Algorithm
(continued)
Illustration for two numeric features; i.e., the feature space X corresponds to a
two-dimensional plane:
t1 X(t1)
X(t7)
t2
t3
X(t2)
X(t4)
X(t3)
X(t6)
t5 X(t5)
t4 X(t4)
c3
X(t8)
c1
c3
t6 X(t6)
c1
X(t7)
c2
X(t8)
X(t9)
X = X(t1)
X(t9)
By a sequence of splittings the feature space X is partitioned into rectangles that
are parallel to the two axes.
ML:III-91
Decision Trees
STEIN/LETTMANN 2005-2015