Decision Trees
Decision Trees
Decision Trees
LOAN EVALUATION
Skin Covering
Feathers Scales
Fur
Hooked Straight
Sharp Blunt
A tree in which:
Each terminal node (leaf) is associated with a class.
Each non-terminal node is associated with one of the
attributes that examples possess.
Each branch is associated with a particular value that
the attribute of its parent node can take.
What is decision tree induction?
A procedure that, given a training set, attempts to build a
decision tree that will correctly predict the class of any
unclassified example.
What is a training set?
A set of classified examples, drawn from some population of
possible examples. The training set is almost always a very
small fraction of the population.
What is an example
Typically decision trees operate using examples that take the
form of feature vectors.
A feature vector is simply a vector whose elements are the
values taken by the examples attributes.
For example, a heron might be represented as:
FUNCTION build_dec_tree(examples,atts)
// Takes a set of classified examples and
// a list of attributes, atts. Returns the
// root node of a decision tree
Create node N;
IF examples are all in same class
THEN RETURN N labelled with that class;
IF atts is empty
THEN RETURN N labelled with modal example
class;
best_att = choose_best_att(examples,atts);
label N with best_att;
FOR each value ai of best_att
si = subset examples with best_att = ai;
IF si is not empty
THEN
new_atts = atts best_att;
subtree = build_dec_tree(si,new_atts);
attach subtree as child of N;
ELSE
Create leaf node L;
Label L with modal example class;
attach L as child of N;
RETURN N;
Shannons Function
1
0.8
Information
0.6
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability
Suppose
You have a set of 100 examples, E
These examples fall in two classes, c1 and c2
70 are in c1
30 are in c2
How uncertain are you about the class an example belongs
to?
Information = -p(c1)ln2(p(c1)) - p(c2)ln2(p(c2))
= - 0.7ln2(0.7) - 0.3 ln2(0.3)
= -(0.7 x -0.51 + 0.3 x -1.74) = 0.88 bits
Now suppose
A is one of the example attributes with values v1 and v2
The 100 examples are distributed thus
v1 v2
c1 63 7
c2 6 24
AN EXAMPLE
Initial Uncertainty
Cool Examples:
There are 5 of these; 2 rain and 3 dry.
So p(rain) = 2/5 and p(dry) = 3/5
Hence Infcool = -(2/5ln2(2/5)+3/5ln2(3/5)) = 0.971
Warm Examples:
There are 3 of these; 1 rain and 2 dry.
So p(rain) = 1/3 and p(dry) = 2/3
Hence Infwarm = -(1/3ln2(1/3)+2/3ln2(2/3)) = 0.918
Average Information:
5/8 Infcool + 3/8 Infwarm = 0.625x0.971+0.375x0.918 = 0.951
Hence Information Gain for Temperature is
Initial Information Average Information for Temperature
= 0.954 0.951 = 0.003. (Very small)
Cloud Cover
Overcast Clear
cool,cloudy,windy: rain
Overcast Clear
warm,overcast,windy: rain
cool,overcast,calm: dry Dry
cool,overcast,windy: rain
warm,overcast,calm: dry
Cloudy
Rain
Cool Examples:
There are 2 of these; 1 rain and 1 dry.
So p(rain) = 1/2 and p(dry) = 1/2
Hence Infcool = -(1/2ln2(1/2)+1/2ln2(1/2)) = 1
Warm Examples:
There are also 2 of these; 1 rain and 1 dry.
So again Infwarm = -(1/2ln2(1/2)+1/2ln2(1/2)) = 1
Average Information:
1/2 Infcool + 1/2 Infwarm = 0. 5 x 1.0 + 0. 5 x 1.0 = 1
Hence Information Gain for Temperature is zero!
Windy Examples:
There are 2 of these; 2 rain and 0 dry.
So p(rain) = 1 and p(dry) = 0
Hence Infwindy = -(1 x ln2(1)+ 0 x ln2(0)) = 0
Calm Examples:
There are also 2 of these; 0 rain and 2 dry.
So again Infcalm = -(1 x ln2(1)+ 0 x ln2(0)) = 0
Average Information:
1/2 Infwindy + 1/2 Infcalm = 0. 5 x 0.0 + 0. 5 x 0.0 = 0
Hence Information Gain for Temperature is 1.
Note: This reflects the fact that wind is a perfect predictor
of precipitation for this subset of examples.
The Best Attribute:
Obviously wind is the best attribute so we can now extend
the tree.
Cloud Cover
Overcast Clear
Wind
Dry
Calm
Windy
cool,overcast,calm: dry Cloudy
warm,overcast,calm: dry
Rain
warm,overcast,windy: rain
cool,overcast,windy: rain
Cloud Cover
Overcast Clear
Cloudy
Wind
Dry
Rain
Windy Calm
Rain Dry
These include:
Dealing with Missing Values
Inconsistent Data
Incrementality
Handling Numerical Attributes
Attribute Value Grouping
Alternative Attribute Selection Criteria
The Problem of Overfitting
Note.
Many of these problems also arise in other learning
procedures and statistical methods.
Hence many of the solutions developed for use with
decision trees may be useful in conjunction with other
techniques.
MISSING VALUES
INCONSISTENT DATA
No More Attributes
A decision tree program can do one of two things.
Predict the most probable (modal) class.
This is what the pseudocode given earlier does.
Make a set of predictions with associated probabilities.
This is better.
INCREMENTALITY
The Problem
The Solution
Partition the value set into a small number of contiguous
subranges and then treat membership of each subrange as a
categorical variable.
The result is in effect a new ordinal attribute.
A reasonable branching factor.
Some of the ordering information has been used.
A5
N A2 N A2
A8 N N N A8 N N N
N A
V2,1 2 V2,2,V2,3,V2,4
A N
8
V8,1,V8,2,V8,4 V8,3
N Y
Hill Climbing
The basic decision tree induction algorithm proceeds
using a hill climbing approach.
At every step, a new branch is created for each value of
the best attribute.
There is no backtracking.
Information Gain
| Xv |
Gain( X , A) I ( X )
vvalues ( A ) | X |
I(Xv )
This criterion has been (and is) widely and successfully used
But it is known to be biased towards attributes with many
values.
Why is it biased?
A many-valued attribute will partition the examples into
many subsets.
The average size of these subsets will be small.
Some of these are likely to contain a high percentage of
one class by chance alone.
Hence the true information gain will be over-estimated.
Gain ( X , A)
GainRatio( X , A)
SplitInf ( X , A)
from equiprobable.
If most of the examples are of the same value then
SplitInf(X,A) will be close to zero, and hence GainRatio(X,A)
may be very large.
The usual solution is to use Gain rather than GainRatio
whenever SplitInf is small.
Information Distance
G p ( ci ) p ( c j )
i j
The evidence
Question
Would the tree be any good as a classifier?
Would it, for example, do better than the simple strategy of
always picking the modal class?
Answer
No.
Note also that if the experiment were repeated with a new set
of random data we would get an entirely different tree.
Questions
Isnt this rather worrying?
Could the same sort of thing be happening with non-random
data?
Answers
Yes and yes.
Eliminating Overfitting
There are two basic ways of preventing overfitting:
Chi-Square Tests
Post-Pruning
Refinements
Substituting Branches
Rather than replacing a subtree with a leaf, it can be replaced
by its most frequently used branch.
More Drastic Transformations
More substantial changes, possibly leading to a structure that
is no longer a tree, have also been used.
One example is the transformation into rule sets in C4.5:
Generate a set of production rules equivalent to the tree
by creating one rule for each path from the root to a leaf.
Generalize each rule by removing any precondition
whose loss does not reduce the accuracy.
This step corresponds to pruning, but note that the
structure may no longer be equivalent to a tree.
An example might match the LHS of more than one
rule.
Sort the rules by their estimated accuracy.
When using the rules for classification, this accuracy is
used for conflict resolution.
Suggested Readings
Mitchell, T. M., (1997),Machine Learning,McGraw-Hill. Chapter
3.
Tan, Steinbach & Kumar (2006) Introduction to Data Mining.
Chapter 4
Han & Kamber (2006), Data Mining: Concepts and Techniques.
Section 6.3
Breiman, L., Freidman, J. H., Olshen, R. A. and Stone, C. J., (1984)
Classification and Regression Trees. Wadsworth, Pacific Grove, CA..
(This is a thorough treatment of the subject from a more statistical
perspective an essential reference if you are doing research in the
area usually known as The CART book.)
Quinlan, J. R., (1986), Induction of Decision Trees. Machine
Learning, 1(1), pp 81-106. (A full account of ID3).
Quinlan, J. R., (1993), Programs for Machine Learning. Morgan
Kaufmann, Los Altos, CA.. (A complete account of C4.5, the successor
to ID3 and the yardstick to which other decision tree induction
procedures are usually compared.)
Dougherty, J., Kohavi, R. and Sahami, M., (1995), Supervised and
Unsupervised Discretisation of Continuous Features, in Proc. 12th Int.
Conf. on Machine Learning, Morgan Kaufmann, Los Altos, CA., pp
194-202. (A good comparative study of different methods for
discretising numeric attributes).
Ho, K. M. and Scott, P. D. (2000) Reducing Decision Tree
Fragmentation Through Attribute Value Grouping: A Comparative
Study. Intelligent Data Analysis. 6 pp 255-274
Implementations
An implementation of a decision tree procedure is available as part of
the WEKA suite of data mining programs. It is called J4.8 and closely
resembles C4.5.