Decision Trees

Decision Trees Slide 1 of 40
DECISION TREE INDUCTION
What is a decision tree?

The basic decision tree induction procedure
From decision trees to production rules
Dealing with missing values
Inconsistent training data
Incrementality
Handling numerical attributes
Attribute value grouping
Alternative attribute selection criteria
The problem of overfitting
P.D.Scott University of Essex

LOAN EVALUATION
When a bank is asked to make a loan it needs to assess how

likely it is that the borrower will be able to repay the loan.
Collectively the bank has a lot of experience of making loans
and discovering which ones are ultimately repaid.
However, any individual bank employee has only a limited
amount of experience.
It would thus be very helpful if, somehow the banks collective
experience could be used to construct a set of rules (or a
computer program embodying those rules) that could be used
to assess the risk that a prospective loan would not be repaid.
How?
What we need is a system that could take all the banks data
on the outcome of previous borrowers and the outcomes for
their loans and learn such a set of rules.
One widely used approach is decision tree induction.

WHAT IS A DECISION TREE?
The following is a very simple decision tree that assigns

animals to categories:
Skin Covering
Feathers Scales
Fur
Beak Teeth Fish
Hooked Straight
Sharp Blunt
Eagle Heron Lion Lamb
Thus a decision tree can be used to predict the category (or

class) to which an example belongs.

So what is a decision tree?
A tree in which:
Each terminal node (leaf) is associated with a class.
Each non-terminal node is associated with one of the
attributes that examples possess.
Each branch is associated with a particular value that
the attribute of its parent node can take.
What is decision tree induction?
A procedure that, given a training set, attempts to build a
decision tree that will correctly predict the class of any
unclassified example.
What is a training set?
A set of classified examples, drawn from some population of
possible examples. The training set is almost always a very
small fraction of the population.
What is an example
Typically decision trees operate using examples that take the
form of feature vectors.
A feature vector is simply a vector whose elements are the
values taken by the examples attributes.
For example, a heron might be represented as:
Skin Covering Beak Teeth
Feathers Straight None

THE BASIC DECISION TREE ALGORITHM
FUNCTION build_dec_tree(examples,atts)
// Takes a set of classified examples and
// a list of attributes, atts. Returns the
// root node of a decision tree
Create node N;
IF examples are all in same class
THEN RETURN N labelled with that class;
IF atts is empty
THEN RETURN N labelled with modal example
class;
best_att = choose_best_att(examples,atts);
label N with best_att;
FOR each value ai of best_att
si = subset examples with best_att = ai;
IF si is not empty
THEN
new_atts = atts best_att;
subtree = build_dec_tree(si,new_atts);
attach subtree as child of N;
ELSE
Create leaf node L;
Label L with modal example class;
attach L as child of N;
RETURN N;

Choosing the Best Attribute
What is the best attribute?
Many possible definitions.

A reasonable answer would be the attribute that best
discriminates the examples with respect to their classes.
So what does best discriminates mean?
Still many possible answers.

Many different criteria have many used.
The most popular is information gain.
What is information gain?

Shannons Information Function
Given a situation in which there are N unknown outcomes.

How much information have you acquired once you know
what the outcome is?
Consider some examples when the outcomes are all equally

likely:
Coin toss 2 outcomes 1 bit of information

Pick 1 card from 8 8 outcomes 3 bits of information
Pick 1 card from 32 32 outcomes 5 bits of information
In general, for N equiprobable outcomes

Information = ln2(N) bits
Since the probability of each outcome p = 1/N, we can also
express this as
Information = -ln2(p) bits
Non-equiprobable outcomes
Consider picking 1 card from a pack containing 127 red and 1

black.
There are 2 possible outcomes but you would be almost
certain that the result would be red.
Thus being told the outcome usually gives you less
information than being told the outcome of an experiment with
two equiprobable outcomes.

Shannons Function
We need an expression that reflects the fact there is less

information to be gained when we already know that some
outcomes are more likely than others.
Shannon derived the following function:
N
Information pi ln 2 pi bits
i 1
where N is the number of alternative outcomes.
Notice that it reduces to ln2(p) when the outcomes are all

equiprobable.
If there are only two outcomes, it takes this form:
1
0.8
Information
0.6
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability
Information is also sometimes called uncertainty or entropy.

Using Information to Assess Attributes
Suppose
You have a set of 100 examples, E
These examples fall in two classes, c1 and c2
70 are in c1
30 are in c2
How uncertain are you about the class an example belongs
to?
Information = -p(c1)ln2(p(c1)) - p(c2)ln2(p(c2))
= - 0.7ln2(0.7) - 0.3 ln2(0.3)
= -(0.7 x -0.51 + 0.3 x -1.74) = 0.88 bits
Now suppose
A is one of the example attributes with values v1 and v2
The 100 examples are distributed thus
v1 v2
c1 63 7
c2 6 24
What is the uncertainty for the examples whose A value is v1?

There are 69 of them; 63 in c1 and 6 in c2.
So for this subset, p(c1) = p(63/69) = 0.913
and p(c2) = p(6/69) = 0.087
Hence
Information = -0.913 ln2(0.913) - 0.087 ln2(0.087)) = 0.43

Similarly for the examples whose A value is v2?

There are 31 of them; 7 in c1 and 24 in c2.
So for this subset, p(c1) = p(7/31) = 0.226
and p(c2) = p(24/31) = 0.774
Hence
Information = -0. 226 ln2(0. 226) - 0.774 ln2(0.774) = 0.77
So, if we know the value of attribute A

The uncertainty is 0.43 if the value is v1.
The uncertainty is 0.77 if the value is v2.
But 69% have value v1 and 31% have value v2.
Hence the average uncertainty if we know the value of
attribute A will be 0.69 x 0.43 + 0.31 x 0.77 = 0.54.
Compare this with the uncertainty if we dont know the value
of A which we calculated earlier as 0.88.
Hence attribute A provides an information gain of
0.88 0.54 = 0.34 bits

AN EXAMPLE
Suppose we have a training set of data derived from weather

records.
These contain four attributes:
Attribute Possible Values
Temperature Warm; Cool
Cloud Cover Overcast; Cloudy; Clear
Wind Windy; Calm
Precipitation Rain; Dry
We want to build a system that predicts precipitation from the

other three attributes.
The training data set is:
[ warm, overcast, windy; rain ]
[ cool, overcast, calm; dry ]
[ cool, cloudy, windy; rain ]
[ warm, clear, windy; dry ]
[ cool, clear, windy; dry ]
[ cool, overcast, windy; rain ]
[ cool, clear, calm; dry ]
[ warm, overcast, calm; dry ]

Initial Uncertainty
First we consider the initial uncertainty:

p(rain) = 3/8; p(dry) = 5/8
So Inf = -(3/8ln2(3/8)+5/8ln2(5/8)) = 0.954;
Next we must choose the best attribute for building branches

from the root node of the decision tree.
There are three to choose to from
Information Gain from Temperature Attribute
Cool Examples:
There are 5 of these; 2 rain and 3 dry.
So p(rain) = 2/5 and p(dry) = 3/5
Hence Infcool = -(2/5ln2(2/5)+3/5ln2(3/5)) = 0.971
Warm Examples:
Hence Infwarm = -(1/3ln2(1/3)+2/3ln2(2/3)) = 0.918
Average Information:
5/8 Infcool + 3/8 Infwarm = 0.625x0.971+0.375x0.918 = 0.951
Hence Information Gain for Temperature is
Initial Information Average Information for Temperature
= 0.954 0.951 = 0.003. (Very small)

Information Gain from Cloud Cover Attribute
A similar calculation gives an average information of 0.500.

Hence Information Gain for Cloud Cover is
Initial Information Average Information for Cloud Cover
= 0.954 0.500 = 0.454. (Large)
Information Gain from Wind Attribute
A similar calculation gives an average information of 0.607.

Hence Information Gain for Wind is
Initial Information Average Information for Wind
= 0.954 0.607 = 0.347. (Quite large)
The Best Attribute: Starting to build the tree
Cloud cover gives the greatest information gain so we choose
it as the attribute to begin tree construction
Cloud Cover
Overcast Clear
warm,overcast,windy: rain warm,clear,windy: dry

cool,overcast,calm: dry cool,clear,windy: dry
cool,overcast,windy: rain cool,clear,calm: dry
warm,overcast,calm: dry
Cloudy
cool,cloudy,windy: rain
Eagle Heron Lion Lamb

Developing the Tree
All the examples on the Clear branch belong to the same

class so no further elaboration is needed.
It can be terminated with a leaf node labelled dry
Similarly the single example on the Cloudy branch
necessarily belongs to one class.
It can be terminated with a leaf node labelled rain
This gives us :-
The Overcast branch has both rain and dry examples.
Cloud Cover
Overcast Clear
warm,overcast,windy: rain
cool,overcast,calm: dry Dry
cool,overcast,windy: rain
Cloudy
Rain
So we must attempt to extend the tree from this node.

Extending the Overcast subtree
There are 4 examples: 2 rain and 2 dry

So p(rain) = p(dry) = 0.5 and the uncertainty is 1.
There are two remaining attributes: temperature and wind.
Information Gain from Temperature Attribute
Cool Examples:
Hence Infcool = -(1/2ln2(1/2)+1/2ln2(1/2)) = 1
Warm Examples:
There are also 2 of these; 1 rain and 1 dry.
So again Infwarm = -(1/2ln2(1/2)+1/2ln2(1/2)) = 1
1/2 Infcool + 1/2 Infwarm = 0. 5 x 1.0 + 0. 5 x 1.0 = 1
Hence Information Gain for Temperature is zero!

Information Gain from Wind Attribute
Windy Examples:
So p(rain) = 1 and p(dry) = 0
Hence Infwindy = -(1 x ln2(1)+ 0 x ln2(0)) = 0
Calm Examples:
There are also 2 of these; 0 rain and 2 dry.
So again Infcalm = -(1 x ln2(1)+ 0 x ln2(0)) = 0
1/2 Infwindy + 1/2 Infcalm = 0. 5 x 0.0 + 0. 5 x 0.0 = 0
Hence Information Gain for Temperature is 1.
Note: This reflects the fact that wind is a perfect predictor
of precipitation for this subset of examples.
The Best Attribute:
Obviously wind is the best attribute so we can now extend
the tree.

Cloud Cover
Overcast Clear
Wind
Dry
Calm
Windy
cool,overcast,calm: dry Cloudy
Rain
warm,overcast,windy: rain
cool,overcast,windy: rain
All the examples on both the new branches belong to the

same class so they can be terminated with appropriately
labelled leaf nodes.
Cloud Cover
Overcast Clear
Cloudy
Wind
Dry
Rain
Windy Calm
Rain Dry

FROM DECISION TREES TO PRODUCTION RULES
Decision trees can easily be converted into sets of IF-THEN

rules.
The tree just derived would become:
IF clear THEN dry
IF cloudy THEN rain
IF overcast AND calm THEN dry
IF overcast AND windy THEN rain
Such rules are usually easier to understand than the

corresponding tree.
Large trees produce large sets of rules.
It is often possible to simplify these considerably by applying
transformations to them.
In some cases these simplified rule sets are more accurate
than the original tree because they reduce the effect of
overfitting a topic we will discuss later.

REFINEMENTS OF DECISION TREE LEARNING
The basic top down procedure for decision tree construction is

exemplified by ID3.
This technique has proved extremely successful as a
method for classification learning.
Consequently it has been applied to a wide range of
problems.
As this has happened, limitations have emerged and new
techniques developed to deal with them.
These include:
Dealing with Missing Values
Inconsistent Data
Incrementality
Handling Numerical Attributes
Attribute Value Grouping
Alternative Attribute Selection Criteria
The Problem of Overfitting
Note.
Many of these problems also arise in other learning
procedures and statistical methods.
Hence many of the solutions developed for use with
decision trees may be useful in conjunction with other
techniques.

MISSING VALUES
Missing values are a major problem when working with real

data sets.
A survey respondent may not have answered a question.
The results of a lab test may not be available.
There are various approaches to this problem.
Discard the training example.

If there are many attributes, each of which might be
missing, this may almost wipe out the training data.
Guess the value.

e.g. substitute the commonest value.
Obviously error prone but quite effective if missing values
for a given variable are rare.
More sophisticated guessing techniques have been
developed.
Sometimes called imputation
Assign a probability to each possible value.

Probabilities can be estimated from remaining examples.
For training purposes, treat missing value cases as
fractional examples in each class for gain computation.
This fractionation will propagate as the tree is extended.
Resulting tree will give a probability for each class rather
than a definite answer.

INCONSISTENT DATA
It is possible (and not unusual) to arrive at a situation in which

the examples associated with a leaf node belong to more than
one class.
This situation can arise in two ways:
1. There are no more attributes that could be used to
further subdivide the examples.
e.g. Suppose the weather data had contained a ninth
example
[ cool, cloudy, windy; dry ]
2. There are more attributes but none of them is useful
in distinguishing the classes.
e.g. Suppose the weather data had included the above
example and another attribute that identified the
person who recorded the data.
No More Attributes
A decision tree program can do one of two things.
Predict the most probable (modal) class.
This is what the pseudocode given earlier does.
Make a set of predictions with associated probabilities.
This is better.

No More Useful Attributes
In this situation the system cannot build a subtree to further

discriminate the classes because the unused attributes do not
correlate with the classification.
This situation must be handled in the same way as No More
Attributes.
But first the program must detect when the situation has
arisen.
Detecting that Further Progress is Impossible.
This requires a threshold on information gain or a statistical

test.
We will discuss this when we consider pre-pruning as a
possible solution to overfitting.

INCREMENTALITY
Most decision tree induction programs require the entire

training set to be available at the start.
Such programs can only incorporate new data by building
a complete new tree.
This is not a problem for most data mining applications.
In some applications, new training examples become
available at intervals.
e.g. Consider a robot dog learning football tactics by
playing games.
Learning programs that can accept new training instances
after learning has begun, without starting again are said to be
incremental.
Incremental Decision Tree Induction
ID4 is a modification of ID3 that can learn incrementally.

Maintains counts at every node throughout learning:
Number of each class associated with the node and its
subclasses.
Numbers of each class having each possible value of
each attribute.
When new example is encountered, all counts are
updated.
Where counts have changed, system can calculate if
current best attribute is still the best.
If not new subtree is built to replace original.

Building Decision Trees with Numeric Attributes
The Problem
Standard decision tree procedures are designed to work with

categorical attributes.
No account is taken of any numerical relationship between
the values.
The branching factor of the tree will be reasonable if the
number of distinct values is modest.
If the same procedures are applied to numerical attributes we
run into two difficulties:
The branching factor may become absurdly large.
Consider the attribute income in a survey of 1000
people.
It is not unlikely that every individual will have a
different annual income.
All the information implicit in the ordering of values is
thrown away.
The Solution
Partition the value set into a small number of contiguous
subranges and then treat membership of each subrange as a
categorical variable.
The result is in effect a new ordinal attribute.
A reasonable branching factor.
Some of the ordering information has been used.

Discretization of Continuous Variables
Two possible approaches to partitioning a range of numeric

values in order to build a classifier:
Divide the range up into a preset number of subranges.
For example, each subrange could have equal width or
include an equal number of examples.
Use the classification variable to determine the best way
to partition the numeric attribute.
The second approach has proved more successful.
Discretization using the classification variable
In principle this is straightforward:

Consider every possible partitioning of the numeric
attribute.
Assess each partitioning using the attribute selection
criterion.
In practice this is infeasible:
A set of m training examples can be partitioned in 2m-1
ways
However, it can be proved that if two neighbouring values
belong to the same class they should be assigned to the
same group.
This reduces the number of possibilities but the number of
partitionings is still infeasibly large.
Two solutions are possible:
Consider only the m-1 binary partitions. (e.g. C4.5)
Use heuristics to find good multiple partitions
ATTRIBUTE VALUE GROUPING
Discretization of numeric attributes involves finding subgroups

of values that are equivalent for the purposes of classification.
This notion can usefully be applied to categorical variables:
Suppose
A decision tree program attempts to induce a tree to
predict some binary class, C.
That the training set comprises feature vectors of nominal
attributes A1, A2, ..., Ak.
That the class C is in fact defined by the following
classification function:
C = V2,1 (V5,2 V5,4) V8,3
where Vi,j denotes that attribute Ai takes its jth value.
Suppose finally that each attribute has four possible
values and the program selects attributes in the order of
A5, A2, A8.
The resulting tree will be:
A5
V5,1 V5,2 V5,3 V5,4
N A2 N A2
V2,1 V2,2 V2,3 V2,4 V2,1 V2,2 V2,3 V2,4
A8 N N N A8 N N N
V8,1 V8,2 V8,3 V8,4 V8,1 V8,2 V8,3 V8,4

N N Y N N N Y N

There is a great deal of duplication in the structure of this tree.

If branches could created for groups of attribute values a
A
V5,1,V5,3 5 V5,2V5,4
N A
V2,1 2 V2,2,V2,3,V2,4
A N
8
V8,1,V8,2,V8,4 V8,3
N Y
much simpler one could be constructed:

The original tree:
Has 21 nodes
Divides the example space into 16 regions
The new tree
Has 7 nodes
Divides the example space into 4 regions
Their classification behaviours are identical.
Attribute Value Grouping Procedures
Heuristic attribute value grouping procedures, similar to those

used to discretize numeric attributes, can be used to produce
such tree simplification.


ALTERNATIVE ATTRIBUTE SELECTION CRITERIA
Hill Climbing
The basic decision tree induction algorithm proceeds
using a hill climbing approach.
At every step, a new branch is created for each value of
the best attribute.
There is no backtracking.
Which is the Best Attribute?
The criterion used for selecting the best attribute is therefore

very important, since there is no opportunity to rectify its
mistakes.
Several alternatives have been used successfully.
Information Based Criteria
If an experiment has n possible outcomes then the amount of

information, expressed as bits, provided by knowing the
outcome is defined to be
n
I pi log 2 pi
i 1
where pi is the prior probability of the ith outcome.
This quantity is also known as entropy and uncertainty.

For decision tree construction, the experiment is finding out
the correct classification of an example.

Information Gain
ID3 originally used information gain to determine the best

attribute:
The information gain for an attribute A when used with a set of

examples X is defined to be:
| Xv |
Gain( X , A) I ( X )
vvalues ( A ) | X |
I(Xv )
where |X| is the number of examples in set X.
This criterion has been (and is) widely and successfully used
But it is known to be biased towards attributes with many
values.
Why is it biased?
A many-valued attribute will partition the examples into
many subsets.
The average size of these subsets will be small.
Some of these are likely to contain a high percentage of
one class by chance alone.
Hence the true information gain will be over-estimated.

Information Gain Ratio
The information gain ratio measure incorporates an additional

term, split information, to compensate for this bias:
It is defined:
| Xv | | Xv |
SplitInf ( X , A)
vvalues ( A ) | X |
log2
|X |
which is in fact the information imparted when you are given
the value of attribute A.
For example, if we consider equiprobable values:

If A has 2 values, SplitInf(X,A) = 1
The information gain ratio is then defined:

The gain ratio itself can lead to difficulties if values are far
Gain ( X , A)
GainRatio( X , A)
SplitInf ( X , A)
from equiprobable.
If most of the examples are of the same value then
SplitInf(X,A) will be close to zero, and hence GainRatio(X,A)
may be very large.
The usual solution is to use Gain rather than GainRatio
whenever SplitInf is small.

Information Distance
SplitInf can be regarded as a normalisation factor for gain.

Lopez de Mantaras has developed a mathematically sounder
form of normalization based on the information distance
between two partitions.
This has not been widely adopted.
The Gini Criterion
An alternative to the information theory based measures.

Based on the notion of minimising misclassification rate.
Suppose you know the probabilities p(ci) of each class ci for

the examples assigned to a node.
Suppose you are given an unclassified example that would be
assigned to that node and decide to make a random guess of
its class with probabilities p(ci).
What is the probability that you will guess incorrectly?
G p ( ci ) p ( c j )
i j
where the sum is taken over all classes.
G is called the Gini criterion and can be used in the same

ways as information to select the best attribute.

SO WHAT IS THE BEST ATTRIBUTE SELECTION CRITERION?

A good attribute selection criterion should select those
attributes that most improve prediction accuracy.
It should also be cheap to compute.
Which criteria are widely used?
Information gain ratio is most popular in machine learning

research.
The Gini criterion is very popular in the statistical and pattern
recognition communities.
The evidence
Empirical evidence suggests that choice of criteria has little

impact on the classification accuracy of the resulting trees.
There are claims that some methods produce smaller trees
with similar accuracies.

THE PROBLEM OF OVERFITTING

Question
What would happen if a completely random set of data were
used as the training and test sets for a decision tree induction
program?
Answer
The program would build a decision tree.
If there were many variables and plenty of data it could be
quite a large tree.
Question
Would the tree be any good as a classifier?
Would it, for example, do better than the simple strategy of
always picking the modal class?
Answer
No.
Note also that if the experiment were repeated with a new set
of random data we would get an entirely different tree.
Questions
Isnt this rather worrying?
Could the same sort of thing be happening with non-random
data?
Answers
Yes and yes.

What is going on?

A decision tree is a mathematical model of some population
of examples.
But the tree is built on the basis of a sample from that
population the training set.
So what a decision tree program really does is build a model
of the training set.
The features of such a model can be divided into two groups:
1. Those that reflect relationships that are true for the
population as a whole.
2. Those that reflect relationships that are peculiar to the
particular training set.
Overfitting
Roughly speaking what happens is this:
Initially the features of a decision tree will reflect features
of the whole population.
As the tree gets deeper, the samples at each node get
smaller and the major relationships of the population will
have already been incorporated into the model.
Consequently any further additions are likely to reflect
relationships that have occurred by chance in the training
data.
From this point on the tree becomes a less accurate
model of the population: typically 20% less accurate.
This phenomenon of modelling the training data rather than
the population it represents is called overfitting.

Eliminating Overfitting
There are two basic ways of preventing overfitting:
1. Stop tree growth before it happens: Pre-pruning.
2. Remove the parts of the tree due to overfitting after it has

been constructed: Post-pruning.
Pre-pruning
This approach is appealing because it would save the effort

involved in building then scrapping subtrees.
This implies the need for a stopping criterion: a function
whose value determines when a leaf node should not be
expanded into subtrees.
Two types of stopping criteria have been tried:
Stopping when the improvement gets too small.
Typically stop when the improvement indicated by the
attribute selection criterion drops below some pre-set
threshold .
Choice of is crucial. Too low and you still get
overfitting. Too high and you lose accuracy.
This method proved unsatisfactory. It wasnt possible
to choose a value for that worked for all data sets.
Stopping when the evidence for an extension becomes
statistically insignificant.
Quinlan used 2 testing in some versions of ID3.
He later abandoned this because results were
satisfactory but uneven.

Chi-Square Tests
Chi-square testing is an extremely useful technique for

determining whether the differences between two distributions
could be due to chance.
That is, whether they could both be samples of the same
parent population.
Suppose we have a set of n categories and a set of
observations O1OiOn of the frequency that each
category occurs in a sample.
Suppose we wish to know if this set of observations could
be a sample drawn from some population whose
frequencies we also know.
We can calculate the expected frequencies E1EiEn of
each category if the sample exactly followed the
distribution the population.
Now compute the value of the chi-square statistic defined:
n
(Oi Ei ) 2
2
i 1 Ei
Clearly 2 increases as the two distributions deviate.
To determine whether the deviation is statistically significant,

consult chi square tables for the appropriate number of
degrees of freedom in this case n-1.

Post-Pruning
The Basic Idea

First build a decision tree, allowing overfitting to occur.
Then, for each subtree:
Assess whether a more accurate tree would result if the
subtree were replaced by a leaf.
(The leaf will choose the modal class for classification)
If so, replace the subtree with a leaf.
Validation Data Sets

How do we assess whether a pruned tree would be more
accurate?
We cant use the training data because the tree has
overfitted to this.
We cant use the test data because then we would have
no independent measure for the accuracy of the final tree.
We must have a third set used only for this purpose.
This is known as a validation set.
Notes:
Validation sets can also be used in pre-pruning.
In C4.5, Quinlan uses the training data for validation but
treats the result as an estimate and sets up a confidence
interval.
This is statistically dubious because the training data
isnt an independent sample.
Quinlan justifies it on the grounds that it works in
practice.
Refinements
Substituting Branches
Rather than replacing a subtree with a leaf, it can be replaced
by its most frequently used branch.
More Drastic Transformations
More substantial changes, possibly leading to a structure that
is no longer a tree, have also been used.
One example is the transformation into rule sets in C4.5:
Generate a set of production rules equivalent to the tree
by creating one rule for each path from the root to a leaf.
Generalize each rule by removing any precondition
whose loss does not reduce the accuracy.
This step corresponds to pruning, but note that the
structure may no longer be equivalent to a tree.
An example might match the LHS of more than one
rule.
Sort the rules by their estimated accuracy.
When using the rules for classification, this accuracy is
used for conflict resolution.

Suggested Readings
Mitchell, T. M., (1997),Machine Learning,McGraw-Hill. Chapter
3.
Tan, Steinbach & Kumar (2006) Introduction to Data Mining.
Chapter 4
Han & Kamber (2006), Data Mining: Concepts and Techniques.
Section 6.3
Breiman, L., Freidman, J. H., Olshen, R. A. and Stone, C. J., (1984)
Classification and Regression Trees. Wadsworth, Pacific Grove, CA..
(This is a thorough treatment of the subject from a more statistical
perspective an essential reference if you are doing research in the
area usually known as The CART book.)
Quinlan, J. R., (1986), Induction of Decision Trees. Machine
Learning, 1(1), pp 81-106. (A full account of ID3).
Quinlan, J. R., (1993), Programs for Machine Learning. Morgan
Kaufmann, Los Altos, CA.. (A complete account of C4.5, the successor
to ID3 and the yardstick to which other decision tree induction
procedures are usually compared.)
Dougherty, J., Kohavi, R. and Sahami, M., (1995), Supervised and
Unsupervised Discretisation of Continuous Features, in Proc. 12th Int.
Conf. on Machine Learning, Morgan Kaufmann, Los Altos, CA., pp
194-202. (A good comparative study of different methods for
discretising numeric attributes).
Ho, K. M. and Scott, P. D. (2000) Reducing Decision Tree
Fragmentation Through Attribute Value Grouping: A Comparative
Study. Intelligent Data Analysis. 6 pp 255-274
Implementations
An implementation of a decision tree procedure is available as part of
the WEKA suite of data mining programs. It is called J4.8 and closely
resembles C4.5.

Decision Trees

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Decision Trees

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Decision Trees

Uploaded by

Copyright:

Available Formats

Decision Trees Slide 1 of 40

DECISION TREE INDUCTION

What is a decision tree?

P.D.Scott University of Essex

When a bank is asked to make a loan it needs to assess how

P.D.Scott University of Essex

WHAT IS A DECISION TREE?

The following is a very simple decision tree that assigns

Beak Teeth Fish

Eagle Heron Lion Lamb

Thus a decision tree can be used to predict the category (or

P.D.Scott University of Essex

So what is a decision tree?

Skin Covering Beak Teeth

Feathers Straight None

P.D.Scott University of Essex

THE BASIC DECISION TREE ALGORITHM

P.D.Scott University of Essex

Choosing the Best Attribute

What is the best attribute?

Many possible definitions.

Still many possible answers.

P.D.Scott University of Essex

Shannons Information Function

Given a situation in which there are N unknown outcomes.

Consider some examples when the outcomes are all equally

Coin toss 2 outcomes 1 bit of information

In general, for N equiprobable outcomes

Consider picking 1 card from a pack containing 127 red and 1

P.D.Scott University of Essex

We need an expression that reflects the fact there is less

where N is the number of alternative outcomes.

Notice that it reduces to ln2(p) when the outcomes are all

Information is also sometimes called uncertainty or entropy.

P.D.Scott University of Essex

Using Information to Assess Attributes

What is the uncertainty for the examples whose A value is v1?

P.D.Scott University of Essex

Similarly for the examples whose A value is v2?

So, if we know the value of attribute A

P.D.Scott University of Essex

Suppose we have a training set of data derived from weather

Attribute Possible Values

Temperature Warm; Cool

Cloud Cover Overcast; Cloudy; Clear

Wind Windy; Calm

Precipitation Rain; Dry

We want to build a system that predicts precipitation from the

P.D.Scott University of Essex

First we consider the initial uncertainty:

Next we must choose the best attribute for building branches

P.D.Scott University of Essex

Information Gain from Cloud Cover Attribute

A similar calculation gives an average information of 0.500.

A similar calculation gives an average information of 0.607.

warm,overcast,windy: rain warm,clear,windy: dry

P.D.Scott University of Essex

Eagle Heron Lion Lamb