Machine Learning: Decision Trees: CS540 Jerry Zhu University of Wisconsin-Madison
Machine Learning: Decision Trees: CS540 Jerry Zhu University of Wisconsin-Madison
Machine Learning: Decision Trees: CS540 Jerry Zhu University of Wisconsin-Madison
Decision Trees
CS540
Jerry Zhu
University of Wisconsin-Madison
[ Some slides from Andrew Moore http://www.cs.cmu.edu/~awm/tutorials and Chuck Dyer, with permission.]
x
The input
These names are the same: example, point,
instance, item, input
Usually represented by a feature vector
These names are the same: attribute, feature
For decision trees, we will especially focus on
discrete features (though continuous features are
possible, see end of slides)
Example: mushrooms
http://www.usask.ca/biology/fungi/
Mushroom features
1. cap-shape: bell=b,conical=c,convex=x,flat=f,
knobbed=k,sunken=s
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,
pink=p,purple=u,red=e,white=w,yellow=y
4. bruises?: bruises=t,no=f
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,
musty=m,none=n,pungent=p,spicy=s
6. gill-attachment: attached=a,descending=d,free=f,notched=n
7.
y
The output
These names are the same: label, target, goal
It can be
Continuous, as in our population
predictionRegression
Discrete, e.g., is this mushroom x edible or
poisonous? Classification
Two mushrooms
x1=x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
y1=p
x2=x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
y2=e
1. cap-shape: bell=b,conical=c,convex=x,flat=f,
knobbed=k,sunken=s
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
3. cap-color:
brown=n,buff=b,cinnamon=c,gray=g,green=r,
pink=p,purple=u,red=e,white=w,yellow=y
4.
Supervised Learning
Training set: n pairs of example, label:
(x1,y1)(xn,yn)
A predictor (i.e., hypothesis: classifier,
regression function) f: x y
Hypothesis space: space of predictors, e.g.,
the set of d-th order polynomials.
Find the best function in the hypothesis
space that generalizes well.
Performance measure: MSE for regression,
accuracy or error rate for classification
Evaluating classifiers
During training
Train a classifier from a training set (x1,y1), (x2,y2),
, (xn,yn).
During testing
For new test data xn+1xn+m, your classifier
generates predicted labels yn+1 yn+m
Test set accuracy:
You need to know the true testnm
labels y n+1 y n+m
1
Test set accuracy: acc 1y y '
m i n1 i i
Leaves: classify by
majority vote
A bigger decision tree
question: what is the
value of
horsepower?
question: what is the
value of maker?
1. Do not split when all
The full examples have the same
label
decision tree
biased
coin
Jerrys coin
H (Y | X ) Pr( X v) H (Y | X v)
v:valuesof X
H(class)=
H(class | color)=
Example Color Shape Size Class
H(class)= H(3/6,3/6) = 1
H(class | color)= 3/6 * H(2/3,1/3) + 1/6 * H(1,0) + 2/6 * H(0,1)
blue is + green is -
2 of the
3 out of 6 1 out of 6
red are + 2 out of 6
are red is blue are green
Example Color Shape Size Class
H(class)= H(3/6,3/6) = 1
H(class | color)= 3/6 * H(2/3,1/3) + 1/6 * H(1,0) + 2/6 * H(0,1)
I(class; color) = H(class) H(class | color) = 0.54 bits
Example Color Shape Size Class
H(class)= H(3/6,3/6) = 1
H(class | shape)= 4/6 * H(1/2, 1/2) + 2/6 * H(1/2,1/2)
I(class; shape) = H(class) H(class | shape) = 0 bits
Shape tells us
nothing about
the class!
Example Color Shape Size Class
H(class)= H(3/6,3/6) = 1
H(class | size)= 4/6 * H(3/4, 1/4) + 2/6 * H(0,1)
I(class; size) = H(class) H(class | size) = 0.46 bits
Example Color Shape Size Class
n i 1
a b c d e y
0 0 0 0 0 0
32 records
0 0 0 0 1 0
0 0 0 1 0 0
0 0 0 1 1 1
0 0 1 0 0 1
: : : : : :
1 1 1 1 1 1
Overfit a decision tree
The test set is constructed similarly
y=e, but 25% the time we corrupt it by y=e
The corruptions in training and test sets are
independent
The training and test sets are the same, except
Some ys are corrupted in training, but not in test
Some ys are corrupted in test, but not in training
Overfit a decision tree
We build a full tree on the training set
Root
e=0 e=1
Root
e=0 e=1
On average:
training data uncorrupted
of these are uncorrupted in test correct labels
of these are corrupted in test wrong
training data corrupted
of these are uncorrupted in test wrong
of these are also corrupted in test correct labels
Test accuracy = * + * = 5/8 = 62.5%
Overfit a decision tree
But if we knew a,b,c,d are irrelevant features and dont use them in the
tree
a b c d e y
0 0 0 0 0 0
0 0 0 0 1 0
0 0 0 1 0 0
0 0 0 1 1 1
0 0 1 0 0 1
: : : : : :
1 1 1 1 1 1
Overfit a decision tree
The tree would be
Root
e=0 e=1
Root
e=0 e=1
Root
e=0 e=1
H ( y | xi t ?) p( xi t ) H ( y | xi t ) p( xi t ) H ( y | xi t )
I ( y | xi t ?) H ( y) H ( y | xi t ?)
What does the feature space look like?
Axis-parallel cuts
Tree Rules
Each path, from the root to a leaf, corresponds to a
rule where all of the decisions leading to the leaf
define the antecedent to the rule, and the
consequent is the classification at the leaf node.
For example, from the tree in the color/shape/size
example, we could generate the rule:
if color = red and size = big then +
Conclusions
Decision trees are popular tools for data mining
Easy to understand
Easy to implement
Easy to use
Computationally cheap
Overfitting might happen
We used decision trees for classification (predicting a
categorical output from categorical or real inputs)
What you should know
Trees for classification
Top-down tree construction algorithm
Information gain
Overfitting
Pruning
Real-valued features