2.3 Decision-Tree-Algorithm

Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

Decision Tree

Definition
• A decision tree is a classifier in the form of a
tree structure with two types of nodes:
– Decision node: Specifies a choice or test of
some attribute, with one branch for each
outcome
– Leaf node: Indicates classification of an example
– It is a recursive partitioning algorithm.
Decision Tree Example 1
Whether to approve a loan

Employed?
No Yes

Credit
Income?
Score?
High Low High Low

Approve Reject Approve Reject


Decision Tree Example 3
Issues
• Given some training examples, what decision tree
should be generated?
• One proposal: prefer the smallest tree that is
consistent with the data (Bias)
– the tree with the least depth?
– the tree with the fewest nodes?
• Possible method:
– search the space of decision trees for the smallest decision
tree that fits the data
Example Data
Training Examples:
Action Author Thread Length Where
e1 skips known new long Home
e2 reads unknown new short Work
e3 skips unknown old long Work
e4 skips known old long home
e5 reads known new short home
e6 skips known old long work
New Examples:
e7 ??? known new short work
e8 ??? unknown new short work
Possible splits
skips 9
length reads 9

long short skips 9


thread reads 9
Skips 2
Skips 7
Reads 9
Reads 0 new old
Skips 6
Skips 3
Reads 2
Reads 7
Two Example DTs
Decision Tree for PlayTennis
• Attributes and their values:
– Outlook: Sunny, Overcast, Rain
– Humidity: High, Normal
– Wind: Strong, Weak
– Temperature: Hot, Mild, Cool

• Target concept - Play Tennis: Yes, No


Decision Tree for PlayTennis
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
Decision Tree for PlayTennis
Outlook

Sunny Overcast Rain

Humidity Each internal node tests an attribute

High Normal Each branch corresponds to an


attribute value node

No Yes Each leaf node assigns a classification


Decision Tree for PlayTennis
Outlook Temperature Humidity Wind PlayTennis
Sunny Hot High Weak ? No
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
Decision Tree
decision trees represent disjunctions of conjunctions
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

(Outlook=Sunny  Humidity=Normal)
 (Outlook=Overcast)
 (Outlook=Rain  Wind=Weak)
Searching for a good tree
• How should you go about building a decision tree?
• The space of decision trees is too big for systematic
search.

• Stop and
– return the a value for the target feature or
– a distribution over target feature values

• Choose a test (e.g. an input feature) to split on.


– For each value of the test, build a subtree for those
examples with this value for the test.
Top-Down Induction of Decision Trees ID3

1. Which node to proceed with?


1. A  the “best” decision attribute for next node
2. Assign A as decision attribute for node
3. For each value of A create new descendant
4. Sort training examples to leaf node according to the
attribute value of the branch
5. If all training examples are perfectly classified (same
value of target attribute) stop, else iterate over new
leaf nodes. 2. When to stop?
Choices
• When to stop
– no more input features
– all examples are classified the same
– too few examples to make an informative split

• Which test to split on


– split gives smallest error.
– With multi-valued features
• split on all values or
• split values into half.
Which Attribute is ”best”?

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]

17
Principled Criterion
• Selection of an attribute to test at each node -
choosing the most useful attribute for classifying
examples.
• information gain
– measures how well a given attribute separates the training
examples according to their target classification
– This measure is used to select among the candidate
attributes at each step while growing the tree
– Gain is measure of how much we can reduce
uncertainty (Value lies between 0,1)
Entropy
• A measure for
– uncertainty
– purity
– information content
• Information theory: optimal length code assigns (- log2p) bits
to message having probability p
• S is a sample of training examples
– p+ is the proportion of positive examples in S
– p- is the proportion of negative examples in S
• Entropy of S: average optimal number of bits to encode
information about certainty/uncertainty about S
Entropy(S) = p+(-log2p+) + p-(-log2p-) = -p+log2p+- p-log2p-
Entropy

• The entropy is 0 if the outcome


is ``certain”.
• The entropy is maximum if we
have no knowledge of the
system (or any outcome is
equally possible).

• S is a sample of training examples


• p+ is the proportion of positive examples
• p- is the proportion of negative examples
• Entropy measures the impurity of S
Entropy(S) = -p+log2p+- p-log2 p-
Information Gain
Gain(S,A): expected reduction in entropy due to partitioning S
on attribute A

Gain(S,A)=Entropy(S) − vvalues(A) |Sv|/|S| Entropy(Sv)


Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64
= 0.99

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]


Information Gain
Entropy([21+,5-]) = 0.71 Entropy([18+,33-]) = 0.94
Entropy([8+,30-]) = 0.74 Entropy([11+,2-]) = 0.62
Gain(S,A1)=Entropy(S) Gain(S,A2)=Entropy(S)
-26/64*Entropy([21+,5-]) -51/64*Entropy([18+,33-])
-38/64*Entropy([8+,30-])
-13/64*Entropy([11+,2-])
=0.27
=0.12

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]


22
Training Example:1
Day Outlook Temp Humidity Wind Tennis?
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Selecting the Next Attribute
S=[9+,5-] S=[9+,5-]
E=0.940 E=0.940

Humidity Wind

High Normal Weak Strong

[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]

E=0.985 E=0.592 E=0.811 E=1.0

Gain(S,Humidity) Gain(S,Wind)
=0.940-(7/14)*0.985 =0.940-(8/14)*0.811
– (7/14)*0.592 – (6/14)*1.0
=0.151 =0.048
Humidity provides greater info. gain than Wind, w.r.t target classification.
Selecting the Next Attribute
S=[9+,5-]
E=0.940

Outlook

Sunny Overcast Rain

[2+, 3-] [4+, 0] [3+, 2-]

E=0.971 E=0.0 E=0.971

Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.0971
=0.247
Selecting the Next Attribute
The information gain values for the 4 attributes are:
• Gain(S,Outlook) =0.247
• Gain(S,Humidity) =0.151
• Gain(S,Wind) =0.048
• Gain(S,Temperature) =0.029

where S denotes the collection of training examples

Note: 0Log20 =0
ID3 Algorithm
[D1,D2,…,D14] Outlook
[9+,5-]

Sunny Overcast Rain

Ssunny=[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]


[2+,3-] [4+,0-] [3+,2-]

? Yes ?
Test for this node

Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970


Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570
Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019
ID3 Algorithm
Outlook

Sunny Overcast Rain

Humidity Yes Wind


[D3,D7,D12,D13]

High Normal Strong Weak

No Yes No Yes

[D1,D2] [D8,D9,D11] [D6,D14] [D4,D5,D10]


Triangles and Squares: Example2
# Attribute Shape
Color Outline Dot
1 green dashed no triange
2 green dashed yes triange
3 yellow dashed no square
4 red dashed no square
5 red solid no square
6 red solid yes triange
7 green solid no square
8 green dashed no triange
9 yellow solid yes square
10 red solid no square
11 green solid yes square
12 yellow dashed yes square
13 yellow solid no square
14 red dashed yes triange
Triangles and Squares
# Attribute Shape
Color Outline Dot
1 green dashed no triange
2 green dashed yes triange
3 yellow dashed no square
4 red dashed no square
5 red solid no square Data Set:
6 red solid yes triange A set of classified objects
7 green solid no square
8 green dashed no triange
9 yellow solid yes square
10 red solid no square . .
11 green solid yes square
12 yellow dashed yes square . .
13 yellow solid no square . .
14 red dashed yes triange
Entropy
• 5 triangles
• 9 squares
.
.
• class probabilities
. .
. .

• entropy
. .
. .
. .
. .
red

Color? green
Entropy
reduction .
by yellow .

data set
partitioning .

.
. .
. .
. .
. .
red

Color? green

.
yellow .

.
.
. .
. .
.
Information Gain
.
. .
red

Color? green

.
yellow .

.
.
Information Gain of The Attribute
• Attributes
– Gain(Color) = 0.246
– Gain(Outline) = 0.151
– Gain(Dot) = 0.048
• Heuristics: attribute with the highest gain is chosen
• This heuristics is local (local minimization of impurity)
. .
. . .
.
. . red

Color?

green

. yellow
.
.
.

Gain(Outline) = 0.971 – 0 = 0.971 bits


Gain(Dot) = 0.971 – 0.951 = 0.020 bits
. .
. . .
.
. . red

Color?
Gain(Outline) = 0.971 – 0.951 = 0.020 bits
Gain(Dot) = 0.971 – 0 = 0.971 bits
green

. yellow
.
.
.

solid
.
Outline?

dashed

.
. .
. . .
.
. . red
yes .

Dot? .
Color?
no
green

. yellow
.
.
.

solid
.
Outline?

dashed

.
Decision Tree
. .

. .
. .

Color

red green
yellow

Dot square Outline

yes no dashed solid

triangle square triangle square


Splitting Rule: GINI Index
• GINI Index
– Measure of node impurity

GINInode(Node) = 1- [ p(c)] 2

c  classes

Sv
GINIsplit (A) =  S
GINI(N v )
v Value s(A)
Splitting Based on Continuous Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


Continuous Attribute – Binary Split
• For continuous attribute
– Partition the continuous value of attribute A into a
discrete set of intervals
– Create a new boolean attribute Ac , looking for a
threshold c,

true if Ac  c
Ac = 
 false otherwise
How to choose c ?
• consider all possible splits and finds the best cut
Strengths of decision tree methods
• Ability to generate understandable rules

• Ease of calculation at classification time

• Ability to handle both continuous and


categorical variables

• Ability to clearly indicate best attributes


The weaknesses of decision tree methods
• Greedy algorithm: no global optimization

• Error-prone with too many classes: numbers of training


examples become smaller quickly in a tree with many
levels/branches.

• Expensive to train: sorting, combination of attributes,


calculating quality measures, etc.

• Trouble with non-rectangular regions: the rectangular


classification boxes that may not correspond well with the
actual distribution of records in the decision space.
Practical Issues of Classification
• Underfitting and Overfitting

• Missing Values

• Costs of Classification

You might also like