09 - ML - Decision Tree
09 - ML - Decision Tree
1
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
Training data set: Buys_computer <=30 high no excellent no
Resulting tree: 31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
no yes yes
2
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a
top-down
recursive
divide-and-conquer manner
12
Decision Tree Induction
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART
SLIQ,SPRINT
General Structure
Tid Refund Marital Taxable
Let Dt be the set of training records Status Income Cheat
that reach a node t 1 Yes Single 125K No
General Procedure: 2 No Married 100K No
Dt
?
Tid Refund Marital Taxable
Status Income Cheat
Don’t Cheat
Cheat
Tree Induction
Greedy strategy.
Split the records based on an attribute test that
Issues
Determine how to split the records
Issues
Determine how to split the records
Ordinal
Continuous
Multi-way split
Splitting Based on Nominal Attributes
Size
{Small,
What about this split? Large} {Medium}
Splitting Based on Continuous Attributes
attribute (many)
As defined in age attribute
Binary Decision: (A < v) or (A v)
consider all possible splits and finds the best cut
can be more compute intensive
Splitting Based on Continuous Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
Issues
Determine how to split the records
m=2
24
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D:
m
Info( D) pi log2 ( pi )
i 1
Information needed (after using A to split D into v partitions) to
classify D: v | D |
Info A ( D) j
Info( D j )
j 1 | D |
Information gained by branching on attribute A
C1 1
C2 5
C1 2
C2 4
Examples for computing Entropy
Entropy (t ) p( j | t ) log p( j | t )
j 2
Gain(income) 0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes fair
yes fair
yes
yes
Gain( student ) 0.151
<=30
31…40
medium
medium
yes excellent
no excellent
yes
yes Gain(credit _ rating ) 0.048
31…40 high yes fair yes
>40 medium no excellent no 31
Gini Index (CART, IBM IntelligentMiner)
If a data set D contains examples from n classes, gini index,
gini(D) is defined as n 2
gini( D) 1 p j
j 1
where pj is the relative frequency of class j in D
|D1| |D |
gini A ( D) gini(D1) 2 gini(D2)
|D| |D|
32
Computation of Gini Index
Ex. D has 9 tuples in buys_computer = “yes”
2
and
2
5 in “no”
9 5
gini( D) 1 0.459
14 14
33
Examples for computing GINI
GINI (t ) 1 [ p( j | t )]2
j
C1 1
C2 5
C1 2
C2 4
Examples for computing GINI
GINI (t ) 1 [ p( j | t )]2
j
36
Three in One
37
Tree Induction
Greedy strategy.
Split the records based on an attribute test that
Issues
Determine how to split the records
be computed again)
Not good accuracy for large dataset
https://www.cs.waikato.ac.nz/ml/weka/book.html
41
Examples
Search “Data Mining Lecture -- Decision Tree | Solved Example (Eng-
Hindi)”
URL:
https://www.youtube.com/watch?v=cKl7WV_EKDU
42
More Examples
Additional material for learning
http://www.otnira.com/2013/03/17/decision-tree-
induction/
43
DT Classification Task (optional)
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Every one can has DT in his mind for
every task
45