Decision Tree

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Decision Tree

Decision Tree Classification


• Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
Example
Decision Tree for Classification

• Let’s consider a tuple X for which the associated class label is unknown

• The attribute values of that tuple are tested against the decision tree

• A path is traced from the root to a leaf node, which holds the class prediction for the given tuple
Decision Tree Induction
• During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in machine learning,
developed a decision tree algorithm known as ID3 (Iterative Dichotomiser)

• Quinlan later presented C4.5 (a successor of ID3)

• ID3, and C4.5 adopt a greedy approach in which decision trees are constructed in a top-down
recursive divide-and-conquer manner

• These approaches starts with a training set of tuples and their associated class labels

• The training set is recursively partitioned into smaller subsets as the tree is being built
Decision Tree Algorithm
Strategy Behind the Algorithm
• The algorithm is called with three parameters: D, attribute list, and Attribute selection method.

• The tree starts as a single node, N, representing the training tuples in D

• the algorithm calls Attribute selection method to determine the splitting criterion.

• The splitting criterion tells us which attribute to test at node N by determining the “best” way to
separate or partition the tuples in D into individual classes

• the splitting criterion indicates the splitting attribute and may also indicate either a split-point or a
splitting subset.

• The node N is labeled with the splitting criterion, which serves as a test at the node

• A branch is grown from node N for each of the outcomes of the splitting criterion.

• If A be the splitting attribute and A has v distinct values, {a1, a2, : : : , av}, based on the training data,
then there are following three possibilities:
1. A is discrete-valued
2. A is continuous-valued
3. A is discrete-valued and a binary tree must be produced
How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution are
preferred
• Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Measures of Node Impurity
• Entropy
• Gini Index
• Misclassification error
Examples for computing Entropy
Entropy (t ) = − p( j | t ) log p( j | t )
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Attribute Selection Measure
• An attribute selection measure is a heuristic for selecting the splitting criterion that “best” separates a
given data partition, D, of class-labeled training tuples into individual classes

• If we were to split D into smaller partitions according to the outcomes of the splitting criterion,
ideally each partition would be pure, i.e. all of the tuples that fall into a given partition would belong
to the same class

• There are three popular attribute selection measures—information gain, gain ratio, and gini index.

• Let’s assume D, the data partition, be a training set of class-labeled tuples and Let’s also suppose
that the class label attribute has m distinct values defining m distinct classes, Ci (for i = 1, : : : , m)

• If Ci,D be the set of tuples of class Ci in D, then |D| and |Ci,D | denote the number of tuples in D and
Ci,D respectively
Information Gain as a Splitting Criteria
• ID3 uses information gain as its attribute selection measure

• The attribute with the highest information gain is chosen as the splitting attribute for a node N

• The expected information needed to classify a tuple in D is given by

where pi is the probability that an arbitrary tuple in D belongs to class Ci and is estimated
by |Ci,D | / |D|

• Info(D) is also known as the entropy of D


Information Gain as a Splitting Criteria
• Now, suppose we were to partition the tuples in D on some attribute A having v distinct values, {a1,
a2, : : : , av}, as observed from the training data

• Attribute A can be used to split Di nto v partitions or subsets, {D1, D2, : : : , Dv}, where Dj contains
those tuples in D that have outcome aj of A

• Ideally, we would like this partitioning to produce an exact classification of the tuples

• How much more information would we still need (after the partitioning) in order to arrive at an exact
classification?

• This amount is measured by


Information Gain as a Splitting Criteria
• Information gain is defined as the difference between the original information requirement (i.e.,
based on just the proportion of classes) and the new requirement (i.e., obtained after partitioning on
A)

• Gain(A) = Info(D) - InfoA(D)

• Gain(A) tells us how much would be gained by branching on A

• The attribute A with the highest information gain, (Gain(A)), is chosen as the splitting attribute at
node N

• This is equivalent to saying that we want to partition on the attribute A that would do the “best
classification,” so that the amount of information still required to finish classifying the tuples is
minimal (i.e., minimum InfoA(D))

• The attribute which gives the highest information gain is picked as the splitting attribute
• => increases the purity of data
Splitting Based on INFO...

• Information Gain:

GAIN  n 
= Entropy ( p ) −   Entropy (i ) 
k
i

 n 
split i =1

Parent Node, p is split into k partitions;


ni is number of records in partition i
– Measures Reduction in Entropy achieved because of the
split. Choose the split that achieves most reduction
(maximizes GAIN)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure.
Example
Induction of a decision tree using
Information Gain

• The class label attribute, buys_computer, has two distinct values (namely, {yes, no})

• therefore, there are two distinct classes (that is, m = 2)

• There are nine tuples of class yes and five tuples of class no

• expected information needed to classify a tuple in D is computed as follows:


Example contd..
• Let’s find the expected information required for attribute named as ‘age’

• For the age category youth, there are two yes tuples and three no tuples

• For the category middle aged, there are four yes tuples and zero no tuples

• For the category senior, there are three yes tuples and two no tuples
Example contd..

• Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, and
Gain(credit rating) = 0.048 bits

• Since age has the highest information gain among the attributes, it is selected as the
splitting attribute
Example
Example
Example
Example
Example
Example
Example
Information Gain of Continuous valued
attribute
• For such a scenario, we must determine the “best” split-point for A, where the split-point is a
threshold on A

• sort the values of A in increasing order

• Typically, the midpoint between each pair of adjacent values is considered as a possible split-point

• Therefore, given v values of A, then v – 1 possible splits are evaluated

• For example, the midpoint between the values ai and ai+1 of A is (ai+ai+1 ) / 2

• For each possible split-point for A, we evaluate InfoA(D), where the number of partitions is two, i.e.
v = 2 (or j = 1;2)

• The point with the minimum expected information requirement for A is selected as the split point
for A

• D1 is the set of tuples in D satisfying A ≤ split_point, and D2 is the set of tuples in D satisfying A >
split_point
Continuous Attributes: Computing Gini Index

• Use Binary Decisions based on one value Tid Refund Marital


Status
Taxable
Income Cheat
• Several Choices for the splitting value 1 Yes Single 125K No
– Number of possible splitting values 2 No Married 100K No
= Number of distinct values 3 No Single 70K No
• Each splitting value has a count matrix 4 Yes Married 120K No
associated with it 5 No Divorced 95K Yes

– Class counts in each of the partitions, A 6 No Married 60K No

< v and A  v 7 Yes Divorced 220K No

• Simple method to choose best v 8 No Single 85K Yes


9 No Married 75K No
– For each v, scan the database to gather
10 No Single 90K Yes
count matrix and compute its Gini 10

index Taxable
– Computationally Inefficient! Repetition Income
of work. > 80K?

Yes No
Continuous Attributes: Computing Gini Index...

• For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix and
computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Taxable Income

Sorted Values 60 70 75 85 90 95 100 120 125 220


55 65 72 80 87 92 97 110 122 172 230
Split Positions
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

You might also like