Decision Tree
Decision Tree
Decision Tree
• Let’s consider a tuple X for which the associated class label is unknown
• The attribute values of that tuple are tested against the decision tree
• A path is traced from the root to a leaf node, which holds the class prediction for the given tuple
Decision Tree Induction
• During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in machine learning,
developed a decision tree algorithm known as ID3 (Iterative Dichotomiser)
• ID3, and C4.5 adopt a greedy approach in which decision trees are constructed in a top-down
recursive divide-and-conquer manner
• These approaches starts with a training set of tuples and their associated class labels
• The training set is recursively partitioned into smaller subsets as the tree is being built
Decision Tree Algorithm
Strategy Behind the Algorithm
• The algorithm is called with three parameters: D, attribute list, and Attribute selection method.
• the algorithm calls Attribute selection method to determine the splitting criterion.
• The splitting criterion tells us which attribute to test at node N by determining the “best” way to
separate or partition the tuples in D into individual classes
• the splitting criterion indicates the splitting attribute and may also indicate either a split-point or a
splitting subset.
• The node N is labeled with the splitting criterion, which serves as a test at the node
• A branch is grown from node N for each of the outcomes of the splitting criterion.
• If A be the splitting attribute and A has v distinct values, {a1, a2, : : : , av}, based on the training data,
then there are following three possibilities:
1. A is discrete-valued
2. A is continuous-valued
3. A is discrete-valued and a binary tree must be produced
How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution are
preferred
• Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Measures of Node Impurity
• Entropy
• Gini Index
• Misclassification error
Examples for computing Entropy
Entropy (t ) = − p( j | t ) log p( j | t )
j 2
• If we were to split D into smaller partitions according to the outcomes of the splitting criterion,
ideally each partition would be pure, i.e. all of the tuples that fall into a given partition would belong
to the same class
• There are three popular attribute selection measures—information gain, gain ratio, and gini index.
• Let’s assume D, the data partition, be a training set of class-labeled tuples and Let’s also suppose
that the class label attribute has m distinct values defining m distinct classes, Ci (for i = 1, : : : , m)
• If Ci,D be the set of tuples of class Ci in D, then |D| and |Ci,D | denote the number of tuples in D and
Ci,D respectively
Information Gain as a Splitting Criteria
• ID3 uses information gain as its attribute selection measure
• The attribute with the highest information gain is chosen as the splitting attribute for a node N
where pi is the probability that an arbitrary tuple in D belongs to class Ci and is estimated
by |Ci,D | / |D|
• Attribute A can be used to split Di nto v partitions or subsets, {D1, D2, : : : , Dv}, where Dj contains
those tuples in D that have outcome aj of A
• Ideally, we would like this partitioning to produce an exact classification of the tuples
• How much more information would we still need (after the partitioning) in order to arrive at an exact
classification?
• The attribute A with the highest information gain, (Gain(A)), is chosen as the splitting attribute at
node N
• This is equivalent to saying that we want to partition on the attribute A that would do the “best
classification,” so that the amount of information still required to finish classifying the tuples is
minimal (i.e., minimum InfoA(D))
• The attribute which gives the highest information gain is picked as the splitting attribute
• => increases the purity of data
Splitting Based on INFO...
• Information Gain:
GAIN n
= Entropy ( p ) − Entropy (i )
k
i
n
split i =1
• The class label attribute, buys_computer, has two distinct values (namely, {yes, no})
• There are nine tuples of class yes and five tuples of class no
• For the age category youth, there are two yes tuples and three no tuples
• For the category middle aged, there are four yes tuples and zero no tuples
• For the category senior, there are three yes tuples and two no tuples
Example contd..
• Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, and
Gain(credit rating) = 0.048 bits
• Since age has the highest information gain among the attributes, it is selected as the
splitting attribute
Example
Example
Example
Example
Example
Example
Example
Information Gain of Continuous valued
attribute
• For such a scenario, we must determine the “best” split-point for A, where the split-point is a
threshold on A
• Typically, the midpoint between each pair of adjacent values is considered as a possible split-point
• For example, the midpoint between the values ai and ai+1 of A is (ai+ai+1 ) / 2
• For each possible split-point for A, we evaluate InfoA(D), where the number of partitions is two, i.e.
v = 2 (or j = 1;2)
• The point with the minimum expected information requirement for A is selected as the split point
for A
• D1 is the set of tuples in D satisfying A ≤ split_point, and D2 is the set of tuples in D satisfying A >
split_point
Continuous Attributes: Computing Gini Index
index Taxable
– Computationally Inefficient! Repetition Income
of work. > 80K?
Yes No
Continuous Attributes: Computing Gini Index...
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420