CHAPTER#03
SUPERVISED LEARNING & ITS ALGORITHMS
PART # 03
COURSE INSTRUCTORS:
DR. MUHAMMAD NASEEM
ENGR. FARHEEN QAZI
MS. RUQQIYA AZIZ
DEPARTMENT OF SOFTWARE ENGINEERING
SIR SYED UNIVERSITY OF ENGINEERING & TECHNOLOGY
AGENDA
Decision Tree Algorithm
Algorithms used in Decision Tree
ID 3
History
Process of ID3
Entropy and Information Gain
Summary
DECISION TREE ALGORITHM
Decision Tree algorithm belongs to the family of supervised learning algorithms.
Unlike other supervised learning algorithms, the decision tree algorithm can be used
for solving regression and classification problems too.
The goal of using a Decision Tree is to create a training model that can use to predict
the class or value of the target variable by learning simple decision rules inferred
from prior data(training data).
In Decision Trees, for predicting a class label for a record we start from the root of
the tree.We compare the values of the root attribute with the record’s attribute.
On the basis of comparison, we follow the branch corresponding to that value and
jump to the next node.
ALGORITHMS USED IN DECISION TREE
Let us look at some algorithms used in Decision Trees:
ID3 → (extension of D3)
C4.5 → (successor of ID3)
CART → (Classification And Regression Tree)
CHAID → (Chi-square automatic interaction detection Performs multi-level splits when computing
classification trees)
MARS → (multivariate adaptive regression splines)
ID 3
(ITERATIVE DICHOTOMISER 3)
WHAT IS THE ID3 ALGORITHM?
ID3 is one of the most common decision tree algorithm.
ID3 stands for Iterative Dichotomiser 3 (dichotomisation means
dividing into two completely opposite things)
The algorithm iteratively divides attributes into two groups which are
the most dominant attribute and others to construct a tree.
Algorithm used to generate a decision tree.
ID3 is a precursor to the C4.5 Algorithm.
HISTORY
The ID3 algorithm was invented by Ross Quinlan.
Quinlan was a computer science researcher in data mining, and decision
theory.
Received doctorate in computer science at the University of Washington
in 1968.
DECISION TREE
Classifies data using the attributes
Tree consists of decision nodes and decision leafs.
Nodes can have two or more branches which represents the value for
the attribute tested.
Leafs nodes produces a homogeneous result.
THE PROCESS (BUILDING THE TREE)
At each node, we need to find the attribute that best divides the data
into Yes and No.
Take all unused attributes and calculates their entropies.
Chooses attribute that has the lowest entropy is minimum or when
information gain is maximum
The attribute with the highest information gain is selected at each node.
ENTROPY & INFORMATION GAIN
In ID3, entropy is used to measure how informative a node is.
It is observed that splitting on any attribute has the property that average entropy of the resulting
training subsets will be less than or equal to that of the previous training set.
ID3 algorithm defines a measurement of a splitting called Information Gain to determine the goodness
of a split.
The attribute with the largest value of information gain is chosen as the splitting attribute and
it partitions into a number of smaller training sets based on the distinct values of attribute under split.
Example-1
Question#01: Using Decision Tree algorithm and given table, classify the given tuple (T), also
draw the tree displaying the parent node and children nodes.
ID AGE COMPETITION TYPE PROFIT
1 Old Y Software Down
2 Old N Software Down
3 Old N Hardware Down
4 Mid Y Software Down
5 Mid Y Hardware Down
6 Mid N Hardware Up
7 Mid N Software Up
8 New Y Software Up
9 New N Hardware Up
10 New N Software Up
CONTD…..
Step-1: Determine the Entropy for the positive class (P) and the negative class (N). Note:These P and N
can be more than 2 according to the class attributes.
−𝑃 𝑃 𝑁 𝑁
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐶𝑙𝑎𝑠𝑠) = 𝑙𝑜𝑔 − 𝑙𝑜𝑔 −− − (1)
𝑁+𝑃 𝑁+𝑃 𝑁+𝑃 𝑁+𝑃
According to the table P = 5 and N = 5 because of 5 Up and 5 down.
−5 5 5 5
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐶𝑙𝑎𝑠𝑠) = 𝑙𝑜𝑔 − 𝑙𝑜𝑔
5+5 5+5 5+5 5+5
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐶𝑙𝑎𝑠𝑠) = 1
CONTD…..
Step-2: Determine information gain 𝐼(𝑃 , 𝑁 ) for each feature attributes like Age, Competition and Type.
−𝑃 𝑃 𝑁 𝑁
𝐼(𝑃 , 𝑁 ) = 𝑙𝑜𝑔 − 𝑙𝑜𝑔 −− − (2)
𝑁+𝑃 𝑁+𝑃 𝑁+𝑃 𝑁+𝑃
Table-1: (Age)
𝑷𝒊 𝑵𝒊 𝑰(𝑷𝒊 , 𝑵𝒊 )
Old 0 3 0
Mid 2 2 1
New 3 0 0
0 0 3 3
𝐼(𝑃 , 𝑁 ) = 𝑙𝑜𝑔 − 𝑙𝑜𝑔
3+0 3+0 3+0 3+0
𝐼(𝑃 , 𝑁 ) =0
−2 2 2 2
𝐼(𝑃 , 𝑁 ) = 𝑙𝑜𝑔 − 𝑙𝑜𝑔
2+2 2+2 2+2 2+2
𝐼(𝑃 , 𝑁 ) =1
−3 3 3 3
𝐼(𝑃 , 𝑁 ) = 𝑙𝑜𝑔 − 𝑙𝑜𝑔
0+3 0+3 3+0 3+0
𝐼(𝑃 , 𝑁 ) =0
CONTD…..
Step-3: Now determine the Entropy of the Features Age,
𝑃 +𝑁
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐴𝑔𝑒 = ∗ 𝐼 𝑃 ,𝑁 −− −(3)
𝑃+𝑁
From table-1,
0+3 2+2 3+0
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐴𝑔𝑒 = 0 + 1 + 0
5+5 5+5 5+5
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐴𝑔𝑒 = 0.4
CONTD….
Step-4: Now determine the Gain,
𝐺𝑎𝑖𝑛 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐶𝑙𝑎𝑠𝑠 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐴𝑔𝑒)
𝐺𝑎𝑖𝑛 = 1 − 0.4
𝐺𝑎𝑖𝑛 = 0.6
CONTD….
Applying steps 2, 3 and 4 on competition,
Table-2: (Competition)
𝑷𝒊 𝑵𝒊 𝑰(𝑷𝒊 , 𝑵𝒊 )
Y 1 3 0.81127
N 4 2 0.918295
−1 1 3 3
𝐼(𝑃 , 𝑁 ) = 𝑙𝑜𝑔 − 𝑙𝑜𝑔
1+3 1+3 1+3 1+3
𝐼(𝑃 , 𝑁 ) = 0.81127
−4 4 2 2
𝐼(𝑃 , 𝑁 ) = 𝑙𝑜𝑔 − 𝑙𝑜𝑔
2+4 2+4 2+4 2+4
𝐼(𝑃 , 𝑁 ) = 0.918295
CONTD….
From table-2,
1+3 4+2
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐶𝑜𝑚𝑝𝑖𝑡𝑖𝑜𝑛 = 0.81127 + 0.918295
5+5 5+5
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐶𝑜𝑚𝑝𝑖𝑡𝑖𝑜𝑛 = 0.8754
Now determine the Gain,
𝐺𝑎𝑖𝑛 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐶𝑙𝑎𝑠𝑠 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐶𝑜𝑚𝑝𝑖𝑡𝑖𝑜𝑛)
𝐺𝑎𝑖𝑛 = 1 − 0.8754
𝐺𝑎𝑖𝑛 = 0.1245
CONTD….
Applying steps 2, 3 and 4 on Type ,
Table-3: (Type)
𝑷𝒊 𝑵𝒊 𝑰(𝑷𝒊 , 𝑵𝒊 )
Software 3 3 1
Hardware 2 2 1
−3 3 3 3
𝐼(𝑃 , 𝑁 )𝑺𝒐𝒇𝒕𝒘𝒂𝒓𝒆 = 𝑙𝑜𝑔 − 𝑙𝑜𝑔
3+3 3+3 3+3 1+3
𝐼(𝑃 , 𝑁 )𝑺𝒐𝒇𝒕𝒘𝒂𝒓𝒆 = 1
−2 2 2 2
𝐼(𝑃 , 𝑁 )𝑯𝒂𝒓𝒅𝒘𝒂𝒓𝒆 = 𝑙𝑜𝑔 − 𝑙𝑜𝑔
2+2 2+2 2+2 2+2
𝐼(𝑃 , 𝑁 )𝑯𝒂𝒓𝒅𝒘𝒂𝒓𝒆 = 1
CONTD….
From table-3,
3+3 2+2
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑇𝑦𝑝𝑒 = 1 + 1
5+5 5+5
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑇𝑦𝑝𝑒 = 1
Now determine the Gain,
𝐺𝑎𝑖𝑛 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐶𝑙𝑎𝑠𝑠 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑇𝑦𝑝𝑒)
𝐺𝑎𝑖𝑛 = 1 − 1
𝐺𝑎𝑖𝑛 = 0
Table-4: Information Gain
Age 0.6
Competition 0.1245
Type 0
CONTD….
So we can see that Information Gain of Age is greater than both, which means that it will become a root node or parent
node.
Age
Old New
Mid
Down Up
Note
If you see in the table that all features of Old are Down, so we place down as one of its root and on the other hand
all features of New is Up, so we placed it as its other branch.
CONTD….
Now, we need to determine the feature (Age) at mid-range, so we build another table for that.
Table-5:
ID AGE COMPETITION TYPE PROFIT
1 Mid Y Software Down
2 Mid Y Hardware Down
3 Mid N Hardware Up
4 Mid N Software Up
Here, P=2 & N=2 because of 2 down and 2 ups.
CONTD….
Since both of them are same,
Entropy(Class) = 1
Table-6: Competition reduced
𝑷𝒊 𝑵𝒊 𝑰(𝑷𝒊 , 𝑵𝒊 )
Y 0 2 0
N 2 0 0
CONTD….
From table-6,
0+2 0+2
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐶𝑜𝑚𝑝𝑖𝑡𝑖𝑜𝑛 𝑅𝑒𝑑𝑢𝑐𝑒𝑑 = 0 + 0
2+2 2+2
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐶𝑜𝑚𝑝𝑖𝑡𝑖𝑜𝑛 𝑅𝑒𝑑𝑢𝑐𝑒𝑑 = 0
𝐺𝑎𝑖𝑛 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐶𝑙𝑎𝑠𝑠 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐶𝑜𝑚𝑝𝑖𝑡𝑖𝑜𝑛 𝑅𝑒𝑑𝑢𝑐𝑒𝑑)
𝐺𝑎𝑖𝑛 = 1 − 0
𝐺𝑎𝑖𝑛 = 1
CONTD….
Table-7: Type Reduced
𝑷𝒊 𝑵𝒊 𝑰(𝑷𝒊 , 𝑵𝒊 )
SOFTWARE 1 1 1
HARDWARE 1 1 1
From table-7,
1+1 1+1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑇𝑦𝑝𝑒 𝑅𝑒𝑑𝑢𝑐𝑒𝑑 = 1 + 1
2+2 2+2
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑇𝑦𝑝𝑒 𝑅𝑒𝑑𝑢𝑐𝑒𝑑 = 1
𝐺𝑎𝑖𝑛 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐶𝑙𝑎𝑠𝑠 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦( 𝑇𝑦𝑝𝑒 𝑅𝑒𝑑𝑢𝑐𝑒𝑑 )
𝐺𝑎𝑖𝑛 = 1 − 1
𝐺𝑎𝑖𝑛 = 0
Since Gain(Competition) is greater than Gain(Type), hence it becomes the child branch of Age.
FINAL DECISION TREE
Age Test Dataset:
AGE COMPETITION TYPE PROFIT
Old New Old N Hardware ?
Mid Mid N Software ?
Down Competition Up
Predict Profit Class using Decision tree:
AGE COMPETITION TYPE PROFIT
Y Old N Hardware Down
N
Mid N Software Up
Down Up
EXAMPLE-II
Question#02: Using Decision Tree algorithm and given table, classify the given tuple
(T), also draw the tree displaying the parent node and children nodes.
ADVANTAGES OF ID3
Understandable prediction rules are created from the training data.
Builds the fastest tree.
Builds a short tree.
Only need to test enough attributes until all data is classified.
Finding leaf nodes enables test data to be pruned, reducing number of
tests.
DISADVANTAGES OF ID3
Data may be over-fitted or over-classified, if a small sample is tested.
Only one attribute at a time is tested for making a decision.
Classifying continuous data may be computationally expensive, as many
trees must be generated to see where to break the continuum.
SUMMARY
Decision trees can be used to help predict the future
The trees are easy to understand
Decision trees work more efficiently with discrete attributes
The trees may suffer from error propagation