ML Unit 3
ML Unit 3
ML Unit 3
• Decision tree algorithm falls under the category of supervised learning. They can be used to
solve both regression and classification problems.
• Decision tree uses the tree representation to solve the problem in which each leaf node
corresponds to a class label and attributes are represented on the internal node of the tree.
• We can represent any boolean function on discrete attributes using the decision tree.
Below are some assumptions that we made while using decision tree:
As you can see from the above image that Decision Tree works on the Sum of Product form
which is also known as Disjunctive Normal Form. In the above image, we are predicting the use
of computer in the daily life of the people.
In Decision Tree the major challenge is to identification of the attribute for the root node in each
level. This process is known as attribute selection. We have two popular attribute selection
measures:
1. Information Gain
2. Gini Index
1. Information Gain
When we use a node in a decision tree to partition the training instances into smaller subsets the
entropy changes. Information gain is a measure of this change in entropy.
Definition: Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A = v, and
Values (A) is the set of all possible values of A, then
Entropy
Entropy is the measure of uncertainty of a random variable, it characterizes the impurity of an
arbitrary collection of examples. The higher the entropy more the information content.
Definition: Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A = v, and
Values (A) is the set of all possible values of A, then
Example:
=-(-0.53-0.424)
= 0.954
Building Decision Tree using Information Gain
The essentials:
• Start with all training instances associated with the root node
• Use info gain to choose which attribute to label each node with
• Note: No root-to-leaf path should contain the same discrete attribute twice
• Recursively construct each subtree on the subset of training instances that would be classified
down that path in the tree.
• If all positive or all negative training instances remain, label that node “yes” or “no”
accordingly
• If no attributes remain, label with a majority vote of training instances left at that node
• If no instances remain, label with a majority vote of the parent’s training instances
Example:
Now, lets draw a Decision Tree for the following data using Information gain.
Training set: 3 features and 2 classes
X Y Z C
1 1 1 I
1 1 0 I
0 0 1 II
1 0 0 II
Here, we have 3 features and 2 output classes.
To build a decision tree using Information gain. We will take each of the feature and calculate
the information for each feature.
Split on feature X
Split on feature Y
Split on feature Z
From the above images we can see that the information gain is maximum when we make a split
on feature Y. So, for the root node best suited feature is feature Y. Now we can see that while
splitting the dataset by feature Y, the child contains pure subset of the target variable. So we
don’t need to further split the dataset.
The final tree for the above dataset would be look like this:
2. Gini Index
• Gini Index is a metric to measure how often a randomly chosen element would be incorrectly
identified.
• It means an attribute with lower Gini index should be preferred.
• Sklearn supports “Gini” criteria for Gini Index and by default, it takes “gini” value.
• The Formula for the calculation of the of the Gini Index is given below.
ID3
The ID3 algorithm begins with the original set {\displaystyle S} as the root node. On
each iteration of the algorithm, it iterates through every unused attribute of the set {\displaystyle
S} and calculates the entropy {\displaystyle \mathrm {H} {(S)}} or the information
gain {\displaystyle IG(S)} of that attribute. It then selects the attribute which has the smallest
entropy (or largest information gain) value. The set {\displaystyle S} is then split or partitioned by
the selected attribute to produce subsets of the data. (For example, a node can be split into child
nodes based upon the subsets of the population whose ages are less than 50, between 50 and 100,
and greater than 100.) The algorithm continues to recurse on each subset, considering only
attributes never selected before.
Recursion on a subset may stop in one of these cases:
• every element in the subset belongs to the same class; in which case the node is turned into
a leaf node and labelled with the class of the examples.
• there are no more attributes to be selected, but the examples still do not belong to the same
class. In this case, the node is made a leaf node and labelled with the most common class of
the examples in the subset.
• there are no examples in the subset, which happens when no example in the parent set was
found to match a specific value of the selected attribute. An example could be the absence of
a person among the population with age over 100 years. Then a leaf node is created and
labelled with the most common class of the examples in the parent node's set.
Throughout the algorithm, the decision tree is constructed with each non-terminal node (internal
node) representing the selected attribute on which the data was split, and terminal nodes (leaf
nodes) representing the class label of the final subset of this branch.
AdaBoost
AdaBoost, short for Adaptive Boosting, is a statistical classification meta-algorithm formulated
by Yoav Freund and Robert Schapire, who won the 2003 Gödel Prize for their work. It can be used
in conjunction with many other types of learning algorithms to improve performance. The output
of the other learning algorithms ('weak learners') is combined into a weighted sum that represents
the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak
learners are tweaked in favor of those instances misclassified by previous classifiers. In some
problems it can be less susceptible to the overfitting problem than other learning algorithms. The
individual learners can be weak, but as long as the performance of each one is slightly better than
random guessing, the final model can be proven to converge to a strong learner.
Every learning algorithm tends to suit some problem types better than others, and typically has
many different parameters and configurations to adjust before it achieves optimal performance on
a dataset. AdaBoost (with decision trees as the weak learners) is often referred to as the best out-
of-the-box classifier. When used with decision tree learning, information gathered at each stage of
the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree
growing algorithm such that later trees tend to focus on harder-to-classify examples.
This figure shows how the first model is made and errors from the first model are noted by the
algorithm. The record which is incorrectly classified is used as input for the next model. This
process is repeated until the specified condition is met. As you can see in the figure, there are
‘n’ number of models made by taking the errors from the previous model. This is how boosting
works. The models 1,2, 3,…, N are individual models that can be known as decision trees. All
types of boosting models work on the same principle.
Since we now know the boosting principle, it will be easy to understand the AdaBoost
algorithm. Let’s dive into AdaBoost’s working. When the random forest is used, the algorithm
makes an ‘n’ number of trees. It makes proper trees that consist of a start node with several leaf
nodes. Some trees might be bigger than others, but there is no fixed depth in a random forest.
With AdaBoost, however, the algorithm only makes a node with two leaves, known as Stump.
The figure here represents the stump. It can be seen clearly that it has only one node with two
leaves. These stumps are weak learners and boosting techniques prefer this. The order of
stumps is very important in AdaBoost. The error of the first stump influences how other
stumps are made. Let’s understand this with an example.
Here’s a sample dataset consisting of only three features where the output is in categorical
form. The image shows the actual representation of the dataset. As the output is in
binary/categorical form, it becomes a classification problem. In real life, the dataset can have
any number of records and features in it. Let us consider 5 datasets for explanation purposes.
The output is in categorical form, here in the form of Yes or No. All these records will be
assigned a sample weight. The formula used for this is ‘W=1/N’ where N is the number of
records. In this dataset, there are only 5 records, so the sample weight becomes 1/5 initially.
Every record gets the same weight. In this case, it’s 1/5.
In our case, TE is 1/5. By substituting the value of total error in the above formula and solving
it, we get the value for the performance of the stump as 0.693. Why is it necessary to
calculate the TE and performance of a stump? The answer is, we must update the sample
weight before proceeding to the next model or stage because if the same weight is applied, the
output received will be from the first model. In boosting, only the wrong records/incorrectly
classified records would get more preference than the correctly classified records. Thus, only
the wrong records from the decision tree/stump are passed on to another stump. Whereas, in
AdaBoost, both records were allowed to pass and the wrong records are repeated more than the
correct ones. We must increase the weight for the wrongly classified records and decrease the
weight for the correctly classified records. In the next step, we will be updating the weights
based on the performance of the stump.
Also Read: Decision Tree Algorithm Explained
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point
x1, so this data point will lie in which of these categories. To solve this type of problem, we need
a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:
The K-NN working can be explained on the basis of the below algorithm:
Suppose we have a new data point and we need to put it in the required category. Consider the
below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It
can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.
Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers
in the model.
o Large values for K are good, but it may find some difficulties.
To do the Python implementation of the K-NN algorithm, we will use the same problem and
dataset which we have used in Logistic Regression. But here we will improve the performance of
the model. Below is the problem description:
Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a
new SUV car. The company wants to give the ads to the users who are interested in buying that
SUV. So for this problem, we have a dataset that contains multiple user's information through the
social network. The dataset contains lots of information but the Estimated Salary and Age we
will consider for the independent variable and the Purchased variable is for the dependent
variable. Below is the dataset: