Decision Trees in Machine Learning
Sabber Ahamed
sabbers@gmail.com
linkedin.com/in/sabber-ahamed
github.com/msahamed
July 31, 2023
Contents
1 Introduction to Trees in Machine Learning 1
1.1 Basics of a Decision Tree . . . . . . . . . . . . . . . . . . . . . 2
1.2 How does a Decision Tree work? . . . . . . . . . . . . . . . . . 3
1.2.1 Gini Impurity: . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Entropy: . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Information Gain: . . . . . . . . . . . . . . . . . . . . . 5
2 From single Tree to Forests 6
2.1 Benefits of a Forest . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Building a Forest: Bagging and Boosting . . . . . . . . . . . . 6
2.2.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Conclusion 8
1 Introduction to Trees in Machine Learning
Decision trees are one of the most intuitive and versatile algorithms in ma-
chine learning. They can be visualized, which aids in understanding the
decision-making process. They can easily handle both numerical and cate-
1
gorical data and can be used for both classification and regression problems.
1.1 Basics of a Decision Tree
The tree can be explained by two entities, namely decision nodes and leaves.
The leaves are the decisions or the final outcomes. And the decision nodes
are where the data is split. The below diagram shows a simple decision
tree. Here, the decision nodes are represented by squares, and the leaves are
represented by circles. The decision nodes are where the data is split based
on a certain attribute/feature. The leaves are the final outcomes.
Figure 1: A basic decision tree: The decision nodes are represented by
squares, and the leaves are represented by circles. The decision nodes are
where the data is split based on a certain attribute/feature. The leaves are
the final outcomes.
2
1.2 How does a Decision Tree work?
A decision tree is like a flowchart: you ask a series of questions about your
data, and based on the answers, you make a decision. The aim is to make
these decisions as accurately as possible.
Here’s a simple way to understand how a decision tree works:
1. Start with your entire dataset. This is like having a mixed bowl of
fruits and wanting to separate them.
2. Ask a question about one of the features of your data. For example,
”Is this fruit yellow?”
3. Based on the answer, split your dataset into two groups. Maybe now
you’ve separated bananas from the rest.
4. Repeat this process for each group, asking new questions and making
further splits. Continue this until you’ve separated out all the different
fruits or met some stopping criteria, like having a very small group.
The goal is to ask the right questions in the right order so that our fruits
(or any data) are separated as quickly and cleanly as possible. In technical
terms, we’re trying to find features that produce the ”purest” groups.
Some popular algorithms help decide which questions to ask and in what
order. They include:
1.2.1 Gini Impurity:
Gini impurity measures the disorder of a set of items. Higher the value of
Gini impurity, higher the disorder hence higher the importance of the feature
used to split the data.
It is calculated by subtracting the sum of the squared probabilities pi of each
class i from one. The formula is given by:
C
X
Gini(t) = 1 − p2i
i=1
Where:
• C is the number of classes.
• pi is the proportion of instances that belong to class i in the set.
3
Example: For a binary feature (e.g., number of yellow and red balls) where
the probabilities of both classes are equal, p1 = p2 = 0.5, the Gini impurity
will be:
Gini(t) = 1 − (0.52 + 0.52 ) = 0.5
Benefits:
• Computationally faster as it doesn’t involve logarithmic functions like
entropy.
• Tends to produce slightly larger trees than entropy, but the trees are
more balanced.
Application: Used widely in the CART (Classification and Regression
Trees) algorithm.
1.2.2 Entropy:
Entropy is a measure of the randomness or impurity or unpredictability of
a vriable. Higher the value of entropy, higher the disorder hence higher the
importance of the feature used to split the data.
It is calculated by summing the product of the probability of each class pi
and the logarithm of that probability as:
C
X
Entropy(t) = − pi log2 (pi )
i=1
Where:
• C is the number of classes.
• pi is the proportion of instances that belong to class i in the set.
Example: For a binary feature where the probabilities of both classes are
equal, p1 = p2 = 0.5, the entropy will be:
Entropy(t) = −[0.5 log2 (0.5) + 0.5 log2 (0.5)] = 1
For a vriable with three classes, p1 = 0.5, p2 = 0.25, p3 = 0.25, the entropy
will be:
Entropy(t) = −[0.5 log2 (0.5) + 0.25 log2 (0.25) + 0.25 log2 (0.25)] = 1.5
4
For continous variables, the entropy is calculated by finding the weighted
average of the entropy of the subsets produced by splitting on a particular
value of the variable or by converting into several discrete categories.
Benefits:
• Offers a clear measure of impurity or disorder.
• When used as a criterion, it tends to produce slightly smaller trees than
Gini impurity.
Application: Used widely in the ID3, C4.5 and C5.0 tree generation algo-
rithms.
1.2.3 Information Gain:
Information Gain is a measure of the effectiveness of an attribute in classi-
fying the training data. It’s calculated as the difference between the entropy
of the original set and the weighted average entropy of the subsets.
X |Sv |
IG(A) = Entropy(t) − Entropy(Sv )
v∈A
|S|
Where:
• A is the set of all possible values of attribute A.
• S is the total dataset.
• Sv is the subset of S for which attribute A has value v.
Example: If splitting a dataset on an attribute results in two subsets with
entropies Entropy(S1 ) = 0.8 and Entropy(S2 ) = 0.6, and these subsets
represent 30% and 70% of the data, respectively, then the information gain
from the split is:
IG(A) = Entropy(t) − [0.3 × 0.8 + 0.7 × 0.6]
Benefits:
• Directly aims to reduce the unpredictability in the dataset.
• Often results in smaller, more coherent trees.
• Favors attributes with many outcomes; however, this can be mitigated
by using the Gain Ratio, which normalizes the Information Gain.
5
2 From single Tree to Forests
Think of a decision tree as a lone decision-maker. While they might make
good decisions most of the time, they have their biases and might not always
get it right. Now, imagine if you had a team of such decision-makers, each
bringing their unique perspective and strengths. They all debate and vote
on the best decision. The chances are that this group, or ’forest’, will make
better overall decisions than a lone tree. This is the fundamental idea behind
the Random Forest and other ensemble techniques.
2.1 Benefits of a Forest
• Diversity in Opinion: Just as a team with diverse expertise can
tackle complex problems better, a forest has multiple trees that look
at the data differently, leading to better overall predictions.
• Reduces Overfitting: While a single tree might memorize (or overfit
to) the training data, averaging predictions from many trees can reduce
this overfitting.
• Improved Accuracy: Multiple trees often lead to predictions that
are more accurate than those from a single tree.
2.2 Building a Forest: Bagging and Boosting
Forests are formed using methods like bagging and boosting. These aren’t
just random collections of trees, but carefully constructed groups where each
tree brings something unique.
2.2.1 Bagging
Idea: Let each tree in the forest focus on a different part of the data. By
letting each tree train on a random subset, we ensure diversity.
Random Forest This is a popular algorithm that uses bagging. Each tree
in a Random Forest is built on a random subset of the data. Additionally,
during the splitting in the decision tree building process, a random subset of
features is considered. This double randomness ensures a variety of trees.
Extra Trees This is another popular algorithm that uses bagging. It’s
similar to Random Forests, but with two key differences:
6
Figure 2: SImplified random forest: Each tree is trained on a random subset
of the data. The final prediction is the average of the predictions from all
the trees. The image is taken from Wikipedia.
• Extra Trees doesn’t bootstrap samples (meaning it doesn’t sample with
replacement), so each tree in Extra Trees is trained on the whole train-
ing set.
• Extra Trees uses random thresholds for each feature rather than search-
ing for the best possible thresholds (like a normal decision tree does).
Isolation Forests Isolation Forests are a special type of tree-based ensemble
that uses Random Forests to detect outliers. It’s based on the idea that
outliers are few and far away from the rest of the data. So, if we isolate an
outlier, it should take fewer splits to isolate it compared to a non-outlier.
This is the basic idea behind Isolation Forests.
2.2.2 Boosting
Idea: Instead of letting each tree form its own opinion independently, make
trees learn from the mistakes of previous trees. Each new tree tries to correct
the errors made by the ensemble of existing trees. This way, the trees are
7
built sequentially, and each tree depends on the previous ones. This is called
boosting.
Boosting is a powerful technique that can lead to better predictions than
bagging. However, it’s also more prone to overfitting. Some popular boosting
algorithms are AdaBoost, Gradient Boosting, and XGBoost.
Figure 3: Schematic diagram shoing the gradient boosting: (a) sequential
boosting process (b) LightGBM architecture. (c) XGBoost architecture.
Example: LightGBM LightGBM is a gradient boosting framework. In
LightGBM, trees are grown sequentially. Each tree fits the residual errors (or
mistakes) from the ensemble of prior trees. This way, each new tree focuses on
the examples that previous trees got wrong. unlike Random Forests, Light-
GBM doesn’t use bootstrap sampling. Instead, it uses a technique called
Gradient-based One-Side Sampling (GOSS) to select which samples
to use for training. It differs from xgboost in that it grows trees leaf-wise
instead of level-wise. This difference leads to faster training and better ac-
curacy for LightGBM.
3 Conclusion
Decision tree-based algorithms, from simple trees to complex structures like
Random Forests and LightGBM, offer a wide array of tools for machine learn-
ing practitioners. Their versatility and interpretability make them invaluable
in many real-world applications.