ML Mod-4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

MODULE – 4

Decision Trees
Decision Trees are machine learning algorithms that can be used for both classification and
regression tasks. They are versatile and powerful, capable of fitting complex datasets.

In simpler terms, Decision Trees are like flowcharts that help us make decisions based on certain
features or attributes. These trees have nodes and branches. At each node, a decision is made based
on a specific feature, and the outcome leads to a new node or branch until a final decision or
prediction is reached.

For example, let's say we want to classify different types of flowers based on their petal length and
width. A Decision Tree would ask questions like "Is the petal length greater than 2.5 cm?" or "Is the
petal width less than 1.5 cm?" at different nodes, leading to different classifications.

Decision Trees can fit complex datasets very well, sometimes even too well. This means they can
become too specific to the training data and overfit, making it difficult to generalize to new, unseen
data.

Random Forests, which we'll discuss later, are a powerful extension of Decision Trees that combine
multiple trees to make more accurate predictions.

In this chapter, we'll learn how to train Decision Trees, visualize them to understand their decision-
making process, and use them to make predictions. We'll also explore the CART (Classification and
Regression Tree) algorithm used by Scikit-Learn to train Decision Trees. Additionally, we'll cover
regularization techniques to avoid overfitting and how Decision Trees can be used for regression
tasks.

Lastly, we'll discuss the limitations of Decision Trees, including their tendency to overfit, sensitivity to
small changes in the data, and the difficulty of capturing complex relationships between features.

Training and Visualizing a Decision Tree


To understand Decision Trees, let’s just build one and take a look at how it makes predictions. The
following code trains a DecisionTreeClassifier on the iris dataset (see Chapter 4):
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

These lines import the necessary modules for the code: load_iris from sklearn.datasets to
load the iris dataset, and DecisionTreeClassifier from sklearn.tree to create a Decision Tree
classifier.
iris = load_iris()
This line loads the iris dataset using the load_iris() function and assigns it to the variable iris.
X = iris.data[:, 2:]
y = iris.target

These lines extract the features and target labels from the iris dataset. X represents the
features, which include the petal length and width, and y represents the target labels,
which indicate the iris species.( In the context of the code, it is applied to the iris.data
array, which contains the features of the iris dataset. Each row in iris.data represents a
sample, and each column represents a specific feature.

The slicing operation [:, 2:] means:

• The : before the comma indicates that we want to select all rows of the array.
• The 2: after the comma indicates that we want to select columns starting from
index 2 (inclusive) to the end of the array.

So, X = iris.data[:, 2:] creates a new array X that contains only the petal length and
width features from the iris dataset, discarding the first two columns (sepal length and
width).

By doing this, we are extracting a subset of features that will be used as input ( X) for
training the Decision Tree classifier, focusing on the petal-related measurements.
)
tree_clf = DecisionTreeClassifier(max_depth=2)

This line creates an instance of the DecisionTreeClassifier class and assigns it to the variable
tree_clf. The max_depth=2 argument sets the maximum depth of the Decision Tree to 2 levels,
which means the resulting tree will have a simple structure.
tree_clf.fit(X, y)

This line trains the Decision Tree classifier using the fit() method. It takes the features X
and target labels y as input and adjusts the model's parameters to fit the training data.(
By calling tree_clf.fit(X, y), the Decision Tree classifier analyzes the relationship between the
features (X) and the target labels (y). It learns the patterns and structure in the training data and
adjusts its internal parameters to make accurate predictions based on the input features.

After the fit() method is called, the tree_clf object will contain the trained Decision Tree classifier,
and it can be used to make predictions on new, unseen data.)
from sklearn.tree import export_graphviz
This line imports the export_graphviz() function from the sklearn.tree module. This
function is used to export the trained Decision Tree to a format that can be visualized.

export_graphviz(
tree_clf,
out_file=image_path("iris_tree.dot"),
feature_names=iris.feature_names[2:],
class_names=iris.target_names,
rounded=True,
filled=True

These lines call the export_graphviz() function to export the trained Decision Tree. Here's
what each argument represents:

• tree_clf: The trained Decision Tree classifier.


• out_file: The output file name and path for the exported graph definition file. In
this case, it uses a function called image_path() to specify the file name as
"iris_tree.dot".
• feature_names: The names of the features in the dataset. Here, it takes the feature
names from the iris dataset, starting from index 2, which corresponds to petal
length and width.
• class_names: The names of the target classes. It takes the class names from the iris
dataset.
• rounded=True: It rounds the edges of the nodes in the visualization.
• filled=True: It fills the nodes with colors to indicate the class distribution.

$ dot -Tpng iris_tree.dot -o iris_tree.png

This line is not part of the Python code but represents a command to be executed in the
command-line interface. It uses the dot command-line tool from the Graphviz package
to convert the exported .dot file ("iris_tree.dot") to a .png image file named
"iris_tree.png". The -Tpng option specifies the output format as PNG.

The resulting image file, "iris_tree.png", will visually represent the trained Decision Tree,
showing the structure and decision-making process of the model based on the petal
length and width of iris flowers.
Making Predictions

Let’s see how the tree represented in Figure 6-1 makes predictions. Suppose you find an iris
flower and you want to classify it. You start at the root node (depth 0, at the top): this node asks
whether the flower’s petal length is smaller than 2.45 cm. If it is, then you move down to the
root’s left child node (depth 1, left). In this case, it is a leaf node (i.e., it does not have any
children nodes), so it does not ask any questions: you can simply look at the predicted class for
that node and the Decision Tree predicts that your flower is an Iris-Setosa (class=setosa). Now
suppose you find another flower, but this time the petal length is greater than 2.45 cm. You must
move down to the root’s right child node (depth 1, right), which is not a leaf node, so it asks
another question: is the petal width smaller than 1.75 cm? If it is, then your flower is most likely
an Iris-Versicolor (depth 2, left). If not, it is likely an Iris-Virginica (depth 2, right). It’s really
that simple.
1. Samples Attribute: The "samples" attribute of a node tells us how many training
instances (data points) are considered or reach that specific node during the
construction of the Decision Tree. For example, if a node has 100 samples, it
means that 100 training instances satisfy the conditions to reach that node.
2. Value Attribute: The "value" attribute of a node provides information about the
distribution of classes among the training instances that reach that node. It tells
us the count of instances belonging to each class. For example, if a node has a
value attribute of [0, 1, 45], it means that among the instances reaching that
node, there are 0 instances of class Iris-Setosa, 1 instance of class Iris-Versicolor,
and 45 instances of class Iris-Virginica.
3. Gini Impurity: The "gini" attribute is a measure of impurity or the level of mixing
of different classes at a node. A node with a gini score of 0 is considered "pure"
because all the training instances reaching that node belong to the same class.
On the other hand, a higher gini score indicates a higher impurity or mixing of
classes at that node. The gini score is computed using a formula that takes into
account the class ratios at the node.

To illustrate this, let's take an example:

• At the depth-1 right node, there are 100 training instances with a petal length
greater than 2.45 cm.
• At the depth-2 left node (a child of the depth-1 right node), there are 54
instances with a petal length greater than 2.45 cm and a petal width smaller than
1.75 cm.
• The bottom-right node has a value of [0, 1, 45], indicating that it applies to 0 Iris-
Setosa instances, 1 Iris-Versicolor instance, and 45 Iris-Virginica instances.
• The depth-1 left node is considered pure (gini=0) because it only applies to
instances of the Iris-Setosa class.

In summary, the "samples" attribute counts the number of instances reaching a node,
the "value" attribute provides the class distribution at that node, and the "gini" attribute
measures the impurity or mixing of classes at a node. These attributes help understand
the characteristics and behavior of the Decision Tree during the training process.
Estimating Class Probabilities

…..

The CART Training Algorithm…….note this both.

…..

Computational Complexity
computational complexity associated with making predictions and training Decision
Trees:

1. Prediction Complexity: When making predictions with a Decision Tree, you start
at the root node and traverse the tree until you reach a leaf node. Decision Trees
are usually balanced, which means the number of nodes you need to traverse is
roughly O(log2(m)), where 'm' is the number of instances in the dataset. Since
each node only requires checking the value of one feature, the prediction
complexity is approximately O(log2(m)), regardless of the number of features.
This makes predictions very fast, even for large datasets.
2. Training Complexity: The training algorithm for Decision Trees compares all
features (or a subset if specified) on all training samples at each node during the
construction of the tree. This results in a training complexity of approximately O(n
× m log(m)), where 'n' is the number of features and 'm' is the number of training
instances. In simple terms, the training complexity grows linearly with the number
of features and logarithmically with the number of instances.

In summary, making predictions with Decision Trees is fast and efficient, with a
complexity of O(log2(m)). However, training Decision Trees involves comparing features
on all training samples at each node, resulting in a training complexity of approximately
O(n × m log(m)). It's important to consider the size of the dataset and the
computational resources available when training Decision Trees.

Gini Impurity or Entropy?

Gini Impurity: Gini impurity is a measure of impurity or disorder in a set of instances. It


measures how likely it is to randomly pick two instances from the set and have them
belong to different classes. A Gini impurity of 0 means the set is pure, i.e., it contains
instances of only one class. Gini impurity is the default impurity measure used in
Decision Trees. It is fast to compute and tends to isolate the most frequent class in its
own branch of the tree. ( Suppose we have a dataset with 100 instances, where 60
instances belong to Class A and 40 instances belong to Class B. Now, let's imagine we
randomly pick two instances from this dataset.

Scenario 1: Randomly picking two instances from the same class If we randomly pick
two instances from the dataset and both instances belong to the same class, let's say
both are from Class A, then the Gini impurity is 0 because there is no mixing or impurity
in this scenario.

Scenario 2: Randomly picking two instances from different classes If we randomly pick
two instances and they belong to different classes, for example, one instance is from
Class A and the other is from Class B, then there is some level of impurity or mixing of
classes in this scenario.)

1. Entropy: Entropy is another measure of impurity borrowed from information


theory. It quantifies the average information content of a message or the disorder
in a set. In the context of Decision Trees, entropy measures how mixed or diverse
the classes are in a set. Entropy is 0 when all instances in the set belong to the
same class. Using entropy as the impurity measure tends to produce slightly
more balanced trees.

So, which one should you use? In practice, the choice between Gini impurity and
entropy does not often make a significant difference. They generally lead to similar
decision trees. Gini impurity is usually a good default choice as it is slightly faster to
compute. However, if you prefer slightly more balanced trees, you can use entropy as
the impurity measure.

It's important to note that both Gini impurity and entropy are valid impurity measures,
and the choice depends on your specific use case and preference.

Regularization Hyperparameters
In order to prevent Decision Trees from overfitting the training data, regularization
techniques are used. Regularization involves imposing certain restrictions on the
Decision Tree's structure during training.

Decision Trees are considered nonparametric models because they don't make strong
assumptions about the data's underlying distribution. They have the flexibility to adapt
and fit the training data very closely, which can lead to overfitting. Overfitting occurs
when the model becomes too complex and captures the noise or idiosyncrasies of the
training data, resulting in poor generalization to unseen data.

To address overfitting, we can apply regularization techniques to limit the Decision


Tree's freedom. One common regularization hyperparameter is the maximum depth of
the tree, controlled by the max_depth parameter. By setting a maximum depth, we restrict
the tree's growth, which helps prevent it from becoming too complex and overfitting
the data. The default value for max_depth is None, indicating unlimited depth.

Additionally, there are other hyperparameters in the DecisionTreeClassifier class that


can shape the tree and further regularize the model. These include:
• min_samples_split: It specifies the minimum number of samples required for a
node to be split further. Increasing this hyperparameter can prevent the tree from
splitting nodes that have a small number of instances, reducing overfitting.
• min_samples_leaf: It sets the minimum number of samples required to be in a leaf
node. Similarly, increasing this hyperparameter ensures that leaf nodes contain a
minimum number of instances, which can regularize the model.
• min_weight_fraction_leaf: It is similar to min_samples_leaf, but it is expressed as a
fraction of the total number of weighted instances rather than a fixed number.
• max_leaf_nodes: It limits the maximum number of leaf nodes in the tree. Reducing
this hyperparameter can prevent the tree from growing too large and complex.
• max_features: It controls the maximum number of features considered for splitting
at each node. By limiting the number of features, we can regularize the model
and reduce overfitting.

By adjusting these hyperparameters, we can control the complexity and size of the
Decision Tree, ultimately reducing the risk of overfitting and improving its generalization
to unseen data.
Regression

Decision Trees are also capable of performing regression tasks. Let’s build a regres‐ sion tree
using Scikit-Learn’s DecisionTreeRegressor class, training it on a noisy quadratic dataset with
max_depth=2:

from sklearn.tree import DecisionTreeRegressor

This line imports the DecisionTreeRegressor class from the sklearn.tree module. The
DecisionTreeRegressor is a class that represents a decision tree model for regression
tasks. It is specifically designed to predict continuous numeric values.

tree_reg = DecisionTreeRegressor(max_depth=2)

This line creates an instance of the DecisionTreeRegressor class and assigns it to the
variable tree_reg. The max_depth parameter is set to 2, which specifies the maximum
depth of the decision tree. It limits the number of levels the tree can grow to, restricting
its complexity and preventing overfitting.

tree_reg.fit(X, y)

This line trains the decision tree regression model using the training data X and the
corresponding target values y. The fit method is called on the tree_reg object to initiate
the training process. During training, the decision tree learns from the input features X
and their corresponding target values y to capture the underlying patterns and
relationships in the data.

In summary, the code imports the DecisionTreeRegressor class from scikit-learn, creates
an instance of the class with a specified maximum depth, and trains the decision tree
regression model using the provided training data.
This tree looks very similar to the classification tree you built earlier. The main differ‐ ence is
that instead of predicting a class in each node, it predicts a value. For example, suppose you want
to make a prediction for a new instance with x1 = 0.6. You traverse the tree starting at the root,
and you eventually reach the leaf node that predicts value=0.1106. This prediction is simply the
average target value of the 110 training instances associated to this leaf node. This prediction
results in a Mean Squared Error (MSE) equal to 0.0151 over these 110 instances. This model’s
predictions are represented on the left of Figure 6-5. If you set max_depth=3, you get the
predictions represented on the right. Notice how the pre‐ dicted value for each region is always
the average target value of the instances in that region. The algorithm splits each region in a way
that makes most training instances as close as possible to that predicted value.
Instability
Decision Trees have some limitations, and one of them is their sensitivity to the
orientation of the training data.

Decision Trees work best when the decision boundaries between classes are orthogonal,
meaning they are aligned with the feature axes (it means that the splits made by the
Decision Tree are vertical or horizontal lines). This is because Decision Trees make splits
along the feature axes to create decision boundaries. In simpler terms, they divide the
feature space by drawing straight lines or planes.

However, when the training data is rotated, the decision boundaries of the Decision Tree
may become unnecessarily complex. In the example shown in Figure 6-7, the dataset on
the left is linearly separable, and a Decision Tree can easily split it with a straight line.
But when the dataset is rotated by 45 degrees (as shown on the right), the decision
boundary becomes more convoluted and less intuitive.

Although both Decision Trees fit the training data perfectly in these cases, the model on
the right, with the rotated dataset, is more likely to overfit and not generalize well to
new, unseen data. It means that the model may not perform well in making predictions
for new instances outside of the training data.

One way to mitigate this issue is to use Principal Component Analysis (PCA), which is a
technique for dimensionality reduction (you can find more details in Chapter 8 of the
source material). PCA can transform the training data to a new coordinate system that
aligns better with the principal components, reducing the impact of dataset rotations. By
doing so, the Decision Tree may be able to create more meaningful and simpler
decision boundaries, improving its generalization capabilities.

In summary, Decision Trees prefer orthogonal decision boundaries, which can lead to
challenges when the training data is rotated. This sensitivity to data orientation can be
mitigated by applying techniques like PCA to better align the data with the decision
boundaries.
One of the main drawbacks of Decision Trees is that they are highly sensitive to small changes or
variations in the training data. Even a slight modification, such as removing or adding a single
instance, can lead to a completely different decision tree.

For example, let's say we have a dataset of iris flowers, and we train a Decision Tree on this data. If
we remove one specific iris flower with certain petal dimensions and train a new Decision Tree, the
resulting model may look very different from the previous one. It could have different splits, different
decision boundaries, and different predictions for new instances.

In fact, even when using the same training data and the same algorithm, due to the stochastic nature
of the training process (randomness involved), you may end up with different decision trees each
time you train the model, unless you set a specific value for the random_state hyperparameter to
ensure reproducibility.

This sensitivity to small variations in the training data is a limitation of Decision Trees and can make
them less reliable in certain situations.
CHAPTER 7

Ensemble Learning and Random Forests

Ensemble learning is a technique where multiple predictors, such as classifiers or


regressors, are combined to make better predictions compared to individual predictors.
It is similar to the concept of "wisdom of the crowd," where aggregating the opinions or
predictions of a large group often leads to more accurate results than relying on a single
expert.

One popular ensemble method is called Random Forest. In a Random Forest, a group of
Decision Tree classifiers is trained, each on a different random subset of the training
data. When making predictions, the predictions of all individual trees are collected, and
the final prediction is based on majority voting, where the class that receives the most
votes is chosen.

Ensemble methods are powerful because they can improve prediction accuracy by
combining the strengths of multiple predictors. They are commonly used in Machine
Learning projects, especially towards the end, to further enhance the performance of the
models already built. In fact, many winning solutions in Machine Learning competitions
involve the use of ensemble methods, such as in the famous Netflix Prize competition.

In this chapter, we will explore various popular ensemble methods, including bagging,
boosting, stacking, and others, and specifically delve into Random Forests, which have
proven to be highly effective in practice.

Voting Classifiers

Imagine you have trained several classifiers, like a Logistic Regression classifier, an SVM
classifier, a Random Forest classifier, and a K-Nearest Neighbors classifier. Each of these
classifiers individually achieves around 80% accuracy, which means they are fairly good
but not perfect.

To create an even better classifier, you can combine the predictions of these classifiers.
One simple way to do this is by using majority voting. In other words, you let each
classifier make a prediction, and the final prediction is based on the class that receives
the most votes from the classifiers. This type of classifier is called a "hard voting"
classifier.
Surprisingly, this majority voting approach often results in a higher accuracy than the
best individual classifier in the ensemble. Even if each classifier is only slightly better
than random guessing (weak learners), the ensemble can still become a strong learner
and achieve high accuracy, as long as there are enough diverse weak learners.

To understand how this is possible, let's consider an analogy. Imagine you have a
slightly biased coin that has a 51% chance of landing on heads and a 49% chance of
landing on tails. If you toss this coin 1,000 times, you would generally get around 510
heads and 490 tails, which means a majority of heads. Mathematically, the probability of
obtaining a majority of heads after 1,000 tosses is close to 75%. The more you toss the
coin, the higher the probability of getting a majority of heads.

This phenomenon is explained by the "law of large numbers." As you keep tossing the
coin more times, the ratio of heads to total tosses gets closer and closer to the
underlying probability of heads (51%). With a larger number of tosses (e.g., 10,000), the
probability of obtaining a majority of heads climbs over 97%. This is because the
randomness evens out over a large number of trials, revealing the true bias of the coin.

Similarly, Suppose you have 1,000 classifiers in an ensemble, and each classifier is only
slightly better than random guessing, correctly predicting the outcome 51% of the time.
Individually, these classifiers are not very accurate.

However, when you combine their predictions and choose the majority voted class as
the final prediction, something interesting happens. In theory, if the classifiers are
perfectly independent and make different types of errors, you could achieve up to 75%
accuracy by relying on the majority vote.

But in reality, the classifiers in the ensemble are trained on the same data, so they are
likely to make similar mistakes. This means that even though the majority vote approach
is still used, there will be many cases where the majority vote is incorrect because
multiple classifiers in the ensemble made the same error.

As a result, the accuracy of the ensemble will be lower than the ideal scenario of 75%.
The extent to which the accuracy is reduced depends on the level of similarity in the
errors made by the classifiers. If they tend to make similar mistakes, the ensemble's
accuracy will be compromised.

In summary, by combining the predictions of multiple classifiers through majority


voting, we can often achieve higher accuracy than the best individual classifier. However,
the effectiveness of this approach depends on having a diverse set of classifiers that
make different types of errors.
The following code creates and trains a voting classifier in Scikit-Learn, composed of three
diverse classifiers (the training set is the moons dataset, introduced in Chap‐ ter 5):
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.svm import SVC

This code imports the required classes from the scikit-learn library. We import the
RandomForestClassifier class, the VotingClassifier class, the LogisticRegression class, and
the SVC class (support vector classifier).

Create individual classifier instances:


log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()

svm_clf = SVC()

Here, we create instances of three different classifiers: LogisticRegression,


RandomForestClassifier, and SVC (support vector classifier). Each instance is assigned to a
variable (log_clf, rnd_clf, and svm_clf).
voting_clf = VotingClassifier(
estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
voting='hard'

We create a VotingClassifier instance called voting_clf. The VotingClassifier combines


the individual classifiers specified in the estimators parameter. In this case, we pass a list
of tuples where each tuple contains a string identifier for the classifier ( 'lr', 'rf', 'svc')
and the corresponding classifier instance (log_clf, rnd_clf, svm_clf). The voting parameter
is set to 'hard', indicating that majority voting will be used to make predictions.

voting_clf.fit(X_train, y_train)

We train the VotingClassifier using the fit method by providing the training data
(X_train) and corresponding labels (y_train).

Evaluate each classifier's accuracy on the test set:


from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(clf.__class__.__name__, accuracy_score(y_test, y_pred))


1. Here, we import the accuracy_score function from sklearn.metrics. Then, we iterate
over each classifier (log_clf, rnd_clf, svm_clf, voting_clf).
• For each classifier, we fit it on the training data using fit(X_train, y_train).
• We then make predictions on the test data ( X_test) using predict(X_test)
and store the predictions in y_pred.
• Finally, we print the classifier's name (retrieved using __class__.__name__)
and its accuracy score ( accuracy_score(y_test, y_pred)).

The output will display the accuracy scores of each classifier on the test set, allowing us
to compare their performances. The VotingClassifier might achieve higher accuracy than
any individual classifier because it combines their predictions using majority voting.
1. Estimating Class Probabilities: If all the classifiers in the ensemble have the ability
to estimate class probabilities, meaning they have a predict_proba() method, we
can leverage this information to improve our predictions.
2. Soft Voting: Soft voting is an alternative to hard voting, where instead of simply
taking the majority vote, we consider the averaged class probabilities predicted
by each classifier. This approach gives more weight to highly confident votes.
3. Modifying the Voting Classifier: To enable soft voting in the VotingClassifier, we
need to make a few adjustments:
• Replace voting="hard" with voting="soft" when creating the
VotingClassifier instance.
• Ensure that all classifiers in the ensemble can estimate class probabilities.
4. Enabling Class Probability Estimation for SVC: By default, the SVC class in scikit-
learn does not estimate class probabilities. However, we can enable this
functionality by setting the probability hyperparameter of SVC to True. This
change triggers cross-validation to estimate class probabilities during training
and adds a predict_proba() method to SVC.
5. Improved Accuracy: When you modify the previous code to use soft voting and
make the necessary changes for enabling class probability estimation for SVC, you
will find that the voting classifier achieves an accuracy of over 91% on the test
set.

In summary, by using soft voting and leveraging class probabilities, the voting classifier
can make more informed decisions by considering the confidence of each classifier's
predictions. This often leads to improved performance compared to hard voting, where
only the majority vote is considered.
Bagging and Pasting

…..

Out-of-Bag Evaluation
In bagging (Bootstrap Aggregating), each predictor in the ensemble is trained on a
random subset of the training instances, which means that some instances may be
sampled multiple times and some may not be sampled at all. The instances that are not
sampled in the training of a particular predictor are called out-of-bag (oob) instances.

Since a predictor never sees the oob instances during training, they can be used for
evaluation without the need for a separate validation set or cross-validation. This means
that we can evaluate the ensemble itself by averaging out the oob evaluations of each
predictor.

In Scikit-Learn, when creating a BaggingClassifier, you can set the parameter


oob_score=True to request an automatic oob evaluation after training. This evaluates the
ensemble's performance using the oob instances. The resulting evaluation score is
available through the oob_score_ variable.

For example, if the oob score is 0.901, it means that the BaggingClassifier is expected to
achieve about 90.1% accuracy on unseen data.

To verify the performance, you can use the trained BaggingClassifier to make
predictions on a separate test set and compare the predictions with the true labels. The
accuracy of the predictions can be calculated using the accuracy_score function from
Scikit-Learn.

The oob decision function, available through the oob_decision_function_ variable,


provides additional information. It returns the class probabilities for each oob training
instance. For example, for the first training instance, the oob decision function might
estimate a 68.25% probability of belonging to the positive class and a 31.75%
probability of belonging to the negative class.
(more simpler: When using bagging, some data points are not used to train each
individual predictor. These unused data points are called out-of-bag (oob) instances.
Since these instances are not seen by a predictor during training, we can use them to
evaluate the performance of the entire ensemble without needing a separate validation
set.
In Scikit-Learn, you can enable oob evaluation by setting oob_score=True when creating a
BaggingClassifier. This will automatically evaluate the ensemble using the oob instances.
The resulting oob score tells us how well the ensemble is likely to perform on unseen
data.

For example, if the oob score is 0.901, it means that the ensemble is expected to have an
accuracy of about 90.1% on new, unseen data.

To verify this estimate, you can use the trained BaggingClassifier to make predictions on
a separate test set and calculate the accuracy of the predictions using the accuracy_score
function.

The oob decision function provides additional information by giving the class
probabilities for each oob training instance. It tells us the estimated probabilities of an
instance belonging to different classes based on the ensemble's predictions.)
bag_clf = BaggingClassifier(
DecisionTreeClassifier(), n_estimators=500,

bootstrap=True, n_jobs=-1, oob_score=True)


• In this line, we create a BaggingClassifier object called bag_clf. We pass
DecisionTreeClassifier() as the base estimator, which will be used to create each
predictor in the ensemble.
• n_estimators=500 specifies that we want to create an ensemble of 500 predictors.
• bootstrap=True indicates that bootstrapping will be used (sampling instances with
replacement) to create different subsets of the training data for each predictor.
• n_jobs=-1 tells the BaggingClassifier to use all available CPU cores for parallel
processing.
• oob_score=True enables the calculation of the out-of-bag (oob) score, which
estimates the accuracy of the ensemble using the oob instances.

bag_clf.fit(X_train, y_train)

This line fits (trains) the BaggingClassifier on the training data X_train and the
corresponding labels y_train.

bag_clf.oob_score_

This line retrieves the oob score of the BaggingClassifier. It returns the estimated
accuracy of the ensemble on the oob instances. In this case, the oob score is
approximately 90.1%

y_pred = bag_clf.predict(X_test)
This line makes predictions on the test data X_test using the trained bag_clf model. The
predicted labels are stored in y_pred

accuracy_score(y_test, y_pred)

This line calculates the accuracy score by comparing the predicted labels y_pred with the
true labels y_test. The accuracy score represents the proportion of correctly predicted
instances in the test set.

Boosting
Boosting is an ensemble method used in machine learning to combine multiple weak
models into a strong model. The idea behind boosting is to train these weak models
sequentially, where each model tries to improve upon the mistakes made by its
predecessor.

In boosting, a weak model is one that performs only slightly better than random
guessing. These weak models can be simple decision trees or other basic algorithms.
The key is that they are not individually very accurate.

In simple terms, boosting is like assembling a team of weak players, training them one by one,
and having each player focus on improving the areas where the previous players struggled. By
combining the strengths of each player, the team becomes much stronger overall. Similarly,
boosting combines the predictions of weak models to create a more accurate and robust model.

AdaBoost
AdaBoost is a boosting algorithm that improves the accuracy of a model by focusing on the difficult
examples that the previous model struggled with. Here's a simplified explanation of how AdaBoost
works:

Imagine you have a training dataset with labeled examples. The first step is to train a simple model,
like a decision tree, on this dataset. This model will make predictions, but it might not be very
accurate and will likely get some examples wrong.

To correct the mistakes made by the first model, AdaBoost increases the importance or weight of the
misclassified examples. This means that in the next iteration, these misclassified examples will carry
more weight and be given more attention.

In the second iteration, another model is trained on the modified dataset, where the previously
misclassified examples are now more important. This new model will try to focus on these difficult
examples and improve its predictions. The process of adjusting the weights and training new models
continues for several iterations.
The final model, called the ensemble, is created by combining all the individual models. However, not
all models have the same influence. Models that performed well overall on the weighted dataset will
have more say in the final predictions, while models that struggled with the difficult examples will
have less influence.

Think of it as a team of players practicing together. The first player makes mistakes, so the coach
pays more attention to the areas where they struggled during the next practice. In the end, the coach
combines the strengths of all the players, but the players who improved the most get more playing
time.

Overall, AdaBoost gradually improves the ensemble's ability to handle difficult examples by giving
them more importance during training. It combines the predictions of multiple models, with each
model focusing on the areas where the previous models struggled, resulting in a stronger and more
accurate model.
Let's break down the AdaBoost algorithm and its steps in simpler words:

1. Initialize the instance weights: Each instance in the training set is initially given
the same weight, which is equal to 1 divided by the total number of instances.
2. Train a predictor: The first predictor (e.g., a decision tree) is trained on the
training set using the initial instance weights. This predictor makes predictions on
the training set.
3. Compute the weighted error rate: The weighted error rate of the predictor is
calculated by summing up the weights of the instances that it misclassified and
dividing it by the sum of all the instance weights.
4. Compute the predictor weight: The weight of the predictor is determined based
on its accuracy. More accurate predictors are assigned higher weights, and less
accurate predictors are assigned lower weights.
5. Update the instance weights: The weights of the instances are updated. Instances
that were misclassified by the predictor are given higher weights, while correctly
classified instances receive lower weights.
6. Normalize the instance weights: The instance weights are normalized by dividing
each weight by the sum of all the weights, ensuring they add up to 1.
7. Repeat steps 2-6: The process is repeated for a predetermined number of
iterations or until a perfect predictor is found. In each iteration, a new predictor is
trained using the updated instance weights.
8. Make predictions: To make predictions, all the predictors are combined, and their
predictions are weighted based on their predictor weights. The final prediction is
the one that receives the majority of the weighted votes.

The code provided demonstrates how to train an AdaBoost classifier using Scikit-Learn.
In this example, 200 decision stumps (simple decision trees with a single decision node
and two leaf nodes) are used as base estimators for AdaBoost. The algorithm is
configured to use the SAMME.R algorithm and a learning rate of 0.5.

Let's go through the code line by line to explain what each line does:

1. from sklearn.ensemble import AdaBoostClassifier: This line imports the


AdaBoostClassifier class from the sklearn.ensemble module. AdaBoostClassifier is
the implementation of the AdaBoost algorithm for classification tasks.
2. ada_clf = AdaBoostClassifier(: This line creates an instance of the
AdaBoostClassifier class and assigns it to the variable ada_clf. The parentheses
indicate that we're calling the constructor of the class to initialize the object.
3. DecisionTreeClassifier(max_depth=1): Here, we create an instance of the
DecisionTreeClassifier class, which represents a decision tree. We set the max_depth
parameter to 1, meaning that the decision tree will have only one level (a decision
node and two leaf nodes). This decision tree will serve as the base estimator for
AdaBoost.
4. n_estimators=200: This parameter specifies the number of estimators (base
models) that AdaBoost will train. In this case, we set it to 200, meaning that
AdaBoost will create an ensemble of 200 decision trees.
5. algorithm="SAMME.R": This parameter specifies the boosting algorithm to be used
by AdaBoost. "SAMME.R" stands for Stagewise Additive Modeling using a Multiclass
Exponential loss function. It is a variant of AdaBoost that relies on class
probabilities and generally performs better.
6. learning_rate=0.5: The learning_rate parameter controls the contribution of each
base estimator in the ensemble. A smaller learning rate means each estimator has
less influence, and the ensemble's learning is more conservative. In this case, we
set it to 0.5.
7. ): The closing parenthesis indicates the end of the constructor call.
8. ada_clf.fit(X_train, y_train): This line fits the AdaBoostClassifier to the training
data. The fit() method is called on the ada_clf object, passing the training
features X_train and the corresponding target labels y_train.

In summary, the code sets up an AdaBoost classifier using decision stumps (decision
trees with a single level) as base estimators. It configures the number of estimators, the
boosting algorithm (SAMME.R), and the learning rate. Finally, it fits the AdaBoost model to
the training data, allowing it to learn from the provided features and labels.
Gradient Boosting

Gradient Boosting is another popular boosting algorithm that builds an ensemble of


predictors by sequentially adding them to correct the mistakes made by the previous
predictors. However, it differs from AdaBoost in how it addresses the mistakes.

Instead of adjusting the instance weights like AdaBoost does, Gradient Boosting focuses
on fitting the new predictor to the residual errors made by the previous predictor.
Residual errors are the differences between the actual target values and the predictions
made by the ensemble so far.

Here's a step-by-step explanation of how Gradient Boosting works:

1. Train the first predictor: The first predictor, usually a simple model like a decision
tree, is trained on the training set.
2. Make predictions: This first predictor makes predictions on the training set.
3. Calculate residuals: The differences between the actual target values and the
predictions made by the first predictor are computed. These residuals represent
the errors made by the first predictor.
4. Train the second predictor: A new predictor is trained to predict these residuals.
The goal is to find a model that can learn from the errors of the previous
predictor and make better predictions.
5. Update the ensemble: The predictions of the second predictor are added to the
predictions of the first predictor, correcting some of the errors made by the first
predictor.
6. Repeat steps 3-5: The process is repeated, with each new predictor focusing on
the residuals and errors made by the ensemble so far. The new predictors are
added to the ensemble, gradually improving the overall predictions.
7. Stop criterion: The process continues until a predetermined number of predictors
are trained, or until a certain level of performance is reached.

In simpler terms, Gradient Boosting works by training multiple predictors, with each new
predictor trying to improve the errors made by the previous predictors. It does this by
focusing on the differences between the actual values and the predictions made by the
ensemble so far. By iteratively adding new predictors and correcting the errors, Gradient
Boosting creates a strong ensemble model that can make more accurate predictions
than any of the individual predictors alone.

Let’s go through a simple regression example using Decision Trees as the base predic‐ tors (of
course Gradient Boosting also works great with regression tasks). This is called Gradient Tree
Boosting, or Gradient Boosted Regression Trees (GBRT). First, let’s fit a DecisionTreeRegressor
to the training set (for example, a noisy quadratic train‐ ing set)

Let's go through the code line by line:

1. from sklearn.tree import DecisionTreeRegressor: This line imports the


DecisionTreeRegressor class from Scikit-Learn, which is used to create decision
tree-based regression models.
2. tree_reg1 = DecisionTreeRegressor(max_depth=2): This line creates the first decision
tree regressor (tree_reg1) with a maximum depth of 2. A decision tree with a
depth of 2 will have one decision node and two leaf nodes.
3. tree_reg1.fit(X, y): This line fits the first decision tree regressor to the training
data X and target values y.
4. y2 = y - tree_reg1.predict(X): This line calculates the residual errors made by the
first predictor (tree_reg1) on the training set. It subtracts the predicted values
from the actual target values (y) to obtain the residuals (y2).
5. tree_reg2 = DecisionTreeRegressor(max_depth=2): This line creates the second
decision tree regressor (tree_reg2) with a maximum depth of 2.
6. tree_reg2.fit(X, y2): This line fits the second decision tree regressor to the
training data X and the residual errors y2 from the first predictor.
7. y3 = y2 - tree_reg2.predict(X): This line calculates the residual errors made by the
second predictor (tree_reg2) on the training set. It subtracts the predicted values
from tree_reg2 from the residuals y2 to obtain new residuals y3.
8. tree_reg3 = DecisionTreeRegressor(max_depth=2): This line creates the third decision
tree regressor (tree_reg3) with a maximum depth of 2.
9. tree_reg3.fit(X, y3): This line fits the third decision tree regressor to the training
data X and the residual errors y3 from the second predictor.
10. y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3)):
This line makes predictions on a new instance X_new using the ensemble
containing all three trees. It sums up the predictions of all three trees ( tree_reg1,
tree_reg2, and tree_reg3) to obtain the final prediction y_pred.

In summary, this code creates an ensemble of three decision tree regressors, each
correcting the errors of the previous predictor. The final prediction is obtained by
summing the predictions of all three trees. This is the basic idea behind Gradient Tree
Boosting, where each tree in the ensemble tries to improve upon the mistakes of the
previous tree to make more accurate predictions.

Stacking

Imagine you have a bunch of friends who are really good at guessing things, like
guessing the weight of a watermelon. Each friend has a different way of making their
guesses, and they often come up with slightly different answers.

Now, instead of just taking the average of all their guesses (like hard voting), we want to
do something smarter. So, we have a new friend, let's call them the "Blender." The
Blender's job is to take the guesses of all your other friends and use those guesses as
inputs to make a final and more accurate guess about the weight of the watermelon.

For example, if your friends guessed 3.1 kg, 2.7 kg, and 2.9 kg, the Blender might take
these three guesses as inputs and decide that the final prediction is 3.0 kg. The Blender
learns from the previous guesses made by your friends and tries to come up with the
best possible guess based on that knowledge.

So, in summary, stacking is an ensemble method where instead of using simple


methods like taking the majority vote (in classification tasks) or the average (in
regression tasks), we use a more sophisticated model (the Blender) to combine the
predictions of other models and make a better final prediction. This can often lead
to more accurate and robust predictions.
To train the blender, we need to follow a specific approach. First, we take our training
dataset and split it into two subsets.

The first subset, which we can call the "first layer," is used to train the individual
predictors. Think of these predictors as your friends who are good at guessing the
weight of a watermelon. Each predictor learns from a portion of the training data and
develops its own way of making guesses. These predictors are trained independently
and aim to make accurate predictions on their own.

Now, the second subset, which we can call the "hold-out set," is not used for training
the predictors. Instead, it is set aside for evaluating the performance of the blender later
on. It acts as a test set for the blender's predictions.

So, the first layer of predictors learns from one part of the training data, and we evaluate
their performance using the hold-out set.

After training the first layer predictors and evaluating their performance, we move on to
the next step, which involves using the predictions made by these predictors as inputs
for training the blender (the meta learner). This is where the hold-out set comes into
play.

We take the predictions made by each predictor on the hold-out set, and these
predictions become the new training data for the blender. The blender learns from these
predictions and tries to find patterns or relationships among them to make a more
accurate final prediction.

In simpler terms, we split the training data into two parts. The first part is used to train
individual predictors, while the second part is kept aside to evaluate the blender's
performance. The predictions made by the individual predictors on the second part are
used to train the blender, enabling it to make better predictions based on the patterns it
discovers.

By using this approach, we can create a more powerful ensemble model that takes into
account the strengths of individual predictors and leverages their collective knowledge
to make more accurate predictions.
Once we have trained the individual predictors in the first layer, we move on to the next
step. In this step, we want to evaluate the performance of the predictors on a set of data
they have never seen before. This set is called the "held-out set."

We use the trained predictors to make predictions on the instances in the held-out set.
These predictions are considered "clean" because the predictors never encountered
these instances during their training phase. So, it's like testing the predictors on
completely new and unseen data.

Now, for each instance in the held-out set, we have three predictions made by the three
individual predictors. Let's say we have an instance A, and the three predictors make
predictions of 3.1 kg, 2.7 kg, and 2.9 kg for this instance.

To train the blender, we create a new training set using these predicted values as input
features. So, we take the predicted values (3.1 kg, 2.7 kg, and 2.9 kg) along with the
actual target values (the true weight of the watermelon) for each instance in the held-
out set.

This new training set becomes three-dimensional because it has the predicted values
from the first layer predictors as its features. The target values are kept as they are.

Finally, we train the blender using this new training set. The blender's job is to learn the
relationship between the predicted values from the first layer and the actual target
values. It learns to predict the target value given the predictions made by the first layer
predictors.

In simpler terms, we use the trained predictors to make predictions on a separate set of
unseen data. For each instance in this held-out set, we have multiple predictions from
the predictors. We then create a new training set using these predictions and the actual
target values. The blender is trained on this new training set to learn how to make
accurate predictions based on the predictions made by the first layer predictors.

By doing this, we ensure that the blender can take advantage of the collective
knowledge of the first layer predictors and make more accurate predictions based on
their insights.

In a multilayer stacking ensemble, we can train several different blenders, each using a
different algorithm or technique. For example, we can have one blender using Linear
Regression and another using Random Forest Regression. This creates a whole layer of
blenders.

To train these blenders, we need to split the training set into three subsets. The first
subset is used to train the first layer of predictors. The second subset is used to create
the training set for the second layer of blenders, using the predictions made by the
predictors of the first layer. Similarly, the third subset is used to create the training set
for the third layer of blenders, using the predictions made by the predictors of the
second layer.

Once we have trained all the layers of blenders, we can make predictions for a new
instance by going through each layer sequentially. Each layer takes the predictions from
the previous layer as inputs and generates its own predictions.

Now, the limitation is that Scikit-Learn, a popular machine learning library, does not
directly support stacking. This means that it does not have built-in functions or classes
specifically for implementing stacking ensembles.
However, it is not too difficult to implement stacking on your own using Scikit-Learn or
other programming tools. You can create your own custom implementation of the
stacking ensemble by carefully following the steps and logic we discussed earlier.

In summary, a multilayer stacking ensemble involves training multiple blenders using


different algorithms or techniques. The training set is split into subsets for training each
layer, and predictions from each layer are used as inputs for the next layer.
Unfortunately, Scikit-Learn does not have direct support for stacking, but you can create
your own implementation if needed.

You might also like