Introduction To Boosting - 2

INTRODUCTION TO
BOOSTING
https://cse.iitk.ac.in/users/piyush/courses/ml_autumn16/771A_lec21_slides.pdf
https://ocw.mit.edu/courses/sloan-school-of-management/15-097-prediction-machine-learning-and-statistics-
spring-2012/lecture-notes/MIT15_097S12_lec10.pdf
https://www.cs.toronto.edu/~hinton/csc2515/notes/lec11boo.htm
http://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/slides/gradient_boosting.pdf
http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
https://www.analyticsvidhya.com/blog/2018/09/an-end-to-end-guide-to-understand-the-math-behind-xgboost/
DEFINITION
• The term ‘Boosting’ refers to a family of algorithms which converts weak learner to strong
learners.
• Let’s understand this definition in detail by solving a problem of spam email identification:
• How would you classify an email as SPAM or not? Like everyone else, our initial approach
would be to identify ‘spam’ and ‘not spam’ emails using following criteria. If:
• Email has only one image file (promotional image), It’s a SPAM
• Email has only link(s), It’s a SPAM
• Email body consist of sentence like “You won a prize money of $ xxxxxx”, It’s a SPAM
• Email from our official domain “metu.edu.tr” , Not a SPAM
• Email from known source, Not a SPAM
• Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not spam’. But, do
you think these rules individually are strong enough to successfully classify an email? No.
• Individually, these rules are not powerful enough to classify an email into ‘spam’ or ‘not
spam’. Therefore, these rules are called as weak learner.
DEFINITION
• To convert weak learner to strong learner, we’ll combine the
prediction of each weak learner using methods like:
• Using average/ weighted average
• Considering prediction has higher vote
• For example: Above, we have defined 5 weak learners. Out of these
5, 3 are voted as ‘SPAM’ and 2 are voted as ‘Not a SPAM’. In this case,
by default, we’ll consider an email as SPAM because we have
higher(3) vote for ‘SPAM’.
How Boosting Algorithms works?
• To find weak rule, we apply base learning algorithms with a different
distribution. Each time base learning algorithm is applied, it generates a new
weak prediction rule. This is an iterative process. After many iterations, the
boosting algorithm combines these weak rules into a single strong prediction
rule.
• For choosing the right distribution, here are the following steps:
Step 1: The base learner takes all the distributions and assign equal weight or attention to
each observation.
Step 2: If there is any prediction error caused by first base learning algorithm, then we pay
higher attention to observations having prediction error. Then, we apply the next base
learning algorithm.
Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is
achieved.
• Finally, it combines the outputs from weak learner and creates a strong
learner which eventually improves the prediction power of the model.
Boosting pays higher focus on examples which are misclassiﬁed or have higher
errors by preceding weak rules.
Types of Boosting Algorithms
• Underlying engine used for boosting algorithms can be anything. It can be
decision stamp, margin-maximizing classification algorithm etc. There are
many boosting algorithms which use other types of engine such as:
• AdaBoost (Adaptive Boosting)
• Gradient Tree Boosting
• GentleBoost
• LPBoost
• BrownBoost
• XGBoost
• CatBoost
• Lightgbm
AdaBoost
• Box 1: You can see that we have assigned equal weights to each data point and
applied a decision stump to classify them as + (plus) or – (minus). The decision
stump (D1) has generated vertical line at left side to classify the data points. We
see that, this vertical line has incorrectly predicted three + (plus) as – (minus). In
such case, we’ll assign higher weights to these three + (plus) and apply another
decision stump.
AdaBoost
• Box 2: Here, you can see that the size of three incorrectly predicted +
(plus) is bigger as compared to rest of the data points. In this case, the
second decision stump (D2) will try to predict them correctly. Now, a
vertical line (D2) at right side of this box has classified three
misclassified + (plus) correctly. But again, it has caused misclassification
errors. This time with three -(minus). Again, we will assign higher
weight to three – (minus) and apply another decision stump.
Adaboost
• Box 3: Here, three – (minus) are given higher weights. A decision
stump (D3) is applied to predict these misclassified observation
correctly. This time a horizontal line is generated to classify + (plus)
and – (minus) based on higher weight of misclassified observation.
Adaboost
• Box 4: Here, we have combined D1, D2 and D3 to form a strong
prediction having complex rule as compared to individual weak
learner. You can see that this algorithm has classified these
observation quite well as compared to any of individual weak learner.
AdaBoost (Adaptive Boosting)
• It works on similar method as discussed above. It fits a sequence of
weak learners on different weighted training data. It starts by
predicting original data set and gives equal weight to each
observation. If prediction is incorrect using the first learner, then it
gives higher weight to observation which have been predicted
incorrectly. Being an iterative process, it continues to add learner(s)
until a limit is reached in the number of models or accuracy.
• Mostly, we use decision stamps with AdaBoost. But, we can use any
machine learning algorithms as base learner if it accepts weight on
training data set. We can use AdaBoost algorithms for both
classification and regression problem.
Let’s try to visualize one Classification
Problem
• Look at the below diagram :
We start with the first box. We see one vertical line which becomes our first
week learner. Now in total we have 3/10 mis-classified observations. We
now start giving higher weights to 3 plus mis-classified observations. Now, it
becomes very important to classify them right. Hence, the vertical line
towards right edge. We repeat this process and then combine each of the
learner in appropriate weights.
Explaining underlying mathematics
• How do we assign weight to observations?
• We always start with a uniform distribution assumption. Lets call it as
D1 which is 1/n for all n observations.
• Step 1 . We assume an alpha(t)
• Step 2: Get a weak classifier h(t)
• Step 3: Update the population distribution for the next step
where
• Simply look at the argument in exponent. Alpha is kind of learning rate, y is
the actual response ( + 1 or -1) and h(x) will be the class predicted by
learner. Essentially, if learner is going wrong, the exponent becomes
1*alpha and else -1*alpha. The weight will probably increase, if the
prediction went wrong the last time.
• Step 4 : Use the new population distribution to again find the next learner
• Step 5 : Iterate step 1 – step 4 until no hypothesis is found which can
improve further.
• Step 6 : Take a weighted average of the frontier using all the learners used
till now. But what are the weights? Weights are simply the alpha values.
Alpha is calculated as follows:
• Output the final hypothesis
Gradient Boosting
• In gradient boosting, it trains many models sequentially. Each new
model gradually minimizes the loss function (y = ax + b + e, e needs
special attention as it is an error term) of the whole system
using Gradient Descent method. The learning procedure
consecutively fit new models to provide a more accurate estimate of
the response variable.
• The principle idea behind this algorithm is to construct new base
learners which can be maximally correlated with negative gradient of
the loss function, associated with the whole ensemble.
Gradient Boosting
• Type of Problem – You have a set of variables vectors x1 , x2 and x3. You
need to predict y which is a continuous variable.
• Steps of Gradient Boost algorithm
Step 1 : Assume mean is the prediction of all variables.
Step 2 : Calculate errors of each observation from the mean (latest prediction).
Step 3 : Find the variable that can split the errors perfectly and find the value for
the split. This is assumed to be the latest prediction.
Step 4 : Calculate errors of each observation from the mean of both the sides of
split (latest prediction).
Step 5 : Repeat the step 3 and 4 till the objective function maximizes/minimizes.
Step 6 : Take a weighted mean of all the classifiers to come up with the final model.
Example
• Assume, you are given a previous model M to improve on. Currently
you observe that the model has an accuracy of 80% (any metric). How
do you go further about it?
• One simple way is to build an entirely different model using new set
of input variables and trying better ensemble learners. On the
contrary, I have a much simpler way to suggest. It goes like this:
Y = M(x) + error
• What if I am able to see that error is not a white noise but have same
correlation with outcome(Y) value. What if we can develop a model
on this error term? Like,
error = G(x) + error2
Example
• Probably, you’ll see error rate will improve to a higher number, say
84%. Let’s take another step and regress against error2.
error2 = H(x) + error3
• Now we combine all these together :
Y = M(x) + G(x) + H(x) + error3
• This probably will have an accuracy of even more than 84%. What if I
can find an optimal weights for each of the three learners,
Y = alpha * M(x) + beta * G(x) + gamma * H(x) + error4
Example
• If we found good weights, we probably have made even a better
model. This is the underlying principle of a boosting learner.
• Boosting is generally done on weak learners, which do not have a
capacity to leave behind white noise.
• Boosting can lead to overfitting, so we need to stop at the right point.
How to improve regression results
• You are given (x1, y1), (x2, y2), …, (xn, yn), and the task is to t a model
F(x) to minimize square loss.
• Suppose your friend wants to help you and gives you a model F.
• You check his model and find the model is good but not perfect.
There are some mistakes: F(x1) = 0.8, while y1 = 0.9, and F(x2) = 1.4
while y2 = 1.3. How can you improve this model?
• Rule of the game:
You are not allowed to remove anything from F or change any
parameter in F.
You can add an additional model (regression tree) h to F, so
the new prediction will be F(x) + h(x).
• Simple solution:
• You wish to improve the model such that
F(x1) + h(x1) = y1
F(x2) + h(x2) = y2
:::
F(xn) + h(xn) = yn
Or, equivalently, you wish
h(x1) = y1 ̶ F(x1)
h(x2) = y2 ̶ F(x2)
:::
h(xn) = yn ̶ F(xn)
• Can any regression tree h achieve this goal perfectly?
• Maybe not....
• But some regression tree might be able to do this approximately.
• How?
• Just fit a regression tree h to data
(x1, y1 ̶ F(x1)), (x2, y2 ̶ F(x2)), …, (xn, yn ̶ F(xn))
• Congratulations, you get a better model!
• yi ̶ F(xi) are residuals. These are the parts that existing model F cannot do well.
• The role of h is to compensate the shortcoming of existing model F. If the new
model F + h is still not satisfactory, we can add another regression tree...
• We are improving the predictions of training data, but is the procedure also
useful for test data?
• Yes! Because we are building a model, and the model can be applied to test
data as well.
• How is this related to gradient descent?
Gradient Descent
How is this related to gradient descent?
•For regression with square loss,
residual  negative gradient
fit h to residual  fit h to negative gradient
update F based on residual  update F based on negative gradient
•So we are actually updating our model using gradient descent! It turns
out that the concept of gradients is more general and useful than the
concept of residuals. So from now on, let's stick with gradients.
Problem
• Recognize the given hand written capital letter.
• Multi-class classification
• 26 classes. A,B,C,...,Z
• Data Set
• http://archive.ics.uci.edu/ml/datasets/Letter+Recognition
• 20000 data points, 16 features
• Feature Extraction
• Feature Vector= (2, 1, 3, 1, 1, 8, 6, 6, 6, 6, 5, 9, 1, 7, 5, 10)

• Label = G
• Model
• 26 score functions (our models): FA, FB, FC , …, FZ .
• FA(x) assigns a score for class A
• scores are used to calculate probabilities
• predicted label = class that has the highest probability

• Loss Function for each data point
Step 1. turn the label yi into a (true) probability distribution Yc (xi). For
example: y5=G, YA(x5) = 0, YB(x5) = 0, …, YG (x5) = 1, …, YZ (x5) = 0.
• Step 2. Calculate the predicted probability distribution Pc(xi) based on
the current model FA, FB, …, FZ . PA(x5) = 0.03, PB(x5) = 0.05, …, PG (x5) =
0.3, …, PZ (x5) =0.05.
Step 3. Calculate the difference between the true probability
distribution and the predicted probability distribution. Here we use KL-
divergence
•Goal
• minimize the total loss (KL-divergence)
• for each data point, we wish the predicted probability distribution to match
the true probability distribution as closely as possible
We achieve this goal by adjusting our models FA, FB, …, FZ .
Differences
•FA, FB, …, FZ vs F
•a matrix of parameters to optimize vs a column of parameters to
optimize
•a matrix of gradients vs a column of gradients

Bagging vs Boosting
• No clear winner; usually depends on the data
• Bagging is computationally more efficient than boosting (note that
bagging can train the M models in parallel, boosting can’t)
• Both reduce variance (and overfitting) by combining different models
• The resulting model has higher stability as compared to the individual
ones
• Bagging usually can’t reduce the bias, boosting can (note that in
boosting, the training error steadily decreases)
• Bagging usually performs better than boosting if we don’t have a high
bias and only want to reduce variance (i.e., if we are overfitting)
XGBoosting (Extreme Gradient Boosting)
• Ever since its introduction in 2014, XGBoost has been lauded as the holy
grail of machine learning hackathons and competitions. From predicting
ad click-through rates to classifying high energy physics events, XGBoost
has proved its mettle in terms of performance – and speed.
• Execution Speed: Generally, XGBoost is fast. Really fast when compared to
other implementations of gradient boosting. But newly introduced
LightGBM is faster than XGBoosting.
• Model Performance: XGBoost dominates structured or tabular datasets on
classification and regression predictive modeling problems. The evidence
is that it is the go-to algorithm for competition winners on the Kaggle
competitive data science platform.
What Algorithm Does XGBoost Use?
• The XGBoost library implements the gradient boosting decision tree algorithm.
• This algorithm goes by lots of different names such as gradient boosting, multiple
additive regression trees, stochastic gradient boosting or gradient boosting
machines.
• Boosting is an ensemble technique where new models are added to correct the
errors made by existing models. Models are added sequentially until no further
improvements can be made. A popular example is the AdaBoost algorithm that
weights data points that are hard to predict.
• Gradient boosting is an approach where new models are created that predict the
residuals or errors of prior models and then added together to make the final
prediction. It is called gradient boosting because it uses a gradient descent
algorithm to minimize the loss when adding new models.
• This approach supports both regression and classification predictive modeling
problems.
Math Behind the Boosting Algorithms
• In boosting, the trees are built sequentially such that each subsequent
tree aims to reduce the errors of the previous tree. Each tree learns from
its predecessors and updates the residual errors. Hence, the tree that
grows next in the sequence will learn from an updated version of the
residuals.
• The base learners in boosting are weak learners in which the bias is high,
and the predictive power is just a tad better than random guessing. Each
of these weak learners contributes some vital information for prediction,
enabling the boosting technique to produce a strong learner by
effectively combining these weak learners. The final strong learner brings
down both the bias and the variance.
• In contrast to bagging techniques like Random Forest, in which trees
are grown to their maximum extent, boosting makes use of trees
with fewer splits.
• Such small trees, which are not very deep, are highly interpretable.
Parameters like the number of trees or iterations, the rate at which
the gradient boosting learns, and the depth of the tree, could be
optimally selected through validation techniques like k-fold cross
validation.
• Having a large number of trees might lead to overfitting. So, it is
necessary to carefully choose the stopping criteria for boosting.
• Boosting consists of three simple steps:
• An initial model F0 is defined to predict the target variable y. This model will be associated with a residual
(y – F0)
• A new model h1 is fit to the residuals from the previous step
• Now, F0 and h1 are combined to give F1, the boosted version of F0. The mean squared error from F1 will be
lower than that from F0:
• To improve the performance of F1, we could model after the residuals of F1 and create a new model F2:
• This can be done for ‘m’ iterations, until residuals have been minimized as much as possible:
• Here, the additive learners do not disturb the functions created in the previous steps. Instead, they
impart information of their own to bring down the errors.
Demonstrating the Potential of Boosting
• Consider the following data where the years of experience is
predictor variable and salary (in thousand dollars) is the target. Using
regression trees as base learners, we can create a model to predict
the salary. For the sake of simplicity, we can choose square loss as
our loss function and our objective would be to minimize the square
error.
• As the first step, the model should be initialized with a function F0(x).
F0(x) should be a function which minimizes the loss function or MSE
(mean squared error).
• Taking the first differential of the above equation with respect to y, it is
seen that the function minimizes at the mean. So, the boosting model
could be initiated with:
• F0(x) gives the predictions from the first stage of our model. Now, the
residual error for each instance is (yi – F0(x)).
• We can use the residuals from F0(x) to create h1(x). h1(x) will be a
regression tree which will try and reduce the residuals from the
previous step. The output of h1(x) won’t be a prediction of y; instead,
it will help in predicting the successive function F1(x) which will bring
down the residuals.
• The additive model h1(x) computes the mean of the residuals (y – F0)
at each leaf of the tree. The boosted function F1(x) is obtained by
summing F0(x) and h1(x). This way h1(x) learns from the residuals of
F0(x) and suppresses it in F1(x).
• The MSEs for F0(x), F1(x) and F2(x) are 875, 692 and 540. It’s amazing how
these simple weak learners can bring about a huge reduction in error!
• Note that each learner, hm(x), is trained on the residuals. All the additive
learners in boosting are modeled after the residual errors at each step.
Intuitively, it could be observed that the boosting learners make use of
the patterns in residual errors. At the stage where maximum accuracy is
reached by boosting, the residuals appear to be randomly distributed
without any pattern.
XGBoosting (Extreme Gradient Boosting)
• What is the difference between the R gbm (gradient boosting machine)
and XGBoost (extreme gradient boosting)?
• Both XGBoost and gbm follows the principle of gradient boosting. There
are however, the difference in modeling details. Specifically, XGBoost used
a more regularized model formalization to control over-fitting, which gives
it better performance.
• Objective Function : Training Loss + Regularization
• The regularization term controls the complexity of the model, which helps
us to avoid overfitting. This sounds a bit abstract, so let us consider the
following problem in the following picture. You are asked to fit visually a
step function given the input data points on the upper left corner of the
image. Which solution among the three do you think is the best fit?
XGBoosting
• In the XGBoost package, at the tth step we are tasked with finding the
tree Ft that will minimize the following objective function:
where L(Ft) is our loss function and (Ft) is our regularization function.
• Regularization is essential to prevent overfitting to the training set.
Without any regularization, the tree will split until it can predict the
training set perfectly. This will usually mean that the tree has lost
generality and will not do well on new test data. In XGBoost, the
regularization function shows the model complexity.
where T is the number of leaves in the tree, wj is the score of leaf j,  is the leaf
weight penalty parameter, and  is the tree size penalty parameter.
•Determining how to find the function to optimize the above objective function is
not clear.
where q(x) maps input features to a leaf node in the tree and l(yi; y^) is our loss
function. This objective function is much easier to work with because it is now
gives a score that we can use to determine how good a tree structure is.
Model Complexity
Unique features of XGBoost
• XGBoost is a popular implementation of gradient boosting. Let’s discuss some
features of XGBoost that make it so interesting.
• Regularization: XGBoost has an option to penalize complex models through both
L1 and L2 regularization. Regularization helps in preventing overfitting
• Handling sparse data: Missing values or data processing steps like one-hot
encoding make data sparse. XGBoost incorporates a sparsity-aware split finding
algorithm to handle different types of sparsity patterns in the data
• Weighted quantile sketch: Most existing tree based algorithms can find the split
points when the data points are of equal weights (using quantile sketch algorithm).
However, they are not equipped to handle weighted data. XGBoost has a
distributed weighted quantile sketch algorithm to effectively handle weighted data
Unique features of XGBoost
• Block structure for parallel learning: For faster computing, XGBoost can make
use of multiple cores on the CPU. This is possible because of a block structure in
its system design. Data is sorted and stored in in-memory units called blocks.
Unlike other algorithms, this enables the data layout to be reused by subsequent
iterations, instead of computing it again. This feature also serves useful for steps
like split finding and column sub-sampling
• Cache awareness: In XGBoost, non-continuous memory access is required to get
the gradient statistics by row index. Hence, XGBoost has been designed to make
optimal use of hardware. This is done by allocating internal buffers in each
thread, where the gradient statistics can be stored
• Out-of-core computing: This feature optimizes the available disk space and
maximizes its usage when handling huge datasets that do not fit into memory
Parameters
•XGBoost requires a number of parameters to be selected. The following is a list of all the
parameters that can be specified:
•(eta) Shrinkage term. Each new tree that is added has its weight shrunk by this parameter,
preventing overfitting, but at the cost of increasing the number of rounds needed for
convergence.
•(gamma) Tree size penalty
•max depth The maximum depth of each tree
•min child weight The minimum weight that a node can have. If this minimum is not met,
that particular split will not occur.
•subsample Gives us the opportunity to perform "bagging,“ which means randomly sampling
with replacement a proportion specified by parameter of the training examples to train each
round on. The value is between 0 and 1 and is a method that helps prevent overfitting.
•colsample bytree Allows us to perform "feature bagging,“ which picks a proportion of the
features to build each tree with. This is another way of preventing overfitting.
•(lambda) This is the L2 leaf node weight penalty
R Example (http://www.sthda.com/english/articles/35-statistical-machine-
learning-essentials/139-gradient-boosting-essentials-in-r-using-xgboost/ )
• Boosting has different tuning parameters including:

• The number of trees B
• The shrinkage parameter lambda
• The number of splits in each tree.
• There are different variants of boosting, including Adaboost, gradient
boosting and stochastic gradient boosting.
• Stochastic gradient boosting, implemented in the R package xgboost, is the
most commonly used boosting technique, which involves resampling of
observations and columns in each round. It offers the best performance.
xgboost stands for extremely gradient boosting.
• Boosting can be used for both classification and regression problems.
• We’ll see how to compute boosting in R
• Loading required R packages
• tidyverse for easy data manipulation and visualization
• caret for easy machine learning workflow
• xgboost for computing boosting algorithm
Classification
• Data set: PimaIndiansDiabetes2 [in mlbench package], for predicting the
probability of being diabetes positive based on multiple clinical variables.
• Randomly split the data into training set (80% for building a predictive model)
and test set (20% for evaluating the model). Make sure to set seed for
reproducibility.
library(tidyverse)
library(caret)
library(xgboost)
library(mlbench)
library(dplyr)
# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
set.seed(123)
# Split the data into training and test set
training.samples <- PimaIndiansDiabetes2$diabetes %>% createDataPartition(p = 0.8, list =
FALSE)
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]
summary(PimaIndiansDiabetes2)
pregnant glucose pressure triceps
Min. : 0.000 Min. : 56.0 Min. : 24.00 Min. : 7.00
1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.:21.00
Median : 2.000 Median :119.0 Median : 70.00 Median :29.00
Mean : 3.301 Mean :122.6 Mean : 70.66 Mean :29.15
3rd Qu.: 5.000 3rd Qu.:143.0 3rd Qu.: 78.00 3rd Qu.:37.00
Max. :17.000 Max. :198.0 Max. :110.00 Max. :63.00
insulin mass pedigree age diabetes
Min. : 14.00 Min. :18.20 Min. :0.0850 Min. :21.00 neg:262
1st Qu.: 76.75 1st Qu.:28.40 1st Qu.:0.2697 1st Qu.:23.00 pos:130
Median :125.50 Median :33.20 Median :0.4495 Median :27.00
Mean :156.06 Mean :33.09 Mean :0.5230 Mean :30.86
3rd Qu.:190.00 3rd Qu.:37.10 3rd Qu.:0.6870 3rd Qu.:36.00
Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
pregnant glucose pressure triceps insulin mass pedigree age diabetes
232 6 134 80 37 370 46.2 0.238 46 pos
600 1 109 38 18 120 23.1 0.407 26 neg
319 3 115 66 39 140 38.1 0.150 28 neg
• Boosted classification trees
• We’ll use the caret workflow, which invokes the xgboost package, to automatically
adjust the model parameter values, and fit the final best boosted tree that explains
the best our data.
• We’ll use the following arguments in the function train():
• trControl, to set up 10-fold cross validation
• # Fit the model on the training set
set.seed(123)
model <- train(
diabetes ~., data = train.data, method = "xgbTree",
trControl = trainControl("cv", number = 10)
)
# Best tuning parameter
model$bestTune
nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
22 50 2 0.3 0 0.6 1 0.75
# Make predictions on the test data
predicted.classes <- model %>% predict(test.data)
head(predicted.classes)
[1] neg pos neg pos pos neg
Levels: neg pos
# Compute model prediction accuracy rate
mean(predicted.classes == test.data$diabetes)
[1] 0.7820513
• Variable importance
• The function varImp() [in caret] displays the importance of variables in
percentage:
varImp(model)
xgbTree variable importance
Overall
glucose 100.000
age 22.280
insulin 15.511
mass 15.066
pedigree 6.566
pregnant 5.176
pressure 1.439
triceps 0.000
Regression
• Similarly, you can build a random forest model to perform regression, that is to predict a
continuous variable.
• Example of data set
• We’ll use the Boston data set [in MASS package], for predicting the median house value
(mdev), in Boston Suburbs, using different predictor variables.
• Randomly split the data into training set (80% for building a predictive model) and test
set (20% for evaluating the model).
library(MASS)
# Load the data
data("Boston", package = "MASS")
# Inspect the data
sample_n(Boston, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- Boston$medv %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]
• Boosted regression trees
• Here the prediction error is measured by the RMSE, which corresponds to the
average difference between the observed known values of the outcome and the
predicted value by the model.
# Fit the model on the training set
set.seed(123)
model <- train(
medv ~., data = train.data, method = "xgbTree",
trControl = trainControl("cv", number = 10) )
# Best tuning parameter mtry
model$bestTune
nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
24 150 2 0.3 0 0.6 1 0.75
# Make predictions on the test data
predictions <- model %>% predict(test.data)
head(predictions)
[1] 34.57916 34.52773 19.75562 21.84306 20.38543 18.97059
# Compute the average prediction error RMSE
RMSE(predictions, test.data$medv)
[1] 2.838189 #MSE=8.055315
For another examples,
https://www.kaggle.com/camnugent/gradient-boosting-and-
parameter-tuning-in-r/code
https://rstudio-pubs-
static.s3.amazonaws.com/64455_df98186f15a64e0ba37177de8b4191f
a.html
Some References
• Friedman, J., Hastie, T., and Tibshirani, R. (2000). Special invited
paper. additive logistic regression: A statistical view of boosting.
Annals of statistics, pages 337-374.
• Friedman, J. H. (2001). Greedy function approximation: a gradient
boosting machine. Annals of Statistics, pages 1189-1232.
• Schapire, R. E. and Freund, Y. (2012). Boosting: Foundations and
Algorithms. MIT Press.
• James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani.
2014. An Introduction to Statistical Learning: With Applications in R.
Springer Publishing Company, Incorporated.

Introduction To Boosting - 2

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Introduction To Boosting - 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Boosting - 2

Uploaded by

Copyright:

Available Formats

INTRODUCTION TO

• Feature Vector= (2, 1, 3, 1, 1, 8, 6, 6, 6, 6, 5, 9, 1, 7, 5, 10)

• predicted label = class that has the highest probability

•a matrix of gradients vs a column of gradients

• Boosting has different tuning parameters including:

You might also like