Ensemble V2 1
Ensemble V2 1
Ensemble V2 1
• What are ensemble methods?
• The main ensemble methods
–Random Forest
Ensemble Methods
• Ensemble methods combine the results from multiple
models with the goal of improving prediction accuracy
• Example: bagging
The original training
Step 1: Create
… multiple Data
Step 2: Build
Step 3: Combine
outcomes into a
single prediction
Understanding Ensemble
Motivating Example
Motivating Example
• If these 5 classifiers are completely independent and I take the
majority vote, how often is the majority vote correct for a new
• P(getting it right)=
P(all 5 get it right)+P(4 classifiers get it right) + P(3 classifiers get it
• P(getting it right)=
5 5 5 4 5
.7 .7 (1 .7)1 .7 3 (1 .7) 2
5 4 3
1* .7 5 5 * .7 4 (1 .7)1 10 * .7 3 (1 .7) 2 0.83692
“n choose k” – in how many different
ways can you select k items from n
overall items (In Excel you can use
the function COMBIN to calculate
these values)
Motivating Example
Suppose I have 101 classifiers which each classify a point correctly
70% of the time. If these 101 classifiers are completely independent
and I take the majority vote, how often is the majority vote correct
for a new record?
We can view the number of correct classifiers as a binomial random
variables, with 101 trials, and we need at least 51 of them to be
correct in order for the overall prediction to be correct
P(of getting it right) = P(at least 51 get it right)
= 1-P(at most 50 get it wrong)
P(of getting it right) = 1-BINOM.DIST(50,101,.7,1)
= .9999
≈ 100%
Types of Ensemble Algorithms
• Ensemble algorithm/methods include
• builds different classifiers by training on repeated samples (with replacement) from the
• combines simple base classifiers by up-weighting data points which are classified
–Random Forests
• averages many trees which are constructed with some amount of randomness
• The basics:
– Step 1: Create B datasets, using sampling with replacement
– Step 2: Create one classifier for each dataset
– Step 3: Combine the classifiers by averaging over the predictions in case of a
continous outcome, or by simple majority vote in case of a categorical
• Bagging is simple to implement
• Bagging using very weak classifiers may not result in an
improvement; bagging good prediction models will in most cases
help improve the prediction accuracy
• Bootstrap aggregating
• Multiple training datasets are created by resampling of the
observed dataset (and of equal size to the observed dataset)
• Obtained by random sampling with replacement from the
original dataset
• Train the statistical learning method on each of the training datasets,
and obtain the prediction
• For prediction:
• Regression: average all predictions from all trees
• Classification: majority vote among all trees
How Bagging Works
Bagging Output
Random Forests
• Random Forest (RF) borrows ideas from Bagging
• RF averages many classification trees, where each is constructed
using a random subset of the variables for each split in the tree
• The key parameters that need to be determined are
a) the number of trees, and b) the variable subset size
– Value of these parameters will vary from one application to the next
• RF can be used with both regression trees and classification trees
• RF has good predictive performance, even when the data is very
noisy, and generally does not overfit
How Random Forests Work
• It is a very efficient statistical learning method
• It builds on the idea of bagging, but it provides an improvement
because it de-correlates the trees
• How does it work?
• Build a number of decision trees on bootstrapped training sample, but when
building these trees, each time a split in a tree is considered, a random
sample of m predictors is chosen as split candidates from the full set of p
predictors (Usually )
• RF with m = p is just bagging
Why consider a random sample of m predictors
instead of all p predictors for splitting?
Bagging in R
# Bagging and Random Forests
## Call:
## randomForest(formula = medv ~ ., data = Boston, mtry = 13, importance = TRUE,
subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 13
## Mean of squared residuals: 11.02509
## % Var explained: 86.65
Bagging in R
yhat.bag = predict(bag.boston,newdata=Boston[-train,])
plot(yhat.bag, boston.test)
## [1] 13.47349
Predictor Importance
## %IncMSE IncNodePurity
## crim 15.396510 950.03191
## zn 1.100738 21.42389
## indus 12.225351 183.14933
## chas 2.726681 13.25062
## nox 10.606485 302.78478
## rm 45.090272 7325.33947
## age 10.400796 309.19654
## dis 17.315918 892.19354
## rad 3.208664 64.56585
## tax 9.296886 296.22083
## ptratio 15.325244 279.25118
## black 5.944955 243.04952
## lstat 39.324555 9837.83280
## [1] 11.48022