Random Forest
Random Forest
Random Forest
• Ted's Thesis
ROADMAP
SECTIONS
1. CART
2. Bagging
3. Random Forest
4. References
1. CART (CLASSIFICATION AND REGRESSION TREE)
CART
SECTION OVERVIEW
1. History
2. Quick Example
3. CART Overview
6. Basis Function
7. Node Splitting
8. Tree Pruning
9. Regression Tree
CLASSIFICATION TREE -
(BINARY TREE DATA STRUCTURE)
CLASSIFICATION TREE - QUICK
EXAMPLE
• We have several predictors (x's) and one response
(y, the thing we're trying to predict)
• 21 vs 22-24
• 21-22 vs 23-24
• 21-23 vs 24
VISUAL EXAMPLE
• y (0 - no cancer or 1- cancer)
• our predictors:
library(rpart)
library(party)
library(partykit)
tree.model.party <-
as.party(tree.model)
plot(tree.model.party)
R - CODE
library(rpart)
library(party)
library(partykit)
tree.model.party <-
as.party(tree.model)
plot(tree.model.party)
BASIS FUNCTIONS
• X is a predictor
• M transformations of X
• β_m is the weight given to the mth transformation (the coefficient)
• h_m is the m-th transformation of X
• f(x) is the linear combination of transformed values of X
BASIS FUNCTION EXAMPLE
THE BASIS EXAMPLE WE CARE ABOUT
THIS IS CART BASIS FUNCTION
• p(y = 1 | A)
• Others
GINI INDEX
• Reduce overfitting.
• = (1/6)*(0+0+0+0+1+1)
• = 2/6
• = 1/3
1. Benefit
2. Bootstrap Algorithm
3. Bootstrap example
4. Bagging Algorithm
5. Flaws
• Reduce overfitting
• Reduce bias
• Bootstrap is a statistical
resample method.
3. What we do is sample our only data set (aka random sample) with
replacement. We take up to the number of observations in our original
data.
4. We repeat step 3 for a large number of time, B times. Once done we have
B number of bootstrap random samples.
5. We then take the statistic of each bootstrap random sample and average it
BOOTSTRAP EXAMPLE
• {1,2,3,1,2,1,1,1} (n = 8)
• The estimated mean for our original data is the mean of the statistic for
each bootstrap sample (1.375+1.375+1.25)/3 = ~1.3333
BAGGING (BOOTSTRAP
AGGREGATION) ALGORITHM
1. Take a random sample of size N with replacement
from the data
1. Problem RF solve
5. R Code
1. PROBLEM RANDOM FOREST
IS TRYING TO SOLVE
BAGGING PROBLEM
RANDOM FOREST SOLUTION
• Self explanatory.
5. REPEAT STEPS 1–4 A LARGE NUMBER OF
TIMES (E.G., 500).
• You take all of those predictions (aka votes) and take the majority.
• This is why I suggest odd number of trees to break ties for binary
responses.
5. R CODE
set.seed(415)
library(randomForest)
predict(rf.model,iris_test)
REFERENCES