Trees, Boosting, and Random Forest

Decision Trees, Boosting, and Random Forest
Notes on statistical learning theory
Kevin Song
Department of Biomedical Informatics

Stanford University School of Medicine
July 14, 2017
Adapted from Profs. Rob Tibshirani and Trevor Hastie

Decision Trees
Overview
I Some of the most popular tools in the data mining field.

I Have been in use since 1984 (Breiman, Friedman, Stone, &
Olshen).
I Can be extremely powerful, when used as ensembles of trees
(e.g., boosted trees and random forest).

Decision Trees
An illustration
Figure: Decision trees are stepwise-defined functions that can be used to

model data. They can be used for either regression or classification
purposes.
Decision Trees
Usage in supervised regression
I For regression (when y is real-valued and continuous): create
a stepwise-defined function that best approximates the data,
using a greedy algorithm that minimizes squared error loss at
each split.
I Three-dimensional case: imagine that you are building a table
that has various heights, corresponding to the average y
value for the subset of data belonging to that specific table
region. The ideal splits are made, one at a time, considering
each variable, to determine which data gets assigned to one
table height or the other.
Figure: A plot of a decision tree algorithms prediction surface.

Decision Trees
Usage in supervised classification
I For classification (when y is discrete and categorical): search

and split data (into left and right branches) on the best
variable that separates the data. Repeat this procedure
recursively, until an optimal stopping criterion is reached.
I The Gini index is usually minimized to maximize node purity
at each split point:
PJ
G (p) = pi (1 pi ),
i=1
where J is the number of classes and pi is the proportion of
training datapoints that would be assigned to the ith class in
the region of interest.
I Aside: decision trees were also known as recursive
partitioning because of thisthe R package rpart, used for
building decision trees, preserves this naming convention.

Decision Trees
Pros and cons of single decision trees
Pros:
I High interpretability, and not a black box (e.g., Exactly why
was I declined my loan?).
I To arrive at the models conclusion, just trace down the
branches of the tree.
I Fast to train, not computationally intensive for large datasets
(unlike neural networks, for instance).
I Can accept either categorical or quantitative variable inputs
(unlike neural networks, which only accept numerical data).
I Can be used with sparse data with missing values.
I Feature selection is built into the model.

Decision Trees
Pros and cons of single decision trees
Cons:
I Low accuracy for approximating smooth or linear boundaries
(due to being a stepwise-defined function).
I Very high variance: single trees are highly variable, even when
the dataset is perturbed slightly.
Figure: Top row: true linear boundary; Bottom row: true non-linear
boundary. Left column: linear model; Right column: tree-based model
How to improve decision tree algorithms?
Using ensembles of trees
How can we boost the accuracy (bias) and/or stability

(variance) of decision trees, while still maintaining their
relatively easy-to-train property?
I We will use ensembles of trees (boosted decision trees and
random forest).
I However, in doing so, we completely lose interpretability.

Boosting
Overview
I Best off-the-shelf classifier today (Friedman, Elements of

Statistical Learning).
I XGBoost, a regularized gradient-descent version of boosted
trees, has been winning many of todays Kaggle competitions.
It has even outperformed deep neural networks in many cases.

Boosting
How does it work?
I Boosting works by constructing many successive iterations of

trees (i.e., weak learners) on the same dataset.
I The final model is constructed from a weighted, linear
combination of weak learners.

Boosting
AdaBoost algorithm
AdaBoost, or adaptive boosting, was the first boosting algorithm
developed (by Freund and Schapire). Here is its implementation
for a classification task:
1. Create an empty vector of weights that correspond to each
to-be-created decision tree. Initialize a vector of weights for
each data point, set to 1/n.
2. At first, construct a normal decision tree (weak learner) on
entire dataset, such that its accuracy is >50% (i.e., better
than chance). Assign a weight to this weak learner, based on
its misclassification error.
3. Identify all misclassified points from the previous decision tree.
Make misclassified data points weights higher, and correctly
classified data points weights lower. Construct another
normal decision tree, but accounting for the new weights on
the data points.
Continue algorithm for k number of iterations.
Boosting
AdaBoost algorithm
I The final output of the boosted model is the output from a

weighted, linear combination of k weak learners. The overall
model consists of a strong learner made up of several weak
ones.
I k can be best chosen by cross-validation, or by using a
validation dataset.

Random Forest
Algorithm
The random forest algorithm (Breiman, 2001) is much simpler

than that of boosting, and is based on a method known as
bagging, or bootstrap aggregation:
1. Create B boostrap samples of the dataset (i.e., create B
samples of size n, with replacement)
2. Create a decision tree for each bootstrap sample.
I For regression, the final output is the average output of B
decision trees.
I For classification, the final output is the majority vote of B
decision trees.

Boosting versus Random Forest
A comparison
I Are distant cousins of each other, each by itself a vast

improvement over traditional single trees.
I Boosting tends to be more accurate than random forest, but
can suffer from overfitting.
I Random forest tends to have lower chance of overfitting than
boosting, and is more stable than a boosted model.
I Hence, boosting tends to have low bias, but high variance,
and random forest tends to have low variance, but high bias.
I Ideally, when performing one type of analysis, consider the
other as a second opinion.

Trees, Boosting, and Random Forest

Uploaded by

Copyright:

Available Formats

Trees, Boosting, and Random Forest

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Trees, Boosting, and Random Forest

Uploaded by

Copyright:

Available Formats

Decision Trees, Boosting, and Random Forest

Notes on statistical learning theory

Department of Biomedical Informatics

July 14, 2017

Adapted from Profs. Rob Tibshirani and Trevor Hastie

I Some of the most popular tools in the data mining field.

Adapted from Profs. Rob Tibshirani and Trevor Hastie

Figure: Decision trees are stepwise-defined functions that can be used to

Figure: A plot of a decision tree algorithms prediction surface.

I For classification (when y is discrete and categorical): search

Adapted from Profs. Rob Tibshirani and Trevor Hastie

Adapted from Profs. Rob Tibshirani and Trevor Hastie

How can we boost the accuracy (bias) and/or stability

Adapted from Profs. Rob Tibshirani and Trevor Hastie

I Best off-the-shelf classifier today (Friedman, Elements of

Adapted from Profs. Rob Tibshirani and Trevor Hastie

I Boosting works by constructing many successive iterations of

Adapted from Profs. Rob Tibshirani and Trevor Hastie

I The final output of the boosted model is the output from a

Adapted from Profs. Rob Tibshirani and Trevor Hastie

The random forest algorithm (Breiman, 2001) is much simpler

Adapted from Profs. Rob Tibshirani and Trevor Hastie

I Are distant cousins of each other, each by itself a vast

Adapted from Profs. Rob Tibshirani and Trevor Hastie

You might also like