0% found this document useful (0 votes)

104 views

Learn The Boosting - Method - Implementation - in - R

The document discusses an R package called ada that implements various boosting algorithms for classification problems. It provides an overview of boosting and different boosting algorithms like AdaBoost, Gentle AdaBoost, and Real AdaBoost. It also discusses stochastic gradient boosting and compares the ada package to other existing R packages for boosting.

Uploaded by

Prashant Jindal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

104 views

Learn The Boosting - Method - Implementation - in - R

Uploaded by

Prashant Jindal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

JSS Journal of Statistical Software

October 2006, Volume 17, Issue 2. http://www.jstatsoft.org/

ada: An R Package for Stochastic Boosting

Mark Culp Kjell Johnson George Michailidis
University of Michigan Pfizer Global Research University of Michigan
and Development

Abstract
Boosting is an iterative algorithm that combines simple classification rules with ‘mediocre’
performance in terms of misclassification error rate to produce a highly accurate classifi-
cation rule. Stochastic gradient boosting provides an enhancement which incorporates a
random mechanism at each boosting step showing an improvement in performance and
speed in generating the ensemble. ada is an R package that implements three popular
variants of boosting, together with a version of stochastic gradient boosting. In addi-
tion, useful plots for data analytic purposes are provided along with an extension to the
multi-class case. The algorithms are illustrated with synthetic and real data sets.

Keywords: boosting algorithms, R, machine learning, classification, implementation of statis-

tical algorithms.

1. Introduction
Boosting has proved to be an effective method to improve the performance of base classifiers,
both theoretically and empirically. The underlying idea is to combine simple classification
rules (the base classifiers) to form an ensemble, whose performance is significantly improved.
The origins of boosting lie in PAC learning theory (Valiant 1984), which established that
learners that exhibit a performance slightly better than random guessing when appropriately
combined can perform very well.
A provably polynomial complexity boosting algorithm was derived in (Schapire 1990), whereas
the Adaptive Boosting (AdaBoost) algorithm in various varieties (Freund and Schapire 1996,
1997) proved to be a practical implementation of the boosting ensemble method. Since its
introduction, many authors have sought to explain and improve upon the AdaBoost algorithm.
In (Friedman, Hastie, and Tibshirani 2000), it was shown that the AdaBoost algorithm can
be thought of as a stage-wise gradient descent procedure that minimizes an exponential loss
function. In addition, three modifications of the original algorithm were proposed: Gentle-,
2 ada: An R Package for Stochastic Boosting

Logit-, and Real AdaBoost.

Recently, several authors have explored regularizing the boosting algorithm (Friedman 2001;
Rosset, Zhu, and Hastie 2004). For example, if classification trees constitute the base clas-
sifiers, regularization is accomplished through the learning rate parameter. This parameter
controls the algorithm’s ability to search the collection of all possible trees for the given data
set. However, the improved performance in terms of predictive accuracy comes at a heavy
computational cost. Stochastic gradient boosting (Friedman 2002; Ridgeway 2006) utilizes
a random mechanism for regularization purposes and achieves significant computational sav-
ings.
The ada package implements the original AdaBoost algorithm, along with the Gentle and
Real AdaBoost variants, using both exponential and logistic loss functions for classification
problems. In addition, it allows the user to implement regularized versions of these methods
by using the learning rate as a tuning parameter, which lead to improved computational
performance. The base classifiers employed are classification/regression trees and therefore
the underlying engine is the rpart package. The ada package uses rpart’s functionality for
handling missing data and surrogate splits (Therneau and Atkinson 2005). Some important
features incorporated in the ada package are: (i) the use of both regression and classification
trees for boosting, (ii) various useful plots that aid in assessing variable importance and
relationships between subsets of variables and (iii) tweaking the controls of the rpart function
to become applicable for boosting purposes.
The boosting framework typically accomplishes the difficult task of providing both strong
predictive performance and useful model diagnostics, which has made it desirable for many
classification problems (Freund and Schapire 1996, 1997; Hastie, Tibshirani, and Friedman
2001). In this paper, we provide analysis of boosting on pharmacology data (Sugata and Abe
2001), where the goal is to predict whether a compound is soluble, and to identify variables
that are important for predicting this property. In addition to pharmacology, boosting al-
gorithms have encompassed a wide range of applications including tumor identification and
gene expression data (Dettling 2004), proteomics data (Ulintz, Zhu, Qin, and Andrews 2006),
financial and marketing data (Boonyanunta and Zeephongsekul 2003; Lemmens and Croux
2005), fisheries data (Kawakita, Minami, Eguchi, and Lennert-Cody 2005), and microscope
imaging data (Huang and Murphy 2004). For many of these applications, ada will be par-
ticularly useful since it implements well documented tools for assessing variable importance,
evaluating training and testing error rates, and viewing pairwise plots of the data.
Currently, free R (R Development Core Team 2006) packages exist for boosting that efficiently
build regression trees, smoothing splines, and additive models such as gbm (Ridgeway 2006)
and mboost (Hothorn and Bühlmann 2006a,b). The gbm package offers two versions of
boosting for classification (gentle boost under logistic and exponential loss). In addition, it
includes squared error, absolute error, Poisson and Cox type loss functions. However, the
latter loss functions are not natural or recommended for classification purposes (Hastie et al.
2001). The mboost package has to a large extent similar functionality as the gbm package
and in addition implements the general gradient boosting framework using regression-based
learners. In our experience, these packages are more suited for users in need of using boosting
in models with a continuous or count type outcome. On the other hand, the ada package
provides a straightforward, well-documented, and broad boosting routine for classification,
ideally suited for small to moderate-sized data sets. As an enhancement over current boosting
packages, ada incorporates two popular loss functions for three boosting variants, variance
Journal of Statistical Software 3

reduction components (similar to bagging), regularization, and performance diagnostics. In

addition, at each iteration ada incorporates cross-validation to determine tree depth, which
resolves the selection of tree depth associated with gbm. ada’s extensive documentation
and ease of use, provides a natural package for individuals interested in quickly familiarizing
themselves with the boosting methodology and assessing how boosting would perform on their
data. The ada package is freely available from http://CRAN.R-project.org/.
The paper is organized as follows: in section 2 a brief introduction to the various boosting
algorithms implemented in ada is provided; section 3 discusses various implementation issues,
while in section 4 a description of the available functions is given. Finally, section 5 discusses
several practical issues in the context of real data examples.

2. A brief account of boosting algorithms

In this section a brief synopsis of the history of AdaBoost (now referred to as Discrete
AdaBoost) is given, which discusses the origins of this algorithm, as well as the relationship
between boosting and additive models. The next section focuses on implementation details
for each boosting variant implemented in the package. This is accomplished by presented a
general boosting algorithm and then discussing the various variants as special cases.
In a classification problem, a training data set consisting of n objects is available. Each
object is characterized by a p-dimensional attribute (feature/variable) vector x, belonging to
a suitable space (e.g. Rp ), and a class label (response) y ∈ {+1, −1}. The objective is to
construct a decision (classification) rule F (x) that would accurately predict the class labels
of objects for which only the attribute vector is observed.

2.1. Historical perspective

In 1996, Freund and Schapire (Freund and Schapire 1996) produced the well-known
AdaBoost.M1 (also known as Discrete AdaBoost) algorithm (given below). In short,
AdaBoost.M1 generates a sequentially weighted set of weak base classifiers that are com-
bined to form an overall strong classifier. In each step of the sequence, AdaBoost attempts
to find an optimal classifier according to the current distribution of weights on the observa-
tions. If an observation is incorrectly classified using the current distribution of weights, then
the observation will receive more weight in the next iteration. On the other hand, correctly
classified observations under the current distribution of weights will receive less weight in
the next iteration. In the final overall model, classifiers that are accurate predictors of the
training data receive more weight, whereas, classifiers that are poor predictors receive less
weight. Thus, AdaBoost uses a sequence of simple weighted classifiers, each forced to learn
a different aspect of the data, to generate a final, comprehensive classifier, which with high
probability outperforms in terms of misclassification error rate any individual classifier. The
basic steps of the algorithm as described next (Algorithm 1):
In 2000, Friedman et al. (2000) established connections of the AdaBoost.M1 algorithm to
statistical concepts such as loss functions, additive modeling, and logistic regression. In
particular, they showed that AdaBoost.M1 fits a forward stagewise additive logistic regression
model that minimizes the expectation of the exponential loss function, e−yF (x) , with F (x)
denoting the boosted classifier.
While the AdaBoost algorithm has been shown empirically to improve classification accuracy,
4 ada: An R Package for Stochastic Boosting

Algorithm 1 AdaBoost
1
1: Initialize weights wi = n
2: for m = 1 to M do
3: fit y = hm (x) as the base weighted classifier using wi and
d
let W− (hm ) = i=1 wi I{yi hm (xi ) = −1} and αm = log 1−W − (h)
PN
4: W− (h)
5: wi = wi exp{αm I{yi 6= hm (xi )}} scaled to sum to one ∀i ∈ {1, . . . , N }
6: end for

it produces at each stage as output the object’s predicted label. This coarse information may
hinder the efficiency of the algorithm in finding an optimal classification model. To overcome
this deficiency, several proposals have been put forth in the literature; for example, the al-
gorithm outputs a real-valued prediction rather than class labels at each stage of boosting.
The latter variant of boosting corresponds to the Real AdaBoost algorithm, where the class
probability estimate is converted using the half-log ratio to a real valued scale. This value is
then used to represent an observation’s contribution to the final overall model. Furthermore,
observation weights for subsequent iterations are updated according to the exponential loss
function of AdaBoost. Like Discrete AdaBoost, the Real AdaBoost algorithm attempts to
minimize the expectation of e−yF (x) . In general, these modifications allow Real AdaBoost to
more efficiently find an optimal classification model relative to AdaBoost.M1. In addition to
the Real AdaBoost modification, Friedman et al. (Friedman et al. 2000) proposed a further
extension called the Gentle AdaBoost algorithm which minimizes the exponential loss func-
tion of AdaBoost through a sequence of Newton steps. Although Real AdaBoost and Gentle
AdaBoost optimize the same loss function and perform similarly on identical data sets, Gentle
AdaBoost is numerically superior because it does not rely on the half-log ratio.

2.2. Stochastic boosting

Boosting inherently relies on a gradient descent search for optimizing the underlying loss
function to determine both the weights and the learner at each iteration (Friedman 2001). In
Stochastic Gradient Boosting (SGB) a random permutation sampling strategy is employed
at each iteration to obtain a refined training set. The full SGB algorithm with the gradient
boosting modification relies on the regularization parameter ν ∈ [0, 1], the so-called learning
rate.

Algorithm 2 Stochastic Gradient Boosting Algorithm

1: Initialize F (x) := 0
2: for m = 1 to M do
3: Set wi = − δL(y,g)
δg |g=F (x)
4: Fit y = η(hm (x)) as the base weighted classifier
P using | wi |, with training sample πm
5: Compute line search step αm = arg minα i∈πm L(yi , F (x) + αη(hm (xi )))
(in some cases this step may be omitted or αm = 1)
6: Update F (x) = F (x) + ναm η(hm (x))
7: end for

The algorithm in its general form can operate under an arbitrary loss function; ada implements
both the exponential (L(y, f ) = e−yf ) and logistic (L(y, f ) = log(1+e−yf )) loss functions. The
Journal of Statistical Software 5

x
η function specifies the type of boosting: discrete (η(x) =sign(x)), real η(x) = 0.5 log 1−x ,
and gentle (η(x) = x).
In the case of exponential loss, the line search step solution (Step 3, Algorithm 2) can be
written as:

αk = αk−1 − (η T P (αk−1 )η)−1 (y T P (αk−1 )η), where P (αk−1 ) = diag(pi (αk−1 )), (1)

pi (αk−1 ) = wi e−αyi ηi , and wi = e−yi F (xi ) . The final stageweight αm = α∞ is used for
Algorithm 2. For the logistic loss (L2 Boost) version we have the line search corresponding
to:

αk = αk−1 − (η T P (αk−1 )(1 − P (αk−1 ))η)−1 (y T P (αk−1 )η), where P (αk−1 ) = diag(pi (αk−1 )), (2)
wi e−αyi ηi
pi (αk−1 ) = 1+wi e−αyi ηi
, and wi = e−yi F (xi ) .
Next, we provide the details for adapting Algorithm 2 to perform each variant of boosting.
Remark: The ada package provides the flexibility to fit all the stageweights with a value of
1. This is known as -boosting where one fits the ensemble with arbitrarily small ν (Rosset
et al. 2004).

Discrete AdaBoost
For Discrete AdaBoost, set L(y, g) = e−yg ⇒ wi = −yi e−yi Fi (exponential loss) and η(x) =sign(x).
Using expression (1) with this value
of η, the optimization problem has a closed form solu-
1−errm
tion given by αm = 0.5 log errm . Therefore, the algorithm is fitting the original Discrete
AdaBoost algorithm with a random sampling strategy at each iteration.
−yi F (xi )
For Discrete L2 Boost, one optimizes L(y, g) = log(1 + e−yi gi ) ⇒ wi = −y ie
1+e−yi F (xi )
. In this
case, the stageweight does not have a closed form solution and the software solves (2) directly.

Real AdaBoost

p
For Real AdaBoost and Real L2 Boost, set η(p) = log 1−p , where p ∈ [0, 1] (i.e. a probabil-
ity class estimate) and use the same weight as in Discrete AdaBoost. If αm = 1 is set for all
m and exponential loss is used, then Real AdaBoost coincides with the algorithm presented in
(Friedman et al. 2000). However, ada has the flexibility of optimizing (1) and (2) to determine
the stageweights.

Gentle AdaBoost
For Gentle and Gentle L2 Boost, set η(x) = x. This algorithm requires fitting a regressor at
each iteration and result in the original GentleBoost algorithm whenever αm = 1. As with
Real boosting, the algorithm can solve the line search directly.

2.3. Connection to bagging

The SGB algorithm has been noted to have a strong connection to bagging (Breiman 1996),
and is often referred to as a hybrid bagging and boosting algorithm (Friedman 2002). The
ada package takes the connection one step further by allowing the stageweights to be adjusted
towards bagging.
6 ada: An R Package for Stochastic Boosting

A close inspection reveals that if one executes Algorithm 2 with ν := 0, then the ensemble
of trees will be generated by random subsamples of the P data with identical case weights
(i.e. letting Fν (x) be the ada output, then F0 (x) = 0 ∗ M m=1 αm hm (x)). This is the exact
tree fitting process used to bag trees (Breiman 1996), with the exception that in bagging
1 PM
the ensemble results as an average of the trees (i.e. B(x) = M m=1 h m (x) is a bagged
ensemble but under the same random settings one would obtain the same trees as with
F0 (x)). In light of this, we add a shift=TRUE argument which supplies a post processing
shift of the ensemble towards bagging after constructing the ensemble (i.e. the final ensemble
is Feν (x) = (1 − ν)B(x) + Fν (x), where Fe0 (x) = B(x) equates to bagging). However, unlike
bagging (with the exception of ν = 0) the individual h’s are obtained via a greedy weighting
process. In the case of -boosting (i.e. αm := 1) then the resulting ensemble is an average
over h.

3. Implementation issues
In this section we discuss implementation issues for the ada package.

3.1. Functional structure

The functional layout for the ada package is shown in Figure 1. An object of class ada can

Figure 1: The functional flow of ada. The top section consists of the functions used to create
the ada object, while the bottom section are the functions invoked using an initialized ada
object. The yellow functions are called only by ada, the red functions are called by the user
and the green functions are called by both.
Journal of Statistical Software 7

be created by invoking the functions ada, which calls either ada.formula or ada.default,
depending on the input to ada. The ada.formula function is a generic formula wrapper for
ada.default that allows the flexibility of a formula object as input. The ada.default func-
tion, in turn, calls the four functions that correspond to the boosting algorithms: discrete.ada,
real.ada, logit.ada and gentle.ada. Once an object of class ada is created, the user can
invoke the standard R commands: predict, summary, print, pairs, update, or plot. In
addition, ada includes a varplot and addtest function, which we discuss below.

3.2. Construction of base learners using rpart

The most popular base (weak) learners employed by both boosting and stochastic boosting
algorithms are classification or regression trees. Both of these algorithms are implemented in
the rpart package, and can be tweaked for either boosting or stochastic gradient boosting.
Because the ada package uses rpart as its engine, ada inherits the flexibility and advantages
of rpart (Therneau and Atkinson 2005). Because of rpart, ada can handle missing data, can
implement either classification or regression trees, and can use cross-validation to automat-
ically determine individual tree depth. In addition, because ada uses rpart, ada will inherit
any improvements made to rpart.
Remark: Notice, that the gbm package, for instance, uses its own internal recursive partitioner
for only computing regression trees with no means for estimating tree depth. However, we
feel that the flexibility and enhancements provided by rpart yields a worthwhile tree engine
for ada.

Setting rpart.control
The rpart function selects tree depth by using an internal complexity measure together with
cross-validation. In the case of SGB we have empirically noticed strong performance with
this automatic choice of tree depth (especially for larger data sets).
However, in deterministic gradient boosting, the tree size is usually selected a priori. For
example, stumps (2-split) or 4-split trees (e.g. split the data into a maximum of 4 groups)
are commonly selected as the weak learners. In rpart one should set cp=-1, which forces the
tree to split until the depth of the tree achieves the maxdepth setting. Thus, by specifying
the maxdepth argument, the number of splits can be controlled. It is important to note that
the maxdepth argument works on a log2 scale, hence the number of splits is a power of 2.
In small data sets, it is useful to appropriately specify the minsplit argument, in order to
ensure that at least one split will be obtained. Finally, it is worth noting that a theoretically
rigorous approach for setting the tree depth in any form of boosting is still an open problem
(Segal 2004).
The following code illustrates how the control parameters for generating stumps and 4-split
trees:
> library("ada")

Loading required package: rpart

> default <- rpart.control()

> stump <- rpart.control(cp = -1 , maxdepth = 1 , minsplit = 0)
> four <- rpart.control(cp = -1 , maxdepth = 2 , minsplit = 0)
8 ada: An R Package for Stochastic Boosting

4. Description of the functions available in the ada package

To illustrate the functions available in ada, we use a ten dimensional synthetic data set,
comprised of two interspersed classes exhibiting a sinusoid pattern (Figure 2). In this data,
two variables that determine the class shapes were corrupted by eight dimensions of standard
Gaussian noise, creating a difficult to trace boundary between the classes. Five hundred
observations equally divided among the two classes were generated, and 20% were assigned
to the training set. The following R code was used to generate the data.

> n <- 500

> p <- 10
> f <- function(x, a, b, d) a * (x - b)^2 + d
> set.seed(100)
> x1 <- runif(n/2, 0, 4)
> y1 <- f(x1, -1, 2, 1.7)+runif(n/2, -1, 1)
> x2 <- runif(n/2, 2, 6)
> y2 <- f(x2, 1, 4, -1.7)+runif(n/2, -1, 1)
> y <- c(rep(1, n/2), rep(2, n/2))
> mat <- matrix(rnorm(n * 8), ncol = 8)
> dat <- data.frame(y = y, x1 = c(x1, x2), x2 = c(y1, y2), mat)
> names(dat) <- c("y", paste("x", 1:10, sep = ""))
> plot(dat$x1, dat$x2, pch = c(1:2)[y], col = c(1, 8)[y],
+ xlab=names(dat)[2], ylab=names(dat)[3])
> indtrain <- sample(1:n, 100, FALSE)
3

●
● ●● ●
●● ● ● ●●●
● ●●●●
●
● ● ●●●●● ●●
2

● ●
●
● ●
● ● ●●●●● ● ● ●● ● ●
●
● ● ● ●
●
● ● ●●●
●● ●● ●●
● ● ●●● ● ●● ● ●
●●
●
●● ● ●● ●
●● ● ● ●● ●●●●● ●●●
● ● ● ●●
1

● ● ● ●
● ●● ● ● ●
● ● ● ● ● ● ●●
● ● ● ●● ● ●● ● ● ●●
●
●
●●●● ●● ● ●● ● ●
● ● ●●
● ● ● ● ● ●●
● ● ● ●●
● ●
●● ●● ● ●●● ●
x2

● ●
0

● ●
● ●● ●●● ●●
● ● ● ●●
● ●●
● ● ● ●●
●● ●● ●
● ●
●●
−1

● ● ●
● ●●
●● ● ●
● ● ●●
●
●● ●
●●● ●● ●
●
●●
●
−2

●● ●●
●●
●
●
●
●
● ●
−3

●
●

0 1 2 3 4 5 6

Figure 2: Plot of the two informative variables, colored by the response.

Journal of Statistical Software 9

> train <- dat[indtrain,]

> test <- dat[-indtrain,]

4.1. Creating an ada object

Objects of class ada are created by a call to the appropriate function, specifying the type of
boosting (Discrete, Real, etc).

Discrete AdaBoost
To model the synthetic data sets with discrete AdaBoost under exponential loss (using stumps
with 50 iterations) call:

> default <- rpart.control()

> gdis <- ada(y~., data = train, iter = 50, loss = "e", type = "discrete",
+ control = default)
> gdis

Call: ada(y ~ ., data = train, iter = 50, loss = "e", type =

"discrete", control=default)

Loss: exponential Method: discrete Iteration: 50

Final Confusion Matrix for Data:

Final Prediction
True value 1 2
1 49 3
2 8 40

Train Error: 0.11

Out-Of-Bag Error: 0.14 iteration= 9

Additional Estimates of number of iterations:

train.err1 train.kap1
47 50

Notice that the output gives the training error, confusion matrix and three estimates of the
number of iterations. In this example, one could use the Out-Of-Bag (OOB) estimate for 9
iterations, training error estimate of 47, or the kappa error estimate of 50 iterations.
To add the testing data set to the model, simply use the addtest function. This function
allows us to evaluate the testing set without refitting the model.

> gdis <- addtest(gdis, test[,-1], test[,1]))

> gdis
10 ada: An R Package for Stochastic Boosting

...
Estimates of number of iterations:

train.err1 train.kap1 test.errs2 test.kaps2

47 50 40 40

Real AdaBoost
Next we provide the code to create a Real AdaBoost ensemble with the -boosting modifica-
tion, 4-split trees, and 1000 iterations. For additional convenience, the test set can be passed
to the function.

> control <- rpart.control(maxdepth = 2,cp = -1, minsplit = 0)

> greal <- ada(y~., data = train, iter = 1000, type = "real", nu = 0.001,
+ bag.frac = 1, model.coef = FALSE, control = control)
> greal

Call: ada(y ~ ., data = train, test.x = test[, -1], test.y = test[,

1], iter = 1000, type = "real", nu = 0.001, bag.frac = 1,
model.coef = FALSE, control = rpart.control(maxdepth = 2,
cp = -1, minsplit = 0))

Loss: exponential Method: real Iteration: 1000

Final Confusion Matrix for Data:

Final Prediction
True value 1 2
1 52 0
2 11 37

Train Error: 0.11

Out-Of-Bag Error: 0 iteration= 6

Additional Estimates of number of iterations:

train.err1 train.kap1 test.err2 test.kap2

999 999 996 996

Notice that the out-of-bag error rate here is meaningless since there are no subsamples in
pure -boosting (i.e. bag.frac = 1).To perform Stochastic Gradient -boosting simply set
the bag.frac argument less than 1 for the previous call (default is bag.frac=0.5).

Gentle AdaBoost
The following call provides a Gentle AdaBoost ensemble with 100 iterations, tree depth of 8,
ν = 0.1 (regularization), and the ensemble is shifted towards bagging using the bag.shift=TRUE
argument (Section 2.3).
Journal of Statistical Software 11

> ggen <- ada(y~., data = train, test.x = test[,-1], test.y = test[,1],
+ iter = 100, type = "gentle", nu = 0.1, bag.shift = TRUE,
+ control = rpart.control(cp = -1, maxdepth = 8))
> ggen

Call: ada(y ~ ., data = train, test.x = test[, -1], test.y = test[,

1], iter = 100, type = "gentle", nu = 0.1, bag.shift = TRUE,
control = rpart.control(cp = -1, maxdepth = 8))

Loss: exponential Method: gentle Iteration: 100

Final Confusion Matrix for Data:

Final Prediction
True value 1 2
1 52 0
2 6 42

Train Error: 0.06

Out-Of-Bag Error: 0.06 iteration= 92

Additional Estimates of number of iterations:

train.err1 train.kap1 test.err2 test.kap2

96 96 9 9

L2 Boost
To call L2 Boost (boosting with logistic loss (Friedman 2001)) with gentle boost, invoke ada
in the following way.

> glog <- ada(y~., data = train, test.x = test[,-1], test.y = test[,1],
+ iter = 50, loss = "l", type = "gentle")
> glog

Call: ada(y ~ ., data = train, test.x = test[, -1], test.y = test[,

1], iter = 50, loss = "l", type = "gentle")

Loss: logistic Method: gentle Iteration: 50

Final Confusion Matrix for Data:

Final Prediction
True value 1 2
1 52 0
2 6 42
12 ada: An R Package for Stochastic Boosting

Train Error: 0.06

ut-Of-Bag Error: 0.1 iteration= 50

Additional Estimates of number of iterations:

train.err1 train.kap1 test.err2 test.kap2

49 49 2 2

In addition to performing gentle boost the logistic loss function can be invoked with the
type="discrete" or type="real".

Stageweight convergence
For many of these methods a Newton Step is necessary for the convergence of the stageweight
at each iteration. To see the convergence use the verbose=TRUE flag:

> greal <- ada(y~., data = train, test.x = test[,-1], test.y = test[,1],
+ iter = 50, type = "real", verbose = TRUE)

FINAL: iter= 4 rate= 9.240109e-19

...
FINAL: iter= 3 rate= 5.890882e-15

For this example the stageweights converged quickly. In some situations the convergence may
not be as fast; to increase the number of iterations for convergence purposes use the max.iter
argument.

4.2. Using an ada object

Upon building an ada object, one can explore the model performance and characteristics
through several different tools.

The plot function

The ada plot function overrides the generic plot function in R by plotting the training
error versus iteration number for a given boosting ensemble. If the testing data sets have
been passed into the ada object then the testing errors can easily be plotted by using the
test=TRUE argument.
The following call shows how to create a plot using discrete adaboost on the training data
(Figure 3 (right)).

> plot(gdis)

Notice that the training error steadily decreases across iterations. This shows that boosting
can effectively learn the features in a data set. However, a more valuable diagnostic plot would
involve the corresponding test set error plot. The following call creates both the training and
testing error plots side-by-side, as seen in Figure 3 (right).
Journal of Statistical Software 13

Training Error Training And Testing Error

0.24

1 Train 1 Train
2 Test1

0.35
0.22

0.30
0.20
0.18

0.25
Error

Error
2 2 2 2
1
0.16

0.20
1
0.14

1 1
1

0.15
1
1
0.12

1
1

0.10
0 10 20 30 40 50 0 10 20 30 40 50

Iteration 1 to 50 Iteration 1 to 50

Figure 3: Training error by iteration number for the example data (right). The training and
testing error by iteration number for the example data (left).

> plot(gdis, FALSE, TRUE)

Remark: In many situations the class priors (proportion of true responses) are unbalanced;
in these situations, learning algorithms often focus on learning the larger set. Hence, errors
tend to appear low, but only because the algorithm classifies objects into the larger class. An
alternative measure to absolute classification error is the Kappa statistic (Cohen 1960), which
adjusts for class imbalances.

The summary function

Another utility often used with objects in R is summary. To summarize an ada object the
summary function returns the kappa value and training accuracy for the final iteration in a
given ensemble. If testing data sets are included, then the accuracy and kappa value will
also be reported. The argument n.iter allows the user to specify which iteration to report
(default is the model iter specification). The following code shows how to call this function
using Gentle AdaBoost with the OOB error minimum.

> summary(ggen, n.iter = 64)

Call: ada(y ~ ., data = train, test.x = test[, -1], test.y = test[,

1], iter = 100, type = "gentle", nu = 0.1, bag.shift = TRUE,
control = rpart.control(cp = -1, maxdepth = 8))

Loss: exponential Method: gentle Iteration: 92

Training Results
14 ada: An R Package for Stochastic Boosting

Accuracy: 0.96 Kappa: 0.92

Testing Results

Accuracy: 0.757 Kappa: 0.517

The predict function

The predict function is written to match the arguments of the predict.rpart generic func-
tion. Hence, the arguments consist of object, newdata, and type, which give a specific
prediction result. The default type for this function is to return the predicted vector of class
labels for the training data. The options type="prob", type="both" and type="F" either
give the probability class estimates, both the portability class estimates and the vector of
labels, or the weighted sum over the ensemble, respectively. The following code shows how to
predict with Real AdaBoost.

> pred <- predict(greal, train[,-1])

> table(pred)

1 2
56 44

To get the probability class estimates for the training data input:

> pred <- predict(greal, train[,-1], type = "prob")

> pred

[,1] [,2]
451 0.317362113 0.682637887
86 0.851147564 0.148852436
...
257 0.007765103 0.992234897
74 0.851265307 0.148734693

The first column provides the number of the actual observation in the original data set.
Remark: The probability class estimate for any boosting algorithm is defined as P̂ (Y = 1 |
e2F (x) x is considered infinite by R for large x it is
x) = 1+e2F (x) . However, since the function e

necessary to compute this value on the logarithmic scale and force it to 1 if e2F (x) = ∞. As
a result, one can not get the original F from this transformation, which is needed for the
multi-class case. This is the rationale for having the option of setting type="F". The usage
of this argument setting is shown in the multi-class example in the next section.
Remark: The newdata option requires a data.frame of observations with the exact same
variable names as the training data. If a matrix format is used to represent the training data,
then the variable names will most likely be the default V1 , . . . , Vp . The following code shows
how to change the names of the columns, if the training data are in a matrix or a data.frame
format, respectively.
Journal of Statistical Software 15

> test <- as.data.frame(test)

> names(test) <- c("y", paste("V", 1:p, sep=""))
> names(test) <- names(train)

The update function

Invoke the following command to add more trees to the glog ensemble, which was constructed
above with 20 trees.

> glog <- update(glog, train[,-1], train[,1], test[,-1], test[,1], n.iter = 50)
> glog

Loss: logistic Method: gentle Iteration: 100

...
train.err1 train.kap1 test.err2 test.kap2
75 75 96 96

The varplot function

Gentle AdaBoost will be used to illustrate the variable importance function. The following
code shows how to plot the variables ordered by the importance score defined in (Hastie et al.
2001).

> varplot(ggen)

Figure 4 gives us the scores for the variable assessment for Gentle AdaBoost.
To obtain the variable scores directly (without a plot) use the following code.

> vip <- varplot(gdis, plot.it = FALSE, type = "scores")

> round(vip, 4)

x1 x6 x5 x10 x2 x4 x9 x3 x8 x7
0.0070 0.0036 0.0035 0.0034 0.0028 0.0027 0.0025 0.0024 0.0023 0.0017

The pairs function

The pairs tool produces a visualization of the pairwise relationships between a subset of
variables in the data set. The upper panel plots represent the true class labels as colors for
each pairwise relationship, while the lower panel gives the predicted class for each observation.
Also the observations in the plots on the lower panel are scaled by the class probability
estimate, where the size of the point represents the probability estimate. Hence this plot can
help identify observations that are difficult for boosting to classify.
To generate the pairwise plots for the three top variables determined by the varplot function,
issue the command:

> pairs(gdis, train[,-1], maxvar = 3)

16 ada: An R Package for Stochastic Boosting

Variable Importance Plot

x1 ●

x5 ●

x4 ●

x2 ●

x10 ●

x6 ●

x7 ●

x8 ●

x3 ●

x9 ●

0.014 0.016 0.018 0.020 0.022 0.024 0.026

Score

Figure 4: Variable importance scores produced by the varplot command.

Pairs can also be used to explore the relationships among variables in the test set. If the
response for the test set is unknown, the observations will be plotted in black.

> pairs(gdis, train[,-1], var = 1:2,test.x = test[,-1],

+ test.y = test[,1],test.only = TRUE)

As a final note, to view both the testing and training data on the same plot, leave test.only
as FALSE and issue the above command with either var=· · · for a specific set of variables or
maxvar=k for the top k variables.

5. Examples
The examples below will illustrate various aspects and uses of the ada package.

5.1. Diagnostics and model selection

To illustrate ada and validate its performance, we will use an example from Hastie et al.
(2001), pp. 300-308. A ten dimensional data set comprised of 12,000 observations was
generated, where each variable is an independent and identically distributed mean zero,
variance one, normal variate. The response was computed using each variable equally as,
Y = 2 × 1{P X 2 >χ210 (0.5)=9.34} − 1, using the following code.
j
Journal of Statistical Software 17

−2 −1 0 1 2

6
5
4
● ● ● ● ●● ● ● ● ●● ●
● ● ● ●● ●
● ● ● ● ●
● ●
● ● ● ● ●
x1 ● ● ● ● ● ●● ●

3
● ●
● ●
● ● ● ● ● ●
● ● ● ● ● ●
●● ● ●
● ●

2
● ● ●●●
● ● ●
● ●
●
● ● ● ● ●

1
● ● ● ●
● ● ●●●
● ● ●● ● ●●
● ●● ●
● ● ● ● ●● ●●

0
●
2

● ●
● ●
● ● ● ●
● ● ● ●
● ●
●● ● ●● ● ●● ● ● ● ●
1

●● ●
●
● ● ●●●
● ● ● ●
●
● ●● ● ● ● ●●● ● ●
●
●●

x6 ●●
● ●
0

●
●
● ●
●
● ●
● ● ●●
● ●
● ● ● ● ● ●
−1

● ● ●● ● ● ●● ●●
● ●● ●●
● ● ● ●
● ●
● ●
−2

●● ● ●
● ●
● ●

● ●
● ●
● ● ● ● ● ●
● ●

1
● ●
● ● ● ●

● ●
● ● ● ● ● ●
● ●
●● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ●●

0
● ● ● ● ●
●● ● ●
● ●
● ● ●●

●
●●
● ●
●
●

●
●
●
●
●●
●
●
●

●
●
● ●
x5

−1
● ●
●
●
●
● ●
●
● ● ● ●
● ● ● ● ●
● ●
●
● ● ● ●

● ●

−2
● ●

0 1 2 3 4 5 6 −2 −1 0 1

Figure 5: Pairs plot of first three descriptors.

−3 −2 −1 0 1 2 3
6
5
4

●
●● ●
● ●●
● ●●●● ●●● ● ●
● ● ●● ●
● ● ●●
●● ● ●● ●●
x1 ● ● ●● ●●●●
●
3

● ●●●
●●●●● ●
●●● ●●●
● ●● ●● ●
● ●● ●● ● ●●
●
●
●●
●●● ●
● ● ●●● ●
● ●
●
●●●●● ● ●
2

● ● ● ●
●● ● ●●
●● ● ●●
● ●
●●●● ●●
●●
● ●●
●●●
●●● ●● ● ● ●
●●● ● ● ●● ●●●
● ●●● ●
1

●
● ●●● ●●●●●●● ●●
●
●●●●
●● ●●
● ● ● ●●
● ●●●●●●
●●
0
3

●
● ●●
●●● ● ●
●
●● ● ● ●●●
● ●●● ● ●● ●●
2

●
● ●
●
●●●
●●
● ●
●●
●●●●●●●● ●●●● ●
●●● ● ●● ●● ●●● ●
●
● ● ● ● ● ●
● ●●● ●
●● ● ● ● ● ● ● ●● ●●
●●
1

● ● ● ● ●●
●●●
● ●●● ● ● ● ● ●● ● ●●● ●● ●
● ● ●
●●
●
●●● ●●● ●● ●● ●
●
●●●●
● ●
● ●●
●● ●● ● ●
● ● ●
●●●

x2
0

● ● ●
●● ●●
●●●
● ● ●● ●
●● ●
●
●●
● ● ● ●●
●
●●
● ●●●
● ● ● ●
−1

●●
●● ●
●●●●
●
●● ● ● ●
●
● ●
● ●
●
●
● ●
● ●
●
●● ●
●● ●
●
●●
● ●● ● ●
−2

● ●●
● ● ●
● ●
● ●
● ●
−3

0 1 2 3 4 5 6

Figure 6: Pairs plot of the two informative variables.

18 ada: An R Package for Stochastic Boosting

> n <- 12000

> p <- 10
> set.seed(100)
> x <- matrix(rnorm(n*p), ncol=p)
> y <- as.factor(c(-1, 1)[as.numeric(apply(x^2, 1, sum) > 9.34) + 1])

A 400-iteration Discrete AdaBoost ensemble using stumps was fitted to a training set of 2000
observations using the following code:

> indtrain <- sample(1:n, 2000, FALSE)

> train <- data.frame(y=y[indtrain], x[indtrain,])
> test <- data.frame(y=y[-indtrain], x[-indtrain,])
> control <- rpart.control(cp = -1,minsplit = 0,xval = 0,maxdepth = 1)
> gdis <- ada(y~., data = train, iter = 400, bag.frac = 1, nu = 1,
+ control = control, test.x = test[,-1], test.y = test[,1])
> gdis

Call:
ada(y ~ ., data = train, iter = 400, bag.frac = 1, nu = 1, control = control,
test.x = test[, -1], test.y = test[, 1])

Loss: exponential Method: discrete Iteration: 400

Final Confusion Matrix for Data:

Final Prediction
True value -1 1
-1 954 36
1 85 925

Train Error: 0.06

Out-Of-Bag Error: 0 iteration= 6

Additional Estimates of number of iterations:

train.err1 train.kap1 test.err2 test.kap2

398 398 398 398

> plot(gdis, TRUE, TRUE)

The summary command is used below to present the test set performance of the boosted
model.

> summary(gdis, n.iter = 398)

Call:
ada(y ~ ., data = train, iter = 400, bag.frac = 1, nu = 1, control = control,
Journal of Statistical Software 19

Training And Testing Error Training And Testing Kappas

0.5
1 Train 1 Train
1 1
2 Test1 2 Test1 1

0.8
1 2 2
2
0.4

1 2
2

0.6
Kappa Accuracy
0.3
Error

0.4
0.2

2
1 2
2
0.2
1 2 2
0.1

1 1 1
0 100 200 300 400 0 100 200 300 400

Iteration 1 to 400 Iteration 1 to 400

Figure 7: Training and testing error and kappa accuracy by iteration.

test.x = test[, -1], test.y = test[, 1])

Loss: exponential Method: discrete Iteration: 398

Training Results

Accuracy: 0.94 Kappa: 0.879

Testing Results

Accuracy: 0.889 Kappa: 0.777

Notice that the testing error is 11.1%, which agrees with the results found in (Hastie et al.
2001).
Remark: The variables in this example are all equally important by construction, and therefore
the diagnostics for variable selection and pairwise plots are not shown.

5.2. Solubility data

The ada package is used to analyze a data set that contains information about compounds
used in drug discovery. Specifically, this data set consists of 5631 compounds on which an
in-house solubility screen (ability of a compound to dissolve in a water/solvent mixture) was
20 ada: An R Package for Stochastic Boosting

performed. Based on this screen, compounds were categorized as either insoluble (n=3493) or
soluble (n=2138). Then, for each compound, 72 continuous, noisy structural descriptors were
computed. Of these descriptors, one contained missing values for approximately 14% (n=787)
of the observations. The objective of the analysis is to model the relationship between the
structural descriptors and the solubility class.
For modeling purposes, the original data set was randomly partitioned into training (50%),
test (30%), and validation (20%) sets. The data will be called soldat and the compound labels
and variable names have been blinded for this illustration.

> data("soldat")
> n <- nrow(soldat)
> set.seed(100)
> ind <- sample(1:n)
> trainval <- ceiling(n * .5)
> testval <- ceiling(n * .3)
> train <- soldat[ind[1:trainval],]
> test <- soldat[ind[(trainval + 1):(trainval + testval)],]
> valid <- soldat[ind[(trainval + testval + 1):n],]

Gentle AdaBoost with default settings was used on the training set. This data set contained a
descriptor with missing values and recall that the default setting is given by na.action=na.rpart.
This option allows rpart to search all descriptors, including those with missing values using
surrogate splits (Breiman, Friedman, Olshen, and Stone 1984), to find the best descriptor for
splitting purposes.

> control <- rpart.control(cp = -1, maxdepth = 14,maxcompete = 1,xval = 0)

> gen1 <- ada(y~., data = train, test.x = test[,-73], test.y = test[,73],
+ type = "gentle", control = control, iter = 70)
> gen1 <- addtest(gen1, valid[,-73], valid[,73])

The summary function can then be used to evaluate the performance of the model on the test
data:

> summary(gen1)

Call: ada(y ~ ., data = train, test.x = test[, -73], test.y = test[,

73], type = "gentle", control = control, iter = 70)

Loss: exponential Method: gentle Iteration: 70

Training Results

Accuracy: 0.987 Kappa: 0.972

Testing Results

Accuracy: 0.765 Kappa: 0.487

Journal of Statistical Software 21

Training And Testing Error Training And Testing Kappas

0.6
1 Train 1 Train 1
2 Test1 2 Test1 1
1

0.9
3 Test2 3 Test2
1
0.5

0.8
1
0.4

Kappa Accuracy

0.7
Error

0.3

2
3 2 2 2 2

0.6
3 3 3 3
0.2

0.5
3 3 3 3
2
0.1

1 3 2 2 2
1 2
1 1
0.4
1
0.0

0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70

Iteration 1 to 70 Iteration 1 to 70

Figure 8: Training, testing, and kappa error values by iteration for the solubility data.

Accuracy: 0.775 Kappa: 0.5

Testing accuracy rates are printed in the order they are entered so the accuracy on the testing
set is 0.765 and on the validation set 0.781.
For this type of early drug discovery data, the Gentle AdaBoost algorithm performs adequately
with test set accuracy of 76.5% (kappa≈ 0.5). Figure 8 also illustrates the model’s performance
for the training and test sets across iterations.

> plot(gen1, TRUE, TRUE)

In order to enhance our understanding regarding the relationship between descriptors and the
response, the varplot function was employed. It can be seen from Figure 9 that variable 5
is the most important one.

> varplot(gen1)

The features are quite noisy and the variables are correlated, which makes variable importance
difficult to ascertain from this one run. Next we provide a small run to obtain the average
variable importance over 20 iterations.

> vars <- rep(0,72)

> t1 <- proc.time()
22 ada: An R Package for Stochastic Boosting

Variable Importance Plot Average Variable Imp.

x50 ● x38 ●
x9 ● x39 ●
x46 ● x63 ●
x58 ● x20 ●
x60 ● x19 ●
x63 ● x57 ●
x49 ● x52 ●
x20 ● x40 ●
x57 ● x50 ●
x36 ● x55 ●
x13 ● x54 ●
x22 ● x36 ●
x51 ● x22 ●
x37 ● x37 ●
x55 ● x21 ●
x8 ● x43 ●
x33 ● x32 ●
x56 ● x51 ●
x42 ● x58 ●
x15 ● x3 ●
x19 ● x34 ●
x2 ● x41 ●
x35 ● x56 ●
x47 ● x49 ●
x52 ● x42 ●
x65 ● x31 ●
x41 ● x18 ●
x14 ● x13 ●
x5 ● x29 ●
x72 ● x64 ●

0.028 0.030 0.032 0.034 0.0265 0.0270 0.0275 0.0280 0.0285 0.0290

Score

Figure 9: Variable importance for solubility data one run (left), and average (right).

> for(i in 1:20){

+ rm(gen1)
+ gen1 <- ada(y~., data = train, test.x = test[,-73], test.y = test[,73],
+ type = "gentle", control = control, iter = 70)
+ vec1 <- varplot(gen1, plot.it = FALSE, type = "scores", max.var.show = 72)
+ vars <- vars + as.numeric(vec1[order(names(vec1))]) / 20
+ cat("i=", i, " time=", (proc.time()-t1)/60, "\n")
+ }

i= 1 time= 0.5468333 0.007166667 0.5546667 NA NA

...
i= 20 time= 13.85133 0.0405 13.97417 NA NA
> a1 <- sort(names(vec1))
> a2 <- order(vars, decreasing = TRUE)
> dotchart(vars[a2][30:1], a1[a2][30:1], main="Average Variable Imp.")

According to the average variable importance, eight of the top eleven variables measure various
aspects of a compound’s hydrophilicity and hydrophobicity, which makes scientific sense in this
context. These various hydrophilicity and hydrophobicity measures account for a molecule’s
tendency to dissolve in water – key measures for determining a molecule’s solubility.
Remark: In practice, we have found the average performance for variable importance infor-
mative, and well worth the time it takes to compute it. Also, given the size of this data set,
it may be convenient to invoke R with -max-mem-size=2Gb.
Journal of Statistical Software 23

5.3. Stochastic boosting in a multi-class context

Currently, several boosting algorithms can not directly handle a K-class response. This
topic constitutes an active area of research. However, several methods have been proposed
in the literature to address this issue. One popular strategy is the one-versus-all tech-
nique, where each individual class (typically coded as 1), is modeled against all the remain-
ing classes
PM(each coded as zero), and K different ensembles are constructed. The values
Fk (x) = m=1 αm fm (x), k = 1, ..., K returned by each ensemble are compared and the value
corresponding to the maximum Fk is given as the class label (Friedman et al. 2000).
To illustrate the use of ada for multi-class data sets, a ten dimensional feature vector (X1 , . . . , X10 )
was simulated from a standard normal distribution. The response variables is constructed as:
Pp
Xr2 ≤ χ210 (.33)

 1 : Pr=1
p
Y = 2 : χ10 (.33) < Pr=1 Xr2 ≤ χ210 (.66)
2
p 2 2
3 : r=1 Xr > χ10 (.66)


This produces a three class analog to the example given above. For this example, a training
and a testing data set which comprised of 200 and 1000 observations, respectively, were
generated.

> n <- 1200

> p <- 10
> K <- 3
> set.seed(100)
> x <- matrix(rnorm(n * p), ncol = p)
> indtrain <- sample(1:n, 200, FALSE)
> indtest <- setdiff(1:n, indtrain)
> val <- qchisq(c(.33, .66), 10)
> su <- apply(x^2, 1, sum)
> Iy <- cbind(as.numeric(su <= val[1]), as.numeric( val[1] < su & su <= val[2]),
+ as.numeric(su > val[2]))
> y <- apply(Iy, 1, which.max)
> test <- data.frame(y = y[indtest], x[indtest,])

A 250-iteration stochastic version of Discrete AdaBoost ensemble was constructed using de-
fault trees as the base learner. The implementation of the one-versus-all strategy is given
next:

> Fs <- list()

> for(i in 1:K)
+ Fs[[i]] <- ada(y~., data = data.frame(y = Iy[indtrain, i], x[indtrain,]),
+ iter = 250,test.x = test, test.y = Iy[indtest, i])$model$F[[2]]

The in class test error rate and total test error rate will be obtained below using the sapply and
table commands in R. For more information on these commands refer to (Becker, Chambers,
and Wilks 1988).

> wmx <- function(i)which.max(c(Fs[[1]][i],Fs[[2]][i],Fs[[3]][i]))

> preds <- sapply(1:1000,wmx)
24 ada: An R Package for Stochastic Boosting

> tab <- table(y[indtest],preds)

> for(i in 1:K){
+ cat("In class error rate for class ", i, ": ",
+ round(1 - tab[i,i] / sum(tab[i,]), 3), "\n")
+ }

In class error rate for class 1 : 0.281

In class error rate for class 2 : 0.644
In class error rate for class 3 : 0.324

> 1 - sum(diag(tab)) / length(indtest)

[1] 0.416

Notice that although the test error rate appears high, it is still substantially lower than
random guessing (0.66). Also the in class error rate for class two seems to be rather high.
In order to assess the magnitude of the error rate, a random forest classifier (Breiman 2001)
was considered. The choice of a random forest as an ensemble method is due to its ability
to handle multi-class problems and its overall competitive performance. The following code
using the R package randomForest constructs such an ensemble for the data at hand (Liaw
and Wiener 2002).

> library("randomForest")
> train <- data.frame(y=as.factor(y[indtrain]), x[indtrain,])
> set.seed(100)
> grf <- randomForest(y~., train)
> tab3 <- table(y[indtest], predict(grf, test))
> for(i in 1:K){
+ cat("In class error rate for class ", i,": ",
+ round(1 - tab3[i,i] / sum(tab[i,]), 3), "\n")
+ }

In class error rate for class 1 : 0.272

In class error rate for class 2 : 0.623
In class error rate for class 3 : 0.352

> 1 - sum(diag(tab3)) / length(indtest)

[1] 0.415

It can be seen that the overall test error rate (0.415) is about the same as that produced
by Discrete AdaBoost, which strongly indicates that the combination of the ada and the
one-versus-all strategy can easily handle multi-class classification problems.
Journal of Statistical Software 25

6. Summary and concluding remarks

In this paper, the R package ada that implements several boosting algorithms is described. Its
key features are its functional modularity, the adjustment of class priors for tree classifiers,
as well as the incorporation of several useful in practice plots, such as the pairs and the
importance of variables plot.
Overall, boosting has had a significant impact both in theoretical and applied research on
classification problems, as can be seen by the size of the existing literature on the topic.
However, the choice of base classifiers plays a crucial role in the performance of the ensemble.
Typically for 4 or 8-split trees, after a large number of iterations the training error rate is
driven to zero, while the test error decreases up to a certain level and then oscillates around
this level. In that respect, the performance seen in our first data example is somewhat
atypical, where the training error rate has not reached zero even after 400 iterations. This is
most likely due to the use of stumps as a base classifier.

Acknowledgements
The authors would like to thank the Editor Jan de Leeuw, an Associate Editor and two
anonymous referees for useful comments and suggestions. The work of George Michailidis
was supported in part by NIH grant P41 RR18627-01 and by NSF grant DMS 0204247.

References

Becker R, Chambers J, Wilks A (1988). The New S Language: A Programming Environment

for Data Analysis and Graphics. Wadsworth and Brooks/Cole Advanced Books & Software,
Monterey, CA. ISBN 0-534-09192-X.

Boonyanunta N, Zeephongsekul P (2003). “Improving the Predictive Power of AdaBoost: A

Case Study in Classifying Borrowers.” In “Proceedings of the 16th International Conference
on Developments in Applied Artificial Intelligence,” pp. 674–685. Springer Verlag Inc. ISBN
3-540-40455-4.

Breiman L (1996). “Bagging Predictors.” Machine Learning, 24(2), 123–140.

doi:10.1023/A:1018054314350.

Breiman L (2001). “Random Forests.” Machine Learning, 45(1), 5–32.

doi:10.1023/A:1010933404324.

Breiman L, Friedman J, Olshen R, Stone C (1984). Classification and Regression Trees.

Chapman & Hall, New York.

Cohen J (1960). “A Coefficient of Agreement for Nominal Data.” Education and Psychological
Measurement, 20, 37–46.

Dettling M (2004). “BagBoosting for Tumor Classification with Gene Expression Data.”
Bioinformatics, 20(18), 3583–3593. doi:10.1093/bioinformatics/bth447.
26 ada: An R Package for Stochastic Boosting

Freund Y, Schapire R (1996). “Experiments with a New Boosting Algorithm.” In “Interna-

tional Conference on Machine Learning,” pp. 148–156.

Freund Y, Schapire R (1997). “A Decision-Theoretic Generalization of On-Line Learning

and an Application to Boosting.” Journal Computer and System Sciences, 55(1), 119–139.
doi:10.1006/jcss.1997.1504.

Friedman J (2001). “Greedy Function Approximation: A Gradient Boosting Machine.” The

Annals of Statistics, 29(5), 1189–1232.

Friedman J (2002). “Stochastic Gradient Boosting.” Computational Statistics & Data Anal-
ysis, 38(4), 367–378. doi:10.1016/S0167-9473(01)00065-2.

Friedman J, Hastie T, Tibshirani R (2000). “Additive Logistic Regression: A Statistical View

of Boosting.” The Annals of Statistics, 28(2), 337–407.

Hastie T, Tibshirani R, Friedman J (2001). The Elements of Statistical Learning. Springer

Verlag. ISBN 0-387-95284-5.

Hothorn T, Bühlmann P (2006a). “Model-based Boosting in High Dimensions.” Bioinformat-

ics. doi:10.1093/bioinformatics/btl462. Forthcoming.

Hothorn T, Bühlmann P (2006b). mboost: Model-based Boosting. R package version 0.4-13.

Huang K, Murphy R (2004). “Boosting Accuracy of Automated Classification of Flu-

orescence Microscope Images for Location Proteomics.” BMC Bioinformatics, 5, 78.
doi:10.1186/1471-2105-5-78.

Kawakita M, Minami M, Eguchi S, Lennert-Cody C (2005). “An Introduction to the Predic-

tive Technique AdaBoost with a Comparison to Generalized Additive Models.” Fisheries
Research, 76(6), 323–343.

Lemmens A, Croux C (2005). “Bagging and Boosting Classification Trees to Predict Churn.”
Journal of Marketing Research, 43(2), 276–268.

Liaw A, Wiener M (2002). “Classification and Regression by randomForest.” R News, 2(3),

18–22. URL http://CRAN.R-project.org/doc/Rnews/.

R Development Core Team (2006). R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http:
//www.R-project.org/.

Ridgeway G (2006). gbm: Generalized Boosted Regression Models. R package version 1.5-7,
URL http://www.i-pensieri.com/gregr/gbm.shtml.

Rosset S, Zhu J, Hastie T (2004). “Boosting as a Regularized Path to a Maximum Margin

Classifier.” Journal of Machine Learning Research, 5, 941–973.

Schapire R (1990). “The Strength of Weak Learnability.” Machine Learning, 5(2), 197–227.
doi:10.1023/A:1022648800760.
Journal of Statistical Software 27

Segal M (2004). “Machine Learning Benchmarks and Random Forest Regression.” Technical
report, Center for Bioinformatics & Molecular Biostatistics, University of California, San
Francisco, CA. URL http://repositories.cdlib.org/cbmb/bench_rf_regn/.

Sugata S, Abe Y (2001). “Computer Simulation of Hydrodynamic Models for

Chemical/Pharmaco-Kinetics.” Journal of Chemical Software, 7(2).

Therneau T, Atkinson B (2005). rpart: Recursive Partitioning Software. R pack-

age version 3.1-32, URL http://mayoresearch.mayo.edu/mayo/research/biostat/
splusfunctions.cfm.

Ulintz P, Zhu J, Qin Z, Andrews P (2006). “Improved Classification of Mass Spectrome-

try Database Search Results Using Newer Machine Learning Approaches.” Molecular and
Cellular Proteomics, 5(3), 497–509.

Valiant L (1984). “A Theory of The Learnable.” In “Proceedings of the 16th Annual ACM
Symposium on Theory of Computing,” pp. 436–445. ACM Press, New York, NY. ISBN
0-89791-133-4. doi:10.1145/800057.808710.

Affiliation:
Mark Culp
Department of Statistics
University of Michigan
436 West Hall, 550 East University
Ann Arbor, MI 48109, United States of America
E-mail: culpm@umich.edu
URL: http://www.stat.lsa.umich.edu/~culpm/

Kjell Johnson
E-mail: Kjell.Johnson@pfizer.com

George Michailidis
E-mail: gmichail@umich.edu
URL: http://www.stat.lsa.umich.edu/~gmichail/

Journal of Statistical Software http://www.jstatsoft.org/

published by the American Statistical Association http://www.amstat.org/
Volume 17, Issue 2 Submitted: 2005-07-13
October 2006 Accepted: 2006-09-26

The Easy Way To Get in Shape and Stay in Shape For The Rest of Your Life
85% (13)
The Easy Way To Get in Shape and Stay in Shape For The Rest of Your Life
73 pages
189 Cheat Sheet Minicards
No ratings yet
189 Cheat Sheet Minicards
2 pages
AdaBoostExample PDF
No ratings yet
AdaBoostExample PDF
2 pages
Boosting Buehlmann
No ratings yet
Boosting Buehlmann
52 pages
Boosting Algorithms: Regularization, Prediction and Model Fitting
No ratings yet
Boosting Algorithms: Regularization, Prediction and Model Fitting
29 pages
The Evolution of Boosting Algorithms: From Machine Learning To Statistical Modelling
No ratings yet
The Evolution of Boosting Algorithms: From Machine Learning To Statistical Modelling
32 pages
Overview of Adaboost: Reconciling Its Views To Better Understand Its Dynamics
No ratings yet
Overview of Adaboost: Reconciling Its Views To Better Understand Its Dynamics
39 pages
A Brief Introduction To Adaboost: Hongbo Deng 6 Feb, 2007
No ratings yet
A Brief Introduction To Adaboost: Hongbo Deng 6 Feb, 2007
35 pages
ADABOOST
No ratings yet
ADABOOST
9 pages
adaboost
No ratings yet
adaboost
5 pages
Experimenting XGBoost Algorithmfor Predictionand Classificationof Different Datasets
No ratings yet
Experimenting XGBoost Algorithmfor Predictionand Classificationof Different Datasets
12 pages
Pradipta Kumar Pattanayak - Ada Boosting
No ratings yet
Pradipta Kumar Pattanayak - Ada Boosting
44 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
Boosting Approach To Machine Learn
No ratings yet
Boosting Approach To Machine Learn
23 pages
Types of Boosting
No ratings yet
Types of Boosting
4 pages
AdaBoost Is Consistent
No ratings yet
AdaBoost Is Consistent
22 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
Adaboost With Totally Corrective Updates For Fast Face Detection
No ratings yet
Adaboost With Totally Corrective Updates For Fast Face Detection
6 pages
DS535 Note 6 (Page28-30)
No ratings yet
DS535 Note 6 (Page28-30)
4 pages
Survey - Gradient Boosting Machine
No ratings yet
Survey - Gradient Boosting Machine
9 pages
Ada Boost
No ratings yet
Ada Boost
7 pages
Adaboost
No ratings yet
Adaboost
4 pages
_LECTURE+NOTES_Boosting
No ratings yet
_LECTURE+NOTES_Boosting
8 pages
The Success of Adaboost and Its Application in Portfolio Management
No ratings yet
The Success of Adaboost and Its Application in Portfolio Management
39 pages
Zhu - Multiclass Adaboost2009 PDF
No ratings yet
Zhu - Multiclass Adaboost2009 PDF
12 pages
addaboost
No ratings yet
addaboost
12 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
AdaBoost M1
No ratings yet
AdaBoost M1
16 pages
Trees, Boosting, and Random Forest
No ratings yet
Trees, Boosting, and Random Forest
14 pages
Boosting: 1. What Is The Difference Between Adaboost and Gradient Boosting?
No ratings yet
Boosting: 1. What Is The Difference Between Adaboost and Gradient Boosting?
2 pages
Experiments With A New Boosting Algorithm: Yoav Freund Robert E. Schapire
No ratings yet
Experiments With A New Boosting Algorithm: Yoav Freund Robert E. Schapire
16 pages
Adaptive Boosting For Classification and Regression
No ratings yet
Adaptive Boosting For Classification and Regression
4 pages
TM Adaboost
No ratings yet
TM Adaboost
12 pages
IMPROVE_boost_1999
No ratings yet
IMPROVE_boost_1999
40 pages
Handout9 Trees Bagging Boosting
100% (1)
Handout9 Trees Bagging Boosting
23 pages
Power System Security Assessment Using Adaboost Algorithm
No ratings yet
Power System Security Assessment Using Adaboost Algorithm
6 pages
Ensemble - Part 1
No ratings yet
Ensemble - Part 1
33 pages
107 Boostong Models
No ratings yet
107 Boostong Models
27 pages
An Introduction To Boosting and Leveraging: 1 A Brief History of Boosting
No ratings yet
An Introduction To Boosting and Leveraging: 1 A Brief History of Boosting
66 pages
Statistics Project
No ratings yet
Statistics Project
5 pages
AdaBoost-From Theoretical Perspective
No ratings yet
AdaBoost-From Theoretical Perspective
7 pages
R: Adabag
No ratings yet
R: Adabag
34 pages
Computational Data Analysis: Machine Learning
No ratings yet
Computational Data Analysis: Machine Learning
26 pages
FAQ - Boosting - Ensemble Techniques - Great Learning
No ratings yet
FAQ - Boosting - Ensemble Techniques - Great Learning
2 pages
Lecture 16: Boosting — Applied ML
No ratings yet
Lecture 16: Boosting — Applied ML
20 pages
Boosting With The L - Loss: Regression and Classification
No ratings yet
Boosting With The L - Loss: Regression and Classification
32 pages
5 - EnsembleModeling
No ratings yet
5 - EnsembleModeling
80 pages
AdaBoost Classifier in Python (Article) - DataCamp
100% (1)
AdaBoost Classifier in Python (Article) - DataCamp
9 pages
Hybrid Credit Scoring
No ratings yet
Hybrid Credit Scoring
13 pages
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Boosting (Machine Learning)
No ratings yet
Boosting (Machine Learning)
6 pages
Gradient Boosting in ML
No ratings yet
Gradient Boosting in ML
5 pages
DM(Boosting)
No ratings yet
DM(Boosting)
15 pages
Ada Boost
No ratings yet
Ada Boost
2 pages
Adaboost: Derek Hoiem March 31, 2004
No ratings yet
Adaboost: Derek Hoiem March 31, 2004
46 pages
Data Mining - Ensemble Methods
No ratings yet
Data Mining - Ensemble Methods
12 pages
Class Adv Classification V
No ratings yet
Class Adv Classification V
50 pages
Boosting With Structural Sparsity
No ratings yet
Boosting With Structural Sparsity
41 pages
Boosting
No ratings yet
Boosting
6 pages
Introduction To Boosting - 2
No ratings yet
Introduction To Boosting - 2
79 pages
Adaboost Matas
No ratings yet
Adaboost Matas
136 pages
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
From Everand
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
Fouad Sabry
No ratings yet
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Resume NSIT
No ratings yet
Resume NSIT
2 pages
Kanchan Malhotra: TH TH
No ratings yet
Kanchan Malhotra: TH TH
2 pages
Continuous Optimization
No ratings yet
Continuous Optimization
23 pages
IAS Daily Quiz
No ratings yet
IAS Daily Quiz
58 pages
Calculating A - Linear Regression
No ratings yet
Calculating A - Linear Regression
28 pages
IEEE Format
No ratings yet
IEEE Format
2 pages
Data Science Dream Job Skills Roadmap PDF
No ratings yet
Data Science Dream Job Skills Roadmap PDF
6 pages
Meet Our Leadership Keplers
No ratings yet
Meet Our Leadership Keplers
7 pages
EXCEL Formulas
No ratings yet
EXCEL Formulas
207 pages
Wi Fi Technology
No ratings yet
Wi Fi Technology
28 pages
JP Morgan)
No ratings yet
JP Morgan)
7 pages
Analytics in Banking:: Time To Realize The Value
100% (1)
Analytics in Banking:: Time To Realize The Value
11 pages
Abstract For 3G Vs Wifi
No ratings yet
Abstract For 3G Vs Wifi
1 page
Machine Learning Algorithmsfor Predictionofmobilephone Price
No ratings yet
Machine Learning Algorithmsfor Predictionofmobilephone Price
9 pages
Predicting Customer Churn On OTT Platforms
No ratings yet
Predicting Customer Churn On OTT Platforms
19 pages
IJNRD2404218
No ratings yet
IJNRD2404218
5 pages
Numbers of Classifier
No ratings yet
Numbers of Classifier
49 pages
Boosting
No ratings yet
Boosting
13 pages
Paper Satu
No ratings yet
Paper Satu
10 pages
Sales-Forecasting of Retail Stores Using Machine Learning Techniques
No ratings yet
Sales-Forecasting of Retail Stores Using Machine Learning Techniques
7 pages
Arduino Based Face Detection and Tracking System
No ratings yet
Arduino Based Face Detection and Tracking System
45 pages
Sse 11 24 549 2
No ratings yet
Sse 11 24 549 2
1 page
Awad K 2018 Report PDF
No ratings yet
Awad K 2018 Report PDF
27 pages
SSRN Id3890338
No ratings yet
SSRN Id3890338
20 pages
Aor 46 1741
No ratings yet
Aor 46 1741
13 pages
Assignment 8
No ratings yet
Assignment 8
4 pages
Opencv-0 9 5 Doc Full
No ratings yet
Opencv-0 9 5 Doc Full
285 pages
Cyber Data Analytics - Assignment 1: Mateusz Garbacz
No ratings yet
Cyber Data Analytics - Assignment 1: Mateusz Garbacz
8 pages
A Guide To Face Detection in Python - Towards Data Science
No ratings yet
A Guide To Face Detection in Python - Towards Data Science
26 pages
ENsemble, Random Forest
No ratings yet
ENsemble, Random Forest
28 pages
A Comprehensive Guide To Ensemble Learning (With Python Codes)
No ratings yet
A Comprehensive Guide To Ensemble Learning (With Python Codes)
22 pages
Water Body Extraction From Landsat ETM+ Imagery
No ratings yet
Water Body Extraction From Landsat ETM+ Imagery
4 pages
DOC-20241024-WA0008. (1)
No ratings yet
DOC-20241024-WA0008. (1)
21 pages
My Resume
No ratings yet
My Resume
1 page
IE506 Bagging Boosting April5 6
No ratings yet
IE506 Bagging Boosting April5 6
14 pages
Analysis of ensemble machine learning classification comparison on the skin cancer MNIST dataset
No ratings yet
Analysis of ensemble machine learning classification comparison on the skin cancer MNIST dataset
8 pages
IEEE_Format_Paper
No ratings yet
IEEE_Format_Paper
20 pages
Lecture5 FGV
No ratings yet
Lecture5 FGV
25 pages
Face Detection
No ratings yet
Face Detection
21 pages
COMP4702 Notes 2019: Week 2 - Supervised Learning
No ratings yet
COMP4702 Notes 2019: Week 2 - Supervised Learning
23 pages