Learn The Boosting - Method - Implementation - in - R
Learn The Boosting - Method - Implementation - in - R
Abstract
Boosting is an iterative algorithm that combines simple classification rules with ‘mediocre’
performance in terms of misclassification error rate to produce a highly accurate classifi-
cation rule. Stochastic gradient boosting provides an enhancement which incorporates a
random mechanism at each boosting step showing an improvement in performance and
speed in generating the ensemble. ada is an R package that implements three popular
variants of boosting, together with a version of stochastic gradient boosting. In addi-
tion, useful plots for data analytic purposes are provided along with an extension to the
multi-class case. The algorithms are illustrated with synthetic and real data sets.
1. Introduction
Boosting has proved to be an effective method to improve the performance of base classifiers,
both theoretically and empirically. The underlying idea is to combine simple classification
rules (the base classifiers) to form an ensemble, whose performance is significantly improved.
The origins of boosting lie in PAC learning theory (Valiant 1984), which established that
learners that exhibit a performance slightly better than random guessing when appropriately
combined can perform very well.
A provably polynomial complexity boosting algorithm was derived in (Schapire 1990), whereas
the Adaptive Boosting (AdaBoost) algorithm in various varieties (Freund and Schapire 1996,
1997) proved to be a practical implementation of the boosting ensemble method. Since its
introduction, many authors have sought to explain and improve upon the AdaBoost algorithm.
In (Friedman, Hastie, and Tibshirani 2000), it was shown that the AdaBoost algorithm can
be thought of as a stage-wise gradient descent procedure that minimizes an exponential loss
function. In addition, three modifications of the original algorithm were proposed: Gentle-,
2 ada: An R Package for Stochastic Boosting
Algorithm 1 AdaBoost
1
1: Initialize weights wi = n
2: for m = 1 to M do
3: fit y = hm (x) as the base weighted classifier using wi and
d
let W− (hm ) = i=1 wi I{yi hm (xi ) = −1} and αm = log 1−W − (h)
PN
4: W− (h)
5: wi = wi exp{αm I{yi 6= hm (xi )}} scaled to sum to one ∀i ∈ {1, . . . , N }
6: end for
it produces at each stage as output the object’s predicted label. This coarse information may
hinder the efficiency of the algorithm in finding an optimal classification model. To overcome
this deficiency, several proposals have been put forth in the literature; for example, the al-
gorithm outputs a real-valued prediction rather than class labels at each stage of boosting.
The latter variant of boosting corresponds to the Real AdaBoost algorithm, where the class
probability estimate is converted using the half-log ratio to a real valued scale. This value is
then used to represent an observation’s contribution to the final overall model. Furthermore,
observation weights for subsequent iterations are updated according to the exponential loss
function of AdaBoost. Like Discrete AdaBoost, the Real AdaBoost algorithm attempts to
minimize the expectation of e−yF (x) . In general, these modifications allow Real AdaBoost to
more efficiently find an optimal classification model relative to AdaBoost.M1. In addition to
the Real AdaBoost modification, Friedman et al. (Friedman et al. 2000) proposed a further
extension called the Gentle AdaBoost algorithm which minimizes the exponential loss func-
tion of AdaBoost through a sequence of Newton steps. Although Real AdaBoost and Gentle
AdaBoost optimize the same loss function and perform similarly on identical data sets, Gentle
AdaBoost is numerically superior because it does not rely on the half-log ratio.
The algorithm in its general form can operate under an arbitrary loss function; ada implements
both the exponential (L(y, f ) = e−yf ) and logistic (L(y, f ) = log(1+e−yf )) loss functions. The
Journal of Statistical Software 5
x
η function specifies the type of boosting: discrete (η(x) =sign(x)), real η(x) = 0.5 log 1−x ,
and gentle (η(x) = x).
In the case of exponential loss, the line search step solution (Step 3, Algorithm 2) can be
written as:
αk = αk−1 − (η T P (αk−1 )η)−1 (y T P (αk−1 )η), where P (αk−1 ) = diag(pi (αk−1 )), (1)
pi (αk−1 ) = wi e−αyi ηi , and wi = e−yi F (xi ) . The final stageweight αm = α∞ is used for
Algorithm 2. For the logistic loss (L2 Boost) version we have the line search corresponding
to:
αk = αk−1 − (η T P (αk−1 )(1 − P (αk−1 ))η)−1 (y T P (αk−1 )η), where P (αk−1 ) = diag(pi (αk−1 )), (2)
wi e−αyi ηi
pi (αk−1 ) = 1+wi e−αyi ηi
, and wi = e−yi F (xi ) .
Next, we provide the details for adapting Algorithm 2 to perform each variant of boosting.
Remark: The ada package provides the flexibility to fit all the stageweights with a value of
1. This is known as -boosting where one fits the ensemble with arbitrarily small ν (Rosset
et al. 2004).
Discrete AdaBoost
For Discrete AdaBoost, set L(y, g) = e−yg ⇒ wi = −yi e−yi Fi (exponential loss) and η(x) =sign(x).
Using expression (1) with this value
of η, the optimization problem has a closed form solu-
1−errm
tion given by αm = 0.5 log errm . Therefore, the algorithm is fitting the original Discrete
AdaBoost algorithm with a random sampling strategy at each iteration.
−yi F (xi )
For Discrete L2 Boost, one optimizes L(y, g) = log(1 + e−yi gi ) ⇒ wi = −y ie
1+e−yi F (xi )
. In this
case, the stageweight does not have a closed form solution and the software solves (2) directly.
Real AdaBoost
p
For Real AdaBoost and Real L2 Boost, set η(p) = log 1−p , where p ∈ [0, 1] (i.e. a probabil-
ity class estimate) and use the same weight as in Discrete AdaBoost. If αm = 1 is set for all
m and exponential loss is used, then Real AdaBoost coincides with the algorithm presented in
(Friedman et al. 2000). However, ada has the flexibility of optimizing (1) and (2) to determine
the stageweights.
Gentle AdaBoost
For Gentle and Gentle L2 Boost, set η(x) = x. This algorithm requires fitting a regressor at
each iteration and result in the original GentleBoost algorithm whenever αm = 1. As with
Real boosting, the algorithm can solve the line search directly.
A close inspection reveals that if one executes Algorithm 2 with ν := 0, then the ensemble
of trees will be generated by random subsamples of the P data with identical case weights
(i.e. letting Fν (x) be the ada output, then F0 (x) = 0 ∗ M m=1 αm hm (x)). This is the exact
tree fitting process used to bag trees (Breiman 1996), with the exception that in bagging
1 PM
the ensemble results as an average of the trees (i.e. B(x) = M m=1 h m (x) is a bagged
ensemble but under the same random settings one would obtain the same trees as with
F0 (x)). In light of this, we add a shift=TRUE argument which supplies a post processing
shift of the ensemble towards bagging after constructing the ensemble (i.e. the final ensemble
is Feν (x) = (1 − ν)B(x) + Fν (x), where Fe0 (x) = B(x) equates to bagging). However, unlike
bagging (with the exception of ν = 0) the individual h’s are obtained via a greedy weighting
process. In the case of -boosting (i.e. αm := 1) then the resulting ensemble is an average
over h.
3. Implementation issues
In this section we discuss implementation issues for the ada package.
Figure 1: The functional flow of ada. The top section consists of the functions used to create
the ada object, while the bottom section are the functions invoked using an initialized ada
object. The yellow functions are called only by ada, the red functions are called by the user
and the green functions are called by both.
Journal of Statistical Software 7
be created by invoking the functions ada, which calls either ada.formula or ada.default,
depending on the input to ada. The ada.formula function is a generic formula wrapper for
ada.default that allows the flexibility of a formula object as input. The ada.default func-
tion, in turn, calls the four functions that correspond to the boosting algorithms: discrete.ada,
real.ada, logit.ada and gentle.ada. Once an object of class ada is created, the user can
invoke the standard R commands: predict, summary, print, pairs, update, or plot. In
addition, ada includes a varplot and addtest function, which we discuss below.
Setting rpart.control
The rpart function selects tree depth by using an internal complexity measure together with
cross-validation. In the case of SGB we have empirically noticed strong performance with
this automatic choice of tree depth (especially for larger data sets).
However, in deterministic gradient boosting, the tree size is usually selected a priori. For
example, stumps (2-split) or 4-split trees (e.g. split the data into a maximum of 4 groups)
are commonly selected as the weak learners. In rpart one should set cp=-1, which forces the
tree to split until the depth of the tree achieves the maxdepth setting. Thus, by specifying
the maxdepth argument, the number of splits can be controlled. It is important to note that
the maxdepth argument works on a log2 scale, hence the number of splits is a power of 2.
In small data sets, it is useful to appropriately specify the minsplit argument, in order to
ensure that at least one split will be obtained. Finally, it is worth noting that a theoretically
rigorous approach for setting the tree depth in any form of boosting is still an open problem
(Segal 2004).
The following code illustrates how the control parameters for generating stumps and 4-split
trees:
> library("ada")
●
● ●● ●
●● ● ● ●●●
● ●●●●
●
● ● ●●●●● ●●
2
● ●
●
● ●
● ● ●●●●● ● ● ●● ● ●
●
● ● ● ●
●
● ● ●●●
●● ●● ●●
● ● ●●● ● ●● ● ●
●●
●
●● ● ●● ●
●● ● ● ●● ●●●●● ●●●
● ● ● ●●
1
● ● ● ●
● ●● ● ● ●
● ● ● ● ● ● ●●
● ● ● ●● ● ●● ● ● ●●
●
●
●●●● ●● ● ●● ● ●
● ● ●●
● ● ● ● ● ●●
● ● ● ●●
● ●
●● ●● ● ●●● ●
x2
● ●
0
● ●
● ●● ●●● ●●
● ● ● ●●
● ●●
● ● ● ●●
●● ●● ●
● ●
●●
−1
● ● ●
● ●●
●● ● ●
● ● ●●
●
●● ●
●●● ●● ●
●
●●
●
−2
●● ●●
●●
●
●
●
●
● ●
−3
●
●
0 1 2 3 4 5 6
x1
Discrete AdaBoost
To model the synthetic data sets with discrete AdaBoost under exponential loss (using stumps
with 50 iterations) call:
train.err1 train.kap1
47 50
Notice that the output gives the training error, confusion matrix and three estimates of the
number of iterations. In this example, one could use the Out-Of-Bag (OOB) estimate for 9
iterations, training error estimate of 47, or the kappa error estimate of 50 iterations.
To add the testing data set to the model, simply use the addtest function. This function
allows us to evaluate the testing set without refitting the model.
...
Estimates of number of iterations:
Real AdaBoost
Next we provide the code to create a Real AdaBoost ensemble with the -boosting modifica-
tion, 4-split trees, and 1000 iterations. For additional convenience, the test set can be passed
to the function.
Notice that the out-of-bag error rate here is meaningless since there are no subsamples in
pure -boosting (i.e. bag.frac = 1).To perform Stochastic Gradient -boosting simply set
the bag.frac argument less than 1 for the previous call (default is bag.frac=0.5).
Gentle AdaBoost
The following call provides a Gentle AdaBoost ensemble with 100 iterations, tree depth of 8,
ν = 0.1 (regularization), and the ensemble is shifted towards bagging using the bag.shift=TRUE
argument (Section 2.3).
Journal of Statistical Software 11
> ggen <- ada(y~., data = train, test.x = test[,-1], test.y = test[,1],
+ iter = 100, type = "gentle", nu = 0.1, bag.shift = TRUE,
+ control = rpart.control(cp = -1, maxdepth = 8))
> ggen
L2 Boost
To call L2 Boost (boosting with logistic loss (Friedman 2001)) with gentle boost, invoke ada
in the following way.
> glog <- ada(y~., data = train, test.x = test[,-1], test.y = test[,1],
+ iter = 50, loss = "l", type = "gentle")
> glog
In addition to performing gentle boost the logistic loss function can be invoked with the
type="discrete" or type="real".
Stageweight convergence
For many of these methods a Newton Step is necessary for the convergence of the stageweight
at each iteration. To see the convergence use the verbose=TRUE flag:
> greal <- ada(y~., data = train, test.x = test[,-1], test.y = test[,1],
+ iter = 50, type = "real", verbose = TRUE)
For this example the stageweights converged quickly. In some situations the convergence may
not be as fast; to increase the number of iterations for convergence purposes use the max.iter
argument.
> plot(gdis)
Notice that the training error steadily decreases across iterations. This shows that boosting
can effectively learn the features in a data set. However, a more valuable diagnostic plot would
involve the corresponding test set error plot. The following call creates both the training and
testing error plots side-by-side, as seen in Figure 3 (right).
Journal of Statistical Software 13
1 Train 1 Train
2 Test1
0.35
0.22
0.30
0.20
0.18
0.25
Error
Error
2 2 2 2
1
0.16
0.20
1
0.14
1 1
1
0.15
1
1
0.12
1
1
0.10
0 10 20 30 40 50 0 10 20 30 40 50
Iteration 1 to 50 Iteration 1 to 50
Figure 3: Training error by iteration number for the example data (right). The training and
testing error by iteration number for the example data (left).
Remark: In many situations the class priors (proportion of true responses) are unbalanced;
in these situations, learning algorithms often focus on learning the larger set. Hence, errors
tend to appear low, but only because the algorithm classifies objects into the larger class. An
alternative measure to absolute classification error is the Kappa statistic (Cohen 1960), which
adjusts for class imbalances.
Training Results
14 ada: An R Package for Stochastic Boosting
Testing Results
1 2
56 44
To get the probability class estimates for the training data input:
[,1] [,2]
451 0.317362113 0.682637887
86 0.851147564 0.148852436
...
257 0.007765103 0.992234897
74 0.851265307 0.148734693
The first column provides the number of the actual observation in the original data set.
Remark: The probability class estimate for any boosting algorithm is defined as P̂ (Y = 1 |
e2F (x) x is considered infinite by R for large x it is
x) = 1+e2F (x) . However, since the function e
necessary to compute this value on the logarithmic scale and force it to 1 if e2F (x) = ∞. As
a result, one can not get the original F from this transformation, which is needed for the
multi-class case. This is the rationale for having the option of setting type="F". The usage
of this argument setting is shown in the multi-class example in the next section.
Remark: The newdata option requires a data.frame of observations with the exact same
variable names as the training data. If a matrix format is used to represent the training data,
then the variable names will most likely be the default V1 , . . . , Vp . The following code shows
how to change the names of the columns, if the training data are in a matrix or a data.frame
format, respectively.
Journal of Statistical Software 15
> glog <- update(glog, train[,-1], train[,1], test[,-1], test[,1], n.iter = 50)
> glog
> varplot(ggen)
Figure 4 gives us the scores for the variable assessment for Gentle AdaBoost.
To obtain the variable scores directly (without a plot) use the following code.
x1 x6 x5 x10 x2 x4 x9 x3 x8 x7
0.0070 0.0036 0.0035 0.0034 0.0028 0.0027 0.0025 0.0024 0.0023 0.0017
x1 ●
x5 ●
x4 ●
x2 ●
x10 ●
x6 ●
x7 ●
x8 ●
x3 ●
x9 ●
Score
Pairs can also be used to explore the relationships among variables in the test set. If the
response for the test set is unknown, the observations will be plotted in black.
As a final note, to view both the testing and training data on the same plot, leave test.only
as FALSE and issue the above command with either var=· · · for a specific set of variables or
maxvar=k for the top k variables.
5. Examples
The examples below will illustrate various aspects and uses of the ada package.
−2 −1 0 1 2
6
5
4
● ● ● ● ●● ● ● ● ●● ●
● ● ● ●● ●
● ● ● ● ●
● ●
● ● ● ● ●
x1 ● ● ● ● ● ●● ●
3
● ●
● ●
● ● ● ● ● ●
● ● ● ● ● ●
●● ● ●
● ●
2
● ● ●●●
● ● ●
● ●
●
● ● ● ● ●
1
● ● ● ●
● ● ●●●
● ● ●● ● ●●
● ●● ●
● ● ● ● ●● ●●
0
●
2
● ●
● ●
● ● ● ●
● ● ● ●
● ●
●● ● ●● ● ●● ● ● ● ●
1
●● ●
●
● ● ●●●
● ● ● ●
●
● ●● ● ● ● ●●● ● ●
●
●●
x6 ●●
● ●
0
●
●
● ●
●
● ●
● ● ●●
● ●
● ● ● ● ● ●
−1
● ● ●● ● ● ●● ●●
● ●● ●●
● ● ● ●
● ●
● ●
−2
●● ● ●
● ●
● ●
● ●
● ●
● ● ● ● ● ●
● ●
1
● ●
● ● ● ●
● ●
● ● ● ● ● ●
● ●
●● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ●●
0
● ● ● ● ●
●● ● ●
● ●
● ● ●●
●
●●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
x5
−1
● ●
●
●
●
● ●
●
● ● ● ●
● ● ● ● ●
● ●
●
● ● ● ●
● ●
−2
● ●
0 1 2 3 4 5 6 −2 −1 0 1
−3 −2 −1 0 1 2 3
6
5
4
●
●● ●
● ●●
● ●●●● ●●● ● ●
● ● ●● ●
● ● ●●
●● ● ●● ●●
x1 ● ● ●● ●●●●
●
3
● ●●●
●●●●● ●
●●● ●●●
● ●● ●● ●
● ●● ●● ● ●●
●
●
●●
●●● ●
● ● ●●● ●
● ●
●
●●●●● ● ●
2
● ● ● ●
●● ● ●●
●● ● ●●
● ●
●●●● ●●
●●
● ●●
●●●
●●● ●● ● ● ●
●●● ● ● ●● ●●●
● ●●● ●
1
●
● ●●● ●●●●●●● ●●
●
●●●●
●● ●●
● ● ● ●●
● ●●●●●●
●●
0
3
●
● ●●
●●● ● ●
●
●● ● ● ●●●
● ●●● ● ●● ●●
2
●
● ●
●
●●●
●●
● ●
●●
●●●●●●●● ●●●● ●
●●● ● ●● ●● ●●● ●
●
● ● ● ● ● ●
● ●●● ●
●● ● ● ● ● ● ● ●● ●●
●●
1
● ● ● ● ●●
●●●
● ●●● ● ● ● ● ●● ● ●●● ●● ●
● ● ●
●●
●
●●● ●●● ●● ●● ●
●
●●●●
● ●
● ●●
●● ●● ● ●
● ● ●
●●●
x2
0
● ● ●
●● ●●
●●●
● ● ●● ●
●● ●
●
●●
● ● ● ●●
●
●●
● ●●●
● ● ● ●
−1
●●
●● ●
●●●●
●
●● ● ● ●
●
● ●
● ●
●
●
● ●
● ●
●
●● ●
●● ●
●
●●
● ●● ● ●
−2
● ●●
● ● ●
● ●
● ●
● ●
−3
0 1 2 3 4 5 6
A 400-iteration Discrete AdaBoost ensemble using stumps was fitted to a training set of 2000
observations using the following code:
Call:
ada(y ~ ., data = train, iter = 400, bag.frac = 1, nu = 1, control = control,
test.x = test[, -1], test.y = test[, 1])
The summary command is used below to present the test set performance of the boosted
model.
Call:
ada(y ~ ., data = train, iter = 400, bag.frac = 1, nu = 1, control = control,
Journal of Statistical Software 19
0.5
1 Train 1 Train
1 1
2 Test1 2 Test1 1
0.8
1 2 2
2
0.4
1 2
2
0.6
Kappa Accuracy
0.3
Error
0.4
0.2
2
1 2
2
0.2
1 2 2
0.1
1 1 1
0 100 200 300 400 0 100 200 300 400
Training Results
Testing Results
Notice that the testing error is 11.1%, which agrees with the results found in (Hastie et al.
2001).
Remark: The variables in this example are all equally important by construction, and therefore
the diagnostics for variable selection and pairwise plots are not shown.
performed. Based on this screen, compounds were categorized as either insoluble (n=3493) or
soluble (n=2138). Then, for each compound, 72 continuous, noisy structural descriptors were
computed. Of these descriptors, one contained missing values for approximately 14% (n=787)
of the observations. The objective of the analysis is to model the relationship between the
structural descriptors and the solubility class.
For modeling purposes, the original data set was randomly partitioned into training (50%),
test (30%), and validation (20%) sets. The data will be called soldat and the compound labels
and variable names have been blinded for this illustration.
> data("soldat")
> n <- nrow(soldat)
> set.seed(100)
> ind <- sample(1:n)
> trainval <- ceiling(n * .5)
> testval <- ceiling(n * .3)
> train <- soldat[ind[1:trainval],]
> test <- soldat[ind[(trainval + 1):(trainval + testval)],]
> valid <- soldat[ind[(trainval + testval + 1):n],]
Gentle AdaBoost with default settings was used on the training set. This data set contained a
descriptor with missing values and recall that the default setting is given by na.action=na.rpart.
This option allows rpart to search all descriptors, including those with missing values using
surrogate splits (Breiman, Friedman, Olshen, and Stone 1984), to find the best descriptor for
splitting purposes.
The summary function can then be used to evaluate the performance of the model on the test
data:
> summary(gen1)
Training Results
Testing Results
0.6
1 Train 1 Train 1
2 Test1 2 Test1 1
1
0.9
3 Test2 3 Test2
1
0.5
0.8
1
0.4
Kappa Accuracy
0.7
Error
0.3
2
3 2 2 2 2
0.6
3 3 3 3
0.2
0.5
3 3 3 3
2
0.1
1 3 2 2 2
1 2
1 1
0.4
1
0.0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Iteration 1 to 70 Iteration 1 to 70
Figure 8: Training, testing, and kappa error values by iteration for the solubility data.
Testing accuracy rates are printed in the order they are entered so the accuracy on the testing
set is 0.765 and on the validation set 0.781.
For this type of early drug discovery data, the Gentle AdaBoost algorithm performs adequately
with test set accuracy of 76.5% (kappa≈ 0.5). Figure 8 also illustrates the model’s performance
for the training and test sets across iterations.
In order to enhance our understanding regarding the relationship between descriptors and the
response, the varplot function was employed. It can be seen from Figure 9 that variable 5
is the most important one.
> varplot(gen1)
The features are quite noisy and the variables are correlated, which makes variable importance
difficult to ascertain from this one run. Next we provide a small run to obtain the average
variable importance over 20 iterations.
x50 ● x38 ●
x9 ● x39 ●
x46 ● x63 ●
x58 ● x20 ●
x60 ● x19 ●
x63 ● x57 ●
x49 ● x52 ●
x20 ● x40 ●
x57 ● x50 ●
x36 ● x55 ●
x13 ● x54 ●
x22 ● x36 ●
x51 ● x22 ●
x37 ● x37 ●
x55 ● x21 ●
x8 ● x43 ●
x33 ● x32 ●
x56 ● x51 ●
x42 ● x58 ●
x15 ● x3 ●
x19 ● x34 ●
x2 ● x41 ●
x35 ● x56 ●
x47 ● x49 ●
x52 ● x42 ●
x65 ● x31 ●
x41 ● x18 ●
x14 ● x13 ●
x5 ● x29 ●
x72 ● x64 ●
0.028 0.030 0.032 0.034 0.0265 0.0270 0.0275 0.0280 0.0285 0.0290
Score
Figure 9: Variable importance for solubility data one run (left), and average (right).
According to the average variable importance, eight of the top eleven variables measure various
aspects of a compound’s hydrophilicity and hydrophobicity, which makes scientific sense in this
context. These various hydrophilicity and hydrophobicity measures account for a molecule’s
tendency to dissolve in water – key measures for determining a molecule’s solubility.
Remark: In practice, we have found the average performance for variable importance infor-
mative, and well worth the time it takes to compute it. Also, given the size of this data set,
it may be convenient to invoke R with -max-mem-size=2Gb.
Journal of Statistical Software 23
This produces a three class analog to the example given above. For this example, a training
and a testing data set which comprised of 200 and 1000 observations, respectively, were
generated.
A 250-iteration stochastic version of Discrete AdaBoost ensemble was constructed using de-
fault trees as the base learner. The implementation of the one-versus-all strategy is given
next:
The in class test error rate and total test error rate will be obtained below using the sapply and
table commands in R. For more information on these commands refer to (Becker, Chambers,
and Wilks 1988).
[1] 0.416
Notice that although the test error rate appears high, it is still substantially lower than
random guessing (0.66). Also the in class error rate for class two seems to be rather high.
In order to assess the magnitude of the error rate, a random forest classifier (Breiman 2001)
was considered. The choice of a random forest as an ensemble method is due to its ability
to handle multi-class problems and its overall competitive performance. The following code
using the R package randomForest constructs such an ensemble for the data at hand (Liaw
and Wiener 2002).
> library("randomForest")
> train <- data.frame(y=as.factor(y[indtrain]), x[indtrain,])
> set.seed(100)
> grf <- randomForest(y~., train)
> tab3 <- table(y[indtest], predict(grf, test))
> for(i in 1:K){
+ cat("In class error rate for class ", i,": ",
+ round(1 - tab3[i,i] / sum(tab[i,]), 3), "\n")
+ }
[1] 0.415
It can be seen that the overall test error rate (0.415) is about the same as that produced
by Discrete AdaBoost, which strongly indicates that the combination of the ada and the
one-versus-all strategy can easily handle multi-class classification problems.
Journal of Statistical Software 25
Acknowledgements
The authors would like to thank the Editor Jan de Leeuw, an Associate Editor and two
anonymous referees for useful comments and suggestions. The work of George Michailidis
was supported in part by NIH grant P41 RR18627-01 and by NSF grant DMS 0204247.
References
Cohen J (1960). “A Coefficient of Agreement for Nominal Data.” Education and Psychological
Measurement, 20, 37–46.
Dettling M (2004). “BagBoosting for Tumor Classification with Gene Expression Data.”
Bioinformatics, 20(18), 3583–3593. doi:10.1093/bioinformatics/bth447.
26 ada: An R Package for Stochastic Boosting
Friedman J (2002). “Stochastic Gradient Boosting.” Computational Statistics & Data Anal-
ysis, 38(4), 367–378. doi:10.1016/S0167-9473(01)00065-2.
Lemmens A, Croux C (2005). “Bagging and Boosting Classification Trees to Predict Churn.”
Journal of Marketing Research, 43(2), 276–268.
R Development Core Team (2006). R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http:
//www.R-project.org/.
Ridgeway G (2006). gbm: Generalized Boosted Regression Models. R package version 1.5-7,
URL http://www.i-pensieri.com/gregr/gbm.shtml.
Schapire R (1990). “The Strength of Weak Learnability.” Machine Learning, 5(2), 197–227.
doi:10.1023/A:1022648800760.
Journal of Statistical Software 27
Segal M (2004). “Machine Learning Benchmarks and Random Forest Regression.” Technical
report, Center for Bioinformatics & Molecular Biostatistics, University of California, San
Francisco, CA. URL http://repositories.cdlib.org/cbmb/bench_rf_regn/.
Valiant L (1984). “A Theory of The Learnable.” In “Proceedings of the 16th Annual ACM
Symposium on Theory of Computing,” pp. 436–445. ACM Press, New York, NY. ISBN
0-89791-133-4. doi:10.1145/800057.808710.
Affiliation:
Mark Culp
Department of Statistics
University of Michigan
436 West Hall, 550 East University
Ann Arbor, MI 48109, United States of America
E-mail: culpm@umich.edu
URL: http://www.stat.lsa.umich.edu/~culpm/
Kjell Johnson
E-mail: Kjell.Johnson@pfizer.com
George Michailidis
E-mail: gmichail@umich.edu
URL: http://www.stat.lsa.umich.edu/~gmichail/