Prediction Games and Arcing Algorithms
Prediction Games and Arcing Algorithms
Prediction Games and Arcing Algorithms
Leo Breiman
leo@stat.berkeley.edu
Abstract
Recent work empirical work has shown that combining predictors can lead to
significant reduction in generalization error (Breiman[1996].1997]. Drucker
and Cortes[1995], Quinlan [1996], Freund and Schapire[1996a], Kong and
Dietterich[1996], Bauer and Kohavi[1998]). Interestingly, the individual
predictors can be very simple, i.e. single hyperplanes in two class classification
(Ji and Ma[1997]) .
∑ cm hm (x) .
The combined classifications predict that class having the plurality of the
weighted votes. That is, the predicted class is
where I(true)=1, I(false)=0, and y ranges over the set of class labels.
rs(z,c)=( y− ∑ cm hm (x))2 .
If the margin error at z is negative, the right class is predicted. The more
negative it is, the larger "the margin of correctness".
E(rs(Z,c)) regression
P(mg(Z,c)>0) classification.
Let er(z,c) equal rs(z,c) in regression and mg(z,c) in classification. One way to
formulate the idea that er(z,c) is generally small for zεT is by requiring
uniform smallness.
Definition 2.2 The prediction game is a two player zero-sum game. Player I
chooses zn εT . Player II chooses {cm }. Player I wins the amount er(zn ,c) .
In both regression and classification er(zn ,c) is continuous and convex in c for
each zn εT fixed. By standard game theory results (Blackwell and Girshick
1954]) Player II has a good pure strategy, Player I has a good mixed strategy (a
probability measure on the instances in T) and the value of the game φ * is
given by the minimax theorem:
top(c)=sup Q EQ er(z,c)
Regression:
Classification:
where l( y,hm (x)) is a non-negative loss function and the subscript M refers
to the matrix game. Choosing c to make er M (z,c) small on T will force
er(z,c) to be small. In particular: for regression
top(c)≤topM (c)
and in classification
with equality if there are only two classes. To simplify notation, denote
lm (z)=l(y,hm (x)).
Definition 3.1 The dominating matrix game is a two person zero-sum game.
Player I chooses zn εT . Player II chooses m. Player I wins the amount lm (zn ) .
In this simpler game, there is a good mixed strategy {cm } for Player II and a
good mixed strategy for Player I consisting of a probability measure over the
instances in T. The value of the game φ * is given by:
The relations (2.2) and (3.3) will be the keys in our construction of effective
algorithms to find optimum c values. The definition of the matrix game for
classification appeared in Freund and Schapire[1996b].
The relation (3.3) also follows from putting the game into a linear
programming context. Define φ * to be the minimum value of top(c) .
Consider minimizing the function ∑ n (er M (z n ,c)− φ )+ where x+ is the positive
part of x. If φ > φ * then the minimum is zero. Denoting pn = Q ( x n ), the
minimization can also be cast as the linear programming problem: minimize
∑ n un pn under the constraints
λ n ≥0, λ n ≤ pn ,z≤ ∑ λ n lm (z n )
n
Let Z = (Y,X ) be a random vector having the same distribution that the
instances in T were drawn from but independent of T and denote
er M (Z,c)= ∑ m cmlm (Z) . Denote by P̃ the probability on the set of all N-instance
training sets such that each one is drawn from the distribution P. Assume
there is a constant B such that for all m, 0≤lm (Z)≤ B. Set δ >0. Then:
6
8log(2M) B 2
R=
N ∆
Except for a set of training sets with P̃ probability ≤δ , for every ∆ and c
The proof of this theorem is given in the Appendix. The bound in Schapire
et.al.[1997] is about square root of the bound in (4.1). The additional
.sharpness comes from using the uniform bound given by topM (c). The
bound gives non-trivial results only if R is small. We give this theorem and
its proof mainly as a comfort factor hopefully pointing in the right direction.
Generalization to infinite sets of predictors can be given in terms of their VC-
dimension (see Schapire et.al.[1997])
For the matrix game, the main problem is how to determine c so that
er M (z,c) is generally small for z in T. This can be formulated in different
ways as a minimization problem to which standard optimization methods
can be applied. For instance, a linear programming method can be used to
find a c such that φ *=topM (c).
But such methods are not feasible in practice. Typically, the set of classifiers is
large and complex. It would be extremely difficult to work with more than
one at a time as required by a standard optimization method. The essence of
feasible algorithms is that it is possible to solve, in practice, problems of this
following type:
This means, in regression, find that predictor having the lowest Q-weighted
least squares error. In classification, minimize the Q-weighted
missclassification rate. This is not always exactly possible. For example, the
CART algorithm does not find that J-terminal node tree having minimum
Q-weighted error. Instead, it uses a greedy algorithm to approximate the
minimizing tree. In the theory below, we will assume that it is possible to
find the minimizing h m , keeping in mind that this may be only
approximately true in practice.
7
Definition 5.1 An arcing algorithm for the matrix game works on a vector b
of non-negative weights such that bm is the weight assigned to predictor hm
and the c vector is given by b/|b|where b = ∑ bm . The algorithm updates b
in these steps:
ε k = ∑ lm (z n )Qk (z n )
n
l (z n )
Qk+1 (z n )=Qk (z n )(β k ) m /S
Qk+1 (z n )=(1+(ms(k) (z n )) 4 )/ S
If hm is selected at the kth step, increase bm by one.
This method applies only to two class problems. The set of classifiers H is
defined this way: To each direction in input space and point x n in the
training set corresponds two classifiers. Form the hyperplane passing
through x n perpendicular to the given direction. The first classifier classifies
all the points on one side as class 1 and on the other as class 2. The second
classifier switches the class assignment.
Set two parameters α >0, η >0 such that .5- η < α <.5. After the kth classifier is
selected, increase its b weight by one. Let
where I(. ) is the indicator function. Select a hyperplane direction and training
set instance at random. Compute the classification error for each of the two
associated classifiers using the probabilities Qk +1 (z n ). If the smaller of the two
errors is less than .5- η then keep the corresponding classifier. Otherwise,
reject both and select another random hyperplane..
The existing arcing algorithms (see also Leisch and Hornik[1997]) fall into two
different types which will be defined and discussed in the next two sections.
In these sections, we let em ={z n ;yn ≠hm (x n )}, i.e. em is the set of instances for
which lm (z n )=1..
so the minimum value of the first partial of the target function g(b) is in the
direction of bm* . This value is negative, because by the minimax theorem,
min m EQ (em )≤ φ *. Furthermore, the 2nd derivative of g(b+ ∆ u m* ) with respect
to ∆ is positive, insuring a unique minimum in the line search over ∆ >0.
proof. Clearly, g(b(k ) ) is decreasing in k. It suffices to show that |b(k ) |→∞ since
writing
shows that if there is a subsequence k' such that along this subsequence
top(c(k' ) )→φ1 >φ , then g(b(k' ) )→∞ . If |b(k ) | does not go to infinity, then there is
at least one finite limit point b*. But every time that b (k) is in the vicinity of
b*, g(b (k) ) decreases in the next step by at least a fixed amount δ >0 . Since
g(b(k)) is non-negative, this is not possible.
comments: i) From this argument, its clear that cruder algorithms would
also give convergence, since all that is is needed is to generate a sequence
|b(k ) |→∞ such that g(b(k ) ) stays bounded. In particular, the line search can be
avoided (see Section 8). ii) if we take (w.l.o.g) g(0)=1, then at the kth stage
#{n;er(z n ,c(k ) )>φ}≤Ng(b(k ) ).
10
Proposition 7.3. Adaboost is an unnormalized arcing algorithm using f(x)=ex
and φ =1/2
Proof: For f(x)=ex,
− φ |b| b l (z )
f (er(z n ,b)− φ |b|)=e ∏e m m n
m
The normalized algorithm minimizes g(c)= ∑ n f (er(z n ,c)) where f'(x) is non-
negative and f''(x) is continuous and non-negative for all x ∈ [0,1]. Let
c=b/|b|.
and m*=argminm Q(em ) . If Q(em* )≥EQ (er(z,c)) then stop. Otherwise let
bm* =bm* +1 and repeat.
Comments: Since
1
∂ g(c)/ ∂ bm = ∑ (l (z )−er(z n ,c)) f ' (er(z n ,c)) ,
|b| n m n
the smallest partial derivative is at m=m*.
11
Theorem 8.2 Let c be any stopping or limit point of the normalized arc
algorithm. Then c is a global minimum of g(c).
proof: Suppose there is a φ ,0<φ <1 such that f'(x)>0 for x> φ and zero for x ≤ φ .
If φ < φ * or if f'(x)>0 for all x then ∑ n f '(er(z n ,c))=0 is not possible. We treat this
case first. Suppose the algorithm stops after a finite number of steps because
Q(em* )≥EQ (er(z,c)) Then
Consider the problem of minimizing g(c) under non-negativity and sum one
constraints on c. The Kuhn-Tucker necessary conditions are that there exist
numbers λ and µ m ≥0 such that if cm >0, then ∂ g(c)/ ∂ cm = λ . If cm =0, then
∂ g(c)/ ∂ cm = λ + µ m . These conditions follow from (8.1) and (8.2). Because g(c)
is convex in c these conditions are also sufficient.
Now suppose that the algorithm does not stop after a finite number of steps.
After the kth step, let c (k+1) be the updated c (k) and mk the index of the
minimizing classifier at the kth step. Then
er(z n ,c (k+1) )−er(z n ,c (k) )=(lm (z n )−er(z n ,c (k) ))/(k +1) (8.3)
k
Denote the right hand side of (8.3) by δ k (z n )/(k+1). Using a partial Taylor
expansion gives
1 γ
g(c (k+1) )−g(c (k) )= ∑ n δ k (z n )( f ' (er(z n ,c ) +
(k)
(8.4)
(k +1) (k +1) 2
The first term on the right in (8.4) is negative for all k. Since g(c) is bounded
below for all c,
1
∑k |∑ n δ k (z n )( f ' (er(z n ,c (k) )|<∞ (8.5)
(k +1)
∑ n δ k (z n )( f ' (er(z n ,c
(k)
) )→0 (8.6)
12
(k)
Take a subsequence of the k for which (8.6) holds such that mk →m*,c →c .
Then the situation of (8.2) is in force and c is a global minimum.
Furthermore, since the first term on the right of (8.4) is negative (non-
(k )
stopping), then (8.4) implies that the entire sequence g( c ) converges. Thus,
all limits or stopping points of the c(k) sequence are global minimum points
of g(c).
∑ n (lm* (z n )−er(z n ,c)) f ' (er(z n ,c))≤( φ *− φ ) ∑ n f ' (er(z n ,c) ) (8.7)
If φ > φ * the right side of (8.7) is strictly negative and the algorithm never stops.
Then (8.4) gives a subseqence satisfying (8.6). For any limit point c and m*,
∑ n (lm* (z n )−er(z n ,c)) f '(er(z n ,c))=0 which implies top(c)≤ φ and g(c)=0. If φ = φ * and
the algorithm stops, ∑ n (lm* (z n )−er(z n ,c)) f '(er(z n ,c))=0, implying top(c)≤ φ . If it
does not stop, the same conclusion is reached. In either case, we get
g(c)=Nf(0).
One version of the normalized algorithm starts with a function q(x) defined
on [-1,1] such that q'(x) is zero for x ≤ 0, positive for x>0, and q" continuous,
bounded and non-negative. For φ > φ * define f(q)=g(x- φ ). Applying the
algorithm to this f(x) gives the following result:
Corollary 8.3 For c any limit or stopping point of the normalized algorithm,
top(c)≤ φ
Proof. Take φ > φ * and consider trying to minimize ∑ n (er(z n ,c)− φ )+ using the
normalized algorithm. At each stage, the current probability Q( z n ) is
proportional to I(er(z n ,c)− φ >0) where I ( • ) is the indicator function, and this
is the Ji-Ma reweighting. In the standard form of the normalized algorithm,
em* minimizes Q(em ) and the corresponding b value is increased by one.
Because x+ does not have a bounded 2nd derivative and because the Ji-Ma
algorithm does only a restricted search for the minimizing e m , the
normalized arc algorithm has to be modified a bit to make the convergence
proof work.
Take ε >0 small, and q(x) on [-1,1] such that q'(x)=0 for x ≤ 0, q'(x)=1 for x> ε , and
q'(x) in [0, ε ] rising smoothly from 0 to 1 so that q"(x) is continuous, bounded
and non-negative on [-1,1]. Now let φ > φ * and consider minimizing
∑ n q(er(z n ,c)− φ ) . Take δ >0 and at each stage, search randomly to find a
classifier hm such that Q(em )≤ φ *+ δ . Then as long as φ *+ δ − φ <0 , the result of
Theorem 8.2 holds.
The original Ji-Ma algorithm sets the values of two parameters α >0, η >0. In
our notation φ = α ,φ *+ δ =.5− η . Ji and Ma set the values of α ,β by an
experimental search. This is not surprising since the value of φ *is unknown.
This section describes an arcing algorithm we call arc-gv (gv=game value) and
proves that top(c(k ) )→φ *. Let B be such that for all m,n 0≤lm (z n )≤B and set ∆
to be a number much larger than 1/B. The algorithm generates a sequence of
weight vectors b(k ) and the normed weights are c(k ) =b(k ) /|b(k ) | Denote
tk =top(c(k ) ). Initialize by taking b(1) =0 and Q1 (z n )=1/ N , all n.
i) Let
Theorem 9.2 If arc-gv stops at the kth step, then top(c(k ) )= φ *. If it does not
stop at any finite step, then lim k top(c(k ) )= φ *.
Qb (z n )≈exp(er(z n ,b)−t|b|)
Consider increasing the mth coordinate of b by the amount ∆ to get b'. Let
where t'=top(c').
min ∆ Θ(m,b,∆)≤1−.5( µ b / B) 2
Now,
15
Θ'' ( α ∆)= EQ [(lm −t)2 exp( α ∆((lm −t))] ≤ B 2 Θ( α ∆)
b
Θ(∆*)≤1− µ b2 / 2 B 2 (9.3)
To analyze the the behavior of arc-gv we use the notation that b(k ) is the
vector of weights after the kth step, Ek the expection w.r. to Q (k ) , tk =top(c(k ) ),
b
µ k =µ , Θ k the minimum of Θ(mk +1 ,b ,∆)over the interval [0,∆] , ∆ k the
(k )
b (k )
minimizing value of ∆ ., and gk =g(b(k ) ) . By (9.1)
For any b , since min m EQb lm ≤ φ * then min m EQb lm ≤top(c) with equality only if
top(c)= φ *. Now µ k =tk −min m Ek lm so µ k ≥0 only if tk = φ *. But this is just the
stopping condition. If there is no stopping then all µ j >0 and
Since log(gk +1 )≥0 the sum on the right of (9.6) must be bounded below.
16
Take a subsequence {k'} such that tk' +1 →limsuptk =t and look at (9.6) along this
subsequence assuming t = φ *+ δ where δ >0 . Let Nk' be the number of terms
in the sum in (9.6) that are positive. We claim that sup k' Nk' <∞. To show this,
suppose the jth term is positive, i.e.
If t j ≥ φ *+ τ , τ >0 then µ j ≥ τ . This implies that for k' sufficiently large, there is
a fixed ε >0 such that if (9.7) is satisfied, then t j ≥t + ε . But this can happen at
most a finite number of times.
Let the sum of the positive terms in (9.6) plus log(N) be bounded by S. Fix
ε >0 . In the negative terms in the sum, let j' index those for which |t j −tk' +1|≤ ε .
Then
Take ε ≤ δ / 2 . For k' large enough and all j', t j' >φ *+ δ /2 and µ 2j' /2B2 ≥δ 2 /8B2 .
Taking ε so that ε ∆<δ 2 /16B2 shows that the number of terms in the (9.6)
sum such that |t j −tk' +1|≤ ε is uniformly bounded. This contradicts that fact that
the tk' +1 sequence converges to a limit point unless limsuptk = φ *.
Setting the derivation of Θ with respect to ∆ equal to zero and solving gives
t 1−q
∆=log[ ]
1−t q
For optimal c, Q
min m EQ lm =top(c)= φ *
If p(j|x), j=1,2 are the probabilities of classes 1,2 at the point x then the
ε −boundary between the class is defined by:
If N is large, in the instances (yn ,x n )εT such that x n εBDε there are roughly
equal numbers such that yn =1 and yn =2. Also, these two groups of instances
are dispersed almost at random over BDε . For Q concentrated on the
instances such that x n εBDε , if the set {hm } of classifiers do not contain
classifiers that can separate instances randomly dispersed on BDε . then
for some small ε >0 and this is about as large as min m EQ lm can get.
Intuitively, it is clear why Q would tend to concentrate on the points near the
boundary. These are the hardest points to classify correctly and Player I will
do well to focus his strategy on these points.
(k )
10.3 Does c converge?
If the non-decreasing |b(k ) | sequence has a finite limit, then it is not difficult to
show that the sequence of vectors c(k ) converges. But if the |b(k ) | go to
infinity, then the situation is unclear. The set of optimal c form a convex,
closed set of vectors in M-space and it is possible that the c(k ) sequence
circulates endlessly through this convex set.
18
11. An Optimizing Arcing Algorithm for the Convex Regression Game
The matrix game gives upper bounds for the convex classification and
regression games. A natural question is whether there exist optimizing
arcing algorithms for the original convex games. There is such an algorithm
for regression. Recall that
2
rs(z n , c) = ( y n − ∑ cm hm ( x n ))
2
= ( ∑ cm ( yn −hm ( x n ))
2
= ( ∑ cm r mn ) (11.1)
(v,r) n = ∑ m vmrmn
(k )
Definition 11.1 arc-gv updates b as follows:
1) Let
(k) 2 (k) 2
Qk (z n ) ~exp((b ,r) n − t k |b | )
where m= m k +1 .
The difference between this arcing algorithm and the arcing algorithms for
the matrix game is in (11.2). Instead of minimizing the Q-expectation of the
fixed loss function l(yn ,hm (x n )) , we are minimizing an error function thast
depends on the current value of c.
)= φ *.
(k )
Theorem 11.2 If arc-gv stops at the kth step, then top(c Otherwise
lim k top(c )= φ *.
(k )
Qb (z n )= w(zn ,b)/g(b)
Increase the mth coordinate of b by the amount x/|b| getting b'. Let
Taking logs
where
Θ(m,b,x)= EQ λ (m,b,x,z)
b
20
Proposition 11.3 If
Proposition 11.4 If
there is a constant D>0 depending only on B and ∆ such that for x in the
interval [0, ∆ /|b|]
For x ≤ ∆ /|b|
Let [0,s] be the largest interval such that on this interval Θ(x)≤1 In this range
Θ( x)≤1−2 µ b x + Dx 2 (11.7)
The right hand side of (11,7) has a minimum at x*= µ b / D Clearly, x*<s and
x*/|b|≤2 B/ D≤1/ B≤∆.
At the kth step of arc − φ * let x k = ∆ k /|b | . From identity (11.4) and using the
(k )
From (11.9) the rest of the proof goes exactly as in the matrix game proof.
In their 1997 paper Schapire et.al. explain the success of Adaboost in lowering
generalization error to its success in lowering margins and showed that it
produced generally lower margins than bagging. In this section we give
empirical results that show that this explanation is not valid and that the
explanation must lie deeper. Furthermore, the results cast doubt on ability of
VC-type bounds to give an adequate qualitative picture of what factors
influence generalization error.
In the three synthetic data sets used, training sets of size 300 and test sets of
size 3000 were generated. After the algorithms were run 100 iterations, the
test sets were used to estimate the generalization error. With the real data
sets, a random 10% of the instances were set aside and used as a test set. In
both cases, the procedure was repeated 10 times and the test set results
averaged. In each run, we kept track of top(c) and these values were also
averaged over the 10 run for each algorithm.
The synthetic data sets are called twonorm, threenorm, and ringnorm and
are described in Breiman[1997]. The real data sets are all in the UCI repository
except for the heart data (also described in Breiman[1997]). The real data sets
have the following number of input variables and instances--heart 6-1395:
breast cancer 9-699: ionosphere 34-351: diabetes 8-768. Two values of k were
used for each data set. One value was set low, and the other higher. Larger
values of k were used for larger data sets so that tree sizes would be
appropriate to the data set.
23
Table 1 Test Set Error(%) and Top(c) (x100)
Although the test set errors for arc-gv and adaboost are generally close, the
pattern is that adaboost has a test set error slightly less than that of arc-gv.
The diabetes data set is a bit different, but is known to contain errors and
outliers. On the other hand, top(c) is often significantly less for arc-gv than
for adaboost. But this does not translate into a lower test set error for arc-gv.
Often, quite the contrary.
A last issue is whether lower values of top(c) translate into generally lower
values of the margin. That is, it might happen that although top(c) was
lower, the average, say, of mgn(zn,c) increased. We looked at this by graphing
er(c,z n ) for the first run of twonorm (k=16) using arc-gv vs. the values of
er(c,z n ) for the similar run using adaboost. The same data was used in both
runs.
The results are shown in Figure 1. In only 6 instances out of 300 are the
er(zn,c) values resulting from adaboost larger than those from arc-gv. In the
remaining instances, they are significantly smaller. Figure 2 is a similar graph
for the first run of ionosphere with k=16. Here, every value of er(z n ,c) for vg
is less than its value for adaboost.
24
The conclusion is that these drastically different margin distributions for the
same VC-dimension had little effect on the generalization error. In fact, that
smaller margins usually led to high er(z n ,c) generalization error. But this is
the opposite of what is indicated by the VC-type bounds.
IE(Q) top(c)
data set arc-gv adaboost arc-gv adaboost
twonorm
k=8 43 23 24.0 31.0
k=16 62 21 9.8 25.5
threenorm
k=8 53 43 35.5 37.2
k=16 67 38 20.0 28.4
ringnorm
k=8 46 32 26.6 32.1
k=16 60 26 11.5 23.0
heart
k=32 47 31 23.8 26.9
k=64 50 19 8.4 15.8
breast cancer
k=16 12 6 16.7 21.1
k=32 7 3 1.9 8.1
ionosphere
k=8 26 15 21.3 21.4
k=16 28 8 4.1 14.1
diabetes
k=32 47 36 32.3 33.3
k=64 54 28 17.5 22.6
25
Thus, the Q for adaboost is concentrated on far fewer instances then is arc-gv.
Further, in most data sets, as the complexity is increased by increasing k, arc-
gv concentrates on more instances while adaboost concentrates on fewer. The
original intuitive idea behind adaboost was that it concentrated weight on
instances most likely to be missclassified. Work in Breiman[1997] showed
that there were other algorithms, sharing the property that they concentrated
weight on hard-to-classify instances, and resulting in generalization errors
comparable to those of adaboost.
But the evidence in Table 2 indicates that arc-gv, in its effort to produce lower
values of top(c), concentrates on too many instances--more than just the
hard-to-classify. Still, given the wide disparity in the IE(Q) values, it is
curious that the estimated generalization errors of the two algorithms are so
close.
Other arcing algorithms, such as adaboost and arc-x4, take much larger steps,
circulate around the inner rim of the valley, and do not reach the bottom.
These algorithms are surprisingly resistant to overfitting, For instance, the
estimated generalization error of adaboost either stays the same or gets
smaller as k increases. On the other hand, the error of arc-gv is sometimes
significantly increased when a larger k is used.
This indicates that skating around the inner rim instead of slogging to the
bottom usually gives lower errors. But it certainly raises the question of why
this is so and how far down the rim one ought to be for optimal performance.
waveform
adaboost 16.5 31 24.2
arc-gv 17.2 68 17.0
twonorm
adaboost 4.0 20 20.7
arc-gv 4.4 64 13.1
threenorm
adaboost 17.1 34 24.5
arc-gv 17.5 76 17.8
ringnorm
adaboost 5.2 30 22.5
arc-gv 5.7 68 15.0
heart
adaboost 0.9 18 18.3
arc-gv 0.6 44 9.6
breast cancer
adaboost 3.4 5 15.9
arc-gv 3.7 16 6.7
ionosphere
adaboost 6.6 13 19.6
arc-gv 5.4 34 10.7
diabetes
` adaboost 22.1 31 25.6
arc-gv 22.6 56 20.2
glass
adaboost 18.6 34 26.9
arc-gv 18.6 56 21.1
soybean
adaboost 7.5 6 12.1
arc-gv 8.2 7 9.8
27
The curious aspect of the randomized algorithms is that no regularization is
needed. The trees combined were grown as large as possible and no pruning
was used. In past experiments, I found that this worked better than using any
pruning. In contrast, some limits have to be placed on the size of the trees in
the deterministic algorithm to prevent overfitting. The table above shows
that the indications of overfitting by the deterministic verion of arc-gv does
not appear in the randomized version.
The test set performance of the two algorithms is very similar. Adaboost does
sightly better on the synthetic data. On the six real data sets they have exactly
the same total test set error. The surprising aspect is that the test set results
are so close, while the differences in top(c) and IE(Q) are so marked. I
conjecture that randomising the path gets arc-gv away from the bottom of the
top(c) valley so it also skates around the inner rim, but at as lower level than
Adaboost. But the small effect of the top(c) and IE(Q) differences is mystifying.
14. Remarks
15. Acknowledgments
This work has important seeds in the Schapire et. al. 1997 paper and thought-
provoking talks with Yoav Freund at the Newton Institute, Cambridge
University, during the summer of 1997.
References
Breiman, L. [1996b] Bagging predictors , Machine Learning 26, No. 2, pp. 123-
140
Robinson, J.[1951] An iterative method of solving a game, Ann. Math 154, pp.
296-301
Schapire, R.,Freund, Y., Bartlett, P., and Lee, W[1997] Boosting the Margin,
(available at http://www.research.att.com/~yoav look under "publications".)
29
Szep, J. and Forgo, F.[1985] Introduction to the Theory of Games, D. Reidel
Publishing Co.
Our proof is patterned after the proof given in Schapire et.al.[1997] of a similar
result based on the margin distribution. Denote l(m,z)=l m (z). Let K to be a
positive integer and fixing c take Jk* k=1, ... ,K to be independent random
variables such that P( Jk* =m)=c m . Denote by J* the random K-vector whose
kth component is Jk* Conditional on Z
1 K *
L (Z,J*)= ∑ l( J k ,Z )
K 1
is an average of iid random variables, each one having expectation er(Z,c) .
Similarly,
1 K *
L (z n ,J*)= ∑ l( J k ,z n )
K 1
is an average of iid variables, each having expectation er(z n ,c). For any µ < λ
exp((− K( µ − λ ) 2 / 2 B 2 ) (AI.3)
To bound the 1st term in (AI.1), for some ε >0 consider the probability
By the independence of the {zn} under P̃ , the jth term in (AI.5) is bounded by
P{P( L (Z, j)≥ µ )≤ ε +1,L (Z, j)≥ µ}+ P{P( L (Z, j)≥ µ )≤ ε ,L (Z, j)< µ}=
For any λ , ν take µ to be the lowest value in the grid of M µ values that is
≥(λ + ν )/2 . So
(λ + ν ) αB
µ= +
2 M
where 0≤α <1. Assuming that 0< λ − ν ≤B the sum of the bounds in (AI.3) and
(AI.8) is less than
31
SK = max(2 N,exp(K / 2 M)) exp(− K( λ − ν )2 /8B2 ) .
Let the ε in (AI.4) depend on K. and define δ K =M K +1 exp(− ε K N). Then, except
for a fixed set of training sets with P̃ probability ≤δ K , for all λ , ν ,c , and for
fixed K,
Take δ K =2 − K δ . Then (AI.9) also holds for all K except on a fixed set of
training sets with probability ≤ δ . Now let ν =top(c),λ =∆+top(c),σ =8(B/∆)2 , and
take K = σ log(2N 2 / σ log(2M)). Take ∆ so that (∆/ B)≥ 8/ M , then 2N≥exp(K /2M)
and if R= σ log(2M)/ N then
.2
er(z) arc-vg
.1
0
0 .1 .2 .3
er(z) adaboost
Figure 2
Comparison or er(z)-ionosphere data, k=16
.15
.1
er(z) arc-vg
.05
0
0 .05 .1 .15
er(z) adaboost
33