Hand : Assessing Error Rate Estimators: The Leave-One-Out Method Reconsidered
Hand : Assessing Error Rate Estimators: The Leave-One-Out Method Reconsidered
Hand : Assessing Error Rate Estimators: The Leave-One-Out Method Reconsidered
Summary
1. Introduction
The error rate of a statistical classification rule is the probability that the
rule will misclassify future cases. Much work has been undertaken on developing
ways of estimating error rates. This is partly because it is a problem of intrinsic
importance, and partly because it raises issues, in a clearly defined and well-
understood context, which occur throughout statistics. Toussaint (1974), Hand
(1986) and McLachlan (1987) have surveyed work on error rate estimation.
Many methods of error rate estimation have been proposed, including some
specific to various distributions, but in practical terms the most important gen-
eral classes are resubstitution estimators, rotation and hold-out estimators (in
particular, leave-one-ou t estimators), average conditional error rate estimators
(also called additive estimators or posterior probability estimators), and boot-
strap estimators. The review papers above and McLachlan (1992) give details of
these estimators, their modifications, extensions and generalizations. All these
. estimators are purely data-based; they assume the existence of a design set com-
prising data on individuals whose class membership is known and from which a
classification rule can be constructed, but the absence of any separate test set
of similarly identified individuals from which the performance of the rule can be
evaluated.
The resubstitution estimator is an obvious and attractive rule, and is defined
as the proportion of the design set which is misclassified by the classification
rule. The major weakness of this estimator is that it is optimistically biased: it
underestimates the error rate because it uses the same set both for design (i.e. the
rule is ‘optimised’ in some sense on this data set) and for evaluation. The leave-
one-out rule seeks to overcome this drawback by a process of cross-validation.
Each design set case is withdrawn from the design set, a classification rule is
constructed on the remaining n - 1 cases, and the withdrawn case is classified
using that rule. This is repeated (with replacement) for each design set case in
turn. The result is a set of n classifications, each of which has been obtained
from a design set almost identical to the original (size n ) set, and each of which
is based on a single point test set independent of its corresponding design set.
The proportion of these n classifications which are incorrect is the leave-one-out
estimate. The result is an estimator which is unbiased for design sets of size
n - 1 and, hence, almost unbiased for the original design set (of size n ) if that is
moderately large, thus overcoming the main objection to the resubstitution rule.
Unfortunately, this advantage is not without. its apparent disadvantages.
Glick (1978 p.211) says of the leave-one-out method, that
the virtue of being ‘nearly unbiased’ is compromised by fickle fluctuation (the
opposite of smoothing) with consequently larger variance than any other well
known estimator of success rate [the complement of error rate].
Glick’s work has been a springboard for many simulation studies of data-based
error rate estimators (McLachlan, 1992). We contend, however, that many of
t*he studies, commencing with Glick’s, have a fundamental shortcoming: they
assess performance using the wrong measure of variance. Section 2 examines the
motivation underlying this assertion, Section 3 focuses on variance and identifies
a more appropriate measure of performance, and Section 4 presents the results of
a simulation study comparing Glick’s measure and ours. In view of the criticism
that the leave-one-out method has a high variance, Section 5 defines a straight-
forward extension aimed at reducing this variance, and compares it with one of
the more popular bootstrap estimators, the 632 estimator.
1.1. Notation
As we make some subtle distinctions, we here clearly define our notation.
The Bayes error rate e g is the smallest error rate which could ever be achieved
using the given variables. It is the error rate one would get if one knew the
distributions involved and used the best classification rule for these distributions.
The optimal error rate eo is the error rate of a classification rule of the kind
being studied, when applied to the true population distributions. It is the error
rate in the limiting case of a rule constructed from design sets of infinite size,
so is the minimum error rate which can be achieved by rules of the specified
ASSESSING ERROR RATE ESTIMATORS 37
type on populations of the specified form. (It may not be the minimum that can
be achieved if a broader class of rules is adopted. In general eo 2 eB.) Any
classification rule built on a finite design set has an error rate which is not less
than the optimal error rate.
The true error rate of a classifier, of the type being studied, constructed using a
given design set is denoted e (so e 2 eo). It is the error rate one would obtain
by integrating, over the entire measurement space, the classifier's probability of
making a misclassification at each point of the measurement space. We use 2 to
denote a generic estimator of this error rate, and e R to denote the resubstitution
error rate estimate.
E, denotes expectation over design sets drawn from specified populations. Thus
the Bayes error rates are fixed but the true error rates associated with classifiers
built using these design sets vary.
Ex,, denotes expectation over design sets yielding classifiers with the same true
error rate; the positions of the cases in the design sets vary, but the classifiers
built using them have the same true error rate. This means that, even though
the true error rates of the classifiers are the same, estimates of those rates vary
amongst classifiers.
with eg instead of with e (e.g. Page (1985), who used Ii - eBI). This, of course,
does not address questions relating to performance as estimators of the true error
rate e .
This, in itself, is hardly surprising, but Glick also criticised the leave-one-out
estimator on the grounds that its variance is large, a criticism that has been taken
up elsewhere (e.g. Efron, 1983). However, from Section 2, we suggest that the
variance that Glick studied is of limited interest. It compares i with an expected
value of i, but where the expectation is over a range of e values (those arising
from design sets generated from the specified distributions, with fixed Bayes error
rates), i.e. it compares d with ET(ExITZ).In our view, the expectation ExITEis
of more relevance than the expectation ET(ExFi). This is the expected value of
the error rate estimate for data coming from situations which all have the same
(unknown) true error rate e . To us it seems more natural to study the variance
for fixed true error rates, rather than include variability of the true error rate
arising from design set sampling fluctuations. That means that , for a particular
unknown true error rate we want to look at EXIT(;- Of course, if we
still want to summarise these measures over the range of true error rates (for
a given Bayes error rate), we can aggregate by taking the expectation, yielding
E*[EXIT(2 - EXpC)21.
Starting from the mean squared error, ETIExlT(t - e)2], Glick implicitly
decomposes it as
4. A Simulation Study
As we have argued above, in principle, simulation studies should
(a) generate data according to fixed positions of the decision surface (where the
range of positions spans the range of possible positions and true error rates
in some reasonable way), and
(b) generate multiple design sets leading to the decision surfaces in each of these
positions.
In fact, however, this is difficult. Moreover, for our purposes we want the posi-
tions of the decision surfaces to be a random sample from the set of possible po-
sitions (given the underlying distributions) so that we can compare our measure
of variance with those used by Glick and others. Therefore we have undertaken
a two-stage simulation process in which:
(i) an initial design set is generated, by random sampling, from the specified
distributions;
(ii) this set is used to define a decision surface and then multiple design sets are
generated, each of which has this same decision surface, as detailed below.
In common with many comparative simulation studies, we investigate only
the case of normally distributed variables. Although many specially-tailored
error rate estimators have been proposed for such variables (see, e.g. McLachlan,
1992 Section lO.S), we here examine the behaviour of the leave-one-out procedure
(which we shorten to the ‘omit-one’ or 0, procedure) as an estimator for general
situations and so do not consider those special estimators any further. The
main advantage of using just normal variables is that we can specify populations
concisely and calculate the true error rates easily.
As parameters to be varied in our study, we chose separation of the two
populations, and number of variables observed. The former can be specified
most conveniently in terms of Bayes error rate eg: we chose e i = 0.1 (well
separated populations), eB = 0.2 (moderately separated populations), and e B =
0.3 (considerably overlapping populations). For the number of variables p , we
chose p = 5,lO. The normal populations are completely specified by their mean
vectors, p1 and p a , and dispersion matrices C1 and C2. Moreover, if the latter
are assumed equal (El= C2 = C, say) then the linear discriminant function
(LDF) is the optimal classification rule and ew = eo = @(-;A) where A2 =
( P I - P z ) ’ C - l ( k - P 2 ) and W) = (1/J27;)S_Ume - 2 2 / 2 dx. Knowledge of eo
ASSESSING ERROR RATE ESTIMATORS 41
thus determines A, and the populations can be specified without loss of generality
in the canonical form C = I , p1 = (A, 0,. . . ,O)T and p 2 = (0,. . . ,O)T for a given
value of p .
The third parameter to be varied was the sample size for each design set;
we used equal sample sizes from the two populations (n1 = 722 = n), and used
n = 25,50,100 for small, medium, and large design sets. For a given setting of
each of the three parameters, we generated 1000 pairs of samples of size n from
N ( p l , I ) and N ( p 2 , 1 ) populations using subroutine G05DDF of the NAG suite
of programs. These 1000 pairs were structured into 20 sets of 50, so that all
pairs within a set gave rise to the same decision surface (i.e. sample LDF) and
the same error rates. This requirement is satisfied very simply, as follows.
Suppose one pair of training samples has means XI,$ 2 and pooled covariance
matrix W. Coefficients of the sample LDF are then given by elements of the
vector c = W-l(X1 - X?), and the overall error rate for this pair is
If a second pair of training samples has means 31,32 and pooled covariance
matrix U, then the decision surface and the true error rate are the same as those
of the first pair provided that U-'(31 - 3-2)= W-l(ji1 - 2 2 ) and y1 32 = +
+
XI X2. But the first equation implies that 71 - 72 = UW-'(Xl - Xz), which,
taken with the second, means that we must have
and
yz = f { ( X 1 f x 2 ) - uw-yx1 - X , ) } . (2)
Thus to ensure that successive replicates within the set of 50 have the same
decision surfaces and true error rates, we can simply generate each pair of samples
in the usual way but adjust each pair so that the sample means satisfy (1) and
(2) (where 21,X 2 , and W are the sample values of the previous pair). This
was done in our simulation study. The ERU variance is then just the variance
of all 1000 estimates for each parameter combination (without Legard to any
grouping), while the ERC variance is the pooled within set variance for each
parameter combination.
Table 1 summarises the results, showing, in order, the ERU and ERC vari-
ances, and their ratio ERU/ERC. The ERU variance is markedly inflated because
it includes the differences between the true error rates. Other patterns evident
in these results are:
0 a tendency for the two variances to decrease with increasing sample size, as
TABLE1
Simulation comparison of the ERU and ERC variance measures for the 01 method
eg p n ERU variance ERC variance Ratio
0.3 5 25 0.0105 0.0020 5.1650
0.3 5 50 0.0027 0.0010 2.5865
0.3 5 100 0.0012 0.0005 2.4614
0.2 5 25 0.0092 0.0018 5.1019
0.2 5 50 0.0037 0.0009 4.2893
0.2 5 100 0.0010 0.0004 2.4614
0.1 5 25 0.0044 0.0015 2.9907
0.1 5 50 0.0023 0.0006 3.7049
0.1 5 100 0.0007 0.0003 2.2758
0.3 10 25 0.0133 0.0025 5.3036
0.3 10 50 0.0030 0.0011 2.6544
0.3 10 100 0.0021 0.0005 4.2091
0.2 10 25 0.0080 0.0022 3.7170
0.2 10 50 0.0025 0.0009 2.7381
0.2 10 100 0.0010 0.0004 2.2163
0.1 10 25 0.0073 0.0014 5.0959
0.1 10 50 0.0022 0.0006 3.4010
0.1 10 100 0.0010 0.0003 3.0557
~~
a tendency for the inflation (as measured by the ratio) to decrease as sample
size increases, as expected because larger samples are associated with a less
varia.ble decision surface (for samples taken from given distributions).
Figure 1 plots the ERU/ERC ratios of standard deviations against the standard
deviations of the true error rates for each of the specified parameter combinations.
The positive correlation is quite clear.
This property of the ERU variance is not restricted to the 0, method) so we
extended the simulation study to compare this method with one of the bootstrap
methods of error rate estimation. Table 2 shows corresponding results for the
632 estimator (Efron, 1983). The estimator is defined as e632 = 0.368eR +
0.632eN, where eR is the resubstitution estimate and eN is the bootstrap rate
for points not contained in the bootstrap sample. It has been found to perform
well (Efron, 1983; Fitzmaurice et a]., 1991). Our study used 200 bootstrap
samples for each replicate. Table 2 also shows the ratio of ratios of variances
for the two estimators, i.e. (ERU variance O1/ERU variance 632)/(ERC variance
Ol/ERC variance 632). All of these ratios are less than 1, showing that Glick’s
unconditional approach underestimates the poor performance of the 01 method
relative to the 632 estimator in terms of variance, at least for the situations we
have studied. Note that the extent of this underestimation is approximately
constant across the different parameter settings.
OO
0.005 0.010 0.015 0.020 0.025 0.030
Standard deviations of true error rates
Fig. 1. - Ratios ERU s.d./ERC s.d. for the omit-one (0,) method
plotted against the s.d.s of the true error rates
TABLE
2
Simulation comparison of the ERU and ERC variance measures for the 632 method
-~ -
6. Discussion
The problems of estimating the future misclassification rate of a classifi-
cation rule, using only the data which are available to design the rule, have
provided the field on which many advanced statistical techniques were first ex-
plored in depth; for example, the leave-one-out method (Lachenbruch, 1965) and
the bootstrap method (Efron, 1983). Because of the complexity of the estimators
involved, simulation studies frequently supplement theoretical analysis, typically
generating data from specified distributions. Since the underlying distributions
ASSESSING ERROR RATE ESTIMATORS 45
TABLE3
Simulated ERC variance measures for the 01,O2and 632 methods.
eB P ?I 01 0 2 632
0.3 5 25 0.0020 0.0018 0.0008
0.3 5 50 0.0010 0.0010 0.0004
0.3 5 100 0.0005 0.0005 0.0003
0.2 5 25 0.0018 0.0016 0.0009
0.2 5 50 0.0009 0.0008 0.0005
0.2 5 100 0.0004 0.0004 0.0003
0.1 5 25 0.0015 0.0014 0.0008
0.1 5 50 0.0006 0.0006 0.0004
0.1 5 100 0.0003 0.0003 0.0002
0.3 10 25 0.0025 0.0022 0.0008
0.3 10 50 0.0011 0.0011 0.0004
0.3 10 100 0.0005 0.0005 0.0002
0.2 10 25 0.0022 0.0020 0.0008
0.2 10 50 0.0009 0.0009 0.0004
0.2 10 100 0.0004 0.0004 0.0002
0.1 10 25 0.0014 0.0013 0.0006
0.1 10 50 0.0006 0.0006 0.0003
0.1 10 100 0.0003 0.0003 0.0002
~ ~~ ~
are known in simulation studies, the Bayes error rate is also known and is some-
times used to index the performance of the resulting rule. However, there is a
danger that this can give a misleading impression of the performance of the error
rate estimator. Different data sets generated from fixed distributions give rise
to different decision surfaces, and these have different true error rates. Hence,
a summary of the results according to the Bayes error rate is an average over
different true error rates.
Different realizations of the allocation rule, and hence different true error
rates, are irrelevant in any practical application: one particular realization has
been obtained, with an (unknown) true error rate, which is the error rate for
which the practitioner seeks the most accurate estimate. There is a subset of the
sample space which gives rise to that particular allocation rule, so only elements
of that subset are relevant when examining the accuracy of any estimator of that
rule’s true error rate. By extension, in any simulation study a proper comparison
among error rate estimators should be based on indexing the results according
to the true error rates of the classification rules.
Above we have explored the effect of using the incorrect summary by study-
ing its impact on the 01 and the 632 bootstrap methods of error rate estimation.
The ratio of the incorrect measure of variance (including the variance attributable
to differences between the true error rates generated from different design sets)
to the correct measure of variance (conditioning on fixed true error rate) can
be substantial, involving factors of up to 6 for the situations in our simulations.
This means that conclusions drawn from such studies are at risk of being quite
46 W.J. KRZANOWSKI AND D.J. HAND
wrong: they will be substantially influenced by the variation in true error rate.
We have also compared the performance of the 0 1 method with that of the
632 method, as measured by the incorrect and correct methods. Our results
suggest that the O1 method is even worse than had been expected, in terms of
its 'fickle fluctuation', t o use Glick's words.
Motivated by this, we have explored a simple extension of the O1 method
aimed at reducing its (correct measure of) variance. While this extension, the
02 method, leads to a performance superior t o that of the 01 method, it remains
inferior to the 632 estimator.
References
E:FRON, H. (1983). Estimating t h e error rate of a prediction rule: improvement on cross-
validation. J. Amer. Statist. Assoc. 78, 316-331.
F'ITZMAURICE, G.M. & HAND, D.J. (1987). A comparison of two average conditional error
rate estimators. Pattern Recognition Lett. 6 , 221-224.
--, KRZANOWSKI, W.J., & HAND, D.J. (1991). A Monte Carlo study of the 632 bootstrap
estimalor of error rate. J. Classification 8 , 239-250.
G L I C K , N . (1978). Additive estimators for probabilities of correct classification. Pattern
Recognition 1 0 , 21 1-222.
HAND, D.J. (1986). Recent advances in error rate estimation. Pattern Recognition Lett.
4, 335-346.
-- (1987). A shrunken leaving-one-out estimator of error rate. Comput. Math. A p p l . 14,
161-167.
IfOR.4, S.C. & WILCOX, J.B. (1982). Estimation of error rates in several-population discrim-
inant analysis. J . Marketing Res. 1 9 , 57-61.
KITTLER, J. & DEVIJVER, P.A. (1982). Statistical properties of error estimators in perfor-
mance assessment of recognition systems. IEEE Tkans. Pattern Anal. Machine Intelligence
4, 215-220.
LACHENBRUCH, P.A. (1965). Estimation of Error Rates in Discriminant Analysis. Unpub-
lished P h D Thesis, University of Los Angeles.
McLACHLAN, G.J. (1987). Error rate estimation in discriminant analysis: recent advances.
In Advances in Multivariate Statistical Analysis, ed. A.K. Gupta, pp.233-252. Dordrecht:
D.Reide1.
- (1992). Discriminant Analysis a n d Statistical Pattern Recognition. New York: John Wiley.
PAGE, J.T. (1985). Error-rate estimation in discriminant analysis. Technometrics 27, 189-198.
PAWLAK, M. (1988). On t h e asymptotic properties of smoothed estimators of t h e classification
error rate. Pattern Recognition 21, 515-524.
SNAPINN, S.M. & K N O K E , J.D. (1984) Classification error rate estimators evaluated by
unconditional mean squared error. Technornetrics 26, 371-378.
-- & - (1985). An evaluation of smoothed classification error rate estimators. Technornetrics
27, 199-206.
-- & - (1989). Estimation of error rates in discriminant analysis with selection of variables.
Biometrics 45, 289-299.
TOUSSAINT, G.T. (1974). Bibliography on estimation of misclassification. IEEE Trans. Inform.
Theory 20, 472-479.
LYEISS, S.M. (1991). Small sample error rate estimation for k-NN classifiers. IEEE Trans.
Pattern Anal. Machine Intelligence 13, 285-289.