Hand : Assessing Error Rate Estimators: The Leave-One-Out Method Reconsidered

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Austral. J. Statist.

39(1), 1997, 35-46

ASSESSING ERROR RATE ESTIMATORS:


THE LEAVE-ONE-OUT METHOD RECONSIDERED

W.J. KRZANOWSKI' AND D.J. HAND^


University of Exeter and The Open University

Summary

Many comparative studies of the estimators of error rates of supervised


classification rules are based on inappropriate criteria. In particular, although
they fix the Bayes error rate, their summary statistics aggregate a range of
true error rates. This means that their conclusions about the performance of
classification rules cannot be trusted. This paper discusses the general issues
involved, and then focuses attention specifically on the leaveone-out estima-
tor. The estimator is investigated in a simulation study, both in absolute
terms and in comparison with a popular bootstrap estimator. An improve-
ment to the leave-one-out estimator is suggested, but the bootstrap estimator
appears to maintain superiority even when the criteria are adjusted.
K e y words: Classification rule; error rate; leaveoneout method; 632 estimator.

1. Introduction
The error rate of a statistical classification rule is the probability that the
rule will misclassify future cases. Much work has been undertaken on developing
ways of estimating error rates. This is partly because it is a problem of intrinsic
importance, and partly because it raises issues, in a clearly defined and well-
understood context, which occur throughout statistics. Toussaint (1974), Hand
(1986) and McLachlan (1987) have surveyed work on error rate estimation.
Many methods of error rate estimation have been proposed, including some
specific to various distributions, but in practical terms the most important gen-
eral classes are resubstitution estimators, rotation and hold-out estimators (in
particular, leave-one-ou t estimators), average conditional error rate estimators
(also called additive estimators or posterior probability estimators), and boot-
strap estimators. The review papers above and McLachlan (1992) give details of
these estimators, their modifications, extensions and generalizations. All these
. estimators are purely data-based; they assume the existence of a design set com-
prising data on individuals whose class membership is known and from which a
classification rule can be constructed, but the absence of any separate test set

Received February 1996; accepted November 1996.


Dept of Mathematical Statistics & Operational Research, University of Exeter, Laver Build-
ing, North Park Rd, Exeter EX4 4QE, UK.
2Dept of Statistics, The Open University, Milton Keynes, MK7 6AA, UK.
36 W.J. KRZANOWSKI AND D.J. HAND

of similarly identified individuals from which the performance of the rule can be
evaluated.
The resubstitution estimator is an obvious and attractive rule, and is defined
as the proportion of the design set which is misclassified by the classification
rule. The major weakness of this estimator is that it is optimistically biased: it
underestimates the error rate because it uses the same set both for design (i.e. the
rule is ‘optimised’ in some sense on this data set) and for evaluation. The leave-
one-out rule seeks to overcome this drawback by a process of cross-validation.
Each design set case is withdrawn from the design set, a classification rule is
constructed on the remaining n - 1 cases, and the withdrawn case is classified
using that rule. This is repeated (with replacement) for each design set case in
turn. The result is a set of n classifications, each of which has been obtained
from a design set almost identical to the original (size n ) set, and each of which
is based on a single point test set independent of its corresponding design set.
The proportion of these n classifications which are incorrect is the leave-one-out
estimate. The result is an estimator which is unbiased for design sets of size
n - 1 and, hence, almost unbiased for the original design set (of size n ) if that is
moderately large, thus overcoming the main objection to the resubstitution rule.
Unfortunately, this advantage is not without. its apparent disadvantages.
Glick (1978 p.211) says of the leave-one-out method, that
the virtue of being ‘nearly unbiased’ is compromised by fickle fluctuation (the
opposite of smoothing) with consequently larger variance than any other well
known estimator of success rate [the complement of error rate].
Glick’s work has been a springboard for many simulation studies of data-based
error rate estimators (McLachlan, 1992). We contend, however, that many of
t*he studies, commencing with Glick’s, have a fundamental shortcoming: they
assess performance using the wrong measure of variance. Section 2 examines the
motivation underlying this assertion, Section 3 focuses on variance and identifies
a more appropriate measure of performance, and Section 4 presents the results of
a simulation study comparing Glick’s measure and ours. In view of the criticism
that the leave-one-out method has a high variance, Section 5 defines a straight-
forward extension aimed at reducing this variance, and compares it with one of
the more popular bootstrap estimators, the 632 estimator.
1.1. Notation
As we make some subtle distinctions, we here clearly define our notation.
The Bayes error rate e g is the smallest error rate which could ever be achieved
using the given variables. It is the error rate one would get if one knew the
distributions involved and used the best classification rule for these distributions.
The optimal error rate eo is the error rate of a classification rule of the kind
being studied, when applied to the true population distributions. It is the error
rate in the limiting case of a rule constructed from design sets of infinite size,
so is the minimum error rate which can be achieved by rules of the specified
ASSESSING ERROR RATE ESTIMATORS 37

type on populations of the specified form. (It may not be the minimum that can
be achieved if a broader class of rules is adopted. In general eo 2 eB.) Any
classification rule built on a finite design set has an error rate which is not less
than the optimal error rate.
The true error rate of a classifier, of the type being studied, constructed using a
given design set is denoted e (so e 2 eo). It is the error rate one would obtain
by integrating, over the entire measurement space, the classifier's probability of
making a misclassification at each point of the measurement space. We use 2 to
denote a generic estimator of this error rate, and e R to denote the resubstitution
error rate estimate.
E, denotes expectation over design sets drawn from specified populations. Thus
the Bayes error rates are fixed but the true error rates associated with classifiers
built using these design sets vary.
Ex,, denotes expectation over design sets yielding classifiers with the same true
error rate; the positions of the cases in the design sets vary, but the classifiers
built using them have the same true error rate. This means that, even though
the true error rates of the classifiers are the same, estimates of those rates vary
amongst classifiers.

2. Measuring Performance of an Error Rate Estimator


in Simulation Studies
Suppose that we start with a particular design set, from which we derive
a classification rule that has a fixed (unknown) true error rate, e . A typical
simulation study compares a number, M say, of different estimators of e , yielding
M estimates, di).We want to know which of the estimators is the most accurate,
where accuracy is usually measured in terms of the mean squared error E(i(')-
e ) 2 . For classification rules with true error rate e, this tells us which estimator
leads to estimates closest to it in the mean squared error sense.
Many simulation studies of error rate estimation use design factors (such
as Bayes error rate, sample size, and distribution shape) to span a range of
conditions and summarise the comparisons between estimators using what is
often called the unconditional mean squared error E,[EXIT(2 - e ) 2 ] (see Hora
& Wilcox, 1982; Efron, 1983; Snapinn & Knoke, 1984, 1985, 1989; Fitzmaurice
& Hand, 1987; Hand, 1987; Pawlak, 1988). Estimated values of _ET[ExlT(2-
e ) 2 ] are then tabulated (or plotted) against the known Bayes error rates, i.e.
the distributions (and consequently the Bayes error rate) are fixed, and the
expectation E, is taken over true error rates obtained from design sets generated
from those distributions. This gives a measure of performance conditional on
Bayes error rate. However, as Hand (1987) and Fitzmaurice et d.(1991) point
out, it is not clear that this is the appropriate thing to do. We claim that
the measure of real interest is ExIT(Z - e ) 2 , conditional on fixed values of e ,
i.e. for a given classification rule applied to a design set, with unknown true
38 W.J. KRZANOWSKI AND D.J. HAND

misclassification rate e, we want to know how accurately that unknown rate is


estimated. Averaging over design sets which have different true error rates is
irrelevant to our problem. In general, some of those design sets, though sampled
from the same populations, may lead to very different true error rates from, that
of the given design set and so distort the measure of accuracy. It is thus necessary
(i) to compute the measures conditional on fixed values of e (rather than e B ) ,
and
(ii) to present the results for fixed values of e (perhaps as well as - in crosstab-
ulation with - values of e B ) .
However, this does not completely determine the strategy for a simulation study.
In particular, we must be more specific about the situations which give rise to
e. There are many possibilities, lying as it were on a continuum of conditioning,
some of which are as below, proceeding from greatest to least generality.
(a) All possible distributions and design sets which have the true (unknown)
error rate e.
( b ) All design sets (leading to classification rules with true error rate e) arising
from specified distributions, requiring the researcher to make further infer-
ences to other distributions and sets of distributions. This is the implicit
framework when simulation studies are undertaken, the simulations being
carried out for particular forms of distribution with the idea that the results
generalise, without explicit statistical evidence for that generalisation;
( c ) Specified distribution form and the position of the decision surface obtained
from the given design sets.
( d ) Specified distribution form and the sample means and covariance matrices
of the design sets.
(e) The particular design set to hand. This gives only one error rate estimate
for each method, so that the mean squared error reduces to (di) - e)2.
Since we want to be able to make statistical (as opposed to non-statistical)
generalisations which are as broad as possible, we want to use a method from
near the start of this list. Method (a) is clearly impractical. Method (b) could
be used, but does not recognise the role of the decision surface. Specifically,
future objects would be classified using the particular decision surface that the
researcher has obtained, so it seems sensible to condition on that decision surface.
Going any further down the list than (c) seems unnecessarily restrictive. The
decision surface is the sufficient statistic for classification purposes:
Note that the essence of the distinction between those approaches which
condition on the Bayes error rate and others that condition on the true error rate
also holds if other performance criteria (such as bias or absolute deviation) are
used. Whatever the criterion, most studies (at least, those based on simulation
results) seem to summarise their results according to the Bayes error rate. An
exception is Weiss (1991), who plots the root mean squared error rate against
the mean true error rate. Some authors sidestep the problem by comparing 2
ASSESSING ERROR RATE ESTIMATORS 39

with eg instead of with e (e.g. Page (1985), who used Ii - eBI). This, of course,
does not address questions relating to performance as estimators of the true error
rate e .

3. Variance of the Leave-one-out Error Estimate


Glick’s results (1978 p.218, and his Fig. 4) show ‘that the estimators have
standard deviations at least as great as the standard deviation of the actual
success rate which they attempt to track’, i.e. in terms of our notation,

ETLE,IT(t - ET(ExITa21 2 ET[E,IT(e - ET(ExITe))21


or

This, in itself, is hardly surprising, but Glick also criticised the leave-one-out
estimator on the grounds that its variance is large, a criticism that has been taken
up elsewhere (e.g. Efron, 1983). However, from Section 2, we suggest that the
variance that Glick studied is of limited interest. It compares i with an expected
value of i, but where the expectation is over a range of e values (those arising
from design sets generated from the specified distributions, with fixed Bayes error
rates), i.e. it compares d with ET(ExITZ).In our view, the expectation ExITEis
of more relevance than the expectation ET(ExFi). This is the expected value of
the error rate estimate for data coming from situations which all have the same
(unknown) true error rate e . To us it seems more natural to study the variance
for fixed true error rates, rather than include variability of the true error rate
arising from design set sampling fluctuations. That means that , for a particular
unknown true error rate we want to look at EXIT(;- Of course, if we
still want to summarise these measures over the range of true error rates (for
a given Bayes error rate), we can aggregate by taking the expectation, yielding
E*[EXIT(2 - EXpC)21.
Starting from the mean squared error, ETIExlT(t - e)2], Glick implicitly
decomposes it as

E T [ E X J T ( C- e>21= ETIEXJTP- E,(Ex,,eN2I


- E T [ E X ~ T ( E X-~ E ~ ]T [ E X I T ( E X I-Tel2],
T ~T ( E X I T ~ )4-) E ~
where the first term on the right hand side is his variance. We decompose it as

ET[EXlT(C - 421= ETIEX)T(Z- Exp6)21 + ET[E,,T(ExIT: - e)21.


Now, Glick’s variance can be decomposed as

ET [Ex(T( d - W E , ITe))21 = E, [EX,T(i -E x ET “Ex(,; - ETPX l T 9 2 1 7


i.e. into our variance plus a positive term. Hence, Glick’s variance overstates the
variability of the estimator.
40 W.J. KRZANOWSKI AND D.J. HAND

This raises the question of whether estimators criticised on the grounds of


their variance, in particular the leave-one-out estimator, are in fact not as bad
as was thought. Of course, the above argument also applies to other estimators,
so comparative studies are needed in order to investigate this question. In what
follows we use the term error rate unconditional (ERU) variance to describe
the measure Glick used, and the term error rate conditional (ERC) variance to
describe the measure we propose.

4. A Simulation Study
As we have argued above, in principle, simulation studies should
(a) generate data according to fixed positions of the decision surface (where the
range of positions spans the range of possible positions and true error rates
in some reasonable way), and
(b) generate multiple design sets leading to the decision surfaces in each of these
positions.
In fact, however, this is difficult. Moreover, for our purposes we want the posi-
tions of the decision surfaces to be a random sample from the set of possible po-
sitions (given the underlying distributions) so that we can compare our measure
of variance with those used by Glick and others. Therefore we have undertaken
a two-stage simulation process in which:
(i) an initial design set is generated, by random sampling, from the specified
distributions;
(ii) this set is used to define a decision surface and then multiple design sets are
generated, each of which has this same decision surface, as detailed below.
In common with many comparative simulation studies, we investigate only
the case of normally distributed variables. Although many specially-tailored
error rate estimators have been proposed for such variables (see, e.g. McLachlan,
1992 Section lO.S), we here examine the behaviour of the leave-one-out procedure
(which we shorten to the ‘omit-one’ or 0, procedure) as an estimator for general
situations and so do not consider those special estimators any further. The
main advantage of using just normal variables is that we can specify populations
concisely and calculate the true error rates easily.
As parameters to be varied in our study, we chose separation of the two
populations, and number of variables observed. The former can be specified
most conveniently in terms of Bayes error rate eg: we chose e i = 0.1 (well
separated populations), eB = 0.2 (moderately separated populations), and e B =
0.3 (considerably overlapping populations). For the number of variables p , we
chose p = 5,lO. The normal populations are completely specified by their mean
vectors, p1 and p a , and dispersion matrices C1 and C2. Moreover, if the latter
are assumed equal (El= C2 = C, say) then the linear discriminant function
(LDF) is the optimal classification rule and ew = eo = @(-;A) where A2 =
( P I - P z ) ’ C - l ( k - P 2 ) and W) = (1/J27;)S_Ume - 2 2 / 2 dx. Knowledge of eo
ASSESSING ERROR RATE ESTIMATORS 41

thus determines A, and the populations can be specified without loss of generality
in the canonical form C = I , p1 = (A, 0,. . . ,O)T and p 2 = (0,. . . ,O)T for a given
value of p .
The third parameter to be varied was the sample size for each design set;
we used equal sample sizes from the two populations (n1 = 722 = n), and used
n = 25,50,100 for small, medium, and large design sets. For a given setting of
each of the three parameters, we generated 1000 pairs of samples of size n from
N ( p l , I ) and N ( p 2 , 1 ) populations using subroutine G05DDF of the NAG suite
of programs. These 1000 pairs were structured into 20 sets of 50, so that all
pairs within a set gave rise to the same decision surface (i.e. sample LDF) and
the same error rates. This requirement is satisfied very simply, as follows.
Suppose one pair of training samples has means XI,$ 2 and pooled covariance
matrix W. Coefficients of the sample LDF are then given by elements of the
vector c = W-l(X1 - X?), and the overall error rate for this pair is

If a second pair of training samples has means 31,32 and pooled covariance
matrix U, then the decision surface and the true error rate are the same as those
of the first pair provided that U-'(31 - 3-2)= W-l(ji1 - 2 2 ) and y1 32 = +
+
XI X2. But the first equation implies that 71 - 72 = UW-'(Xl - Xz), which,
taken with the second, means that we must have

31 = +{(%I + X2) + uw-'(x, - xz)}, (1)

and
yz = f { ( X 1 f x 2 ) - uw-yx1 - X , ) } . (2)
Thus to ensure that successive replicates within the set of 50 have the same
decision surfaces and true error rates, we can simply generate each pair of samples
in the usual way but adjust each pair so that the sample means satisfy (1) and
(2) (where 21,X 2 , and W are the sample values of the previous pair). This
was done in our simulation study. The ERU variance is then just the variance
of all 1000 estimates for each parameter combination (without Legard to any
grouping), while the ERC variance is the pooled within set variance for each
parameter combination.
Table 1 summarises the results, showing, in order, the ERU and ERC vari-
ances, and their ratio ERU/ERC. The ERU variance is markedly inflated because
it includes the differences between the true error rates. Other patterns evident
in these results are:
0 a tendency for the two variances to decrease with increasing sample size, as

is to be expected, since larger samples mean more accurate estimates, and


42 W.J. KRZANOWSKI AND D.J. HAND

TABLE1
Simulation comparison of the ERU and ERC variance measures for the 01 method
eg p n ERU variance ERC variance Ratio
0.3 5 25 0.0105 0.0020 5.1650
0.3 5 50 0.0027 0.0010 2.5865
0.3 5 100 0.0012 0.0005 2.4614
0.2 5 25 0.0092 0.0018 5.1019
0.2 5 50 0.0037 0.0009 4.2893
0.2 5 100 0.0010 0.0004 2.4614
0.1 5 25 0.0044 0.0015 2.9907
0.1 5 50 0.0023 0.0006 3.7049
0.1 5 100 0.0007 0.0003 2.2758
0.3 10 25 0.0133 0.0025 5.3036
0.3 10 50 0.0030 0.0011 2.6544
0.3 10 100 0.0021 0.0005 4.2091
0.2 10 25 0.0080 0.0022 3.7170
0.2 10 50 0.0025 0.0009 2.7381
0.2 10 100 0.0010 0.0004 2.2163
0.1 10 25 0.0073 0.0014 5.0959
0.1 10 50 0.0022 0.0006 3.4010
0.1 10 100 0.0010 0.0003 3.0557
~~

a tendency for the inflation (as measured by the ratio) to decrease as sample
size increases, as expected because larger samples are associated with a less
varia.ble decision surface (for samples taken from given distributions).
Figure 1 plots the ERU/ERC ratios of standard deviations against the standard
deviations of the true error rates for each of the specified parameter combinations.
The positive correlation is quite clear.
This property of the ERU variance is not restricted to the 0, method) so we
extended the simulation study to compare this method with one of the bootstrap
methods of error rate estimation. Table 2 shows corresponding results for the
632 estimator (Efron, 1983). The estimator is defined as e632 = 0.368eR +
0.632eN, where eR is the resubstitution estimate and eN is the bootstrap rate
for points not contained in the bootstrap sample. It has been found to perform
well (Efron, 1983; Fitzmaurice et a]., 1991). Our study used 200 bootstrap
samples for each replicate. Table 2 also shows the ratio of ratios of variances
for the two estimators, i.e. (ERU variance O1/ERU variance 632)/(ERC variance
Ol/ERC variance 632). All of these ratios are less than 1, showing that Glick’s
unconditional approach underestimates the poor performance of the 01 method
relative to the 632 estimator in terms of variance, at least for the situations we
have studied. Note that the extent of this underestimation is approximately
constant across the different parameter settings.

5 . Leaving-two-out as a Way t o R e d u c e Variance of Leave-one-out


Although we have argued that previous studies have unfairly criticised the
ASSESSING ERROR RATE ESTIMATORS 43

OO
0.005 0.010 0.015 0.020 0.025 0.030
Standard deviations of true error rates

Fig. 1. - Ratios ERU s.d./ERC s.d. for the omit-one (0,) method
plotted against the s.d.s of the true error rates

absolute variance of the 0, estimator, Section 4 has produced evidence suggest-


ing that in terms of variance this estimator is relatively even worse than was
previously supposed. Here we consider a simple and obvious extension aimed at
easing this problem.
The leave-one-out (0,)estimate may be regarded as an average of n simple
estimates, each consisting of a simple binary estimate of misclassification rate, i.e.
each is an estimate of the misclassification rate of a rule almost identical to the
one based on n points, but with each estimate taking only the values 0 or 1. Such
estimates, with values 0 or 1, are unbiased, but they are also maximum variance
estimates. Any alternative would reduce the variance. Approaches made in this
direction include the average conditional error rate estimators (Glick, 1978; Kit-
tler & Devijver, 1982; Snapinn & Knoke, 1984, 1985; Fitzmaurice & Hand, 1987)
mentioned in the Introduction, and global shrinkage estimators (Hand, 1987).
Such methods deliberately introduce bias, hoping that this will be compensated
(in MSE terms) by a reduction in variance. C'

An alternative approach, which yields unbiased local estimators, is to omit


more than one; for example, the leave-two-out approach (omit-two, 0,) esti-
mates the error rate as the proportion of these two which are misclassified by a
classification rule built using the remaining n - 2 cases. This can be done for
(i)pairs, so that the overall error rate estimator is based on an average of (i)
estimates, for each of which the design and test sets are independent. (Note that
this differs from conventional rotation or holdout methods which split the data
into n / 2 sets of size 2 , and then uses each of these as a test set for a classifier
41 W.J. KRZANOWSKI AND D.J. HAND

TABLE
2
Simulation comparison of the ERU and ERC variance measures for the 632 method
-~ -

eB p n ERU variance ERC variance Ratio (ERU/ERC) Ratio*


0.3 5 25 0.0072 0.0008 8.4433 0.6117
0.3 5 50 0.0019 0.0004 4.2634 0.6067
0.3 5 100 0.0009 0.0003 3.6037 0.6830
0.2 5 25 0.0073 0.0009 8.4073 0.6068
0.2 5 50 0.0031 0.0005 6.2892 0.6820
0.2 5 100 0.0008 0.0003 3.1019 0.7935
0.1 5 25 0.0033 0.0008 4.2145 0.7096
0.1 5 50 0.0019 0.0004 4.9031 3.7556
0.1 5 100 0.0009 0.0002 4.2509 0.7188
0.3 10 25 0.0074 0.0008 8.7942 0.6031
0.3 10 50 0.0018 0.0004 4.3384 0.6118
0.3 10 100 0.0016 0.0002 7.1632 0.5876
0.2 10 25 0.0047 0.0008 5.7324 0.6484
0.2 10 50 0.0017 0.0004 3.9704 0.6896
0.2 10 100 0.0007 0.0002 2.9659 0.7472
0.1 10 25 0.0046 0.0006 7.7738 0.6555
0.1 10 50 0.0017 0.0003 5.2498 0.6478
0.1 10 100 0.0006 0.0002 2.7334 0.8326
*(ERU variance /ERU variance
01 6 3 2 ) / ( E R C variance /ERC
01 variance 632)

built using the remaining n - 2 cases.)


In terms of variance, such an approach can be expected to gain on two
counts: (i> instead of being binary estimates, its components can take values 0,
t, or 1; (ii) instead of averaging over only n components, the average is over (;).
In terms of bias, however, we should expect a deterioration: each component is
now based on a rule which is less ‘almost identical’ to the rule based on all n
points. This strategy can be extended to leave-i-out, with matching reduction
in variance but increase in bias.
Table 3 shows the performance of the leave-two-out (0,) approach relative
to the 0, and 632 methods. As expected, the 0, method yields a slight vari-
ance reduction relative to the O1 method, but not enough to’ make it a good
competitor, in terms of variance, for the 632 method.

6. Discussion
The problems of estimating the future misclassification rate of a classifi-
cation rule, using only the data which are available to design the rule, have
provided the field on which many advanced statistical techniques were first ex-
plored in depth; for example, the leave-one-out method (Lachenbruch, 1965) and
the bootstrap method (Efron, 1983). Because of the complexity of the estimators
involved, simulation studies frequently supplement theoretical analysis, typically
generating data from specified distributions. Since the underlying distributions
ASSESSING ERROR RATE ESTIMATORS 45

TABLE3
Simulated ERC variance measures for the 01,O2and 632 methods.
eB P ?I 01 0 2 632
0.3 5 25 0.0020 0.0018 0.0008
0.3 5 50 0.0010 0.0010 0.0004
0.3 5 100 0.0005 0.0005 0.0003
0.2 5 25 0.0018 0.0016 0.0009
0.2 5 50 0.0009 0.0008 0.0005
0.2 5 100 0.0004 0.0004 0.0003
0.1 5 25 0.0015 0.0014 0.0008
0.1 5 50 0.0006 0.0006 0.0004
0.1 5 100 0.0003 0.0003 0.0002
0.3 10 25 0.0025 0.0022 0.0008
0.3 10 50 0.0011 0.0011 0.0004
0.3 10 100 0.0005 0.0005 0.0002
0.2 10 25 0.0022 0.0020 0.0008
0.2 10 50 0.0009 0.0009 0.0004
0.2 10 100 0.0004 0.0004 0.0002
0.1 10 25 0.0014 0.0013 0.0006
0.1 10 50 0.0006 0.0006 0.0003
0.1 10 100 0.0003 0.0003 0.0002
~ ~~ ~

are known in simulation studies, the Bayes error rate is also known and is some-
times used to index the performance of the resulting rule. However, there is a
danger that this can give a misleading impression of the performance of the error
rate estimator. Different data sets generated from fixed distributions give rise
to different decision surfaces, and these have different true error rates. Hence,
a summary of the results according to the Bayes error rate is an average over
different true error rates.
Different realizations of the allocation rule, and hence different true error
rates, are irrelevant in any practical application: one particular realization has
been obtained, with an (unknown) true error rate, which is the error rate for
which the practitioner seeks the most accurate estimate. There is a subset of the
sample space which gives rise to that particular allocation rule, so only elements
of that subset are relevant when examining the accuracy of any estimator of that
rule’s true error rate. By extension, in any simulation study a proper comparison
among error rate estimators should be based on indexing the results according
to the true error rates of the classification rules.
Above we have explored the effect of using the incorrect summary by study-
ing its impact on the 01 and the 632 bootstrap methods of error rate estimation.
The ratio of the incorrect measure of variance (including the variance attributable
to differences between the true error rates generated from different design sets)
to the correct measure of variance (conditioning on fixed true error rate) can
be substantial, involving factors of up to 6 for the situations in our simulations.
This means that conclusions drawn from such studies are at risk of being quite
46 W.J. KRZANOWSKI AND D.J. HAND

wrong: they will be substantially influenced by the variation in true error rate.
We have also compared the performance of the 0 1 method with that of the
632 method, as measured by the incorrect and correct methods. Our results
suggest that the O1 method is even worse than had been expected, in terms of
its 'fickle fluctuation', t o use Glick's words.
Motivated by this, we have explored a simple extension of the O1 method
aimed at reducing its (correct measure of) variance. While this extension, the
02 method, leads to a performance superior t o that of the 01 method, it remains
inferior to the 632 estimator.

References
E:FRON, H. (1983). Estimating t h e error rate of a prediction rule: improvement on cross-
validation. J. Amer. Statist. Assoc. 78, 316-331.
F'ITZMAURICE, G.M. & HAND, D.J. (1987). A comparison of two average conditional error
rate estimators. Pattern Recognition Lett. 6 , 221-224.
--, KRZANOWSKI, W.J., & HAND, D.J. (1991). A Monte Carlo study of the 632 bootstrap
estimalor of error rate. J. Classification 8 , 239-250.
G L I C K , N . (1978). Additive estimators for probabilities of correct classification. Pattern
Recognition 1 0 , 21 1-222.
HAND, D.J. (1986). Recent advances in error rate estimation. Pattern Recognition Lett.
4, 335-346.
-- (1987). A shrunken leaving-one-out estimator of error rate. Comput. Math. A p p l . 14,
161-167.
IfOR.4, S.C. & WILCOX, J.B. (1982). Estimation of error rates in several-population discrim-
inant analysis. J . Marketing Res. 1 9 , 57-61.
KITTLER, J. & DEVIJVER, P.A. (1982). Statistical properties of error estimators in perfor-
mance assessment of recognition systems. IEEE Tkans. Pattern Anal. Machine Intelligence
4, 215-220.
LACHENBRUCH, P.A. (1965). Estimation of Error Rates in Discriminant Analysis. Unpub-
lished P h D Thesis, University of Los Angeles.
McLACHLAN, G.J. (1987). Error rate estimation in discriminant analysis: recent advances.
In Advances in Multivariate Statistical Analysis, ed. A.K. Gupta, pp.233-252. Dordrecht:
D.Reide1.
- (1992). Discriminant Analysis a n d Statistical Pattern Recognition. New York: John Wiley.
PAGE, J.T. (1985). Error-rate estimation in discriminant analysis. Technometrics 27, 189-198.
PAWLAK, M. (1988). On t h e asymptotic properties of smoothed estimators of t h e classification
error rate. Pattern Recognition 21, 515-524.
SNAPINN, S.M. & K N O K E , J.D. (1984) Classification error rate estimators evaluated by
unconditional mean squared error. Technornetrics 26, 371-378.
-- & - (1985). An evaluation of smoothed classification error rate estimators. Technornetrics
27, 199-206.
-- & - (1989). Estimation of error rates in discriminant analysis with selection of variables.
Biometrics 45, 289-299.
TOUSSAINT, G.T. (1974). Bibliography on estimation of misclassification. IEEE Trans. Inform.
Theory 20, 472-479.
LYEISS, S.M. (1991). Small sample error rate estimation for k-NN classifiers. IEEE Trans.
Pattern Anal. Machine Intelligence 13, 285-289.

You might also like