ORE Open Research Exeter
TITLE
Formulation and comparison of multi-class ROC surfaces
AUTHORS
Fieldsend, Jonathan E.; Everson, Richard M.
DEPOSITED IN ORE
09 July 2013
This version available at
http://hdl.handle.net/10871/11622
COPYRIGHT AND REUSE
Open Research Exeter makes this work available in accordance with publisher policies.
A NOTE ON VERSIONS
The version presented here may differ from the published version. If citing, you are advised to consult the published version for pagination, volume/issue and date of
publication
Formulation and comparison of multi-class ROC surfaces
Jonathan E. Fieldsend
J.E.Fieldsend@exeter.ac.uk
Richard M. Everson
R.M.Everson@exeter.ac.uk
Department of Computer Science, University of Exeter, Exeter, EX4 4QF, UK.
Abstract
The Receiver Operating Characteristic
(ROC) has become a standard tool for the
analysis and comparision of classifiers when
the costs of misclassification are unknown.
There has been relatively little work, however, examining ROC for more than two
classes.
Here we define the ROC surface for the Qclass problem in terms of a multi-objective
optimisation problem in which the goal is to
simultaneously minimise the Q(Q − 1) misclassification rates, when the misclassification costs and parameters governing the classifier’s behaviour are unknown. We present
an evolutionary algorithm to locate the optimal trade-off surface between misclassifications of different types. The performance of
the evolutionary algorithm is illustrated on
a synthetic three class problem. In addition
the use of the Pareto optimal surface to compare classifiers is discussed, and we present
a straightforward multi-class analogue of the
Gini coefficient. This is illustrated on synthetic and standard machine learning data.
1. Introduction
Classification or discrimination of unknown exemplars
into two or more classes based on a ‘training’ dataset
of examples, whose classification is known, is one of the
fundamental problems in supervised pattern recognition. Given a classifier that yields estimates of the exemplar’s probability of belonging to each of the classes
and when the relative relative costs of misclassification are known, it is straightforward to determine the
decision rule that minimises the average cost of misclassification. If the costs of misclassification are equal
Appearing in Proceedings of the ICML 2005 workshop on
ROC Analysis in Machine Learning, Bonn, Germany, 2005.
Copyright 2005 by the author(s)/owner(s).
and there is no penalty for a correct classification then
the optimal rule becomes: assign to the class with the
highest posterior probability. In practical situations,
however, the true costs of misclassification are unequal
and frequently unknown or difficult to determine (e.g.
(Adams & Hand, 1999; Bradley, 1997)). In such cases
the practitioner must either guess the misclassification
costs or explore the trade-off in classification rates as
the decision rule is varied.
Receiver Operating Characteristic (ROC) analysis
provides a convenient graphical display of the trade-off
between true and false positive classification rates for
two class problems (Provost & Fawcett, 1997). Since
its introduction in the medical and signal processing
literatures (Hanley & McNeil, 1982) ROC analysis has
become a prominent method for selecting an operating
point; see (Flach et al., 2003) and (Hernández-Orallo
et al., 2004) for a recent overview of methodologies
and applications.
In this paper we extend the spirit of ROC analysis to multi-class problems by considering the tradeoffs between the misclassification rates from one class
into each of the other classes. Rather than considering the true and false positive rates, we consider
the multi-class ROC surface to be the solution of the
multi-objective optimisation problem in which these
misclassification rates are simultaneously optimised.
Srinivasan (1999) has discussed a similar formulation
of multi-class ROC, showing that if classifiers for Q
classes are considered to be points with coordinates
given by their Q(Q − 1) misclassification rates, then
optimal classifiers lie on the convex hull of these points.
Here we describe the surface in terms of Pareto optimality and in section 3 we give an evolutionary algorithm for locating the optimal ROC surface when the
classifier’s parameters may be adjusted as part of the
optimisation.
ROC analysis is frequently used for evaluating and
comparing classifiers in terms of the area under the
ROC curve (AUC) or, equivalently, the Gini coefficient. Although the straightforward analogue of the
AUC is unsuitable for more than two classes, in section 5 we develop a straightforward generalisation of
the Gini coefficient which quantifies the superiority of
a classifier’s performance to random allocation.
2. ROC Analysis
Here we describe the straightforward extension of ROC
analysis to more than two classes (multi-class ROC)
and draw some comparisons with the two class case.
In general a classifier seeks to allocate an exemplar or
measurement x to one of a number of classes. Allocation of x to the incorrect class, say Cj , usually incurs
some, often unknown, cost denoted by λkj ; we count
the cost of a correct classification as zero: λkk = 0 (see
(Elkan, 2001) for a nice discussion of the general case).
Denoting the probability of assigning an exemplar to
Cj when its true class is, in fact, Ck as p(Cj | Ck ) the
overall risk or expected cost is
X
R=
λkj p(Cj | Ck )πk
(1)
k,j
where πk is the prior probability of Ck . The performance of some particular classifier may be conveniently be summarised by a confusion matrix or contingency table, Ĉ, which summarises the results of
classifying a set of examples. Each entry Ĉkj of the
confusion matrix gives the number of examples, whose
true class was Ck , that were actually assigned to Cj .
Normalising the confusion matrix so that each column sums to unity gives the confusion rate matrix,
C, whose entries are estimates of the misclassification
probabilities: p(Cj | Ck ) ≈ Ckj . Thus the expected risk
is estimated as
X
R=
λkj Ckj πk .
(2)
minimum conditional risk (e.g. (Duda & Hart, 1973)).
Choosing ‘zero-one costs’, λkj = 1 − δkj , means that
all misclassifications are equally costly and the conditional risk is equal to the class posterior probability;
one thus assigns to the class with the greatest posterior
probability, which minimises the overall error rate.
If costs are known, it is straightforward make classifications that achieve the Bayes risk (provided, of course,
that the classifier yields accurate assessments of the
posterior probabilities p(Ck | x)). However, costs are
frequently unknown and difficult to estimate, particularly when there are many classes; in this case it is
useful to be able to compare the classification rates as
the costs vary. For binary classification the conditional
risk may be simply rewritten in terms of the posterior
probability of assigning to C1 , resulting in the rule: assign x to C1 if P (C1 | x) > t = λ12 /(λ12 + λ22 ). This
classification rule reveals that there is, in fact, only
one degree of freedom in the binary cost matrix and,
as might be expected, the entire range of classification
rates for each class can be swept out as the classification threshold t varies from 0 to 1. It is this variation
of rates that the ROC curve exposes for binary classifiers. ROC analysis focuses on the classification of
one particular class, say C1 , and plots the true positive classification rate for C1 versus the false positive
rate as the threshold t or, equivalently, the ratio of
misclassification costs is varied.
If more than one classifier is available (often produced
by altering the parameters, w, of a particular classifier) then it can be shown that the convex hull of the
ROC curves for the individual classifiers is the locus
of optimum performance for that set of classifiers.
where p(Ck | x) is the posterior probability that x belongs to Ck . The expected overall risk is
Z
R = R(Cj | x)p(x) dx.
(4)
Frequently in two class problems the focus is on a single class, for example, whether a set of medical symptoms are to be classified as benign or dangerous, so the
ROC analysis practice of plotting of true and false positive rates for a single class is helpful. Also, since there
are only three degrees of freedom in the binary confusion matrix, classification rates for the other class
are easily inferred. Indeed, the confusion rate matrix, C, has only two degrees of freedom for binary
problems. Focusing on one particular class is likely to
be misleading when more than two classes are available for assignment. We therefore concentrate on the
misclassification rates of each class to the others. In
terms of the confusion rate matrix C we consider the
off-diagonal elements, the diagonal elements (i.e., the
true positives) being determined by the off-diagonal
elements since each row sums to unity.
The expected risk is then minimised, being equal to
the Bayes risk, by assigning x to the class with the
With Q classes there are D = Q(Q − 1) degrees of
freedom in the confusion rate matrix and it is desir-
k,j
A slightly different perspective is gained by writing
expected risk in terms of the posterior probabilities
of classification to each class. The conditional risk or
average cost of assigning x to Cj is
X
R(Cj | x) =
λkj p(Ck | x)
(3)
k
able to simultaneously minimise all the misclassification rates represented by these. For most problems,
as for the binary problem, simultaneous optimisation
will clearly be impossible and some compromise between the various misclassification rates will have to
be found. Knowledge of the costs makes this determination simple, but if the costs are unknown we propose
to use multi-objective optimisation to discover the optimal trade-offs between the misclassification rates.
In general we will consider locating the optimal ROC
surface as a function of the classifier parameters, w,
as well as the costs. For notational convenience and
because they are treated as a single entity, we write
the cost matrix λ and parameters as a single vector
of generalised parameters, θ = {λ, w}; to distinguish
θ from the classifier parameters w we use the optimisation terminology decision vectors to refer to θ.
The D misclassification rates are functions (depending on the particular classifier) of the decision vectors,
thus Ckj = Ckj (θ). The optimal trade-off between
the misclassification rates is thus the defined by the
minimisation problem: minimise Ckj (θ) ∀k, j, k 6= j.
If all the misclassification rates for one classifier with
decision vector θ are no worse than the classification
rates for another classifier φ and at least one rate is
better, then the classifier parameterised by θ is said
to strictly dominate that parameterised by φ. Thus
θ strictly dominates φ (denoted θ ≺ φ) iff Ckj (θ) ≤
Ckj (φ) ∀k, j, k 6= j, and Ckj (θ) < Ckj (φ) for some
k, j, k 6= j. Less stringently, θ weakly dominates φ
(denoted θ φ) iff Ckj (θ) ≤ Ckj (φ) ∀k, j, k 6= j.
A set E of decision vectors is said to be non-dominated
if no member of the set is dominated by any other
member: θ 6≺ φ ∀θ, φ ∈ E.
A solution to the minimisation problem is thus Pareto
optimal if it is not dominated by any other feasible
solution, and the non-dominated set of all Pareto optimal solutions is the known as the Pareto front. Recent
years have seen the development of a number of evolutionary techniques based on dominance measures for
locating the Pareto front; see (Deb, 2001) for a recent
review. Kupinski and Anastasio (1999) and Anastasio
et al. (1998) introduced the use of multi-objective
evolutionary algorithms (MOEAs) to optimise ROC
curves for binary problems, illustrating the method on
a synthetic data set and for medical imaging problems;
and we have used a similar methodology for locating
optimal ROC curves for safety-related systems (Fieldsend & Everson, 2004; Everson & Fieldsend, 2006).
In the following section we describe a straightforward
evolutionary algorithm for locating the Pareto front
for multi-class problems. We illustrate the method
Algorithm 1 Multi-objective evolution scheme for
ROC surfaces.
Inputs:
T
Number of generations
Nλ
Number of costs to sample
1: E := initialise()
2: for t := 1 : T
3:
{w, λ} = θ := select(E)
4:
w′ := perturb(w)
5:
for i := 1 : Nλ
6:
λ′ := sample()
7:
C := classify(w′ , λ′ )
8:
θ ′ := {w′ , λ′ }
9:
if θ ′ 6 φ ∀φ ∈ E
10:
E := {φ ∈ E | φ ⊀ θ′ }
11:
E := E ∪ θ′
12:
end
13:
end
14: end
on a synthetic problem for two different classification
models in section 4.
3. Locating multi-class ROC surfaces
Here we describe a straightforward algorithm for locating the Pareto front for multi-class ROC problems using an analogue of mutation-based evolution.
The procedure is based on the Pareto Archive Evolutionary Strategy (PAES) introduced by Knowles and
Corne (2000). In outline, the algorithm maintains a
set or archive E, whose members are mutually nondominating, which forms the current approximation
to the Pareto front. As the computation progresses
members of E are selected, copied and their decision
vectors perturbed, and the objectives corresponding
to the perturbed decision vector evaluated; if the perturbed solution is not dominated by any element of E,
it is inserted into E and any members of E which are
dominated by the new entrant are removed. It is clear,
therefore, that the archive can only move towards the
Pareto front: it is in essence a greedy search where the
archive E is the current point of the search and perturbations to E that are not dominated by the current
E are always accepted.
Algorithm 1 describes the procedure in more detail.
The archive E is initialised by evaluating the misclassification rates for a number (here 100) of randomly chosen parameter values and costs, and discarding those
which are dominated by another element of the initial
set. Then at each generation a single element, θ is
selected from E (line 3 of Algorithm 1); selection may
be uniformly random, but partitioned quasi-random
selection (PQRS) (Fieldsend et al., 2003) was used
here to promote exploration of the front. PQRS increases the efficiency and range of the search by preventing clustering of solutions in a particular region
of the front which would otherwise bias the search because they would be selected more frequently.
The selected parent decision vector is copied, after
which the costs λ and classifier parameters w are
treated separately. The parameters w of the classifier
are perturbed or, in the nomenclature of evolutionary algorithms, mutated to form a child, w′ (line 4).
Here we seek to encourage wide exploration of parameter space by perturbing each of the parameters with a
random number δ drawn from a heavy tailed distribution (such as the Laplacian density, p(δ) ∝ e−|δ| ). The
Laplacian distribution has tails that decay relatively
slowly, ensuring that there is a high probability of exploring regions distant from the current solutions, facilitating escape from local minima (Yao et al., 1999).
With a proposed parameter set w′ on hand the procedure then investigates the misclassification rates as
the costs are varied with fixed parameters. In order
to do this we generate Nλ sample costs λ′ and evaluate the misclassification rates for each of them. Since
the misclassification costs are non-negative and sum
to unity, a straightforward way of producing samples
is to make a draws from a Dirichlet distribution:
p(λ) = Dir(λ | α1 , . . . , αi , . . . , αD )
(5)
where the index i labels the D = Q(Q − 1) off-diagonal
entries in the cost matrix.PSamples from a Dirichlet
density lie on the simplex kj λkj = 1. The αkj ≥ 0
determine the density of the samples; since we have
no preference for particular costs here, we set all the
αkj = 1 so that the simplex (that is, cost space) is
sampled uniformly with respect to Lebesgue measure.
The misclassification rates for each cost sample λ′
and classifier parameters w are used to make class assignments for each example in the given dataset (line
7). Usually this step consists of merely modifying the
posterior probabilities p(Ck | x) to find the assignment
with the minimum expected cost and is therefore computationally inexpensive as the probabilities need only
be computed once for each w′ . The misclassification
rates Ckj (θ′ ) (k 6= j) comprise the objective values for
the decision vector θ′ = {w′ , λ} and decision vectors
that are not dominated by members of the archive E
are inserted into E (line 11) and any decision vectors in
E that are dominated by the new entrant are removed
(line 10). We remark that this algorithm, unlike the
original PAES algorithm, uses an archive whose size is
unconstrained, permitting better convergence (Fieldsend et al., 2003).
4. Illustrations
In this section we illustrate the performance of the
evolutionary algorithm on synthetic data, which is
readily understood. Subsequently we give results for
a number of standard multi-class problems. We use
two relatively simple classifiers, the multinomial logistic regression classifier and the probabilistic k -nearest
neighbour classifier.
4.1. Synthetic data
In order to gain an understanding of the Pareto optimal ROC surface for multiple class classifications we
extend a two-dimensional, two-class synthetic data set
devised by Ripley (1994) by adding additional Gaussian functions corresponding to an additional class.
The resulting data set comprises 3 classes, the conditional density for each being a mixture of two Gaussians.1
4.2. Multinomial logistic regression
The functional form of the multinomial logistic regression classifier is:
T
eαj +x βj
p(Cj |x, α, β) = PQ
αi +xT β i
i=1 e
(6)
where β j is a vector of feature coefficients for class j
and αj is a single bias (for each class). Therefore w
consists of these Q sets of β j and αj .
To discover the Pareto optimal ROC surface, the optimisation algorithm was run for T = 5000 proposed
parameter values, with Nλ = 100, resulting in an estimated Pareto front comprising approximately 9000
mutually non-dominating parameter and cost combinations; we judge that the algorithm is very well converged and obtain very similar results by permitting
the algorithm to run for only T = 2000 generations.
The left panel of the Figure 1 shows the decision regions that yield the smallest total misclassification error, 40/300. Decision regions for this parameterisation
1
Covariance matrices for all the components were
isotropic: Σj = 0.3I. Denoting by µji for i = 1, 2 the
means of the two Gaussian components generating samples
for class j, the centres were located at: µ11 = (0.7, 0.3)T ,
µ12 = (0.3, 0.3)T , µ21 = (−0.7, 0.7)T , µ22 = (0.4, 0.7)T ,
µ31 = (1.0, 1.0)T and µ32 = (0.0, 1.0)T . Each component
had equal mixing weight 1/6. The 300 samples used here,
together with the equal cost Bayes optimal decision boundaries, are shown in Figure 1.
1.6
1.6
1.6
1.4
1.4
1.4
1.2
1.2
1.2
1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
−0.2
−1
−0.5
0
0.5
1
1.5
−0.2
−1
0
−0.5
0
0.5
1
1.5
−0.2
−1
−0.5
0
0.5
1
1.5
Figure 1. Decision regions for various multinomial logistic regression classifiers on multi-class ROC surface. Grey scale
background shows the class to which a point would be assigned. Black lines show the ideal equal-cost decision boundary.
Symbols show actual training data. Left: Parameters corresponding to minimum total misclassification error on the
training data. Middle: Decision regions corresponding to the minimum C21 and C23 and conditioned on this, minimum
C31 and C13 . Right: Decision regions corresponding to minimising C12 and C32 .
are not tightly fitted to the Bayes optimal ones, which
reflects the relative inflexibility of the particular classifier, rather than a problem with the training process.
By contrast with the decision regions which are optimal for roughly equal costs, the middle and right panels of Figure 1 show decision regions for imbalanced
costs. The middle panel shows decision regions corresponding to minimising C21 and C23 : this, of course,
can be achieved by setting λ21 and λ23 to be large,
so that every C2 example (triangle) is correctly classified, no matter what the cost. For these data there
are many decision regions correctly classifying every C2
and we display the decision regions that also minimise
C31 and C13 . For these data, it is possible to make
C31 = C13 = 0 because C1 and C3 are adjacent only
along a boundary distant from C2 points; such complete minimisation will in general not be possible. Of
course, the penalty to be paid for minimising the C2
rates together with C31 and C13 is that C32 and C12
are large.
The right panel of Figure 1 shows the reverse situation:
here the costs for misclassifying either C1 or C3 as C2
are high. With these data, although not in general, of
course, it is possible to reduce C12 and C32 to zero,
as shown by the decision regions which ensure that C2
examples are only classified correctly when it does not
result in incorrect assignment of the other two classes
to C2 . In this case the greatest misclassification rate is
C23 (triangles as crosses).
It should be emphasised that the evolutionary algorithm has explored a wide range of cost and parameter combinations on the Pareto optimal ROC surface.
Values of each λkj on the front ranges from below
10−4 to above 0.79, all having means of approximately
1/D = 1/6, providing assurance that a complete range
of costs is being explored by the algorithm.
One way to view misclassification costs when Q = 3 is
to look at the trade-off surface for minimising all misclassifications into each class, that is the false positive
rate for each class. We thus minimise the Q objectives:
X
Fk (w, λ) =
Ckj
k = 1, . . . , Q.
(7)
j6=k
We call this front the ‘false positive rate front’.
The false positive rate Pareto front is easily visualised
(at least for three class problems), but clearly information on exactly how misclassifications are made is
lost. However, the full D-dimensional Pareto surface
may usefully be viewed in ‘false positive space’. Figure
2 shows the solutions on the estimated Pareto front obtained using the full Q(Q − 1) objectives for the multinomial logistic regression classifier, but each solution
is plotted at the coordinate given by the Q = 3 false
positive rates (7), with the greyscale denoting the class
into which the greatest number of misclassifications are
made. Although the solutions obtained by directly optimising the false positive rates clearly lie on the full
Pareto surface (in Q(Q − 1) dimensions) the converse
is not true and the projections into false positive space
do not form a surface. Nonetheless, at least for these
data, they lie close to a surface, which aids visualisation and navigation of the full Pareto front. The
relation between the solutions on the full Pareto front
and the false positive rate front is made more precise
as follows. If E is a set of Q(Q − 1)-dimensional solutions lying in the full Pareto front, let EQ be the set
of Q-dimensional vectors representing the false positive coordinates of elements of E. The extremal set of
non-dominated elements of EQ is
ẼQ = {f ∈ EQ | f 6≺ f ′ ∈ EQ }.
(8)
1
F
3
1
A
C 21
0.5
0
0
0
0.5
0.2
B
0.4
0.6
0.8
1
F
1
1
0
F
2
3
1
Figure 3. Illustration of the G and δ measures where Q =
2. Shaded area denotes G(A), horizontally hatched area
denotes δ(A, B), vertically hatched area denotes δ(B, A).
1
F
C 12
0.5
0
1
1
0.5
0.5
F
2
0
0
F
1
Figure 2. The estimated Pareto front for synthetic data
classified with a multinomial logistic regression classifier
viewed in false positive space. Axes show the false positive
rates for each class and different greyscales represent the
class into which the greatest number of misclassifications
are made. (Points better than random shown.)
Then solutions in ẼQ also lie in the false positive rate
front. Other more sophisticated methods for visualising Pareto fronts in the Q > 2 situation are described
in (Everson & Fieldsend, 2005).
5. Comparing classifiers
In two class problems the area under the ROC curve is
often used to compare classifiers. As clearly explained
by Hand and Till (2001), the AUC measures a classifier’s ability to separate two classes over the range of
possible costs and is linearly related to the Gini coefficient. In this section we compare the multinomial
logistic regression and k-nn classifiers using a measure
based on the volume dominated by the Pareto optimal
ROC surface. We draw attention to Ferri et al. (2003)
who give another view of the volume under multi-class
ROC surfaces.
By analogy with the AUC, we might use the volume of the Q(Q − 1)-dimensional hypercube that is
dominated by elements of the ROC surface for classifier A as a measure of A’s performance. In binary
and multi-class problems alike its maximum value is
1 when A classifies perfectly. If the classifier allocates at random, the ROC surface is the simplex in
Q(Q − 1)-dimensional space with vertices at distance
Q − 1 along each coordinate vector. The volume
of the unit hypercube dominated by this simplex is
1
Q(Q−1)
− Q(Q − 1)(Q − 2)Q(Q−1) ]; a
[Q(Q−1)]! [(Q − 1)
full derivation is provided in Everson and Fieldsend
(2005). It corresponds to the amount of the pyramidal region dominated by the simplex in the Q(Q − 1)
hypercube with (Q − 1) length sides which also lies in
the unit hypercube; we denote by this truncated pyramidal region by P . When Q = 2 the volume (area)
is just 1/2, corresponding to the area under the diagonal in a conventional ROC plot.2 However, when
Q > 2, the volume not dominated by the random allocation simplex is very small; even when Q = 3, the
volume not dominated is ≈ 0.0806. We therefore define G(A) to be the analogue of the Gini coefficient in
two dimensions, namely the proportion of the volume
of the Q(Q − 1)-dimensional unit hypercube that is
dominated by elements of the ROC surface, but is not
dominated by the simplex defined by random allocation (as illustrated by the shaded area in Figure 3 for
the Q = 2 case). In binary classification problems this
corresponds to twice the area between the ROC curve
and the diagonal. In multi-class problems G(A) quantifies how much better A is than random allocation. It
can be simply estimated by Monte Carlo sampling of
this volume in the unit hypercube.
If every point on the optimal ROC surface for classifier A is dominated by a point on the ROC surface
for classifier B, then classifier B has a superior per2
Although binary ROC plots are usually made in terms
of true positive rates versus false positive rates for one class,
the false positive rate for the other class is just 1 minus the
true positive rate for the other class.
formance to classifier A. In general, however, neither
ROC surface will completely dominate the other: regions of A’s surface will be dominated by B and vice
versa; in binary problems this corresponds to ROC
curves that cross. To quantify the classifiers’ relative
performance we therefore define δ(A, B) to be the volume of P that is dominated by elements of A and not
by elements of B (marked in Figure 3 with horizontal
lines). Note that δ(A, B) is not a metric; although it
is non-negative, it is not symmetric. Also if A and
B are subsets of the same non-dominated set W, (i.e.,
A ⊆ W and B ⊆ W ), then δ(A, B) and δ(B, A) may
have a range of values depending on their precise composition (Fieldsend et al., 2003). Situations like this
are rare in practice, however, and measures like δ have
proved useful for comparing Pareto fronts.
5.1. Probabilistic k-nn classifiers
One of the most popular methods of statistical classification is the k -nearest neighbour model (k -nn). The
method is essentially geometrical, assigning the class
of an unknown exemplar to the class of the majority of
its k nearest neighbours in some training data. More
precisely, in order to assign a datum x, given known
classes and examples in the form of training data
D = {yn , xn }N
n=1 , the k -nn method first calculates the
distances di = ||x − xi ||. If the Q classes are a priori
equally likely, the probability that x belongs to the
j-th class is then evaluated as p(Cj | x, k, D) = kj /k,
where kj is the number of the k data points with the
smallest dn belonging to Cj .
Holmes and Adams (2002) have extended the traditional k -nn classifier by adding a parameter β which
controls the ‘strength of association’ between neighbours. The posterior probability of x belonging to
each class Cj is given by the predictive likelihood:
β
ek
p(Cj | x, k, β, D) = PQ
q=1
Pk
xn ∼x
β
ek
Pk
d(x,xn )δjyn
xn ∼x
d(x,xn )δqyn
.
(9)
Pk
Here δmn is the Kronecker delta and
xn ∼x means
the sum over the k nearest neighbours of x (excluding
P
x itself). The term k1 kxn ∼x d(xn , x)δjyn counts the
fraction of the k nearest neighbours of x in the same
class j as x. Here we regard w = {k, β} as parameters
to be adjusted as part of Algorithm 1 as the Pareto
optimal ROC surface is sought.
Table 1 shows the results of comparing the probabilistic k -nn classifier with the multinomial logistic regression classifier on the Synthetic data, and also on the
UCI Image and Vehicle data sets.3 By using the δ mea3
http://www.ics.uci.edu/~mlearn/MLRepository
Table 1. Generalised Gini coefficients and exclusively dominated volume comparisons of the multinomial logistic regression (MLR) and k -nn classifiers.
Measure
G(MLR)
G(k-nn)
δ(MLR, k-nn)
δ(k-nn, MLR)
Synth.
0.840
0.920
0.001
0.081
Vehicle
≈0
0.168
0.000
0.168
Image
≈0
0.076
0.000
0.076
sure we can see that the k -nn model is wholly better
than the multinomial logistic regression for both the
UCI data sets (where Q = 4 and Q = 7), and almost
entirely better on the synthetic data.
6. Conclusion
In this paper we have considered multi-class generalisations of ROC analysis from a multi-objective optimisation perspective. Consideration of the role of
costs in classification leads to a multi-objective optimisation problem in which misclassification rates are
simultaneously optimised. The resulting trade-off surface generalises the binary classification ROC curve
because on it one misclassification rate cannot be improved without degrading at least one other. We have
presented a straightforward general evolutionary algorithm which efficiently locates approximations to the
Pareto optimal ROC surface. Although the algorithm
clearly takes longer than training a single classifier, on
the data presented here run times are on the order of
a few minutes.
The Pareto optimal ROC surface yields a natural way
of comparing classifiers in terms of the volume that
the classifiers’ ROC surfaces dominate. We defined
and illustrated a generalisation of the Gini coefficient
for multi-class problems that quantifies the superiority
of a classifier to random allocation. This measure in
turn has proved itself useful in practice.
Finally, we remark that some imprecise information
about the costs of misclassification may often be
available. Lachiche and Flach (2003) have considered multi-class ROC optimisation when the costs are
known. However, if partial information about the
costs, such as approximate bounds on the ratios of
the λkj , is known, the evolutionary algorithm is easily
focused on the relevant region by setting the Dirichlet
parameters αkj appearing in (5) to be in the ratio of
the expected costs, with their magnitudes setting the
variance in the cost ratios.
Acknowledgements
This work was supported in part by the EPSRC, grant
GR/R24357/01. We thank Trevor Bailey, Adolfo Hernandez, Wojtek Krzanowski, Derek Partridge, Vitaly
Schetinin Jufen Zhang and two anonymous referees for
their helpful comments.
mining: Introduction to ROC analysis and its applications. In D. Mladenic, N. Lavrac, M. Bohanec and
S. Moyle (Eds.), Data mining and decision support:
Integration and collaboration, 81–90. Kluwer.
References
Hand, D., & Till, R. (2001). A simple generalisation
of the area under the ROC curve for multiple class
classification problems. Machine Learning, 45, 171–
186.
Adams, N., & Hand, D. (1999). Comparing classifiers
when the misallocation costs are uncertain. Pattern
Recognition, 32, 1139–1147.
Hanley, J., & McNeil, B. (1982). The meaning and use
of the area under a receiver operating characteristic
(ROC) curve. Radiology, 82, 29–36.
Anastasio, M., Kupinski, M., & Nishikawa:, R. (1998).
Optimization and FROC analysis of rule-based detection schemes using a multiobjective approach.
IEEE Trans. Medical Imaging, 17, 1089–1093.
Hernández-Orallo, J., Ferri, C., Lachiche, N., & Flach,
P. (Eds.). (2004). ROC Analysis in Artificial Intelligence, 1st International Workshop, ROCAI-2004,
Valencia, Spain.
Bradley, A. (1997). The use of the area under the
ROC curve in the evaluation of machine learning
algorithms. Pattern Recognition, 30, 1145–1159.
Holmes, C., & Adams, N. (2002). A probabilistic nearest neighbour method for statistical pattern recognition. J. Royal Statistical Society B, 64, 1–12.
Deb, K. (2001). Multi-objective optimization using
evolutionary algorithms. Chichester: Wiley.
Knowles, J., & Corne, D. (2000). Approximating
the Nondominated Front Using the Pareto Archived
Evolution Strategy. Evol. Computation, 8, 149–172.
Duda, R., & Hart, P. (1973). Pattern classification
and scene analysis. New York: Wiley.
Elkan, C. (2001). The foundations of cost-sensitive
learning. IJCAI (pp. 973–978).
Everson, R., & Fieldsend, J. (2005). Multi-class ROC
analysis from a multi-objective optimisation perspective (Technical Report 421). Department of Computer Science, University of Exeter.
Everson, R., & Fieldsend, J. (2006). Multi-objective
optimisation of safety related systems: An application to short term conflict alert. IEEE Transactions
on Evolutionary Computation. (In press).
Ferri, C., Hernández-Orallo, J., & Salido, M. (2003).
Volume under the ROC surface for multi-class problems. ECML 2003 (pp. 108–120).
Fieldsend, J., & Everson, R. (2004). ROC Optimisation of Safety Related Systems. Proceedings of ROCAI 2004, part of the 16th European Conference on
Artificial Intelligence (ECAI) (pp. 37–44).
Fieldsend, J., Everson, R., & Singh, S. (2003). Using
Unconstrained Elite Archives for Multi-Objective
Optimisation. IEEE Trans. Evol. Comp., 7, 305–
323.
Flach, P., Blockeel, H., Ferri, C., Hernández-Orallo,
J., & Struyf, J. (2003). Decision support for data
Kupinski, M., & Anastasio, M. (1999). Multiobjective Genetic Optimization of Diagnostic Classifiers
with Implications for Generating Receiver Operating Characterisitic Curves. IEEE Transactions on
Medical Imaging, 18, 675–685.
Lachiche, N., & Flach, P. (2003). Improving accuracy
and cost of two-class and multi-class probabilistic
classifiers using ROC curves. ICML 2003 (pp. 416–
423).
Provost, F., & Fawcett, T. (1997). Analysis and visualisation of classifier performance: Comparison
under imprecise class and cost distributions. Proceedings of the Third International Conference on
Knowledge Discovery and Data Mining (pp. 43–48).
Menlo Park, CA: AAAI Press.
Ripley, B. (1994). Neural networks and related methods for classification (with discussion). Journal of
the Royal Statistical Society Series B, 56, 409–456.
Srinivasan, A. (1999). Note on the location of optimal
classifiers in n-dimensional ROC space (Technical
Report PRG-TR-2-99). Oxford University Computing Laboratory, Oxford.
Yao, X., Liu, Y., & Lin, G. (1999). Evolutionary programming made faster. IEEE Trans. Evol. Comp.,
3, 82–102.