Formulation and comparison of multi-class ROC surfaces

Richard Everson

Formulation and comparison of multi-class ROC surfaces

Richard Everson

2005

visibility

…

description

9 pages

link

1 file

The Receiver Operating Characteristic (ROC) has become a standard tool for the analysis and comparision of classifiers when the costs of misclassification are unknown. There has been relatively little work, however, examining ROC for more than two classes. Here we define the ROC surface for the Qclass problem in terms of a multi-objective optimisation problem in which the goal is to simultaneously minimise the Q(Q − 1) misclassification rates, when the misclassification costs and parameters governing the classifier's behaviour are unknown. We present an evolutionary algorithm to locate the optimal trade-off surface between misclassifications of different types. The performance of the evolutionary algorithm is illustrated on a synthetic three class problem. In addition the use of the Pareto optimal surface to compare classifiers is discussed, and we present a straightforward multi-class analogue of the Gini coefficient. This is illustrated on synthetic and standard machine learning data.

ORE Open Research Exeter TITLE Formulation and comparison of multi-class ROC surfaces AUTHORS Fieldsend, Jonathan E.; Everson, Richard M. DEPOSITED IN ORE 09 July 2013 This version available at http://hdl.handle.net/10871/11622 COPYRIGHT AND REUSE Open Research Exeter makes this work available in accordance with publisher policies. A NOTE ON VERSIONS The version presented here may diﬀer from the published version. If citing, you are advised to consult the published version for pagination, volume/issue and date of publication Formulation and comparison of multi-class ROC surfaces Jonathan E. Fieldsend J.E.Fieldsend@exeter.ac.uk Richard M. Everson R.M.Everson@exeter.ac.uk Department of Computer Science, University of Exeter, Exeter, EX4 4QF, UK. Abstract The Receiver Operating Characteristic (ROC) has become a standard tool for the analysis and comparision of classifiers when the costs of misclassification are unknown. There has been relatively little work, however, examining ROC for more than two classes. Here we define the ROC surface for the Qclass problem in terms of a multi-objective optimisation problem in which the goal is to simultaneously minimise the Q(Q − 1) misclassification rates, when the misclassification costs and parameters governing the classifier’s behaviour are unknown. We present an evolutionary algorithm to locate the optimal trade-off surface between misclassifications of different types. The performance of the evolutionary algorithm is illustrated on a synthetic three class problem. In addition the use of the Pareto optimal surface to compare classifiers is discussed, and we present a straightforward multi-class analogue of the Gini coefficient. This is illustrated on synthetic and standard machine learning data. 1. Introduction Classification or discrimination of unknown exemplars into two or more classes based on a ‘training’ dataset of examples, whose classification is known, is one of the fundamental problems in supervised pattern recognition. Given a classifier that yields estimates of the exemplar’s probability of belonging to each of the classes and when the relative relative costs of misclassification are known, it is straightforward to determine the decision rule that minimises the average cost of misclassification. If the costs of misclassification are equal Appearing in Proceedings of the ICML 2005 workshop on ROC Analysis in Machine Learning, Bonn, Germany, 2005. Copyright 2005 by the author(s)/owner(s). and there is no penalty for a correct classification then the optimal rule becomes: assign to the class with the highest posterior probability. In practical situations, however, the true costs of misclassification are unequal and frequently unknown or difficult to determine (e.g. (Adams & Hand, 1999; Bradley, 1997)). In such cases the practitioner must either guess the misclassification costs or explore the trade-off in classification rates as the decision rule is varied. Receiver Operating Characteristic (ROC) analysis provides a convenient graphical display of the trade-off between true and false positive classification rates for two class problems (Provost & Fawcett, 1997). Since its introduction in the medical and signal processing literatures (Hanley & McNeil, 1982) ROC analysis has become a prominent method for selecting an operating point; see (Flach et al., 2003) and (Hernández-Orallo et al., 2004) for a recent overview of methodologies and applications. In this paper we extend the spirit of ROC analysis to multi-class problems by considering the tradeoffs between the misclassification rates from one class into each of the other classes. Rather than considering the true and false positive rates, we consider the multi-class ROC surface to be the solution of the multi-objective optimisation problem in which these misclassification rates are simultaneously optimised. Srinivasan (1999) has discussed a similar formulation of multi-class ROC, showing that if classifiers for Q classes are considered to be points with coordinates given by their Q(Q − 1) misclassification rates, then optimal classifiers lie on the convex hull of these points. Here we describe the surface in terms of Pareto optimality and in section 3 we give an evolutionary algorithm for locating the optimal ROC surface when the classifier’s parameters may be adjusted as part of the optimisation. ROC analysis is frequently used for evaluating and comparing classifiers in terms of the area under the ROC curve (AUC) or, equivalently, the Gini coefficient. Although the straightforward analogue of the AUC is unsuitable for more than two classes, in section 5 we develop a straightforward generalisation of the Gini coefficient which quantifies the superiority of a classifier’s performance to random allocation. 2. ROC Analysis Here we describe the straightforward extension of ROC analysis to more than two classes (multi-class ROC) and draw some comparisons with the two class case. In general a classifier seeks to allocate an exemplar or measurement x to one of a number of classes. Allocation of x to the incorrect class, say Cj , usually incurs some, often unknown, cost denoted by λkj ; we count the cost of a correct classification as zero: λkk = 0 (see (Elkan, 2001) for a nice discussion of the general case). Denoting the probability of assigning an exemplar to Cj when its true class is, in fact, Ck as p(Cj | Ck ) the overall risk or expected cost is X R= λkj p(Cj | Ck )πk (1) k,j where πk is the prior probability of Ck . The performance of some particular classifier may be conveniently be summarised by a confusion matrix or contingency table, Ĉ, which summarises the results of classifying a set of examples. Each entry Ĉkj of the confusion matrix gives the number of examples, whose true class was Ck , that were actually assigned to Cj . Normalising the confusion matrix so that each column sums to unity gives the confusion rate matrix, C, whose entries are estimates of the misclassification probabilities: p(Cj | Ck ) ≈ Ckj . Thus the expected risk is estimated as X R= λkj Ckj πk . (2) minimum conditional risk (e.g. (Duda & Hart, 1973)). Choosing ‘zero-one costs’, λkj = 1 − δkj , means that all misclassifications are equally costly and the conditional risk is equal to the class posterior probability; one thus assigns to the class with the greatest posterior probability, which minimises the overall error rate. If costs are known, it is straightforward make classifications that achieve the Bayes risk (provided, of course, that the classifier yields accurate assessments of the posterior probabilities p(Ck | x)). However, costs are frequently unknown and difficult to estimate, particularly when there are many classes; in this case it is useful to be able to compare the classification rates as the costs vary. For binary classification the conditional risk may be simply rewritten in terms of the posterior probability of assigning to C1 , resulting in the rule: assign x to C1 if P (C1 | x) > t = λ12 /(λ12 + λ22 ). This classification rule reveals that there is, in fact, only one degree of freedom in the binary cost matrix and, as might be expected, the entire range of classification rates for each class can be swept out as the classification threshold t varies from 0 to 1. It is this variation of rates that the ROC curve exposes for binary classifiers. ROC analysis focuses on the classification of one particular class, say C1 , and plots the true positive classification rate for C1 versus the false positive rate as the threshold t or, equivalently, the ratio of misclassification costs is varied. If more than one classifier is available (often produced by altering the parameters, w, of a particular classifier) then it can be shown that the convex hull of the ROC curves for the individual classifiers is the locus of optimum performance for that set of classifiers. where p(Ck | x) is the posterior probability that x belongs to Ck . The expected overall risk is Z R = R(Cj | x)p(x) dx. (4) Frequently in two class problems the focus is on a single class, for example, whether a set of medical symptoms are to be classified as benign or dangerous, so the ROC analysis practice of plotting of true and false positive rates for a single class is helpful. Also, since there are only three degrees of freedom in the binary confusion matrix, classification rates for the other class are easily inferred. Indeed, the confusion rate matrix, C, has only two degrees of freedom for binary problems. Focusing on one particular class is likely to be misleading when more than two classes are available for assignment. We therefore concentrate on the misclassification rates of each class to the others. In terms of the confusion rate matrix C we consider the off-diagonal elements, the diagonal elements (i.e., the true positives) being determined by the off-diagonal elements since each row sums to unity. The expected risk is then minimised, being equal to the Bayes risk, by assigning x to the class with the With Q classes there are D = Q(Q − 1) degrees of freedom in the confusion rate matrix and it is desir- k,j A slightly different perspective is gained by writing expected risk in terms of the posterior probabilities of classification to each class. The conditional risk or average cost of assigning x to Cj is X R(Cj | x) = λkj p(Ck | x) (3) k able to simultaneously minimise all the misclassification rates represented by these. For most problems, as for the binary problem, simultaneous optimisation will clearly be impossible and some compromise between the various misclassification rates will have to be found. Knowledge of the costs makes this determination simple, but if the costs are unknown we propose to use multi-objective optimisation to discover the optimal trade-offs between the misclassification rates. In general we will consider locating the optimal ROC surface as a function of the classifier parameters, w, as well as the costs. For notational convenience and because they are treated as a single entity, we write the cost matrix λ and parameters as a single vector of generalised parameters, θ = {λ, w}; to distinguish θ from the classifier parameters w we use the optimisation terminology decision vectors to refer to θ. The D misclassification rates are functions (depending on the particular classifier) of the decision vectors, thus Ckj = Ckj (θ). The optimal trade-off between the misclassification rates is thus the defined by the minimisation problem: minimise Ckj (θ) ∀k, j, k 6= j. If all the misclassification rates for one classifier with decision vector θ are no worse than the classification rates for another classifier φ and at least one rate is better, then the classifier parameterised by θ is said to strictly dominate that parameterised by φ. Thus θ strictly dominates φ (denoted θ ≺ φ) iff Ckj (θ) ≤ Ckj (φ) ∀k, j, k 6= j, and Ckj (θ) < Ckj (φ) for some k, j, k 6= j. Less stringently, θ weakly dominates φ (denoted θ φ) iff Ckj (θ) ≤ Ckj (φ) ∀k, j, k 6= j. A set E of decision vectors is said to be non-dominated if no member of the set is dominated by any other member: θ 6≺ φ ∀θ, φ ∈ E. A solution to the minimisation problem is thus Pareto optimal if it is not dominated by any other feasible solution, and the non-dominated set of all Pareto optimal solutions is the known as the Pareto front. Recent years have seen the development of a number of evolutionary techniques based on dominance measures for locating the Pareto front; see (Deb, 2001) for a recent review. Kupinski and Anastasio (1999) and Anastasio et al. (1998) introduced the use of multi-objective evolutionary algorithms (MOEAs) to optimise ROC curves for binary problems, illustrating the method on a synthetic data set and for medical imaging problems; and we have used a similar methodology for locating optimal ROC curves for safety-related systems (Fieldsend & Everson, 2004; Everson & Fieldsend, 2006). In the following section we describe a straightforward evolutionary algorithm for locating the Pareto front for multi-class problems. We illustrate the method Algorithm 1 Multi-objective evolution scheme for ROC surfaces. Inputs: T Number of generations Nλ Number of costs to sample 1: E := initialise() 2: for t := 1 : T 3: {w, λ} = θ := select(E) 4: w′ := perturb(w) 5: for i := 1 : Nλ 6: λ′ := sample() 7: C := classify(w′ , λ′ ) 8: θ ′ := {w′ , λ′ } 9: if θ ′ 6 φ ∀φ ∈ E 10: E := {φ ∈ E | φ ⊀ θ′ } 11: E := E ∪ θ′ 12: end 13: end 14: end on a synthetic problem for two different classification models in section 4. 3. Locating multi-class ROC surfaces Here we describe a straightforward algorithm for locating the Pareto front for multi-class ROC problems using an analogue of mutation-based evolution. The procedure is based on the Pareto Archive Evolutionary Strategy (PAES) introduced by Knowles and Corne (2000). In outline, the algorithm maintains a set or archive E, whose members are mutually nondominating, which forms the current approximation to the Pareto front. As the computation progresses members of E are selected, copied and their decision vectors perturbed, and the objectives corresponding to the perturbed decision vector evaluated; if the perturbed solution is not dominated by any element of E, it is inserted into E and any members of E which are dominated by the new entrant are removed. It is clear, therefore, that the archive can only move towards the Pareto front: it is in essence a greedy search where the archive E is the current point of the search and perturbations to E that are not dominated by the current E are always accepted. Algorithm 1 describes the procedure in more detail. The archive E is initialised by evaluating the misclassification rates for a number (here 100) of randomly chosen parameter values and costs, and discarding those which are dominated by another element of the initial set. Then at each generation a single element, θ is selected from E (line 3 of Algorithm 1); selection may be uniformly random, but partitioned quasi-random selection (PQRS) (Fieldsend et al., 2003) was used here to promote exploration of the front. PQRS increases the efficiency and range of the search by preventing clustering of solutions in a particular region of the front which would otherwise bias the search because they would be selected more frequently. The selected parent decision vector is copied, after which the costs λ and classifier parameters w are treated separately. The parameters w of the classifier are perturbed or, in the nomenclature of evolutionary algorithms, mutated to form a child, w′ (line 4). Here we seek to encourage wide exploration of parameter space by perturbing each of the parameters with a random number δ drawn from a heavy tailed distribution (such as the Laplacian density, p(δ) ∝ e−|δ| ). The Laplacian distribution has tails that decay relatively slowly, ensuring that there is a high probability of exploring regions distant from the current solutions, facilitating escape from local minima (Yao et al., 1999). With a proposed parameter set w′ on hand the procedure then investigates the misclassification rates as the costs are varied with fixed parameters. In order to do this we generate Nλ sample costs λ′ and evaluate the misclassification rates for each of them. Since the misclassification costs are non-negative and sum to unity, a straightforward way of producing samples is to make a draws from a Dirichlet distribution: p(λ) = Dir(λ | α1 , . . . , αi , . . . , αD ) (5) where the index i labels the D = Q(Q − 1) off-diagonal entries in the cost matrix.PSamples from a Dirichlet density lie on the simplex kj λkj = 1. The αkj ≥ 0 determine the density of the samples; since we have no preference for particular costs here, we set all the αkj = 1 so that the simplex (that is, cost space) is sampled uniformly with respect to Lebesgue measure. The misclassification rates for each cost sample λ′ and classifier parameters w are used to make class assignments for each example in the given dataset (line 7). Usually this step consists of merely modifying the posterior probabilities p(Ck | x) to find the assignment with the minimum expected cost and is therefore computationally inexpensive as the probabilities need only be computed once for each w′ . The misclassification rates Ckj (θ′ ) (k 6= j) comprise the objective values for the decision vector θ′ = {w′ , λ} and decision vectors that are not dominated by members of the archive E are inserted into E (line 11) and any decision vectors in E that are dominated by the new entrant are removed (line 10). We remark that this algorithm, unlike the original PAES algorithm, uses an archive whose size is unconstrained, permitting better convergence (Fieldsend et al., 2003). 4. Illustrations In this section we illustrate the performance of the evolutionary algorithm on synthetic data, which is readily understood. Subsequently we give results for a number of standard multi-class problems. We use two relatively simple classifiers, the multinomial logistic regression classifier and the probabilistic k -nearest neighbour classifier. 4.1. Synthetic data In order to gain an understanding of the Pareto optimal ROC surface for multiple class classifications we extend a two-dimensional, two-class synthetic data set devised by Ripley (1994) by adding additional Gaussian functions corresponding to an additional class. The resulting data set comprises 3 classes, the conditional density for each being a mixture of two Gaussians.1 4.2. Multinomial logistic regression The functional form of the multinomial logistic regression classifier is: T eαj +x βj p(Cj |x, α, β) = PQ αi +xT β i i=1 e (6) where β j is a vector of feature coefficients for class j and αj is a single bias (for each class). Therefore w consists of these Q sets of β j and αj . To discover the Pareto optimal ROC surface, the optimisation algorithm was run for T = 5000 proposed parameter values, with Nλ = 100, resulting in an estimated Pareto front comprising approximately 9000 mutually non-dominating parameter and cost combinations; we judge that the algorithm is very well converged and obtain very similar results by permitting the algorithm to run for only T = 2000 generations. The left panel of the Figure 1 shows the decision regions that yield the smallest total misclassification error, 40/300. Decision regions for this parameterisation 1 Covariance matrices for all the components were isotropic: Σj = 0.3I. Denoting by µji for i = 1, 2 the means of the two Gaussian components generating samples for class j, the centres were located at: µ11 = (0.7, 0.3)T , µ12 = (0.3, 0.3)T , µ21 = (−0.7, 0.7)T , µ22 = (0.4, 0.7)T , µ31 = (1.0, 1.0)T and µ32 = (0.0, 1.0)T . Each component had equal mixing weight 1/6. The 300 samples used here, together with the equal cost Bayes optimal decision boundaries, are shown in Figure 1. 1.6 1.6 1.6 1.4 1.4 1.4 1.2 1.2 1.2 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 0 −0.2 −1 −0.5 0 0.5 1 1.5 −0.2 −1 0 −0.5 0 0.5 1 1.5 −0.2 −1 −0.5 0 0.5 1 1.5 Figure 1. Decision regions for various multinomial logistic regression classifiers on multi-class ROC surface. Grey scale background shows the class to which a point would be assigned. Black lines show the ideal equal-cost decision boundary. Symbols show actual training data. Left: Parameters corresponding to minimum total misclassification error on the training data. Middle: Decision regions corresponding to the minimum C21 and C23 and conditioned on this, minimum C31 and C13 . Right: Decision regions corresponding to minimising C12 and C32 . are not tightly fitted to the Bayes optimal ones, which reflects the relative inflexibility of the particular classifier, rather than a problem with the training process. By contrast with the decision regions which are optimal for roughly equal costs, the middle and right panels of Figure 1 show decision regions for imbalanced costs. The middle panel shows decision regions corresponding to minimising C21 and C23 : this, of course, can be achieved by setting λ21 and λ23 to be large, so that every C2 example (triangle) is correctly classified, no matter what the cost. For these data there are many decision regions correctly classifying every C2 and we display the decision regions that also minimise C31 and C13 . For these data, it is possible to make C31 = C13 = 0 because C1 and C3 are adjacent only along a boundary distant from C2 points; such complete minimisation will in general not be possible. Of course, the penalty to be paid for minimising the C2 rates together with C31 and C13 is that C32 and C12 are large. The right panel of Figure 1 shows the reverse situation: here the costs for misclassifying either C1 or C3 as C2 are high. With these data, although not in general, of course, it is possible to reduce C12 and C32 to zero, as shown by the decision regions which ensure that C2 examples are only classified correctly when it does not result in incorrect assignment of the other two classes to C2 . In this case the greatest misclassification rate is C23 (triangles as crosses). It should be emphasised that the evolutionary algorithm has explored a wide range of cost and parameter combinations on the Pareto optimal ROC surface. Values of each λkj on the front ranges from below 10−4 to above 0.79, all having means of approximately 1/D = 1/6, providing assurance that a complete range of costs is being explored by the algorithm. One way to view misclassification costs when Q = 3 is to look at the trade-off surface for minimising all misclassifications into each class, that is the false positive rate for each class. We thus minimise the Q objectives: X Fk (w, λ) = Ckj k = 1, . . . , Q. (7) j6=k We call this front the ‘false positive rate front’. The false positive rate Pareto front is easily visualised (at least for three class problems), but clearly information on exactly how misclassifications are made is lost. However, the full D-dimensional Pareto surface may usefully be viewed in ‘false positive space’. Figure 2 shows the solutions on the estimated Pareto front obtained using the full Q(Q − 1) objectives for the multinomial logistic regression classifier, but each solution is plotted at the coordinate given by the Q = 3 false positive rates (7), with the greyscale denoting the class into which the greatest number of misclassifications are made. Although the solutions obtained by directly optimising the false positive rates clearly lie on the full Pareto surface (in Q(Q − 1) dimensions) the converse is not true and the projections into false positive space do not form a surface. Nonetheless, at least for these data, they lie close to a surface, which aids visualisation and navigation of the full Pareto front. The relation between the solutions on the full Pareto front and the false positive rate front is made more precise as follows. If E is a set of Q(Q − 1)-dimensional solutions lying in the full Pareto front, let EQ be the set of Q-dimensional vectors representing the false positive coordinates of elements of E. The extremal set of non-dominated elements of EQ is ẼQ = {f ∈ EQ | f 6≺ f ′ ∈ EQ }. (8) 1 F 3 1 A C 21 0.5 0 0 0 0.5 0.2 B 0.4 0.6 0.8 1 F 1 1 0 F 2 3 1 Figure 3. Illustration of the G and δ measures where Q = 2. Shaded area denotes G(A), horizontally hatched area denotes δ(A, B), vertically hatched area denotes δ(B, A). 1 F C 12 0.5 0 1 1 0.5 0.5 F 2 0 0 F 1 Figure 2. The estimated Pareto front for synthetic data classified with a multinomial logistic regression classifier viewed in false positive space. Axes show the false positive rates for each class and different greyscales represent the class into which the greatest number of misclassifications are made. (Points better than random shown.) Then solutions in ẼQ also lie in the false positive rate front. Other more sophisticated methods for visualising Pareto fronts in the Q > 2 situation are described in (Everson & Fieldsend, 2005). 5. Comparing classifiers In two class problems the area under the ROC curve is often used to compare classifiers. As clearly explained by Hand and Till (2001), the AUC measures a classifier’s ability to separate two classes over the range of possible costs and is linearly related to the Gini coefficient. In this section we compare the multinomial logistic regression and k-nn classifiers using a measure based on the volume dominated by the Pareto optimal ROC surface. We draw attention to Ferri et al. (2003) who give another view of the volume under multi-class ROC surfaces. By analogy with the AUC, we might use the volume of the Q(Q − 1)-dimensional hypercube that is dominated by elements of the ROC surface for classifier A as a measure of A’s performance. In binary and multi-class problems alike its maximum value is 1 when A classifies perfectly. If the classifier allocates at random, the ROC surface is the simplex in Q(Q − 1)-dimensional space with vertices at distance Q − 1 along each coordinate vector. The volume of the unit hypercube dominated by this simplex is 1 Q(Q−1) − Q(Q − 1)(Q − 2)Q(Q−1) ]; a [Q(Q−1)]! [(Q − 1) full derivation is provided in Everson and Fieldsend (2005). It corresponds to the amount of the pyramidal region dominated by the simplex in the Q(Q − 1) hypercube with (Q − 1) length sides which also lies in the unit hypercube; we denote by this truncated pyramidal region by P . When Q = 2 the volume (area) is just 1/2, corresponding to the area under the diagonal in a conventional ROC plot.2 However, when Q > 2, the volume not dominated by the random allocation simplex is very small; even when Q = 3, the volume not dominated is ≈ 0.0806. We therefore define G(A) to be the analogue of the Gini coefficient in two dimensions, namely the proportion of the volume of the Q(Q − 1)-dimensional unit hypercube that is dominated by elements of the ROC surface, but is not dominated by the simplex defined by random allocation (as illustrated by the shaded area in Figure 3 for the Q = 2 case). In binary classification problems this corresponds to twice the area between the ROC curve and the diagonal. In multi-class problems G(A) quantifies how much better A is than random allocation. It can be simply estimated by Monte Carlo sampling of this volume in the unit hypercube. If every point on the optimal ROC surface for classifier A is dominated by a point on the ROC surface for classifier B, then classifier B has a superior per2 Although binary ROC plots are usually made in terms of true positive rates versus false positive rates for one class, the false positive rate for the other class is just 1 minus the true positive rate for the other class. formance to classifier A. In general, however, neither ROC surface will completely dominate the other: regions of A’s surface will be dominated by B and vice versa; in binary problems this corresponds to ROC curves that cross. To quantify the classifiers’ relative performance we therefore define δ(A, B) to be the volume of P that is dominated by elements of A and not by elements of B (marked in Figure 3 with horizontal lines). Note that δ(A, B) is not a metric; although it is non-negative, it is not symmetric. Also if A and B are subsets of the same non-dominated set W, (i.e., A ⊆ W and B ⊆ W ), then δ(A, B) and δ(B, A) may have a range of values depending on their precise composition (Fieldsend et al., 2003). Situations like this are rare in practice, however, and measures like δ have proved useful for comparing Pareto fronts. 5.1. Probabilistic k-nn classifiers One of the most popular methods of statistical classification is the k -nearest neighbour model (k -nn). The method is essentially geometrical, assigning the class of an unknown exemplar to the class of the majority of its k nearest neighbours in some training data. More precisely, in order to assign a datum x, given known classes and examples in the form of training data D = {yn , xn }N n=1 , the k -nn method first calculates the distances di = ||x − xi ||. If the Q classes are a priori equally likely, the probability that x belongs to the j-th class is then evaluated as p(Cj | x, k, D) = kj /k, where kj is the number of the k data points with the smallest dn belonging to Cj . Holmes and Adams (2002) have extended the traditional k -nn classifier by adding a parameter β which controls the ‘strength of association’ between neighbours. The posterior probability of x belonging to each class Cj is given by the predictive likelihood: β ek p(Cj | x, k, β, D) = PQ q=1 Pk xn ∼x β ek Pk d(x,xn )δjyn xn ∼x d(x,xn )δqyn . (9) Pk Here δmn is the Kronecker delta and xn ∼x means the sum over the k nearest neighbours of x (excluding P x itself). The term k1 kxn ∼x d(xn , x)δjyn counts the fraction of the k nearest neighbours of x in the same class j as x. Here we regard w = {k, β} as parameters to be adjusted as part of Algorithm 1 as the Pareto optimal ROC surface is sought. Table 1 shows the results of comparing the probabilistic k -nn classifier with the multinomial logistic regression classifier on the Synthetic data, and also on the UCI Image and Vehicle data sets.3 By using the δ mea3 http://www.ics.uci.edu/~mlearn/MLRepository Table 1. Generalised Gini coefficients and exclusively dominated volume comparisons of the multinomial logistic regression (MLR) and k -nn classifiers. Measure G(MLR) G(k-nn) δ(MLR, k-nn) δ(k-nn, MLR) Synth. 0.840 0.920 0.001 0.081 Vehicle ≈0 0.168 0.000 0.168 Image ≈0 0.076 0.000 0.076 sure we can see that the k -nn model is wholly better than the multinomial logistic regression for both the UCI data sets (where Q = 4 and Q = 7), and almost entirely better on the synthetic data. 6. Conclusion In this paper we have considered multi-class generalisations of ROC analysis from a multi-objective optimisation perspective. Consideration of the role of costs in classification leads to a multi-objective optimisation problem in which misclassification rates are simultaneously optimised. The resulting trade-off surface generalises the binary classification ROC curve because on it one misclassification rate cannot be improved without degrading at least one other. We have presented a straightforward general evolutionary algorithm which efficiently locates approximations to the Pareto optimal ROC surface. Although the algorithm clearly takes longer than training a single classifier, on the data presented here run times are on the order of a few minutes. The Pareto optimal ROC surface yields a natural way of comparing classifiers in terms of the volume that the classifiers’ ROC surfaces dominate. We defined and illustrated a generalisation of the Gini coefficient for multi-class problems that quantifies the superiority of a classifier to random allocation. This measure in turn has proved itself useful in practice. Finally, we remark that some imprecise information about the costs of misclassification may often be available. Lachiche and Flach (2003) have considered multi-class ROC optimisation when the costs are known. However, if partial information about the costs, such as approximate bounds on the ratios of the λkj , is known, the evolutionary algorithm is easily focused on the relevant region by setting the Dirichlet parameters αkj appearing in (5) to be in the ratio of the expected costs, with their magnitudes setting the variance in the cost ratios. Acknowledgements This work was supported in part by the EPSRC, grant GR/R24357/01. We thank Trevor Bailey, Adolfo Hernandez, Wojtek Krzanowski, Derek Partridge, Vitaly Schetinin Jufen Zhang and two anonymous referees for their helpful comments. mining: Introduction to ROC analysis and its applications. In D. Mladenic, N. Lavrac, M. Bohanec and S. Moyle (Eds.), Data mining and decision support: Integration and collaboration, 81–90. Kluwer. References Hand, D., & Till, R. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 45, 171– 186. Adams, N., & Hand, D. (1999). Comparing classifiers when the misallocation costs are uncertain. Pattern Recognition, 32, 1139–1147. Hanley, J., & McNeil, B. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 82, 29–36. Anastasio, M., Kupinski, M., & Nishikawa:, R. (1998). Optimization and FROC analysis of rule-based detection schemes using a multiobjective approach. IEEE Trans. Medical Imaging, 17, 1089–1093. Hernández-Orallo, J., Ferri, C., Lachiche, N., & Flach, P. (Eds.). (2004). ROC Analysis in Artificial Intelligence, 1st International Workshop, ROCAI-2004, Valencia, Spain. Bradley, A. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30, 1145–1159. Holmes, C., & Adams, N. (2002). A probabilistic nearest neighbour method for statistical pattern recognition. J. Royal Statistical Society B, 64, 1–12. Deb, K. (2001). Multi-objective optimization using evolutionary algorithms. Chichester: Wiley. Knowles, J., & Corne, D. (2000). Approximating the Nondominated Front Using the Pareto Archived Evolution Strategy. Evol. Computation, 8, 149–172. Duda, R., & Hart, P. (1973). Pattern classification and scene analysis. New York: Wiley. Elkan, C. (2001). The foundations of cost-sensitive learning. IJCAI (pp. 973–978). Everson, R., & Fieldsend, J. (2005). Multi-class ROC analysis from a multi-objective optimisation perspective (Technical Report 421). Department of Computer Science, University of Exeter. Everson, R., & Fieldsend, J. (2006). Multi-objective optimisation of safety related systems: An application to short term conflict alert. IEEE Transactions on Evolutionary Computation. (In press). Ferri, C., Hernández-Orallo, J., & Salido, M. (2003). Volume under the ROC surface for multi-class problems. ECML 2003 (pp. 108–120). Fieldsend, J., & Everson, R. (2004). ROC Optimisation of Safety Related Systems. Proceedings of ROCAI 2004, part of the 16th European Conference on Artificial Intelligence (ECAI) (pp. 37–44). Fieldsend, J., Everson, R., & Singh, S. (2003). Using Unconstrained Elite Archives for Multi-Objective Optimisation. IEEE Trans. Evol. Comp., 7, 305– 323. Flach, P., Blockeel, H., Ferri, C., Hernández-Orallo, J., & Struyf, J. (2003). Decision support for data Kupinski, M., & Anastasio, M. (1999). Multiobjective Genetic Optimization of Diagnostic Classifiers with Implications for Generating Receiver Operating Characterisitic Curves. IEEE Transactions on Medical Imaging, 18, 675–685. Lachiche, N., & Flach, P. (2003). Improving accuracy and cost of two-class and multi-class probabilistic classifiers using ROC curves. ICML 2003 (pp. 416– 423). Provost, F., & Fawcett, T. (1997). Analysis and visualisation of classifier performance: Comparison under imprecise class and cost distributions. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (pp. 43–48). Menlo Park, CA: AAAI Press. Ripley, B. (1994). Neural networks and related methods for classification (with discussion). Journal of the Royal Statistical Society Series B, 56, 409–456. Srinivasan, A. (1999). Note on the location of optimal classifiers in n-dimensional ROC space (Technical Report PRG-TR-2-99). Oxford University Computing Laboratory, Oxford. Yao, X., Liu, Y., & Lin, G. (1999). Evolutionary programming made faster. IEEE Trans. Evol. Comp., 3, 82–102.

Log In

Formulation and comparison of multi-class ROC surfaces

Sign up for access to the world's latest research

Related papers

Related papers

Related topics