International Biometric Society

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Comparing the Areas under Two or More Correlated Receiver Operating Characteristic

Curves: A Nonparametric Approach


Author(s): Elizabeth R. DeLong, David M. DeLong and Daniel L. Clarke-Pearson
Reviewed work(s):
Source: Biometrics, Vol. 44, No. 3 (Sep., 1988), pp. 837-845
Published by: International Biometric Society
Stable URL: http://www.jstor.org/stable/2531595 .
Accessed: 18/02/2013 14:49

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp

.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.

International Biometric Society is collaborating with JSTOR to digitize, preserve and extend access to
Biometrics.

http://www.jstor.org

This content downloaded on Mon, 18 Feb 2013 14:49:38 PM


All use subject to JSTOR Terms and Conditions
BIOMETRICS44, 837-845
September1988

Comparingthe Areas Under Two or More CorrelatedReceiver


OperatingCharacteristicCurves:A NonparametricApproach
ElizabethR. DeLong
Quintiles,Inc., 1829 East FranklinStreet,
Chapel Hill, North Carolina 27514, U.S.A.
David M. DeLong
SAS Institute,Cary,North Carolina27511, U.S.A.
and
Daniel L. Clarke-Pearson
Division of Oncology,Departmentof OBGYN, Duke UniversityMedicalCenter,
Durham,North Carolina27710, U.S.A.

SUMMARY
Methodsof evaluatingand comparingthe performanceof diagnostictestsareof increasingimportance
as new tests are developedand marketed.When a test is basedon an observedvariablethat lies on a
continuous or gradedscale, an assessmentof the overallvalue of the test can be made throughthe
use of a receiver operatingcharacteristic(ROC) curve. The curve is constructedby varying the
cutpoint used to determinewhich values of the observedvariablewill be consideredabnormaland
then plotting the resultingsensitivitiesagainstthe correspondingfalse positive rates. When two or
more empiricalcurvesare constructedbased on tests performedon the same individuals,statistical
analysison differencesbetweencurvesmust take into accountthe correlatednatureof the data. This
paperpresentsa nonparametricapproachto the analysisof areasunder correlatedROC curves,by
using the theoryon generalizedU-statisticsto generatean estimatedcovariancematrix.

1. Introduction
Methods of evaluating and comparing the performance of diagnostic tests or indices are of
increasing importance as new tests or indices are developed or measured. When a test is
based on an observed variable that lies on a continuous or graded scale, an assessment of
the overall value of the test can be made through the use of a receiver operating characteristic
(ROC) curve (Hanley and McNeil, 1982; Metz, 1978). The underlying population curve is
theoretically given -by varying the cutpoint used to determine the values of the observed
variable to be considered abnormal and then plotting the resulting sensitivities against the
corresponding false positive rates. If a test could perfectly discriminate, it would have a
value above which the entire abnormal population would fall and below which all normal
values would fall (or vice versa). The curve would then pass through the point (0, 1) on the
unit grid. The closer an ROC curve comes to this ideal point, the better its discriminating
ability. A test with no discriminating ability will produce a curve that follows the diagonal
of the grid.
For statistical analysis, a recommended index of accuracy associated with an ROC curve
is the area under the curve (Swets and Pickett, 1982). The area under the population ROC

Key words. Jackknifing; Mann-Whitney test; Receiver operating characteristic (ROC) curve; Struc-
tural components; U-statistics.
837

This content downloaded on Mon, 18 Feb 2013 14:49:38 PM


All use subject to JSTOR Terms and Conditions
838 Biometrics,September1988
curverepresentsthe probabilitythat, when the variableis observedfor a randomlyselected
individual from the abnormal population and a randomly selected individual from the
normal population, the resultingvalues will be in the correctorder (e.g., abnormalvalue
higher than the normal value). Generally, parametricassumptions are applied on the
distributions of the observed variable in the normal and the abnormal populations.
Maximum likelihood programs for estimating the area under the curve and relevant
parametersunder a binormal model assumption have been widely employed (Dorfman
and Alf, 1969;.Metz, 1978;Swetsand Pickett, 1982) in orderto estimatethis area,although
these distributionscannot be uniquelydeterminedfrom the ROC curve. The methodology
has been extended(Metz, Wang, and Kronman, 1984) to a "bivariatebinormal"model for
testing differencesbetween correlatedsample ROC curves that arise, for example, when
differentdiagnostictests are performedon the same individuals.
This paper addressesthe nonparametriccomparison of areas under correlatedROC
curves.When calculatedby the trapezoidalrule,the areafallingunderthe points comprising
an empiricalROC curve has been shown to be equal to the Mann-Whitney U-statisticfor
comparing distributionsof values from the two samples (Bamber, 1975). Although the
trapezoidalrule systematicallyunderestimatesthe true area (Hanley and McNeil, 1982;
Swetsand Pickett, 1982) when the numberof distinctvalues taken on by a discrete-valued
diagnosticvariableis small (say, 5 or 6), it nonethelessproducesa meaningfulstatisticthat
can be used with confidencewhen the variabletakes on a largernumberof values. Hanley
and McNeil (1983) use some propertiesof this nonparametricstatisticto compare areas
under ROC curves arising from two measures applied to the same individuals. Their
approachinvolves calculatingfor both the normaland the abnormalsamplethe correlation
between the values of the original measures.The averageof the two correlationsis used
along with the averageof the areasunderthe two curvesto arriveat an estimatedcorrelation
between the two areas.A table that applies when the averagearea is at least .70 is given.
However,for measuresthat arenot continuousor nearlyso, theirmethodrelieson Gaussian
modelingassumptionsfor estimatingthe variancesof the two areas.In Section 2 we present
an alternative methodology using a more completely nonparametricapproach which
exploits the propertiesof the Mann-Whitney statistic. Section 3 presentsan example of
three correlatedROC curves derived from data on ovarian cancer patients undergoing
surgery for bowel obstruction. Three different prognostic indices are evaluated and
compared.

2. Analysis of Areas Under CorrelatedROC Curves


Suppose a sample of N individuals undergo a test for predictingan event of interest or
determiningpresence or absence of a medical condition and that the test is based on a
continuous-valueddiagnosticvariable.We will follow the convention that highervalues of
the test variableare assumedto be associatedwith the event of interest,e.g., positivedisease
status.Also supposeit can be determinedby means independentof the test that in of these
individualstruly undergothe event or have the condition. Let this groupbe denotedby Cl
and let the group of n (= N - m) individualswho do not have the condition be denotedby
C2. Let Xi, i = 1, 2, . . ., m and Yj, j = 1, 2, . . ., n be the values of the variable on which
the diagnostictest is based for membersof Cl and C2, respectively.These outcome values
can be used to constructan empiricalROC curve for assessingthe diagnosticperformance
of the test. For any real numberz, let
1 in
sens(z) = - E I(XI > z)

This content downloaded on Mon, 18 Feb 2013 14:49:38 PM


All use subject to JSTOR Terms and Conditions
NonparametricComparisonofAreas UnderROC Curves 839
where I(A) = 1 if A is true and 0 otherwise. Also let
1 *
spec(z) = - E, If Y < z).
n j=1
Then sens(z) is the empirical sensitivity of a test that is derived by dichotomizing the
variable into positive or negative results on the basis of the cutpoint z and spec(z) is the
corresponding empirical specificity. Now, as z varies over the possible values of the variable,
the empirical ROC curve is a plot of sens(z) versus [1 - spec(z)]. Clearly, when z is larger
than the largest possible value, the curve passes through (0, 0) and it monotonically
increases to the point (1, 1) as z decreases to the smallest possible value. To be informative,
the entire curve should lie above the 450 line where sens(z) = 1 - spec(z). Selection of an
optimal cutpoint depends on a cost function of sensitivity and specificity.
It has been shown that the area under an empirical ROC curve, when calculated by the
trapezoidal rule, is equal to the Mann-Whitney two-sample statistic applied to the two
samples {Xi} and {Yj}. Because the Mann-Whitney statistic is a generalized U-statistic,
statistical analysis regarding the performance of diagnostic tests can be performed by
exploiting the general theory for U-statistics.
The Mann-Whitney statistic estimates the probability, 0, that a randomly selected
observation from the population represented by C2 will be less than or equal to a randomly
selected observation from the population represented by Cl. It can be computed as the
average over a kernel, A, as
1 n m
mn j= i=1
where
Il Y<X
O(X,Y)={2 Y=X.
0O Y>X

In terms of probabilities, E(0) = 0 = Pr(Y< X) + 'Pr(X = Y). For continuous distributions,


Pr(Y= X) = 0.
Asymptotic normality and an expression for the variance of the Mann-Whitney statistic
can be derived from theory developed for generalized U-statistics by Hoeffding (1948).
Define
(lo = E[I(Xi, Yj)lp(X, Yk)] - 02J j $ k;
0oj = E[t(Xj, Yj)t(Xk, Yj)] - 02, i k; (1)
1II = E[J(X1, Yj)J(X, Yj)] _ 02-

Then
(n - l) lo + (m - l)tol +i(
var(6) = +- (2)
mn mn
Bamber (1975) provides a method of estimating the variance in the context of testing the
significance of a single ROC curve. Bamber introduces a quantity Bxxy, which is the
probability that two randomly chosen elements of the population C1 will both be greater
than or less than a randomly chosen element of C2, minus the complementary probability
that the observation from C2 will be between the two from Cl. A similar quantity Byyxis
also defined and the variance of A is given in terms of B and B Var(6) is then

This content downloaded on Mon, 18 Feb 2013 14:49:38 PM


All use subject to JSTOR Terms and Conditions
840 Biometrics,September1988
estimated by empirically estimating Byyx and Bxxy. Formula (2) can be shown to be
equivalentto Bamber'sformula(4), which derivesfromworkof Noether(1967) and applies
when X and Y are not necessarilycontinuous.
Hoeffding'stheory extendsto a vector of U-statistics.Let 0 = (j l, . k) be a vector
of statistics, representingthe areas under the ROC curves derived from the readings
{Xir},{YJ)} (i = 1, . . ., m; j = 1, ..., n; 1 < r < k) of k different diagnostic measures.
Then, similarto (1) above, define
010
= E[A(Xr, YJ)p(Xi, Ys)] - rIs j $ k;
01l= E[A(Xi,Yjr)(Xs, Yjs)] - 0rs i 5 k; (3)
rS= E[u(Xr, Yr)J(XS, Es)] - rs
The covarianceof the rth and sth statisticcan then be writtenas
s
cov(r as) A (n - 1)t + (m - 1)rsU rs,

mn mn
Sen (1960) has provideda method of structuralcomponentsto provideconsistentestimates
of the elements of the variance-covariancematrixof a vectorof U-statistics.This approach
turns out to be equivalentto jackknifing,but is conceptuallysimplerwhen dealing with
U-statistics.We will exploit this methodology to compare the areas under two or more
ROC curves. For the rth statistic, or, the X-components and Y-componentsare defined,
respectively,as
i n
E A(Xr, Yr) (=
Vro(X,)=- nl j=1i 1,2,...,m)
,2 . .IM

and
in
VI' rV)=(Xi,
m i=1 YJ) (j= 1, 2, ...,n).

Also define the k x k matrixS10such that the (r, s)th element is

1
r= [V-r0(X) - ][V (X, ) - ]
m - 1 i=1I
and similarlyS0l, which has (r, s)th element
1
501~= n - 1 E[V61(Y1) - ][VS (Y) - ]
n1 - 1j(

The estimated covariance matrix for the vector of parameter estimates, 0 -

O
(al 2 ,), is thus

S =-Sio + - Sol.
m n
Let g be a real-valuedfunction of 0 that has bounded second derivativesin a neighborhood
of 0. Combiningresultsfrom Sen (1960) and Arveson(1969, Theorem 16), it follows that
if limNO,m/n is bounded and nonzero, then N12 [g(O) - g(O)] is asymptoticallynormally
distributedwith mean 0 and varianceo2, where
k
2 N co j-k agg a 1m
I 1 I
N-c9 = 10' 6~ nl /

This content downloaded on Mon, 18 Feb 2013 14:49:38 PM


All use subject to JSTOR Terms and Conditions
NonparametricComparisonofAreas UnderROC Curves 841
Further,
k k /1I
s9 = N SE El OggSoldfj(ms0
2 1.
j=1 i=1 06 06' \m n
is a consistentestimate of (2.
When g is simply a linear function, the theory reducesconsiderably,becausethe partial
derivativesare the constants that comprise the linear function. Thus, for any contrast
LO', whereL is a row vector of coefficients,
LO' - LO'

[L (--S1o+
I
Sol)L'J
[ m n ) ]
has a standardnormal distribution.A confidenceintervalfor LO' naturallyfollows.
By a modest generalizationof these results,we can also apply any set of linear contrasts
to a vector of areas under correlatedROC curves and perform a test of significanceon
LO'. The test then takes the form

(0( - O)L'[L
) (Is1
m + nS01)
) LJ L(0
( 0)'
)'() (5)
which has a chi-squaredistributionwith degreesof freedomequal to the rank of LSL' . A
confidence regioncan also be constructed.
A computer program written in the SAS language is available from the authors for
computing components, covariancematrices,and contrasts.However,as indicatedin the
next section, the components can be computed easily by hand or by a simple computer
program.The components can then be input to any programwhich computes sums of
squaresand cross-productsin orderto obtain the covariancematrixS.

3. Example
When to performsurgicalcorrectionof intestinal obstructionin patients known to have
ovarian carcinoma is an unresolvedproblem. The dilemma centers around determining
those patients for whom surgerypresents a benefit. Castelado et al. (1981), and other
authorshave proposedthat patientswho survivelongerthan 2 months postoperativelycan
be declaredto have "benefited"from the surgery.Using this criterion,Krebsand Goplerud
(1983) devised a preoperativescoring system for use as a screeningtest in determining
a patient's risk for failing to benefit from surgery.The scoring algorithmis presentedin
Table 1. Accordingto this scoringsystem,patientswith low scoresshouldbe good candidates
for surgeryand those with higher scores should be consideredat risk for failing to benefit
from surgery.
The following example evaluates the discriminatingability of the proposed screening
algorithmon 49 consecutive ovarian cancer patients undergoingcorrectionof intestinal
obstructionat Duke UniversityMedicalCenter.Of the 49 patients, 12 survivedmore than
2 months postoperativelyand could be consideredsurgicalsuccesses;the remaining37 are
considered failures. The Krebs-Goplerud score (K-G) is compared against two other
preoperativelymeasuredindices:total protein(TP) and albumin(ALB), both of which are
positively associatedwith the patient's nutritionalstatus. BecauseALB is one component
of TP, these two measures are highly correlated,with a Kendall's tau-b value of .65.
Increasinglevels of ALB and TP are associated with better nutritional status, whereas
increasinglevels of K-G are associatedwith poorerprognosis.Thus, to simplifycomputa-

This content downloaded on Mon, 18 Feb 2013 14:49:38 PM


All use subject to JSTOR Terms and Conditions
842 Biometrics, September 1988
Table 1
Krebs-Goplerudscoringsystemfor prognosticparametersin ovariancarcinoma
complicatedby bowelobstruction
Assigned
Parameter riskscore
Age (yr)
<45 0
45-65 1
>65 2
Nutritionalstatus(deprivation)
None or minimal 0
Moderate 1
Severe 2
Tumor status
No palpableintra-abdominalmasses 0
Palpableintra-abdominalmasses 1
Liverinvolvementor distantmetastases 2
Ascites
None or mild (asymptomatic,abdomennot distended) 0
Moderate(abdomendistended) 1
Severe(symptomatic,requiresfrequentparacentesis) 2
Previouschemotherapy
None, or no adequatetrial 0
Failed single-drugtherapy 1
Failedcombination-drugtherapy 2
Previousradiationtherapy
None 0
Radiationtherapyto pelvis 1
Radiationtherapyto whole abdomen 2

tions, we transformedby subtractingK-G from 12, the maximum possible value, so that
all indices would prognosticatein the same direction.
Figure 1 displays the empirical ROC curves for the three indices. From this figure, it
appearsthat K-G offers little improvement over either ALB or TP. The estimated areas
underthe curvesfor K-G, ALB, and TP are .69, .72, and .65, respectively.To analyzeand
comparethese areas,the covariancematrix for the vector of areasis needed. The method
of structuralcomponents easily producesthis matrix.For each of the variablesof interest,
(K-G, ALB, TP), we can denote by Xr (r = 1, 2, 3) the values associatedwith successand
by yr (r = 1, 2, 3) the values associatedwith surgicalfailures.Then, Or = Pr(Y' < Xr) +
iPr( yr = X') and we compute the components individually for each of the three variables.
If the data are first sorted by the variableof interest,it is a simple matterto calculatefor
eachX the numberof Y's less than X (NYLx)and the numberof Y's equal to X (NYEQx).
The component for X is then NYLx + 'NYEQx. Likewise, for each Y we calculate the
number of X's greaterthan Y (NXGy) and the number of X's equal to Y (NXEQy).The
component for Y is NXGy + 4NXEQy.
For this example, there are 12 X's and three variablesof interest,so the X-components
form a 12 x 3 matrix, V10.The 37 Y's yield a component matrix of dimension 37 x 3,
V0o.The 3 x 3 matricesS10and S0l are then computed as

So1 (Vf0V10 - 120'0)

and

Sol = (V1lVol - 370'0).

This content downloaded on Mon, 18 Feb 2013 14:49:38 PM


All use subject to JSTOR Terms and Conditions
NonparametricComparisonofAreas UnderROC Curves 843
1.0

- -- - -- -6
,- -- 1-
0.8 -

0.6-

"04 '

0.2 _ I

0 0.2 0.4 0.6 0.8 1.0


False positive rate
Figure1. Receiveroperatingcharacteristiccurves for Krebs-Goplerudscore (0), total protein(A),
and albumin(0).

It is clear that S10 and Sol are the covariance matrices of V1oand Vol, respectively. They
can readily be obtained from any computer program that computes covariance matrices.
The covariance matrix for the vector of areas is then

12 37

Table 2
Estimatedcovariancematrixbetweenareas underthe threeROC curves
Covariance
K-G score Albumin Total protein
K-G score .0110 .0033 .0028
Albumin .0086 .0076
Total protein .0100

Table 3
Correlationcoefficientsof pairs of areas calculatedfromestimatedcovariancematrix(ECM)and
alsofrom methodof Hanley and McNeil (HM)
Correlation Kendall'stau-b Kendall'stau-b Correlation
(ECM) Survivors Nonsurvivors (HM)
K-G, ALB .34 .20 .18 .17
K-G, TP .27 -.01 .21 .10
ALB, TP .82 .61 .66 .61

This content downloaded on Mon, 18 Feb 2013 14:49:38 PM


All use subject to JSTOR Terms and Conditions
844 Biometrics,September1988
This matrix is displayed in Table 2. In Table 3, the resulting correlation coefficients are
presented, along with Kendall's tau-b values for the group that benefited from surgery and
for the remaining group, and finally the estimated correlations derived from the table in
the paper by Hanley and McNeil (1983). For this set of data, our estimates tend to be
larger.
Now, to compare K-G to the average of ALB and TP, we use the contrast L
(1, -.5, -.5). Evaluated at 0, the value of the contrast is .004. The standard deviation
of this estimate is
(LSL' )1/2 = 116
A two-sided 95% confidence interval for this contrast is thus (-.223, .231), indicating
negligible improvement by K-G over ALB and TP.
To determine whether the Krebs-Goplerud score is better than at least one of the other
indices, ALB and TP, we use the contrast

I1 -I?)

Then based on (5), the x2 statistic with 2 degrees of freedom can be computed as 1.51 with
a P-value of .47. Based on this sample of 49 patients, there appears to be no advantage in
using the Krebs-Goplerud score over other routinely collected nutritional parameters,
although power in this situation is likely to be very small because of the small sample size.

4. Discussion
ROC curves are frequently being applied to the evaluation of diagnostic or prognostic tests
and indices. In order to make comparisons between two or more such indices derived from
the same test units or subjects, the implicit correlation between the curves should be taken
into account. This paper has presented a totally nonparametric approach to the comparison
of the areas under two or more ROC curves by using the theory developed for generalized
U-statistics. A covariance matrix can be estimated using the method of structural compo-
nents and the resulting test statistic has asymptotically a chi-square distribution. The
covariance matrix may also be used to construct confidence regions.

ACKNOWLEDGEMENTS

This work was supported in part by the Veterans' Administration Region 2 Health Services
Research and Development Field Program.

RESUME
L'importancedes methodesd'evaluationet de comparaisonde la performancedes testsdiagnostiques
croit dans le meme temps que de nouveauxtests se developpentet sont lanc6ssur le marche.Quand
un test est fonde sur une variableobserveecontinueou qui prendses valeurssur une 6chellegraduee,
on peut faireune estimationglobalede la valeurdu test en utilisantla courbecaract6ristique (ROC)
du receveur.La courbe est construiteen faisant varierla coupure utilis6epour d6terminerquelles
valeursde la variableobserv6esont a considerercomme anormales,et ensuite en faisantla graphe
des sensibilit6sr6sultantescontre les ratioscorrespondantsfaussementpositifs.On doit tenir compte
de la naturecorr6l6edes donn6esdans l'analysestatistiquedes differencesentre courbesquanddeux
ou plusieurscourbesempiriquessont construitesa partirde tests bas6ssur les memes individus.On
presentedans ce papierune approchenon param6triquede l'analysedes airessous des courbesROC
correlees,en utilisant la theorie sur la statistique U generalisee,pour engendrerune matrice de
covarianceestim6e.

This content downloaded on Mon, 18 Feb 2013 14:49:38 PM


All use subject to JSTOR Terms and Conditions
NonparametricComparisonofAreas UnderROC Curves 845
REFERENCES
Arveson,J. N. (1969). JackknifingU-statistics.Annalsof MathematicalStatistics40, 2076-2100.
Bamber,D. (1975). The area above the ordinal dominance graphand the area below the receiver
operatingcharacteristicgraph.Journalof MathematicalPsychology12, 387-415.
Castelado,T. W., Petrilli, E. S., Ballon, S. C., and Lagasse,L. D. (1981). Intestinaloperationsin
patientswith ovariancarcinoma.AmericanJournalof Obstetricsand Gynecology139, 80-84.
Dorfman,D. D. and Alf, E. (1969). Maximumlikelihoodestimationof parametersof signaldetection
theoryand determinationof confidenceintervals-rating-methoddata.Journalof Mathematical
Psychology6, 487-496.
Hanley,J. A. and McNeil, B. J. (1982). The meaningand use of the area undera receiveroperating
characteristic(ROC)curve.Radiology143, 29-36.
Hanley, J. A. and McNeil, B. J. (1983). A method of comparingthe area under two ROC curves
derivedfrom the same cases.Radiology148, 839-843.
Hoeffding, W. (1948). A class of statistics with asymptoticallynormal distribution.Annals of
MathematicalStatistics19, 293-325.
Krebs,H. B. and Goplerud,D. R. (1983). Surgicalmanagementof bowel obstructionin advanced
ovariancarcinoma.Obstetricsand Gynecology61, 327-330.
Metz, C. E. (1978). Basicprinciplesof ROC analysis.Seminarsin NuclearMedicine8, 283-298.
Metz, C. E., Wang, P.-L., and Kronman,H. B. (1984). A new approachfor testingthe significance
of differencesbetweenROC curvesmeasuredfrom correlateddata. In InformationProcessingin
MedicalImaging VIII, F. Deconick (ed.), 432-445. The Hague:MartinusNijhof.
Noether,G. E. (1967). Elementsof NonparametricStatistics.New York:Wiley.
Sen, P. K. (1960). On some convergencepropertiesof U-statistics.CalcuttaStatisticalAssociation
Bulletin10, 1-18.
Swets, J. A. and Pickett, R. M. (1982). Evaluation of Diagnostic Systems: Methodsfrom Signal
DetectionTheory.New York:AcademicPress.

ReceivedApril 1987; revisedOctober1987 and January 1988.

This content downloaded on Mon, 18 Feb 2013 14:49:38 PM


All use subject to JSTOR Terms and Conditions

You might also like