Colin Cameron 1997

JOURNAL OF
Econometrics
ELSEVIER Journal of Econometrics 77 11997) 329-342
An R-squared measure of goodness of fit for some

common nonlinear regression models
A. Colin Cameron", Frank A.G. Windmeijer *'b
aDepartment of Economics UniversiO, of Calilbrnia. Davis, CA 95616-8578. USA
hDepartment of Economics, University College London, London WCIE 6BT, UK
(Received September 1994; final version received January 1996~
Abstract
For regression models other than the linear model, R-squared type goodness-of-fit
summaly statistics have been constructed for particular models using a variety of
methods. We propose an R-squared measure of goodness of fit for the class of exponen-
tial family regression models, which includes logit, probit, Poisson, geometric, gamma,
and exponential. This R-squared is defined as the proportionate reduction in uncertainty,
measured by Kullback-Leibler divergence, due to the inclusion of regressors. Under
further conditions concerning the conditional mean function it can also be interpreted as
the fraction of uncertainty explained by the fitted model.
Key words: R-squared; Exponential family regression; Kullback-Leibler divergence;

Entropy; Information theory; Deviance; Maximum likelihood
J EL class!lication: C52; C29
I. Introduction
For the standard linear regression model the familiar coefficient of determina-
tion, R-squared (R2), is a widely used goodness-of-fit measure whose usefulness
and limitations are more or less known to the applied researcher. Application of
this measure to nonlinear models generally leads to a measure that can lie
*Corresponding author.
The authors are grateful to Richard Blundell, Shiferaw Gurmu, and two anonymous referees for
their helpful comments.
0304-4076/97/$15.00 (i) 1997 Elsevier Science S.A. All rights reserved

PII S 0 3 0 4 - 4 0 7 6 ( 9 6 ) 0 1 8 1 8-0
330 A.C Cameron, F.A.G. Windmeijer /Journal of Econometrics 77 (1997) 329 342
outside the [0, 1] interval and decrease as regressors are added. Alternative
R2-type goodness-of-fit summary statistics have been constructed for particular
nonlinear models using a variety of methods. For binary choice models, such as
logit and probit, there is an abundance of measures; see Maddala (1983) and
Windmeijer (1995). For censored latent models, such as the binary choice and
tobit models, it is possible to avoid nonlinearity by obtaining an approximation
of the usual R 2 for the linear latent variable model; see McKelvey and Zavoina
(1976), Laitila (1993), and Veall and Zimmermann (1992, 1995). For other
nonlinear regression models R 2 measures are very rarely used.
Desirable properties of an R-squared include interpretation in terms of the
information content of the data, and sufficient generality to cover a reasonably
broad class of models. We propose an R-squared measure based on the
Kullback-Leibler divergence for regression models in the exponential family.
This measure can be applied to a range of commonly-used nonlinear regression
models: the normal for continuous dependent variable y ~ ( - oc,, ~ ); exponen-
tial, gamma, and inverse-Gaussian for continuous ye(0, oc ); logit, probit, and
other Bernoulli regression models for discrete y = 0, 1; binomial (m trials) for
discrete y = 0, I . . . . . m; Poisson and geometric for discrete y = 0, 1, 2....
The exponential family regression model is described in Section 2. In Section
3, the R 2 measure based on the Kullback-Leibler divergence is presented. This
measures the proportionate reduction in uncertainty due to the inclusion of
regressors. Interpretation of the measure in terms of the fraction of uncertainty
explained by the fitted model is given in Section 4. Examples are presented in
Section 5. Extensions and other goodness-of-fit statistics are discussed in
Section 6. Section 7 contains an application to a gamma model for accident
claims data. Section 8 concludes.
2. Exponential family regression models
Following Hastie (1987), assume that the dependent variable Y has distribu-
tion in the one-parameter exponential family with density
f0(y) = exp[0y - b(O)]h(y), (l)
where 0 is the natural or canonical parameter, b(O) is the normalizing function,
and h(.) is a known function. Different b(O) correspond to different distributions.
The mean of Y, denoted p, can be shown to equal the derivative b'(0), and is
monotone in 0. Therefore, the density can equivalently be indexed by p, and
expressed as
f~,(),) = exp[c(,u)y - d{10] h(y).
General statistical theory for regression models based on the exponential
family is given in Wedderburn and Nelder (1972), Gourieroux et al. (1984), and
A.C. Cameron, F.A.G. Windmeijer /Jom'nal of Econometrics 77 (1997) 329-342 331
White (1993). The standard reference for applications is McCullagh and Nelder
(1989). Regressors are introduced by specifying # to be a function of the linear
predictor r / = x'fl, where x is a vector of regressors and fl is an unknown
parameter vector. Models obtained by various choices of b(O) and functions of
~/are called generalized linear models. More specialized results are obtained by
choice of the canonical link function, for which r / = 0, i.e., 0 in (1) is set equal to
X'fl.
Binary choice models are an example of exponential family regression models.
Then Y is Bernoulli distributed with parameter # and density f ~ ( y ) =
#r(1 _ #)1-r, y = [0, 1}. This can be expressed as (1) with 0 = Iog(#/(l - #))
and b(O)= log(1 + exp(0)). The logit regression model specifies
#=exp(x'fl)/(1 +exp(x'fl)), while the probit regression model specifies
# = ~(x'fl), where q~ is the standard normal cumulative distribution function.
The logit model corresponds to use of the canonical link function.
The parameter vector fl is estimated by the maximum likelihood (ML)
estimator fl, based on the independent sample {(yi, xi), i = 1.... ,n}, with
f,,(Yi) = f,,(Yj) for #i = lq. The estimated mean for an observation with regressor
x is denoted l) = #(x Ij,~). Throughout we assume that the model includes a con-
stant term. The estimated mean from M L estimation of the constant only mode!
is denoted/~o.
3. R-squared based on the Kullback-Leibler divergence
A standard measure of the information content from observations in a density

.f(y) is the expected information, or Shannon's entropy, E [log(f (y))]. This is the
basis for the standard measure of discrepancy between two densities, the
Kullback Leibler divergence (Kullback, 1959). Recent surveys are given by
Maasoumi (1993) and Ullah (1993).
Consider two densities, denoted f,,(y) and f,:(y) that are parameterized only
by the mean. In this case the general formula for the Kullback-Leibler (KL)
divergence is
K(#1,#2)-= 2E,, log[f~,,(y)/f~,~(y)], (2)
where a factor two is added for convenience, and E,, denotes expectation taken
with respect to the densityf~,,{y). K(#~, #2) is the information of #1 with respect
to #2 and is a measure of how close #~ and #2 are. The term divergence rather
than distance is used because it does not in general satisfy the symmetry and
triangular properties of a distance measure. However, K(# ~, #2) t> 0 with equal-
ity iff./i,' =fs,,.
In addition to.f~,,(y) andf~,,(y) we also consider the densityfr(y), for which the
mean is set equal to the realized y. Then the KL divergence K(y,#) can be
332 A.C. Cameron, F.A.G. Windmeijer /Journal of Econometrics 77 (1997) 329--342
defined in a manner analogous to (2) as
K(y. la) = 2Er log [f~,(y)/f.(y)] = 2~f.,,(y)log [fr(Y)/f.(Y)] dy. (3)
The random variable K(y, p) is a measure of the deviation of y from the mean p.
For the exponential family, Hastie (1987) and Vos (1991) show that the expecta-
tion in (3) drops out and
K(y, ~tt)= 2log [fr(Y)/f.(Y)]"

In the estimated model, with n individual estimated means/~ = #(x~/~), the
estimated KL divergence between the n-vectors y and/i is equal to twice the
difference between the maximum log-likelihood achievable, i.e., the log-likeli-
hood in a full model with as many parameters as observations, tO;y), and the
log-likelihood achieved by the model under investigation, I(/~;y):
/'I
K(y. ,~) = 2 ~ [!og.,~,(),e) --logf~,(y,)] = 2[/(y;y)-/(~;y)]. (4)

i=l
Let #0 denote the n-vector with entries/~o, the fitted mean from ML estima-
tion of the constant only model. We interpret KO,,,~o) as the estimate of the
information in the sample data on y potentially recoverable by inclusion of
regressors, it is the difference between the information in the sample data ony,
and the estimated information using #0, the best point estimate when data on
regressors are not utilized, where information is measured by taking, expectation
with respect to the observed wdt, e ~,, By choosing ~, to be the MLE, K(v,/~01~
is minimized. The R-squared ',re propose is the proportionate reduction in
this potentially recoverabie inlbrmation achieved by the filled regression
model:
= 1 - Kty, A K
(Y, (5)
This measure can be used for fitted means obtained by any estimation method.
In the following proposition we restrict attention to ML estimation (which
minimizes K0',/i)):
Proposition I. For ML estimates of exponential family retjression models based

on the density(I), R~L defined in (5) has the following properties.
1. R~L is nondecreasin(I as retdressors are added.
2. 0 ~< R~,t, ~< 1.
3. R~t is a scalar multiple of the likelihood ratio test Jbr the joint significance of
the explanatory variables.
A.C. Cameron, F.A.G. Windmeijer /Journal of Econometrics 77 (1997) 329-342 333
4. R2L equals the likelihood ratio index 1 - l(/~;y)/l(~;.;') if and only if l(y; y) = O.
5. R~L measures the proportionate reduction in recoverable information due to

the inclusion of regressors, where information is measured by the estimated
Kullback-Leibler divergence (4).
Proof
1. The :.~,LE minimizes K0',,~) which will therefore not increase as regressors
are added.
2. The lower bound of 0 occurs if inclusion of regressors leads to no change in

the fitted mean, i.e.,/~ = ~ , and the upper bound occurs when the model fit is
perfect.
3. Follows directly from re-expressing R~L as 2[l(,a;y) - l(~o;y)]/K(y, ~).
4. Follows directly from re-expressing Rt~L as [ 1 - l(~;y)/l(~o;y)] [l(~o;y)/

(i(~o;y)- l(y;y)].
5. See the discussion leading up to (5).
Properties 1 and 2 are standard properties often desired for R-squared

measures. Property 3 generalizes a similar result for the linear regression model
under normality. The relationship between likelihood ratio tests and the
Kullback-Leibler divergence is fully developed in Vuong (1989). Property 4 is of
interest as the likelihood ratio index, which measures the proportionate reduc-
tion in the logo!ikelihood due to inclusion of regressors, is sometimes proposed
as a general pseudo R-squared measure. Equality occurs for the Bernoulli
model, but in general the likelihood ratio index differs and, for other discrete
dependent variable models, is more pessimistic regarding the contribution of
regressors, as l(y;y) <~O. In the continuous case, large values (positive or nega-
tive) of the likelihood ratio index can arise if I(/~o;y) is close to zero (positive or
negative). By contrast, R~L will always be bounded by zero and one. The final
property establishes an information-theoretic basis for R~L.
An interesting aspect is that the expression for K0', #) in (4) equals the
definition of the deviance, given in, for example, McCullagh and Nelder
(1989, p. 33). Therefore R~L can be interpreted as being based on deviance
residuals, defined as the signed square root of individual contributions to the
deviance. Deviance residuals have been found very useful for diagnostic check-
ing in generalized linear models, see, e.g., Pregibon (1981), Landwehr et a!.
(1984), and Williams (1987); and R~L is related to the analysis of deviance the
same way as R 2 in the standard linear model is related to the analysis of
variance.
334 A.C. Cameron, F.A.G. Windmeijer /Journal of Econometrics 77 (1997) 329-342
4. Pythagorean decomposition for RZgL
In the linear regression model, the usual R-squared can be interpreted not
only as the proportionate reduction in the total sum of squares due to inclusion
of regressors, but also as the fraction of the total sum of squares explained by the
regression model. This result rests on the decomposition of the total sum of
squares into explained sum of squares and residual sum of squares. Such
a decomposition of the sum of squares does not generally hold for exponential
family regression models, which is one reason for not applying the linear
regression model R-squared to other models.
For a widely used subclass of exponential family regression models that use
the canonical link, R2L has the desirable property of interpretation in terms of
explained KL divergence between the fitted model and the constant-only model.
Proposition 2. For the exponential family models that use the canonical link, i.e.,
0 = x'fl in (1), R2L deJined in (5) can be equivalently expressed as
R~,. = K(k ~o)/K~r, ~0). .r (
where K (~, Po) is the estimated KL divergence defined in (2) between models wi~h
.fitted means ~ and po, and st, R2L measures the fraction of uncertainty explained by
the fitted model.
Proof. Let the vector Pt = p(x'tfi), and/)2 = p(x'2[~, with x2 nested in xt. For
models that use the canonical link, the KL divergence exhibits the Pythagorean
property (see Hastie, 1987, pp. 19~20; Simon, 1973):
K(.v,~2) = K(~i, ~2) + K(y, ~ ).
Proposition 2 follows, using the particular decomposition K(y, ko)=
K(A ~o}+ K(y, ~). []
For models that do not use the canonical link, R~L still satisfies all the
properties in Proposition 1, in particular property 5. A decomposition is trivially
obtained by use of the so-called likelihood displacement defined as (Cook, 1986;
Vos, 1991)
LD(~, ,~,,)~ 2{l(~;y) - l(~o;Y)} = K (y, ~o) - K (y, ~).
Then
R~L = LOIk ~o)/K(y,k,),
and can be interpreted as measuring the fraction of empirical uncertainty
explained by the model, i
tSee Hauser 11978), who analyzed the likelihood ratio index for Bernoulli and multinomiai models.
A.C. Camero,, F.A.G. Windmeijer /Journal of Econometrics 77 (1997) 329-342 335
5. E x a m p l e s
The formulae for RKL for a range of exponential family regression models are
given in Table 1. The models are defined in, for example, McCullagh and Nelder
(1989, p. 30). The column R~L is the measure defined in (5). The final column
gives the conditional mean, as a function of ~1= x'fl corresponding to the
canonical link, in which case Proposition 2 also holds and R2L can be simplified
in certain cases.
For the normal distribution, with tr2 known (or using the same estimator for
the two models), R~L given in Table 1 equals the usual coefficient of determina-
tion in the linear model. Proposition 2 applies to the linear regression model, but
not to nonlinear models under normality since these do not use the canonical
link.
For the linear model with nonspherical disturbances (var(y)= o2V, with
V known), the KL divergence can be shown to be given by
K ( y , ti) = (y - V-'(y - ti),
and R~L is
(y - ll)'v-'(y - #)
= 1 - (6)
R2L'GI'S = (y -- llio)' V - 1(.1'- Ifto)
where ! is the vector of ones, a n d / ) . = 0 ' V - i l)- I t' V - ly. So in this case, R~.L is
equal to the definition as given by Buse (1973).
For Bernoulli ~'egression models, where y takes only the values 0 or 1, many
R-' measures have been proposed. See, for example, Maddala (1983, pp. 37 41)
and Windmeijer (1995), or the output from the econometrics package
SHAZAM. For these models, I(y;y) = 0, so that by property 4 in Proposition 1,
R"KL given in Table 1 is equal to the likelihood ratio index proposed by
McFadden (1974), Efron (1978) for one-way ANOVA, Pregibon (1984) who
explicitly derives his measure based on deviances, and Christensen (1990).
Proposition 2 applies to the logit model, but not to the probit model.
An R 2 measure is rarely reported for the Poisson model. R2L given in Table
1 equals one of the R 2 measures proposed for this model by Cameron and
Windmeijer (1996). The standard Poisson regression model specifies
It = exp(x'/l) which is the canonical link so that Proposition 2 applies.
Table 1 also lists R2L for the binomial, geometric, exponential, gamma, and
inverse-Gaussian regression models. For these models we have been unable to
find specific R 2 measures in the literature.
The analysis can be extended to a p-dimensional dependent variable with
density in the p-parameter exponential family. Necessary results for such gener-
alization are given in Simon (1973). Of particular interest is the muitinomial
Table 1 Ox
R2Lfor exponential family regression models
7-
Distribution KL divergence R~-g Canonical link ~ .q
Normal ~..(3' --/Q2, 0"2 I E ( Y -- 1I)2

-.%
5_(.r - y): P = ~l
Bernoulli - 25- ~ylogl~ + I! - y)log~! - .u)~ I -- )-"/~log(/~) 4-(I -- [01og(l -- l~) exp(q)
n ~4)~log(.i`) + (! - D l o g ( I - D} 1 + exp0/)
Binomial (rap 2 ~. t'log + (m - r ) l o g \ n T - p / } I - )):?iog(.i;) + On - y)log(m - Y) I /~ = 1 + exp01)
Poisson b 2)-" ~y log(y,'/~) - (3- - P! ~ I - ~ .)' log(y/l)) - ( y - / i ) II = expOl)

~ ylog(y/.f)
,+ ,,
" "+'
Geometric" 2~
l.),log
( " )~ -,y+ l,log
(~> ) l ~Jfr' l ) ' o g / yl'~l'7"-~!
+ /'=1-exp01)
Eyi°g ~" -(.)'+ \r+ ]
e~
Exponential - 2 ~ ~]og()'/'fl) q- (3" -- fl),~ l 5`iogO,/~) + 0 ' - fO/f~ II = t]-!
log(y/~) "-4
-,,q
Gamma ¢ - 2 r ~ ~,log0vp) + (y - P) U~ I y log(y/~) + (.), - ~),/~
It = q -1
log( )'/)~ )
Lxj
Inverse Gaussian E ()" --//~2/,(f12y) l ~(r- ~)-'/(~'y) fl ---- ;1-2 I
E (I<' -- fij2/(~2),)
~rl = x'fl; bflog(y) = 0 for 3" = 0: "r is the scale parameter.

distribution, used for example in multi-choice regression models such as multi-

nomial and nested logit. In this case l(v;y) = 0, so that R2L equals the likelihood
ratio index analyzed by Hauser (1978).
6. Discussion
Different interpretations of the coefficient of determination in the linear

regression model, R~Ls, lead to different R 2 measures for nonlinear models, each
with some, but not all, of the properties possessed by R2o~s. A number of the
possible general approaches are given in, for example, Magee (1990) and Veall
and Zimmermann (1992, 1995). The most easily interpretable measures are
-- A .2
based on residual sums of squares, 1 - ~ i ( Y i ~tli)/v~i(Yi- .~).2, or explained
sums of squares ~i(l$i- )7)2/Y~(Yl- ))2. But in nonlinear models these two
measures may fall outside the unit interval, decrease as regressors are added, and
differ from each other. 2
A number of proposed measures, including R2L, are related to LRT, the
likelihood ratio test statistic for the joint significance of the slope parameters. In
particular, a general measure proposed by Kent (1983), Maddala (1983), and
Magee (1990) is
R2RT = 1 -- exp( -- LRT/n). (7)
Kent argued that L R T / n is an estimate of the expected Kullback-Leibler
information gain, the expectation being with respect to regressors x. Kent chose
this particular transformation of L R T / n as it is guaranteed to lie within the unit
interval, and it equals the usual multiple correlation coefficient in the regression
model under normality. Maddala and Magee proposed R~R~"on grounds that in
the linear model it equals Rots. " All treat the variance a 2 a:.: an unknown
parameter that is estimated by n-tY.s(yi _fii)2 in the fitted model and by
n-t~;(Ys - )7)2 in the constant-only model.
Magee (1990) also proposed a measure based on a Wald test rather than
likelihood ratio test:
R2w = W/(n + W),
where W is the Wald test statistic for joint significance of the slope parameters.
As Magee (1990) notes, R2w does not necessarily increase when regressors are
added, and another drawback of the measure is the lack of invariance of W to
the parameterization of the model. In the linear model R2w equals R.2oLs,where
the variance tr 2 is again treated as an unknown parameter.
2in the special ca.~e that the nonlinear model is based on a linear latent variable mode!, Veall and
Zimmermann (I 992, 1995) advocate estimating the latter measure for the underlying latent variable.
This approach cannot be applied to most of the models considered here.
This different treatment of the scale parameter needs to be emphasized. The

discussion of R~L was restricted to exponential family models where the scale
parameter is known. This includes Bernoulli, Poisson, geometric, and exponen-
tial, which have no scale parameter, and binomial for which the scale parameter
(the number of trials m) is usually known. For models with unknown scale
parameter, R2L is easily extended if the KL divergence is multiplicative in the
scale parameter. Then the scale parameter cancels out from numerator and
denominator in R~L, leaving the same formulae for R2L as Table I with no need
to estimate the scale parameter. This is the case for the normal (o.2 unknown)
and gamma (v unknown) distributions. By contrast, the motivation of R~RT as-
sumes estimation of any scale parameter, since if o.2 is known in the linear
regression model under normality, (7) yields R2LR-r= 1 --exp(y.+{(y~-/~+)2 _
(.V~- 37)2 }~{ha 2 }) rather than RgLs .
For exponential family models with known (or no) scale parameter,
") "l, ,,,
RERT = 1 -- exp( -- R~:L K(y, po)/n),

and therefore R2aT takes maximum value of l - e x p ( - KO,,[to)/n) when
R~L = 1. So a measure with upper bound of one is
, 1 - exp( - LRT/n}
R[RI-. =
I -- exp( - K(y,/~0)/n)"
This equals the R 2 measure for the multinomial legit model given by Cragg and
Uh!er (1970) and discussed in Maddala (1983, pp. 39-40). Note that RUtT.
2 > RIll.
for 0 < R~t. < I. There are clearly many ways to generate aiJ R-' measure based
on the likelihood ratio test that lies between 0 and I and increases as regressors
are added. R~.L has the additional advantage of interpretation in terms of
proportionate reduction in recoverable information.
An interesting question is generalization of R2t. to any model specification
estimated by maximum likelihood. By (4}
RIll+ = I /,,rex --/l'i, Ira -- Io

/max -- Io = I,,,..,,,- Io" 18)
where/tit, 1o, and I,,.~ denote, respectively, the log-likelihood in the fitted model,
the log-likelihood in the intercept-only model, and the maximum log-likelihood
achievable. Thus R~L equals the fraction of the maximum potential likelihood
gain (starting with a constant-only model) achieved by the fitted model. This
definition works well in cases such as exponential family models with known
scale parameter where Im,,~ is well-defined) But in other cases, such as the
•~See also Merkle and Zimmermann (1992), who proposed use of RZt, as defined in (8) for tile Poisson
model.
normal with a 2 unknown, Im.~ is not defined. 4 Even where/m.~x is defined, it

should be noted that it does not necessarily equal the log-likelihood evaluated at
/~ =y.S
7. Application
To illustrate the behaviour of R2L and other R 2 m e a s u r e s we perform an

analysis of the cost of claims for damage to an owner's car for privately owned
vehicles with comprehensive cover. The data used is the same as in Baxter et al.
(1980) (see also McCullagh and Nelder, 1989, p. 298). The data set consists of cell
average cost of claims for each of 123 cells, where the cells are determined by
eight categories of policy-holders age, four categories of vehicle age, and four
categories of car group (cells with no claims are excluded). A gamma distribution
is assumed, with log-likelihood
{vi( - Yi/#~ - log Pi + vl logyi + log vi) - log F(vi)};
i
conditional mean p~ = (x;fl)- ~, corresponding to the canonical link function for

gamma; and scale parameter v~ = v.w~, where v is a scalar and the weight
wi equals the number of claims within each cell i. In constructing R~L the
parameter v is assumed known and factors out. By contrast, in computing
R2RT and R 2 v is treated as unknown and needs to be separately estimated, and
L R T is computed as 2{1(/~, i~;y)- I(/~o, ~o;y)}. For comparative purposes we
additionally calculate R~v, RLR2 T, and RLRTu
2 for v known and equal to 1, which
corresponds to the exponential. The estimation results for the mean p are
independent of the value of v.
The results as presented in Table 2 are given for three different models: PA has
seven dummies for categories of policy-holders age; PA + CG additionally
includes dummies for three categories of car group; PA + CG + VA addition-
ally includes dummies for three categories of vehicle age. The values of the three
measures are very similar for the gamma model with v estimated, but they differ
quite substantially when v is set equal to l (exponential) in which case R 2 (for
the first two models) and R2LRT(for all models) are much higher than R2L . For
the measure based on the Wald statistic these higher values occur due to the
'LAssume tha! each y: is drawn from N(p,, ~r~),where pi = yi and a 2 ~ 0. Then tile density of each y,
and hence the log-likelihood for the sample, becomes infinite. For the negative binomial model,
where a similar problem arises, Cameron and Windmeijer (I 996) propose setting the scale parameter
to its estimate in the fitted model.
5For example, consider the log-normal, Iogy~ ,,-N(O~, 1) in which case p; = exp(0i + 0.5). The
log-density of y~ is maximized w.r.t. 0~ at 0~ = iog y , and hence is maximized w.r.t, ll~ at
Pi = yiexp(0.5) ~ Yi.
Table 2
Results RJ's for car insurance data
v estimated v= 1
VAR R~L R~ R~.T R~ R~R~ 2

RLRTa
PA 0.127 0.138 0.134 0.410 0.487 0.490

PA CG 0.478 0.514 0.496 0.737 0.920 0.925
PA CG VA 0.808 0.800 0.820 0.799 0.986 0.991
PA: policy holder's age; CG: car group; VA: vehicle age.
go = 0.203; ¢ = 0.231 for PA; ~ = 0.378 for PA CG; ¢ = 1.004 for PA CG VA.
fact that the ,ariances in the exponential model are smaller than the estimated
variances in the gamma model (~ < 1) for the first two models. For the measure
based on the likelihood ratio statistic the reason is the smaller value of l(~, l,.v)
as compared to l(/io, Vo;y).
~ 6 The differences between RLZRTand RLRTu 2 are small
due to the fact that the term 1 - exp( - K(.V, ~o)/n) is close to 1.
The R 2 measures clearly convey the message that the full model provides
a very good fit for this data. While the RJ's may appear high to those familiar
with cross-section data, the full model does actually fit the data well, as even
standard weighted nonlinear least squares, i.e., minimize ~ i w d y i - 1)/(x~fl))z
2
gives an RKL.GLS, as defined in (6), equal to 0.79 in the model with all categories
included.
8. Conclusions
For exponential family regression models, the Kullback-Leibler divergence

can be used to construct an R 2 measure of goodness of fit, denoted R~L, that
measures the proportionate reduction in uncertainty due to the inclusion of
regressors, lies between 0 and I and is nondecreasing as regressors are added.
R~L corresponds to the usual coefficient of determination in the linear regression
under normality. In Bernoulli models, such as probit and logit, R~L coincides
with the likelihood ratio index, supporting use of this index rather than the
many other competing R e measures. R~, can also be used for other regression
models in the exponential family, such as Poisson, geometric, binomial, ex-
ponential, and gamma, for which R z measures do not generally appear to be
available. For models with canonical link function, RJL can additionally be
interpreted as the fraction of uncertainty explained by the fitted model.
6Equivalently, R[~T takes on very high values when LRT is computed as 2 {l(/i, O;y) - I(~o, ¢,y)}.
A.C. Cameron, F.A.G. Windmeijer /Journal of Econometrics 77 (1997) 329~ 342 341
References
Baxter, L.A., S.M. Coutts, and G.A.F. Ross, 1980, Applications of linear models in motor insurance,
Proceedings of the 21st International Congress of Actuaries, Zurich, 11-29.
Buse, A., 1973, Goodness of fit in generalized least squares estimation, The American Statistician 27,
106-108.
Cameron, A.C. and F.A.G. Windmeijer, 1996, R-squared measures for count data regression models
with applications to health care utilization, Journal of Business and Economic Statistics 14,
209-220.
Christensen, R., 1990, Log-linear models (Springer-Veflag, New York, NY).
Cragg, J. and R. Uhler, 1970, The demand for automobiles, Canadian Journal of Economics 3,
386-406.
Cook, R.D., 1986, Assessment of local influence, Journal of the Royal Statistical Society B 48,
133-169.
Efron, B., 1978, Regression and ANOVA with zero-one data: Measures of residual variation, Journal
of the American Statistical Association 73, 113-121.
Gourieroux, C., A. Montfort, and A. Trognon, 1984, Pseudo maximum likelihood methods: Theory,
Econometrica 52, 681-700.
Hastie, T., 1987, A closer look at the deviance, The American Statistician 41, 16-20.
Hauser, J.A., 1978, Testing the accuracy, usefulness, and significance of probabilistic choice models:
An information-theoretic approach, Operations Research 26, 406-421.
Kent, J.T., 1983, Information gain and a general measure of correlation, Biometrika 70, 163-173.
Kullback, S., 1959, Information theory and statistics (Wiley, New York, NY).
Laitila, T., 1993, A pseudo-R 2 measure for limited and qualitative dependent variable models,
Journal of Econometrics 56, 341-356.
Landwehr, J.M., D. Pregibon, and A.C. Shoemaker, 1984, Graphical methods for assessing logistic
regression models, Journal of the American Statistical Association 79, 61-83.
Maasoumi, E., 1993, A compendium to information theory in economics and econometrics, Econo-
metric Reviews, 137-181.
Maddala, G.S., 1983, Limited dependent and qualitative variables in econometrics ICambridge
University Press, Cambridge},
Magee, L., 1990, R 2 measures based on Wald and likelihood ratio join! signilicance tests, The
American Statistician 44, 250253.
McCullagh, P. and J.A. Nelder, 1989, Generalized linear models, 2nd ed. {Chapman and ~!,all,
London).
McFadden, D., 1974, Conditional legit analysis of qualitative choice behaviour, in: P. Zarembka,
ed., Frontiers in econometrics (Academic Press, New York, NY) 105-142.
McKelvey, R.D. and W. Zavoina, 1976, A statistical model for the analysis of ordinal level
dependent variables, Journal of Mathematical Sociology 4, 103--120.
Merkle, L. and K.F. Zimmermann, 1992, The demographica of labor turnover: A comparison of
ordinal probit and censored count data models, Recherches Economiques de Louvain 58, 283--307.
Nelder, J.A. and R.W.M. Wedderburn, 1972, Generalized linear models, Journal of the Royal
Statistical Society A 135, 370-384.
Pregibon, D., 1981, Logistic regression diagnostics, Annals of Statistics 9. 705-724.
Pregibon, D., 1984, Data analytic methods for matched case-control studies, Biometrics 40, 639 65 I.
Simon, G., 1973, Additivity of information in exponential hmily probability laws, Journal of the
American Statistical Association 68, 478--482.
Ullah, A., 1993, Entropy, divergence and distance measures with econometric applications (Depart-
ment of Economics, Ur~iversity of California, Riverside, CA).
Veall, M.R. and K.F. Zimmermann, 1992, Pseudo-RZ's in the ordinal probit model, Journal of
Mathematical Sociology 4, 103 120.
Veall, M.R. and K.F. Zimmermann, 1995, Pseudo-R 2 measures for some common limited dependent
variable models, Journal of Economic Surveys, forthcoming.
Vos, P.W., 1991, A geometric approach to detecting influential cases, Annals of Statistics 19,
1570-1581.
Vuong, Q.H., 1989, Likelihood ratio tests for model selection and non-nested hypothesis, Econo-
metrica 57, 307-333.
White, H., 1993, Estimation, inference and specification analysis (Cambridge University Press,
Cambridge).
Williams, D.A., 1987, Generalized linear model diagnostics using the deviance and single-case
deletions, Applied Statistics 36, 181-191.
Windmeijer, F.A.G., 1995, Goodness-of-fit measures in binary choice models, Econometric Reviews
14, 101-116.

Colin Cameron 1997

Uploaded by

Copyright:

Available Formats

Colin Cameron 1997

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Colin Cameron 1997

Uploaded by

Copyright:

Available Formats

JOURNAL OF

An R-squared measure of goodness of fit for some

(Received September 1994; final version received January 1996~

Key words: R-squared; Exponential family regression; Kullback-Leibler divergence;

0304-4076/97/$15.00 (i) 1997 Elsevier Science S.A. All rights reserved

2. Exponential family regression models

3. R-squared based on the Kullback-Leibler divergence

A standard measure of the information content from observations in a density

K(#1,#2)-= 2E,, log[f~,,(y)/f~,~(y)], (2)

defined in a manner analogous to (2) as

K(y. la) = 2Er log [f~,(y)/f.(y)] = 2~f.,,(y)log [fr(Y)/f.(Y)] dy. (3)

K(y, ~tt)= 2log [fr(Y)/f.(Y)]"

K(y. ,~) = 2 ~ [!og.,~,(),e) --logf~,(y,)] = 2[/(y;y)-/(~;y)]. (4)

Proposition I. For ML estimates of exponential family retjression models based

5. R~L measures the proportionate reduction in recoverable information due to

2. The lower bound of 0 occurs if inclusion of regressors leads to no change in

3. Follows directly from re-expressing R~L as 2[l(,a;y) - l(~o;y)]/K(y, ~).

4. Follows directly from re-expressing Rt~L as [ 1 - l(~;y)/l(~o;y)] [l(~o;y)/

Properties 1 and 2 are standard properties often desired for R-squared

4. Pythagorean decomposition for RZgL

K ( y , ti) = (y - V-'(y - ti),

Normal ~..(3' --/Q2, 0"2 I E ( Y -- 1I)2

Binomial (rap 2 ~. t'log + (m - r ) l o g \ n T - p / } I - )):?iog(.i;) + On - y)log(m - Y) I /~ = 1 + exp01)

Poisson b 2)-" ~y log(y,'/~) - (3- - P! ~ I - ~ .)' log(y/l)) - ( y - / i ) II = expOl)

~rl = x'fl; bflog(y) = 0 for 3" = 0: "r is the scale parameter.

distribution, used for example in multi-choice regression models such as multi-

Different interpretations of the coefficient of determination in the linear

This different treatment of the scale parameter needs to be emphasized. The

RERT = 1 -- exp( -- R~:L K(y, po)/n),

RIll+ = I /,,rex --/l'i, Ira -- Io

normal with a 2 unknown, Im.~ is not defined. 4 Even where/m.~x is defined, it

To illustrate the behaviour of R2L and other R 2 m e a s u r e s we perform an

conditional mean p~ = (x;fl)- ~, corresponding to the canonical link function for

VAR R~L R~ R~.T R~ R~R~ 2

PA 0.127 0.138 0.134 0.410 0.487 0.490

For exponential family regression models, the Kullback-Leibler divergence

You might also like