Colin Cameron 1997
Colin Cameron 1997
Colin Cameron 1997
Econometrics
ELSEVIER Journal of Econometrics 77 11997) 329-342
Abstract
For regression models other than the linear model, R-squared type goodness-of-fit
summaly statistics have been constructed for particular models using a variety of
methods. We propose an R-squared measure of goodness of fit for the class of exponen-
tial family regression models, which includes logit, probit, Poisson, geometric, gamma,
and exponential. This R-squared is defined as the proportionate reduction in uncertainty,
measured by Kullback-Leibler divergence, due to the inclusion of regressors. Under
further conditions concerning the conditional mean function it can also be interpreted as
the fraction of uncertainty explained by the fitted model.
I. Introduction
For the standard linear regression model the familiar coefficient of determina-
tion, R-squared (R2), is a widely used goodness-of-fit measure whose usefulness
and limitations are more or less known to the applied researcher. Application of
this measure to nonlinear models generally leads to a measure that can lie
*Corresponding author.
The authors are grateful to Richard Blundell, Shiferaw Gurmu, and two anonymous referees for
their helpful comments.
outside the [0, 1] interval and decrease as regressors are added. Alternative
R2-type goodness-of-fit summary statistics have been constructed for particular
nonlinear models using a variety of methods. For binary choice models, such as
logit and probit, there is an abundance of measures; see Maddala (1983) and
Windmeijer (1995). For censored latent models, such as the binary choice and
tobit models, it is possible to avoid nonlinearity by obtaining an approximation
of the usual R 2 for the linear latent variable model; see McKelvey and Zavoina
(1976), Laitila (1993), and Veall and Zimmermann (1992, 1995). For other
nonlinear regression models R 2 measures are very rarely used.
Desirable properties of an R-squared include interpretation in terms of the
information content of the data, and sufficient generality to cover a reasonably
broad class of models. We propose an R-squared measure based on the
Kullback-Leibler divergence for regression models in the exponential family.
This measure can be applied to a range of commonly-used nonlinear regression
models: the normal for continuous dependent variable y ~ ( - oc,, ~ ); exponen-
tial, gamma, and inverse-Gaussian for continuous ye(0, oc ); logit, probit, and
other Bernoulli regression models for discrete y = 0, 1; binomial (m trials) for
discrete y = 0, I . . . . . m; Poisson and geometric for discrete y = 0, 1, 2....
The exponential family regression model is described in Section 2. In Section
3, the R 2 measure based on the Kullback-Leibler divergence is presented. This
measures the proportionate reduction in uncertainty due to the inclusion of
regressors. Interpretation of the measure in terms of the fraction of uncertainty
explained by the fitted model is given in Section 4. Examples are presented in
Section 5. Extensions and other goodness-of-fit statistics are discussed in
Section 6. Section 7 contains an application to a gamma model for accident
claims data. Section 8 concludes.
Following Hastie (1987), assume that the dependent variable Y has distribu-
tion in the one-parameter exponential family with density
f0(y) = exp[0y - b(O)]h(y), (l)
where 0 is the natural or canonical parameter, b(O) is the normalizing function,
and h(.) is a known function. Different b(O) correspond to different distributions.
The mean of Y, denoted p, can be shown to equal the derivative b'(0), and is
monotone in 0. Therefore, the density can equivalently be indexed by p, and
expressed as
f~,(),) = exp[c(,u)y - d{10] h(y).
General statistical theory for regression models based on the exponential
family is given in Wedderburn and Nelder (1972), Gourieroux et al. (1984), and
A.C. Cameron, F.A.G. Windmeijer /Jom'nal of Econometrics 77 (1997) 329-342 331
White (1993). The standard reference for applications is McCullagh and Nelder
(1989). Regressors are introduced by specifying # to be a function of the linear
predictor r / = x'fl, where x is a vector of regressors and fl is an unknown
parameter vector. Models obtained by various choices of b(O) and functions of
~/are called generalized linear models. More specialized results are obtained by
choice of the canonical link function, for which r / = 0, i.e., 0 in (1) is set equal to
X'fl.
Binary choice models are an example of exponential family regression models.
Then Y is Bernoulli distributed with parameter # and density f ~ ( y ) =
#r(1 _ #)1-r, y = [0, 1}. This can be expressed as (1) with 0 = Iog(#/(l - #))
and b(O)= log(1 + exp(0)). The logit regression model specifies
#=exp(x'fl)/(1 +exp(x'fl)), while the probit regression model specifies
# = ~(x'fl), where q~ is the standard normal cumulative distribution function.
The logit model corresponds to use of the canonical link function.
The parameter vector fl is estimated by the maximum likelihood (ML)
estimator fl, based on the independent sample {(yi, xi), i = 1.... ,n}, with
f,,(Yi) = f,,(Yj) for #i = lq. The estimated mean for an observation with regressor
x is denoted l) = #(x Ij,~). Throughout we assume that the model includes a con-
stant term. The estimated mean from M L estimation of the constant only mode!
is denoted/~o.
where a factor two is added for convenience, and E,, denotes expectation taken
with respect to the densityf~,,{y). K(#~, #2) is the information of #1 with respect
to #2 and is a measure of how close #~ and #2 are. The term divergence rather
than distance is used because it does not in general satisfy the symmetry and
triangular properties of a distance measure. However, K(# ~, #2) t> 0 with equal-
ity iff./i,' =fs,,.
In addition to.f~,,(y) andf~,,(y) we also consider the densityfr(y), for which the
mean is set equal to the realized y. Then the KL divergence K(y,#) can be
332 A.C. Cameron, F.A.G. Windmeijer /Journal of Econometrics 77 (1997) 329--342
The random variable K(y, p) is a measure of the deviation of y from the mean p.
For the exponential family, Hastie (1987) and Vos (1991) show that the expecta-
tion in (3) drops out and
Let #0 denote the n-vector with entries/~o, the fitted mean from ML estima-
tion of the constant only model. We interpret KO,,,~o) as the estimate of the
information in the sample data on y potentially recoverable by inclusion of
regressors, it is the difference between the information in the sample data ony,
and the estimated information using #0, the best point estimate when data on
regressors are not utilized, where information is measured by taking, expectation
with respect to the observed wdt, e ~,, By choosing ~, to be the MLE, K(v,/~01~
is minimized. The R-squared ',re propose is the proportionate reduction in
this potentially recoverabie inlbrmation achieved by the filled regression
model:
= 1 - Kty, A K
(Y, (5)
This measure can be used for fitted means obtained by any estimation method.
In the following proposition we restrict attention to ML estimation (which
minimizes K0',/i)):
3. R~t is a scalar multiple of the likelihood ratio test Jbr the joint significance of
the explanatory variables.
A.C. Cameron, F.A.G. Windmeijer /Journal of Econometrics 77 (1997) 329-342 333
4. R2L equals the likelihood ratio index 1 - l(/~;y)/l(~;.;') if and only if l(y; y) = O.
Proof
1. The :.~,LE minimizes K0',,~) which will therefore not increase as regressors
are added.
In the linear regression model, the usual R-squared can be interpreted not
only as the proportionate reduction in the total sum of squares due to inclusion
of regressors, but also as the fraction of the total sum of squares explained by the
regression model. This result rests on the decomposition of the total sum of
squares into explained sum of squares and residual sum of squares. Such
a decomposition of the sum of squares does not generally hold for exponential
family regression models, which is one reason for not applying the linear
regression model R-squared to other models.
For a widely used subclass of exponential family regression models that use
the canonical link, R2L has the desirable property of interpretation in terms of
explained KL divergence between the fitted model and the constant-only model.
Proposition 2. For the exponential family models that use the canonical link, i.e.,
0 = x'fl in (1), R2L deJined in (5) can be equivalently expressed as
R~,. = K(k ~o)/K~r, ~0). .r (
where K (~, Po) is the estimated KL divergence defined in (2) between models wi~h
.fitted means ~ and po, and st, R2L measures the fraction of uncertainty explained by
the fitted model.
Proof. Let the vector Pt = p(x'tfi), and/)2 = p(x'2[~, with x2 nested in xt. For
models that use the canonical link, the KL divergence exhibits the Pythagorean
property (see Hastie, 1987, pp. 19~20; Simon, 1973):
K(.v,~2) = K(~i, ~2) + K(y, ~ ).
Proposition 2 follows, using the particular decomposition K(y, ko)=
K(A ~o}+ K(y, ~). []
For models that do not use the canonical link, R~L still satisfies all the
properties in Proposition 1, in particular property 5. A decomposition is trivially
obtained by use of the so-called likelihood displacement defined as (Cook, 1986;
Vos, 1991)
LD(~, ,~,,)~ 2{l(~;y) - l(~o;Y)} = K (y, ~o) - K (y, ~).
Then
R~L = LOIk ~o)/K(y,k,),
and can be interpreted as measuring the fraction of empirical uncertainty
explained by the model, i
tSee Hauser 11978), who analyzed the likelihood ratio index for Bernoulli and multinomiai models.
A.C. Camero,, F.A.G. Windmeijer /Journal of Econometrics 77 (1997) 329-342 335
5. E x a m p l e s
The formulae for RKL for a range of exponential family regression models are
given in Table 1. The models are defined in, for example, McCullagh and Nelder
(1989, p. 30). The column R~L is the measure defined in (5). The final column
gives the conditional mean, as a function of ~1= x'fl corresponding to the
canonical link, in which case Proposition 2 also holds and R2L can be simplified
in certain cases.
For the normal distribution, with tr2 known (or using the same estimator for
the two models), R~L given in Table 1 equals the usual coefficient of determina-
tion in the linear model. Proposition 2 applies to the linear regression model, but
not to nonlinear models under normality since these do not use the canonical
link.
For the linear model with nonspherical disturbances (var(y)= o2V, with
V known), the KL divergence can be shown to be given by
and R~L is
(y - ll)'v-'(y - #)
= 1 - (6)
R2L'GI'S = (y -- llio)' V - 1(.1'- Ifto)
where ! is the vector of ones, a n d / ) . = 0 ' V - i l)- I t' V - ly. So in this case, R~.L is
equal to the definition as given by Buse (1973).
For Bernoulli ~'egression models, where y takes only the values 0 or 1, many
R-' measures have been proposed. See, for example, Maddala (1983, pp. 37 41)
and Windmeijer (1995), or the output from the econometrics package
SHAZAM. For these models, I(y;y) = 0, so that by property 4 in Proposition 1,
R"KL given in Table 1 is equal to the likelihood ratio index proposed by
McFadden (1974), Efron (1978) for one-way ANOVA, Pregibon (1984) who
explicitly derives his measure based on deviances, and Christensen (1990).
Proposition 2 applies to the logit model, but not to the probit model.
An R 2 measure is rarely reported for the Poisson model. R2L given in Table
1 equals one of the R 2 measures proposed for this model by Cameron and
Windmeijer (1996). The standard Poisson regression model specifies
It = exp(x'/l) which is the canonical link so that Proposition 2 applies.
Table 1 also lists R2L for the binomial, geometric, exponential, gamma, and
inverse-Gaussian regression models. For these models we have been unable to
find specific R 2 measures in the literature.
The analysis can be extended to a p-dimensional dependent variable with
density in the p-parameter exponential family. Necessary results for such gener-
alization are given in Simon (1973). Of particular interest is the muitinomial
Table 1 Ox
R2Lfor exponential family regression models
7-
Distribution KL divergence R~-g Canonical link ~ .q
Bernoulli - 25- ~ylogl~ + I! - y)log~! - .u)~ I -- )-"/~log(/~) 4-(I -- [01og(l -- l~) exp(q)
n ~4)~log(.i`) + (! - D l o g ( I - D} 1 + exp0/)
,+ ,,
" "+'
Geometric" 2~
l.),log
( " )~ -,y+ l,log
(~> ) l ~Jfr' l ) ' o g / yl'~l'7"-~!
+ /'=1-exp01)
Eyi°g ~" -(.)'+ \r+ ]
e~
Exponential - 2 ~ ~]og()'/'fl) q- (3" -- fl),~ l 5`iogO,/~) + 0 ' - fO/f~ II = t]-!
log(y/~) "-4
-,,q
Gamma ¢ - 2 r ~ ~,log0vp) + (y - P) U~ I y log(y/~) + (.), - ~),/~
It = q -1
log( )'/)~ )
Lxj
Inverse Gaussian E ()" --//~2/,(f12y) l ~(r- ~)-'/(~'y) fl ---- ;1-2 I
E (I<' -- fij2/(~2),)
6. Discussion
2in the special ca.~e that the nonlinear model is based on a linear latent variable mode!, Veall and
Zimmermann (I 992, 1995) advocate estimating the latter measure for the underlying latent variable.
This approach cannot be applied to most of the models considered here.
338 A.C. Cameron, F.A.G. Windmeijer /Journal of Econometrics 77 (1997) 329-342
•~See also Merkle and Zimmermann (1992), who proposed use of RZt, as defined in (8) for tile Poisson
model.
A.C. Cameron, F.A.G. Windmeijer /Journal of Econometrics 77 (1997) 329-342 339
7. Application
'LAssume tha! each y: is drawn from N(p,, ~r~),where pi = yi and a 2 ~ 0. Then tile density of each y,
and hence the log-likelihood for the sample, becomes infinite. For the negative binomial model,
where a similar problem arises, Cameron and Windmeijer (I 996) propose setting the scale parameter
to its estimate in the fitted model.
5For example, consider the log-normal, Iogy~ ,,-N(O~, 1) in which case p; = exp(0i + 0.5). The
log-density of y~ is maximized w.r.t. 0~ at 0~ = iog y , and hence is maximized w.r.t, ll~ at
Pi = yiexp(0.5) ~ Yi.
340 A.C. Cameron, F.A.G. Windmeijer /Journal of Econometrics 77 (1997) 329-342
Table 2
Results RJ's for car insurance data
v estimated v= 1
PA: policy holder's age; CG: car group; VA: vehicle age.
go = 0.203; ¢ = 0.231 for PA; ~ = 0.378 for PA CG; ¢ = 1.004 for PA CG VA.
fact that the ,ariances in the exponential model are smaller than the estimated
variances in the gamma model (~ < 1) for the first two models. For the measure
based on the likelihood ratio statistic the reason is the smaller value of l(~, l,.v)
as compared to l(/io, Vo;y).
~ 6 The differences between RLZRTand RLRTu 2 are small
due to the fact that the term 1 - exp( - K(.V, ~o)/n) is close to 1.
The R 2 measures clearly convey the message that the full model provides
a very good fit for this data. While the RJ's may appear high to those familiar
with cross-section data, the full model does actually fit the data well, as even
standard weighted nonlinear least squares, i.e., minimize ~ i w d y i - 1)/(x~fl))z
2
gives an RKL.GLS, as defined in (6), equal to 0.79 in the model with all categories
included.
8. Conclusions
6Equivalently, R[~T takes on very high values when LRT is computed as 2 {l(/i, O;y) - I(~o, ¢,y)}.
A.C. Cameron, F.A.G. Windmeijer /Journal of Econometrics 77 (1997) 329~ 342 341
References
Baxter, L.A., S.M. Coutts, and G.A.F. Ross, 1980, Applications of linear models in motor insurance,
Proceedings of the 21st International Congress of Actuaries, Zurich, 11-29.
Buse, A., 1973, Goodness of fit in generalized least squares estimation, The American Statistician 27,
106-108.
Cameron, A.C. and F.A.G. Windmeijer, 1996, R-squared measures for count data regression models
with applications to health care utilization, Journal of Business and Economic Statistics 14,
209-220.
Christensen, R., 1990, Log-linear models (Springer-Veflag, New York, NY).
Cragg, J. and R. Uhler, 1970, The demand for automobiles, Canadian Journal of Economics 3,
386-406.
Cook, R.D., 1986, Assessment of local influence, Journal of the Royal Statistical Society B 48,
133-169.
Efron, B., 1978, Regression and ANOVA with zero-one data: Measures of residual variation, Journal
of the American Statistical Association 73, 113-121.
Gourieroux, C., A. Montfort, and A. Trognon, 1984, Pseudo maximum likelihood methods: Theory,
Econometrica 52, 681-700.
Hastie, T., 1987, A closer look at the deviance, The American Statistician 41, 16-20.
Hauser, J.A., 1978, Testing the accuracy, usefulness, and significance of probabilistic choice models:
An information-theoretic approach, Operations Research 26, 406-421.
Kent, J.T., 1983, Information gain and a general measure of correlation, Biometrika 70, 163-173.
Kullback, S., 1959, Information theory and statistics (Wiley, New York, NY).
Laitila, T., 1993, A pseudo-R 2 measure for limited and qualitative dependent variable models,
Journal of Econometrics 56, 341-356.
Landwehr, J.M., D. Pregibon, and A.C. Shoemaker, 1984, Graphical methods for assessing logistic
regression models, Journal of the American Statistical Association 79, 61-83.
Maasoumi, E., 1993, A compendium to information theory in economics and econometrics, Econo-
metric Reviews, 137-181.
Maddala, G.S., 1983, Limited dependent and qualitative variables in econometrics ICambridge
University Press, Cambridge},
Magee, L., 1990, R 2 measures based on Wald and likelihood ratio join! signilicance tests, The
American Statistician 44, 250253.
McCullagh, P. and J.A. Nelder, 1989, Generalized linear models, 2nd ed. {Chapman and ~!,all,
London).
McFadden, D., 1974, Conditional legit analysis of qualitative choice behaviour, in: P. Zarembka,
ed., Frontiers in econometrics (Academic Press, New York, NY) 105-142.
McKelvey, R.D. and W. Zavoina, 1976, A statistical model for the analysis of ordinal level
dependent variables, Journal of Mathematical Sociology 4, 103--120.
Merkle, L. and K.F. Zimmermann, 1992, The demographica of labor turnover: A comparison of
ordinal probit and censored count data models, Recherches Economiques de Louvain 58, 283--307.
Nelder, J.A. and R.W.M. Wedderburn, 1972, Generalized linear models, Journal of the Royal
Statistical Society A 135, 370-384.
Pregibon, D., 1981, Logistic regression diagnostics, Annals of Statistics 9. 705-724.
Pregibon, D., 1984, Data analytic methods for matched case-control studies, Biometrics 40, 639 65 I.
Simon, G., 1973, Additivity of information in exponential hmily probability laws, Journal of the
American Statistical Association 68, 478--482.
Ullah, A., 1993, Entropy, divergence and distance measures with econometric applications (Depart-
ment of Economics, Ur~iversity of California, Riverside, CA).
Veall, M.R. and K.F. Zimmermann, 1992, Pseudo-RZ's in the ordinal probit model, Journal of
Mathematical Sociology 4, 103 120.
342 A.C. Cameron, F.A.G. Windmeijer /Journal of Econometrics 77 (1997) 329-342
Veall, M.R. and K.F. Zimmermann, 1995, Pseudo-R 2 measures for some common limited dependent
variable models, Journal of Economic Surveys, forthcoming.
Vos, P.W., 1991, A geometric approach to detecting influential cases, Annals of Statistics 19,
1570-1581.
Vuong, Q.H., 1989, Likelihood ratio tests for model selection and non-nested hypothesis, Econo-
metrica 57, 307-333.
White, H., 1993, Estimation, inference and specification analysis (Cambridge University Press,
Cambridge).
Williams, D.A., 1987, Generalized linear model diagnostics using the deviance and single-case
deletions, Applied Statistics 36, 181-191.
Windmeijer, F.A.G., 1995, Goodness-of-fit measures in binary choice models, Econometric Reviews
14, 101-116.