Estadísticas

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Biometrika (1974), 61, 3, 439 439

Printed in Great Britain

Quasi-likelihood functions, generalized linear models,


and the Gauss—Newton method

Downloaded from https://academic.oup.com/biomet/article-abstract/61/3/439/249095 by University of California, Irvine user on 14 April 2019


B Y R. W. M. WEDDERBURN
Rothamsted Experimental Station, Harpenden, Herts.

SUMMARY
To define a likelihood we have to specify the form of distribution of the observations, but
to define a quasi-likelihood function we need only specify a relation between the mean and
variance of the observations and the quasi-likelihood can then be used for estimation. For
a one-parameter exponential family the log likelihood is the same as the quasi-likelihood
and it follows that assuming a one-parameter exponential family is the weakest sort of
distributional assumption that can be made. The Gauss-Newton method for calculating
nonlinear least squares estimates generalizes easily to deal with maximum quasi-likelihood
estimates, and a rearrangement of this produces a generalization of the method described
by Nelder & Wedderburn (1972).

Some key words: Estimation; Exponential families; Gauss-Newton method; Generalized linear models;
Maximum likelihood; Quasi-likelihood.

1. INTRODUCTION
This paper is mainly concerned with fitting regression models, linear or nonlinear, in
which the variance of each observation is specified to be either equal to, or proportional to,
some function of its expectation. If the form of distribution of the observations were
specified, the method of maximum likelihood would give estimates of the parameters in
the model. For instance, if it is specified that the observations have normally distributed
errors with constant variance, then the method of least squares provides expressions for the
variances and covariances of the estimates, exact for linear models and approximate for
nonlinear ones, and these estimates and the expressions for their errors remain valid even
if the observations are not normally distributed but merely have a fixed variance; thus, with
linear models and a given error variance, the variance of least squares estimates is not
affected by the distribution of the errors, and the same holds approximately for nonlinear
ones.
A more general situation is considered in this paper, namely the situation when there is a
given relation between the variance and mean of the observations, possibly with an unknown
constant of proportionality. A similar problem was considered from a Bayesian viewpoint by
Hartigan (1969). We define a quasi-likelihood function, which can be used for estimation in
the same way as a likelihood function. With constant variance this again leads to least
squares estimation. When other mean-variance relationships are specified, the quasi-
likelihood sometimes turns out to be a recognizable likelihood function; for instance, for a
constant coefficient of variation the quasi-likelihood function is the same as the likelihood
obtained by treating the observations as if they had a gamma distribution.
440 R. W. M. WEDDEBBUBISr
Then computational methods are discussed. The well-known Gauss-Newton method for
calculation of nonlinear least squares estimates is generalized to provide a method of
calculating maximum quasi-likelihood estimates.
When there exists a function of the means that is linear in the parameters, a rearrangement
of the calculations in the generalized Gauss-Newton method gives a procedure identical to
that described by Nelder & Wedderburn (1972). This method produces maximum likelihood

Downloaded from https://academic.oup.com/biomet/article-abstract/61/3/439/249095 by University of California, Irvine user on 14 April 2019


estimates by iterative weighted least squares when the distribution of observations has a
certain form and there is a transformation of the mean which makes it linear in the para-
meters. The distributions that can be treated in this way are those for which the likelihoods
are identical to the quasi-likelihoods; thus the present result generalizes that of Nelder &
Wedderburn.
The approach described in this paper sheds new light on some existing data-analytic
techniques, and also suggests new ones. An example is given to illustrate the method.

2. DEFINITION OF THE QUASI-LIKELIHOOD FUNCTION


Suppose we have independent observations zi (i = 1, ...,n) with expectations fit and
variances F ^ ) , where V is some known function. Later we shall relax this specification and
say var (z{) oc F(/i{). We suppose that for each observation/^ is some known function of a set
of parameteis filt ...,/?r. Then for each observation we define the quasi-likelihood function
Kiz^/ij) by the relation

out v{/ii)
or equivalently
'%< ?x d/i'i + function of zt.
J " [Mil
From now on, when convenient, the subscript i will be dropped so that z and ji will refer
to an observation and its expectation, respectively. Also, the notation S(.) will be used to
denote summation over the observations, so that S(z) is the sum of the observations. We shall
find that K has many properties in common with a log likelihood function. In fact, we find
that K is the log likelihood function of the distribution if z comes from a one-parameter
exponential family, as will be shown in § 4.

3. PROPERTIES OF QUASI-LIKELIHOODS
It is now shown that the function K has properties similar to those of log likelihoods.
THEOREM 1. Let z and K be defined as in § 2, and suppose that /i is expressed as a function
of parameters fix,...,pm. Then K has the following properties:

F(8K8K\ _ F( 8 K
* \ = * 8/l d/l
Quasi-likelihood and generalized linear models 441
Proof. First, (i) follows immediately from the definition of K. Then (ii) follows on noting
that dK/dfii = (dK/dfi) (dfi/dfit) and (iii) is a special case of (iv).
To prove (iv), we note that
-tdKdKX (dK\28/i8/i
\ Oil J dp • OP A
r

Downloaded from https://academic.oup.com/biomet/article-abstract/61/3/439/249095 by University of California, Irvine user on 14 April 2019


= E

_ 1 d/i
-
since V{/i) = var (z). Also we have

1 8/i d/i

which completes the proof.


A result which will be discussed further in § 4 is as follows.
COROLLARY. If the distribution ofz is specified in terms of/i, so that the log likelihood L can be
defined, then

Proof. From the theorem just proved, the above statement is equivalent to

var(z)^-l/J6/SV
a result which follows immediately from the Cramer-Rao inequality (Kendall & Stuart,
1973, §7.14).

4. LIKELIHOODS OF EXPONENTIAL FAMILIES

It is possible to define a log likelihood if a one-parameter family of distributions with ji


as parameter is specified for z. The following theorem shows that the log likelihood function
is identical to the quasi-likelihood if and only if this family is an exponential family.
THEOREM 2. For one observation ofz, the log likelihood function L has the property

(3)
T^WY
where /i = E(z) and V(ji) = var (z), if and only if the density of z with respect to some measure
can be written in the form exp{zd — g{0)}, where 6 is some function of/i.
Proof. If BL/d/i has the form (3), then integrating with respect to ji and defining
dfi
442 R. W. M. W E D D E R B U K N

we have the required result. Conversely, suppose for some measure m on the real line, the
distribution of z is given by exp{zd-g(6)}dm(z). Then JezSdm(z) = em, and so the moment
generating function M(t) ofz is

It follows that g(d + t) — g(6), regarded as a function off, is the cumulant generating function

Downloaded from https://academic.oup.com/biomet/article-abstract/61/3/439/249095 by University of California, Irvine user on 14 April 2019


ofz. Hence g'{d) = p and g\6) = V(ji); also d/i/dd = g"(0) = V(/i). Then we have
8L dd z-/t

This completes the proof.


If K really is a log likelihood, then the theorem shows that, given V(ji), we can construct 8
and g(0) by integration. Theorem 9 of Lehmann (1959) shows that V and g must be analytic
functions and that the characteristic function <ft(t) ofz is analytic on the whole real line and
given by <f>{t) = exp [g{6 + it) - g(6)}. Thus, in principle, the problem of determining whether
or not K is a log likelihood is reduced to the problem of determining whether a given com-
plex function $(t), analytic over the whole real line, is a characteristic function; this point
will not be pursued further here.
In the corollary to Theorem 1 in § 3 it was shown that

Then Theorem 2 shows that this inequality becomes an equality for a one-parameter
exponential family. Thus for a given mean-variance relationship, a one-parameter
exponential family minimizes the information — E(d2Lldfi2), provided that an exponential
family exists for that relationship.
It seems reasonable to regard — E(82K/8/i2), which is equal to 1/var (z), as a measure of the
information z gives concerning ju, when only the mean-variance relationship is known, and to
regard E{82{K—L)l8/i2}t which is always nonnegative, as the additional information
provided by knowing the distribution ofz. From this point of view, assuming a one-parameter
exponential family for z is equivalent to making no assumption other than the mean-
variance relationship.

5. ESTIMATION USING QUASI-LIKELIHOODS


This section discusses maximum quasi-likelihood estimates and shows that their precision
may be estimated from the expected second derivatives of K in the same way as the precision
of maximum likelihood estimates may be estimated from the expected second derivatives of
the log likelihood.
For each observation, let u be the vector whose components are 8K\8^. Then, from
Theorem 1, u has mean 0 and dispersion matrix with elements

Let H = 82S(K)\dfii8fii; then, if the observations are independent, S(u) has mean 0 and
dispersion D = —E(H). Now let ft be the maximum quasi-likelihood estimate of ft, obtained
by setting S(u) equal to its expectation, 0. To first order in ft-ft we have S(u) ^== H(fi-ft),
whence fi-ft^ H^S)
Quasi-likelihood and generalized linear models 443
Approximating to H by its expectation, — D, we have,

Now D-18(ji) has dispersion D~l; hence we have deduced, rather informally, the following
result.
3. Maximum quasi-likelihood estimates have approximate dispersion matrix

Downloaded from https://academic.oup.com/biomet/article-abstract/61/3/439/249095 by University of California, Irvine user on 14 April 2019


THEOREM
1
, where H is the matrix of second derivatives of S(K).
Next, we consider the case where the mean-variance relation is not known completely, but
the variance is known to be proportional to a given function of the mean, i.e. var (z) = y V(u),
where V is a known function but y is unknown. Clearly the maximum quasi-likelihood
estimate of /? is not affected by the value of y, so that we can calculate P as if y was known
to be 1. To obtain error estimates we need some estimate of y. Assuming that/t is approxi-
mately linear in /? and that V{p.) differs negligibly from V(/i), we have the approximation

which leads to an estimate of y given by


1
7 g
n-m \ V(fi)
For normal linear models, this gives the usual estimate of variance.
6. A GENERALIZATION OF THE GAUSS-NEWTON METHOD
When F(/i) = 1, maximum quasi-likelihood estimation reduces to least squares. One
method of calculating the estimates is then the Gauss—Newton method. This is an iterative
process in which one calculates a regression of the residuals on the quantities d/i/d/^ by
linear least squares, the residuals and d/t/cty^ being calculated from the current estimate of ft.
The resulting regression coefficients are then used as corrections to /?. It will now be shown
that to calculate maximum quasi-likelihood estimates with a general F, the Gauss-Newton
method can be modified simply by using the current estimate of 1/ F(/t) as a weight variate in
the least squares calculation.
Writing vi for d/i/dfi^ and r for z—/i, we have

and using Theorem 2

Then if we obtain successive approximations to fusing the Newton-Raphson method with


the second derivatives of K replaced by their expectations we obtain corrections dfi to the
estimates given, for i — 1,...,», by

Hence we have proved the following:


444 R. W. M. WEDDERBTTRN

THEOREM 4. Using the Newton-Raphson method with expected second derivatives of K to


calculate ft is equivalent to calculating iteratively a weighted linear regression of the residuals, r,
on the derivatives of/i with respect to theft's with weight 1/ V(ji), andusing the regression coefficients
as corrections to ft.
Here F(/t) and the derivatives of/i are calculated at the current estimate of ft.

Downloaded from https://academic.oup.com/biomet/article-abstract/61/3/439/249095 by University of California, Irvine user on 14 April 2019


7. GENERALIZED LINEAR MODELS
We now derive a result which includes the result of Nelder & Wedderburn (1972) as a
special case. Suppose that some function of the mean/(/«) can be expressed in the form
f(p) = SAsr, = 7,
say, where the x's are known variables. Then in the notation of the previous section
«i = xtd/ijdY. Hence (4) may be rewritten

Then if ft denotes the current estimates and ft* = ft+Sft, the corrected ones, and if
Y = S/ftjS^ we have

which proves the next theorem.


THEOREM 5. When Y = f(fi) = ^ftixi a method equivalent to the generalized Gauss-Newton
method already described is to calculate repeatedly a weighted linear regression of

on xlt..., xm using as weighting variate

Nelder & Wedderburn showed that this technique could be used to obtain maximum
likelihood estimates when there was a linearizing transformation of the mean/(/i), and the
distribution of z could be expressed in the form
n(z; 6,4) = a(
where 6 is a function of/i and <j> is a nuisance parameter. For fixed <l> this gives a one-parameter
exponential family, so that the likelihood is the same as the quasi-likelihood. Also, by a
simple extension of the argument used in Theorem 1 we have var (z) = g"{d)ja.(<j>). Hence
the mean-variance relationship is of the form given in (3), and the result of Nelder &
Wedderburn is a special case of Theorem 5.
A good starting approximation in this process is usually given by setting /i = z and
calculating w from (5) and y as f(z), but some modification may be needed when / has
singularities at the ends of the range of possible z.
Quasi-likelihood and generalized linear models 445

8. EXAMPLE
J. F. Jenkyn in an unpublished Aberystwyth Ph.D. thesis, discussed the data of Table I
which gives estimates of the percentage leaf area of barley infected with Rhynchosporium
secalis, or leaf blotch, for 10 different varieties grown at 9 different sites in a variety trial in
1965.

Downloaded from https://academic.oup.com/biomet/article-abstract/61/3/439/249095 by University of California, Irvine user on 14 April 2019


Jenkyn applied the angular transformation to the data, and then applied the method of
Finlay & Wilkinson (1963), calculating, for each variety, regressions of the transformed
percentages on the site means of the transformed percentages. He found a marked relation-
ship between the variety means and the slopes of the regressions and also between the
variety means and the residual variances from the regression. Thus the angular transforma-
tion failed to produce additivity or to stabilize the variance; in fact, it appeared that a
transformation with a more extreme effect at the ends of the range, or at least at the lower
end, was needed; Jenkyn suggested a logarithmic transformation. Two others suggest
themselves: the logistic transformation, log (pjq), and the complementary log log transforma-
tion, log (-logq).

Table 1. Incidence ofR. secalis on leaves of 10 varieties grown


at nine sites; percentage leaf area affected
Variety
t
ite 1 2 3 4 5 6 7 8 9 10 Mean
1 005 000 000 010 0-25 0-05 0-50 1-30 1-50 1-50 0-52
2 0-00 0-05 0-05 0-30 0-75 0-30 300 7-50 1-00 12-70 2-56
3 1-25 1-25 2-50 16-60 2-50 2-50 0-00 2000 37-50 26-25 1103
4 2-50 0-50 0-01 300 2-50 001 25-00 55-00 500 40-00 13-35
5 5-50 100 600 1-10 2-50 8-00 16-50 29-50 2000 43-50 13-36

6 100 5-00 500 5-00 5-00 5-00 10-00 5-00 50-00 75-00 16-60
7 5-00 010 5-00 500 5000 10-00 5000 25-00 50-00 7500 27-51
8 500 10-00 5-00 5-00 25-00 75-00 5000 7500 75-00 75-00 4000
9 17-50 25-00 42-50 5000 37-50 95-00 62-50 95-00 95-00 9500 61-50

Mean 4-20 4-77 7-34 9-57 1400 21-76 2417 34-81 37-22 49-33

An attempt was made to analyze the data using a logistic transformation. To do this, zero
had to be replaced by some suitably small value; since the value 0-01 % occurs in the data
we could hardly replace zero by something greater than this, unless 0-01 % were to be
increased too. It was found that some of these small values in the data gave large negative
residuals which had a serious distorting effect on the analysis, and only when these values
were ignored was it possible to obtain a satisfactory analysis.
The logistic transformation appeared to be about right for stabilizing variance and
producing additivity except for the undesirable effect of the small observations. This led to
a different formulation of the model which was the same to a first order of approximation,
but which avoided the problems caused by applying a logistic transformation to small
observations.
Let pif denote the proportion of leaf area infected in the ith variety at thejth site. Let
Pij = E(pij) and Qij=l — Pij. The model is stated as logitl^ = Ttj = m + ai + b^ and
?
446 R . W. M. W E D D E B B U R N

When the method of §6 is applied the weighting variate is equal to 1. This is useful
because it means that the iterative analysis remains orthogonal. The modified y variate,
/(/*)» takes the form of a 'working logit', namely

Downloaded from https://academic.oup.com/biomet/article-abstract/61/3/439/249095 by University of California, Irvine user on 14 April 2019


The quantities ri} = (#« —i<y)/(i#@#) can be regarded as weighted residuals. They are
proportional to residuals divided by their estimated standard error. Evidently the calcula-
tions are like the usual ones for fitting a logistic model to quantal data, but simpler, because
the weights are equal to 1.
When the model was fitted by the method described, there were, of course, clear differences
between sites, and between varieties. The estimate of y obtained from the residual mean
square came to 0-995, a value which indicates high variability in the data. It implies, for
instance, that if Pti is 0-20, the standard deviation of pti is about 0-16. Clearly, for this to
happen, pi} would have to have a rather skew distribution for Pti not near 0-5; examination
of the weighted residuals showed this skewness.

Table 2. Means over sites of fitted values of logit Ptj


Variety

1 2 3 4 5 6 7 8 9 10
Mean of -4-05 -4-51 -3-96 -309 -2-69 -2-71 -1-71 -0-78 -0-91 -016
A
logit Pn
(Standard error + 0-331.)

The mean values of Tti for each variety are shown, with their standard errors, in Table 2.
Clearly there are differences between varieties; there seem to be 3 highly resistant varieties
and 3 less so, while the remaining 4 are much more susceptible.
Starting with the final set of working logits, the technique of Finlay & Wilkinson (1963)
was applied noniteratively, but there was no sign of any interaction; nor was there any when
a single degree of freedom for nonadditivity (Tukey, 1949) was isolated.
I t seems, then, that a simple summary of the data has been achieved which makes it easy
to see what conclusions can be drawn. The simpler method of working with logit Pit might
have worked better if the variance had not been so large; part of the trouble is that with
such a large variance the approximations
&t, ^ log Ptj, var (logit^) ^ var (p<,)/(Pfc Q%)
break down.

9. CONCLUSIONS
It may be difficult to decide what distribution one's observations follow, but the form of
the mean-variance relationship is often much easier to postulate; this is what makes quasi-
likelihoods useful. It has been seen how maximum quasi-likelihood estimation produced a
satisfactory analysis of rather difficult data, and how these estimates can be computed.
Some procedures used in the past are best understood in terms of quasi-likelihoods. For
instance, in probit analysis, when the variance of the observations is found to be greater
than that predicted by the binomial distribution, it is common to accept the maximum
Quasi-likelihood and generalized linear models 4A1
likelihood estimates regardless, while estimating the degree of heterogeneity as in Chapter 4
of Finney (1971). If the variance is still proportional to binomial variance then this pro-
cedure can be justified in terms of quasi-likelihoods. Also Fisher (1949),findingthat in some
data the variance was proportional to the mean, treated them effectively as if they had a
Poisson distribution, even though the measurement involved was a continuous one. Thus
quasi-likelihoods improve understanding of some past procedures, as well as providing new

Downloaded from https://academic.oup.com/biomet/article-abstract/61/3/439/249095 by University of California, Irvine user on 14 April 2019


ones.

The author wishes to thank the director of the National Institute of Agricultural Botany
and Dr J. F. Jenkyn for their permission to use the data, Mr M. J. R. Healy whose comments
on an earlier version of the paper improved the presentation and Mr R. W. Payne for
running the calculations on the GENSTAT statistical program developed at Rothamsted.

REFERENCES
FINLAY, K. W. & WILKINSON, G. N. (1963). The analysis of adaptation in a plant breeding programme.
Aust. J. Agric. Res. 14, 742-54.
FINNEY, D. J. (1971). Probit Analysis, 3rd edition. Cambridge University Press.
FISHER, R. A. (1949). A biological assay of tuberculins. Biometrics 5, 300-16.
HARTIGAN, J. A. (1969). Linear Bayesian models. J.R. Statist. Soc. B 31, 446-54.
KENDAXX, M. G. & STUART, A. L. (1973). The Advanced Theory of Statistics, Vol. II, 3rd edition. London :
Griffin.
LEHMANN, E. L. (1959). Testing Statistical Hypotheses. New York: Wiley. ,
NELDER, J. A. & WEDDERBURN, R. W. M. (1972). Generalized linear models. J.R. Statist. Soc. A 135,
370-84.
TUKEY, J. W. (1949). One degree of freedom for non-additivity. Biometrics 5, 232-42.

[Received November 1973. Revised June 1974]

You might also like