Lecture Notes Statistics II PDF
Lecture Notes Statistics II PDF
Lecture Notes Statistics II PDF
Contents
1. Statistical models
2. Point estimation
2.1. Stochastic Models . . . . . . . . . . . . .
2.2. Estimators and their properties . . . . .
2.2.1. Finite-Sample Properties . . . . .
2.2.2. Asymptotic Properties . . . . . .
2.3. Sufficient Statistics . . . . . . . . . . . .
2.4. Minimal Sufficient Statistics . . . . . . .
2.5. Minimum Variance Unbiased Estimation
2.5.1. Cramr-Rao Lower Bond (CRLB)
2.5.2. Sufficiency and Completeness . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
8
9
14
18
23
27
28
36
.
.
.
.
.
.
.
.
.
.
.
.
.
45
45
47
50
52
64
64
67
70
78
80
95
96
100
4. Hypothesis testing
103
4.1. Fundamental Notations and Terminology of Hypothesis Testing . . . . . . . . . 103
4.2. Parametric Tests and Test Properties . . . . . . . . . . . . . . . . . . . . . . . . 108
4.3. Construction of UMP Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
ii
Contents
4.4. Hypothesis-Testing Methods . . . . . .
4.4.1. Likelihood Ratio Tests . . . . .
4.4.2. Lagrange Multiplier (LM) Tests
4.4.3. Wald Tests . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
121
122
130
132
Appendix
A. Tables
iii
1. Statistical models
In the course Advanced Statistics I we discussed fundamental ideas of probability theory and
the theory of distributions. There we considered the probability space of a random experiment,
given by the 3-tuple
{S, Y, P },
where
Example 1.1 Consider the experiment of tossing a fair coin 50 times. Assume that we are
interested in the number of heads, say X. We know that this rv has a binomial distribution
with parameters n = 50 and p = 0.5. Hence, we have a completely specified probability space
with a sample space S = {0, 1, ..., 50}, an event space Y consisting of all subsets of S, and a
probability set function P characterized by the pdf of a binomial distribution. From this we can
deduce the characteristics of the outcomes of X like the expected number of heads (np = 25) or
the shape of the pdf of X.
In this course we now turn the question of the probability theory and the theory of distributions
around:
Given the observed characteristics and properties of outcomes of an experiment,
what can we say (infer) about the probability space?
1. Statistical models
Example 1.2 Assume that we have a sample of daily returns observed for the German stock
index DAX, which we interpret as the outcomes of a random process/experiment. As a financial
analyst we might be interested in finding a probability distribution (i.e. the probability space)
which can be used to describe or approximate the observed behavior of the returns.
Problems associated with this kind of question are addressed by the methods of statistical
inference.
In general, the term statistical inference refers to the inductive process of generating information
about characteristics of a population or process by analyzing a sample of objects or outcomes
from the population or process. A typical problem of statistical inference is as follows.
Let X be a rv that represents the population under investigation, and let f (x, ) denote
the parametric family of pdfs of X. The set of possible parameter values is denoted by
.
Then the job of the statistician is to decide on the basis of a sample randomly drawn from
the population, say {Xi , i = 1, ..., n}, which member of the family of pdfs {f (x, ), }
can represent the pdf of X.
Example 1.3 Consider a random sample of daily DAX returns, say {X1 , X2 , ..., Xn }. Assume
that the returns represent a random sample from a normal distribution, i.e.,
Xi iidN(, 2 ),
i = 1, ..., n,
where and 2 are unknown parameters. Our task is to generate statistical inferences based
upon the random sample about the population values for and 2 .
Statistics
In statistical inference, we use functions of the random sample X1 , ..., Xn to map/transform
sample information into inferences regarding the population characteristics of interest. The
functions used for this mapping are called statistics, defined as follows.
Definition (Statistic): Let X1 , ..., Xn be a random sample
from a population and let T (x1 , ..., xn ) be a real-valued function, which does not depend on unobservable quantities. Then
the random variable
Y = T (X1 , ..., Xn ) is called a (sample) statistic.
Often used statistics are
2
1. Statistical models
n =
the sample mean X
1
n
Pn
i=1
Xi ,
1
n
Pn
i=1 (Xi
1
n
Pn
i=1
Xir ,
n )2 .
X
If statistics have certain qualifying statistical properties, they can be used for the estimation of
population parameters or hypothesis testing. Then they are called estimators or test statistics,
respectively.
n , which has a lot of useful
Example 1.4 The most popular statistic is the sample mean X
statistical properties, some of which are summarized in the following. Let X1 , ..., Xn be a random
sample from a population with expectation EX = and variance var(X) = 2 . Then the sample
mean has the following properties
n = ,
EX
n ) = 2 /n,
var(X
n = , which follows from the WLLN,
plim X
a
n
X
N (, 2 /n), which follows from the CLT of Lindberg-Levy.
2. Point estimation
The rationale behind point estimation is loosely described as follows. Assume that we have a
realization of a random sample X1 , .., Xn from a joint pdf f (x1 , ..., xn ; ), where the form of
the pdf f is assumed to be known except that it contains a parameter with an unknown
value, say 0 . Then the objective of point estimation is to utilize the random sample outcome
x1 , .., xn to generate good (in some sense) estimates of the unknown value of 0 , or the value
of some function, say (0 ).
This estimation can be made in two ways.
The first is called point estimation: There the outcome of some statistic, say t(X1 , ..., Xn ),
represents the estimation of the unknown 0 or (0 ).
The second is called interval estimation: There we define two statistics, say t1 (X1 , ..., Xn )
and t2 (X1 , ..., Xn ) so that
[t1 (X1 , ..., Xn ) , t2 (X1 , ..., Xn )]
is an interval for which the probability can be determined that it contains the unknown
0 or (0 ).
In the following we will focus on point estimation. Point estimation admits two problems.
The first is to advise some means of obtaining a statistic to use as an estimator;
the second is to select criteria and techniques to define and find the best estimator among
many possible estimators.
Here we will be concerned with the second problem, and in the following chapter with the first
one.
2. Point estimation
outcome x = (x1 , ..., xn ) constitutes the observed data to be analyzed.
Definition (Statistical model): A statistical model for a random sample x consists of
a parametric functional form, f (x; ), for the joint pdf
of x indexed by the parameters
together with a parameter space, , that defines the set
of potential candidates for the true joint pdf of x as
{f (x; ) , }.
Remark: The true model is the joint pdf with = 0 , i.e., f (x; 0 ) and the estimation
problem consists in approximating 0 or some function (0 ). Examples of such functions
(0 ) are e.g. the population mean and variance
EX =
x f (x; 0 )dx,
var(X) =
These are, at the same time, functionals1 of the density f (or corresponding of the cdf F ).
The method used to estimate will generally depend on the degree of specificity with which
one can define the stochastic model for the random sample X. Accordingly, one can distinguish between distribution-specific estimation methods and distribution-free estimation
methods.
Distribution-specific estimation
In this case, the estimation of 0 or (0 ) is associated with a fully specified stochastic
model assuming a specific parametric family of pdfs for the random sample x represented by
{f (x; ) , }.
Example 2.1 A fully specified model for a random sample of n daily returns of the DAX index,
say {Xi , i = 1, ..., n}, might be defined as
f (x1 , ..., xn ; ) =
n
Y
N(xi ; , 2 ), (, 2 ) ,
where = (, ) (0, ).
i=1
Loosely speaking, functionals are real functions taking functions as arguments; the integral is one of the most
common functionals.
2. Point estimation
These methods require that the parametric form of the joint pdf of the random sample is fully
algebraically specified.
Distribution-free estimation
In this case, a specific functional form for the joint pdf f (x; ) is not assumed and may or
may not be fully specified.
Example 2.2 A partially specified model for the sample of the DAX returns would be
f (x1 , ..., xn ; ) =
n
Y
f (xi ; , 2 ), (, 2 ) ,
with
EXi = , var(Xi ) = 2 ,
i=1
where = (, ) (0, )
This specification is very general, since it is the collection of all continuous joint pdfs for which
Q
f (x1 , ..., xn ; ) = ni=1 f (xi ; ) with EXi = and Var(Xi ) = 2 .
Estimation methods for partially specified models are the least-squares and the method of
moments estimation procedure. Should one be interested in the desity f itself, the Advanced
Statistics III course discusses non-parametric estimation.
The advantage of distribution-free estimation: It is based upon a general specification
for the joint distribution of the random sample. Hence, we can have great confidence that the
actual distribution is contained within that specification. This implies that the validity and
reliability of the estimation result is robust w.r.t. distributional assumptions.
The disadvantage of distribution-free estimation: In a context without specific distributional assumptions, the interpretation of the properties of point estimates is not as specific
or as detailed as when the family of distribution is defined with greater specificity. Moreover,
estimators may have different statistical properties.
In the context of a point estimation problem, two important assumptions regarding the parameter space are made.
Assumption 1: contains the true parameter value so that
0 .
This implies that the stochastic model {f (x; ) , } can be assumed to contain the true
distribution for the random sample under consideration.
Since represents the entire set of possible values for 0 , the relevance of this assumption is
perhaps obvious if our aim is to estimate 0 , we do not want to preclude 0 from the set
6
2. Point estimation
of potential estimates. In practice, this assumption may be a tentative assumption that needs
to be verified by statistical tests. More important is the assumption that the data actually are
informative about the parameters.
Assumption 2: is such that the parameter vector is identified.
The notion of the identifiability of is defined as follows.
Definition
(Parameter
identifiability):
Let
{f (x; ) , } be a statistical model for the random sample x. The parameter vector is said to be
identified iff 1 and 2 , f (x; 1 ) and f (x; 2 ) are
distinct if 1 6= 2 .
The definition states that if the parameter vector is not identified, then two or more different
-values, say 1 and 2 , are associated with exactly the same sampling distribution for x.
In this event, random-sample outcomes x cannot be used to discriminate between 1 and 2 ,
since the stochastic behavior of X under either possibility is indistinguishable.
The identifiability assumption ensures that different -values are associated with different
stochastic behavior of the random-sample outcomes. By this we make sure that the sample
outcomes are able to provide discriminatory information regarding the choice of to be used
in estimating 0 .
Example 2.3 A random sample X1 , ..., Xn is assumed to be generated by the process
Xi = x + Vi ,
Vi iid N(v , v2 ),
x , v , v2 > 0,
such that Xi iidN(x + v , v2 ). Hence, the stochastic model for that random sample can be
represented as
f (x; ) =
Qn
2
2
i=1 N(xi ; x + v , v ), (x , v , v ) ,
,
= Qni=1 N(xi ; , v2 ), (, v2 )
f x;
= 2 (0, ).
where
i=1
Note that any choice of positive values for x and v that results in a given positive value for
results in exactly the same sampling distribution for the Xi s. (Also note that there is an infinite
number of such choices.)
7
2. Point estimation
In order to identify the model, one can impose identifying restrictions on = (x , v , v2 )0 such
as, e.g., v 0 if they are plausible.
2. Point estimation
Remark: The notation E () is used to emphasize that the expectation depends upon the value
of . In the continuous case we have
MSE (T ) =
[t(x) q()]2 f (x; )dx.
Rn
The MSE measures the expected squared distance between the estimator T and the quantity
to be estimated q(). The MSE can be decomposed into the variance and the bias of the
estimator, as
E [T q()]2 = E [T E T + E T q()]2
{z
Bias
Hence, the MSE-criterion penalizes an estimator for having a large variance, a large bias, or
both. It also allows a trade-off between variance and bias in ranking estimators.
The definition of the MSE given above for scalar-valued estimators T can be generalized to the
mean square error matrix for multivariate estimators T see Mittelhammer (1996, Def. 7.7).
Estimators with smaller MSEs are preferred. Note, however, that since is unknown, we must
consider the MSE-performance for all possible true values of , i.e., for all .
It is quite often the case that an estimator will have lower MSEs than another estimator for
some values but not for others. A comparison of two estimators using the MSE-criterion
leads to the concept of relative efficiency.
2. Point estimation
Definition (Relative Efficiency (scalar case)): Let T and T
be two estimators of a scalar q(). The relative efficiency of
T w.r.t. T is given by
MSE (T )
RE (T, T ) =
,
MSE (T )
The definition says that if T is more efficient than T , then there is no value for which T is
preferred to T on the basis of MSE, and for one or more values, T is preferred to T . In this
case T can be discarded as an estimator and T is called inadmissible for estimating q()
and T admissible.
Example 2.4 Suppose (X1 , ..., Xn ) is a random sample from a Bernoulli distribution with
P (xi = 1) = p and n = 25. Consider the following two estimators for p [0, 1]:
T =
1
n
Pn
i=1
Xi
T =
and
1
n+1
Pn
i=1
Xi .
ET = EXi = p,
ET =
n
EXi
n+1
np
,
n+1
var(T ) =
n
p(1
(n+1)2
p).
Note, that T is unbiased and T is biased. However, T has a larger variance than T . The
MSEs of the two estimators are
MSE(T ) = var(T ) + (ET p)2 =
MSE(T ) =
n
p(1
(n+1)2
p(1p)
25
np
p) + ( n+1
p)2 =
p(1p)
27.04
p2
.
676
MSE(T )
p
= .9246 + .037
.
MSE(T )
1p
Since this ratio depends on the unknown value of p, we must consider all the possible contingencies for p [0, 1]. Note that for
p 0 RE(T, T ) .9246 < 1,
while for
p 1 RE(T, T ) > 1.
10
2. Point estimation
Hence, neither estimator is preferred to the other on the basis of MSE, and thus neither estimator is inadmissible relative to the other.
A natural question to ask is wether or not an MSE-optimal estimator exists that has for all
values the smallest MSE among all estimators for q(). In general, no such MSE-optimal
estimator exists. This can be shown as follows.
Assume that we want to estimate the scalar . Consider the degenerate estimator T1 = t1 (X) =
1 (fixed value, ignoring the sample information) with
2
= 0.
( = 1 )
Now we can define for each value of such a degenerate estimator. Then for an estimator,
say T , to have minimum MSE for every potential value of , it would be necessary that
MSE (T ) = 0 .
(Otherwise, we would find a -value where the corresponding degenerate estimator has a smaller
MSE than T .) However, note that MSE (T ) = 0 implies that
E T =
and var (T ) = 0 ,
2. Point estimation
Unbiasedness means that the mean of the estimators distribution is equal to the parameter to
be estimated. Hence, an unbiased estimator has the appealing property that its outcomes are
equal to q() on the average.
Example 2.5 Let (X1 , ..., Xn ) be a random sample from an exponential distribution with pdf
f (x; ) = 1 ex/ I(0,) (x),
The estimator T =
1
n
Pn
with
EX = .
i=1
ET =
1
n
Pn
i=1
EXi = .
Example 2.6 Let (X1 , ..., Xn ) be a random sample from a population with EX = and
var(X) = 2 . Consider the sample variance
T =
1
n
Pn
i=1 (Xi
2,
X)
with
=
X
1
n
Pn
i=1
Xi
ET =
=
1
E
n
1
n
2
2
Pn
1
n
Pn
{z
var(X) = 2
{z
= 2 /n
var(X)
= (1 n1 ) 2 6= 2
Hence, the sample variance T is a biased estimator for the population variance 2 . An unbiased
estimator obtains as
1 Pn
1
2
S2 =
i=1 (Xi X) .
1 T =
n1
(1 n )
In addition to this desirable property, we also want an estimator with a distribution not too
spread out, which could generate an estimate being far away from q(). This motivates the
objective that an estimator has minimum variance among all unbiased estimators which
is formally defined as follows.
Definition (Minimum Variance unbiased estimator
(MVUE) (scalar case)): An estimator T is said to be a
minimum-variance unbiased estimator of q() iff
1. E T = q() ,
2. var (T ) var (T )
ased estimator T .
12
2. Point estimation
The definition states that an estimator is a MVUE if the estimator is unbiased and if there is no
other unbiased estimator that has a smaller variance for any . Note that the MVUE has
the smallest MSE within the class of unbiased estimators. (Remember that MSE = Variance
+ Bias2 ).
Unfortunately, without the aid of theorems that facilitate the discovery of MVUEs, finding a
MVUE for q() is, if such an estimator exists at all, typically quite challenging.2
Hence, one sometimes restricts the attention to estimators that are unbiased and that have the
minimum variance among all unbiased estimators that are linear. Those estimators which are
called BLUE are defined as follows.
Definition (Best linear unbiased estimator (BLUE) (scalar
case)): An estimator T is said to be a BLUE of q() iff
1. T is a linear function of the random sample x =
(X1 , ..., Xn )0 , i.e.,
T = a 0 X = a1 X 1 + + an X n ,
2. E T = q() ,
3. var (T ) var (T )
and unbiased estimator T .
Note that the BLUE has the smallest MSE within the class of linear and unbiased estimators.
Example 2.7 Let (X1 , ..., Xn ) be a random sample from a population with EX = and
var(X) = 2 . The BLUE for is obtained as follows.
As a linear estimator the BLUE must have the form
T = a0 + a1 X 1 + + an X n .
The expectation of T is
ET = a0 + a1 EX1 + + an EXn = a0 +
Pn
i=1
ai .
i=1
ai = 1
and
a0 = 0.
For an example showing how to find the MVUE without the aid of theorems see Mittelhammer (1996,
Example 7.4).
13
2. Point estimation
The variance of T is
Var(T ) =
P
2 ni=1
a2i =
"
Pn1 2
a
i=1
+ 1
Pn1
i=1
{z
ai
2
Pn1
i=1
.
}
ai
= 2 (2ai 2an ) = 0,
i = 1, ..., n 1,
Pn
i=1
ai = an .
ai = 1 implies that
1
.
n
1
n
Pn
i=1
Xi .
A prominent BLUE arises in the context of least-square estimation of the parameters of a linear
regression model, which we will discuss in the next chapter.
2. Point estimation
A consistent estimator converges in probability to what is being estimated. Thus, for large
enough n, there is a high probability that the outcome of Tn will be in the interval [q()
, q() + ] for arbitrarily small > 0.
Equivalently, the sampling density of Tn concentrates on the true value q() as n .
Recall that convergence in mean square implies convergence in probability. Hence,
m
Tn q()
Tn q().
and
lim var(Tn ) = 0,
Pn
1
Now, consider as an alternative estimator Tn = nk
Tn is a biased estimator for , it is consistent, since
ETn =
n
nk
i=1
n) =
var(X
and
n 2
(nk)2
0 as
n .
This example shows that we typically have many consistent estimators for an estimation problem.
The following example shows that an estimator can be consistent for q() without converging
in mean square to q().
Example 2.9 Let be a population parameter. Consider an estimator Tn for with two
possible outcomes {, n} and pdf
f (tn , ) = (1
1 )I{} (tn )
n
1 I{n} (tn ).
n
1 )
n
+n
1
n
15
as
n ,
2. Point estimation
Tn does not converge in mean square to .
Note that the divergence of the expectation is due to the fact that the pdf of Tn , although
collapsing to the point , ensuring consistency, is not collapsing at a fast enough rate for the
expectation to converge to .
a
n(Tn q()) N(0, )
Tn N(q(), n1 ).
n(Tn q())
|{z} |
p
{z
Tn q()
Z N(0, )
0Z
0,
(Slutsky)
Example 2.10 Let (X1 , ..., Xn ) be a random sample from a population with EX = and
n = 1 Pn Xi is a CAN estimator for since by
var(X) = 2 . Then the sample mean Tn = X
i=1
n
Lindberg-Levys CLT
n ) N(0, 2 )
n(X
and
a
n
X
N(, n1 2 ).
By defining the class of CAN estimators in terms of the limiting distribution of the transfor
mation n(Tn q()) (which utilizes as a sequence of centering the quantity to be estimated,
and thus
16
Tn N(q(), n1 2 ).
2. Point estimation
However, by Slutskys theorem it follows that for any constant k < n
n
nk
n(Tn q()) =
| {z } |
p
{z
Z N(0, 2 )
(T
q())
n
nk
1 Z N(0, 2 ),
(Slutsky)
Tn N(q(), nk
2 ).
n2
and thus
Hence, we have for Tn two alternative asymptotic distributions which would lead to different
asymptotic properties. The problem is that centering and scaling required to achieve a limiting
distribution is not unique. By restricting the use of asymptotic properties to the class of CAN
estimators which utilize the same sequence of centering and scaling we avoid nonuniqueness
of asymptotic properties.
Asymptotic versions of MSE, bias and variance can be defined w.r.t. the unique asymptotic
distribution of CAN estimators. The asymptotic MSE for a CAN estimator Tn for the
a
scalar q() with Tn N(q(), n1 2 ) is
AMSE (Tn ) = EA (Tn q())2
{z
Asymtotic Bias = 0
= Avar(Tn ),
=
1 2
.
n
AMSE (Tn )
2
= T2 .
AMSE (Tn )
T
17
2. Point estimation
If the estimator Tn is asymptotically relatively more efficient than Tn , then Tn is called asymptotically inadmissible. Otherwise, Tn is asymptotically admissible.
The definition of asymptotic relative efficiency given above refers to CAN estimators for a scalar
q(). For a multivariate generalization of this definition to CAN estimators for vectors q()
see Mittelhammer (1996, Def. 7.16).
The definition of asymptotic relative efficiency suggests to define asymptotic efficiency in
terms of a choice of estimator in the CAN class that has uniformly the smallest variance.
However, such an estimator does not exist without further restrictions on the CAN class. In
particular, one can show that for any CAN estimator, there is an alternative estimator that
has a smaller variance for at least one . Hence, we cannot define an achievable lower
bound to the asymptotic variance of CAN estimators.
On the other hand, one can show that under mild regularity conditions there does exist a lower
bound for the asymptotic variance of a CAN estimator that holds for all except on
a finite set of values (this is the so called Cramr-Rao lower bound). This result shown
by LeCam (1953)3 allows us to state a general definition of asymptotic efficiency for CAN
estimators.
Definition (Asymptotic efficiency (scalar case)): If T n is a
CAN estimator of q() having the smallest asymptotic variance among all CAN estimators , except on a finite
set of values, T n is said to be asymptotically efficient.
Lucien Marie LeCam (1953), On some asymptotic properties of maximum likelihood estimates and related
Bayes estimates. University of California Publications in Statistics.
18
2. Point estimation
As we shall see later, sufficient statistics facilitate the construction of estimators with the MVUE
property or small MSEs.
Definition (Sufficient statistics): Let (X1 , . . . , Xn )
f (x1 , . . . , xn ; ) be a random sample, and let S1 =
s1 (X1 , . . . , Xn ), . . . , Sr = sr (X1 , . . . , Xn ) be r statistics. The
r statistics are said to be sufficient statistics for f (x; ) iff
f (x1 , . . . , xn ; | s1 , . . . , sr ) = h (x1 , . . . , xn ) ,
i.e., the conditional density of x, given s = [s1 , . . . , sr ]0 , does
not depend on the parameter .
In order to interpret this definition, note first that the conditional pdf f (x1 , . . . , xn ; | s1 , . . . , sr )
represents the probability distribution of the various ways in which the sample outcomes x occur
so as to generate exactly the value s = (s1 , ..., sr )0 . According to the definition, this probability
distribution has nothing to do with if S = (S1 , ..., Sn ) is sufficient.
Thus analyzing the various ways in which a given value s can occur cannot provide any additional information about , since the behavior of the outcomes of x, conditional on the fact
that s(x) = s, is totally unrelated to .
Example 2.11 Let X = (X1 , X2 , X3 )0 be a random sample from a Bernoulli population with
P (x = 1) = p. Consider the two statistics
S = s(X) = X1 + X2 + X3
and
T = t(X) = X1 X2 + X3 .
We now want to show that S is sufficient for p and T is not. The conditional pdfs f (x1 , x2 , x3 ; p | s)
and f (x1 , x2 , x3 ; p | t) are represented in the following table.
Values in the
range R(X)
(0, 0, 0)
(0, 0, 1)
(0, 1, 0)
(1, 0, 0)
(0, 1, 1)
(1, 0, 1)
(1, 1, 0)
(1, 1, 1)
Values
of S
0
1
1
1
2
2
2
3
Values
of T
f (x; p|s(x) = s)
0
1
0
0
1
1
1
2
1
1/3
1/3
1/3
1/3
1/3
1/3
1
19
f (x; p|t(x) = t)
(1 p)/(1 + p)
(1 p)/(1 + 2p)
p/(1 + p)
p/(1 + p)
p/(1 + 2p)
p/(1 + 2p)
p/(1 + 2p)
1
2. Point estimation
The conditional probabilities given in the last two columns are obtained as follows. For instance
the probability P (x1 = 0, x2 = 1, x3 = 0|s = 1) is obtained as
P (x1 = 0, x2 = 1, x3 = 0|s = 1) =
=
=
(31)
(31)!1!
3!
(Def.)
S binomial(n = 3, p))
= 31 .
(Def.),
(1p)2 p
2(1p)2 +(1p)3
p
.
1+p
Since the conditional pdf f (x1 , x2 , x3 ; p | s) has nothing to do with the value of p, the
statistic S is sufficient. However, the conditional pdf f (x1 , x2 , x3 ; p | t) depends on p; so T is
not sufficient.
In any problem of estimating q(), once the outcome of a set of sufficient statistics is observed,
the random sample outcome x can effectively be ignored for the remainder of the estimation
problem since s(x) captures all the relevant information that the sample has to offer regarding
q().
On the other hand, this implies that any estimator which is not based upon a sufficient statistic
must be inefficient since it does not capture all the relevant information that the sample has to
offer.
A practical problem in the use of sufficient statistics is their identification. A criterion which can
be helpful for identification of sufficient statistics is that given by the Neyman factorization
theorem.
Theorem 2.1 (Neymans Factorization Theorem) Let f (x; ) be the pdf of the random
sample (X1 , . . . , Xn ). The statistics S1 , . . . , Sr are sufficient statistics for f (x; ) iff f (x; )
can be factored as
f (x; ) = g (s1 (x), . . . , sr (x); ) h(x),
where g is a function of only s1 (x), . . . , sr (x) and , and h(x) does not depend on .
20
2. Point estimation
Proof
(Discrete case) Sufficiency of the factorization; Suppose that the factorization criterion is met. Let
B(a1 , ..., ar ) denote the set of sample outcomes x generating s1 = a1 ,...,sr = ar , i.e.
xB(a) f (x; )
xB(a) h(x).
fixed values !
Furthermore, we have
f (x; |s1 = a1 , ..., sr = ar ) =
=
x B(a)
= h (x),
which does not depend on . Hence, if the factorization criterion is met, S1 , ..., Sr are sufficient
statistics.
Necessity of the factorization; Suppose S1 , ..., Sr are sufficient statistics. Note that by the definition
of the (discrete) conditional pdf that
P (x , s1 = a1 , ..., sr = ar ; ) = f (x; |s1 = a1 , ..., sr = ar ) P (s1 = a1 , ..., sr = ar )
Now, since P (x , s1 = a1 , ..., sr = ar ; ) = f (x; ) for x B(a) we get
f (x; ) = f (x; |s1 = a1 , ..., sr = ar ) P (s1 = a1 , ..., sr = ar )
|
} |
{z
{z
Hence, if S1 , ..., Sr are sufficient statistics we can factor f (x; ) into the product of a function g of
the si s and and a function h which does not depend on .
21
2. Point estimation
Example 2.12 Let (X1 , . . . , Xn ) be a random sample from a Bernoulli population with pdf
f (x; p) = px (1 p)1x I{0,1} (x),
p [0, 1].
Qn
i=1
= p
Pn
i=1
xi
Pn
(1 p)n
|
setting S =
{z
Pn
i=1
i=1
xi
}
Xi , this
Qn
{z
corresponds to h(x)
Pn
i=1
Xi is a sufficient
It follows that the value of the sum of the sample outcomes contains all the sample information
relevant for estimating q(p). Suppose, e.g., that n = 3 and that we observe s = 2. Then it is
irrelevant which of the following outcomes has generated s = 2:
x = (1, 1, 0),
x = (1, 0, 1),
x = (0, 1, 1).
Example 2.13 Let (X1 , . . . , Xn ) be a random sample from a N(, 2 ) population with =
(, 2 )0 . The joint pdf of the random sample is given by
f (x1 , ..., xn ; ) =
Qn
i=1
1 e 22 (xi )
2 2
1
1
e 22
(2 2 )n/2
1
1
e 22 (
( 2 )n/2
Pn
i=1
Pn
(xi )2
Pn
i=1
x2i 2
i=1
xi +n2 )
{z
setting S1 =
Pn
i=1
Xi and S2 =
}
Pn
i=1
Xi2 , this
1
(2)n/2
| {z }
independent from and
corresponds to h(x)
Pn
i=1
Xi and S2 =
For further examples see Mittelhammer (1996) and Mood, Graybill and Boes (1974).
22
2. Point estimation
The use of Neymans factorization criterion for identifying sufficient statistics requires that we
are able to define the appropriate g(s(x); ) and h(x) functions that achieve the required factorization. However, the appropriate definition of such functions is not always readily apparent.
In the following section we will discuss an approach that might be useful for providing direction
to search sufficient statistics.
x R (x).
23
2. Point estimation
The notation for the sample space R (x) indicates that the range of x is taken over all s
in the parameter space . If the support of the pdf does not change with (e.g., Normal,
Gamma, etc.) then R (x) = R(x).
This definition implies that the minimal sufficient statistic S utilizes the minimal set of
points for representing the sample information. This follows from the fact, that, by definition,
a function can never have more elements in its range than in its domain. (Recall that for each
argument we have only one function value, but one function value might be associated with
more than one argument.) Thus, if S = hT (T ) for any other sufficient statistic T , then the
number of elements in R(S) does not exceed the number of elements in R(T ), for any sufficient
statistic T . Hence minimal sufficient statistics provide the most parsimonious representation
of the sample information about the unknown parameters.
The definition is of little use in finding minimal sufficient statistics. Lehmann-Scheffs Minimal
Sufficiency Theorem provides an approach for finding minimal sufficient statistics. Rather than
presenting this theorem, we consider a corollary of the theorem, which is helpful for identifying
minimal sufficient statistics in cases where the sample range is independent of the distribution
parameter 4 .
Corollary 2.1 (Minimal Sufficiency when R is independent of ) Let x f (x; ), and
suppose that R(x) does not depend on . If the statistic S = s(x) is such that
f (x; )
f (y; )
iff (x, y)
This Lehmann-Scheff result for identifying a minimal sufficient statistic requires that we are
able to find an appropriate function S = s(x). However, in many cases this result allows us to
transform the problem into one where a choice of S is readily apparent.
Example 2.14 Let (X1 , . . . , Xn ) be a random sample from a nondegenerate Bernoulli population with joint pdf
f (x; p) = p
4
Pn
i=1
xi
Pn
(1 p)n
i=1
xi
Qn
for
p (0, 1).
For a discussion of the Lehmann-Scheffs Minimal Sufficiency Theorem, which includes cases where the
sample range is dependent of the distribution parameter see Mittelhammer (1996, p. 395-396).
24
2. Point estimation
The Lehmann-Scheff procedure for identifying a minimal sufficient statistic for p requires the
examination of the ratio
Pn
Pn
iff
the constraint
Pn
i=1
xi =
Pn
i=1
yi is imposed.
Since the sample range R(x) is independent of p it follows by Corollary 2.1 that
minimal sufficient statistic for p.
Pn
i=1
Xi is a
The exponential class of distributions represents a collection of parametric families of distributions for which minimal sufficient statistics are straightforwardly defined.
Theorem 2.2 (Exponential class and sufficient statistics) Let f (x; ) be a member of the
exponential class of density functions
f (x; ) = exp
hP
k
Then s(x) = [g1 (x), . . . , gk (x)]0 is a k-variate sufficient statistic, and if c1 (),...,ck (), are linearly
independent, the sufficient statistic is a minimal sufficient statistic.
Proof
That s(x) is a sufficient statistic follows directly from the Neyman factorization theorem by defining
f (x; ) = exp
|
hP
k
i=1 ci ()gi (x)
{z
} |
{z
h(x)
in the theorem.
That s(x) is a minimal sufficient statistic follows from the Lehmann-Scheff approach of Corollary
2.1. In fact note that
hP
n
o
i
f (x; )
k
= exp
c
()
g
(x)
g
(y)
+
z(x)
z(y)
.
i
i
i
i=1
f (y; )
Assuming that the ci ()s are linearly independent, this ratio does not dependent on iff
x and y satisfy the constraints gi (x) = gi (y),
25
for i = 1, ..., k.
2. Point estimation
Example 2.15 Let (X1 , . . . , Xn ) be a random sample from a Gamma population with a joint
pdf which belongs to the exponential class
f (x; , ) =
1 xi /
1
e
i=1 () xi
Qn
= exp ( 1)
Pn
| {z } |
c1 ()
i=1 ln xi
{z
g1 (x)
Pn
} | {z } | {z }
c2 ()
g2 (x)
Thus, by Theorem 2.2 regarding the exponential class and sufficient statistics it follows that
P
P
[g1 (x) , g2 (x)]0 = [ ni=1 ln xi , ni=1 xi ]0 is a bivariate minimal sufficient statistic for (, ).
Sufficient statistics are not unique. This means that any one-to-one (i.e., invertible) function
of a (minimal) sufficient statistic S is also a (minimal) sufficient statistic. This fact follows
from the observation that a one-to-one transformation of a (minimal) sufficient statistic S
provides the same sample information about the unknown parameter as that provided by S.
The following theorem formalizes this observation.
Proof
1. Since is assumed to be invertible, it follows that s(x) = 1 { [s(x)] }.
Furthermore, since s(x) is sufficient it follows by Neymans factorization theorem
f (x; ) = g(
s(x)
; ) h(x)
= g( 1 { [s(x)] }; ) h(x)
= g(
[s(x)]
; ) h(x).
26
2. Point estimation
Example 2.16 Let (X1 , . . . , Xn ) be a random sample from a N(, 2 ) population with =
(, 2 )0 .
Recall that S = ( ni=1 Xi , ni=1 Xi2 ) is a bivariate sufficient statistic for estimating . Furthermore, note that by Corollary 2.1, S is also a minimal sufficient statistic.
P
n =
Consider the sample mean X
define an invertible function of S
1
n
Pn
i=1
Xi and variance S 2 =
n, S 2) = (
(X
Pn
i=1
Xi ,
Pn
i=1
1
n1
Pn
i=1 (Xi
n )2 , which
X
Xi2 ).
2. Point estimation
In the first subsection we will derive a lower bound for the variance of unbiased estimators,
the Cramr-Rao Lower Bound (CRLB), and show how it can be useful in finding MVUEs.
In the second subsection we introduce the concept of complete sufficient statistics and show
how it can sometimes be used to identify MVUEs.
In the following discussion we will focus on the case where the parameter as well as q() are scalars and
where the sampling distribution is continuous. For an extension to the multivariate and/or discrete case
see Mittelhammer (1996), p. 408-418.
28
2. Point estimation
Definition (CRLB regularity conditions (scalar case)):
1. The parameter space for the parameter indexing the
pdf f (x; ) is an open interval with R1 .
2. The support of f (x; ), say A, is the same .
3. ln f (x; )/ exists and is finite x A, and .
4. We can differentiate under the integral as follows
f (x; )
dx1 dxn .
5. For all unbiased estimators t(x) for q() with finite variance, we can differentiate under the integral as follows
0<E
"
6.
t(x)
ln f (x; )
f (x; )
dx1 dxn .
!2 #
< .
In practice, the CRLB regularity conditions (1),(2),(3),(4), and (6), are generally not difficult
to verify. However, condition (5) can be complicated, since it must hold true for all unbiased
estimators of q().
There is a wide class of distributions, namely the exponential class, that satisfies the CRLB
regularity conditions (see Mittelhammer, 1996, Theorem 7.15).
Theorem 2.4 (Cramr-Rao Lower Bound (scalar case)) Let X1 , ..., Xn be a random sample from a population with pdf f (x; ) and let T = t(x) be an unbiased estimator for q(). Then
under the CRLB regularity conditions for the joint pdf f (x, ) given above
h
var (T )
nE
29
i
q() 2
ln f (X; )
2 .
2. Point estimation
Equality prevails iff there exists a function, say K(, n), such that
Pn
i=1
Proof
Since T = t(x) is an unbiased estimator for q(), we have
q()
=
t(x)f (x; )dx
=
t(x)
f (x; )dx
t(x)
f (x; )dx q()
f (x; ) dx
=
=
f (x; )dx
[t(x) q()]
[ln f (x; )]
q()
i2
h
E [t(x) q()]
}
}
=0
[t(x) q()]
= E [t(x) q()]
h
{z
{z
=1
(by condition 4)
f (x; )dx
(since
ln f
1 f
f )
ln f (x; )
ln f (x; )
i2
q()
i2
E [t(x) q()]2 E
|
{z
} |
var[t(x)]
h
i2
ln f (x; )
{z
Var[t(x)]
E
hn
i
q() 2
ln f (x; )
o2 i .
By the independence of the Xi s in x, it follows for the denominator of the r.h.s. that
E
hn
o2 i
ln f (x, )
= E
=
hn P
i=1
Pn
i=1
= nE
Pn
h
o2 i
ln f (Xi ; )
j=1 E
ln f (Xi ; )
ln f (Xi ; )
2 i
ln f (Xj ; )
30
i 6= j,
2. Point estimation
and by noting that
E
)dx = 0.
ln f (X; ) =
[ln f (x; )]f (x; ) dx = | f (x;
{z
}
{z
}
|
=1
f (x; )
=
Var(T )
nE
hn
i
q() 2
o2 i ,
ln f (X; )
which completes the proof for the first part of the CRLB-theorem.
The inequality in the Cauchy-Schwarz inequality used above becomes an equality iff one function
is proportional to the other, i.e.
which is equivalent to the fact that there exists a factor K(, n) independent of x such that
ln f (x; )
Pn
i=1 ln f (xi ; )
The CRLB regularity conditions, which were stated for continuous sampling distributions, can
be modified for discrete sampling distributions, leaving the statement of the CLRB-theorem
unchanged. Furthermore, the CRLB regularity conditions, which were stated for the case where
and q() are scalars can be modified for the case where and q() are vectors, leading
to a multivariate version of the CRLB-theorem (see Mittelhammer, 1996, Theorems 7.16 and
7.17).
The CRLB theorem has two uses. First, it gives a lower bound for the variance of unbiased
estimators (first part of the theorem). Second, if an unbiased estimator whose variance coincides
with the CRLB can be found, then this estimator is the MVUE. The conditions under which
an unbiased estimator has a variance that achieves the CRLB (second part of the theorem)
aids finding a MVUE.
In fact, if there exists an unbiased estimator T = t (x) such that
Pn
i=1
for some function K(, n), then var(T ) coincides with the CRLB and T is the MVUE.
31
2. Point estimation
Example 2.17 Let (X1 , ..., Xn ) be a random sample from an exponential distribution with pdf
f (x; ) = ex ,
x, (0, ),
var(X) = 1/2 .
EX = 1/,
with
Find the CRLB for the variance of unbiased estimators of and 1/, and find the MVUEs of
and 1/.
First, we need to show that the six CRLB-conditions are satisfied.
1. Since (0, ), the parameter space is an open interval. X
2. The support of the exponential density is the same for all . X
3. Note that the joint pdf of the random sample is
f (x; ) = n e
Pn
i=1
xi
ln f (x; ) = n ln
and
Pn
i=1
xi ,
so that
ln f (x; ) =
Pn
i=1
xi ,
f (x; )dx =
f (x; )dx.
f (x; )
so that
f (x; )dx
n
= E
Pn
Pn
i=1
Xi
n 1
Now note that the l.h.s. of condition 4 is also equal to zero since
f (x; ) is a pdf and 1/ = 0. X
0.
5. Regularity condition 5 assumes that we can differentiate under the integral so that we
have for any unbiased estimator t(x) for
t(x)
f (x; )dx.
2. Point estimation
As mentioned above, the verification of this condition is rather complicated since we have
to show that it holds true for all unbiased estimators of .
a) Here, where we consider a sample from a pdf which belongs to the exponential class,
it is satisfied as discussed in Mittelhammer (1996, p. 410-411). X
6. Regarding the last condition first note that (see the proof for the CRLB-theorem)
hn
ln f (x, )
o2 i
hn
=nE
o2 i
ln f (Xi ; )
where
hn
o2 i
hn
ln f (Xi ; )
= E
Xi
o2 i
hn
= E Xi
o2 i
|{z}
EXi
= var(Xi ) =
1
.
2
Thus, the CRLB regularity conditions are met for our estimation problem. The CRLB for the
var(T )
nE
i
q() 2
2
1
n(1/2 )
2
.
n
ln f (X; )
Regarding the existence of an unbiased estimator whose variance achieves the CRLB (second
part of the CRLB-theorem) note that
Pn
i=1
ln f (xi ; ) =
Pn
1
i=1 (
xi ) = n ( n1
Pn
i=1
xi
|{z}
),
EXi
Pn
i=1
Further examples illustrating the use of the CRLB can be found in Mittelhammer (1996,
p. 411ff) and Mood, Graybill and Boes (1973, p. 319f).
33
2. Point estimation
We conclude the discussion of the CRLB with remarks on its form and its attainability.
An alternative form of the CRLB that utilizes the second-order derivative of ln f (x, ) w.r.t.
is sometimes useful in practice. Specifically, it is often the case that the so-called information
equality, given by
o2 i
h 2
i
hn
ln f (x; )
= E
E
2 ln f (x; )
holds true. Now, note that under the assumption that the Xi s in x are iid, the l.h.s. becomes
E
hn
o2 i
ln f (x; )
=nE
hn
o2 i
ln f (Xi ; )
2
2
(independence)
ln f (x; )
Pn
(identical distr.)
i=1 E
n E
2
2
2
2
ln f (Xi ; )
i
ln f (Xi ; ) .
CRLB =
nE
i
q() 2
2
nE
ln f (X; )
i
q() 2
2
2
.
ln f (X; )
o2 i
ln f (; )
= E
2
2
ln f (; )
E
|
2
2
ln f (; )
{z
= E
ii
= E
1
f (;)
= E
1
f (;)
= E
ln f (; )
1
f (;)
ii
f (; )
2
f (; )
2
2
f (; )
2
1
f (;)2
+ E
|
hn
f (; )
o2 i
ln f (; )
o2 i
{z
ln f
f ).
1
f (;)
2
f (; )
2
=
=
1
f (;)
2
2
34
1
f (;)
2
[f (; )]
2
f (; )dx = 0,
2
f (; )
2
f (; )dx
= 0. Now note
2. Point estimation
which is satisfied if we may twice differentiate
2
f (; )dx
2
=
=
f (; )dx
f (; )dx = 0.
{z
=1
Thus, the information equality and the alternative form of the CRLB hold if the joint pdf
f (x; ) for the random sample under consideration allows us to change two times the order of
integration w.r.t. x and differentiation w.r.t. . This is typically the case if the joint pdf f (x; )
belongs to the exponential class (see Mittelhammer, 1996, p. 414).
Often the CRLB for the variance of an unbiased estimator is not attainable, that is, there
often exists a lower bound for the variance that is greater than the CRLB. In that case the
variance of the MVUE is greater than the CRLB, which implies that the CRLB theorem is of
limited use for the identification of the MVUE. The following proposition indicates the cases
where both the CRLB-regularity conditions are satisfied and the CRLB is attainable.6
Theorem 2.5 (Exponential class and CRLB) If T is an unbiased estimator of some q()
whose variance coincides with the CRLB, then the pdf f (x; ) belongs to the exponential class;
and, conversely, if f (x; ) belongs to the exponential class, then there exists a unique unbiased
estimator T of some q() whose variance coincides with the CRLB.
Proof
Omitted.
The theorem tells us that we will be able to find an unbiased estimator whose variance attains
the CRLB iff the sampling density belongs to the exponential class; and there is only one
function q() for which such an estimator exists.
Recall the example where we considered a random sample from an exponential distribution
(belonging to the exponential class)
f (x; ) = ex ,
with EX = 1/.
See Mittelhammer (1996, p. 418) and Mood, Graybill and Boes (1973, p. 320) .
35
2. Point estimation
This implies that for q() = , e.g., no such estimator exists.
Since the CRLB is often of limited use in finding the MVUE, we need alternative approaches
for finding MVUEs. In the next subsection, we will discuss such an alternative approach.
Proof
1. First note that since S = (S1 , ..., Sr )0 is a sufficient statistic,
f (x; |s) = h(x)
Hence,
T 0 = E(T |S) =
{z
independent of
E[E(T |S)]
| {z }
= T
36
(l.i.e.)
E(T )
q().
2. Point estimation
3. The variance of T is
var(T ) = E[(T ET )2 ] = E[(T T 0 + T 0 ET 0 )2 ]
(since ET = ET 0 )
= E (T 0 ET 0 )E[(T T 0 )|S]
= 0
and therefore
var(T )
E[(T T 0 )2 ] + var(T 0 )
{z
var(T 0 ).
The Rao-Backwell theorem says: Given an unbiased estimator, another unbiased estimator
that is a function of a sufficient statistic can be constructed and it will not have larger variance
but eventually smaller variance. In other words, conditioning an unbiased estimator on a
sufficient statistic might improve its MSE performance while it will never deteriorate the MSE
performance. This implies that the search for an MVUE can be restricted to functions of
sufficient statistics!
Example 2.18 Let (X1 , X2 , X3 ) be a random sample from a uniform distribution on the interval [0, ] with pdf
f (x; ) = 1 I[0,] (x).
Note that this pdf does not belong to the exponential class such that the CRLB theorem is not
applicable to find the MVUE for .
An unbiased estimator for is T = 2X[2] , that is two times the median, since
ET =
2EX[2] = 2EX
|
{z
= 2 2 = .
A sufficient statistic for the upper bound of the sample range is given by the sample
maximum, that is S = X[3] . (Note that the sample maximum contains all the information on
the upper bound that the sample has to offer.) According to the Rao-Blackwell theorem the
37
2. Point estimation
estimator of constructed as
T 0 = E(T |S) = E(2X[2] |X[3] )
should have a variance which is not larger than that of T .
The explicit functional form of T 0 = E(T |S) is obtained as follows. First note that
T 0 = E(2X[2] |X[3] ) =
where
f (x[2] |x[3] ) =
6x[2] /3
f (x[2] , x[3] )
=
=
f (x[3] )
3x2[3] /3
|
{z
T 0 = E(2X[2] |X[3] ) =
for
x[2] x[3] .
Thus, we have
2x[2]
,
x2[3]
x[3]
0
4x2[2]
x2[3]
dx[2] = 43 x[3] .
The variance of T is
Var(T ) = Var(2X[2] ) =
(2x[2] )2
f (x[2] )
dx[2] =
1 2
,
5
| {z }
=
6x[2]
2
3x2
[2]
3
1 2
.
15
| {z }
=
3x2
[3]
3
Thus, we have that var(T 0 ) < var(T ) , which is consistent with the Rao-Blackwell theorem.
Before moving on, two comments are appropriate. First, if an unbiased estimator T is already a
function of a sufficient statistic S, then the estimator T 0 derived according to the Rao-Blackwell
theorem will be identical to T . Second, the Rao-Blackwell theorem tells us how to improve
on an unbiased estimator by conditioning on a sufficient statistic. This raises the question
whether or not such an unbiased estimator, obtained by conditioning on a sufficient statistic,
will be the MVUE. As we shall discuss now, this is the case if the sufficient statistic S used for
conditioning an unbiased estimator is complete.
38
2. Point estimation
Definition (Complete sufficient statistics): Let S =
[S1 , . . . , Sr ]0 be a sufficient statistic for f (x; ). The sufficient statistic S is said to be complete iff
E [z(S)] = 0 implies that P [z(s) = 0] = 1 ,
where z(S) is a statistic.
One implication of that definition is: If a sufficient statistic S is complete, it follows that
two different functions of S cannot have the same expected value. To see this, consider two
functions of a complete sufficient statistic S, say t(S) and t0 (S), and suppose that they have
the same expected value
E[t(S)] = E[t0 (S)] = q().
Now define
(S) = t(S) t0 (S),
so that
E[(S)]
0.
Now, since (S) is a function of a complete sufficient statistic, it must be the case (according
to the definition) that
P [(s) = 0] = 1
and hence
P [t(s) = t0 (s)] = 1.
Thus, t(S) and t0 (S) are the same function with probability 1, if they have the same expected
value and if S is a complete sufficient statistic.
An important implication of this result is that any unbiased estimator of q() that is a function
of a complete sufficient statistic is unique there cannot be more than one unbiased estimator
of q() defined as a function of a complete sufficient statistic.
Example 2.19 Let (X1 , . . . , Xn ) be a random sample from a Bernoulli population with P (X =
1) = p, and consider the statistic
P
S = ni=1 Xi ,
which is a sufficient statistic for p. To determine whether S is a complete sufficient statistic
we need to show that a function z(S) of S for which
E[z(S)] = 0 p [0, 1] is characterized by
Pn
j=0 z(j)
= (1 p)n
n
j
Pn
pj (1 p)nj = 0
j=0 z(j)
n
j
j = 0,
39
2. Point estimation
Hence, E[z(S)] = 0 p [0, 1] requires that
Pn
j=0 z(j)
n
j
j = 0
p [0, 1].
For that polynomial in to be equal to 0 , all coefficients z(j) nj need to be equal to 0, that
is
z(j) nj = 0,
so that
z(j) = 0
j {0, 1, ..., n}.
|{z}
6= 0
Hence, E[z(S)] = 0 p requires that z(j) = 0 j such that E[z(S)] = 0 implies that P [z(s) =
P
0] = 1. Thus, S = ni=1 Xi is a complete sufficient statistic for p.
In general, the verification of the completeness of a sufficient statistic can require tricky analysis.
However, the following theorem identifies a large collection of distributions for which complete
sufficient statistics are relatively easy to identify.
Theorem 2.7 (Completeness in the exponential class) Let the joint density, f (x; ), of
the random sample (X1 , . . . , Xn ) be a member of a parametric family of densities belonging to
the exponential class of densities with pdf
f (x; ) = exp
hP
k
If the range of [c1 (), . . . , ck ()]0 , , contains an open k-dimensional rectangle7 , then
s(x) = [g1 (x), . . . , gk (x)]0 is a complete sufficient statistic for f (x; ), .
Proof
See Rohatgi and Saleh (2001), An Introduction to Probability Theory and Mathematical Statistics,
John Wiley and Sons, p. 367f.
If complete sufficient statistics exist for a statistical model {f (x; ), }, then an alternative to the CRLB approach is available to identify the MVUE of q(). The approach is based
upon the Lehmann-Scheff completeness theorem.
The condition that the range of [c1 (), . . . , ck ()] contains an open k-dimensional rectangle excludes cases
where the ci ()s are linearly dependent. For a random sample from a N(, 2 ) distribution with (, 2 )
R1 R1+ , for example, the range of [c1 (), c2 ()]0 = [ 2 , 21 2 ] is the set R1 R1 and contains an open
2-dimensional rectangle.
40
2. Point estimation
Theorem 2.8 (Lehmann-Scheffs completeness theorem (scalar case)) Let S = (S1 , . . . , Sr )0
be a complete sufficient statistic for f (x; ). Let T = t(S) be an unbiased estimator for the
function q). Then T = t(S) is the MVUE of q().
Proof
Let T 0 be any unbiased estimator of q() which is a function of the complete sufficient statistic S,
that is T 0 = t0 (S). Then
E(T T 0 ) = 0
and
T T 0 is a function of S,
So by completeness of S,
P [t(S) = t0 (S)] = 1.
Hence, there is only one unbiased estimator of q() that is a function of S.
Now let T be any unbiased estimator of q(). Then T must be equal to
T = E(T |S),
since E(T |S) is an unbiased estimator of q() depending on S. By the Rao-Blackwell theorem,
var(T ) var(T )
so that T is the MVUE of q().
The Lehmann-Scheff completeness theorem says that if a complete sufficient statistic S exists
and if there is an unbiased estimator of q(), then there is an MVUE for q(); there are two
possible procedures for identifying the MVUE for q():
1. Find a statistic of the form t(S) such that Et(S) = q(). Then t(S) is necessarily the
MVUE of q().
2. Find any unbiased estimator of q(), say t (x). Then t(S) = E(t (x)|S) is the MVUE
of q().
Example 2.20 Let (X1 , ..., Xn ) be a random sample from a Poisson distribution with pdf
f (x; ) =
e x
x!
for
x = 0, 1, 2, . . . ,
41
EX = var(X) = .
2. Point estimation
To find the MVUE of q() = , note first that the joint pdf f (x; ) is a member of the exponential
class of densities and has the form
f (x; ) =
Qn
f (xi ; )
i=1
= exp ln()
en
Q
Pn
i=1
Pn
x
i=1 i
n
i=1
xi !
xi n ln
Qn
i=1
xi ! .
| {z } | {z }
g(x)
c()
Pn
i=1
To identify the MVUE for , it suffices to find a function of the complete sufficient statistic
1 Pn
Pn
i=1
Xi ) = n ,
var(T )
nE
i
q() 2
2 ,
ln f (X; )
2
ln f (x; )
2
ln f (X; )
= (1 + x )2 = 1 2 x +
= 12+
var(X)+ EX
2
x2
2
2
= 1 ,
and
var(T )
n =
Thus, the variance of the MVUE X
1
n
Pn
i=1
1
n 1
= n .
Example 2.21 Let (X1 , ..., Xn ) be a random sample from a Poisson distribution with pdf
x
f (x; ) = e x! . Find the MVUE of q() = P (x = 0) = e .
According to Lehmann-Scheffs theorem, the MVUE can be derived by calculating the conditional expectation of some unbiased estimator T of e given the complete sufficient statistic
P
S = ni=1 Xi .
42
2. Point estimation
Since we can use any unbiased estimator, we may choose a simple one that would make the
calculations easy. Such a simple unbiased estimator of P (x = 0) = e is
T = I{0} (X1 ),
Pn
i=1
Xi ] = 1 P (x1 = 0|
Pn
i=1
Xi ) + 0 P (x1 > 0|
Pn
i=1
Xi )
is the MVUE for e . To find this conditional expectation, we need to derive the conditional
P
probability P (x1 = 0| ni=1 Xi = s). This conditional probability obtains as
P (x1 = 0|
Pn
i=1
Pn
P (x1 =0 ,
Pn i=1 Xi =s)
P ( i=1 Xi =s)
Xi = s) =
(by Def.)
Pn
P (x1 =0 ,
Pn i=2 Xi =s)
P ( i=1 Xi =s)
Pn
Now we exploit the additivity property of the Poisson distribution, which implies
Xi iid Poisson()
Pn
i=1
Xi Poisson(n).
Hence
Pn
Xi = s) =
Pn
Xi = s) =
P(
i=1
P(
i=2
en (n)s
s!
e(n1) ([n1])s
,
s!
such that
P (x1 = 0|
Pn
i=1
Xi = s) =
n1
n
s
Therefore,
T = E[I{0} (X1 )|
Pn
i=1
Xi ] = P (x1 = 0|
Pn
i=1
Xi ) =
n1
n
Pn Xi
i=1
2. Point estimation
Note that this variance has to be larger than the CRLB for the variance of an unbiased estimator
of e , since the variance of the MVUE of attains the CRLB (see previous example) and there is
only one function of , for which the MVUE attains the CRLB. In fact, the CRLB for the variance
of an unbiased estimator of q() = e is
h
var(T )
nE
i
q() 2
2
ln f (X; )
44
e2
n 1
= n e2 .
i = 1, ..., n,
45
i = 1, ..., n,
Y
1
..
Y = . ,
Yn
11
..
..
x= .
.
xn1
x1k
..
. ,
xnk
1
..
= . ,
1
..
= . .
n
Note that the errors i represent unobservable random variables, since they are deviations of
Yi from the unknown mean EYi = 1 xi1 + + k xik .
In order to illustrate the generality of the LRM, note that a specification like
2
i4
) + i ,
Yi = 1 + 2 zi2
+ 3 sin(zi3 ) + 4 ( zzi5
i = 1, ..., n,
is consistent with the linear form of a LRM, and a representation that is linear in explanatory
46
xi1 = 1,
xi3 = sin(zi3 ),
xi4 = ( zzi4
).
i5
Moreover, a relationship between dependent and independent variables that is initially nonlinear
in the parameters might be transformable into a LRM. Consider, e.g., the following stochastic
Cobb-Douglas production function
Qi = ki1 `i 2 i ,
where yi being output, ki and `i being capital and labor inputs, and i denoting a random error
term. A logarithmic transformation yields
ln Qi = ln + 1 ln ki +2 ln `i + ln i ,
| {z }
| {z }
xi2
Yi
| {z }
xi3
| {z }
i
and
E = 0.
The first assumption (A1) says that (as already discussed above) the mean of the dependent
variables Y is a linear function in the regression parameters and has the form x. This
necessarily implies that E = 0.
Assumption (A2): x is a non-random n k matrix with rank
rk(x) = k (full column rank).
47
48
var(2 ) cov(2 , n )
Cov() =
..
..
.
.
var(n )
0
2
..
.
0
0
..
.
2
The fact that all of the variances of the elements in have the same value, i.e.
var(i ) = 2 ,
i = 1, ..., n,
is referred to as homoskedasticity. The fact that the off-diagonal entries are zero implies that
covariances are zero, i.e.
cov(i , j ) = 0,
i 6= j,
such that there is no correlation between any two elements in .
The three assumptions (A1)-(A3), i.e.
(A1): EY = x
and
E = 0
(A2): x is a non-random n k matrix with rank rk(x) = k
(A3): Cov() = E0 = 2 I
are the classical assumptions of the LRM
Y = x + .
49
S() =
n
X
where
i=1
= 2x0 y + 2x0 xb = 0
x0 xb = x0 y
b = (x0 x)1 x0 y.
Note that x is assumed to have full rank (A2). This implies that the (k k) matrix x0 x has
full rank and is thus invertible. The minimizer
b = (x0 x)1 x0 y
defines the LS estimate for . The LS estimator of is then defined by the random vector
= (x0 x)1 x0 Y = (x0 x)1 x0 (x + ) = + (x0 x)1 x0 .
Note that the second-order conditions for the minimization are satisfied, since
2 S()
0
= 2x0 x
is a p.d. matrix
(the matrix 2x0 x is p.d. since for any vector ` 6= 0 we obtain `0 2x0 x` = 2(x`)0 (x`) > 0), such
that the objective function S() is convex.
50
with
= xb,
y
which are estimates of the expectations EYi = xi. . The LS residuals given by
ei = yi yi = yi xi. b,
with
= y xb,
e=yy
x0 e = 0.
Hence, we have
xi1 ei
Pn
..
. = 0, so that the regressor matrix x and the LS-residual
i=1
xik ei
vector e are orthogonal. If the LRM contains an intercept such that the first column of x is a
P
P
column of 1s, then the LS residuals sum to 0. This follows from ni=1 1ei = ni=1 ei = 0. The
vector of LS residuals can be written as
e = y xb = y x[(x0 x)1 x0 y] = [I (n) x(x0 x)1 x0 ]y = M y.
The (n n) matrix M = I (n) x(x0 x)1 x0 produces the LS residuals in the regression of y on x
when it pre-multiplies the vector y. Hence it is a residual generating matrix. The matrix
M is both symmetric (M = M 0 ) and idempotent (M M 0 = M ). Furthermore we have
M x = (I (n) x(x0 x)1 x0 )x = 0.
Hence, the LS residuals of a regression of x on x are equal to zero.
As a measure of how well the Y outcomes have been explained by the fitted regression plane
xb we can use the coefficient of determination which is defined as
R2 =
SST =
Pn
SSR =
Pn
2
yi y)
i=1 (
i=1 (yi
where
y)2
=
(LRM with intercept)
51
Pn
yi
i=1 (
y)2 .
Hence, the LS estimator has the form of a linear function of the random sample variables Yi ,
i.e.
= AY + d,
The following theorem establishes that the LS estimator has the smallest variance among all
linear and unbiased estimators.
=
Theorem 3.1 (Gauss-Markov Theorem) Under the classical assumptions of the LRM,
(x0 x)1 x0 Y is the best linear unbiased estimator of .
Proof
Let
= AY + d
0
be any linear estimator of . Its expectation is
= AE(x + ) + d = Ax + d.
E
0
52
) as
we can rewrite Cov(
0
) = 2 AA0 = 2 [(x0 x)1 x0 + D][(x0 x)1 x0 + D]0 = 2 [(x0 x)1 + DD 0 ],
Cov(
0
where the last equation follows from the fact that
Dx(x0 x)1 = [A (x0 x)1 x0 ] x(x0 x)1
|
{z
=D
Ax
|{z}
= I(unbiasedness)
Thus we have
) = Cov()
+ 2 DD 0 ,
Cov(
0
where 2 DD 0 is a p.s.d. matrix so that for any linear combination `o0
) var(`o0 ).
var(`o0
0
The classical assumptions of the LRM are sufficient to allow the definition of an unbiased
estimator of the disturbance variance Var(i ) = 2 . This unbiased estimator is given by1
0
,
S2 =
nk
= Y x.
{z
53
{z
I (k)
It follows that
)
E(
0
2
ES =
= 2.
nk
as
n ,
p
= (x0 x)1 x0 Y
is a consistent estimator of .
then
, so that
Proof
Under the classical LRM we have that
=
E
and
1
= 2 x0 x
Cov()
m
0 and thus
If (x0 x)1 0, then Cov()
. But since convergence in mean square implies
Theorem 3.2. says that the condition that (x0 x)1 0 is sufficient for consistency of the LS
estimator. This condition ensures that the x matrix is well-behaved in large samples. It
is fairly weak and is likely to be satisfied by typical data sets encountered in practice. The
convergence (x0 x)1 0 will occur iff each of the diagonal entries of (x0 x)1 goes to zero. The
necessity of this condition is obvious. The sufficiency follows from the fact that for p.d. matrices
54
a11 a12
A=
.
a21 a22
(x0 x)
Note that A is symmetric and p.d. since (x0 x) is symmetric and p.d.. Since A is symmetric
and p.d., it follows that
|A| = a11 a22 a212 > 0,
so that
and
This is the boundedness result mentioned above. Therefore, if a11 0 and a22 0, then
a12 0.
n
X
x2i ,
i=1
Pn
2
i=1 xi
Theorem 3.3 (Consistency of S2 iid case) Under the classical assumptions of the LRM,
p
if the disturbances i are iid, then S2 2 , so that S2 is a consistent estimator of 2 .
Proof
Recall that the unbiased estimator S2 can be represented by
2
0
nk
0 M
nk
= M
This yields
S2 =
0
nk
| {z }
p
Vn 2
0 x(x0 x)1 x0
nk
{z
Zn 0
0 x(x0 x)1 x0
nk
55
> 0,
1
0
0 1 0
nk E[ x(x x) x ] =
0
1
0 1 0
)]
nk tr[x(x x) x E(
| {z }
1
0
0 1 0
nk E[tr( x(x x) x )]
2 I
0 1 0
2
x) x x]
nk tr[(x
|
{z
}
2 k
nk .
I (k)
2 k/(nk)
.
c
plimVn = plim nk
= plim
n
nk
plim
1
n
Pn
2
i=1 i
The iid assumption for the i s implies that the 2i s are also iid with E2i = 2 . So by Khinchins WLLN,
plim n1
Pn
2
i=1 i
n
= 2 , while plim nk
= 1.
Theorem 3.3 establishes consistency of S2 for estimating 2 by assuming that the disturbances
i are iid. This assumption is fairly restrictive and in practice often violated. A theorem for
consistency of S2 which relaxes the iid conditions is found in Mittelhammer (1996, p. 442.).
There the iid assumption is replaced with the assumption that E4i < and certain conditions
on the dependence structure of {i }.
With a consistent estimator S2 for 2 we can compute
S2 (x0 x)1 ,
=
which is a consistent estimator for the covariance matrix of the LS estimator Cov()
2 (x0 x)1 .
Theorem 3.4 (Asymptotic Normality of iid case) Assume the classical assumptions
of the LRM. In addition, assume that
1. the i s are iid with P (|i | < m) = 1 for m < and i;
2. the regressors are such that |xij | < < i and j;
56
and
N (, n1 2 Q1 ).
) N (0, 2 Q1 )
n(
Proof
Recall that the LS estimator can be expressed as
= + (x0 x)1 x0
) =
n(
n(x0 x)1 x0 = ( n1 x0 x)1
{z
1 x0
n
W n Q1
| {z }
d
V n N (0, 2 Q)
1 0 1
x x)
n n
= Q1 .
{z
by assumption (c)
1
n
x0 =
1
n
n
X
i=1
x0i. i ,
(k 1)
1
n
Pn
0
i=1 Cov(xi. i )
= 2 limn
1
n
Pn
x0 xi. = 2 Q.
| i=1{z i. }
x0 x
Thus the random vectors in {x0i. i } satisfy the conditions of the CLT for independent bounded random
vectors (see Adv. Stat. I, or Mittelhammer, 1996, Theorem 5.32), so that
Vn=
1
n
n
X
i=1
d
) = W n V n
n(
Q1 V
so that
) N (0, 2 Q1 )
n(
and
57
N (0, 2 Q1 ),
2
a
N (, n Q1 ).
N (, 2 (x0 x)1 ).
is regardless of the
Theorem 3.4 says that under certain conditions the LS estimator
distribution of the disturbances i approximately normally distributed, which is a consequence
of the CLT. As we shall see later, if the i s are normally distributed, then the normal distribution
holds in every sample, so it holds asymptotically as well.
for
Theorem 3.4 assumes certain conditions on the values of the regressors and regarding the
stochastic behavior of the disturbances, which deserve some discussion.
1. In practice, the iid assumption is often violated, e.g., due to heteroskedastic disturbances.
Also, the boundedness condition excludes many distributions for i , like the normal,
which
Gamma, lognormal, student-t, etc. For a theorem for asymptotic normality of
replaces those conditions by the weaker assumption that the i s are independent with
E4i < see Mittelhammer (1996, p. 445).
2. In practice, this condition is typically met.
3. This condition requires that
limn
1
n
Pn 2
x
i
ij
< and
limn | n1
Pn
i
which is typically satisfied in cross-sectional applications. However, note that this condiP
tion is violated if we have a (time) trend such that xij = i, since limn n1 ni i2 = .
Theorem 3.5 (Asymptotic Normality of S2 iid case) Under the classical assumptions
of the LRM, if the i s are iid, and if E4i = 04 < , then
n(S2 2 ) N 0, 04 4
and
Proof
58
a
S2 N 2 , n1 [04 4 ] .
1 0
1
n( nk
0 M 2 ) =
n nk (I x(x0 x)1 x0 ) n 2
i
h n
i
h n
0 n 2
0 x(x0 x)1 x0
=
n k {z
nk
|
}
|
{z
}
n(S2 2 ) =
Un N (0, [04 4 ])
Wn 0
Regarding the limiting behavior of the second term Wn note that Wn is a positive random variable
with EWn =
nk 2
nk
2 nk/(nk)
.
c
p
Since limn 2 nk/(n k) = 0, it follows that Wn 0.
Regarding the first term Un note that by Slutskys theorem, both
Un
Un = ( nk
n ) Un
and
| {z }
p
k 2
n
| {z }
p
2
n 0
nk
k 2
Un = ( nk
)
n
nk ( n ) n n
1 0
1 Pn 2
=
n n n 2 =
n( n i=1 i 2 ).
Furthermore, E2i = 2 and E4i = 04 < so that var(2i ) = 04 4 < . Hence, a direct application
of the Lindberg-Levy CLT to the sequence of iid variables {2i } yields
Un =
Therefore,
n( n1
Pn
2
i=1 i
2 ) U N (0, 04 4 )
so that
Un U .
2
d
n(S 2 ) N 0, 04 4 .
The classic assumptions (A1)-(A3) of the LRM form a basic-level set of assumptions on which
useful properties (unbiasedness, BLUE-property, consistency, asymptotic normality) of the LS
estimator depend. Hence, it is instructive to discuss the effect that violations of the classical
assumptions have on those basic properties of the LS estimator.
is a biased estimator for , since
1. In general, this would imply that the LS estimator
= + (x0 x)1 x0 6= .
E
(Only for the special case that x0 = 0, the LS estimator remains unbiased.) This
violation of (A1) also implies in general that the variance estimator S2 would be biased,
and S2 would be inconsistent.
and that
59
= . However, note that the covariis unaffected, and taking expectation still yields E
is no longer 2 (x0 x)1 but
ance matrix of
= (x0 x)1 x0 x(x0 x)1 .
Cov()
0 as n , the
Furthermore, under mild conditions on ensuring that Cov()
estimator is still consistent. However, the proof of the Gauss-Markov result (Theorem
3.1) that is BLUE breaks down, so that is no longer the best linear unbiased estimator.
Finally, one can show that under the condition E0 = 6= 2 I the estimator S2 is
typically no longer2 unbiased and consistent for 2 .
Up to this point, we investigated the properties of the LS estimator under the classical LRM,
without assuming a specific parametric distribution for Y or . In the classical LRM, we now
introduce the additional assumption that
N (0, 2 I)
Y N (x, 2 I).
2. (n k)S2 / 2 2(nk) ,
and S2 are independent.
3.
2
60
0 M
,
2
and where M is a symmetric and idempotent matrix with rk(M ) = tr(M ) = n k.3 Using the
spectral decomposition of M , i.e. M = P P 0 , where and P are, respectively, the diagonal matrix
of eigenvalues and the matrix of eigenvectors of M , we obtain
(nk)S2
2
0 M
0
P
P 0 = Z 0 Z,
|{z}
Z0
| {z }
Z
where
Z N (0 ,
1
2
0 2
P}) = N (0 , I).
|P {z
P 0P = I
Then
(nk)S2
2
= Z0
I (nk) 0
0
Z =
Pnk
i=1
Zi2 2(nk) .
{z
Zi iidN (0, 1)
and S2 follows from an application of suitable theorems from Adv. Stat. I.4
3. The independence of
From the fact that (n k)S2 / 2 2(nk) it follows that the variance estimator S2 is Gamma
distributed.5 The first two moments of S2 obtain as
E( nk
S2 ) = (n k)
2
S2 ) = 2(n k)
var( nk
2
so that
so that
ES2 = 2 ,
(nk)2
var(S2 ) = 2(n
4
2 4
var(S2 ) = nk
.
k)
We now come to a further important additional property of the LS estimator that results when
is not only the BLUE but also the
the disturbances are normally distributed. In that case,
MVUE, as stated in the following theorem.
See our discussion above of the unbiasedness of S2 .
The linear form Bx and the quadratic form x0 Ax are independent if x N (0, 2 I) and BA = 0, which
n and the sample variance Sn2 from a normal population are
can be used to prove that the sample mean X
independent.
5
See also Adv. Stat. I on the distribution of the sample variance from a normal population.
3
4
61
Proof
Note that the normality assumption for the disturbances implies that the vector of the random sample
variables Y follows a N (x, 2 I)-distribution which belongs to the exponential class of distributions.
The form of the joint pdf of Y indexed by = ( 0 , 2 )0 is
f (y; ) =
1
(2)n/2 | 2 I|
1
exp
(2 2 )n/2
1
(y
2 2
1
exp
(2 2 )n/2
1
0 x0 x
2 2
x)0 (y x)
21 2 y 0 y +
| {z }
c1 ()g1 (y)
1 0 0
xy
2
| {z }
c2 ()g2 (y)
and
g2 (Y ) = x0 Y
are complete sufficient statistics for estimating and 2 , since the range of
[c1 (), c2 ()0 ]0 = [ 21 2 , 12 0 ]0
contains an open (k + 1)-dimensional rectangle. Then since
= (x0 x)1 x0 Y
S2 =
1
nk (Y
0 (Y x)
=
x)
0
1
nk (Y Y
Y 0 x(x0 x)1 x0 Y )
Under the normality assumption, the covariance of the estimator given by 2 (x0 x)1 also
achieves the CRLB, whereas the variance of S2 given by 2 4 /(n k) does not. To show
this, consider the multivariate form of the CRLB for an unbiased estimator of the m-dimensional
62
q() 0
2 ln f (Y ; ) 1 q()
E
(under information equality)
=
.
Since under the classical LRM with normally distributed disturbances the joint pdf f (Y ; ) is
a member of the exponential class, the information equality holds true. Furthermore note that
for q() = = ( 0 , 2 )0 , we have
q()
= I (k+1) ,
1 0
x (y
2
x)
and
ln f ()
2
= 2n2 +
1
(y
2 4
x)0 (y x).
2 ln f ()
1
= 2 x0 x with E
= 12 x0 x
0
1 0
2 ln f ()
1 0
= 4 x (y x) with E 2 = E 4 x = 0
n
1
2 ln f ()
0
=
(y x) (y x) with E [2 ]2
2 4 6
=E
n
2 4
1 0
n
.
2 4
2 (x0 x)1
=
0
1
I =
1 0
xx
2
n
2 4
0
.
2 4
n
2 4
nk
>
2 4
,
n
This multivariate form of the CRLB obtains by a straightforward extension of the univariate CRLB see
also Mittelhammer (1996, Theorem 7.16).
63
If the likelihood L(; x) is differentiable w.r.t. and has a maximum in the interior of the
can be found as the solution of the 1st-order conditions
parameter space , the ML estimate
(f.o.c.) for the maximizing value of , i.e.,
x)/1
L(;
L(; x)
..
= 0.
=
.
L(; x)/k
The 2nd-order condition for the maximizing value requires that the Hessian
definite matrix.
2 L(;x)
is a negative
Note that it may not be possible to explicitly solve the f.o.c., consisting of a system of k
equations in k unknown estimates 1 , ..., k . In this case, numerical methods are required to
that satisfies the f.o.c.. Even if there is no interior solution or if the likelihood is not
find
that solves max L(; x) is a ML estimate, no matter how it is
differentiable, a value
obtained.
Note that in some situations the (numerical) calculations are simplified by maximizing ln L(; x)
Example 3.3 Let x = (X1 , ..., Xn ) be a random sample from an exponential population with
pdf
x (0, ), = (0, ).
f (x; ) = 1 exp( x ),
Then the likelihood function is given by
L(; x) f (x; ) =
Qn
i=1
f (xi ; ) =
1
n
exp( 1
Pn
i=1
xi .
Pn
i=1
xi ).
d ln L(;x)
d
1
n
Pn
i=1
Pn
xi
i=1
2
= 0.
Pn
1
n
i=1
Xi .
ln L(;x)
d2
n
2
Pn
x
i=1 i
3
n
2
2n
3
= n2 < 0.
Example 3.4 Let x = (X1 , ..., Xn ) be a random sample from a normal population with pdf
f (x; , 2 ) =
1
2 2
exp
(x)2
2 2
x (, ), (, 2 ) = (, ) (0, ).
1
2 2
Pn
i=1 (xi
)2 .
ln L(
,
2 ;x)
2
Pn
i=1 (xi
= 2n2 +
) = 0
1
2[
2 ]2
Pn
i=1 (xi
)2 = 0.
1
n
Pn
i=1
xi
2 =
and
1
n
Pn
i=1 (xi
)2 .
n )2 are the
n = 1 Pni=1 Xi and the sample variance Sn2 = 1 Pni=1 (Xi X
Thus the sample mean X
n
n
MLEs of the population mean and variance for a random sample from a normal population.
To see that the 2nd order condition for a maximizer is met, compute the Hessian of ln L(; x)
with = (, 2 )0 . There we have
2
ln L(, 2 ;x)
2
2 ln L(, 2 ;x)
2
ln L(, 2 ;x)
[ 2 ]2
n2
= n2
,
2
n Pn
4 i=1 (xi )
n
2 4
=0
,
2
i=1 (xi )
1 Pn
66
=
,
2
n2n
2
4
= 2n4 .
2 ln L(;x)
0
n2
0
=
,
0 2n4
Example 3.5 Let x = (X1 , ..., Xn ) be a random sample from a uniform population with pdf
f (x; a, b) =
1
I (x),
ba [a,b]
1
ba
n Q
n
i=1 I[a,b] (xi ).
Note that L() is monotonically increasing in a, while the admissible values of a are bounded
above by the smallest order statistic min(x1 , ..., xn ), and L() is monotonically decreasing in b,
while the admissible values of b are bounded below by the largest order statistic max(x1 , ..., xn ).
Thus the ML estimates are
a
= min(x1 , ..., xn )
and
b = max(x1 , ..., xn ).
67
ln f (x; )
1
,
K(, n)
t(x) = +
with a variance
var[t(x)] =
h ln f (x; ) i
1
var
=
K(, n)2
1
ln f (x;) 2
{z
CRLB
var
|
(since
E[
ln f ()
]
so that
ln f ()
var[
{z
f ()dx =
ln f ()
]
f ()
= E[(
= E
dx =
ln f () 2
) ]
h ln f (x; ) 2 i
}
f ()dx = 0,
Accounting for this functional form of K, we obtain for an unbiased estimator of attaining the CRLB
the form
2 i
ln f (x; )
,
2 i
ln L(; x)
.
t(x) = +
E
h
ln f (x;)
t(x) = +
E
h
ln L(;x)
ln L(;x)
= 0.
Then outcomes of the MLE and t(x) coincide with probability 1. The extension of the corresponding
arguments to the multivariate case is straightforward for details see Mittelhammer (1996, p. 470).
Theorem 3.8 says that
if there exists a unbiased estimator for which attains the CRLB and
if the MLE is defined by solving the f.o.c. for maximizing the likelihood,
then the MLE will be the MVUE for .
This raises the question whether or not the MLE will still be the MVUE in cases where there
is no unbiased estimator attaining the CRLB. Hints to address this issue are provided by the
following theorem.
68
Proof
The Neyman factorization theorem states that we can decompose the likelihood function as
L(; x) f (x; ) = g(s1 , ..., sr ; ) h(x),
where
S1 , ..., Sr are sufficient statistics that can be complete sufficient statistics if they exist;
g and h are nonnegative-valued functions;
h is independent of .
It follows that for a given value of x, if the MLE is unique, then
If the sufficient statistics S1 , ..., Sr used in the Neyman factorization are complete, then the
is a function of the complete sufficient statistics, by Theorem 3.9. It
unique MLE
is additionally
follows from the Lehmann-Scheff completeness theorem that if the MLE
unbiased, then the MLE is the MVUE for . Thus if there exist complete sufficient
statistics and if the MLE is unique and unbiased then the MLE is the MVUE.
Example 3.6 Let x = (X1 , ..., Xn ) be a random sample from an exponential population with
pdf
x (0, ), = (0, )
f (x; ) = 1 exp( x ),
69
1
n
Pn
i=1
Xi
= .
E
with
Since the joint pdf of the random sample variables belongs to the exponential class and has the
form
P
f (x, ) = exp{ 1 ni=1 xi n ln },
|
{z
c()g(x)
is a function of
g(x) = ni=1 is a complete sufficient statistic. By Theorem 3.9 the MLE
Pn
the complete sufficient statistic ( i=1 Xi ). Since the MLE is additionally unbiased it is the
MVUE of .
P
Example 3.7 Recall the example discussing the MLE for the parameter of an exponential
population. There we found that the MLE for is given by
= (x) =
1
n
Pn
i=1
Xi .
By application of Khinchins WLLN and the CLT of Lindberg-Levy to this MLE available as an
explicit function of x we can directly establish that
=
plim
N (, 2 /n).
and
Example 3.8 Let x = (X1 , ..., Xn ) be a random sample from a Gamma population with pdf
f (x; ) =
1
x1
()
exp( x ),
x (0, ),
= (, ).
1
n ()n
Qn
i=1
x1
exp(
i
Pn
xi
i=1 ),
Pn
i=1
ln(xi )
Pn
xi
i=1 .
ln L
= n ln()
= n +
n d()
()
1 Pn
i=1
Pn
i=1
ln(xi ) = 0
xi = 0.
in terms of x,
Note that there is no explicit solution of this nonlinear system of f.o.c.s for
is implicitly a function of x. A unique value for satisfying the f.o.c.s needs to
although
be obtained numerically.
Since an explicit functional form for the MLE is not identifiable, an analysis of consistency
and asymptotic normality using the first approach is quite difficult. In such cases, we might try
to identify appropriate regularity conditions on L(; x) that represent sufficient conditions for
the MLE to be consistent and asymptotically normal and that are met by the specific likelihood
L(; x) under consideration.
ML estimate, (x),
x .
p
Then
0 (true value of ), and the MLE is thus consistent for .
Proof
For any > 0, let
` = 0 ,
such that ` , h
h = 0 + ,
&
| {z }
h
6= 0 ,
Pn
1
n
i=1
ln f (xi ; ) ln f (xi ; 0 )
f (xi ;)
i=1 ln f (xi ;0 )
Pn
< 0
n (x) < 0.
Because n (x) is the sample mean for iid random variables, Khinichins WLLN and Jensens inequality
imply that
p
n (x)
E ln
|
f (Xi ;)
f (Xi ;0 )
< ln E
f (Xi ;)
f (Xi ;0 )
{z
i
}
72
= ln(1)
|
{z
f (.;)
( f (.; ) f (.; 0 )dx = 1)
0
= 0.
lim P [x An ()] = 1,
when
6= 0 .
Since the last equation holds for all s, including ` and h , this in turn implies that
lim P [x An (` )] =
lim P [x An (h )] = 1.
{z
` ) P (A
h )
(since by Bonferronis ineq. P (A` Ah ) 1 P (A
` ) = 0and limn P (A
h ) = 0)
and limn P (A
Theorem 3.10 establishing the consistency of the MLE is based upon the assumption that the
random sample variables are iid, which in many practical applications is hard to justify.
This fairly restrictive iid assumption is not needed if we add to the list of the 4 regularity
conditions in Theorem 3.10 the (fairly weak) condition
R5. limn P [ln L (0 ; x) > ln L (; x)] = 1
for
6= 0 .
This condition essentially requires that the likelihood is such that as n the true value 0
maximizes the likelihood (and hence satisfies the definition of the ML-estimate) with probability
1. For further details see Mittelhammer (1996, Theorem 8.14.).
Theorem 3.10 establishes the consistency of the MLE in the scalar case. The extension of the
arguments to the multivariate case when is k-dimensional vector is straightforward. The
following theorem establishes such a multivariate extension.
N () is an open interval, the interior of a circle, the interior of a sphere, and the interior of a hypersphere in
1, 2, 3, and 4 dimensions, respectively.
73
Proof
See Mittelhammer (1996, p.477-478).
For the MLE to be asymptotically normally distributed, additional conditions on the likelihood
function are needed. The following Theorem presents a collection of sufficient conditions that
ensure the asymptotic normality of the MLE of in the scalar and iid case.
Theorem 3.12 (MLE Asymptotic Normality - iid and scalar case) In addition to conditions (R1)-(R4) of Theorem 3.10 assume that
R6. 2 ln L(; x)/2 exists and is continuous in and x ;
h
0 ) N 0 ,
n(
ln f (Xi ; 0 )/
H(0 )2
N 0 , 1
n
a
ln f (Xi ; 0 )/
Proof
The f.o.c. for maximizing the likelihood implies
x)
ln L(;
= 0.
74
2
H(0 )2
2
+ (1 )0
where =
[0, 1].
for
0 )
n(
=
(iid)
h 1 2 ln L( ; x) i1 h 1 ln L( ; x) i
0
n
n
2
n
n
h1X
2 ln f (Xi ; ) i1 h
n i=1
2
|
{z
Un H(0 )
n
1X
ln f (Xi ; 0 )
n
n i=1
{z
Wn N 0, E
ln f (Xi ; 0 )/
}
2
P
Regarding the limiting behavior of the first term Un note that
0 implies that
P
+ (1 )0
=
0 .
Hence we obtain
Un =
n
h 2 ln f (X ; ) i
1X
2 ln f (Xi ; ) p
i 0
E
n i=1
2
2
{z
(say)
H(0 ) 6= 0.
Regarding the second term Wn , note that the iid assumption for the Xi s implies that the summands
{ ln f (Xi ; 0 )/} are iid random variables
[ f dx]/ = 0),
n
h ln f (X ; ) 2 i
1X
ln f (Xi ; 0 ) d
i 0
n
W N 0, E
.
n i=1
0 ) = [Un ]
n(
Wn [H(0 )]
W N 0,
h
ln f (Xi ; 0 )/
H(0 )2
2 i
If the joint pdf defining the likelihood function meets the conditions for the information
75
E
so that
ln f (Xi ; )
2
= E
2
ln f (X
i ; )
= H(),
ln f (Xi ; )
0 ) N 0 , [H(0 )]1 = N 0 , E
n(
ln f (Xi ; 0 )
a
N 0 , [n H(0 )]1 = N 0 , n E
|
{z
2 1
2 1
.
}
In such cases, the MLE is the best asymptotically normal estimator, i.e. asymptotically
efficient.
In general, an estimator T of is said to be asymptotically efficient if it has an asymptotic
normal distribution with mean and a variance equal to the CRLB (see Mittelhammer, 1996,
Definition 7.22).
Note that the variance of the asymptotic distribution of the MLE is unknown, since it depends
via
2
ln f (Xi ; 0 ) 2
ln f (Xi ; 0 )
and E
H(0 ) = E
2
on the unknown parameter 0 that must be estimated. Consistent estimates for the variance
of the asymptotic distribution of MLEs (which are necessary for asymptotic hypothesis tests)
can be obtained by replacing the two unknown expectations by consistent estimates, given by
2
ln f (X
i ; 0 )
2
ln f (Xi ; 0 )
2
n 2
1X
ln f (Xi ; )
=
n i=1
2
n
2
1X
ln f (Xi ; )
=
.
n i=1
Based upon those estimates we can compute the so called asymptotic standard errors of
MLEs.
Example 3.9 Let x = (X1 , ..., Xn ) be a random sample from an exponential population with
pdf
f (x; ) = 1 exp( x ),
x (0, ), = (0, ),
and a log-likelihood function
ln L(; x) = n ln
76
Pn
i=1
xi .
= n +
1
2
Pn
i=1
xi = 0
P
has a unique solution defining the ML estimate = ( ni=1 xi )/n. X
is consistent.
Hence, by Theorem 3.10, the MLE
As to the regularity conditions (R6) and (R7) of Theorem 3.11:
R6. The second derivative given by
2 ln L(;x)
2
n
2
2
3
Pn
i=1
xi
2 ln L( ;x)
2
= plim
1
[ ]2
2 1
[ ]3 n
Pn
i=1 Xi
= 12 = H(0 ) 6= 0.
0
is such that
Hence, it follows by Theorem 3.12 that the MLE
N 0 ,
E
1
2 i
h
ln f (Xi ;0 )/
H(0 )2
With regard to the particular form of the asymptotic variance note that
H(0 )2 =
1
02
2
1
,
04
and
h
i
ln f (Xi ;0 ) 2
h
= E
=
1
02
1
0
1
X
02 i
2 i
= H(0 ),
77
h
= E
1
02
1
X2
04 i
2
X
03 i
i
N 0 ,
1 2
n 0
Theorem 3.12 establishes the asymptotic normality of the MLE in the scalar and iid case. The
following theorem provides an extension to the multivariate case when is a k-dimensional
vector and to the case where the random sample variables are not iid.
Theorem 3.13 (MLE Asymptotic Normality - Sufficient Conditions) In addition to conditions (M1)-(M4) of Theorem 3.11, assume
M5. 2 ln L(; x)/0 exists and is continuous in and x ;
h
Proof
See Mittelhammer (1996, p.480-482).
The regularity conditions (M5) and (M6) correspond to the conditions (R6) and (R7) used for
the univariate case in Theorem 3.11, while condition (M7) replaces the iid assumption used in
in Theorem 3.11.
78
Proof
Consider for = q() the induced likelihood, say L , defined as the maximum likelihood value which
can be achieved for the set of -values generating a fixed , i.e., by
L (; x) =
max
{:q()=}
L(; x).
By definition, the value that maximizes L (; x) is the MLE of = q(). Hence, we must show that
x]. Note that
L (
; x) = L [q();
L (
; x) = max
max
{:q()=}
L(; x)
= max L(; x)
(by definition of L )
x).
= L(;
(by definition of )
Furthermore
x) =
L(;
max
{:q()=q()}
L(; x)
x].
= L [q();
the ML estimation )
(since is
(by definition of L )
Example 3.10 We found that the MLE for the parameter of an exponential population is
= 1 Pn Xi . Then by the invariance principle, the MLE of 2 is given by
2 =
given by
i=1
n
P
( n1 ni=1 Xi )2 .
8
For a multivariate version of this theorem see Mittelhammer (1996, Theorem 8.20).
79
has been obtained. This raises the question whether or not the properties of the MLE
Under fairly general conditions, one can show that
transfer to the estimator q().
is a consistent estimator of , then q()
is a consistent estimator of q(), and
if
is asymptotically normal, then q()
is also asymptotically normal.
if
The first property directly follows from the continuous mapping theorem which says that
= q(plim )
= q().
plim q()
The second property directly follows from the delta method, which says that a smooth function
of a random variable, which is asymptotically normal, is also asymptotically normal.
Summary of the MLE Properties
We have seen that the ML procedure is a relatively straightforward approach to obtain estimates
of parameters indexing the joint pdf of random sample variables or of any function q().
The MLE possesses the following useful properties, which makes the ML approach very attractive:
If an unbiased estimator of that achieves the CRLB exists, and if the MLE is defined
by solving the f.o.c.s, then the MLE achieves the CRLB.
If an MLE is unique, then the MLE is a function of any set of sufficient statistics,
including complete sufficient statistics provided that they exist.
If an MLE is unique and unbiased, and complete sufficient statistics exist, then the MLE
is the MVUE.
However, an MLE is not necessarily unbiased.
Under general regularity conditions on the likelihood function, the MLE is consistent,
asymptotically normal and asymptotically efficient.
is the MLE of , then q()
is the MLE of q().
If
1
y 1
()
exp( y ),
y (0, ),
= (, ).
EYt = 0 0 ,
Hence, by setting
Yt
,
g(Yt , ) = 2
2
Yt (1 + )
we obtain the following moment conditions
Yt 0 0
= 0.
Eg(Yt , 0 ) = E 2
Yt 0 02 (1 + 0 )
In the case where ` = k, the parameter is exactly identified by the moment conditions
Eg(Yt , ) = 0.
Then the moment conditions represent a set of k equations for k unknowns (see the previous
example). Solving these equations would give us the value of which satisfies the moments
81
g n (y, ) = 0,
i.e.
i.e.
0 .
defined by
The estimator implied by this procedure is the MM estimator
n
1X
= g n (Y , )
= 0.
g(Yt , )
n t=1
In more general contexts with over-identifying moment conditions (` > k), a generalized
MM estimate (GMM) of is obtained by minimizing a weighted distance between g n (y, )
and 0. See below.
Example 3.12 For the Gamma distribution in the previous example the sample moment counterpart to the moment conditions
Yt
= 0
Eg(Yt , ) = E 2
Yt 2 (1 + )
is given by
n
y
m01
1X
t
=
= 0.
g n (y, ) =
n t=1 yt2 2 (1 + )
m02 2 (1 + )
Taking the last equation and solving it for , , we find that the MM estimates are given by
(m01 )2 /[m02 (m01 )2 ]
=
.
Example 3.13 Let Y1 , ..., Yn be a random sample from a N (, 2 ) population. Based upon the
first two non-central moments, we can specify the following moment conditions
Yt
= 0,
Eg(Yt , ) = E 2
Yt ( 2 + 2 )
= (, 2 )0 .
n
y
m01
1X
t
=
= 0.
g n (y, ) =
n t=1 yt2 ( 2 + 2 )
m02 ( 2 + 2 )
Taking the last equation and solving it for , 2 , we find that the MM estimates are given by
m01
=
m02 (m01 )2
2
Note that the MM estimates
and
2 are the simple sample mean and the sample variance,
respectively.
In order to discuss the properties of the MM estimator, we consider the following estimation
context. It is assumed that the sample variables Y = (Y1 , ..., Yn ) are a collection of iid random
variables. Given a statistical model for the random sample {f (y; ), Rk }, our interest
is to estimate . In general the first k non-central moments of Y are functions of , i.e.
0r = EYtr = hr ();
r = 1, ..., k.
Eg(Yt , ) =
Yt h1 ()
Yt2 h2 ()
..
.
Ytk hk ()
= 0
01
02
..
.
0k
h1 ()
h2 ()
..
.
hk ()
(say)
= h(),
83
n
1X
n i=1
yt h1 ()
yt2 h2 ()
..
.
ytk hk ()
= 0
m01
m02
..
.
m0k
h1 ()
h2 ()
..
.
hk ()
= h(),
and the solution for defines the MM estimate via the inverse function h1 as
= h1 (m0 , ..., m0 ).
1
k
The corresponding MM estimator is
= h1 (M10 , ..., Mk0 ),
where Mr0 =
1
n
Pn
t=1
Ytr .
= h1 (01 , . . . , 0k )
= .
Example 3.14 Recall the example where we considered the MM estimator of (, 2 ) for a
normal population using the first two non-central moments. In this case, the MM-estimator is
M10
01
p
=
=
.
2
0
0 2
0
0 2
2
M2 (M1 )
2 (1 )
84
A (01 , . . . , 0k ) =
0
0
h1
1 (1 , . . . , k )
0
1
..
.
h1 (0 , . . . , 0 )
1
k
k
01
..
0
0
h1
1 (1 , . . . , k )
0
k
..
.
1
0
0
hk (1 , . . . , k )
0k
) N (0, AA0 ),
n(
and
N (, n1 AA0 ),
Proof
= h1 (M 0 , . . . , M 0 ) converge in disRecall that the sample moments defining the MM estimator
1
k
tribution to a normal limiting distribution as
M10
01
.. ..
n
. .
Mk0
0k
d
N (0, ).
Since a) the MM estimator is a continuous function of those random variables (given by h1 ), b) the
partial derivatives of this function contained in the Jacobian matrix A are continuous, and c) A has
stated by the theorem follows directly from the
full rank, the asymptotic normal distribution of
theorem of the asymptotic distribution of continuous functions of asymptotically normally
distributed random variables (see the delta method e.g. Mittelhammer, 1996. Theorem 5.40,
p.288).
In the discussion above, we have seen that the MM estimator is conceptually and computationally very simple, and is consistent and asymptotically normal under fairly general conditions.
However, the MM estimator is often not unbiased, BLUE, MVUE, or asymptotically efficient.
Furthermore, its small sample properties are often unknown and critically dependent upon the
particular estimation problem under consideration. The MM estimator has been most often
85
Et = 0,
with
t = 1, ..., n,
where is a k-dimensional parameter vector. Moment conditions can be specified by the kdimensional vector function
Eg(Yt , ) = Ex0t. t = Ex0t. (Yt xt. ) = 0.
The corresponding sample moment conditions are given by
1
n
Pn
t=1
g(yt , ) =
1
n
Pn
t=1
= (x0 x)1 x0 y.
This system of equations solved for defines the MM estimate of as
Thus the LS estimator is an MM estimator.
Example 3.16 Let Y = (Y1 , ..., Yn ) be an iid random sample from a population with pdf
f (y; ). Then the log-likelihood is
Pn
ln L(; y) =
t=1
ln f (yt ; ).
Under the usual regularity conditions we have (in the continuous case)
ln f (Yt ; ) i
E
=
ln f (yt ; )
f (yt ; ) dyt =
{z
|
}
f (yt ; )dyt = 0.
f (yt ;)
Eg (Yt , ) = E
ln f (Yt ; ) i
= 0.
1X
1 X ln f (Yt ; )
1 ln L(; y)
=
= 0.
g (yt , ) =
n t=1
n t=1
The GMM estimator is used when the k-dimensional parameter vector is over-identified by
the ` > k moment conditions Eg(Yt ; ) = 0.
In this case the corresponding sample moment conditions
n
1X
g(yt ; ) = 0
g n (y; ) =
n t=1
that
define a system with more equations than unknowns, such that we cannot find a vector
exactly satisfies the sample moment conditions. In this over-identified case the GMM estimate
is defined as the value of that satisfies the sample moment conditions as closely as possible.
Formally, the GMM estimate is defined to be the value that minimizes a weighted measure of the distance between the sample moments g n (y; ) and the zero vector,
as
Definition (GMM estimator):
Let
where
Qn (; y) = g n (y; )0 W n g n (y; ),
and W n is a (``) symmetric, p.d. weighting matrix and is
p
such that W n w, with w being a nonrandom, symmetric,
and p.d. matrix.
Note that for ` > k different weighting matrices W n lead to different GMM estimators. The
most simple weighting matrix would be W n = I. Also note that
Qn (; y) 0,
(since the GMM objective function Qn (; y) has a quadratic form with a p.d. weighting matrix
W n ), and
Qn (; y) = 0
g n (y; ) = 0.
Example 3.17 Let Y = (Y1 , ..., Yn ) be a random sample from a Gamma population with pdf
f (y; ) =
1
y 1
()
exp( y ),
87
y (0, ),
= (, ).
Eg(Yt , ) = E Yt Yt2 2 (1 + ) ln Yt
d ln ()
d
ln 1/Yt 1/( 1)
= 0.
Hence the two-dimensional parameter vector is over-identified by those four moment conditions. If we select W n = I, the GMM estimate would obtain from minimizing the following
objective function:
Qn (; y) = [ n1
Pn
yt ]2 + [ n1
+ [ n1
Pn
ln(yt )
+ [ n1
Pn
t=1
t=1
t=1
Pn
t=1
d ln ()
d
yt2 2 (1 + )]2
ln ]2
As was the case for the MM estimator, general statements regarding the finite sample properties of the GMM estimator are not possible to derive. In the following we will discuss9 a
set of sufficient conditions for the consistency and the asymptotic normality of the GMM
estimator. We will also consider the asymptotic efficiency of the GMM estimator for a given
particular set of moment conditions.
For consistency of the GMM estimator we require some conditions on the behavior of the GMM
objective function. A set of sufficient conditions is given in the following definition.
This discussion is based on L. Mtys (1999), Generalized Method of Moments Estimation, Cambridge:
Cambridge University Press; Chapter 1. The seminal paper on the asymptotic properties of the GMM estimator is that of L.P. Hansen (1982), Large sample properties of generalized method of moments estimators,
Econometrica, p. 1029-1054.
88
Eg(Yt , ) = 0
= 0 .
for j = 1, ..., `.
Condition (C1) ensures that the population moments defining the moment conditions exist.
Condition (C2) says that the population moments take the value 0 at 0 and at no other
value of . This ensures that 0 can be identified by the population moment conditions
Eg(Yt , ) = 0.
Condition (C3), saying that
p
implies that the jth sample moment (as a function of ) converges in probability uniformly
to the corresponding population moment (as a function of ).
The uniformity of the convergence is a stronger requirement than the usual pointwise conp
vergence in probability which simply requires that gn,j (Y , ) hj () 0 for every
single fixed value separately. The importance of the uniformity of the convergence for the
p
consistency proof is that it implies that gn,j (Y , n )hj (n ) 0, where n is some sequence
of values. This may not be true if we only have pointwise convergence.
With the conditions (C1)-(C3), the consistency of the GMM estimator can be shown.
Theorem 3.17 (Consistency of the GMM Estimator) The GMM estimator of defined
as
= arg minQn (; Y ),
89
and where W n w, with w being a nonrandom, symmetric, and p.d. matrix. Then under the
p
conditions (C1) to (C3),
0 .
Proof
The conditions (C1)-(C3) imply that the GMM objective function Qn (; Y ) converges in probability
uniformly to the nonrandom (limit) function
Q() = h()0 w h() = E[g(Yt , )]0 w E[g(Yt , )],
i.e.
p
= 0 ,
and note that Q() has a quadratic form such that Q() 0. Hence it follows that
0 =
arg minQ().
Now since
minimizes the objective function Qn (; Y ),
- the GMM estimator
- the true value 0 minimizes the corresponding limit function Q(),
- and Qn (; Y ) converges in probability uniformly to Q(),
p
it follows that
0 .10
Under additional conditions relating to the moment conditions, it can be shown that the GMM
estimator is asymptotically normally distributed. A set of such conditions is given in the
following definition.
10
For the formal details of the proof see T. Amemiya (1985, p. 107), Advanced Econometrics, Cambridge:
Havard University Press.
90
Gn (y; ) =
n
g n (y; )
1X
g(yt ; )
=
.
0
n t=1 0
p
Gn (Y ; n ) G(0 ),
where G(0 ) is a nonrandom ` k matrix.
C6. The sequence of moment functions {g(Yt ; )} satisfies
a CLT, so that
n g n (Y ; ) Z N [0, V (0 )],
In cases where the sequence of moment functions {g(Yt ; )} are heteroskedastic and/or correlated random variables, the standard Lindberg-Levy CLT needs to be replaced by a CLT for
heteroskedastic and/or correlated processes.
Theorem 3.18 (Asymptotic Normality of GMM Estimators) Under the conditions (C1)
91
d
0 )
n(
N (0, )
where
i1
= G(0 )0 wG(0 )
i1
Proof
The GMM objective function to be minimized (ignoring its dependence from Y ) is
Qn () = g n ()0 W n g n ()
with first-order derivatives
Qn ()
g () 0
W n g n () = 2Gn ()0 W n g n (),
=2 n
are
so that the f.o.c.s defining the GMM estimator
0 W n g n ()
= 0.
Gn ()
given by
Now consider a first-order Taylor series representation of the sample moments g n ()
= g n (0 ) + Gn ( )(
0 ),
g n ()
n
where
+ (1 )0
n =
[0, 1].
and
p
p
Note that since
0 , we have n 0 .
0 W n we get the following represenIf we pre-multiply the equation of the Taylor expansion by Gn ()
tation of the GMM f.o.c.s:
0 W n g n ()
= Gn ()
0 W n g n (0 ) + Gn ()
0 W n Gn ( )(
0 ) = 0.
Gn ()
n
Multiplying the last equation by
0 ) = Gn ()
0 W n Gn ( )
n(
n
| {z } |{z} |
p
G(0 )0
{z
G(0 )
0 ) yields
n(
i1
0 Wn
Gn ()
n g n (0 ) .
| {z } |{z} |
p
G(0 )0
{z
n( 0 ) follows from Slutskys theorem.
92
N [0, V (0 )]
i1
h
i1
1h
G(0 )0 wG(0 ) G(0 )0 w V (0 ) w G(0 ) G(0 )0 wG(0 ) .
n
Hence, the asymptotic covariance of the GMM estimator for a given set of moment conditions
p
and, therefore, its asymptotic efficiency depends upon the selected weighting matrix W n
w.
One can show that the optimal choice W n of W n which minimizes this asymptotic covariance
for a given set of moment conditions is characterized by11
p
W n
1
n Cov[g n (Y ; 0 )]
1
= V (0 )
Note that the asymptotic covariance of the GMM estimator based upon this optimal weighting
matrix is
i1
1h
=
ACov()
G(0 )0 V (0 )1 G(0 ) ,
n
Wn = Wn
which is the smallest possible for the set of moment conditions under consideration. The
computation of the optimal weighting matrix W n and, thereby the implementation of
the asymptotically optimal GMM, requires a consistent estimate for
V (0 ) = n Cov[g n (Y ; 0 )] = n Cov[ n1
Pn
t=1
g(Yt ; 0 )].
1
n
Pn
t=1
g(Yt ; n )g(Yt ; n )0 ,
where plimn = 0 .
Note that the computation of the optimal weighting matrix W n requires a consistent estimate
for . For this reason we typically use in practice the following two-step procedure to obtain
the asymptotically efficient estimator:
Step (1): Select some arbitrary weighting matrix (say W n = I) to obtain an initial consistent
estimate of .
Step (2): Use this consistent estimate of to construct W n and to obtain the asymptotically
efficient GMM estimate.
11
For details see, e.g. L.P. Hansen (1982) Large sample properties of generalized method of moments estimators, Econometrica, p. 1029-1054.
93
{ct+ }
E U (Ct+ )|xt ,
where
=0
i = 1, ..., L.
= E (1 + Ri,t+1 )
| {z }
Utility loss
dU (Ct+1 )
dCt+1
xt ,
{z
i = 1, ..., L.
caused by
saving 1$ in t
Ct1
,
1
L.P. Hansen and K. Singleton (1982), Generalized instrumental variable estimation of nonlinear rational
expectation models, Econometrica 50, p. 1029-1054.
94
xt
= E (1 + Ri,t+1 ) Ct+1
i
1 = E (1 + Ri,t+1 ) (Ct+1 /Ct ) xt ,
and hence
E
= 0
xt
Xt
= 0.
(by l.i.e.)
The aim is to estimate the parameter = (, ). Note that without further assumptions about
the distribution of the involved random variables we cannot derive a likelihood function which
would be required for ML estimation. However, the optimality conditions allow us to define a
set of moment conditions that can be used to estimate by GMM.
Let Y t = (R1,t+1 , ..., RL,t+1 , Ct+1 /Ct , X 0t )0 . Then stacking the optimality conditions for the L
assets produces a set of L k moment conditions of the form
Eg(Y t ; ) =
1 (1 + R1,t+1 ) (Ct+1 /Ct ) X t
..
E
.
1 (1 + RL,t+1 ) (Ct+1 /Ct ) X t
= 0.
(L k 1)
h P
n
1
n
t=1
i0
g(y t ; ) W n
h P
n
1
n
t=1
g(y t ; ) .
f (y|)f ()
f (y|)f ()
f (, y)
.
=
=
f (y|)f ()d
f (y)
f (y)
The posterior combines our prior beliefs about summarized by the prior f () with the
information about in the data y, contained in the likelihood f (y|). Hence, the posterior
represents our revised beliefs about the distribution of after seeing the data y. It obtains as
a mixture of the prior information and current information, that is, the data.
Once obtained, the posterior is available to be the prior when the next body of data is available.
The principle involved is one of continual updating our knowledge about . This appears
nowhere in the classical analysis.
In this setting, f (y) is referred to as the marginal data density, which does not involve the
parameter of interest. It represents an inessential (integrating) constant. Since the data density
is an inessential constant not involving , it is often dropped. We then write
f (|y) f (y|) f (),
| {z }
posterior
| {z }
likelihood
| {z }
prior
where the symbol means is proportional to. Note that the product f (y|)f () does not
define a proper density. It represents a so called density kernel for the posterior density of .
96
E[q()|y] =
q()f (|y)d =
.
f (y|)f ()d
Note that the posterior Bayes estimator is a quantity that is obtained by integration. Only in
special cases, this integral can be worked out analytically (see the following examples). Whether
or not an analytical characterization can be obtained, critically depends on the functional forms
of the likelihood and the chosen prior density.
However, in general, we cannot work out this integral analytically and we need to rely on numerical integration. Here we can use either deterministic integration methods (Simponss rule,
Laplace approximations, quadrature rules) or Monte-Carlo techniques (Importance Sampling,
Gibbs sampling, Metropolis-Hastings sampling, Markov-Chain Monte-Carlo). We dont go into
the details here.
Example 3.19 Let Y = (Y1 , ..., Yn ) be a random sample from a N (, 2 ) population. The
variance 2 is assumed to be known. Suppose that prior information about can be represented
by a N (m, v 2 ) prior distribution, where the prior parameters (m, v 2 ) are known. The
functional form of the posterior distribution for
f (|y) f (y|) f ()
obtains as follows. Ignoring the inessential integrating constant, the likelihood and the prior
density have the form
n
f (y|) exp
1
2 2
2
i (yi ) ,
f () exp
1
(
2v 2
m)2 .
1
2 2
P
2
i yi
yi + n
1
2
n2
2
2
v2
97
2 2
1
2v 2
2m + m
yi
2 vm2
1
2
1
2
n2
2
2
v2
nv 2 + 2
2 v2
{z
2
1/
2 2
yi
P
y
i i
2
2 vm2
m
v2
{z
2
/
Note that the r.h.s. has the form of a density kernel of a normal distribution for . To see this,
consider the density for an X N (x , x2 ) given by
f (x) = 1
2x2
exp{ 21 ( 12 x2 2 x2 x +
x
2x
)}.
x2
2v2
,
nv 2 + 2
n
y v 2 + m 2
,
nv 2 + 2
where y =
1
n
yi ,
E(|y) = =
n
n+( 2 /v 2 )
y +
( 2 /v 2 )
n+( 2 /v 2 )
m.
Note that the classical ML estimate for is y. Thus, the Bayesian estimate is a weighted
average of the classical estimate and an estimate based on prior information, namely the prior
mean m.
Furthermore note that the Bayesian estimate
E(|y) = =
n
n+( 2 /v 2 )
y +
( 2 /v 2 )
n+( 2 /v 2 )
depends on the prior variance v 2 . Smaller values of v 2 correspond to greater confidence in prior
information, and this would make the Bayesian estimate closer to m. In contrast, as v 2 becomes
larger, the Bayesian estimate approaches the classical estimate y.
For the limit of v 2 the prior density becomes a so-called diffuse or improper prior
density. In this case the prior information is so poor that it is completely ignored in forming
the estimate.
Recall that the ML estimator Y is an unbiased estimator of and the MVUE. This implies
that the posterior Bayes estimator is biased for a finite v 2 . One can show that in general a
posterior Bayes estimator is biased13 .
13
98
(n 1)
N (0, 2 I),
x + ,
(n k)
(k 1)
(n 1)
1
(y
2 2
x) (y x) .
1
(
2 2
) (V
+ x x)( ) ,
where
= (V 1 + x0 x)1 (V 1 m + x0 y).
Hence the posterior for is a N [ , 2 (V 1 + x0 x)1 ] density.
The posterior Bayes estimate is
E(|y) = = (V 1 + x0 x)1 (V 1 m + x0 y)
and can be interpreted as follows:
Poor prior information about corresponds to a large prior variance V , that means a small
value of V 1 . The diffuse prior in this case can be represented as the limit V 1 0, for which
the posterior Bayes estimate becomes the LS estimate
= (x0 x)1 x0 y.
The variance of the posterior distribution becomes 2 (x0 x)1 . Thus the classical LS approach
for the LRM is reproduced as a special case of Bayesian inference with a diffuse prior.
In the previous example, we have assumed that the residual variance 2 is known. In general,
however, both and 2 are unknown, and a Bayesian analysis requires a prior distribution
for 2 . For this application it is convenient to assume for the 1/ 2 (the so-called precision)
99
`2 (t; ) = |t q()|
Under both loss functions, the loss associated with large errors is larger than for small errors,
and the loss is zero when the estimation error is zero.
The average loss associated with an estimator (decision rule) T is the so-called risk function,
defined as
Rt () = E `(T ; )| = E ` t(Y ); | = ` t(y); f (y|)dy.
The risk function measures the average loss computed across different realizations of the random
sample Y for a given value of the parameter .
Note that the risk function for a quadratic loss function is
E `1 (T ; )| = E [t q()]
(Mean-squared error),
E `2 (T ; )| = E
14
|t q()|
100
rt = Ef () [Rt ()] =
Rt () f () d,
| {z } | {z }
risk
prior
`(t; )f (|y)d.
E `(T ; )y =
Proof
The Bayes estimator has the smallest Bayes risk. The Bayes risk obtains as
h
i
{z
posterior
E `(t; ) y f (y)dy.
101
inner integral E `(t; )y , i.e. the posterior risk, is minimized for each y .
We can thus establish the result that the Bayes estimator of for a quadratic loss function
is given by the mean of the posterior distribution of (i.e. the posterior Bayes estimator).
Corollary 3.1 Under a quadratic loss function `(t; ), the Bayes estimator of q() is given
by the posterior expectation of q(), i.e.
i
E q()y
q()f (|y)d.
Proof
The Bayes risk for a quadratic loss function is minimized if the corresponding posterior risk, i.e.
h
E `(t; )y
= E [t q()]2 y
is minimized. Now note that an MSE of the form E[(a Z)2 ] is minimized as a function of a by
a = EZ. To see this consider the f.o.c.
dE[(aZ)2 ]
da
= 2E[(a Z)] = 0
a = EZ.
Thus, the posterior risk under a quadratic loss function E[[t q()]2 | y] is minimized by E[q() | y].
Example 3.21 Recall the example where we considered the posterior Bayes estimator of for
a random sample from a N (, 2 ) population with a known variance 2 and a prior given by
a N (m, v 2 ) distribution. For a quadratic loss function the Bayes estimator of is equal to the
posterior Bayes estimator and is given by
E(|y) = =
n
n+( 2 /v 2 )
102
y +
( 2 /v 2 )
n+( 2 /v 2 )
m.
4. Hypothesis testing
There are two major areas of statistical inference: the estimation of parameters discussed
in the previous two chapters, and the testing of hypotheses, which we shall discuss in this
chapter.
Statistical hypothesis testing concerns the use of a random sample of observations from the
population under consideration to judge the validity of a statement or hypothesis about this
population in such a way that the probability of making incorrect decisions can be controlled.
Examples of statements that might be tested are
A certain brand of batteries lasts at least 3 hours;
The average monthly return of a certain portfolio of risky assets exceeds 5%;
The probability that a consumer will buy a certain brand of coffees depends on his age.
In the following section, various concepts of statistical hypothesis testing are introduced.
4. Hypothesis testing
Example 4.1 A manufacturer of light bulbs claims that the percentage of defective light bulbs
in a shipment is no more than 2%. This claim can be transformed into a statistical hypothesis
as follows:
We take a random sample of n light bulbs and define the random variable
Xi =
i = 1, ..., n.
H = f (x; p) =
Qn
xi
1xi
,
i=1 p (1 p)
p [0 , 0.02] .
p [0 , 0.02] defines the set of all distributions implied by the manufacturers claim. Since this
claim involves more than one distribution, the statistical hypothesis is a composite hypothesis.
Since the functional form for the joint pdf f (x; p) can be assumed to be known, the hypothesis
can be represented as it is usually the case in abbreviated form as
H : 0 p 0.02
: 0.02 < p 1.
and the complementary hypothesis as H
Suppose that the manufacturers claim was that the percentage of defective light bulbs in a
shipment is exactly 2%. Then we have a simple statistical hypothesis given by
H = f (x; p) =
Qn
1xi
xi
,
i=1 0.02 (1 0.02)
or simply H : p = 0.02.
Note that a statistical hypothesis H implicitly defines a set of potential outcomes for the random
sample x. Formally, this set obtains as
R(x|H) = {x : f (x; ) > 0 and f (x; ) H},
and is referred to as the range of x over H.
(the complement of H) and over H H
can be
Analogous definitions of the range of x over H
given as
= {x : f (x; ) > 0 and f (x; ) H},
R(x|H)
= {x : f (x; ) > 0 and f (x; ) H H}.
R(x|H H)
104
4. Hypothesis testing
and H H
are the same, which
In many applications it is the case that the ranges over H, H
are the same.
occurs when all of the supports of the pdfs in H and H
Example 4.2 Recall the previous example, where we considered the hypothesis that the percentage of defective light bulbs in a shipment is no more than 2%, such that
H : 0 p 0.02
and
: 0.02 < p 1.
H
and H H
are
The ranges of a sample x = (X1 , ..., Xn ) over the ranges H, H
= R(x|H H)
= n {0, 1}.
R(x|H) = R(x|H)
i=1
Suppose that the manufacturers claim was that there are no defective bulbs, such that
H:p=0
and
: 0 < p 1.
H
= R(x|H H)
= n {0, 1}.
R(x|H)
i=1
4. Hypothesis testing
Example 4.3 Let x = (X1 , ..., Xn ) be a sample from a N (, 25) population, where is unknown. Consider the hypothesis
H : < 17.
Intuitively, one would reject H if the sample mean xn is significantly larger than 17. Hence, as
a critical region one could define the following set of random sample outcomes
Cr = {x : xn > 17 +
5 }.
n
The different situations associated with the type I and type II errors are depicted below.
106
4. Hypothesis testing
Ideal statistical test
Clearly, the ideal statistical test would be such that it leads with probability 1 to the correct
decision, implying that
P (type I error) = P (type II error) = 0.
For such an ideal test to exist, it must be possible to define a critical region Cr such that
x Cr His not true
and
x
/ Cr His true.
are two
Note that the definition of such a critical region requires that R(x|H) and R(x|H)
disjoint sets partitioning the sample space, so that
x Cr = R(x|H)
implies with certainty that His not true
and
x Ca = R(x|H) implies with certainty that His true.
would define an ideal error-free test.
Hence, in this case Cr = R(x|H)
6= , there are potential outcomes x that belong to R(x|H) as well as to
If R(x|H) R(x|H)
Those outcomes cannot be used to discriminate with certainty between H and H.
R(x|H).
6= , which is virtually always the case in practice, there would not
Hence if R(x|H) R(x|H)
exist an ideal test. In those cases, we might define tests that control the incidence of errors
such that they occur with acceptable probabilities.
Test statistic
A scalar statistic whose outcomes are used to define critical regions for a test is called a test
statistic.
Definition (Test statistic): Let Cr define the critical region
If T =
associated with a test of the hypothesis H versus H.
n
o
t(x) is a scalar statistic such that Cr = x : t(x) CrT , i.e.,
the critical region can be defined in terms of outcomes, CrT ,
of the statistic T , then T is referred to as a test statistic for
The set CrT will be referred to as
the hypothesis H versus H.
the critical (or rejection) region of the test statistic, T .
Example 4.4 Consider a sample from a N (, 25) population and the hypothesis H : < 17.
Then the critical region
Cr = {x : xn > 17 + 5n }
n.
is defined in terms of the test statistic X
107
4. Hypothesis testing
() = P (x Cr ; )
f (x; )dx
(cont. case)
xCr
f (x; )
(discr. case).
xCr
1 () = P (correct decision)
1 () = P (type II error).
Hence, the power function summarizes all of the characteristics of a test w.r.t. probabilities
of making a correct/incorrect decision. This makes it a useful tool for the comparison of
alternative tests for a particular parametric hypothesis.
108
4. Hypothesis testing
Example 4.5 Let x be a random sample of size n = 200 from a Bernoulli population with
p(x = 1) = . Consider testing H : 0 0.02 using a test based on
Cr = {x :
P200
i=1
xi > 5}.
() = P (x Cr ; ) = P (
xi
| i=1
{z }
> 5; )
Binomial(200,)
P200
200!
j
j=6 j!(200j)! (1
)200j ,
[0, 1].
1,
0,
if H
.
else
and
When comparing two tests for a given H, a test is better if it has higher power for H
lower power for H, which implies that the better test has lower probabilities of both type
I and type II error.
Properties of statistical tests
The power function can be used to define important properties of a test, including the size and
the significance level of the test, and whether it is unbiased and consistent.
The size of a test is the maximum probability of a type I error.
Definition (Size of test): Let () be the power function
of a test Cr for the hypothesis H. Then
= sup () = sup P (x Cr ; )
H
109
4. Hypothesis testing
The significance level is an upper bound of the type I error probability. The difference between
the size and the significance level is that the former is the supH P (type I error) while the
latter is only a bound that might not be equal to P (type I error) for any H nor equal to
supH P (type I error). Thus a test of H having size is a test of significance level for any
.
Example 4.6 Let x = (X1 , ..., Xn )0 be a random sample from a N (, 25) population and consider the hypothesis H : 17. What is the size of the test with critical region
Cr = {x : xn > 17 +
5 }
n
To answer this question, we first need to derive the power function. The power function obtains
as
() = P (x Cr ; ) = P (
xn > 17 + 5/ n ; )
= 1 P (
xn 17 + 5/ n ; )
17 + 5/ n
xn
= 1P
;
5/ n
5/ n
| {z }
N (0, 1)
17 + 5/ n
,
= 1
5/ n
17 + 5/ n
.
() = 1
5/ n
|
|
{z
monot. decreasing in
{z
monot. increasing in
}
}
Thus the maximizer of the power function under the constraint H : 17 is the upper bound
of H, namely = 17. Hence the size of the test is
= sup () = 1 (1) = 0.159.
H
The concept of unbiasedness within the context of tests refers to a test that has smaller probability of rejecting H when H is true compared to when H is false.
110
4. Hypothesis testing
Definition (Unbiasedness of a test): Let () be the power
function of a test for the hypothesis H. The test is called
unbiased iff
sup () inf ().
The definition implies that if the height of the power function graph is everywhere lower for
the test is unbiased. Clearly, unbiasedness is a desirable property of
H than for H,
a test.
Another desirable property of a test is that, for a given size (supremum of the type-I error
probability), the test exhibits the smallest possible type-II error probability. Such a test is
called uniformly most powerful.
Definition (Uniformly most powerful (UMP) size- test):
Let = {Cr : supH () } be the set of all critical
regions with a size of at most for the hypothesis H. A test
with critical region Cr and with a power function Cr () is
called uniformly most powerful of size iff1
sup Cr () = ,
and
and Cr .
Cr () Cr () H
The definition implies that the UMP test of H is the best test of H providing protection against
type I error equal to . Equivalently, it is the size test having the most power uniformly in
In the case where H
is simple referring to one value of , the test C is called
for H.
r
most powerful (the adverb uniformly being redundant in this case).
Unfortunately, in many cases of practical interest, UMP tests do not exist. But, as we shall
see later, it is sometimes possible to restrict attention to the class of unbiased tests and define
a UMP test within this class.
Definition (Admissibility of test): Let Cr be a test of H. If
there exists an alternative test Cr such that
Cr ()
Cr ()
H
,
H
then Cr
with strict inequality holding for some H H,
is inadmissible.
111
4. Hypothesis testing
From the definition, it follows that an inadmissible test is one that is dominated by another
test in terms of protection against both, type I and type II error. Inadmissible tests can be
eliminated from any further consideration in testing applications.
A further desirable property of a test is that for a given significance level the probability
P (type II error) 0 as the sample size n . This property is called consistency.
In the definition, since the critical region of a test will generally change as n changes, we use
the notation Crn to indicate the dependence of the critical region on n.
Definition (Consistency of test): Let Crn be a sequence
of tests of H based on a random sample (X1 , ..., Xn ). Let
the significance level of the test Crn be n. Then the
sequence of tests Crn is said to be a consistent sequence of
significance level- tests iff
lim Crn () = 1 H.
From the definition, it follows that a consistent test is such that, in the limit, the probability
is 1 that H is rejected whenever H is false.
4. Hypothesis testing
Theorem 4.1 (Neyman-Pearson Lemma) Let x be a random sample from f (x; ). Furthermore, let k > 0 be a positive constant and Cr a critical region which satisfy
1.
p(x Cr ; 0 ) = ,
2.
f (x; 0 )
k
f (x; 1 )
x Cr ;
3.
f (x; 0 )
>k
f (x; 1 )
x
/ Cr .
0 < < 1;
Then Cr is the most powerful critical region of size for testing the hypothesis H0 : = 0
versus H1 : = 1 .
Proof
Let Cr represent any other critical region of size so that
p(x Cr ; 0 ) = .
1
ICr (x) =
0
if x Cr
else
1 if x C
r
,
ICr (x) =
0
else
and
1 I (x) 0
Cr
ICr (x) ICr (x) =
0 IC (x) 0
r
x Cr
x
/ Cr
1
f (x; 1 )
f (x; 0 )
< k
if
xC
r
.
x
/ Cr
1
[IC (x) ICr (x)]f (x; 0 ),
k r
x.
Assuming that x is discrete, the summation of both sides of this inequality over all x values in the
sample space R yields (if x is continuous, substitute the summation by an integration)
P
xR [ICr (x)
p(x Cr ; 1 ) p(x Cr ; 1 )
1
k
xR [ICr (x)
1
k [ p(x
|
Cr ; 0 ) p(x Cr ; 0 ) ].
{z
= size fixed to be
113
{z
= size fixed to be
4. Hypothesis testing
Hence we obtain
p(x Cr ; 1 ) p(x Cr ; 1 ) 0.
Hence, the size- test Cr has a power for 1 which is larger or equal to that of any other size- test
Cr . Thus Cr is the most powerful size- test.
The Neyman-Pearson lemma facilitates the construction of a most power test of H0 : = 0
versus H1 : = 1 . In particular, the most powerful critical region is defined by a test statistic,
namely the likelihood ratio
f (x; 0 )
.
=
f (x; 1 )
Note that the smaller , the less plausible H0 : = 0 relative to H1 : = 1 .
Using the likelihood ratio, the most powerful critical region can be represented as
Cr = {x : f (x; 0 )/f (x; 1 ) k},
where the critical value of k is selected such that the test Cr has a given size , i.e.
p(f (x; 0 )/f (x; 1 ) k ; 0 ) = .
Example 4.7 Let x be a random sample of size n from an exponential population with pdf
x (0, ),
f (x; ) = exp{x};
> 0.
Consider the hypotheses H0 : = 0 versus H1 : = 1 with 0 < 1 . The most powerful critical
region is defined in terms of the likelihood ratio, which obtains as
f (x; 0 )
0n exp{0 ni=1 xi }
0
= n
=
Pn
f (x; 1 )
1 exp{1 i=1 xi }
1
P
n
Pn
exp{(0 1 )
i=1
xi }.
According to the Neyman-Pearson lemma, the most powerful critical region has the form
Cr = {x : f (x; 0 )/f (x; 1 ) k}
= {x : (0 /1 )n exp{(0 1 )
Pn
i=1
xi } k}
= {x :
Pn
i=1
P
i
xi (1 0 )1 ln[(1 /0 )n k]}
|
{z
114
4. Hypothesis testing
defining the critical region in terms of the test statistic Pi Xi , is selected
Now, the value for k,
such that Cr has a given size , i.e.
Pn
P (x Cr ; 0 ) = P (
i=1
0 ) = .
xi k;
Now note that since Xi Exponential(), the sum ni=1 Xi has a Gamma(n, )- distribution;
hence the relationship between the critical value of k and size is
P
0 ) =
P ( i=1 xi k;
Pn
0n n1
!
v
exp{0 v}dv = ,
(n)
which can be solved for k (which is the -quantile of the Gamma(n, 0 )- distribution). Thus
the most powerful test of size for H0 : = 0 versus H1 : = 1 is based upon the following
decision rule:
Reject H0 if
Pn
i=1
xi k,
It can be shown that whenever the random vector x is continuous, a Neyman-Pearson most
powerful test for H0 : = 0 versus H1 : = 1 will exist. In cases where the random vector x
is discrete, a Neyman-Pearson most powerful test will typically not exist. In those cases, the
most powerful test exists only for a limited number of sizes .
In order to see this, consider the construction of the Neyman-Pearson most powerful critical
region, whereby Cr obtains as
Cr = {x : f (x; 0 )/f (x; 1 ) k},
Example 4.8 Let x be a random sample of size n from a Bernoulli population with pdf
f (x; ) = x (1 )1x ;
x {0, 1},
[0, 1].
xi
0 (1 1 )
i (1 0 )n i xi
f (x; 0 )
= 0P x
=
P
i
f (x; 1 )
1 (1 0 )
1 i (1 1 )n i xi
115
P xi
i
1 n
0
1 1
4. Hypothesis testing
so that the Neyman-Pearson most powerful critical region has the form
Cr =
x:
0 (1 1 )
1 (1 0 )
P xi
i
1 n
0
1 1
x:
Pn
i=1 xi
P
i
{z
P
Since ni=1 Xi is Binomial(n, ) distributed, the relationship between the choice of k and the
size of the test implied by Cr is given by
= P (x Cr ; 0 ) = P
X
n
0
xi k;
n
X
j=k
i=1
n j
(1 0 )nj .
j 0
For 0 = 0.2, 1 = 0.8, and n = 20, possible choices of the test size are
k
1
2
3
4
5
6
0.9885
0.9308
0.7939
0.5886
0.3704
0.1958
7 0.0867
8 0.0321
9 0.0100
10 0.0026
11 0.0006
12 0.0001
Note that there are no choices of within the range [0.0001, 0.9885] other than the ones displayed
in the table. Hence, a Neyman-Pearson most powerful test of size, say, = 0.05 does not exist.
One can show that the Neyman-Pearson most powerful test of H0 : = 0 versus H1 : = 1
is also an unbiased test, that is, a test which has smaller probability of rejecting H0 , when H0
is true compared to when H0 is false (see Mittelhammer, 1996, Theorem 9.2).
Neyman-Pearson approach - composite hypotheses
Up to this point, we considered the construction of the most powerful test for a simple H0
against a simple H1 by means of the Neyman-Person lemma. In some cases, the NeymanPearson approach can also be used to identify UMP tests when the alternative hypothesis is
composite, i.e.
H0 : = 0 ,
H1 : 1 .
The idea is to show via Neyman-Pearson that the critical region Cr of a MP test of H0 : = 0
versus H1 : = 1 is the same 1 1 . It would follow that this Cr defines a uniformly
most powerful test for H0 : = 0 versus H1 : = 1 .
116
4. Hypothesis testing
Theorem 4.2 (UMP Test of a simple H0 versus a composite H1 ) Let x be a random
sample from f (x; ). Furthermore, let {k(1 ) > 0} be a sequence of positive constants with
1 1 and Cr a critical region which satisfy
1. p(x Cr ; 0 ) = , 0 < < 1;
2.
f (x; 0 )
k(1 ) x Cr and 1 1 ;
f (x; 1 )
3.
f (x; 0 )
> k(1 ) x
/ Cr and 1 1 .
f (x; 1 )
Then Cr is the uniformly most powerful critical region of size for testing the hypothesis
H0 : = 0 versus H1 : 1 .
Proof
According to the Neyman-Pearson lemma, the region Cr satisfying the conditions (1)-(3), defines the
most powerful size test for H0 : = 0 versus H1 : = 1 . Since the same critical region, Cr ,
applies for any 1 1 , the region Cr is also most powerful for any value of 1 1 . Thus Cr is a
UMP size- test.
Example 4.9 Recall the example where we consider a sample x of size n from an exponential
population with pdf f (x; ) = exp{x} and simple hypotheses H0 and H1 . Now consider the
hypotheses
H0 : = 0 versus H1 : > 0 .
As shown above, the most powerful size- test for H0 : = 0 and H1 : = 1 with 0 < 1 is
Cr = {x :
Pn
i=1
xi (1 0 )1 ln[(1 /0 )n k] k},
0n n1
v
exp{0 v}dv = .
(n)
The form of this critical region Cr is independent of the particular value of 1 (as long as
1 > 0 ). It is the same for each value of 1 > 0 . (Note that, since k is fixed, the values for
k depend upon 1 , so that k is a function k(1 ).) Thus, by the above Theorem, the test Cr also
defines a UMP test for H0 : = 0 versus H1 : > 0 .
117
4. Hypothesis testing
Note that for the hypotheses
H0 : = 0
versus H1 : <0 ,
Cr = {x :
i=1
xi (1 0 )1 ln[(1 /0 )n k] k},
0n n1
v
exp{0 v}dv = .
(n)
Thus, relative to the initial case where we considered H0 : = 0 versus H1 : >0 , the critical
region for the UMP size- test has changed! This implies that there will not exist a UMP test
for
H0 : = 0 versus H1 : 6=0 .
In the previous discussion we considered a testing situation with a simple Null and a composite
alternative hypothesis, i.e.,
H0 : = 0
versus H1 : 1 .
Note that a composite alternative can be either a one-sided alternative hypothesis, such
as
H1 : > 0 or H1 : < 0 ,
or a two-sided alternative hypothesis, such as
H1 : 6= 0 .
In practice, UMP size- tests of simple H0 s versus one-sided H1 s typically exist when the
parameter is a scalar and x is continuous (when x is discrete, then the UMP test typically
exists only for a limited number of sizes). In sharp contrast, UMP size- tests of simple H0 s
versus two-sided H1 s will typically not exist (see the previous example). In such cases one must
generally resort to seek a UMP test within a subclass of tests, such as the class of unbiased
tests.
Monotone likelihood ratio approach
The monotone likelihood ratio approach can be used to identify UMP size- tests of composite
118
4. Hypothesis testing
Null hypotheses versus composite one-sided alternative hypotheses, i.e,
H0 : 0 (or H0 : 0 ) versus H1 : > 0 (or H1 : < 0 ).
This procedure for defining UMP size- tests relies on the concept of monotone likelihood
ratios in statistics T = t(x). As we shall see, the concept of monotone likelihood ratios
allows us to apply the Neyman-Pearson approach to identify UMP tests of composite Null
hypotheses versus composite one-sided alternative hypotheses.
Definition (Monotone likelihood ratio): A family of density functions {f (x; ), } is said to have a monotone
likelihood ratio in the statistic T = t(x) iff
1 > 2 ,
L(1 ; x)
f (x; 1 )
=
L(2 ; x)
f (x; 2 )
n
| {z }
exp{ (1 2 )
|
{z
Pn
i=1
xi }.
>0
Hence, 1 > 2 the likelihood ratio is a nonincreasing function of ni=1 xi and hence a nondeP
creasing function of t(x) = 1/ ni=1 xi . Thus the family of densities {f (x; )} for an exponential
population has a monotone likelihood ratio.
P
The verification of whether a particular family of densities has a monotone likelihood ratio can
be quite difficult. However, if the family belongs to the exponential class of densities, the
verification is simplified by the following result.
Theorem 4.3 (Monotone likelihood ratio and the exponential class) Let {f (x; ),
}, be a density family belonging to the exponential class of densities, as
f (x; ) = exp{c()g(x) + d() + z(x)}.
If c() is a nondecreasing function of , then {f (x; ), } has a monotone likelihood ratio
in the statistic g(x).
119
4. Hypothesis testing
Proof
Let 1 > 2 , and examine the likelihood ratio
n
L(1 ; x)
= exp
L(2 ; x)
[c(1 ) c(2 )]
{z
g(x) + d (1 , 2 ) .
0,
since c()is nondecreasing and 1 > 2
If the family of densities can be shown to have a monotone likelihood ratio in some statistic,
then UMP size- tests of
H0 : 0 (or H0 : 0 ) versus H1 : > 0 (or H1 : < 0 ).
will exist and can be identified by the Neyman-Pearson approach.
Theorem 4.4 (Monotone likelihood ratios and UMP size- tests) Let {f (x; ), },
be a density family having a monotone likelihood ratio in the statistic t(x). Then
(1.)
Cr = {x : t(x) k},
is the a UMP size- test for
H0 : 0
P (t(x) k; 0 ) =
versus H1 : > 0 ;
(2.)
Cr = {x : t(x) k},
is the a UMP size- test for
H0 : 0
P (t(x) k; 0 ) =
versus H1 : < 0 .
Proof
See Mittelhammer (1996), Theorem 9.6 and Corollary 9.1.
The intuition behind the form of the UMP critical region implied by this result is the following.
For H0 : 0 versus H1 : > 0 the UMP critical region is given by Cr = {x : t(x) k}.
120
4. Hypothesis testing
Now note that all i s such that i < 0 imply that,
the larger t(x),
the larger
L(0 ;x)
,
L(i ;x)
relative to H1 : > 0 .
Example 4.11 Let x be a random sample from an exponential population with density
f (x; ) = exp{x},
> 0.
Define an UMP size- test for H0 : 0 versus H1 : > 0 . As shown in the previous
example, the family of density functions {f (x; )} for an exponential population has a monotone
P
likelihood ratio in the statistic 1/ ni=1 Xi . Hence, we can rely on the UMP size- test for
monotone likelihood ratios, so that the UMP size- test is given by
Pn
Cr = {x : 1/
Pn
P (1/
i=1 xi
i=1 xi
k},
k; 0 ) = P (
Pn
i=1 xi
1/k; 0 ) = .
Since ni=1 Xi has a Gamma(n, )-distribution, the appropriate value of 1/k is the solution to
the integral equation
1/k n
0 n1
!
v
exp{0 v}dv =
(n)
0
P
121
4. Hypothesis testing
they generally have excellent asymptotic properties
and they often have good power in finite samples.
supH0 L (; x)
.
supH0 H1 L (; x)
H0
supH0 L (; x)
supH0 H1 L (; x)
4. Hypothesis testing
2. Note that the GLR
(x) =
supH0 L (; x)
,
supH0 H1 L (; x)
tends to be small (large) when the restriction H0 is not true (is true). Hence, it
appears reasonable to use a critical region of the form
Cr = {x : (x) c},
where we reject H0 , when the GLR is too small.
Example 4.12 Let x be a random sample of size n from an exponential population with density
f (x; ) = exp{x},
H0 : 0
> 0.
versus H1 : > 0 .
P
Recall that the (unconstrained) ML estimate of is = 1/ n1 i xi . Since H0 H1 = , it
follows that the denominator of the GLR is
xi }] = n/
xi
in
exp{n}.
The numerator of the GLR depends on whether < 0 is binding, and obtains as
sup L (; x) =
H0
sup [n exp{
0<0
xi }]
h P
in
n/ i xi exp{n}
[0n exp{0
xi }]
if
(non binding)
if
> 0
(binding)
(note that the likelihood function is strictly concave in ). Hence the GLR obtains as
(x) =
if
(non binding)
xi + n} if
> 0
(binding)
[0
xi /n]n exp{0
{z
<1
for
0 < c < 1,
where c = 1 is excluded. (Setting c = 1 would imply that we will always reject H0 .) Hence, Cr
can also be represented as
Cr = {x : [0
xi /n]n exp{0
P
xi + n} c and = 1/ n1 i xi > 0 }.
|
123
{z
(restriction is binding)
4. Hypothesis testing
For a size- test, c is selected such that
sup P (x Cr ; ) = .
0<0
xi , and
Now note that the function y n exp{n(y 1)} is nondecreasing on [0, 1] and has a maximum
at y = 1. Hence we can rewrite Cr as
Cr = {x : y k and 0 < k < 1} = {x :
i 0 xi
nk; ).
i 0 xi
nk; ) = P (
i 0 xi
with
sup P (x Cr ; ) = sup P (
0<0
0<0
(since P (
{z P
0 xi nk; ) P (
i 0 xi
nk; 0 ) .
}
0 xi nk; 0 ) 0 )
Finally note that for = 0 , the random variable i 0 Xi has a Gamma(n, 1) distribution.
It follows that the appropriate value of k for a size- LR test is the solution to the integral
equation
nk
1 n1
v
exp{v}dv = .
(n)
0
Finite sample properties of the LR test
Whether the LR test is unbiased and/or a UMP test for a given hypothesis must be established
on a case-by-case basis since it depends on both the characteristics of f (x; ) and the set of
H0 and H1 . However, there are some parallels between the LR test and the Neyman-Pearson
UMP test. In particular, if both H0 and H1 are simple, then the size- LR test and the
Neyman-Pearson MP size- test will be equivalent.
Theorem 4.5 (Equivalence of LR and NP tests when H0 and H1 are simple) Suppose
a size- LR test of H0 : = 0 versus H1 : = 1 exists with critical region
CrLR = {x : (x) c} ,
Furthermore, suppose a Neyman-Pearson most powerful size- test also exists with critical
region
Cr = {x : L(0 ; x)/L(1 ; x) k} , where P (x Cr ; 0 ) = .
124
4. Hypothesis testing
Then the LR test and the Neyman-Pearson most powerful test are equivalent.
Proof
Since (x) [0, 1], and given that (0, 1), it follows that
P ((x) c; 0 ) =
only if
c < 1.
Hence, for a size < 1, the critical value of the LR test c must be less than 1. Now let
= arg max L(; x),
(x) =
{0 ,1 }
L(0 ; x)
.
x)
L(;
Then the partitioning of the sample space leads to the following relations
x A = {x : = 1 }
L(0 ; x)
L(0 ; x)
=
1
x)
L(1 ; x)
L(
;
| {z }
simple LR
x B = {x : = 0 }
It follows that
L(0 ; x)
c < 1,
L(1 ; x)
only if
{z
GLR
L(0 ; x)
L(0 ; x)
= 1.
x)
L(1 ; x)
L(;
(x) =
L(0 ; x)
c < 1,
x)
L(;
L(0 ; x)
and hence, only if = 1 and (x) =
(GLR = simple LR of Neyman-Pearson). Thus, for
L(1 ; x)
c < 1 and (0, 1), if
P
L(0 ; x)
c; 0
x)
L(;
and
L(0 ; x)
c; 0 = ,
L(1 ; x)
then Cr = CrLR and the LR test is equivalent to the Neyman-Pearson MP test of size .
The theorem implies that the LR test for a simple H0 versus a simple H1 is a most powerful
test, if an LR-test and a MP test exist.
Example 4.13 Recall the example in which we considered a sample x of size n from an exponential population with pdf
f (x; ) = exp{x};
x (0, ),
> 0,
xi k}
where k : -quantile of a Gamma(n, 0 )-distribution.
125
4. Hypothesis testing
Now the GLR for this problem is
(x) =
L(0 ; x)
max{0 ,1 } L(; x)
1
n
(0 /1 )
exp{(0 1 )
xi } if = 1 L(0 ; x) L(1 ; x) .
{z
P ((x) c; ) = P (0 /1 )n exp{(0 1 )
i xi } c ;
= P
P
xi (1 0 )1 ln[(1 /0 )n c] ; .
|
Since
P
xi , and noting that 0 < 1 )
i
{z
CrLR = {x :
xi c},
Thus, the critical regions of the LR test and the Neyman-Pearson MP test are identical.
As discussed above, the Neyman-Pearson lemma for simple H0 s versus simple H1 s can be
extended to the case of defining UMP tests for simple H0 s versus composite H1 s (see Theorem
4.2). Similar to this, the result of the equivalence of the LR test and the Neyman-Pearson MP
test for simple H0 s versus simple H1 s (see Theorem 4.4) can be extended to the case of simple
H0 s versus composite H1 s.
L (0 ; x)
max{0 ,1 } L(; x)
and
P (1 (x) c(1 ); 0 ) = .
Finally, suppose a Neyman-Pearson UMP test Cr of H0 versus H1 having size exists. Then
the size- LR test CrLR and the size- Neyman-Pearson UMP test Cr are equivalent.
126
4. Hypothesis testing
Proof
Given that a size- Neyman-Pearson UMP test for H0 : = 0 versus H1 : 1 exists, this test is
the size- Neyman-Pearson MP test for H0 : = 0 versus H1 : = 1 for every 1 1 . Because
it is assumed that for every 1 1 , both the Neyman-Pearson MP and the LR test of size for
H0 : = 0 versus H1 : = 1 exist, both tests are equivalent (see Theorem 4.4). It follows that, for
H0 : = 0 versus H1 : 1 , the size- LR test is equivalent to the size- Neyman-Pearson MP
test.
The theorem implies that if the critical region of a size- LR test of H0 : = 0 versus
H1 : = 1 is the same 1 1 , then if a size- Neyman-Pearson UMP test for H0 : = 0
versus H1 : 1 exists, it is given by the LR test.
Example 4.14 Consider a sample x of size n from an exponential population with pdf
f (x; ) = exp{x};
x (0, ),
> 0,
and the hypotheses H0 : = 1 versus H1 : 1 = (0, 1). The GLR for this problem obtains
as
(x) =
where L(1; x) = exp{
L(1; x)
,
sup(0,1] L(; x)
xi } and
sup L(; x) =
(0,1]
exp{
xi }
if
P
= n/ i xi 1
|
(n/
xi ) exp{n} if
{z
ML estimate
< 1
Hence we get
if 1
(x) = P
.
P
( i xi /n)n exp{n i xi } if < 1
1
n
i xi /n) exp{n
P ((x) c; ) = P (
= P
i xi } c ;
; .
i 1 xi c
127
4. Hypothesis testing
Since
CrLR = {x :
xi c},
Recall that the Neyman Pearson UMP size- test for H0 : = 0 versus H1 : < 0 is
Cr = {x :
xi k},
(see the example where we considered UMP tests for H0 : = 0 vs. H1 : > 0 and H0 : = 0
vs. H1 : < 0 ). Thus, the critical regions of the LR test and the Neyman-Pearson UMP test
are identical.
where R() = r places k linear and/or nonlinear restrictions on the elements of . Examples
are 21 + 32 = 3, exp 2 = 3. It is also assumed that none of the k restrictions is redundant.
In this case it can be shown that when H0 is true,
a
2 ln (x) 2(q) .
128
4. Hypothesis testing
Thus an asymptotically valid size- LR test of H0 : R() = r versus H1 : R() 6= r is
Cr = {x : 2 ln (x) 2q, },
The formal result on the asymptotic distribution of the transformed GLR 2 ln (x) when H0
is true is given in the following theorem.
Theorem 4.7 (Asymptotic distribution of the GLR when H0 is true) Assume that the
MLE of the (k 1) vector is consistent, asymptotically normal and asymptotically efficient.
Let
supH0 L(; x)
(x) =
supH0 H1 L(; x)
be the GLR statistic for testing H0 : R() = r versus H1 : R() 6= r, where R() is a (q 1)
continuously differentiable vector function having nonredundant coordinate functions and q k.
Then, when H0 is true,
d
2 ln (x) 2q .
Proof
See Mittelhammer (1996), Theorem 10.5.
Example 4.15 Consider a sample x of size n = 10 from a Poisson population with pdf
f (x; ) =
exp{}x
;
x!
x {0, 1, 2, ...},
> 0,
and assume that i xi = 20. Use an asymptotically valid size = 0.05 LR test to test the
hypotheses H0 : = 1.8 versus H1 : 6= 1.8.
P
The unrestricted ML estimate of is = i xi /n = 2; the restricted ML estimate of is
r = 1.8. It follows that the value of the GLR is given by
(x) =
L(1.8; x)
exp{10 1.8}1.820
=
,
sup(0,) L(; x)
exp{10 2}220
so that
2 ln (x) = 2[18 + 20 + 20(ln(1.8) ln(2))] = 0.2144.
The critical region for an asymptotic size of = 0.05 is
Cr = {x : 2 ln (x) 21,0.05 = 3.84},
129
4. Hypothesis testing
so that H0 cannot be rejected.
2 ln (x) 2q (),
r = 0
130
and
r ) r = 0,
R(
4. Hypothesis testing
r is the restricted ML estimate that solves the f.o.c.s and r the corresponding LMs.
where
r ; x)/
Note that large values of the LM r and large values of the log-likelihood gradient ln L(
indicate that large likelihood increases are possible if we relax the constraint H0 : R() = r.
Hence, an LM and a gradient which are significantly different from 0 suggest that the restriction
H0 is false and should be rejected.
The following theorem introduces the two versions of the LM test statistic (G) and establishes
its asymptotic distribution under H0 .
Theorem 4.8 (Asymptotic distribution of the LM test statistic when H0 is true) Assume
that the MLE of the (k 1) vector is consistent, asymptotically normal and asymptotically
r and r denote the restricted MLE and the LM that solve
efficient. Let
max ln L(; x) 0 [R() r],
,
0
0 R(r )
r
r ; x) 1 R(
r)
2 ln L(
r
0
r ; x)0 2 ln L(
r ; x) 1 ln L(
r ; x) d
ln L(
2q .
0
Proof
See Mittelhammer (1996), Theorem 10.7.
Theorem 4.8 provides the asymptotic distribution of the LM test statistic G under the restriction
R() = r . From this we can construct the following asymptotic size- test of H0 : R() = r
versus H1 : R() 6= r:
Cr = {x : g 2q, },
The LM test can have a computational advantage over the LR test: The latter needs both
the restricted and the unrestricted ML estimate of , whereas the former requires only the
restricted estimate.
131
4. Hypothesis testing
Example 4.16 See Mittelhammer (1996), Example 10.11.
The following theorem introduces the Wald test statistic (W ) and establishes its asymptotic
distribution under H0 .
Theorem 4.9 (Asymptotic distribution of the Wald test statistic when H0 is true)
be
Let the random sample x of size n have the joint probability density function f (x; 0 ), let
d
n be a consistent
a consistent estimator for 0 such that n( 0 ) N (0, ), and let n
estimator of . Furthermore, consider the hypotheses H0 : R() = r versus H1 : R() 6= r,
where R() is a (q 1) continuously differentiable vector function of for which q k and
132
4. Hypothesis testing
R() contains no redundant coordinate functions. Finally, let R(0 )/0 have full row rank.
Then under H0 it follows that:
0
0 R() R()
n
W = [R() r]
1
r] 2q .
[R()
Proof
See Mittelhammer (1996), Theorem 10.9.
Theorem 4.9 provides the asymptotic distribution of the Wald test statistic W under the restriction R() = r. From this we can construct the following asymptotic size- test of H0 : R() = r
versus H1 : R() 6= r:
Cr = {x : w 2q, },
133
Appendix
A. Tables
Table A.1.: Quantiles of the 2 distribution
0.5%
1%
2.5%
5%
10%
90%
95%
97.5%
99%
99.5%
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0.000
0.010
0.039
0.195
0.406
0.673
0.987
1.343
1.734
2.155
2.603
3.074
3.565
4.075
4.601
5.142
5.697
6.265
6.844
7.434
0.000
0.020
0.100
0.292
0.552
0.871
1.238
1.646
2.088
2.558
3.053
3.570
4.107
4.660
5.229
5.812
6.408
7.015
7.633
8.260
0.001
0.051
0.213
0.484
0.831
1.237
1.690
2.180
2.700
3.247
3.816
4.404
5.009
5.629
6.262
6.908
7.564
8.231
8.907
9.591
0.004
0.103
0.353
0.712
1.146
1.636
2.168
2.733
3.325
3.940
4.575
5.226
5.892
6.571
7.261
7.962
8.672
9.390
10.117
10.851
0.016
0.211
0.587
1.065
1.611
2.204
2.833
3.490
4.168
4.865
5.578
6.304
7.042
7.790
8.547
9.312
10.085
10.865
11.651
12.443
2.706
4.605
6.252
7.780
9.237
10.645
12.017
13.362
14.684
15.987
17.275
18.549
19.812
21.064
22.307
23.542
24.769
25.989
27.204
28.412
3.841
5.991
7.816
9.488
11.071
12.592
14.067
15.507
16.919
18.307
19.675
21.026
22.362
23.685
24.996
26.296
27.587
28.869
30.144
31.410
5.024
7.378
9.350
11.144
12.833
14.450
16.013
17.535
19.023
20.483
21.920
23.337
24.736
26.119
27.488
28.845
30.191
31.526
32.852
34.170
6.635
9.210
11.346
13.277
15.086
16.812
18.475
20.090
21.666
23.209
24.725
26.217
27.688
29.141
30.578
32.000
33.409
34.805
36.191
37.566
7.879
10.597
12.836
14.859
16.749
18.547
20.277
21.955
23.589
25.188
26.757
28.299
29.819
31.319
32.801
34.267
35.718
37.156
38.582
39.997
A. Tables
Table A.2.: Quantiles of the standard normal distribution
p
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
0.50x
0.51x
0.52x
0.53x
0.54x
0.0000
0.0251
0.0502
0.0753
0.1004
0.0025
0.0276
0.0527
0.0778
0.1030
0.0050
0.0301
0.0552
0.0803
0.1055
0.0075
0.0326
0.0577
0.0828
0.1080
0.0100
0.0351
0.0602
0.0853
0.1105
0.0125
0.0376
0.0627
0.0878
0.1130
0.0150
0.0401
0.0652
0.0904
0.1156
0.0175
0.0426
0.0677
0.0929
0.1181
0.0201
0.0451
0.0702
0.0954
0.1206
0.0226
0.0476
0.0728
0.0979
0.1231
0.55x
0.56x
0.57x
0.58x
0.59x
0.1257
0.1510
0.1764
0.2019
0.2275
0.1282
0.1535
0.1789
0.2045
0.2301
0.1307
0.1560
0.1815
0.2070
0.2327
0.1332
0.1586
0.1840
0.2096
0.2353
0.1358
0.1611
0.1866
0.2121
0.2378
0.1383
0.1637
0.1891
0.2147
0.2404
0.1408
0.1662
0.1917
0.2173
0.2430
0.1434
0.1687
0.1942
0.2198
0.2456
0.1459
0.1713
0.1968
0.2224
0.2482
0.1484
0.1738
0.1993
0.2250
0.2508
0.60x
0.61x
0.62x
0.63x
0.64x
0.2533
0.2793
0.3055
0.3319
0.3585
0.2559
0.2819
0.3081
0.3345
0.3611
0.2585
0.2845
0.3107
0.3372
0.3638
0.2611
0.2871
0.3134
0.3398
0.3665
0.2637
0.2898
0.3160
0.3425
0.3692
0.2663
0.2924
0.3186
0.3451
0.3719
0.2689
0.2950
0.3213
0.3478
0.3745
0.2715
0.2976
0.3239
0.3505
0.3772
0.2741
0.3002
0.3266
0.3531
0.3799
0.2767
0.3029
0.3292
0.3558
0.3826
0.65x
0.66x
0.67x
0.68x
0.69x
0.3853
0.4125
0.4399
0.4677
0.4959
0.3880
0.4152
0.4427
0.4705
0.4987
0.3907
0.4179
0.4454
0.4733
0.5015
0.3934
0.4207
0.4482
0.4761
0.5044
0.3961
0.4234
0.4510
0.4789
0.5072
0.3989
0.4261
0.4538
0.4817
0.5101
0.4016
0.4289
0.4565
0.4845
0.5129
0.4043
0.4316
0.4593
0.4874
0.5158
0.4070
0.4344
0.4621
0.4902
0.5187
0.4097
0.4372
0.4649
0.4930
0.5215
0.70x
0.71x
0.72x
0.73x
0.74x
0.5244
0.5534
0.5828
0.6128
0.6433
0.5273
0.5563
0.5858
0.6158
0.6464
0.5302
0.5592
0.5888
0.6189
0.6495
0.5330
0.5622
0.5918
0.6219
0.6526
0.5359
0.5651
0.5948
0.6250
0.6557
0.5388
0.5681
0.5978
0.6280
0.6588
0.5417
0.5710
0.6008
0.6311
0.6620
0.5446
0.5740
0.6038
0.6341
0.6651
0.5476
0.5769
0.6068
0.6372
0.6682
0.5505
0.5799
0.6098
0.6403
0.6713
0.75x
0.76x
0.77x
0.78x
0.79x
0.6745
0.7063
0.7388
0.7722
0.8064
0.6776
0.7095
0.7421
0.7756
0.8099
0.6808
0.7128
0.7454
0.7790
0.8134
0.6840
0.7160
0.7488
0.7824
0.8169
0.6871
0.7192
0.7521
0.7858
0.8204
0.6903
0.7225
0.7554
0.7892
0.8239
0.6935
0.7257
0.7588
0.7926
0.8274
0.6967
0.7290
0.7621
0.7961
0.8310
0.6999
0.7323
0.7655
0.7995
0.8345
0.7031
0.7356
0.7688
0.8030
0.8381
0.80x
0.81x
0.82x
0.83x
0.84x
0.8416
0.8779
0.9154
0.9542
0.9945
0.8452
0.8816
0.9192
0.9581
0.9986
0.8488
0.8853
0.9230
0.9621
1.0027
0.8524
0.8890
0.9269
0.9661
1.0069
0.8560
0.8927
0.9307
0.9701
1.0110
0.8596
0.8965
0.9346
0.9741
1.0152
0.8633
0.9002
0.9385
0.9782
1.0194
0.8669
0.9040
0.9424
0.9822
1.0237
0.8705
0.9078
0.9463
0.9863
1.0279
0.8742
0.9116
0.9502
0.9904
1.0322
0.85x
0.86x
0.87x
0.88x
0.89x
1.0364
1.0803
1.1264
1.1750
1.2265
1.0407
1.0848
1.1311
1.1800
1.2319
1.0450
1.0893
1.1359
1.1850
1.2372
1.0494
1.0939
1.1407
1.1901
1.2426
1.0537
1.0985
1.1455
1.1952
1.2481
1.0581
1.1031
1.1503
1.2004
1.2536
1.0625
1.1077
1.1552
1.2055
1.2591
1.0669
1.1123
1.1601
1.2107
1.2646
1.0714
1.1170
1.1650
1.2160
1.2702
1.0758
1.1217
1.1700
1.2212
1.2759
0.90x
0.91x
0.92x
0.93x
0.94x
1.2816
1.3408
1.4051
1.4758
1.5548
1.2873
1.3469
1.4118
1.4833
1.5632
1.2930
1.3532
1.4187
1.4909
1.5718
1.2988
1.3595
1.4255
1.4985
1.5805
1.3047
1.3658
1.4325
1.5063
1.5893
1.3106
1.3722
1.4395
1.5141
1.5982
1.3165
1.3787
1.4466
1.5220
1.6072
1.3225
1.3852
1.4538
1.5301
1.6164
1.3285
1.3917
1.4611
1.5382
1.6258
1.3346
1.3984
1.4684
1.5464
1.6352
0.95x
0.96x
0.97x
0.98x
0.99x
1.6449
1.7507
1.8808
2.0537
2.3263
1.6546
1.7624
1.8957
2.0749
2.3656
1.6646
1.7744
1.9110
2.0969
2.4089
1.6747
1.7866
1.9268
2.1201
2.4573
1.6849
1.7991
1.9431
2.1444
2.5121
1.6954
1.8119
1.9600
2.1701
2.5758
1.7060
1.8250
1.9774
2.1973
2.6521
1.7169
1.8384
1.9954
2.2262
2.7478
1.7279
1.8522
2.0141
2.2571
2.8782
1.7392
1.8663
2.0335
2.2904
3.0902
II