STAT 713 Mathematical Statistics Ii: Lecture Notes
STAT 713 Mathematical Statistics Ii: Lecture Notes
STAT 713 Mathematical Statistics Ii: Lecture Notes
MATHEMATICAL STATISTICS II
Spring 2018
Lecture Notes
Joshua M. Tebbs
Department of Statistics
University of South Carolina
c by Joshua M. Tebbs
TABLE OF CONTENTS JOSHUA M. TEBBS
Contents
6 Principles of Data Reduction 1
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
6.2 The Sufficiency Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
6.2.1 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
6.2.2 Minimal sufficient statistics . . . . . . . . . . . . . . . . . . . . . . . 11
6.2.3 Ancillary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.2.4 Sufficient, ancillary, and complete statistics . . . . . . . . . . . . . . . 18
7 Point Estimation 26
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.2 Methods of Finding Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.2.1 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.2.2 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . 29
7.2.3 Bayesian estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.3 Methods of Evaluating Estimators . . . . . . . . . . . . . . . . . . . . . . . . 42
7.3.1 Bias, variance, and MSE . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.3.2 Best unbiased estimators . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.3.3 Sufficiency and completeness . . . . . . . . . . . . . . . . . . . . . . . 52
7.4 Appendix: CRLB Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8 Hypothesis Testing 65
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.2 Methods of Finding Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8.2.1 Likelihood ratio tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8.2.2 Bayesian tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.3 Methods of Evaluating Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.3.1 Error probabilities and the power function . . . . . . . . . . . . . . . 79
8.3.2 Most powerful tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.3.3 Uniformly most powerful tests . . . . . . . . . . . . . . . . . . . . . . 90
8.3.4 Probability values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
i
TABLE OF CONTENTS JOSHUA M. TEBBS
ii
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
6.1 Introduction
At = {x ∈ X : T (x) = t},
for t ∈ T . The statistic T summarizes the data x in that one can report
T (x) = t ⇐⇒ x ∈ At
instead of reporting x itself. This is the idea behind data reduction. We reduce the data
x so that they can be more easily understood without losing the meaning associated with
the set of observations.
PAGE 1
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
Example 6.1. Suppose X1 , X2 , X3 are iid Bernoulli(θ), where 0 < θ < 1. The support of
X = (X1 , X2 , X3 ) is
X = {(0, 0, 0), (1, 0, 0), (0, 1, 0), (0, 0, 1), (1, 1, 0), (1, 0, 1), (0, 1, 1), (1, 1, 1)}.
the support of T . The statistic T summarizes the data in that it reports only the value
T (x) = t. It does not report which x ∈ X produced T (x) = t.
PAGE 2
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
• In other words, if x ∈ X , y ∈ X , and T (x) = T (y), then inference for θ should be the
same whether X = x or X = y is observed.
• For example, in Example 6.1, suppose
x = (1, 0, 0)
y = (0, 0, 1)
so that t = T (x) = T (y) = 1. The Sufficiency Principle says that inference for θ
depends only on the value of t = 1 and not on whether x or y was observed.
Discussion: Note that in the discrete case, all distributions above can be interpreted as
probabilities. From the definition of a conditional distribution,
fX,T (x, t|θ) Pθ (X = x, T = t)
fX|T (x|t) = = .
fT (t|θ) Pθ (T = t)
Because {X = x} ⊂ {T = t}, we have
Pθ (X = x, T = t) = Pθ (X = x) = fX (x|θ).
Therefore,
fX (x|θ)
fX|T (x|t) =
fT (t|θ)
as claimed. If T is continuous, then fT (t|θ) 6= Pθ (T = t) and fX|T (x|t) cannot be interpreted
as a conditional probability. Fortunately, the criterion above; i.e.,
fX (x|θ)
fX|T (x|t) =
fT (t|θ)
being free of θ, still applies in the continuous case (although a more rigorous explanation
would be needed to see why).
Example 6.2. Suppose X1 , X2 , ..., Xn are iid Poisson(θ), where θ > 0. Use Definition
6.2.1/Theorem 6.2.2 to show that
n
X
T = T (X) = Xi
i=1
is a sufficient statistic.
PAGE 3
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
Recall that T ∼ Poisson(nθ), shown by using mgfs. Therefore, the pmf of T , for t = 0, 1, 2, ...,
is
(nθ)t e−nθ
fT (t|θ) = .
t!
With t = ni=1 xi , the conditional distribution
P
Pn
xi −nθ
e θ i=1
Qn
fX (x|θ) i=1 xi ! t!
fX|T (x|t) = = = Qn ,
fT (t|θ) (nθ)t e−nθ nt i=1 xi !
t!
which is free of θ. P
From the definition of sufficiency and from Theorem 6.2.2, we have shown
that T = T (X) = ni=1 Xi is a sufficient statistic. 2
T = T (X) = X
is a sufficient statistic.
Proof. The pdf of X, for xi > 0, is given by
n
Y 1 1 − Pni=1 xi /θ
fX (x|θ) = e−xi /θ = e .
i=1
θ θn
X ∼ gamma(n, θ/n).
PAGE 4
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
which is free of θ. From the definition of sufficiency and from Theorem 6.2.2, we have shown
that T = T (X) = X is a sufficient statistic. 2
Example 6.4. Suppose X1 , X2 , ..., Xn is an iid sample from a continuous distribution with
pdf fX (x|θ), where θ ∈ Θ. Show that T = T(X) = (X(1) , X(2) , ..., X(n) ), the vector of order
statistics, is always sufficient.
Proof. Recall from Section 5.4 (CB) that the joint distribution of the n order statistics is
fX(1) ,X(2) ,...,X(n) (x1 , x2 , ..., xn |θ) = n!fX (x1 |θ)fX (x2 |θ) · · · fX (xn |θ)
= n!fX (x|θ),
fX (x|θ) fX (x|θ) 1
= = ,
fT (x|θ) n!fX (x|θ) n!
which is free of θ. From the definition of sufficiency and from Theorem 6.2.2, we have shown
that T = T(X) = (X(1) , X(2) , ..., X(n) ) is a sufficient statistic. 2
Discussion: Example 6.4 shows that (with continuous distributions), the order statistics
are always sufficient.
• Of course, reducing the sample X = (X1 , X2 , ..., Xn ) to T(X) = (X(1) , X(2) , ..., X(n) ) is
not that much of a reduction. However, in some parametric families, it is not possible
to reduce X any further without losing information about θ (e.g., Cauchy, logistic,
etc.); see pp 275 (CB).
• In some instances, it may be that the parametric form of fX (x|θ) is not specified. With
so little information provided about the population, we should not be surprised that
the only available reduction of X is to the order statistics.
Remark: The approach we have outlined to show that a statistic T is sufficient appeals to
Definition 6.2.1 and Theorem 6.2.2; i.e., we are using the definition of sufficiency directly by
showing that the conditional distribution of X given T is free of θ.
• What if we need to find a sufficient statistic? Then the approach we have just outlined
is not practical to implement (i.e., imagine trying different statistics T and for each
one attempting to show that fX|T (x|t) is free of θ). This might involve a large amount
of trial and error and you would have to derive the sampling distribution of T each
time (which for many statistics can be difficult or even intractable).
PAGE 5
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
Proof. We prove the result for the discrete case only; the continuous case is beyond the scope
of this course.
Necessity (=⇒): Suppose T is sufficient. It suffices to show there exists functions g(t|θ) and
h(x) such that the factorization holds. Because T is sufficient, we know
fX|T (x|t) = P (X = x|T (X) = t)
is free of θ (this is the definition of sufficiency). Therefore, take
g(t|θ) = Pθ (T (X) = t)
h(x) = P (X = x|T (X) = t).
Because {X = x} ⊂ {T (X) = t},
fX (x|θ) = Pθ (X = x)
= Pθ (X = x, T (X) = t)
= Pθ (T (X) = t)P (X = x|T (X) = t) = g(t|θ)h(x).
Sufficiency (⇐=): Suppose the factorization holds. To establish that T = T (X) is sufficient,
it suffices to show that
fX|T (x|t) = P (X = x|T (X) = t)
is free of θ. Denoting T (x) = t, we have
Pθ (X = x, T (X) = t)
fX|T (x|t) = P (X = x|T (X) = t) =
Pθ (T (X) = t)
Pθ (X = x) I(T (X) = t)
=
Pθ (T (X) = t)
g(t|θ)h(x) I(T (X) = t)
= ,
Pθ (T (X) = t)
because the factorization holds by assumption. Now write
Pθ (T (X) = t) = Pθ (X ∈ At ),
where recall At = {x ∈ X : T (x) = t} is a set over (Rn , B(Rn ), PX ). Note that
X
Pθ (X ∈ At ) = Pθ (X = x)
x∈X : T (x)=t
X
= g(t|θ)h(x)
x∈X : T (x)=t
X
= g(t|θ) h(x).
x∈X : T (x)=t
PAGE 6
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
Therefore,
g(t|θ)h(x) I(T (X) = t) h(x) I(T (X) = t)
fX|T (x|t) = P =P ,
g(t|θ) x∈X : T (x)=t h(x) x∈X : T (x)=t h(x)
which is free of θ. 2
Example 6.2 (continued). Suppose X1 , X2 , ..., Xn are iid Poisson(θ), where θ > 0. We have
already shown that
Xn
T = T (X) = Xi
i=1
is a sufficient statistic (using the definition of sufficiency). We now show this using the
Factorization Theorem. For xi = 0, 1, 2, ..., the pmf of X is
n
Y θxi e−θ
fX (x|θ) =
i=1
xi !
Pn
xi −nθ
θ i=1 e
= Qn
i=1 xi !
Pn
xi −nθ 1
= θ| {z e } Qn xi ! ,
i=1
= g(t|θ) | i=1
{z }
= h(x)
Pn Pn
where t = i=1 xi . By the Factorization Theorem, T = T (X) = i=1 Xi is sufficient.
Example 6.5. Suppose X1 , X2 , ..., Xn are iid U(0, θ), where θ > 0. Find a sufficient statistic.
Solution. The pdf of X is
n
Y 1
fX (x|θ) = I(0 < xi < θ)
i=1
θ
n
1 Y
= I(0 < xi < θ)
θn i=1
n
1 Y
= n I(x(n) < θ) I(xi > 0),
|θ {z } i=1
= g(t|θ)
| {z }
= h(x)
Example 6.6. Suppose X1 , X2 , ..., Xn are iid gamma(α, β), where α > 0 and β > 0. Note
that in this family, the parameter θ = (α, β) is two-dimensional. The pdf of X is
n
Y 1
fX (x|θ) = α
xα−1
i e−xi /β I(xi > 0)
i=1
Γ(α)β
n Y n
!α n
1 − n
P
i=1 xi /β
Y I(xi > 0)
= x i e ,
Γ(α)β α i=1
x i
| {z } |i=1 {z }
= g(t1 ,t2 |θ) = h(x)
PAGE 7
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
Qn Pn
where t1 = i=1 xi and t2 = i=1 xi . By the Factorization Theorem,
Yn
Xi
i=1
T = T(X) = X
n
Xi
i=1
is sufficient.
Remark: In previous examples, we have seen that the dimension of a sufficient statistic T
often equals the dimension of the parameter θ:
Pn
• Example 6.2: Poisson(θ). T = i=1 Xi ; dim(T ) = dim(θ) = 1
Sometimes the dimension of a sufficient statistic is larger than that of the parameter. We
have already seen this in Example 6.4 where T(X) = (X(1) , X(2) , ..., X(n) ), the vector of order
statistics, was sufficient; i.e., dim(T) = n. In some parametric families (e.g., Cauchy, etc.),
this statistic is sufficient and no further reduction is possible.
Example 6.7. Suppose X1 , X2 , ..., Xn are iid U(θ, θ + 1), where −∞ < θ < ∞. This is a
one-parameter family; i.e., dim(θ) = 1. The pdf of X is
n
Y
fX (x|θ) = I(θ < xi < θ + 1)
i=1
Yn n
Y
= I(xi > θ) I(xi − 1 < θ)
i=1 i=1
n
Y
= I(x(1) > θ)I(x(n) − 1 < θ) I(xi ∈ R),
| {z }
= g(t1 ,t2 |θ) |i=1 {z }
= h(x)
PAGE 8
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
where t1 = ni=1 yi2 , t2 = ni=1 yi , and t3 = ni=1 xi yi . Taking h(y) = 1, the Factorization
P P P
Theorem shows that n
X
2
Yi
i=1
n
X
T = T(Y) = Yi
i=1
X n
xi Y i
i=1
is sufficient. Note that dim(T) = dim(θ) = 3.
Theorem 6.2.10. Suppose X1 , X2 , ..., Xn are iid from the exponential family
( k )
X
fX (x|θ) = h(x)c(θ) exp wi (θ)ti (x) ,
i=1
is sufficient.
PAGE 9
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
Example 6.9. Suppose X1 , X2 , ..., Xn are iid Bernoulli(θ), where 0 < θ < 1. For x = 0, 1,
the pmf of X is
fX (x|θ) = θx (1 − θ)1−x
x
θ
= (1 − θ)
1−θ
θ
= (1 − θ) exp ln x
1−θ
= h(x)c(θ) exp{w1 (θ)t1 (x)},
where h(x) = 1, c(θ) = 1 − θ, w1 (θ) = ln{θ/(1 − θ)}, and t1 (x) = x. By Theorem 6.2.10,
n
X n
X
T = T (X) = t1 (Xj ) = Xj
j=1 j=1
is sufficient.
Applications:
PAGE 10
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
• In the N (µ, σ 2 ) family where both parameters are unknown, it is easy to show that
X n
Xi
i=1
T = T(X) = X
n
2
Xi
i=1
is sufficient (just apply the Factorization Theorem directly or use our result dealing
with exponential families). Define the function
t1 /n
r(t) = r(t1 , t2 ) = 1 ,
(t − t21 /n)
n−1 2
and note that r(t) is one-to-one over T = {(t1 , t2 ) : −∞ < t1 < ∞, t2 ≥ 0}. Therefore,
X n
Xi
X
i=1
r(T(X)) = r X =
n
2
S2
Xi
i=1
Remark: In the N (µ, σ 2 ) family where both parameters are unknown, the statistic T(X) =
(X, S 2 ) is sufficient.
Example 6.10. Suppose that X1 , X2 , ..., Xn are iid N (µ, σ02 ), where −∞ < µ < ∞ and σ02
is known. Each of the following statistics is sufficient:
n
!
X
T1 (X) = X, T2 (X) = X1 , Xi , T3 (X) = (X(1) , X(2) , ..., X(n) ), T4 (X) = X.
i=2
PAGE 11
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
Definition: A sufficient statistic T = T (X) is called a minimal sufficient statistic if, for
any other sufficient statistic T ∗ (X), T (x) is a function of T ∗ (x).
Remark: A minimal sufficient statistic is a sufficient statistic that offers the most data
reduction. Note that “T (x) is a function of T ∗ (x)” means
Informally, if you know T ∗ (x), you can calculate T (x), but not necessarily vice versa.
Remark: You can also characterize minimality of a sufficient statistic using the partitioning
concept described at the beginning of this chapter. Consider the collection of sufficient
statistics. A minimal sufficient statistic T = T (X) admits the coarsest possible partition
in the collection.
fX (x|θ)
is free of θ ⇐⇒ T (x) = T (y).
fX (y|θ)
Example 6.10 (continued). Suppose X1 , X2 , ..., Xn are iid N (µ, σ02 ), where −∞ < µ < ∞
and σ02 is known. For x ∈ Rn , the pdf of X is
n
Y 1 2 2
fX (x|µ) = √ e−(xi −µ) /2σ0
i=1
2πσ0
n P
1 n 2 2
= √ e− i=1 (xi −µ) /2σ0 .
2πσ0
Now write n n
X X
2
(xi − µ) = (xi − x)2 + n(x − µ)2
i=1 i=1
PAGE 12
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
Clearly, this ratio is free of µ if and only if x = y. By Theorem 6.2.13, we know that
T (X) = X is a minimal sufficient statistic.
Example 6.7 (continued). Suppose X1 , X2 , ..., Xn are iid U(θ, θ + 1), where −∞ < θ < ∞.
We have already shown the pdf of X is
n
Y
fX (x|θ) = I(x(1) > θ)I(x(n) − 1 < θ) I(xi ∈ R).
i=1
is free of θ if and only if (x(1) , x(n) ) = (y(1) , y(n) ). By Theorem 6.2.13, we know that T(X) =
(X(1) , X(n) ) is a minimal sufficient statistic. Note that in this family, the dimension of a
minimal sufficient statistic does not match the dimension of the parameter. Note also that
a one-to-one function of T(X) is
X(n) − X(1)
(X(1) + X(n) )/2
PAGE 13
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
Example 6.11. Suppose that X1 , X2 , ..., Xn are iid N (0, σ 2 ), where σ 2 > 0. Note that
X ∼ N (0, σ 2 /n),
X
S(X) = √ ∼ tn−1
S/ n
is ancillary because its distribution, tn−1 , does not depend on σ 2 . Also, it is easy to show
that n
X
T (X) = Xi2
i=1
2
is a (minimal) sufficient statistic for σ .
Recap:
Pn
• T (X) = i=1 Xi2 contains all the information about σ 2 .
• I used R to generate B = 1000 draws from the bivariate distribution of (T (X), S(X)),
when n = 10 and σ 2 = 100; see Figure 6.1.
Remark: Finding ancillary statistics is easy when you are dealing with location or scale
families.
for all x ∈ X . We say that S(X) is a location-invariant statistic. In other words, the
value of S(x) is unaffected by location shifts.
fX (x|µ) = fZ (x − µ),
a location family with standard pdf fZ (·) and location parameter −∞ < µ < ∞. If S(X) is
location invariant, then it is ancillary.
Proof. Define Wi = Xi − µ, for i = 1, 2, ..., n. We perform an n-variate transformation to
find the distribution of W = (W1 , W2 , ..., Wn ). The inverse transformation is described by
PAGE 14
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
4
2
s
0
−2
−4
Figure 6.1: Scatterplot of B = 1000 pairs of T (x) and S(x) in Example 6.11. Each point
was calculated based on an iid sample of size n = 10 with σ 2 = 100.
for i = 1, 2, ..., n. It is easy to see that the Jacobian of the inverse transformation is 1 and
therefore
Because the distribution of W does not depend on µ, the distribution of the statistic S(W)
cannot depend on µ either. But S(W) = S(X), so we are done. 2
PAGE 15
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
Example 6.12. Suppose X1 , X2 , ..., Xn are iid N (µ, σ02 ), where −∞ < µ < ∞ and σ02 is
known. Show that the sample variance S 2 is ancillary.
Proof. First note that
1 2 2
fX (x|µ) = √ e−(x−µ) /2σ0 I(x ∈ R) = fZ (x − µ),
2πσ0
where
1 2 2
fZ (z) = √ e−z /2σ0 I(z ∈ R),
2πσ0
the N (0, σ02 ) pdf. Therefore, the N (µ, σ02 ) family is a location family. We now show that
S(X) = S 2 is location invariant. Let Wi = Xi + c, for i = 1, 2, ..., n. Clearly, W = X + c and
n
1 X
S(W) = (Wi − W )2
n − 1 i=1
n
1 X
= [(Xi + c) − (X + c)]2
n − 1 i=1
n
1 X
= (Xi − X)2 = S(X).
n − 1 i=1
Remark: The preceding argument only shows that the distribution of S 2 does not depend
on µ. However, in this example, it is easy to find the distribution of S 2 directly. Recall that
(n − 1)S 2 n − 1 2σ02
2 d n−1 2
∼ χn−1 = gamma ,2 =⇒ S ∼ gamma , ,
σ02 2 2 n−1
for all x ∈ X . We say that S(X) is a scale-invariant statistic. In other words, the value
of S(x) is unaffected by changes in scale.
PAGE 16
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
for i = 1, 2, ..., n. It is easy to see that the Jacobian of the inverse transformation is σ n and
therefore
Because the distribution of W does not depend on σ, the distribution of the statistic S(W)
cannot depend on σ either. But S(W) = S(X), so we are done. 2
Examples: Each of the following is a scale-invariant statistic (and hence is ancillary when
sampling from a scale family):
Pk
S X(n) Xi2
S(X) = , S(X) = , S(X) = Pi=1 n 2
.
X X(1) i=1 Xi
PAGE 17
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
Remark: The preceding argument only shows that the distribution of S(X) does not depend
on σ. It can be shown (verify!) that
Pk
|Xi |
S(X) = Pi=1
n ∼ beta(k, n − k),
i=1 |Xi |
Definition: Let {fT (t|θ); θ ∈ Θ} be a family of pdfs (or pmfs) for a statistic T = T (X). We
say that this family is a complete family if the following condition holds:
Eθ [g(T )] = 0 ∀θ ∈ Θ =⇒ Pθ (g(T ) = 0) = 1 ∀θ ∈ Θ;
i.e., g(T ) = 0 almost surely for all θ ∈ Θ. We call T = T (X) a complete statistic.
Remark: This condition basically says that the only function of T that is an unbiased
estimator of zero is the function that is zero itself (with probability 1).
Example 6.14. Suppose X1 , X2 , ..., Xn are iid Bernoulli(θ), where 0 < θ < 1. Show that
n
X
T = T (X) = Xi
i=1
is a complete statistic.
PAGE 18
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
Proof. We know that T ∼ b(n, θ), so it suffices to show that this family of distributions is a
complete family. Suppose
Eθ [g(T )] = 0 ∀θ ∈ (0, 1).
It suffices to show that Pθ (g(T ) = 0) = 1 for all θ ∈ (0, 1). Note that
0 = Eθ [g(T )]
n
X n t
= g(t) θ (1 − θ)n−t
t=0
t
n
n
X n t
= (1 − θ) g(t) r,
t=0
t
The LHS of this equation is a polynomial (in r) of degree n. The only way this polynomial
can be zero for all θ ∈ (0, 1); i.e., for all r > 0, is for the coefficients
n
g(t) = 0, for t = 0, 1, 2, ..., n.
t
Because nt 6= 0, this can only happen when g(t) = 0, for t = 0, 1, 2, ..., n. We have shown
Remark: To show that a statistic T = T (X) is not complete, all we have to do is find one
nonzero function g(T ) that satisfies Eθ [g(T )] = 0, for all θ.
Example 6.15. Suppose X1 , X2 , ..., Xn are iid N (θ, θ2 ), where θ ∈ Θ = (−∞, 0) ∪ (0, ∞).
Putting
1 2 2
fX (x|θ) = √ e−(x−θ) /2θ I(x ∈ R)
2πθ 2
PAGE 19
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
We have found a nonzero function g(T) that has zero expectation. Therefore T cannot be
complete.
PAGE 20
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
We have shown that the joint cdf of (S, T ) factors into the product of the marginal cdfs.
Because s and t are arbitrary, we are done. 2
Example 6.16. Suppose that X1 , X2 , ..., Xn are iid U(0, θ), where θ > 0. Show that X(n)
and X(1) /X(n) are independent.
Proof. We will show that
The result will then follow from Basu’s Theorem. First, note that
1
fX (x|θ) = I(0 < x < θ)
θ
1 x
= fZ ,
θ θ
where fZ (z) = I(0 < z < 1) is the standard uniform pdf. Therefore, the U(0, θ) family is
a scale family. We now show that S(X) is scale invariant. For d > 0, let Wi = dXi , for
i = 1, 2, ..., n. We have
W(1) dX(1) X(1)
S(W) = = = = S(X).
W(n) dX(n) X(n)
We have already shown that T = T (X) = X(n) is sufficient; see Example 6.5 (notes). We
now show T is complete. We first find the distribution of T . The pdf of T , the maximum
order statistic, is given by
the last step following from the Fundamental Theorem of Calculus, provided that g is
Riemann-integrable. Because θn−1 =
6 0, it must be true that g(θ) = 0 for all θ > 0. We have
PAGE 21
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
therefore shown that the only function g satisfying Eθ [g(T )] = 0 for all θ > 0 is the function
that is itself zero; i.e., we have shown
Remark: Our completeness argument in Example 6.16 is not entirely convincing. We have
basically established that
for the class of functions g which are Riemann-integrable. There are many functions g that
are not Riemann-integrable. CB note that “this distinction is not of concern.” This is another
way of saying that the authors do not want to present completeness from a more general
point of view (for good reason; this would involve a heavy dose of measure theory).
the last step following because X(n) and X(1) /X(n) are independent. Therefore, we can
calculate the desired expectation by instead calculating E(X(1) ) and E(X(n) ). These are
easier to calculate:
θ
E(X(1) ) =
n+1
n
E(X(n) ) = θ.
n+1
Therefore, we have
θ n X(1) X(1) 1
= θE =⇒ E = .
n+1 n+1 X(n) X(n) n
It makes sense that this expectation would not depend on θ; recall that S(X) = X(1) /X(n)
is ancillary.
PAGE 22
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
is a sufficient statistic.
New result (Theorem 6.2.25): In the exponential family, the statistic T = T(X) is com-
plete if the natural parameter space
{η = (η1 , η2 , ..., ηk ) : ηi = wi (θ); θ ∈ Θ}
contains an open set in Rk . For the most part, this means:
• T = T(X) is complete if d = k (full exponential family)
• T = T(X) is not complete if d < k (curved exponential family).
Example 6.17. Suppose that X1 , X2 , ..., Xn is an iid sample from a gamma(α, 1/α2 ) distri-
bution. The pdf of X is
1 2
fX (x|α) = 1 α
xα−1 e−x/(1/α ) I(x > 0)
Γ(α) α2
I(x > 0) α2α α ln x −α2 x
= e e
x Γ(α)
I(x > 0) α2α
exp α ln x − α2 x
=
x Γ(α)
= h(x)c(α) exp{w1 (α)t1 (x) + w2 (α)t2 (x)},
where h(x) = I(x > 0)/x, c(α) = α2α /Γ(α), w1 (α) = α, t1 (x) = ln x, w2 (α) = −α2 , and
t2 (x) = x. Theorem 6.2.10 tells us that
Xn
ln Xi
i=1
T = T(X) = X
n
Xi
i=1
PAGE 23
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
is a sufficient statistic. However, Theorem 6.2.25 tells us that T is not complete because
{fX (x|α), α > 0} is an exponential family with d = 1 and k = 2. Note also that
{η = (η1 , η2 ) : (α, −α2 ); α > 0}
is a half-parabola (which opens downward); this set does not contain an open set in R2 .
Example 6.18. Suppose X1 , X2 , ..., Xn are iid N (µ, σ 2 ), where −∞ < µ < ∞ and σ 2 > 0;
i.e., both parameters are unknown. Prove that X ⊥⊥ S 2 .
Proof. We use Basu’s Theorem, but we have to use it carefully. Fix σ 2 = σ02 and consider
first the N (µ, σ02 ) subfamily. The pdf of X ∼ N (µ, σ02 ) is
1 2 2
fX (x|µ) = √ e−(x−µ) /2σ0 I(x ∈ R)
2πσ0
2 2
I(x ∈ R)e−x /2σ0 −µ2 /2σ02 (µ/σ02 )x
= √ e e
2πσ0
= h(x)c(µ) exp{w1 (µ)t1 (x)}.
Theorem 6.2.10 tells us that n
X
T = T (X) = Xi
i=1
is a sufficient statistic. Because d = k = 1 (remember, this is for the N (µ, σ02 ) subfamily),
Theorem 6.2.25 tells us that T is complete. In Example 6.12 (notes), we have already showed
that
Therefore, by Basu’s Theorem, we have proven that, in the N (µ, σ02 ) subfamily,
n
X
Xi ⊥⊥ S 2 =⇒ X ⊥⊥ S 2 ,
i=1
the last implication being true because X is a function of T = T (X) = ni=1 Xi and functions
P
of independent statistics are independent. Finally, because we fixed σ 2 = σ02 arbitrarily, this
same argument holds for all σ02 fixed. Therefore, this independence result holds for any choice
of σ 2 and hence for the full N (µ, σ 2 ) family. 2
Remark: It is important to see that in the preceding proof, we cannot work directly with
the N (µ, σ 2 ) family and claim that
Pn
• T (X) = i=1 Xi is complete and sufficient
• S(X) = S 2 is ancillary
for this family. In fact, neither statement is true in the full family.
PAGE 24
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
Remark: Outside the exponential family, Basu’s Theorem can be useful in showing that a
sufficient statistic T (X) is not complete.
Basu’s Theorem (Contrapositive version): Suppose T (X) is sufficient and S(X) is ancillary.
If T (X) and S(X) are not independent, then T (X) is not complete.
This shows that S(X) is ancillary in this family. Finally, we know from Example 6.4 (notes)
that the order statistics
T = T(X) = (X(1) , X(2) , ..., X(n) )
are sufficient for this family (in fact, T is minimal sufficient; see Exercise 6.9, CB, pp 301).
However, clearly S(X) and T(X) are not independent; e.g., if you know T(x), you can
calculate S(x). By Basu’s Theorem (the contrapositive version), we know that T(X) cannot
be complete.
Theorem 6.2.28. Suppose that T (X) is sufficient. If T (X) is complete, then T (X) is
minimal sufficient.
Remark: Example 6.19 shows that the converse to Theorem 6.2.28 is not true; i.e.,
6
T (X) minimal sufficient =⇒ T (X) complete.
Example 6.7 provides another counterexample. We showed that if X1 , X2 , ..., Xn are iid
U(θ, θ + 1), then T = T(X) = (X(1) , X(n) ) is a minimal sufficient statistic. However, T
cannot be complete because T and the sample range X(n) − X(1) (which is location invariant
and hence ancillary in this model) are not independent. This implies that there exists a
nonzero function g(T) that has zero expectation for all θ ∈ R. In fact, it is easy to show
that
n−1
Eθ (X(n) − X(1) ) = .
n+1
Therefore,
n−1
g(T) = X(n) − X(1) −
n+1
satisfies Eθ [g(T)] = 0 for all θ.
PAGE 25
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
7 Point Estimation
7.1 Introduction
Remark: We will approach “the point estimation problem” from the following point of
view. We have a parametric model for X = (X1 , X2 , ..., Xn ):
X ∼ fX (x|θ), where θ ∈ Θ ⊆ Rk ,
and the model parameter θ = (θ1 , θ2 , ..., θk ) is unknown. We will assume that θ is fixed
(except when we discuss Bayesian estimation). Possible goals include
1. Estimating θ
Remark: For most of the situations we will encounter in this course, the random vector
X will consist of X1 , X2 , ..., Xn , an iid sample from the population fX (x|θ). However,
our discussion is also relevant when the independence assumption is relaxed, the identically
distributed assumption is relaxed, or both.
is any function of the sample X. Therefore, any statistic is a point estimator. We call
W (x) = W (x1 , x2 , ..., xn ) a point estimate. W (x) is a realization of W (X).
Preview: This chapter is split into two parts. In this first part (Section 7.2), we present
different approaches of finding point estimators. These approaches are:
The second part (Section 7.3) focuses on evaluating point estimators; e.g., which estimators
are good/bad? What constitutes a “good” estimator? Is it possible to find the best one?
For that matter, how should we even define “best?”
PAGE 26
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
µ0j = E(X j ).
Intuition: The first k sample moments depend on the sample X. The first k population
moments will generally depend on θ = (θ1 , θ2 , ..., θk ). Therefore, the system of equations
set
m01 = E(X)
set
m02 = E(X 2 )
..
.
0 set
mk = E(X k )
can (at least in theory) be solved for θ1 , θ2 , ..., θk . A solution to this system of equations is
called a method of moments (MOM) estimator.
Example 7.1. Suppose that X1 , X2 , ..., Xn are iid U(0, θ), where θ > 0. The first sample
moment is n
0 1X
m1 = Xi = X.
n i=1
The first population moment is
θ
µ01 = E(X) = .
2
We set these moments equal to each other; i.e.,
set θ
X =
2
and solve for θ. The solution
θb = 2X
is a method of moments estimator for θ.
PAGE 27
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
Example 7.2. Suppose that X1 , X2 , ..., Xn are iid U(−θ, θ), where θ > 0. For this popula-
tion, E(X) = 0 so this will not help us. Moving to second moments, we have
n
1X 2
m02 = X
n i=1 i
and
θ2
µ02 = E(X 2 ) = var(X) = .
3
Therefore, we can set
n
1 X 2 set θ2
X =
n i=1 i 3
and solve for θ. The solution v
u n
u3 X
θb = +t X2
n i=1 i
is a method of moments estimator for θ. We keep the positive solution because θ > 0
(although, technically, the negative solution is still a MOM estimator).
Example 7.3. Suppose X1 , X2 , ..., Xn are iid N (µ, σ 2 ), where −∞ < µ < ∞ and σ 2 > 0;
i.e., both parameters are unknown. The first two population moments are E(X) = µ and
E(X 2 ) = var(X) + [E(X)]2 = σ 2 + µ2 . Therefore, method of moments estimators for µ and
σ 2 are found by solving
set
X = µ
n
1 X 2 set 2
Xi = σ + µ2 .
n i=1
b = X and
We have µ
n n
2 1X 2 2 1X
σ
b = Xi − X = (Xi − X)2 .
n i=1 n i=1
Note that the method of moments estimator for σ 2 is not our “usual” sample variance (with
denominator n − 1).
Remarks:
• I think of MOM estimation as a “quick and dirty” approach. All we are doing is
matching moments. We are attempting to learn about a population fX (x|θ) by using
moments only.
PAGE 28
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
• MOM estimators can be nonsensical. In fact, sometimes MOM estimators fall outside
the parameter space Θ. For example, in linear models with random effects, variance
components estimated via MOM can be negative.
Note: We first formally define a likelihood function; see also Section 6.3 (CB).
Note: The likelihood function L(θ|x) is the same function as the joint pdf/pmf fX (x|θ).
The only difference is in how we interpret each one.
• The function fX (x|θ) is a model that describes the random behavior of X when θ is
fixed.
• The function L(θ|x) is viewed as a function of θ with the data X = x held fixed.
That is, when X is discrete, we can interpret the likelihood function L(θ|x) literally as a
joint probability.
• Suppose that θ 1 and θ 2 are two possible values of θ. Suppose X is discrete and
This suggests the sample x is more likely to have occurred with θ = θ 1 rather than if
θ = θ 2 . Therefore, in the discrete case, we can interpret L(θ|x) as “the probability of
the data x.”
• Section 6.3 (CB) describes how the likelihood function L(θ|x) can be viewed as a data
reduction device.
PAGE 29
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
We call θ(X)
b a maximum likelihood estimator (MLE).
Remarks:
1. Finding the MLE θ b is essentially a maximization problem. The estimate θ(x)
b must
fall in the parameter space Θ because we are maximizing L(θ|x) over Θ; i.e.,
θ(x)
b = arg max L(θ|x).
θ∈Θ
Example 7.4. Suppose X1 , X2 , ..., Xn are iid U[0, θ], where θ > 0. Find the MLE of θ.
Solution. The likelihood function is
n n
Y 1 1 Y
L(θ|x) = I(0 ≤ xi ≤ θ) = n I(x(n) ≤ θ) I(xi ≥ 0) .
i=1
θ θ i=1
| {z }
view this as a function of θ with x fixed
Note that
• For θ ≥ x(n) , L(θ|x) = 1/θn , which decreases as θ increases.
• For θ < x(n) , L(θ|x) = 0.
Remark: Note that in this example, we “closed the endpoints” on the support of X; i.e.,
the pdf of X is ( 1
, 0≤x≤θ
fX (x|θ) = θ
0, otherwise.
Mathematically, this model is no different than had we “opened the endpoints.” However, if
we used open endpoints, note that
x(n) < arg max L(θ|x) < x(n) +
θ>0
for all > 0, and therefore the maximizer of L(θ|x); i.e., the MLE, would not exist.
PAGE 30
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
∂
L(θ|x)
b = 0, j = 1, 2, ..., k.
∂θj
Example 7.5. Suppose that X1 , X2 , ..., Xn are iid N (θ, 1), where −∞ < θ < ∞. The
likelihood function is
n
Y 1 2
L(θ|x) = √ e−(xi −θ) /2
i=1
2π
n
1 1 Pn 2
= √ e− 2 i=1 (xi −θ) .
2π
The derivative
n n
∂ 1 − 12
Pn
i=1 (xi −θ)
2
X set
L(θ|x) = √ e (xi − θ) = 0
∂θ 2π i=1
| {z }
this can never be zero
n
X
=⇒ (xi − θ) = 0.
i=1
Because n
∂2
1 − 12 n
P 2
i=1 (xi −x) < 0,
2
L(θ|x) = −n √ e
∂θ
θ=x 2π
the function L(θ|x) is concave down when θ = x; i.e., θb = x maximizes L(θ|x). Therefore,
θb = θ(X)
b =X
is the MLE of θ.
Illustration: Under the N (θ, 1) model assumption, I graphed in Figure 7.1 the likelihood
function L(θ|x) after observing x1 = 2.437, x2 = 0.993, x3 = 1.123, x4 = 1.900, and
x5 = 3.794 (an iid sample of size n = 5). The sample mean x = 2.049 is our ML estimate of
θ based on this sample x.
PAGE 31
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
6e−04
Likelihood function
4e−04
2e−04
0e+00
0 1 2 3 4 5
Figure 7.1: Plot of L(θ|x) versus θ in Example 7.5. The data x were generated from a
N (θ = 1.5, 1) distribution with n = 5. The sample mean (MLE) is x = 2.049.
θ(x)
b = arg max L(θ|x)
θ∈Θ
= arg max ln L(θ|x).
θ∈Θ
∂
ln L(θ|x) = 0, j = 1, 2, ..., k,
∂θj
PAGE 32
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
Example 7.6. Suppose X1 , X2 , ..., Xn are iid N (µ, σ 2 ), where −∞ < µ < ∞ and σ 2 > 0;
i.e., both parameters are unknown. Set θ = (µ, σ 2 ). The likelihood function is
n
Y 1 2 2
L(θ|x) = √ e−(xi −µ) /2σ
i=1 2πσ 2
n/2
1 − 12
Pn 2
i=1 (xi −µ) .
= 2
e 2σ
2πσ
The log-likelihood function is
n
n 2 1 X
ln L(θ|x) = − ln(2πσ ) − 2 (xi − µ)2 .
2 2σ i=1
The score equations are
n
∂ 1 X set
ln L(θ|x) = 2
(xi − µ) = 0
∂µ σ i=1
n
∂ n 1 X set
2
ln L(θ|x) = − 2 + 4 (xi − µ)2 = 0.
∂σ 2σ 2σ i=1
b = x solves the
Clearly µ Pnfirst equation; inserting µb = x into the second equation
Pn and solving
2 2 −1 2 −1 2
for σ gives σ
b =n i=1 (xi − x) . A first-order critical point is (x, n i=1 (xi − x) ).
PAGE 33
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
X1 ∼ b(n1 , p1 )
X2 ∼ b(n2 , p2 ),
where 0 < p1 < 1 and 0 < p2 < 1. The likelihood function of θ = (p1 , p2 ) is
We can use Lagrange multipliers to maximize ln L(θ|x1 , x2 ) subject to the constraint that
g(θ) = g(p1 , p2 ) = p1 − p2 = 0.
PAGE 34
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
Example 7.8. Logistic regression. In practice, finding maximum likelihood estimates usu-
ally requires numerical methods. Suppose Y1 , Y2 , ..., Yn are independent Bernoulli random
variables; specifically, Yi ∼ Bernoulli(pi ), where
pi exp(β0 + β1 xi )
ln = β0 + β1 xi ⇐⇒ pi = .
1 − pi 1 + exp(β0 + β1 xi )
In this model, the xi ’s are fixed constants. The likelihood function of θ = (β0 , β1 ) is
n
Y
L(θ|y) = pyi i (1 − pi )1−yi
i=1
n y i 1−yi
Y exp(β0 + β1 xi ) exp(β0 + β1 xi )
= 1− .
i=1
1 + exp(β0 + β1 xi ) 1 + exp(β0 + β1 xi )
Taking logarithms and simplifying gives
n
X
yi (β0 + β1 xi ) − ln(1 + eβ0 +β1 xi ) .
ln L(θ|y) =
i=1
Closed-form expressions for the maximizers βb0 and βb1 do not exist except in very simple
situations. Numerical methods are needed to maximize ln L(θ|y); e.g., iteratively re-weighted
least squares (the default method in R’s glm function).
PAGE 35
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
Theorem 7.2.10 (Invariance property of MLEs). Suppose θ b is the MLE of θ. For any
function τ (θ), the MLE of τ (θ) is τ (θ).
b
Proof. For simplicity, suppose θ is a scalar parameter and that τ : R → R is one-to-one (over
Θ). In this case,
η = τ (θ) ⇐⇒ θ = τ −1 (η).
The likelihood function of interest is L∗ (η). It suffices to show that L∗ (η) is maximized when
η = τ (θ),
b where θb is the maximizer of L(θ). For simplicity in notation, I drop emphasis of a
likelihood function’s dependence on x. Let ηb be a maximizer of L∗ (η). Then
L∗ (b
η ) = sup L∗ (η)
η
Remark: Our proof assumes that τ is a one-to-one function. However, Theorem 7.2.10 is
true for any function; see pp 319-320 (CB).
Example 7.9. Suppose X1 , X2 , ..., Xn are iid exponential(β), where β > 0. The likelihood
function is n
Y 1 −xi /β 1 Pn
L(β|x) = e = n e− i=1 xi /β .
i=1
β β
The log-likelihood function is
Pn
i=1 xi
ln L(β|x) = −n ln β −
β
The score equation becomes
Pn
∂ n i=1 xi set
ln L(β|x) = − + = 0.
∂β β β2
Solving the score equation for β gives βb = x. It is easy to show that this value maximizes
ln L(β|x). Therefore,
βb = β(X)
b =X
is the MLE of β.
PAGE 36
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
• For t fixed, e−t/X is the MLE of SX (t|β) = e−t/β , the survivor function of X at t.
Bayesians do not consider the parameter θ to be fixed. They regard θ as random, having its
own probability distribution. Therefore, Bayesians think of inference in this way:
Model θ ∼ π(θ) −→ Observe X|θ ∼ fX (x|θ) −→ Update with π(θ|x).
The model for θ on the front end is called the prior distribution. The model on the
back end is called the posterior distribution. The posterior distribution combines prior
information (supplied through the prior model) and the observed data x. For a Bayesian,
all inference flows from the posterior distribution.
Important: Here are the relevant probability distributions that arise in a Bayesian context.
These are given “in order” as to how the Bayesian uses them. Continue to assume that θ is
a scalar.
PAGE 37
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
= fX|θ (x|θ)π(θ)dθ,
Θ
Remark: The process of starting with π(θ) and performing the necessary calculations to
end up with π(θ|x) is informally known as “turning the Bayesian crank.” The distributions
above can be viewed as steps in a “recipe” for posterior construction (i.e., start with the
prior and the conditional, calculate the joint, calculate the marginal, calculate the posterior).
We will see momentarily that not all steps are needed. In fact, in practice, computational
techniques are used to essentially bypass Step 4 altogether. You can see that this might be
desirable, especially if θ is a vector (and perhaps high-dimensional).
Example 7.10. Suppose that, conditional on θ, X1 , X2 , ..., Xn are iid Poisson(θ), where the
prior distribution for θ ∼ gamma(a, b), a, b known. We now turn the Bayesian crank.
1. Prior distribution.
1
π(θ) = θa−1 e−θ/b I(θ > 0).
Γ(a)ba
2. Conditional distribution. For xi = 0, 1, 2, ...,
n Pn
Y θxi e−θ θ i=1 xi −nθ
e
fX|θ (x|θ) = = Qn .
i=1
xi ! i=1 xi !
PAGE 38
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
where n
∗
X 1
a = xi + a and b∗ = .
i=1
n + 1b
Therefore,
n
! Pni=1 xi +a
1 X 1
mX (x) = Qn a
Γ xi + a 1 .
i=1 xi ! Γ(a)b i=1
n + b
Remark: Note that the shape and scale parameters of the posterior distribution π(θ|x)
depend on
In this sense, the posterior distribution combines information from the prior and the data.
PAGE 39
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
Note that in Example 7.10 (the Poisson-gamma example), the posterior mean equals
Pn
i=1 xi + a
θB = E(θ|X = x) =
b
n + 1b
nb 1
= x+ ab.
nb + 1 nb + 1
That is, the posterior mean is a weighted average of the sample mean x and the prior
mean ab. Note also that as the sample size n increases, more weight is given to the data
(through x) and less weight is given to the the prior (through the prior mean).
At this step, we can clearly identify the kernel of the posterior distribution. We can therefore
skip calculating the marginal distribution mX (x) in Step 4, because we know mX (x) does
not depend on θ. Because of this, it is common to write, in general,
The posterior distribution is proportional to the likelihood function times the prior distri-
bution. A (classical) Bayesian analysis requires these two functions L(θ|x) and π(θ) only.
PAGE 40
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
This shows that the posterior distribution will depend on the data x through the value of the
sufficient statistic t = T (x). We can therefore write the posterior distribution as depending
on t only; i.e.,
π(θ|t) ∝ fT |θ (t|θ)π(θ),
and restrict attention to the (sampling) distribution of T = T (X) from the beginning.
Example 7.11. Suppose that X1 , X2 , ..., Xn are iid Bernoulli(θ), where the prior distribution
for θ ∼ beta(a, b), a, b known. We know that
n
X
T = T (X) = Xi
i=1
is a sufficient statistic for the Bernoulli family and that T ∼ b(n, θ). Therefore, for t =
0, 1, 2, ..., n and 0 < θ < 1, the posterior distribution
π(θ|t) ∝ fT |θ (t|θ)π(θ)
n t Γ(a + b) a−1
= θ (1 − θ)n−t θ (1 − θ)b−1
t Γ(a)Γ(b)
n Γ(a + b)
= θt+a−1 (1 − θ)n−t+b−1 ,
t Γ(a)Γ(b) | {z }
| {z } beta(a∗ , b∗ ) kernel
does not depend on θ
Definition: Let F = {fX (x|θ) : θ ∈ Θ} denote a class of pdfs or pmfs. A class Π of prior
distributions is said to be a conjugate prior family for F if the posterior distribution also
belongs to Π.
PAGE 41
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
Example 7.12. Suppose X1 , X2 , ..., Xn are iid N (µ, σ 2 ), where −∞ < µ < ∞ and σ 2 > 0.
µ ∼ N (ξ, τ 2 ), ξ, τ 2 known.
σ 2 ∼ IG(a, b) a, b known.
Eθ (W ) = θ =⇒ Biasθ (W ) = Eθ (W ) − θ = 0.
In this case,
MSEθ (W ) = varθ (W ).
Obviously, we prefer estimators with small MSE because these estimators have small bias
(i.e., high accuracy) and small variance (i.e., high precision).
PAGE 42
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
Example 7.13. Suppose X1 , X2 , ..., Xn are iid N (µ, σ 2 ), where −∞ < µ < ∞ and σ 2 > 0;
i.e., both parameters unknown. Set θ = (µ, σ 2 ). Recall that our “usual” sample variance
estimator is n
2 1 X
S = (Xi − X)2
n − 1 i=1
and for all θ,
Eθ (S 2 ) = σ 2
2σ 4
varθ (S 2 ) = .
n−1
Consider the “competing estimator:”
n
1X
Sb2 = (Xi − X)2 ,
n i=1
Note that
2 n−1 2 2 n−1 2 n−1 2 n−1
Sb = S =⇒ Eθ (Sb ) = Eθ S = Eθ (S ) = σ2.
n n n n
That is, the estimator Sb2 is biased; it underestimates σ 2 on average.
Comparison: Let’s compare S 2 and Sb2 on the basis of MSE. Because S 2 is an unbiased
estimator of σ 2 ,
2σ 4
MSEθ (S 2 ) = varθ (S 2 ) = .
n−1
The MSE of Sb2 is
MSEθ (Sb2 ) = varθ (Sb2 ) + Bias2θ (Sb2 ).
The variance of Sb2 is
n−1
varθ (Sb2 ) = varθ S2
n
2 2
2σ 4 2(n − 1)σ 4
n−1 2 n−1
= varθ (S ) = = .
n n n−1 n2
The bias of Sb2 is
n−1
Eθ (Sb2 −σ )= 2
Eθ (Sb2 ) −σ =2
σ2 − σ2.
n
Therefore,
2
2(n − 1)σ 4
n−1 2n − 1
MSEθ (Sb2 ) = 2
+ 2
σ −σ = 2
σ4.
| n {z } n n2
| {z }
varθ (Sb2 ) Bias2θ (Sb2 )
PAGE 43
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
Finally, to compare MSEθ (S 2 ) with MSEθ (Sb2 ), we are left to compare the constants
2 2n − 1
and .
n−1 n2
Note that the ratio
2n − 1
2
n2 = 2n − 3n + 1 < 1,
2 2n2
n−1
for all n ≥ 2. Therefore,
MSEθ (Sb2 ) < MSEθ (S 2 ),
showing that Sb2 is a “better” estimator than S 2 on the basis of MSE.
• If both W1 and W2 are unbiased, we prefer the estimator with the smaller variance.
• If either W1 or W2 is biased (or perhaps both are biased), we prefer the estimator with
the smaller MSE.
There is no guarantee that one estimator, say W1 , will always beat the other for all θ ∈ Θ
(i.e., for all values of θ in the parameter space). For example, it may be that W1 has smaller
MSE for some values of θ ∈ Θ, but larger MSE for other values.
Remark: In some situations, we might have a biased estimator, but we can calculate its
bias. We can then “adjust” the (biased) estimator to make it unbiased. I like to call this
“making biased estimators unbiased.” The following example illustrates this.
Example 7.14. Suppose that X1 , X2 , ..., Xn are iid U[0, θ], where θ > 0. We know (from
Example 7.4) that the MLE of θ is X(n) , the maximum order statistic. It is easy to show
that
n
Eθ (X(n) ) = θ.
n+1
The MLE is biased because Eθ (X(n) ) 6= θ. However, the estimator
n+1
X(n) ,
n
PAGE 44
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
The estimator W1 is an unbiased version of the MLE. The estimator W2 is the MOM (which
is also unbiased). I have calculated
θ2 θ2
varθ (W1 ) = and varθ (W2 ) = .
n(n + 2) 3n
It is easy to see that varθ (W1 ) ≤ varθ (W2 ), for all n ≥ 2. Therefore, W1 is a “better”
estimator on the basis of this variance comparison. Are you surprised?
Curiosity: Might there be another unbiased estimator, say W3 = W3 (X) that is “better”
than both W1 and W2 ? If a better (unbiased) estimator does exist, how do we find it?
That is, Cτ is the collection of all unbiased estimators of τ (θ). Our goal is to find the
(unbiased) estimator W ∗ ∈ Cτ that has the smallest variance.
Remark: On the surface, this task seems somewhat insurmountable because Cτ is a very
n+1
large class. In Example 7.14, for example, both W1 = n X(n) and W2 = 2X are unbiased
estimators of θ. However, so is the convex combination
n+1
Wa = Wa (X) = a X(n) + (1 − a)2X,
n
Remark: It seems that our discussion of “best” estimators starts with the restriction that
we will consider only those that are unbiased. If we did not make a restriction like this,
then we would have to deal with too many estimators, many of which are nonsensical. For
example, suppose X1 , X2 , ..., Xn are iid Poisson(θ), where θ > 0.
• The estimators X and S 2 emerge as candidate estimators because they are unbiased.
• However, suppose we widen our search to consider all possible estimators and then try
to find the one with the smallest MSE. Consider the estimator θb = 17.
• We want to exclude nonsensical estimators like this. Our solution is to restrict attention
to estimators that are unbiased.
PAGE 45
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
Approach 1: Determine a lower bound, say B(θ), on the variance of any unbiased esti-
mator of τ (θ). Then, if we can find an unbiased estimator W ∗ whose variance attains this
lower bound, that is,
varθ (W ∗ ) = B(θ),
for all θ ∈ Θ, then we know that W ∗ is UMVUE.
Approach 2: Link the notion of being “best” with that of sufficiency and completeness.
is justified; i.e., we can interchange the derivative and integral (derivative and sum if
X is discrete).
For any estimator W (X) with varθ [W (X)] < ∞, the following inequality holds:
d 2
dθ
Eθ [W (X)]
varθ [W (X)] ≥ n 2 o .
∂
Eθ ∂θ ln fX (X|θ)
The quantity on the RHS is called the Cramér-Rao Lower Bound (CRLB) on the
variance of the estimator W (X).
Remark: Note that in the statement of the CRLB in Theorem 7.3.9, we haven’t said exactly
what W (X) is an estimator for. This is to preserve the generality of the result; Theorem 7.3.9
holds for any estimator with finite variance. However, given our desire to restrict attention
to unbiased estimators, we will usually consider one of these cases:
PAGE 46
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
Important special case (Corollary 7.3.10): When X consists of X1 , X2 , ..., Xn which are
iid from the population fX (x|θ), then the denominator in Theorem 7.3.9
( 2 ) ( 2 )
∂ ∂
Eθ ln fX (X|θ) = nEθ ln fX (X|θ) ,
∂θ ∂θ
Lemma 7.3.11 (Information Equality): Under fairly mild assumptions (which hold for
exponential families, for example), the Fisher information based on one observation
( 2 ) 2
∂ ∂
I1 (θ) = Eθ ln fX (X|θ) = −Eθ ln fX (X|θ) .
∂θ ∂θ2
Preview: In Chapter 10, we will investigate the large-sample properties of MLEs. Under
certain regularity conditions, we will show an MLE θb satisfies
√ d
n(θb − θ) −→ N (0, σθ2b),
Σ = [I1 (θ)]−1
PAGE 47
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
Example 7.15. Suppose X1 , X2 , ..., Xn are iid Poisson(θ), where θ > 0. Find the CRLB on
the variance of unbiased estimators of τ (θ) = θ.
Solution. We know that the CRLB is
1 1
= ,
In (θ) nI1 (θ)
where ( 2 ) 2
∂ ∂
I1 (θ) = Eθ ln fX (X|θ) = −Eθ ln fX (X|θ) .
∂θ ∂θ2
For x = 0, 1, 2, ...,
θx e−θ
ln fX (x|θ) = ln = x ln θ − θ − ln x!.
x!
Therefore,
∂ x
ln fX (x|θ) = −1
∂θ θ
∂2 x
ln f X (x|θ) = − .
∂θ2 θ2
The Fisher information based on one observation is
2
∂
I1 (θ) = −Eθ ln fX (X|θ)
∂θ2
X 1
= −Eθ − 2 = .
θ θ
1 θ
CRLB = = .
nI1 (θ) n
Example 7.16. Suppose X1 , X2 , ..., Xn are iid gamma(α0 , β), where α0 is known and β > 0.
Find the CRLB on the variance of unbiased estimators of β.
Solution. We know that the CRLB is
1 1
= ,
In (β) nI1 (β)
PAGE 48
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
where ( 2 )
∂2
∂
I1 (β) = Eβ ln fX (X|β) = −Eβ ln fX (X|β) .
∂β ∂β 2
For x > 0,
1
ln fX (x|β) = ln α
xα0 −1 e−x/β
Γ(α0 )β 0
x
= − ln Γ(α0 ) − α0 ln β + (α0 − 1) ln x − .
β
Therefore,
∂ α0 x
ln fX (x|β) = − + 2
∂β β β
2
∂ α0 2x
2
ln fX (x|β) = − .
∂β β2 β3
The Fisher information based on one observation is
2
∂
I1 (β) = −Eβ ln fX (X|β)
∂β 2
α0 2X α0
= −Eβ − = .
β2 β3 β2
1 β2
CRLB = = .
nI1 (β) nα0
PAGE 49
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
1. Show that
nα0 − 1
W (X) =
nX
is an unbiased estimator of τ (β) = 1/β.
2. Derive the CRLB for the variance of unbiased estimators of τ (β) = 1/β.
3. Calculate varβ [W (X)] and show that it is strictly larger than the CRLB (i.e., the
variance does not attain the CRLB).
Q: Does this necessarily imply that W (X) cannot be the UMVUE of τ (β) = 1/β?
Remark: In general, the CRLB offers a lower bound on the variance of any unbiased
estimator of τ (θ). However, this lower bound may be unattainable. That is, the CRLB may
be strictly smaller than the variance of any unbiased estimator. If this is the case, then our
“CRLB approach” to finding an UMVUE will not be helpful.
Example 7.16 (continued). Suppose X1 , X2 , ..., Xn are iid gamma(α0 , β), where α0 is known
and β > 0. The likelihood function is
n
Y 1
L(β|x) = α0
xαi 0 −1 e−xi /β
i=1
Γ(α 0 )β
n Y n
!α0 −1
1 − n
P
i=1 xi /β .
= α
x i e
Γ(α0 )β 0
i=1
PAGE 50
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
where Pn
i=1 xi x
W (x) = = .
nα0 α0
We have written the score function S(β|x) as a linear function of W (x) = x/α0 . Because
W (X) = X/α0 is an unbiased estimator of τ (β) = β (shown previously), the variance
varβ [W (X)] attains the CRLB for the variance of unbiased estimators of τ (β) = β.
Remark: The attainment result is interesting, but I have found that its usefulness may be
limited if you want to find the UMVUE. Even if we can write
where Eθ [W (X)] = τ (θ), the RHS might involve a function τ (θ) for which there is no desire
to estimate. To illustrate this, suppose X1 , X2 , ..., Xn are iid beta(θ, 1), where θ > 0. The
score function is
n
n X
S(θ|x) = + ln xi
θ
Pi=1
n
i=1 ln xi 1
= n − −
n θ
= a(θ)[W (x) − τ (θ)].
Unresolved issues:
1. What if fX (x|θ) does not satisfy the regularity conditions needed for the Cramér-Rao
Inequality to apply? For example, X ∼ U(0, θ).
PAGE 51
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
Remark: We now move to our “second approach” on how to find UMVUEs. This approach
involves sufficiency and completeness−two topics we discussed in the last chapter. We can
also address the unresolved issues on the previous page.
φ(T ) = E(W |T ).
Then
Second,
Remark: To use the Rao-Blackwell Theorem, some students think they have to
This is not the case at all! Because φ(T ) = E(W |T ) is a function of the sufficient statistic
T , the Rao-Blackwell result simply convinces us that in our search for the UMVUE, we can
restrict attention to those estimators that are functions of a sufficient statistic.
PAGE 52
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
Q: In the proof of the Rao-Blackwell Theorem, where did we use the fact that T was
sufficient?
A: Nowhere. Thus, it would seem that conditioning on any statistic, sufficient or not, will
result in an improvement over the unbiased W . However, there is a catch:
Remark: To understand how we can use the Rao-Blackwell result in our quest to find a
UMVUE, we need two additional results. One deals with uniqueness; the other describes an
interesting characterization of a UMVUE itself.
∗ 1 0
varθ (W ) = varθ (W + W )
2
1 1 1
= varθ (W ) + varθ (W 0 ) + covθ (W, W 0 )
4 4 2
1 1 1 1/2
≤ varθ (W ) + varθ (W ) + [varθ (W )varθ (W 0 )]
0
4 4 2
= varθ (W ),
where the inequality arises from the covariance inequality (CB, pp 188, application of
Cauchy-Schwarz) and the final equality holds because both W and W 0 are UMVUE by
assumption (so their variances must be equal). Therefore, we have shown that
2. varθ (W ∗ ) ≤ varθ (W ).
Because W is UMVUE (by assumption), the inequality in (2) can not be strict (or else it
would contradict the fact that W is UMVUE). Therefore, it must be true that
varθ (W ∗ ) = varθ (W ).
This implies that the inequality above (arising from the covariance inequality) is an equality;
therefore,
1/2
covθ (W, W 0 ) = [varθ (W )varθ (W 0 )] .
PAGE 53
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
Therefore,
by Theorem 4.5.7 (CB, pp 172), where a(θ) and b(θ) are constants. It therefore suffices to
show that a(θ) = 1 and b(θ) = 0. Note that
Theorem 7.3.20. Suppose Eθ (W ) = τ (θ) for all θ ∈ Θ. W is UMVUE of τ (θ) if and only
if W is uncorrelated with all unbiased estimators of 0.
Proof. Necessity (=⇒): Suppose Eθ (W ) = τ (θ) for all θ ∈ Θ. Suppose W is UMVUE of
τ (θ). Suppose Eθ (U ) = 0 for all θ ∈ Θ. It suffices to show covθ (W, U ) = 0 for all θ ∈ Θ.
Define
φa = W + aU,
where a is a constant. It is easy to see that φa is an unbiased estimator of τ (θ); for all θ ∈ Θ,
Eθ (φa ) = Eθ (W + aU ) = Eθ (W ) + a Eθ (U ) = τ (θ).
| {z }
= 0
Also,
PAGE 54
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
However, this again contradicts the assumption that W is UMVUE. Therefore, it must
be true that covθ (W, U ) ≤ 0.
Combining Case 1 and Case 2, we are forced to conclude that covθ (W, U ) = 0. This proves
the necessity.
Sufficiency (⇐=): Suppose Eθ (W ) = τ (θ) for all θ ∈ Θ. Suppose covθ (W, U ) = 0 for all
θ ∈ Θ where U is any unbiased estimator of zero; i.e., Eθ (U ) = 0 for all θ ∈ Θ. Let W 0 be
any other unbiased estimator of τ (θ). It suffices to show that varθ (W ) ≤ varθ (W 0 ). Write
W 0 = W + (W 0 − W )
and calculate
PAGE 55
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
Summary: We are now ready to put Theorem 7.3.17 (Rao-Blackwell), Theorem 7.3.19
(UMVUE uniqueness) and Theorem 7.3.20 together. Suppose X ∼ fX (x|θ), where θ ∈ Θ.
Our goal is to find the UMVUE of τ (θ).
• Theorem 7.3.20 assures us that φ(T ) is UMVUE if and only if φ(T ) is uncorrelated
with all unbiased estimators of 0.
Add the assumption that T is a complete statistic. The only unbiased estimator of 0 in
complete families is the zero function itself. Because covθ [φ(T ), 0] = 0 holds trivially, we
have shown that φ(T ) is uncorrelated with “all” unbiased estimators of 0. Theorem 7.3.20
says that φ(T ) must be UMVUE; Theorem 7.3.19 guarantees that φ(T ) is unique.
Recipe for finding UMVUEs: Suppose we want to find the UMVUE for τ (θ).
Then φ(T ) is the UMVUE for τ (θ). This is essentially what is summarized in Theorem
7.3.23 (CB, pp 347).
• We already know that X is UMVUE for θ; we proved this by showing that X is unbiased
and that varθ (X) attains the CRLB on the variance of all unbiased estimators of θ.
The pmf of X is
θx e−θ
fX (x|θ) = I(x = 0, 1, 2, ..., )
x!
I(x = 0, 1, 2, ..., ) −θ (ln θ)x
= e e
x!
= h(x)c(θ) exp{w1 (θ)t1 (x)}.
PAGE 56
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
Therefore X has pmf in the exponential family. Theorem 6.2.10 says that
n
X
T = T (X) = Xi
i=1
is a sufficient statistic. Because d = k = 1 (i.e., a full family), Theorem 6.2.25 says that T
is complete. Now, !
n
X Xn
Eθ (T ) = Eθ Xi = Eθ (Xi ) = nθ.
i=1 i=1
Therefore,
T
Eθ = Eθ (X) = θ.
n
Because X is unbiased and is a function of T , a complete and sufficient statistic, we know
that X is the UMVUE.
Example 7.18. Suppose X1 , X2 , ..., Xn are iid U(0, θ), where θ > 0. We have previously
shown that
T = T (X) = X(n)
is sufficient and complete (see Example 6.5 and Example 6.16, respectively, in the notes). It
follows that
n
Eθ (T ) = Eθ (X(n) ) = θ
n+1
for all θ > 0. Therefore,
n+1
Eθ X(n) = θ.
n
Because (n+1)X(n) /n is unbiased and is a function of X(n) , a complete and sufficient statistic,
it must be the UMVUE.
Example 7.19. Suppose X1 , X2 , ..., Xn are iid gamma(α0 , β), where α0 is known and β > 0.
Find the UMVUE of τ (β) = 1/β.
Solution. The pdf of X is
1
fX (x|β) = xα0 −1 e−x/β I(x > 0)
Γ(α0 )β α0
xα0 −1 I(x > 0) 1 (−1/β)x
= e
Γ(α0 ) β α0
= h(x)c(β) exp{w1 (β)t1 (x)}
PAGE 57
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
is a sufficient and complete statistic, respectively. In Example 7.16 (notes), we saw that
nα0 − 1
φ(T ) =
T
is an unbiased estimator of τ (β) = 1/β. Therefore, φ(T ) must be the UMVUE.
Remark: In Example 7.16, recall that the CRLB on the variance of unbiased estimators of
τ (β) = 1/β was unattainable.
Example 7.20. Suppose X1 , X2 , ..., Xn are iid Poisson(θ), where θ > 0. Find the UMVUE
for
τ (θ) = Pθ (X = 0) = e−θ .
Solution. We use an approach known as “direct conditioning.” We start with
n
X
T = T (X) = Xi ,
i=1
which is sufficient and complete. We know that the UMVUE therefore is a function of T .
Consider forming
φ(T ) = E(W |T ),
where W is any unbiased estimator of τ (θ) = e−θ . We know that φ(T ) by this construction
is the UMVUE; clearly φ(T ) = E(W |T ) is a function of T and
How should we choose W ? Any unbiased W will “work,” so let’s keep our choice simple, say
Note that
Eθ (W ) = Eθ [I(X1 = 0)] = Pθ (X1 = 0) = e−θ ,
showing that W is an unbiased estimator. Now, we just calculate φ(T ) = E(W |T ) directly.
For t fixed, we have
PAGE 58
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
Pn
We can now calculate each of these probabilities. Recall that X1 ∼ Poisson(θ), i=2 Xi ∼
Poisson((n − 1)θ), and T ∼ Poisson(nθ). Therefore,
Pθ (X1 = 0)Pθ ( ni=2 Xi = t)
P
φ(t) =
Pθ (T = t)
[(n − 1)θ]t e−(n−1)θ
e−θ
n−1
t
= t! = .
(nθ)t e−nθ n
t!
Therefore,
T
n−1
φ(T ) =
n
is the UMVUE of τ (θ) = e−θ .
for n large. Recall that e−X is the MLE of τ (θ) = e−θ by invariance.
Remark: In this section, we provide the proofs that pertain to the CRLB approach to
finding UMVUEs. These proofs are also relevant for later discussions on MLEs and their
large-sample characteristics.
PAGE 59
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
2. for any function h(x) such that Eθ [h(X)] < ∞ for all θ ∈ Θ, the interchange
Z Z
d ∂
h(x)fX (x|θ)dx = h(x)fX (x|θ)dx
dθ Rn Rn ∂θ
is justified; i.e., we can interchange the derivative and integral (derivative and sum if
X is discrete).
For any estimator W (X) with varθ [W (X)] < ∞, the following inequality holds:
d 2
Eθ [W (X)]
varθ [W (X)] ≥ ndθ 2 o .
∂
Eθ ∂θ ln fX (X|θ)
Lemma. Let
∂
S(θ|X) =
ln fX (X|θ)
∂θ
denote the score function. The score function is a zero-mean random variable; that is,
∂
Eθ [S(θ|X)] = Eθ ln fX (X|θ) = 0.
∂θ
The interchange of derivative and integral above is justified based on the assumptions stated
in Theorem 7.3.9. Therefore, the lemma is proven. 2
that is, (
2 )
∂ ∂
varθ ln fX (X|θ) = Eθ ln fX (X|θ) .
∂θ ∂θ
PAGE 60
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
We get
2
∂ ∂
covθ W (X), ln fX (X|θ) ≤ varθ [W (X)] varθ ln fX (X|θ) ,
∂θ ∂θ
that is, (
2 2 )
d ∂
Eθ [W (X)] ≤ varθ [W (X)] Eθ ln fX (X|θ) .
dθ ∂θ
n 2 o
∂
Dividing both sides by Eθ ∂θ ln fX (X|θ) gives the result. 2
Corollary 7.3.10 (Cramér-Rao Inequality−iid case). With the same regularity conditions
stated in Theorem 7.3.9, in the iid case,
d 2
E θ [W (X)]
varθ [W (X)] ≥ ndθ 2 o .
∂
nEθ ∂θ ln fX (X|θ)
PAGE 61
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
n
( 2 ) X X
X ∂ ∂ ∂
= Eθ ln fX (Xi |θ) + Eθ ln fX (Xi |θ) ln fX (Xj |θ)
i=1
∂θ i6=j
∂θ ∂θ
n
( 2 )
indep
X ∂ XX ∂
∂
= Eθ ln fX (Xi |θ) + Eθ ln fX (Xi |θ) Eθ ln fX (Xj |θ) .
i=1
∂θ i6=j
∂θ ∂θ
| {z }| {z }
= 0 = 0
In the iid case, we have just proven that In (θ) = nI(θ). Therefore, in the iid case,
[τ 0 (θ)]2
CRLB = .
nI1 (θ)
PAGE 62
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
We have shown
( 2 )
∂2
∂
Eθ ln fX (X|θ) = −Eθ ln fX (X|θ) .
∂θ2 ∂θ
PAGE 63
STAT 713: CHAPTER 7 JOSHUA M. TEBBS
[τ 0 (θ)]2
varθ [W (X)] ≥ n 2 o
∂
Eθ ∂θ
ln fX (X|θ)
iid [τ 0 (θ)]2
= n Qn 2 o .
∂
Eθ ∂θ
ln i=1 Xf (X i |θ)
Now, in the covariance inequality, we have equality when the correlation of W (X) and
∂
∂θ
ln fX (X|θ) equals ±1, which in turn implies
c(X − µX ) = Y − µY a.s.,
or restated,
∂
c[W (X) − τ (θ)] = ln fX (X|θ) − 0 a.s.
∂θ
This is an application of Theorem 4.5.7 (CB, pp 172); i.e., two random variables are per-
fectly correlated if and only if the random variables are perfectly linearly related. In these
equations, c is a constant. Also, I have written “−0” on the RHS of the last equation to
emphasize that " #
n
∂ ∂ Y
Eθ ln fX (X|θ) = Eθ ln fX (Xi |θ) = 0.
∂θ ∂θ i=1
Also, W (X) is an unbiased estimator of τ (θ) by assumption. Therefore, we have
∂
c[W (X) − τ (θ)] = ln fX (X|θ)
∂θ
n
∂ Y
= ln fX (Xi |θ)
∂θ i=1
∂
= ln L(θ|X)
∂θ
= S(θ|X),
where S(θ|X) is the score function. The constant c cannot depend on W (X) nor on
∂
∂θ
ln fX (X|θ), but it can depend on θ. To emphasize this, we write
Thus, varθ [W (X)] attains the CRLB when the score function S(θ|X) can be written as a
linear function of the unbiased estimator W (X). 2
PAGE 64
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
8 Hypothesis Testing
8.1 Introduction
H0 : θ ∈ Θ0
Example 8.1. Suppose X1 , X2 , ..., Xn are iid N (θ, σ02 ), where −∞ < θ < ∞ and σ02 is
known. Consider testing
H0 : θ = θ0
versus
H1 : θ 6= θ0 ,
where θ0 is a specified value of θ. The null parameter space Θ0 = {θ0 }, a singleton. The
alternative parameter space Θc0 = R \ {θ0 }.
H0 : θ ≤ θ0
versus
H1 : θ > θ0 .
PAGE 65
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
Goal: In a statistical hypothesis testing problem, we decide between the two complementary
hypotheses H0 and H1 on the basis of observing X = x. In essence, a hypothesis test is a
specification of the test function
• The subset of X for which H0 is rejected is called the rejection region, denoted by
R.
• The subset of X for which H0 is not rejected is called the acceptance region, denoted
by Rc .
If
1, x ∈ R
φ(x) = I(x ∈ R) =
0, x ∈ Rc ,
the test is said to be non-randomized.
Example 8.2. Suppose X ∼ b(10, θ), where 0 < θ < 1, and consider testing
H0 : θ ≥ 0.35
versus
H1 : θ < 0.35.
PAGE 66
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
1. We would like to work with test statistics that are sensible and confer tests with nice
statistical properties (does sufficiency play a role?)
2. We would like to find the sampling distribution of W under H0 and H1 .
Example 8.3. Suppose X1 , X2 , ..., Xn are iid N (µ, σ 2 ), where −∞ < µ < ∞ and σ 2 > 0;
i.e., both parameters are unknown. Consider testing
H0 : σ 2 = 40
versus
H1 : σ 2 6= 40.
Example 8.4. McCann and Tebbs (2009) summarize a study examining perceived unmet
need for dental health care for people with HIV infection. Baseline in-person interviews were
conducted with 2,864 HIV infected individuals (aged 18 years and older) as part of the HIV
Cost and Services Utilization Study. Define
X1 = number of patients
with private insurance
X2 = number of patients
with medicare and private insurance
X3 = number of patients
without insurance
X4 = number of patients
with medicare but no private insurance.
Set X = (X1 , X2 , X3 , X4 ) and model X ∼ mult(2864, p1 , p2 , p3 , p4 ; 4i=1 pi = 1). Under this
P
assumption, consider testing
1
H0 : p1 = p2 = p3 = p4 = 4
versus
H1 : H0 not true.
Note that an observation like x = (0, 0, 0, 2864) should lead to a rejection of H0 . An obser-
vation like x = (716, 716, 716, 716) should not. What about x = (658, 839, 811, 556)? Can
we find a reasonable one-dimensional test statistic?
PAGE 67
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
where fX (x|θ) is the common population distribution (in the iid case). Recall that Θ is the
parameter space.
H0 : θ ∈ Θ0
versus
H1 : θ ∈ Θ \ Θ0
is defined by
sup L(θ|x)
θ∈Θ0
λ(x) = .
sup L(θ|x)
θ∈Θ
Intuition: The numerator of λ(x) is the largest the likelihood function can be over the null
parameter space Θ0 . The denominator is the largest the likelihood function can be over the
entire parameter space Θ. Clearly,
0 ≤ λ(x) ≤ 1.
PAGE 68
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
The form of the rejection region above says to “reject H0 when λ(x) is too small.” When
λ(x) is small, the data x are not consistent with the collection of models under H0 .
where θb0 is the MLE of θ subject to the constraint that θ ∈ Θ0 . That is, θb0 is the
value of θ that maximizes L(θ|x) over the null parameter space Θ0 . We call θb0 the
restricted MLE.
where θb is the MLE of θ. That is, θb is the value of θ that maximizes L(θ|x) over the
entire parameter space Θ. We call θb the unrestricted MLE.
L(θb0 |x)
λ(x) = .
L(θ|x)
b
This notation is easier and emphasizes how the definition of λ(x) is tied to maximum
likelihood estimation.
H0 : θ = θ 0 ,
That is, there is only one value of θ “allowed” under H0 . We are therefore maximizing the
likelihood function L(θ|x) over a single point in Θ.
Large-sample intuition: We will learn in Chapter 10 that (under suitable regularity con-
ditions), an MLE
p
θb −→ θ, as n → ∞,
i.e., “MLEs are consistent” (I have switched to the scalar case here only for convenience).
In the light of this asymptotic result, consider each of the following cases:
PAGE 69
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
The MLEs θb0 and θb are converging to the same quantity (in probability) so they should
be close to each other in large samples. Therefore, we would expect
L(θb0 |x)
λ(x) =
L(θ|x)
b
to be “close” to 1.
but θb0 ∈ Θ0 because θb0 is calculated by maximizing L(θ|x) over Θ0 (i.e., θb0 can never
“escape from” Θ0 ). Therefore, there is no guarantee that θb0 and θb will be close to each
other in large samples, and, in fact, the ratio
L(θb0 |x)
λ(x) =
L(θ|x)
b
• This is why (at least by appealing to large-sample intuition) it makes sense to reject
H0 when λ(x) is small.
Example 8.5. Suppose X1 , X2 , ..., Xn are iid N (µ, σ02 ), where −∞ < µ < ∞ and σ02 = 1.
Consider testing
H0 : µ = µ0
versus
H1 : µ 6= µ0 .
Θ0 = {µ0 }, a singleton
Θ = {µ : −∞ < µ < ∞}.
PAGE 70
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
Clearly, n
1 1 Pn 2
sup L(µ|x) = L(µ0 |x) = √ e− 2 i=1 (xi −µ0 ) .
µ∈Θ0 2π
Over the entire parameter space Θ, the MLE is µb = X; see Example 7.5 (notes, pp 31).
Therefore, n
1 1 Pn 2
sup L(µ|x) = L(x|x) = √ e− 2 i=1 (xi −x) .
µ∈Θ 2π
R = {x ∈ X : λ(x) ≤ c} = {x ∈ X : |x − µ0 | ≥ c0 }.
Rejecting H0 when λ(x) is “too small” is the same as rejecting H0 when |x − µ0 | is “too
large.” The latter decision rule makes sense intuitively. Note that we have written our LRT
rejection region and the corresponding test function
1, |x − µ0 | ≥ c0
0
φ(x) = I(x ∈ R) = I(|x − µ0 | ≥ c ) =
0, |x − µ0 | < c0
PAGE 71
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
where −∞ < θ < ∞. Note that this is a location exponential population pdf; the location
parameter is θ. Consider testing
H0 : θ ≤ θ0
versus
H1 : θ > θ0 .
Note that W (X) = X(1) is a sufficient statistic by the Factorization Theorem. The relevant
parameter spaces are
Θ0 = {θ : −∞ < θ ≤ θ0 }
Θ = {θ : −∞ < θ < ∞}.
PAGE 72
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
• Therefore, L(θ|x) is an increasing function when θ is less than or equal to the minimum
order statistic x(1) ; when θ is larger than x(1) , the likelihood function drops to zero.
• Clearly, the unrestricted MLE of θ is θb = X(1) and hence the denominator of λ(x) is
b = L(x(1) |x).
sup L(θ|x) = L(θ|x)
θ∈Θ
Restricted MLE: By “restricted,” we mean “subject to the constraint that the estimate
fall in Θ0 = {θ : −∞ < θ ≤ θ0 }.”
• Case 1: If θ0 < x(1) , then the largest L(θ|x) can be is L(θ0 |x). Therefore, the restricted
MLE is θb0 = θ0 .
• Case 2: If θ0 ≥ x(1) , then the restricted MLE θb0 coincides with the unrestricted MLE
θb = X(1) .
• Therefore,
θ0 , θ0 < X(1)
θb0 =
X(1) , θ0 ≥ X(1) .
H0 : θ ≤ θ0
versus
H1 : θ > θ0 .
• It is only when x(1) > θ0 do we have evidence that θ might be larger than θ0 . The
larger x(1) is (x(1) > θ0 ), the smaller λ(x) becomes; see Figure 8.2.1 (CB, pp 377).
That is,
PAGE 73
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
Not surprisingly, we can write our LRT rejection region in terms of W (X) = X(1) . When
θ0 < x(1) , the LRT statistic
Pn
L(θ0 |x) e− i=1 xi +nθ0
λ(x) = = − Pn x +nx = e−n(x(1) −θ0 ) .
L(x(1) |x) e i=1 i (1)
Note that
R = {x ∈ X : λ(x) ≤ c} = {x ∈ X : x(1) ≥ c0 }.
Rejecting H0 when λ(x) is “too small” is the same as rejecting H0 when x(1) is “too large.”
As noted earlier, the latter decision rule makes sense intuitively. Note that we have written
our LRT rejection region and the corresponding test function
1, x(1) ≥ c0
0
φ(x) = I(x ∈ R) = I(x(1) ≥ c ) =
0, x(1) < c0
in terms of the one-dimensional statistic W (X) = X(1) , which is sufficient for the location
exponential family.
Theorem 8.2.4. Suppose T = T (X) is a sufficient statistic for θ. If λ∗ (T (x)) = λ∗ (t) is the
LRT statistic based on T and if λ(x) is the LRT statistic based on X, then λ∗ (T (x)) = λ(x)
for all x ∈ X .
Proof. Because T = T (X) is sufficient, we can write (by the Factorization Theorem)
fX (x|θ) = gT (t|θ)h(x),
PAGE 74
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
Example 8.7. Suppose X1 , X2 , ..., Xn are iid exponential(θ), where θ > 0. Consider testing
H0 : θ = θ0
versus
H1 : θ 6= θ0 .
n
∗ e
λ (t) = tn e−t/θ0 ,
nθ0
Example 8.8. Suppose X1 , X2 , ..., Xn are iid N (µ, σ 2 ), where −∞ < µ < ∞ and σ 2 > 0;
i.e., both parameters are unknown. Set θ = (µ, σ 2 ). Consider testing
H0 : µ = µ0
versus
H1 : µ 6= µ0 .
The null hypothesis H0 above looks simple, but it is not. The relevant parameter spaces are
Θ0 = {θ = (µ, σ 2 ) : µ = µ0 , σ 2 > 0}
Θ = {θ = (µ, σ 2 ) : −∞ < µ < ∞, σ 2 > 0}.
In this problem, we call σ 2 a nuisance parameter, because it is not the parameter that is
of interest in H0 and H1 . The likelihood function is
n
Y 1 2 2
L(θ|x) = √ e−(xi −µ) /2σ
i=1 2πσ 2
n/2
1 − 12
Pn 2
i=1 (xi −µ) .
= 2
e 2σ
2πσ
PAGE 75
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
Exercise: In Example 7.7 (notes, pp 34-35), derive the LRT statistic to test
H0 : p1 = p2
versus
H1 : p1 6= p2 .
Exercise: In Example 8.4 (notes, pp 67), show that the LRT statistic is
4 x
Y 2864 i
λ(x) = λ(x1 , x2 , x3 , x4 ) = .
i=1
4xi
PAGE 76
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
H0 : θ ∈ Θ0
versus
H1 : θ ∈ Θc0 ,
where Θc0 = Θ \ Θ0 , can also be carried out within the Bayesian paradigm, but they are
performed differently. Recall that, for a Bayesian, all inference is carried out using the
posterior distribution π(θ|x).
make perfect sense and be calculated (or approximated) “exactly.” Note that these proba-
bilities make no sense to the non-Bayesian. S/he regards θ as fixed, so that {θ ∈ Θ0 } and
{θ ∈ Θc0 } are not random events. We do not assign probabilities to events that are not
random.
Example 8.9. Suppose that X1 , X2 , ..., Xn are iid Poisson(θ), where the prior distribution
for θ ∼ gamma(a, b), a, b known. In Example 7.10 (notes, pp 38-39), we showed that the
posterior distribution
n
!
X 1
θ|X = x ∼ gamma xi + a, .
i=1
n + 1b
As an application, consider the following data, which summarize the number of goals per
game in the 2013-2014 English Premier League season:
Goals 0 1 2 3 4 5 6 7 8 9 10+
Frequency 27 73 80 72 65 39 17 4 1 2 0
There were n = 380 games total. I modeled the number of goals per game X as a Poisson
random variable and assumed that X1 , X2 , ..., X380 are iid Poisson(θ). Before the season
started, I modeled the mean number of goals per game as θ ∼ gamma(1.5, 2), which is a
fairly diffuse prior distribution.
PAGE 77
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
0.25
4
0.20
Posterior distribution
Prior distribution
0.15
3
0.10
2
0.05
1
0.00
0
0 5 10 15 2.0 2.5 3.0 3.5
θ θ
Figure 8.1: 2013-2014 English Premier League data. Prior distribution (left) and posterior
distribution (right) for θ, the mean number of goals scored per game. Note that the horizontal
axes are different in the two figures.
> sum(goals)
[1] 1060
I have depicted the prior distribution π(θ) and the posterior distribution π(θ|x) in Figure
8.1. Suppose that I wanted to test H0 : θ ≥ 3 versus H1 : θ < 3 on the basis of the assumed
Bayesian model and the observed data x. The probability that H0 is true is
Z ∞
P (θ ≥ 3|x) = π(θ|x)dθ ≈ 0.008,
3
> 1-pgamma(3,1061.5,1/0.002628)
[1] 0.008019202
Therefore, it is far more likely that H1 is true, in fact, with probability over 0.99.
PAGE 78
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
H0 : θ ∈ Θ0
versus
H1 : θ ∈ Θc0 ,
where Θc0 = Θ \ Θ0 . I will henceforth assume that θ is a scalar parameter (for simplicity
only).
Therefore, for any test that we perform, there are four possible scenarios, described in the
following table:
Decision
Reject H0 Do not reject H0
H0 Type I Error ,
Truth
H1 , Type II Error
Calculations:
It is very important to note that both of these probabilities depend on θ. This is why we
emphasize this in the notation.
PAGE 79
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
β(θ) = Pθ (X ∈ R) = Eθ [φ(X)].
In other words, the power function gives the probability of rejecting H0 for all θ ∈ Θ. Note
that if H1 is true, so that θ ∈ Θc0 ,
Example 8.10. Suppose X1 , X2 , ..., Xn are iid N (µ, σ02 ), where −∞ < µ < ∞ and σ02 is
known. Consider testing
H0 : µ ≤ µ0
versus
H1 : µ > µ0 .
1, x −√µ0 ≥ c
φ(x) = σ0 / n
0, otherwise.
• The first requirement implies that P (Type I Error|µ) will not exceed 0.10 for all µ ≤ µ0
(H0 true).
• The second requirement implies that P (Type II Error|µ) will not exceed 0.20 for all
µ ≥ µ0 + σ0 (these are values of µ that make H1 true).
PAGE 80
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
1.0
0.8
0.6
Power function
0.4
0.2
0.0
0 1 2 3 4
Figure 8.2: Power function β(µ) in Example 8.10 with c = 1.28, n = 5, µ0 = 1.5 and σ0 = 1.
Horizontal lines at 0.10 and 0.80 have been added.
the 0.90 quantile of the N (0, 1) distribution. Also, because β(µ) is increasing,
√ set
inf β(µ) = β(µ0 + σ0 ) = 1 − FZ (1.28 − n) = 0.80
µ≥µ0 +σ0
√
=⇒ 1.28 − n = −0.84
=⇒ n = 4.49,
PAGE 81
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
Note that if φ(x) is a size α test, then it is also level α. The converse is not true. In other
words,
{class of size α tests} ⊂ {class of level α tests}.
Remark: Often, it is unnecessary to differentiate between the two classes of tests. How-
ever, in testing problems involving discrete distributions (e.g., binomial, Poisson, etc.), it is
generally not possible to construct a size α test for a specified value of α; e.g., α = 0.05.
Thus (unless one randomizes), we may have to settle for a level α test.
Important: As the definition above indicates, the size of any test φ(x) is calculated by
maximizing the power function over the null parameter space Θ0 identified in H0 .
Example 8.11. Suppose X1 , X2 are iid Poisson(θ), where θ > 0, and consider testing
H0 : θ ≥ 3
versus
H1 : θ < 3.
Size calculations: The size of each test is calculated as follows. For the first test,
α = sup β1 (θ) = β1 (3) = e−3 ≈ 0.049787.
θ≥3
PAGE 82
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
1.0
0.8
0.6
Power function
β1(θ)
β2(θ)
0.4
0.2
0.0
0 1 2 3 4 5
Example 8.12. Suppose X1 , X2 , ..., Xn are iid from fX (x|θ) = e−(x−θ) I(x ≥ θ), where
−∞ < θ < ∞. In Example 8.6 (notes, pp 72-74), we considered testing
H0 : θ ≤ θ0
versus
H1 : θ > θ0
and derived the LRT to take the form φ(x) = I(x(1) ≥ c0 ). Find the value of c0 that makes
φ(x) a size α test.
Solution. The pdf of X(1) is fX(1) (x|θ) = ne−n(x−θ) I(x ≥ θ). We set
PAGE 83
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
H0 : θ ∈ Θ0
versus
H1 : θ ∈ Θc0 ,
where Θc0 = Θ \ Θ0 . A test in C with power function β(θ) is a uniformly most powerful
(UMP) class C test if
β(θ) ≥ β ∗ (θ) for all θ ∈ Θc0 ,
where β ∗ (θ) is the power function of any other test in C. The “uniformly” part in this
definition refers to the fact that the power function β(θ) is larger than (i.e., at least as large
as) the power function of any other class C test for all θ ∈ Θc0 .
Important: In this course, we will restrict attention to tests φ(x) that are level α tests.
That is, we will take
C = {all level α tests}.
This restriction is analogous to the restriction we made in the “optimal estimation problem”
in Chapter 7. Recall that we restricted attention to unbiased estimators first; we then wanted
to find the one with the smallest variance (uniformly, for all θ ∈ Θ). In the same spirit, we
make the same type of restriction here by considering only those tests that are level α tests.
This is done so that we can avoid having to consider “silly tests,” e.g.,
The power function for this test is β(θ) = 1, for all θ ∈ Θ. This test cannot be beaten in
terms of power when H1 is true! Unfortunately, it is not a very good test when H0 is true.
sup β(θ) ≤ α.
θ∈Θ0
H0 : θ = θ0
versus
H1 : θ = θ1 .
Remark: This type of test is rarely of interest in practice. However, it is the “building
block” situation for more interesting problems.
PAGE 84
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
H0 : θ = θ0
versus
H1 : θ = θ1
and denote by fX (x|θ0 ) and fX (x|θ1 ) the pdfs (pmfs) of X = (X1 , X2 , ..., Xn ) corresponding
to θ0 and θ1 , respectively. Consider the test function
1, fX (x|θ1 ) > k
fX (x|θ0 )
φ(x) =
0, fX (x|θ1 ) < k,
fX (x|θ0 )
for k ≥ 0, where
α = Pθ0 (X ∈ R) = Eθ0 [φ(X)]. (8.1)
Sufficiency: Any test satisfying the definition of φ(x) above and Equation (8.1) is a most
powerful (MP) level α test.
Remarks:
• The necessity part of the Neyman-Pearson (NP) Lemma is less important for our
immediate purposes (see CB, pp 388).
Example 8.13. Suppose that X1 , X2 , ..., Xn are iid beta(θ, 1), where θ > 0; i.e., the popu-
lation pdf is
fX (x|θ) = θxθ−1 I(0 < x < 1).
Derive the MP level α test for
H0 : θ = 1
versus
H1 : θ = 2.
PAGE 85
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
The NP Lemma says that the MP level α test uses the rejection rejection
( n
)
Y
R = x ∈ X : 2n xi > k ,
i=1
Instead
Q of finding the constant k that satisfies this equation, we rewrite the rejection rule
{2n ni=1 xi > k} in a way that makes our life easier. Note that
n
Y n
Y
n
2 xi > k ⇐⇒ xi > 2−n k
i=1 i=1
n
X
⇐⇒ − ln xi < − ln(2−n k) = k 0 , say.
i=1
Qn Pn
We have rewritten the rejection rule {2n i=1 xi > k} as { − ln xi < k 0 }. Therefore,
i=1
n ! n !
Y X
− ln Xi < k 0 θ = 1 .
α=P 2n Xi > k θ = 1 = P
i=1 i=1
We have now changed the problem to choosing k 0 to solve this equation above.
Recall that
H H
Xi ∼0 U(0, 1) =⇒ − ln Xi ∼0 exponential(1)
n
H
X
=⇒ − ln Xi ∼0 gamma(n, 1).
i=1
Therefore, to satisfy the equation above, we take k 0 = gn,1,1−α , the (lower) α quantile of a
gamma(n, 1) distribution. This notation for quantiles is consistent with how CB have defined
them on pp 386. Thus, the MP level α test of H0 : θ = 1 versus H1 : θ = 2 has rejection
region ( )
Xn
R= x∈X : − ln xi < gn,1,1−α .
i=1
PAGE 86
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
Therefore, Z 5.425
1
β(2) = u9 e−2u du ≈ 0.643.
1 10
0 Γ(10) 2
| {z }
gamma(10, 1/2) pdf
Proof of NP Lemma. We prove the sufficiency part only. Define the test function
fX (x|θ1 )
1, fX (x|θ0 ) > k
φ(x) =
fX (x|θ1 )
0, < k,
fX (x|θ0 )
where k ≥ 0 and
α = Pθ0 (X ∈ R) = Eθ0 [φ(X)];
i.e., φ(x) is a size α test. We want to show that φ(x) is MP level α. Therefore, let φ∗ (x) be
the test function for any other level α test of H0 versus H1 . Note that
Eθ0 [φ(X)] = α
Eθ0 [φ∗ (X)] ≤ α.
Thus,
Define
b(x) = [φ(x) − φ∗ (x)][fX (x|θ1 ) − kfX (x|θ0 )].
• Case 1: Suppose fX (x|θ1 ) − kfX (x|θ0 ) > 0. Then, by definition, φ(x) = 1. Because
0 ≤ φ∗ (x) ≤ 1, we have
PAGE 87
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
• Case 2: Suppose fX (x|θ1 ) − kfX (x|θ0 ) < 0. Then, by definition, φ(x) = 0. Because
0 ≤ φ∗ (x) ≤ 1, we have
We have shown that b(x) = [φ(x) − φ∗ (x)][fX (x|θ1 ) − kfX (x|θ0 )] ≥ 0. Therefore,
that is,
Eθ1 [φ(X) − φ∗ (X)] ≥ k Eθ0 [φ(X) − φ∗ (X)] ≥ 0.
| {z }
≥ 0, shown above
Therefore, Eθ1 [φ(X) − φ∗ (X)] ≥ 0 and hence Eθ1 [φ(X)] ≥ Eθ1 [φ∗ (X)]. This shows that φ(x)
is more powerful than φ∗ (x). Because φ∗ (x) is an arbitrary level α test, we are done. 2
H0 : θ = θ0
versus
H1 : θ = θ1 ,
and suppose that T = T (X) is a sufficient statistic. Denote by gT (t|θ0 ) and gT (t|θ1 ) the pdfs
(pmfs) of T corresponding to θ0 and θ1 , respectively. Consider the test function
gT (t|θ1 )
1, gT (t|θ0 ) > k
φ(t) =
0, gT (t|θ1 ) < k,
gT (t|θ0 )
PAGE 88
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
Example 8.14. Suppose X1 , X2 , ..., Xn are iid N (µ, σ02 ), where −∞ < µ < ∞ and σ02 is
known. Find the MP level α test for
H0 : µ = µ0
versus
H1 : µ = µ1 ,
where µ1 < µ0 .
Solution. The sample mean T = T (X) = X is a sufficient statistic for the N (µ, σ02 ) family.
Furthermore,
σ02
1 − n2 (t−µ)2
T ∼ N µ, =⇒ gT (t|µ) = p e 2σ0
,
n 2πσ02 /n
− n 2 2
2 [(t−µ1 ) −(t−µ0 ) ] 2σ02 n−1 ln k − (µ21 − µ20 )
e 2σ0
> k ⇐⇒ t < = k 0 , say.
2(µ0 − µ1 )
where k 0 satisfies
k 0 − µ0
0
α = Pµ0 (T < k ) = P Z < √
σ0 / n
k 0 − µ0
=⇒ √ = −zα
σ0 / n
√
=⇒ k 0 = µ0 − zα σ0 / n.
√
Therefore, the MP level α test rejects H0 when X < µ0 − zα σ0 / n. This is the same test we
would have gotten using fX (x|µ0 ) and fX (x|µ1 ) with the original version of the NP Lemma
(Theorem 8.3.12).
PAGE 89
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
Remark: So far, we have discussed “test related optimality” in the context of simple-versus-
simple hypotheses. We now extend the idea of “most powerful” to more realistic situations
involving composite hypotheses; e.g., H0 : θ ≤ θ0 versus H1 : θ > θ0 .
Definition: A family of pdfs (pmfs) {gT (t|θ); θ ∈ Θ} for a univariate random variable T
has monotone likelihood ratio (MLR) if for all θ2 > θ1 , the ratio
gT (t|θ2 )
gT (t|θ1 )
is a nondecreasing function of t over the set {t : gT (t|θ1 ) > 0 or gT (t|θ2 ) > 0}.
Example 8.15. Suppose T ∼ b(n, θ), where 0 < θ < 1. The pmf of T is
n t
gT (t|θ) = θ (1 − θ)n−t ,
t
θ2 1 − θ1
> 1 and > 1.
θ1 1 − θ2
Therefore,
gT (t|θ2 )
= c(θ1 , θ2 ) at ,
gT (t|θ1 ) | {z }
>0
where a > 1. This is an increasing function of t over {t : t = 0, 1, 2, ..., n}. Therefore, the
family {gT (t|θ) : 0 < θ < 1} has MLR.
Remark: Many common families of pdfs (pmfs) have MLR. For example, if
T ∼ gT (t|θ) = h(t)c(θ)ew(θ)t ,
i.e., T has pdf (pmf) in the one-parameter exponential family, then {gT (t|θ); θ ∈ Θ} has
MLR if w(θ) is a nondecreasing function of θ.
Proof. Exercise.
PAGE 90
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
H0 : θ ≤ θ0
versus
H1 : θ > θ0 .
Suppose that T is sufficient. Suppose that {gT (t|θ); θ ∈ Θ} has MLR. The test that rejects
H0 iff T > t0 is a UMP level α test, where
α = Pθ0 (T > t0 ).
Similarly, when testing
H0 : θ ≥ θ0
versus
H1 : θ < θ0 ,
the test that rejects H0 iff T < t0 is UMP level α, where α = Pθ0 (T < t0 ).
Example 8.16. Suppose X1 , X2 , ..., Xn are iid Bernoulli(θ), where 0 < θ < 1, and consider
testing
H0 : θ ≤ θ0
versus
H1 : θ > θ0 .
We know that n
X
T = Xi
i=1
is a sufficient statistic and T ∼ b(n, θ). In Example 8.15, we showed that the family {gT (t|θ) :
0 < θ < 1} has MLR. Therefore, the Karlin-Rubin Theorem says that the UMP level α test
is
φ(t) = I(t > t0 ),
where t0 solves
n
X n t
α = Pθ0 (T > t0 ) = θ (1 − θ0 )n−t .
t 0
t=bt0 c+1
t0 Pθ0 (T ≥ bt0 c + 1)
7 ≤ t0 < 8 P (T ≥ 8|θ = 0.2) = 0.2392
8 ≤ t0 < 9 P (T ≥ 9|θ = 0.2) = 0.1287
9 ≤ t0 < 10 P (T ≥ 10|θ = 0.2) = 0.0611
10 ≤ t0 < 11 P (T ≥ 11|θ = 0.2) = 0.0256
11 ≤ t0 < 12 P (T ≥ 12|θ = 0.2) = 0.0095
PAGE 91
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
1.0
0.8
0.6
Power function
0.4
0.2
0.0
Figure 8.4: Power function β(θ) for the UMP level α = 0.0611 test in Example 8.16 with
n = 30 and θ0 = 0.2. A horizontal line at α = 0.0611 has been added.
Therefore, the UMP level α = 0.0611 test of H0 : θ ≤ 0.2 versus H1 : θ > 0.2 uses I(t ≥ 10).
The UMP level α = 0.0256 test uses I(t ≥ 11). Note that (without randomizing) it is not
possible to write a UMP level α = 0.05 test in this problem. For the level α = 0.0611 test,
the power function is
30
X 30 t
β(θ) = Pθ (T ≥ 10) = θ (1 − θ)30−t ,
t=10
t
Example 8.17. Suppose that X1 , X2 , ..., Xn are iid with population distribution
where θ > 0. Note that this population distribution is an exponential distribution with mean
1/θ. Derive the UMP level α test for
H0 : θ ≥ θ0
versus
H1 : θ < θ0 .
PAGE 92
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
is a sufficient statistic and T ∼ gamma(n, 1/θ). Suppose θ2 > θ1 and form the ratio
1
1 n
tn−1 e−θ2 t n
gT (t|θ2 ) Γ(n) θ2 θ2
= = e−t(θ2 −θ1 ) .
gT (t|θ1 ) 1 θ1
n tn−1 e−θ1 t
Γ(n) θ11
gT (t|θ2 )
gT (t|θ1 )
is a decreasing function of t over {tP: t > 0}. However, the ratio is an increasing function
of t = −t, and T = T (X) = − ni=1 Xi is still a sufficient statistic (it is a one-to-one
∗ ∗ ∗
where t0 satisfies
Using χ2 critical values: We can also write this rejection region in terms of a χ2 quantile.
To see why, note that when θ = θ0 , the quantity 2θ0 T ∼ χ22n so that
PAGE 93
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
1.0
0.8
0.6
Power function
0.4
0.2
0.0
0 2 4 6 8
Figure 8.5: Power function β(θ) for the UMP level α = 0.10 test in Example 8.17 with
n = 10 and θ0 = 4. A horizontal line at α = 0.10 has been added.
Remark: One advantage of writing the rejection region in this way is that it depends on a
χ2 quantile, which, historically, may have been available in probability tables (i.e., in times
before computers and R). Another small advantage is that we can express the power function
β(θ) in terms of a χ2 cdf instead of a more general gamma cdf.
Power function: The power function of the UMP level α test is given by
χ22n,α θχ22n,α
β(θ) = Pθ (X ∈ R) = Pθ T > = Pθ 2θT >
2θ0 θ0
2
θχ2n,α
= 1 − Fχ22n ,
θ0
where Fχ22n (·) is the χ22n cdf. A graph of this power function, when n = 10, α = 0.10, and
θ0 = 4, is shown in Figure 8.5 (above).
Proof of Karlin-Rubin Theorem. We will prove this theorem in parts. The first part is a
lemma.
cov[g(X), h(X)] ≥ 0.
PAGE 94
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
showing that [h(x1 ) − h(x2 )][g(x1 ) − g(x2 )] ≥ 0, for all x1 , x2 ∈ R. By Theorem 2.2.5 (CB,
pp 57), E{[h(X1 ) − h(X2 )][g(X1 ) − g(X2 )]} ≥ 0. 2
Lemma 2. Suppose the family {gT (t|θ) : θ ∈ Θ} has MLR. If ψ(t) ↑nd t, then Eθ [ψ(T )] ↑nd θ.
Proof. Suppose that θ2 > θ1 . Because {gT (t|θ) : θ ∈ Θ} has MLR, we know that
gT (t|θ2 ) x
t
gT (t|θ1 ) nd
over the set {t : gT (t|θ1 ) > 0 or gT (t|θ2 ) > 0}. Therefore, by Lemma 1, we know
gT (T |θ2 ) gT (T |θ2 ) gT (T |θ2 )
covθ1 ψ(T ), ≥ 0 =⇒ Eθ1 ψ(T ) ≥ Eθ1 [ψ(T )] Eθ1
gT (T |θ1 ) gT (T |θ1 ) gT (T |θ1 )
| {z } | {z }
= Eθ2 [ψ(T )] = 1
Pθ (T > t0 ) ↑nd θ
for all t0 ∈ R. In other words, the family {gT (t|θ) : θ ∈ Θ} is stochastically increasing in θ.
Proof. This is a special case of Lemma 2. Fix t0 . Take ψ(t) = I(t > t0 ). Clearly,
1, t > t0
ψ(t) =
0, t ≤ t0
PAGE 95
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
This shows that φ(t) = I(t > t0 ) is a size α (and hence level α) test function. Thus, all that
remains is to show that this test is uniformly most powerful (i.e., most powerful ∀θ > θ0 ).
Remember that we are considering the test
H0 : θ ≤ θ0
versus
H1 : θ > θ0 .
Let φ∗ (x) be any other level α test of H0 versus H1 . Fix θ1 > θ0 and consider the test of
H0∗ : θ = θ0
versus
∗
H1 : θ = θ1
because φ∗ (x) is a level α test of H0 versus H1 . This also means that φ∗ (x) is a level α test
of H0∗ versus H1∗ . However, Corollary 8.3.13 (Neyman Pearson with a sufficient statistic T )
says that φ(t) is the most powerful (MP) level α test of H0∗ versus H1∗ . This means that
Eθ1 [φ(T )] ≥ Eθ1 [φ∗ (X)].
Because θ1 > θ0 was chosen arbitrarily and because φ∗ (x) was too, we have
Eθ [φ(T )] ≥ Eθ [φ∗ (X)]
for all θ > θ0 and for any level α test φ∗ (x) of H0 versus H1 . Because φ(t) is a level α test
of H0 versus H1 (shown above), we are done. 2
PAGE 96
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
Note: In single parameter exponential families, we can find UMP tests for H0 : θ ≤ θ0
versus H1 : θ > θ0 (or for H0 : θ ≥ θ0 versus H1 : θ < θ0 ). Unfortunately,
• once we get outside this setting (even with a one-sided H1 ), UMP tests do become
scarce.
In other words, the collection of problems for which a UMP test exists is somewhat small.
In many ways, this should not be surprising. Requiring a test to outperform all other level
α tests for all θ in the alternative space Θc0 is asking a lot. The “larger” Θc0 is, the harder
it is to find a UMP test.
Example 8.18. Suppose X1 , X2 , ..., Xn are iid N (µ, σ02 ), where −∞ < µ < ∞ and σ02 is
known. Consider testing
H0 : µ = µ0
versus
H1 : µ 6= µ0 .
There is no UMP test for this problem. A UMP test would exist if we could find a test
whose power function “beats” the power function for all other level α tests. For one-sided
alternatives, it is possible to find one. However, a two-sided alternative space is too large.
To illustrate, suppose we considered testing
H00 : µ ≤ µ0
versus
H10 : µ > µ0 .
PAGE 97
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
H000 : µ ≥ µ0
versus
H100 : µ < µ0 .
However, φ0 (x) 6= φ00 (x) for all x ∈ X . Therefore, no UMP test can exist for H0 versus H1 .
H0 : θ ∈ Θ0
versus
H1 : θ ∈ Θc0 .
A test with power function β(θ) is unbiased if β(θ0 ) ≥ β(θ00 ) for all θ0 ∈ Θc0 and for all
θ00 ∈ Θ0 . That is, the power is always larger in the alternative parameter space than it is in
the null parameter space.
• Therefore, when no UMP test exists, we could further restrict attention to those tests
that are level α and are unbiased. Conceptually, define
PAGE 98
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
0.4
0.3
PDF
0.2
0.1
1−α
α 2 α 2
0.0
−4 −2 0 2 4
Figure 8.6: Pdf of Z ∼ N (0, 1). The UMPU level α rejection region in Example 8.18 is
shown shaded.
• The test in C U that is UMP is called the uniformly most powerful unbiased
(UMPU) test. The UMPU test has power function β(θ) that satisfies
Example 8.18 (continued). Suppose X1 , X2 , ..., Xn are iid N (µ, σ02 ), where −∞ < µ < ∞
and σ02 is known. Consider testing
H0 : µ = µ0
versus
H1 : µ 6= µ0 .
PAGE 99
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
1.0
UMPU
UMP µ > µ0
UMP µ < µ0
0.8
0.6
Power function
0.4
0.2
0.0
2 4 6 8 10
Figure 8.7: Power function β(µ) of the UMPU level α = 0.05 test in Example 8.18 with
n = 10, µ0 = 6, and σ02 = 4. Also shown are the power functions corresponding to the two
UMP level α = 0.05 tests with H1 : µ > µ0 and H1 : µ < µ0 .
PAGE 100
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
Figure 8.7 the UMP level α = 0.05 power functions for the two one-sided tests (i.e., the tests
with H1 : µ > µ0 and H1 : µ < µ0 , respectively).
• It is easy to see that the UMPU test is an unbiased test. Note that β(µ) is always
larger in the alternative parameter space {µ ∈ R : µ 6= µ0 } than it is when µ = µ0 .
• The UMPU test’s power function “loses” to each UMP test’s power function in the
region where that UMP test is most powerful. This is the price one must pay for
restricting attention to unbiased tests. The best unbiased test for a two-sided H1 will
not beat a one-sided UMP test. However, the UMPU test is clearly better than the
UMP tests in each UMP test’s null parameter space.
Definition: A p-value p(X) is a test statistic, satisfying 0 ≤ p(x) ≤ 1, for all x ∈ X . Small
values of p(x) are evidence against H0 . A p-value is said to be valid if
Pθ (p(X) ≤ α) ≤ α,
for all θ ∈ Θ0 and for all 0 ≤ α ≤ 1.
“If p(X) is a valid p-value, it is easy to construct a level α test based on p(X).
The test that rejects H0 if and only if p(X) ≤ α is a level α test.”
It is easy to see why this is true. The validity requirement above guarantees that
φ(x) = I(p(x) ≤ α)
is a level α test function. Why? Note that
sup Eθ [φ(X)] = sup Pθ (p(X) ≤ α) ≤ α.
θ∈Θ0 θ∈Θ0
Theorem 8.3.27. Let W = W (X) be a test statistic such that large values of W give
evidence against H0 . For each x ∈ X , define
p(x) = sup Pθ (W (X) ≥ w),
θ∈Θ0
where w = W (x). Then p(X) is a valid p-value. Note that the definition of p(x) for when
small values of W give evidence against H0 would be analogous.
Proof. Fix θ ∈ Θ0 . Let F−W (w|θ) denote the cdf of −W = −W (X). When the test rejects
for large values of W ,
pθ (x) ≡ Pθ (W (X) ≥ w) = Pθ (−W (X) ≤ −w) = F−W (−w|θ),
PAGE 101
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
where the notation X ≥ST Y means “the distribution of X is stochastically larger than the
distribution of Y ” (see Exercise 2.10, CB, pp 77). Combining both cases, we have
Pθ (pθ (X) ≤ α) ≤ α,
Because we fixed θ ∈ Θ0 arbitrarily, this result must hold for all θ ∈ Θ0 . We have shown
that p(X) is a valid p-value. 2
Example 8.19. Suppose X1 , X2 , ..., Xn are iid N (µ, σ 2 ), where −∞ < µ < ∞ and σ 2 > 0;
i.e., both parameters are unknown. Set θ = (µ, σ 2 ). Consider testing
H0 : µ ≤ µ0
versus
H1 : µ > µ0 .
X − µ0
W = W (X) = √
S/ n
are evidence against H0 (i.e., this is a “one-sample t test,” which is a LRT). The null
parameter space is
Θ0 = {θ = (µ, σ 2 ) : µ ≤ µ0 , σ 2 > 0}.
Therefore, with observed value w = W (x), the p-value for the test is
X − µ0
p(x) = sup Pθ (W (X) ≥ w) = sup Pθ √ ≥w
θ∈Θ0 θ∈Θ0 S/ n
X −µ µ0 − µ
= sup Pθ √ ≥w+ √
θ∈Θ0 S/ n S/ n
µ0 − µ
= sup Pθ Tn−1 ≥ w + √ = P (Tn−1 ≥ w) ,
µ≤µ0 S/ n
PAGE 102
STAT 713: CHAPTER 8 JOSHUA M. TEBBS
1.0
0.8
0.6
p−values
0.4
0.2
0.0
U(0,1) quantiles
Remark: In Example 8.19, calculating the supremum over Θ0 is relatively easy. In other
problems, it might not be, especially when there are nuisance parameters. A very good
discussion on this is given in Berger and Boos (1994). These authors propose another type
of p-value by “suping” over subsets of Θ0 formed from calculating confidence intervals first
(which can make the computation easier).
Pθ0 (p(X) ≤ α) = α,
H
for all 0 ≤ α ≤ 1, then φ(x) = I(p(x) ≤ α) is a size α test and p(X) ∼0 U(0, 1).
Example 8.20. Suppose X1 , X2 , ..., Xn are iid N (0, 1). I used R to simulate B = 200
independent samples of this type, each with n = 30. With each sample, I performed a t test
for H0 : µ = 0 versus H1 : µ 6= 0 and calculated the p-value for each test (note that H0 is
true). A uniform qq plot of the 200 p-values in Figure 8.8 shows agreement with the U(0, 1)
distribution. Using α = 0.05, there were 9 tests (out of 200) that incorrectly rejected H0 .
PAGE 103
STAT 713: CHAPTER 9 JOSHUA M. TEBBS
9 Interval Estimation
9.1 Introduction
Remark: In the definition above, a one-sided interval estimate is formed when one of
the endpoints is ±∞. For example, if L(x) = −∞, then the estimate is (−∞, U (x)]. If
U (x) = ∞, the estimate is [L(x), ∞).
Definition: Suppose [L(X), U (X)] is an interval estimator for θ. The coverage probabil-
ity of the interval is
Pθ (L(X) ≤ θ ≤ U (X)).
It is important to note the following:
• In the probability above, it is the endpoints L(X) and U (X) that are random; not θ
(it is fixed).
• The coverage probability is regarded as a function of θ. That is, the probability that
[L(X), U (X)] contains θ may be different for different values of θ ∈ Θ. This is usually
true when X is discrete.
Remark: In some problems, it is possible that the estimator itself is not an interval. More
generally, we use the term 1 − α confidence set to allow for these types of estimators. The
notation C(X) is used more generally to denote a confidence set.
PAGE 104
STAT 713: CHAPTER 9 JOSHUA M. TEBBS
Example 9.1. Suppose that X1 , X2 , ..., Xn are iid U(0, θ), where θ > 0. We consider two
interval estimators:
that is, the coverage probability is the same for all θ ∈ Θ = {θ : θ > 0}. The confidence
coefficient of the interval (aX(n) , bX(n) ) is therefore
n n n n
1 1 1 1
inf − = − .
θ>0 a b a b
On the other hand, the coverage probability for the second interval is
PAGE 105
STAT 713: CHAPTER 9 JOSHUA M. TEBBS
Example 9.2. Suppose that X1 , X2 , ..., Xn are iid Bernoulli(p), where 0 < p < 1. A “1 − α
confidence interval” commonly taught in undergraduate courses is
r
pb(1 − pb)
pb ± zα/2 ,
n
where pb is the sample proportion, that is,
n
Y 1X
pb = = Xi ,
n n i=1
where Y = ni=1 Xi ∼ b(n, p), and zα/2 is the upper α/2 quantile of the N (0, 1) distribution.
P
In Chapter 10, we will learn that this is a large-sample “Wald-type” confidence interval. An
expression for the coverage probability of this interval is
r r !
pb(1 − pb) pb(1 − pb)
Pp pb − zα/2 ≤ p ≤ pb + zα/2
n n
s s
Y Y Y Y
Y (1 − n ) Y (1 − n )
= Ep I − zα/2 n ≤ p ≤ + zα/2 n
n n n n
n
ry ry !
y y
X y (1 − ) y (1 − ) n y
= I − zα/2 n n
≤ p ≤ + zα/2 n n
p (1 − p)y .
y=0
n n n n y
| {z }
b(n,p) pmf
Special case: I used R to graph this coverage probability function across values of 0 < p < 1
when n = 40 and α = 0.05; see Figure 9.1 (next page).
• The coverage probability rarely attains the nominal 0.95 level across 0 < p < 1.
• The jagged nature of the coverage probability function (of p) arises from the discrete-
ness of Y ∼ b(40, p).
• The confidence coefficient of the Wald interval (i.e., the infimum coverage probability
across all 0 < p < 1) is clearly 0.
• An excellent account of the performance of this confidence interval (and competing
intervals) is given in Brown et al. (2001, Statistical Science).
• When 1 − α = 0.95, one competing interval mentioned in Brown et al. (2001) replaces
y with y ∗ = y + 2 and n with n∗ = n + 4. This “add two successes-add two failures”
interval was proposed by Agresti and Coull (1998, American Statistician). Because
this interval’s coverage probability is much closer to the nominal level across 0 < p < 1
(and because it is so easy to compute), it has begun to usurp the Wald confidence
interval in introductory level courses.
PAGE 106
STAT 713: CHAPTER 9 JOSHUA M. TEBBS
1.0
0.8
Coverage probability
0.6
0.4
0.2
0.0
Figure 9.1: Coverage probability of the Wald confidence interval for a binomial proportion
p when n = 40 and α = 0.05. A dotted horizontal line at 1 − α = 0.95 has been added.
Remark: This method of interval construction is motivated by the strong duality between
hypothesis testing and confidence intervals.
PAGE 107
STAT 713: CHAPTER 9 JOSHUA M. TEBBS
Example 9.3. Suppose X1 , X2 , ..., Xn are iid N (µ, σ 2 ), where −∞ < µ < ∞ and σ 2 > 0;
i.e., both parameters are unknown. A size α likelihood ratio test (LRT) of H0 : µ = µ0
versus H1 : µ 6= µ0 uses the test function
|x − µ0 |
1, √ ≥ tn−1,α/2
φ(x) = s/ n
0, otherwise.
Remark: As Example 9.3 suggests, when we invert a two-sided hypothesis test, we get a
two-sided confidence interval. This will be true in most problems. Analogously, inverting
one-sided tests generally leads to one-sided intervals.
Example 9.4. Suppose X1 , X2 , ..., Xn are iid exponential(θ), where θ > 0. A uniformly
most powerful (UMP) level α test of H0 : θ = θ0 versus H1 : θ > θ0 uses the test function
θ0
(
1, t ≥ χ22n,α
φ(t) = 2
0, otherwise,
where the sufficient statistic t = ni=1 xi . The “acceptance region” for this test is
P
θ0 2
Aθ0 = x ∈ X : t < χ2n,α ,
2
where, note that
θ0 2T
Pθ0 (X ∈ Aθ0 ) = Pθ0 T < χ22n,α = Pθ0 < χ22n,α = 1 − α,
2 θ0
H d
because 2T /θ0 ∼0 gamma(n, 2) = χ22n . Therefore, a 1 − α confidence set for θ is
θ 2
C(x) = {θ > 0 : x ∈ Aθ } = θ : t < χ2n,α
2
2t
= θ: 2 <θ .
χ2n,α
The random version of this confidence set is written as
2T
, ∞ ,
χ22n,α
where T = ni=1 Xi . This is a “one-sided” interval, as expected, because we have inverted a
P
one-sided test.
Remark: The test inversion method makes direct use of the relationship between hypothesis
tests and confidence intervals (sets). On pp 421, the authors of CB write,
“Both procedures look for consistency between sample statistics and population
parameters. The hypothesis test fixes the parameter and asks what sample values
(the acceptance region) are consistent with that fixed value. The confidence set
fixes the sample value and asks what parameter values (the confidence interval)
make this sample value most plausible.”
An illustrative figure (Figure 9.2.1, pp 421) displays this relationship in the N (µ, σ02 ) case;
i.e., writing a confidence interval for a normal mean µ when σ02 is known.
PAGE 109
STAT 713: CHAPTER 9 JOSHUA M. TEBBS
Remark: Finding pivots makes getting confidence intervals easy. If Q = Q(X, θ) is a pivot,
then we can set
1 − α = Pθ (a ≤ Q(X, θ) ≤ b),
where a and b are quantiles of the distribution of Q that satisfy the equation. Because Q
is a pivot, the probability on the RHS will be the same for all θ ∈ Θ. Therefore, a 1 − α
confidence interval can be determined from this equation.
Example 9.5. Suppose that X1 , X2 , ..., Xn are iid U(0, θ), where θ > 0. In Example 9.1,
we showed that
X(n)
Q = Q(X, θ) = ∼ beta(n, 1).
θ
Because the distribution of Q is free of θ, we know that Q is a pivot. Let bn,1,1−α/2 and
bn,1,α/2 denote the lower and upper α/2 quantiles of a beta(n, 1) distribution, respectively.
We can then write
X(n) 1 θ 1
1 − α = Pθ bn,1,1−α/2 ≤ ≤ bn,1,α/2 = Pθ ≥ ≥
θ bn,1,1−α/2 X(n) bn,1,α/2
X(n) X(n)
= Pθ ≤θ≤ .
bn,1,α/2 bn,1,1−α/2
Yi = β0 + β1 xi + i ,
where i ∼ iid N (0, σ 2 ) and the xi ’s are fixed constants (measured without error). Consider
writing a confidence interval for
θ = E(Y |x0 ) = β0 + β1 x0 ,
where x0 is a specified value of x. In a linear models course, you have shown that
2
!
1 (x 0 − x)
θb = βb0 + βb1 x0 ∼ N θ, σ 2 + Pn 2
,
n i=1 (xi − x)
where βb0 and βb1 are the least-squares estimators of β0 and β1 , respectively.
PAGE 110
STAT 713: CHAPTER 9 JOSHUA M. TEBBS
θb − θ
Q(Y, θ) = r h 2
i ∼ N (0, 1)
σ2 1
+ Pn(x0 −x) 2
n (x
i=1 i −x)
θb − θ
Q(Y, θ) = r h 2
i ∼ tn−2 ,
1 (x 0 −x)
MSE n + Pn (xi −x)2
i=1
where MSE is the mean-squared error from the regression, is used as a pivot.
Remark: As Examples 9.5 and 9.6 illustrate, interval estimates are easily obtained after
writing 1 − α = Pθ (a ≤ Q(X, θ) ≤ b), for constants a and b (quantiles of Q). More generally,
{θ ∈ Θ : Q(x, θ) ∈ A} is a set estimate for θ, where A satisfies 1 − α = Pθ (Q(X, θ) ∈ A).
For example, in Example 9.5, we could have written
X(n) X(n)
1 − α = Pθ bn,1,1−α ≤ ≤1 = Pθ X(n) ≤ θ ≤
θ bn,1,1−α
Which one is “better?” For that matter, how should we define what “better” means?
PAGE 111
STAT 713: CHAPTER 9 JOSHUA M. TEBBS
Example 9.7. Suppose X1 , X2 , ..., Xn are iid N (µ, σ 2 ), where −∞ < µ < ∞ and σ 2 > 0;
i.e., both parameters are unknown. Set θ = (µ, σ 2 ). We know that
X −µ
Q1 = √ ∼ tn−1 ,
S/ n
that is, Q1 is a pivot. Therefore,
X −µ
1 − α = Pθ −tn−1,α/2 ≤ √ ≤ tn−1,α/2
S/ n
S S
= Pθ X − tn−1,α/2 √ ≤ µ ≤ X + tn−1,α/2 √ ,
n n
showing that
S S
C1 (X) = X − tn−1,α/2 √ , X + tn−1,α/2 √
n n
is a 1 − α confidence set for µ. Similarly, we know that
(n − 1)S 2
Q2 = ∼ χ2n−1 ,
σ2
that is, Q2 is also a pivot. Therefore,
(n − 1)S 2
2 2
1 − α = Pθ χn−1,1−α/2 ≤ ≤ χn−1,α/2
σ2
!
(n − 1)S 2 (n − 1)S 2
= Pθ ≤ σ2 ≤ 2 ,
χ2n−1,α/2 χn−1,1−α/2
showing that !
(n − 1)S 2 (n − 1)S 2
C2 (X) = ,
χ2n−1,α/2 χ2n−1,1−α/2
is a 1 − α confidence set for σ 2 .
Q: Is C1 (X) × C2 (X), the Cartesian product of C1 (X) and C2 (X), a 1 − α confidence region
for θ?
A: No. By Bonferroni’s Inequality,
Pθ (θ ∈ C1 (X) × C2 (X)) ≥ Pθ (µ ∈ C1 (X)) + Pθ (σ 2 ∈ C2 (X)) − 1
= (1 − α) + (1 − α) − 1
= 1 − 2α.
Therefore, C1 (X) × C2 (X) is a 1 − 2α confidence region for θ.
PAGE 112
STAT 713: CHAPTER 9 JOSHUA M. TEBBS
Bonferroni adjustment: Adjust C1 (X) and C2 (X) individually so that the confidence
coefficient of each is 1 − α/2. The adjusted set C1∗ (X) × C2∗ (X) is a 1 − α confidence region
for θ. This region has coverage probability larger than or equal to 1 − α for all θ (so it is
“conservative”).
Discussion: Example 9.2.7 (CB, pp 427-428) provides tips on how to find pivots in location
and scale (and location-scale) families.
In general, differences are pivotal in location family problems; ratios are pivotal for scale
parameters.
PAGE 113
STAT 713: CHAPTER 9 JOSHUA M. TEBBS
PAGE 114
STAT 713: CHAPTER 9 JOSHUA M. TEBBS
1.0
0.8
0.6
CDF
0.4
0.2
0.0
Figure 9.2: CDF of T = X(1) in Example 9.8, FT (t|θ), plotted as a function of θ with t fixed.
The value of t is 10.032, calculated based on an iid sample from fX (x|θ) with n = 5. Dotted
horizontal lines at α/2 = 0.025 and 1 − α/2 = 0.975 have been added.
Special case: I used R to simulate an iid sample of size n = 5 from fX (x|θ). The cdf of
T = X(1) is plotted in Figure 9.2 as a function of θ with the observed value of t = x(1) = 10.032
held fixed. A 0.95 confidence set is (9.293, 10.026). The true value of θ is 10.
Theorem 9.2.12. Suppose T is a statistic with a continuous cdf FT (t|θ). Suppose α1 +α2 =
α. Suppose for all t ∈ T , the functions θL (t) and θU (t) are defined as follows:
– FT (t|θU (t)) = α1
– FT (t|θL (t)) = 1 − α2 .
– FT (t|θU (t)) = 1 − α2
– FT (t|θL (t)) = α1 .
PAGE 115
STAT 713: CHAPTER 9 JOSHUA M. TEBBS
Remark: Theorem 9.2.12 remains valid for any statistic T with continuous cdf. In practice,
we would likely want T to be a sufficient statistic.
Remark: Pivoting the cdf always “works” because (if T is continuous), the cdf itself, when
viewed as random, is a pivot. From the Probability Integral Transformation, we know that
FT (T |θ) ∼ U(0, 1). Therefore, when FT (t|θ) is a decreasing function of θ, we have
Implementation: To pivot the cdf, it is not necessary that FT (t|θ) be available in closed
form (as in Example 9.8). All we really have to do is solve
Z t0 Z ∞
∗ set set
fT (t|θ1 (t0 ))dt = α/2 and fT (t|θ2∗ (t0 ))dt = α/2
−∞ t0
(in the equal α1 = α2 = α/2 case, say), based on the observed value T = t0 . We solve these
equations for θ1∗ (t0 ) and θ2∗ (t0 ). One of these will be the lower limit θL (t0 ) and the other
will be the upper limit θU (t0 ), depending on whether FT (t|θ) is an increasing or decreasing
function of θ.
Remark: The discrete case (i.e., the statistic T has a discrete distribution) is handled in
the same way except that the integrals above are replaced by sums.
Example 9.9.
P Suppose X1 , X2 , ..., Xn are iid Poisson(θ), where θ > 0. We now pivot the
cdf ofPT = ni=1 Xi , a sufficient statistic, to write a 1 − α confidence set for θ. Recall that
T = ni=1 Xi ∼ Poisson(nθ). If T = t0 is observed, we set
t0
X (nθ)k e−nθ set
Pθ (T ≤ t0 ) = = α/2
k=0
k!
∞
X (nθ)k e−nθ set
Pθ (T ≥ t0 ) = = α/2
k=t
k!
0
and solve each equation for θ. In practice, the solutions could be found by setting up a
grid search over possible values of θ and then selecting the values that solve these equations
(one solution will be the lower endpoint; the other solution will be the upper endpoint). In
this example, however, it is possible to get closed-form expressions for the confidence set
endpoints. To see why, we need to recall the following result which “links” the Poisson and
gamma distributions.
PAGE 116
STAT 713: CHAPTER 9 JOSHUA M. TEBBS
1.00
0.95
Coverage probability
0.90
0.85
0.80
Figure 9.3: Coverage probability of the confidence interval in Example 9.9 when n = 10 and
α = 0.10. A dotted horizontal line at 1 − α = 0.90 has been added.
P (X ≤ x) = P (Y ≥ a),
where Y ∼ Poisson(x/b). This identity was stated in Example 3.3.1 (CB, pp 100-101).
Application: If we apply this result in Example 9.9 for the second equation to be solved,
we have a = t0 , x/b = nθ, and
α set 2X
= Pθ (T ≥ t0 ) = Pθ (X ≤ bnθ) = Pθ ≤ 2nθ = Pθ (χ22t0 ≤ 2nθ).
2 b
Therefore, we set
2nθ = χ22t0 ,1−α/2
and solve for θ (this will give the lower endpoint). A similar argument shows that the upper
endpoint solves
2nθ = χ22(t0 +1),α/2 .
Therefore, a 1 − α confidence set for θ is
1 2 1 2
χ , χ .
2n 2t0 ,1−α/2 2n 2(t0 +1),α/2
PAGE 117
STAT 713: CHAPTER 9 JOSHUA M. TEBBS
Remark: Pivoting a discrete cdf can be used to write confidence sets for parameters in other
discrete distributions. For example, a 1 − α confidence interval for a binomial probability p
when using this technique is given by
!
x+1
1 F
n−x 2(x+1),2(n−x),α/2
, ,
1 + n−x+1
x
x+1
F2(n−x+1),2x,α/2 1 + n−x F2(x+1),2(n−x),α/2
where x is the realized value of X ∼ b(n, p) and Fa,b,α/2 is the upper α/2 quantile of an
F distribution with degrees of freedom a and b. This is known as the Clopper-Pearson
confidence interval for p and it can (not surprisingly) be very conservative; see Brown et al.
(2001, Statistical Science). The interval arises by first exploiting the relationship between
the binomial and beta distributions (see CB, Exercise 2.40, pp 82) and then the relationship
which “links” the beta and F distributions (see CB, Theorem 5.3.8, pp 225).
Recall: In the Bayesian paradigm, all inference is carried out using the posterior distribution
π(θ|x). However, because the posterior π(θ|x) is itself a legitimate probability distribution
(for θ, updated after seeing x), we can calculate probabilities involving θ directly by using
this distribution.
Note: Bayesian credible intervals are interpreted differently than confidence intervals.
• Confidence interval interpretation: “If we were to perform the experiment over
and over again, each time under identical conditions, and if we calculated a 1 − α
confidence interval each time the experiment was performed, then 100(1 − α) percent
of the intervals we calculated would contain the true value of θ. Any specific interval
we calculate represents one of these possible intervals.”
• Credible interval interpretation: “The probability our interval contains θ is 1−α.”
PAGE 118
STAT 713: CHAPTER 9 JOSHUA M. TEBBS
Example 9.10. Suppose that X1 , X2 , ..., Xn are iid Poisson(θ), where the prior distribution
for θ ∼ gamma(a, b), a, b known. In Example 7.10 (notes, pp 38-39), we showed that the
posterior distribution
n
!
X 1
θ|X = x ∼ gamma xi + a, .
i=1
n + 1b
In Example 8.9 (notes, pp 77-78), we used this Bayesian model setup with the number of
goals per game in the 2013-2014 English Premier League season and calculated the posterior
distribution for the mean number of goals θ to be
1 d
θ|X = x ∼ gamma 1060 + 1.5, = gamma(1061.5, 0.002628).
380 + 12
> qgamma(0.025,1061.5,1/0.002628)
[1] 2.624309
> qgamma(0.975,1061.5,1/0.002628)
[1] 2.959913
Q: Why did we select the “equal-tail” quantiles (0.025 and 0.975) in this example?
A: It’s easy!
Note: There are two types of Bayesian credible intervals commonly used: Equal-tail (ET)
intervals and highest posterior density (HPD) intervals.
A = {θ : π(θ|x) ≥ c}
and the credible probability of A is 1 − α. ET and HPD intervals will coincide only when
π(θ|x) is symmetric.
Remark: In practice, because Monte Carlo methods are often used to approximate posterior
distributions, simple ET intervals are usually the preferred choice. HPD intervals can be far
more difficult to construct and are rarely much better than ET intervals.
Note: We will not cover all of the material in this subsection. We will have only a brief
discussion of the relevant topics.
PAGE 119
STAT 713: CHAPTER 9 JOSHUA M. TEBBS
Evaluating estimators: When evaluating any interval estimator, there are two important
criteria to consider:
1. Coverage probability. When the coverage probability is not equal to 1 − α for all
θ ∈ Θ (as is usually the case in discrete distributions), we would like it to be as close
as possible to the nominal 1 − α level.
2. Interval length. Shorter intervals are more informative. Interval length (or expected
interval length) depends on the interval’s underlying confidence coefficient.
• It only makes sense to compare two interval estimators (on the basis of inter-
val length) when the intervals have the same coverage probability (or confidence
coefficient).
Example 9.11. Suppose X1 , X2 , ..., Xn are iid N (µ, σ 2 ), where −∞ < µ < ∞ and σ 2 > 0;
i.e., both parameters are unknown. Set θ = (µ, σ 2 ). A 1 − α confidence interval for µ is
S S
C(X) = X + a √ , X + b √ ,
n n
where the constants a and b are quantiles from the tn−1 distribution satisfying
S S
1 − α = Pθ X + a √ ≤ µ ≤ X + b √ .
n n
Which choice of a and b is “best?” More precisely, which choice minimizes the expected
length? The length of this interval is
S
L = (b − a) √ ,
n
Eθ (S) √
Eθ (L) = (b − a) √ = (b − a)c(n)σ/ n,
n
PAGE 120
STAT 713: CHAPTER 9 JOSHUA M. TEBBS
Pθ (a ≤ Q ≤ b) = 1 − α,
Remark: The version of Theorem 9.3.2 stated in CB (pp 441-442) is slightly different than
the one I present above; the authors’ version requires that the pdf of Q be unimodal (mine
requires that it be differentiable).
X −µ
Q = Q(X, θ) = √
S/ n
and
s s
C(x) = x + a √ , x + b √ .
n n
If we choose a = −tn−1,α/2 and b = tn−1,α/2 , then the conditions in Theorem 9.3.2 are
satisfied. Therefore,
S S
X − tn−1,α/2 √ , X + tn−1,α/2 √
n n
has the shortest expected length among all 1 − α confidence intervals based on Q.
1 − α = Pθ (a ≤ Q ≤ b) = FQ (b) − FQ (a)
so that
FQ (b) = 1 − α + FQ (a)
and
b = FQ−1 [1 − α + FQ (a)] ≡ b(a), say.
The goal is to minimize b − a = b(a) − a. Taking derivatives, we have (by the Chain Rule)
d d −1
[b(a) − a] = [F [1 − α + FQ (a)] − 1
da da Q
d d −1
= [1 − α + FQ (a)] F (η) − 1,
da dη Q
PAGE 121
STAT 713: CHAPTER 9 JOSHUA M. TEBBS
where η = 1−α+FQ (a). However, note that by the inverse function theorem (from calculus),
d −1 1 1 1
FQ (η) = 0 −1 = 0 = .
dη FQ [FQ (η)] FQ (b) fQ (b)
Therefore,
d fQ (a) set
[b(a) − a] = − 1 = 0 =⇒ fQ (a) = fQ (b).
da fQ (b)
To finish the proof, all we need to show is that
d2
[b(a) − a] > 0
da2
whenever fQ (a) = fQ (b) and fQ0 (a) > fQ0 (b). This will guarantee that the conditions stated
in Theorem 9.3.2 lead to b − a being minimized. 2
Remark: The theorem we have just proven is applicable when an interval’s length (or
expected length) is proportional to b − a. This is often true when θ is a location parameter
and fX (x|θ) is a location family. When an interval’s length is not proportional to b − a, then
Theorem 9.3.2 is not directly applicable. However, we might be able to formulate a modified
version of the theorem that is applicable.
Example 9.12. Suppose X1 , X P2 ,n..., Xn are iid exponential(β), where β > 0. A pivotal
quantity based on T = T (X) = i=1 Xi , a sufficient statistic, is
2T
Q = Q(T, β) = ∼ χ22n .
β
Therefore, we can write
2T 2T 2T
1 − α = Pβ (a ≤ Q ≤ b) = Pβ a≤ ≤b = Pβ ≤β≤ ,
β b a
where a and b are quantiles from the χ22n distribution. In this example, the expected interval
length is not proportional to b − a. Instead, the expected length is
2T 2T 1 1 1 1
Eβ (L) = Eβ − = − Eβ (2T ) = − 2nβ,
a b a b a b
which is proportional to
1 1
− .
a b
Theorem 9.3.2 is therefore not applicable here. To modify the theorem (towards finding a
shortest expected length confidence interval based on Q), we would have to minimize
1 1 1 1
− = −
a b a b(a)
with respect to a subject to the constraint that
Z b(a)
fQ (q)dq = 1 − α,
a
PAGE 122
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
10 Asymptotic Evaluations
10.1 Introduction
Preview: In this chapter, we revisit “large sample theory” and discuss three important
topics in statistical inference:
• Efficiency, consistency
• Large sample properties of maximum likelihood estimators
Our previous inference discussions (i.e., in Chapters 7-9 CB) dealt with finite sample topics
(i.e., unbiasedness, MSE, optimal estimators/tests, confidence intervals based on finite sam-
ple pivots, etc.). We now investigate large sample inference, a topic of utmost importance
in statistical research.
W 1 = X1
X 1 + X2
W2 =
2
X 1 + X2 + X 3
W3 = ,
3
PAGE 123
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
An equivalent definition is
lim Pθ (|Wn − θ| < ) = 1.
n→∞
We call Wn a consistent estimator of θ. What makes consistency “different” from our
p
usual definition of convergence in probability is that we require Wn −→ θ for all θ ∈ Θ. In
other words, convergence of Wn must result for all members of the family {fX (x|θ) : θ ∈ Θ}.
Weak Law of Large Numbers: Suppose that X1 , X2 , ..., Xn are iid with Eθ (X1 ) = µ and
varθ (X1 ) = σ 2 < ∞. Let
n
1X
Xn = Xi
n i=1
denote the sample mean. As an estimator of µ, it is easy to see that the conditions of
Theorem 10.1.3 are satisfied. Therefore, X n is a consistent estimator of Eθ (X1 ) = µ.
PAGE 124
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
Consistency of MLEs: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ. Let
denote the maximum likelihood estimator (MLE) of θ. Under “certain regularity conditions,”
it follows that
p
θb −→ θ for all θ ∈ Θ,
as n → ∞. That is, MLEs are consistent estimators.
fX (x|θ1 ) = fX (x|θ2 ) =⇒ θ1 = θ2 .
In other words, different values of θ cannot produce the same probability distribution.
3. The family of pdfs {fX (x|θ) : θ ∈ Θ} has common support X . This means that the
support does not depend on θ. In addition, the pdf fX (x|θ) is differentiable with
respect to θ.
4. The parameter space Θ contains an open set where the true value of θ, say θ0 , resides
as an interior point.
Remark: Conditions 1-4 generally hold for exponential families that are of full rank.
Example 10.1. Suppose X1 , X2 , ..., Xn are iid N (0, θ), where θ > 0. The MLE of θ is
n
1X 2
θb = X .
n i=1 i
p
As an MLE, θb −→ θ, for all θ > 0; i.e., θb is a consistent estimator of θ.
Asymptotic normality of MLEs: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where
θ ∈ Θ. Let θb denote the MLE of θ. Under “certain regularity conditions,” it follows that
√ d
n(θb − θ) −→ N (0, v(θ)),
PAGE 125
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
Remark: The four regularity conditions on the last page were sufficient conditions for
consistency. For asymptotic normality, there are two additional sufficient conditions:
5. The pdf/pmf fX (x|θ) isR three times differentiable with respect to θ, the third derivative
is continuous in θ, and R fX (x|θ)dx can be differentiated three times under the integral
sign.
6. There exists a function M (x) such that
3
∂
ln f X (x|θ) ≤ M (x)
∂θ3
for all x ∈ X for all θ ∈ Nc (θ0 ) ∃c > 0 and Eθ0 [M (X)] < ∞.
Note: We now sketch a casual proof of the asymptotic normality result for MLEs. Let θ0
denote the true value of θ. Let S(θ) = S(θ|x) denote the score function; i.e.,
∂
S(θ) = ln fX (x|θ).
∂θ
Note that because θb is an MLE, it solves the score equation; i.e., S(θ)
b = 0. Therefore, we
can write (via Taylor series expansion about θ0 ),
0 = S(θ) b
∂S(θ0 ) b 1 ∂ 2 S(θb∗ ) b
= S(θ0 ) + (θ − θ0 ) + (θ − θ0 )2
∂θ 2 ∂θ2
where θb∗ is between θ0 and θ.
b Therefore, we have
" #
2
∂S(θ0 ) 1 ∂ S(θb∗ )
0 = S(θ0 ) + (θb − θ0 ) + (θb − θ0 ) .
∂θ 2 ∂θ2
After simple algebra, we have
√
√ − nS(θ0 )
n(θ − θ0 ) =
b
∂S(θ0 ) 1 ∂ 2 S(θb∗ ) b
+ (θ − θ0 )
∂θ 2 ∂θ2
n
√ 1X ∂
− n ln fX (Xi |θ0 )
n i=1 ∂θ −A
= n n = ,
1 X ∂2 1 X ∂3 B+C
ln fX (Xi |θ0 ) + ln fX (Xi |θb∗ )(θb − θ0 )
n i=1 ∂θ2 2n i=1 ∂θ3
PAGE 126
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
where
n
√ 1X ∂
A = n ln fX (Xi |θ0 )
n i=1 ∂θ
n
1 X ∂2
B = ln fX (Xi |θ0 )
n i=1 ∂θ2
n
1 X ∂3
C = ln fX (Xi |θb∗ )(θb − θ0 ).
2n i=1 ∂θ3
PAGE 127
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
p
Note that θb − θ0 −→ 0, because θb is consistent (i.e., θb converges in probability to θ0 ).
Therefore, it suffices to show that
n
1 X ∂3
ln fX (Xi |θb∗ )
n i=1 ∂θ3
converges to something finite (in probability). Note that for n “large enough;” i.e., as soon
as θb∗ ∈ Nc (θ0 ) in Regularity Condition 6,
n n 3 n
1 X ∂3
1 X ∂ 1X p
3
ln f X (X |θ
i ∗
b ) ≤
3
ln f X (X |θ
i ∗
b ) ≤ M (Xi ) −→ Eθ0 [M (X)] < ∞. 2
n i=1 ∂θ n i=1 ∂θ
n i=1
p p
We have shown that C −→ 0 and hence B + C −→ −I1 (θ0 ). Finally, note that
−A 1 d 1
= | −{zA } −→ N 0, ,
B+C | B {z
+C } I1 (θ0 )
d
−→N (0,I1 (θ0 )) p 1
−→ − I
1 (θ0 )
by Slutsky’s Theorem. 2
where
1
v(θ) = .
I1 (θ)
Now recall the Delta Method from Chapter 5; i.e., if g : R → R is differentiable at θ and
g 0 (θ) 6= 0, then
√ d
N 0, [g 0 (θ)]2 v(θ) .
b − g(θ)] −→
n[g(θ)
Therefore, not only are MLEs asymptotically normal, but functions of MLEs are too.
Example 10.1 (continued). Suppose X1 , X2 , ..., Xn are iid N (0, θ), where θ > 0. The MLE
of θ is n
1X 2
θ=
b X .
n i=1 i
PAGE 128
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
p
We know that θb −→ θ, as n → ∞. We now derive the asymptotic distribution of θb (suitably
centered and scaled). We know
√ d
n(θb − θ) −→ N (0, v(θ)),
where
1
v(θ) = .
I1 (θ)
Therefore, all we need to do is calculate I1 (θ). The pdf of X is, for all x ∈ R,
1 2
fX (x|θ) = √ e−x /2θ .
2πθ
Therefore,
1 x2
ln fX (x|θ) = − ln(2πθ) − .
2 2θ
The derivatives of ln fX (x|θ) are
∂ 1 x2
ln fX (x|θ) = − + 2
∂θ 2θ 2θ
2
∂ 1 x2
ln f X (x|θ) = − .
∂θ2 2θ2 θ3
Therefore,
∂2
2
X 1 θ 1 1
I1 (θ) = −Eθ 2
ln fX (X|θ) = Eθ 3
− 2 = 3− 2 = 2
∂θ θ 2θ θ 2θ 2θ
and
1
v(θ) = = 2θ2 .
I1 (θ)
We have √ d
n(θb − θ) −→ N (0, 2θ2 ).
b = θb2 ,
Exercise: Use the Delta Method to derive the large sample distributions of g1 (θ)
b = eθb, and g3 (θ)
g2 (θ) b = ln θ,
b suitably centered and scaled.
In addition, s
b− θ
θ b− θ
θ v(θ) d
Zn∗ = s = r −→ N (0, 1),
v(θ)
b v(θ) v(θ)
b
n
| {z }
p
n | {z } −→1
d
−→N (0,1)
Summary:
1. We start with a sequence of estimators (e.g., an MLE sequence, etc.) satisfying
√ d
n(θb − θ) −→ N (0, v(θ)).
One can then use Zn∗ to formulate large sample (Wald) hypothesis tests and confidence
intervals; see Sections 10.3 and 10.4, respectively.
Example 10.1 (continued). Suppose X1 , X2 , ..., Xn are iid N (0, θ), where θ > 0. The MLE
of θ is n
1X 2
θ=
b X .
n i=1 i
We have shown that
√ d θb − θ d
n(θb − θ) −→ N (0, 2θ2 ) ⇐⇒ Zn = r −→ N (0, 1).
2θ2
n
A consistent estimator of v(θ) = 2θ2 is v(θ)
b = 2θb2 , by continuity. Therefore,
s
θ−θ
b θ−θ
b 2θ2 d
Zn∗ = s = r −→ N (0, 1),
2θ 2 2θb2
2θb2 | {z }
n | {zn } −→1 p
d
−→N (0,1)
by Slutsky’s Theorem.
PAGE 130
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
The ARE is commonly used to compare the variances of two competing consistent estimators;
the comparison is of course on the basis of each estimator’s large sample distribution.
Remark: Before we do an example illustrating ARE, let’s have a brief discussion about
sample quantile estimators.
Sample quantiles: Suppose X1 , X2 , ..., Xn are iid with continuous cdf F . Define
φp = F −1 (p) = inf{x ∈ R : F (x) ≥ p}.
We call φp the pth quantile of the distribution of X. Note that if F is strictly increasing,
then F −1 (p) is well defined by
φp = F −1 (p) ⇐⇒ F (φp ) = p.
The simplest definition of the sample pth quantile is Fbn−1 (p), where
n
1X
Fbn (x) = I(Xi ≤ x)
n i=1
is the empirical distribution function (edf ). The edf is a non-decreasing step function
that takes steps of size 1/n at each observed Xi . Therefore,
−1 X(np) , np ∈ Z+
φp ≡ Fbn (p) =
b
X(bnpc+1) , otherwise.
This is just a fancy way of saying that the sample pth quantile is one of the order statistics
(note that other books may define this differently; e.g., by averaging order statistics, etc.).
Whenever I teach STAT 823, I prove that
√
d p(1 − p)
n(φp − φp ) −→ N 0, 2
b ,
f (φp )
PAGE 131
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
where f is the population pdf of X. For example, if p = 0.5, then φp = φ0.5 is the median of
X and the sample median φb0.5 satisfies
√
d 1
n(φ0.5 − φ0.5 ) −→ N 0, 2
b .
4f (φ0.5 )
Example 10.2. Suppose X1 , X2 , ..., Xn are iid N (µ, σ 2 ), where −∞ < µ < ∞ and σ 2 > 0;
i.e., both parameters are unknown. Consider the following two estimators Wn = X n and
Vn = φb0.5 as estimators of µ. Note that because the N (µ, σ 2 ) population distribution is
symmetric, the population median φ0.5 = µ as well.
We know that √ d
n(X n − µ) = N (0, σ 2 ),
√
that is, this “limiting distribution” is the exact distribution of n(X n − µ) for each n. From
our previous discussion on sample quantiles, we know that
√
d 1
n(φ0.5 − µ) −→ N 0, 2
b ,
4f (φ0.5 )
where (under the normal assumption),
1 1 1 π 2
= = 2 = σ .
4f 2 (φ0.5 ) 4f 2 (µ) 2
1
4 √
2πσ
Therefore, the asymptotic relative efficiency of the sample median φb0.5 when compared to
the sample mean X n is
π 2
σ π
ARE(φb0.5 to X n ) = 2 2 = ≈ 1.57.
σ 2
Interpretation: The sample median φb0.5 would require 57 percent more observations to
achieve the same level of (asymptotic) precision as X n .
Example 10.3. Suppose X1 , X2 , ..., Xn are iid beta(θ, 1), where θ > 0.
PAGE 132
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
5
4
ARE (MOM to MLE)
3
2
1
0
0 1 2 3 4 5
Remark: In Chapter 8 (CB), we discussed methods to derive hypothesis tests and also
optimality issues based on finite sample criteria. These discussions revealed that optimal
tests (e.g., UMP tests) were available for just a small collection of problems (some of which
were not realistic).
PAGE 133
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
Preview: In this section, we present three large sample approaches to formulate hypothesis
tests:
1. Wald (1943)
2. Score (1947, Rao); also known as “Lagrange multiplier tests”
3. Likelihood ratio (1928, Neyman-Pearson).
Recall: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ R. As long as suitable
regularity conditions hold, we know that an MLE θb satisfies
√ d
n(θb − θ) −→ N (0, v(θ)),
where
1
v(θ) = .
I1 (θ)
If v(θ) is a continuous function of θ, then
p
b −→ v(θ),
v(θ)
| {zn } −→1
| {z }
p
n
d
−→N (0,1)
by Slutsky’s Theorem. This forms the basis for the Wald test.
Wald statistic: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ R. Consider
testing
H0 : θ = θ0
versus
H1 : θ 6= θ0 .
Therefore,
R = {x ∈ X : |znW | ≥ zα/2 },
where zα/2 is the upper α/2 quantile of the N (0, 1) distribution, is an approximate size α
rejection region for testing H0 versus H1 . One sided tests also use ZnW . The only thing that
changes is the form of R.
Example 10.4. Suppose X1 , X2 , ..., Xn are iid Bernoulli(p), where 0 < p < 1. Derive the
Wald test of
H0 : p = p0
versus
H1 : p 6= p0 .
where
1
v(p) = .
I1 (p)
We now calculate I1 (p). The pmf of X is, for x = 0, 1,
fX (x|p) = px (1 − p)1−x .
Therefore,
ln fX (x|p) = x ln p + (1 − x) ln(1 − p).
The derivatives of ln fX (x|p) are
∂ x 1−x
ln fX (x|p) = −
∂p p 1−p
∂2 x 1−x
ln f X (x|p) = − − .
∂p2 p2 (1 − p)2
Therefore,
∂2
X 1−X p 1−p 1
I1 (p) = −Ep 2
ln fX (X|p) = Ep 2 + 2
= 2+ 2
=
∂p p (1 − p) p (1 − p) p(1 − p)
and
1
v(p) = = p(1 − p).
I1 (p)
PAGE 135
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
We have √ d
p − p) −→ N (0, p(1 − p)).
n(b
Because the asymptotic variance v(p) = p(1 − p) is a continuous function of p, it can be
p) = pb(1 − pb). The Wald statistic to test H0 : p = p0 versus
consistently estimated by v(b
H1 : p 6= p0 is given by
pb − p0 pb − p0
ZnW = r =r .
v(b
p) pb(1 − pb)
n n
An approximate size α rejection region is
R = {x ∈ X : |znW | ≥ zα/2 }.
Motivation: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ R. Recall that
the score function, when viewed as random, is
∂
S(θ|X) = ln L(θ|X)
∂θ
n
iid
X ∂
= ln fX (Xi |θ),
i=1
∂θ
which means
1
n
S(θ|X) S(θ|X) iid S(θ|X) d
q =p =p −→ N (0, 1),
I1 (θ) nI1 (θ) In (θ)
n
where recall In (θ) = nI1 (θ) is the Fisher information based on all n iid observations. There-
fore, the score function divided by the square root of the Fisher information (based on all
n observations) behaves asymptotically like a N (0, 1) random variable. This fact forms the
basis for the score test.
PAGE 136
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
Score statistic: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ R. Consider
testing
H0 : θ = θ0
versus
H1 : θ 6= θ0 .
Example 10.5. Suppose X1 , X2 , ..., Xn are iid Bernoulli(p), where 0 < p < 1. Derive the
score test of
H0 : p = p0
versus
H1 : p 6= p0 .
Setting: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ R. Consider testing
H0 : θ = θ0
versus
H1 : θ 6= θ0 .
Suppose the regularity conditions needed for MLEs to be consistent and asymptotically
normal hold. When H0 is true,
d
−2 ln λ(X) −→ χ21 .
Because small values of λ(x) are evidence against H0 , large values of −2 ln λ(x) are too.
Therefore,
R = {x ∈ X : −2 ln λ(x) ≥ χ21,α },
where χ21,α is the upper α quantile of the χ21 distribution, is an approximate size α rejection
region for testing H0 versus H1 .
PAGE 138
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
∂
where θb∗ is between θb and θ0 . Now write ∂θ ln L(θ0 ) in a Taylor series expansion about θ,
b
that is,
2
∂ ∂ b +(θ0 − θ)b ∂ ln L(θb∗∗ ),
ln L(θ0 ) = ln L(θ)
∂θ |∂θ {z } ∂θ2
= 0
∂
where θb∗∗ is between θ0 and θ.
b Note that
∂θ
ln L(θ)
b = 0 because θb solves the score equation.
From the last equation, we have
√ 1 ∂2
1 ∂
√ ln L(θ0 ) = n(θ − θ0 ) −
b ln L(θ∗∗ ) .
b (10.2)
n ∂θ n ∂θ2
√ √ 1 ∂2
ln L(θ)
b = ln L(θ0 ) + n(θ − θ0 ) n(θ − θ0 ) −
b b ln L(θ∗∗ )
b
n ∂θ2
n 1 ∂2
+ (θb − θ0 )2 ln L(θb∗ )
2 n ∂θ2
so that
2 2
b − ln L(θ0 ) = n(θb − θ0 ) −21 ∂ n 2 1 ∂
ln L(θ) ln L(θb∗∗ ) + (θb − θ0 ) ln L(θb∗ ) . (10.3)
n ∂θ2 2 n ∂θ2
p
Because θb is consistent (and because H0 is true), we know that θb −→ θ0 , as n → ∞.
Therefore, because θb∗ and θb∗∗ are both trapped between θb and θ0 , both terms in the brackets,
i.e.,
1 ∂2 2
b∗∗ ) and 1 ∂ ln L(θb∗ )
ln L(θ
n ∂θ2 n ∂θ2
converge in probability to
2
∂
Eθ0 ln fX (X|θ) = −I1 (θ0 ),
∂θ2
by the WLLN. Therefore, the RHS of Equation (10.3) will behave in the limit the same as
n b 1 √ b √
(θ − θ0 )2 I1 (θ0 ) = n(θ − θ0 ) n(θb − θ0 )I1 (θ0 )
2 2
√ b √ b
1 n(θ − θ0 ) n(θ − θ0 ) d 1 2
= q q −→ χ1 ,
2 1 1 2
I1 (θ0 ) I1 (θ0 )
| {z } | {z }
d d
−→N (0,1) −→N (0,1)
PAGE 139
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
Example 10.6. Suppose X1 , X2 , ..., Xn are iid Bernoulli(p), where 0 < p < 1. Derive the
large sample LRT test of
H0 : p = p0
versus
H1 : p 6= p0 .
Therefore,
" n n
! #
X p0 X 1 − p0
−2 ln λ(X) = −2 Xi ln + n− Xi ln
pb 1 − pb
i=1 i=1
p0 1 − p0
= −2 nb p ln + n(1 − pb) ln .
pb 1 − pb
R = {x ∈ X : −2 ln λ(x) ≥ χ21,α }.
Monte Carlo Simulation: When X1 , X2 , ..., Xn are iid Bernoulli(p), where 0 < p < 1,
we have derived the Wald, score, and large sample LRT for testing H0 : p = p0 versus
H1 : p 6= p0 . Each test is a large sample test, so the size of each one is approximately equal
to α when n is large. We now perform a simulation to assess finite sample characteristics.
The results from this simulation study are shown in Table 10.1.
PAGE 140
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
Table 10.1: Monte Carlo simulation. Size estimates of nominal α = 0.05 Wald, score, and
LRTs for a binomial proportion p when n = 20, 50, 100 and p0 = 0.1, 0.3.
Important: Note that these sizes are really estimates of the true sizes (at each setting of n
and p0 ). Therefore, we should acknowledge that these are estimates and report the margin
of error associated with them.
• Because these are nominal size 0.05 tests, the margin of error associated with each
“estimate,” assuming a 99 percent confidence level, is equal to
r
0.05(1 − 0.05)
B = 2.58 ≈ 0.0056.
10000
• Size estimates between 0.0444 and 0.0556 indicate that the test is operating at the
nominal level. I have bolded the estimates in Table 10.1 that are within these bounds.
• Values <0.0444 suggest conservatism (the test rejects too often). Values >0.0556
suggest anti-conservatism (the test is not rejecting often enough).
Summary: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ R. Assume that the
regularity conditions needed for MLEs to be consistent and asymptotically normal (CAN)
hold. We have presented three large sample procedures to test
H0 : θ = θ0
versus
H1 : θ 6= θ0 .
• Wald:
θb − θ0 θb − θ0 d
ZnW = s =s −→ N (0, 1)
v(θ)
b 1
n nI1 (θ)
b
• Score:
S(θ0 |X) d
ZnS = p −→ N (0, 1)
In (θ0 )
PAGE 141
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
• LRT:
d
−2 ln λ(X) = −2[ln L(θ0 |X) − ln L(θ|X)]
b −→ χ21 .
• Note that (ZnW )2 , (ZnS )2 , and −2 ln λ(X) each converge in distribution to a χ21 distri-
bution as n → ∞.
• In terms of power (i.e., rejecting H0 when H1 is true), all three testing procedures are
asymptotically equivalent when examining certain types of alternative sequences
(i.e., Pitman sequences of alternatives). For these alternative sequences, (ZnW )2 , (ZnS )2 ,
and −2 ln λ(X) each converge to the same (noncentral) χ21 (λ) distribution. However,
the powers may be quite different in finite samples.
Remark: The large sample LRT procedure can be easily generalized to multi-parameter
hypotheses.
Theorem 10.3.3. Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ Rk . Assume
that the regularity conditions needed for MLEs to be CAN hold. Consider testing
H0 : θ ∈ Θ0
versus
H1 : θ ∈ Θc0
and define
sup L(θ|x)
θ∈Θ0 L(θb0 |x)
λ(x) = = .
sup L(θ|x) L(θ|x)
b
θ∈Θ
If θ ∈ Θ0 , then
d
−2 ln λ(X) = −2[ln L(θb0 |X) − ln L(θ|X)]
b −→ χ2ν ,
where ν = dim(Θ) − dim(Θ0 ), the number of “free parameters” between Θ and Θ0 .
R = {x ∈ X : −2 ln λ(x) ≥ χ2ν,α }
Example 10.7. McCann and Tebbs (2009) summarize a study examining perceived unmet
need for dental health care for people with HIV infection. Baseline in-person interviews were
PAGE 142
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
conducted with 2,864 HIV infected individuals (aged 18 years and older) as part of the HIV
Cost and Services Utilization Study. Define
X1 = number of patients
with private insurance
X2 = number of patients
with medicare and private insurance
X3 = number of patients
without insurance
X4 = number of patients
with medicare but no private insurance.
1
H0 : p1 = p2 = p3 = p4 = 4
versus
H1 : H0 not true.
Θ0 = {θ = (p1 , p2 , p3 , p4 ) : p1 = p2 = p3 = p4 = 1/4},
the singleton (1/4, 1/4, 1/4, 1/4). The entire parameter space is
( 4
)
X
Θ= θ = (p1 , p2 , p3 , p4 ) : 0 < p1 < 1, 0 < p2 < 1, 0 < p3 < 1, 0 < p4 < 1; pi = 1 ,
i=1
L( 14 , 14 , 14 , 14 )
λ(x) = λ(x1 , x2 , x3 , x4 ) =
L(pb1 , pb2 , pb3 , pb4 )
2864! 4
( 1 )x1 ( 14 )x2 ( 14 )x3 ( 14 )x4
x
x1 ! x2 ! x3 ! x4 ! 4
Y 2864 i
= = .
2864!
( x1 )x1 ( 2864
x1 ! x2 ! x3 ! x4 ! 2864
x2 x2 x3 x3 x4 x4
) ( 2864 ) ( 2864 ) i=1
4xi
PAGE 143
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
R = {x ∈ X : −2 ln λ(x) ≥ 7.81}.
1. Wald
2. Score
3. Likelihood ratio.
These are known as the “large sample likelihood based confidence intervals.”
Definition: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ R. The random
variable
Qn = Qn (X, θ)
is called a large sample pivot if its asymptotic distribution is free of all unknown param-
eters. If Qn is a large sample pivot and if
Pθ (Qn (X, θ) ∈ A) ≈ 1 − α,
Recall: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ R. As long as suitable
regularity conditions hold, we know that an MLE θb satisfies
√ d
n(θb − θ) −→ N (0, v(θ)),
where
1
v(θ) = .
I1 (θ)
PAGE 144
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
p
b −→ v(θ), for all θ; i.e., v(θ)
If v(θ) is a continuous function of θ, then v(θ) b is a consistent
estimator of v(θ), and
θb − θ d
Qn (X, θ) = s −→ N (0, 1),
v(θ)
b
n
by Slutsky’s Theorem. Therefore, Qn (X, θ) is a large sample pivot and
1 − α ≈ Pθ (−zα/2 ≤ Qn (X, θ) ≤ zα/2 )
s s
θ−θ
b v(θ)
b v(θ)
b
= Pθ −zα/2 ≤ q ≤ zα/2 = Pθ θb − zα/2 ≤ θ ≤ θb + zα/2 .
v(θ)
b n n
n
Therefore, s
v(θ)
b
θb ± zα/2
n
is an approximate 1 − α confidence interval for θ.
Remark: We could have arrived at this same interval by inverting the large sample test of
H0 : θ = θ0
versus
H1 : θ 6= θ0
Extension: We can also write large sample Wald confidence intervals for functions of θ
using the Delta Method. Recall that if g : R → R is differentiable at θ and g 0 (θ) 6= 0, then
√ d
N 0, [g 0 (θ)]2 v(θ) .
b − g(θ)] −→
n[g(θ)
If [g 0 (θ)]2 v(θ) is a continuous function of θ, then we can find a consistent estimator for it,
namely [g 0 (θ)]b 2 v(θ),
b because MLEs are consistent themselves and consistency is preserved
under continuous mappings. Therefore,
b − g(θ)
g(θ) d
Qn (X, θ) = s −→ N (0, 1),
[g 0 (θ)]
b 2 v(θ)
b
n
PAGE 145
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
Example 10.8. Suppose X1 , X2 , ..., Xn are iid Bernoulli(p), where 0 < p < 1.
Therefore, r
pb(1 − pb)
pb ± zα/2
n
is an approximate 1 − α Wald confidence interval for p. The problems with this interval (i.e.,
in conferring the nominal coverage probability) are well known; see Brown et al. (2001).
PAGE 146
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
Remarks: As you can see, constructing (large sample) Wald confidence intervals is straight-
forward. We rely on the MLE being consistent and asymptotically normal (CAN) and also
on being able to find a consistent estimator of the asymptotic variance of the MLE.
• More generally, if you have an estimator θb (not necessarily an MLE) that is asymp-
totically normal and if you can estimate its (large sample) variance consistently, you
can do Wald inference. This general strategy for large sample inference is ubiquitous
in statistical research.
• The problem, of course, is that because large sample standard errors must be estimated,
the performance of Wald confidence intervals (and tests) can be poor in small samples.
Brown et al. (2001) highlights this for the binomial proportion; however, this behavior
is seen in other settings.
• I view Wald inference as a “fall back.” It is what to do when no other large sample
inference procedures are available; i.e., “having something is better than nothing.”
• Of course, in very large sample settings (e.g., large scale Phase III clinical trials, public
health studies with thousands of individuals, etc.), Wald inference is usually the default
approach (probably because of its simplicity) and is generally satisfactory.
Recall: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ R. We have shown
previously that
S(θ|X) d
Qn (X, θ) = p −→ N (0, 1),
In (θ)
where In (θ) = nI1 (θ) is the Fisher information based on the sample.
PAGE 147
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
Motivation: Score confidence intervals arise from inverting (large sample) score tests. Re-
call that in testing H0 : θ = θ0 versus H1 : θ 6= θ0 , the score statistic
S(θ0 |X) d
Qn (X, θ0 ) = p −→ N (0, 1)
In (θ0 )
is an approximate size α rejection region for testing H0 versus H1 . The acceptance region is
Example 10.9. Suppose X1 , X2 , ..., Xn are iid Bernoulli(p), where 0 < p < 1. Derive a
1 − α (large sample) score confidence interval for p.
Solution. From Example 10.5, we have
Pn
n − ni=1 Xi
P
i=1 Xi
−
S(p|X) p 1−p pb − p
Qn (X, p) = p = r =r .
In (p) n p(1 − p)
p(1 − p) n
From our discussion above, the (random) set
pb − p
C(X) = {p : |Qn (X, p)| < zα/2 } = p : q < zα/2
p(1−p)
n
forms the score interval for p. After observing X = x, this interval could be calculated
numerically (e.g., using a grid search over values of p that satisfy this inequality). However,
in the binomial case, we can get closed-form expressions for the endpoints. To see why, note
that the boundary
p(1 − p)
p − p)2 = zα/2
|Qn (x, p)| = zα/2 ⇐⇒ (b 2
.
n
After algebra, this equation becomes
2
! 2
!
zα/2 zα/2
1+ p2 − 2b
p+ p + pb2 = 0.
n n
PAGE 148
STAT 713: CHAPTER 10 JOSHUA M. TEBBS
The LHS of the last equation is a quadratic function of p. The roots of this equation, if they
are real, delimit the score interval for p. Using the quadratic formula, the lower and upper
limits are
q
2 2 2
(2b
p + zα/2 /n) − (2b p + zα/2 /n)2 − 4(1 + zα/2 p2
/n)b
pL = 2
2(1 + zα/2 /n)
q
2 2 2
(2b
p + zα/2 /n) + (2b p + zα/2 /n)2 − 4(1 + zα/2 p2
/n)b
pU = 2
,
2(1 + zα/2 /n)
respectively. Note that the score interval is much more complex than the Wald interval.
However, the score interval (in this setting and elsewhere) typically confers very good cover-
age probability, that is, close to the nominal 1 − α level, even for small samples. Therefore,
although we have added complexity, the score interval is typically much better.
Recall: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ R. Consider testing
H0 : θ = θ0 versus H1 : θ 6= θ0 . The LRT statistic is
L(θ0 |x)
λ(x) =
L(θ|x)
b
and
R = {x ∈ X : −2 ln λ(x) ≥ χ21,α }
is an approximate size α rejection region for testing H0 versus H1 . Inverting the acceptance
region, ( " # )
L(θ|x) 2
C(x) = θ : −2 ln < χ1,α
L(θ|x)
b
is an approximate 1 − α confidence set for θ. If C(x) is an interval, then we call it a
likelihood ratio confidence interval.
Example 10.10. Suppose X1 , X2 , ..., Xn are iid Bernoulli(p), where 0 < p < 1. Derive a
1 − α (large sample) likelihood ratio confidence interval for p.
Solution. From Example 10.6, we have
L(p|x) p 1−p
−2 ln = −2 nb p ln + n(1 − pb) ln .
p|x)
L(b pb 1 − pb
Therefore, the confidence interval is
p 1−p 2
C(x) = p : −2 nb p ln + n(1 − pb) ln < χ1,α .
pb 1 − pb
This interval must be calculated using numerical search methods.
PAGE 149