STAT 713 Mathematical Statistics Ii: Lecture Notes

STAT 713
MATHEMATICAL STATISTICS II
Spring 2018
Lecture Notes
Joshua M. Tebbs
Department of Statistics
University of South Carolina

c by Joshua M. Tebbs
TABLE OF CONTENTS JOSHUA M. TEBBS
Contents
6 Principles of Data Reduction 1
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
6.2 The Sufficiency Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
6.2.1 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
6.2.2 Minimal sufficient statistics . . . . . . . . . . . . . . . . . . . . . . . 11
6.2.3 Ancillary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.2.4 Sufficient, ancillary, and complete statistics . . . . . . . . . . . . . . . 18
7 Point Estimation 26
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.2 Methods of Finding Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.2.1 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.2.2 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . 29
7.2.3 Bayesian estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.3 Methods of Evaluating Estimators . . . . . . . . . . . . . . . . . . . . . . . . 42
7.3.1 Bias, variance, and MSE . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.3.2 Best unbiased estimators . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.3.3 Sufficiency and completeness . . . . . . . . . . . . . . . . . . . . . . . 52
7.4 Appendix: CRLB Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8 Hypothesis Testing 65
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.2 Methods of Finding Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8.2.1 Likelihood ratio tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8.2.2 Bayesian tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.3 Methods of Evaluating Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.3.1 Error probabilities and the power function . . . . . . . . . . . . . . . 79
8.3.2 Most powerful tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.3.3 Uniformly most powerful tests . . . . . . . . . . . . . . . . . . . . . . 90
8.3.4 Probability values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
i
TABLE OF CONTENTS JOSHUA M. TEBBS
9 Interval Estimation 104

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
9.2 Methods of Finding Interval Estimators . . . . . . . . . . . . . . . . . . . . . 107
9.2.1 Inverting a test statistic . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.2.2 Pivotal quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
9.2.3 Pivoting the CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
9.2.4 Bayesian intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
9.3 Methods of Evaluating Interval Estimators . . . . . . . . . . . . . . . . . . . 119
10 Asymptotic Evaluations 123

10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
10.2 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
10.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
10.3.1 Wald tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
10.3.2 Score tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
10.3.3 Likelihood ratio tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
10.4 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
10.4.1 Wald intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
10.4.2 Score intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
10.4.3 Likelihood ratio intervals . . . . . . . . . . . . . . . . . . . . . . . . . 149
ii
STAT 713: CHAPTER 6 JOSHUA M. TEBBS
6 Principles of Data Reduction
Complementary reading: Chapter 6 (CB). Sections 6.1-6.2.
6.1 Introduction
Recall: We begin by recalling the definition of a statistic. Suppose that X1 , X2 , ..., Xn

is an iid sample. A statistic T = T (X) = T (X1 , X2 , ..., Xn ) is a function of the sample
X = (X1 , X2 , ..., Xn ). The only restriction is that T cannot depend on unknown parameters.
For example,
n
!1/n
Y
T (X) = X T (X) = X(n) T (X) = Xi T(X) = X.
i=1
Recall: We can think of X and T as functions:
• (S, B, P ): probability space for random experiment

• (Rn , B(Rn ), PX ) −→ range space of X
– Recall: X : S → Rn is a random vector if X−1 (B) ≡ {ω ∈ S : X(ω) ∈ B} ∈ B,
for all B ∈ B(Rn )
– PX : induced probability measure of X; one-to-one correspondence with FX (x)
– X = support of X; X ⊆ Rn
• (R, B(R), PT ) −→ range space of T
– T : Rn → R, if T is a scalar statistic
– PT describes the (sampling) distribution of T ; Chapters 4-5 (CB)
– T = support of T ; T ⊆ R; T = {t : t = T (x), x ∈ X }, the image set of X under
T.
Conceptualization: A statistic T forms a partition of X , the support of X. Specifically,

T partitions X ⊆ Rn into sets
At = {x ∈ X : T (x) = t},
for t ∈ T . The statistic T summarizes the data x in that one can report
T (x) = t ⇐⇒ x ∈ At
instead of reporting x itself. This is the idea behind data reduction. We reduce the data
x so that they can be more easily understood without losing the meaning associated with
the set of observations.
PAGE 1
Example 6.1. Suppose X1 , X2 , X3 are iid Bernoulli(θ), where 0 < θ < 1. The support of
X = (X1 , X2 , X3 ) is
X = {(0, 0, 0), (1, 0, 0), (0, 1, 0), (0, 0, 1), (1, 1, 0), (1, 0, 1), (0, 1, 1), (1, 1, 1)}.
Define the statistic

T = T (X) = X1 + X2 + X3 .
The statistic T partitions X ⊂ R3 into the following sets:
A0 = {x ∈ X : T (x) = 0} = {(0, 0, 0)}

A1 = {x ∈ X : T (x) = 1} = {(1, 0, 0), (0, 1, 0), (0, 0, 1)}
A2 = {x ∈ X : T (x) = 2} = {(1, 1, 0), (1, 0, 1), (0, 1, 1)}
A3 = {x ∈ X : T (x) = 3} = {(1, 1, 1)}.
The image of X under T is
T = {t : t = T (x), x ∈ X } = {0, 1, 2, 3},
the support of T . The statistic T summarizes the data in that it reports only the value
T (x) = t. It does not report which x ∈ X produced T (x) = t.
Connection: Data reduction plays an important role in statistical inference. Suppose

X1 , X2 , ..., Xn is an iid sample from fX (x|θ), where θ ∈ Θ. We would like to use the sample
X to learn about which member (or members) of this family might be reasonable. We
also do not want to be burdened by having to work with the entire sample X. Therefore,
we are interested in statistics T that reduce the data X (for convenience) while still not
compromising our ability to learn about θ.
Preview: Chapter 6 (CB) discusses three methods of data reduction:
• Section 6.2: Sufficiency Principle

• Section 6.3: Likelihood Principle
• Section 6.4: Equivariance Principle
We will focus exclusively on Section 6.2.
6.2 The Sufficiency Principle
6.2.1 Sufficient statistics
Informal Definition: A statistic T = T (X) is a sufficient statistic for a parameter θ if

it contains “all of the information” about θ that is available in the sample. In other words,
we do not lose any information about θ by reducing the sample X to the statistic T .
PAGE 2
Sufficiency Principle: If T = T (X) is a sufficient statistic for θ, then any inference

regarding θ should depend on X only through the value of T (X).
• In other words, if x ∈ X , y ∈ X , and T (x) = T (y), then inference for θ should be the
same whether X = x or X = y is observed.
• For example, in Example 6.1, suppose
x = (1, 0, 0)
y = (0, 0, 1)
so that t = T (x) = T (y) = 1. The Sufficiency Principle says that inference for θ
depends only on the value of t = 1 and not on whether x or y was observed.
Definition 6.2.1/Theorem 6.2.2. A statistic T = T (X) is a sufficient statistic for θ if the

conditional distribution of X given T is free of θ; i.e., if the ratio
fX (x|θ)
fX|T (x|t) =
fT (t|θ)
is free of θ, for all x ∈ X . In other words, after conditioning on T , we have removed all
information about θ from the sample X.
Discussion: Note that in the discrete case, all distributions above can be interpreted as
probabilities. From the definition of a conditional distribution,
fX,T (x, t|θ) Pθ (X = x, T = t)
fX|T (x|t) = = .
fT (t|θ) Pθ (T = t)
Because {X = x} ⊂ {T = t}, we have
Pθ (X = x, T = t) = Pθ (X = x) = fX (x|θ).
Therefore,
fX (x|θ)
fX|T (x|t) =
fT (t|θ)
as claimed. If T is continuous, then fT (t|θ) 6= Pθ (T = t) and fX|T (x|t) cannot be interpreted
as a conditional probability. Fortunately, the criterion above; i.e.,
fX (x|θ)
fX|T (x|t) =
fT (t|θ)
being free of θ, still applies in the continuous case (although a more rigorous explanation
would be needed to see why).
Example 6.2. Suppose X1 , X2 , ..., Xn are iid Poisson(θ), where θ > 0. Use Definition
6.2.1/Theorem 6.2.2 to show that
n
X
T = T (X) = Xi
i=1
is a sufficient statistic.
PAGE 3
Proof. The pmf of X, for xi = 0, 1, 2, ..., is given by

n Pn
Y θxi e−θ θ i=1 xi −nθ
e
fX (x|θ) = = Qn .
i=1
xi ! i=1 xi !
Recall that T ∼ Poisson(nθ), shown by using mgfs. Therefore, the pmf of T , for t = 0, 1, 2, ...,
is
(nθ)t e−nθ
fT (t|θ) = .
t!
With t = ni=1 xi , the conditional distribution
P
Pn
xi −nθ
e θ i=1
Qn
fX (x|θ) i=1 xi ! t!
fX|T (x|t) = = = Qn ,
fT (t|θ) (nθ)t e−nθ nt i=1 xi !
t!
which is free of θ. P
From the definition of sufficiency and from Theorem 6.2.2, we have shown
that T = T (X) = ni=1 Xi is a sufficient statistic. 2
Example 6.3. Suppose that X1 , X2 , ..., Xn is an iid sample from

1
fX (x|θ) = e−x/θ I(x > 0),
θ
an exponential distribution with mean θ > 0. Show that
T = T (X) = X
Proof. The pdf of X, for xi > 0, is given by
n
Y 1 1 − Pni=1 xi /θ
fX (x|θ) = e−xi /θ = e .
i=1
θ θn
Recall that if X1 , X2 , ..., Xn are iid exponential(θ), then
X ∼ gamma(n, θ/n).
Therefore, the pdf of T = T (X) = X, for t > 0, is

1
fT (t|θ) = tn−1 e−nt/θ .
θ n
Γ(n) n
Pn
With t = x (i.e., nt = i=1 xi ), the conditional distribution
1 − Pni=1 xi /θ
fX (x|θ) e Γ(n)
fX|T (x|t) = = θn = n n−1 ,
fT (t|θ) 1 n t
tn−1 e−nt/θ
θ n
Γ(n) n
PAGE 4
which is free of θ. From the definition of sufficiency and from Theorem 6.2.2, we have shown
that T = T (X) = X is a sufficient statistic. 2
Example 6.4. Suppose X1 , X2 , ..., Xn is an iid sample from a continuous distribution with
pdf fX (x|θ), where θ ∈ Θ. Show that T = T(X) = (X(1) , X(2) , ..., X(n) ), the vector of order
statistics, is always sufficient.
Proof. Recall from Section 5.4 (CB) that the joint distribution of the n order statistics is
fX(1) ,X(2) ,...,X(n) (x1 , x2 , ..., xn |θ) = n!fX (x1 |θ)fX (x2 |θ) · · · fX (xn |θ)
= n!fX (x|θ),
for −∞ < x1 < x2 < · · · < xn < ∞. Therefore, the ratio
fX (x|θ) fX (x|θ) 1
= = ,
fT (x|θ) n!fX (x|θ) n!
which is free of θ. From the definition of sufficiency and from Theorem 6.2.2, we have shown
that T = T(X) = (X(1) , X(2) , ..., X(n) ) is a sufficient statistic. 2
Discussion: Example 6.4 shows that (with continuous distributions), the order statistics
are always sufficient.
• Of course, reducing the sample X = (X1 , X2 , ..., Xn ) to T(X) = (X(1) , X(2) , ..., X(n) ) is
not that much of a reduction. However, in some parametric families, it is not possible
to reduce X any further without losing information about θ (e.g., Cauchy, logistic,
etc.); see pp 275 (CB).
• In some instances, it may be that the parametric form of fX (x|θ) is not specified. With
so little information provided about the population, we should not be surprised that
the only available reduction of X is to the order statistics.
Remark: The approach we have outlined to show that a statistic T is sufficient appeals to
Definition 6.2.1 and Theorem 6.2.2; i.e., we are using the definition of sufficiency directly by
showing that the conditional distribution of X given T is free of θ.
• If I ask you to show that T is sufficient by appealing to the definition of sufficiency,

this is the approach I want you to take.
• What if we need to find a sufficient statistic? Then the approach we have just outlined
is not practical to implement (i.e., imagine trying different statistics T and for each
one attempting to show that fX|T (x|t) is free of θ). This might involve a large amount
of trial and error and you would have to derive the sampling distribution of T each
time (which for many statistics can be difficult or even intractable).
• The Factorization Theorem makes getting sufficient statistics much easier.
PAGE 5
Theorem 6.2.6 (Factorization Theorem). A statistic T = T (X) is sufficient for θ if

and only if there exists functions g(t|θ) and h(x) such that
fX (x|θ) = g(t|θ)h(x),
for all support points x ∈ X and for all θ ∈ Θ.
Proof. We prove the result for the discrete case only; the continuous case is beyond the scope
of this course.
Necessity (=⇒): Suppose T is sufficient. It suffices to show there exists functions g(t|θ) and
h(x) such that the factorization holds. Because T is sufficient, we know
fX|T (x|t) = P (X = x|T (X) = t)
is free of θ (this is the definition of sufficiency). Therefore, take
g(t|θ) = Pθ (T (X) = t)
h(x) = P (X = x|T (X) = t).
Because {X = x} ⊂ {T (X) = t},
fX (x|θ) = Pθ (X = x)
= Pθ (X = x, T (X) = t)
= Pθ (T (X) = t)P (X = x|T (X) = t) = g(t|θ)h(x).
Sufficiency (⇐=): Suppose the factorization holds. To establish that T = T (X) is sufficient,
it suffices to show that
fX|T (x|t) = P (X = x|T (X) = t)
is free of θ. Denoting T (x) = t, we have
Pθ (X = x, T (X) = t)
fX|T (x|t) = P (X = x|T (X) = t) =
Pθ (T (X) = t)
Pθ (X = x) I(T (X) = t)
=
Pθ (T (X) = t)
g(t|θ)h(x) I(T (X) = t)
= ,
Pθ (T (X) = t)
because the factorization holds by assumption. Now write
Pθ (T (X) = t) = Pθ (X ∈ At ),
where recall At = {x ∈ X : T (x) = t} is a set over (Rn , B(Rn ), PX ). Note that
X
Pθ (X ∈ At ) = Pθ (X = x)
x∈X : T (x)=t
X
= g(t|θ)h(x)
x∈X : T (x)=t
X
= g(t|θ) h(x).
x∈X : T (x)=t
PAGE 6
Therefore,
g(t|θ)h(x) I(T (X) = t) h(x) I(T (X) = t)
fX|T (x|t) = P =P ,
g(t|θ) x∈X : T (x)=t h(x) x∈X : T (x)=t h(x)
which is free of θ. 2
Example 6.2 (continued). Suppose X1 , X2 , ..., Xn are iid Poisson(θ), where θ > 0. We have
already shown that
Xn
T = T (X) = Xi
i=1
is a sufficient statistic (using the definition of sufficiency). We now show this using the
Factorization Theorem. For xi = 0, 1, 2, ..., the pmf of X is
n
Y θxi e−θ
fX (x|θ) =
i=1
xi !
Pn
xi −nθ
θ i=1 e
= Qn
i=1 xi !
Pn
xi −nθ 1
= θ| {z e } Qn xi ! ,
i=1
= g(t|θ) | i=1
{z }
= h(x)
Pn Pn
where t = i=1 xi . By the Factorization Theorem, T = T (X) = i=1 Xi is sufficient.
Example 6.5. Suppose X1 , X2 , ..., Xn are iid U(0, θ), where θ > 0. Find a sufficient statistic.
Solution. The pdf of X is
n
Y 1
fX (x|θ) = I(0 < xi < θ)
i=1
θ
n
1 Y
= I(0 < xi < θ)
θn i=1
n
1 Y
= n I(x(n) < θ) I(xi > 0),
|θ {z } i=1
= g(t|θ)
| {z }
= h(x)
where t = x(n) . By the Factorization Theorem, T = T (X) = X(n) is sufficient.
Example 6.6. Suppose X1 , X2 , ..., Xn are iid gamma(α, β), where α > 0 and β > 0. Note
that in this family, the parameter θ = (α, β) is two-dimensional. The pdf of X is
n
Y 1
fX (x|θ) = α
xα−1
i e−xi /β I(xi > 0)
i=1
Γ(α)β
n Y n
!α n
1 − n
P
i=1 xi /β
Y I(xi > 0)
= x i e ,
Γ(α)β α i=1
x i
| {z } |i=1 {z }
= g(t1 ,t2 |θ) = h(x)
PAGE 7
Qn Pn
where t1 = i=1 xi and t2 = i=1 xi . By the Factorization Theorem,
 Yn 
 Xi 
 i=1
T = T(X) =  X

n 
 
Xi
i=1
is sufficient.
Remark: In previous examples, we have seen that the dimension of a sufficient statistic T
often equals the dimension of the parameter θ:
Pn
• Example 6.2: Poisson(θ). T = i=1 Xi ; dim(T ) = dim(θ) = 1
• Example 6.3: exponential(θ). T = X; dim(T ) = dim(θ) = 1
• Example 6.5: U(0, θ). T = X(n) ; dim(T ) = dim(θ) = 1
• Example 6.6: gamma(α, β). T = ( ni=1 Xi , ni=1 Xi ); dim(T) = dim(θ) = 2.

Q P
Sometimes the dimension of a sufficient statistic is larger than that of the parameter. We
have already seen this in Example 6.4 where T(X) = (X(1) , X(2) , ..., X(n) ), the vector of order
statistics, was sufficient; i.e., dim(T) = n. In some parametric families (e.g., Cauchy, etc.),
this statistic is sufficient and no further reduction is possible.
Example 6.7. Suppose X1 , X2 , ..., Xn are iid U(θ, θ + 1), where −∞ < θ < ∞. This is a
one-parameter family; i.e., dim(θ) = 1. The pdf of X is
n
Y
fX (x|θ) = I(θ < xi < θ + 1)
i=1
Yn n
Y
= I(xi > θ) I(xi − 1 < θ)
i=1 i=1
n
Y
= I(x(1) > θ)I(x(n) − 1 < θ) I(xi ∈ R),
| {z }
= g(t1 ,t2 |θ) |i=1 {z }
= h(x)
where t1 = x(1) and t2 = x(n) . By the Factorization Theorem,

X(1)
T = T(X) =
X(n)
is sufficient. In this family, 2 = dim(T) > dim(θ) = 1.
Remark: Sufficiency also extends to non-iid situations.
PAGE 8
Example 6.8. Consider the linear regression model

Yi = β0 + β1 xi + i ,
for i = 1, 2, ..., n, where i ∼ iid N (0, σ 2 ) and the xi ’s are fixed constants (i.e., not random).
In this model, it is easy to show that
Yi ∼ N (β0 + β1 xi , σ 2 ),
so that θ = (β0 , β1 , σ 2 ). Note that Y1 , Y2 , ..., Yn are independent random variables (functions
of independent random variables are independent); however, Y1 , Y2 , ..., Yn are not identically
distributed because E(Yi ) = β0 + β1 xi changes as i does.
For y ∈ Rn , the pdf of Y is

n
1Y 1 2
fY (y|θ) = √ exp − 2 (yi − β0 − β1 xi )
i=1
2πσ 2σ
n/2 ( n
)
1 1 X
= exp − 2 (yi − β0 − β1 xi )2 .
2πσ 2 2σ i=1
It is easy to show that
Xn n
X n
X n
X n
X n
X
2 2 2 2
(yi − β0 − β1 xi ) = yi − 2β0 yi − 2β1 xi yi + nβ0 + 2β0 β1 xi + β1 x2i ,
i=1
|i=1 i=1 i=1
{z i=1 i=1
}
= g(t1 ,t2 ,t3 |θ)
where t1 = ni=1 yi2 , t2 = ni=1 yi , and t3 = ni=1 xi yi . Taking h(y) = 1, the Factorization
P P P
Theorem shows that  n 
X
2
 Yi 
 i=1 
 n
 X 

T = T(Y) =   Yi  
 i=1 
 X n 
xi Y i
 
i=1
is sufficient. Note that dim(T) = dim(θ) = 3.
Sufficient statistics in the exponential family:
Theorem 6.2.10. Suppose X1 , X2 , ..., Xn are iid from the exponential family
( k )
X
fX (x|θ) = h(x)c(θ) exp wi (θ)ti (x) ,
i=1
where θ = (θ1 , θ2 , ..., θd ), d ≤ k. Then

n n n
!
X X X
T = T(X) = t1 (Xj ), t2 (Xj ), ..., tk (Xj )
j=1 j=1 j=1
is sufficient.
PAGE 9
Proof. Use the Factorization Theorem. The pdf of X is

n
( k )
Y X
fX (x|θ) = h(xj )c(θ) exp wi (θ)ti (xj )
j=1 i=1
n
( k n
)
Y X X
= h(xj ) [c(θ)]n exp wi (θ) ti (xj ) ,
j=1 i=1 j=1
| {z } | {z }
= h∗ (x) = g(t1 ,t2 ,...,tk |θ)
Pn
where ti = j=1 ti (xj ), for i = 1, 2, ..., k. 2
Example 6.9. Suppose X1 , X2 , ..., Xn are iid Bernoulli(θ), where 0 < θ < 1. For x = 0, 1,
the pmf of X is
fX (x|θ) = θx (1 − θ)1−x
x
θ
= (1 − θ)
1−θ

θ
= (1 − θ) exp ln x
1−θ
= h(x)c(θ) exp{w1 (θ)t1 (x)},
where h(x) = 1, c(θ) = 1 − θ, w1 (θ) = ln{θ/(1 − θ)}, and t1 (x) = x. By Theorem 6.2.10,
n
X n
X
T = T (X) = t1 (Xj ) = Xj
j=1 j=1
is sufficient.
Result: Suppose X ∼ fX (x|θ), where θ ∈ Θ, and suppose T = T (X) is sufficient. If r is a

one-to-one function, then r(T (X)) is also sufficient.
Proof. Let T ∗ (X) = r(T (X)) so that T (X) = r−1 (T ∗ (X)). We have
fX (x|θ) = g(T (x)|θ)h(x)
= g(r−1 (T ∗ (x))|θ)h(x)
= g ∗ (T ∗ (x)|θ)h(x),
where g ∗ = g ◦ r−1 ; i.e., g ∗ is the composition of g and r−1 . By the Factorization Theorem,
T ∗ (X) is sufficient. 2
Applications:
• In Example 6.9, we showed that

n
X
T = Xi
i=1
is a sufficient statistic for the P
Bernoulli family. By the previous result, we know that
n
∗ ∗
T1 (X) = X and T2 (X) = e i=1 Xi are also sufficient. Note that r1 (t) = t/n and
r2 (t) = et are one-to-one functions over T = {t : t = 0, 1, 2, ..., n}.
PAGE 10
• In the N (µ, σ 2 ) family where both parameters are unknown, it is easy to show that
 X n 
 Xi 
 i=1
T = T(X) =  X

n 
 2 
Xi
i=1
is sufficient (just apply the Factorization Theorem directly or use our result dealing
with exponential families). Define the function

t1 /n
r(t) = r(t1 , t2 ) = 1 ,
(t − t21 /n)
n−1 2
and note that r(t) is one-to-one over T = {(t1 , t2 ) : −∞ < t1 < ∞, t2 ≥ 0}. Therefore,
 X n 
Xi 
X

 i=1
r(T(X)) = r  X =

n
 2 
S2
Xi
i=1
is also sufficient in the N (µ, σ 2 ) family.
Remark: In the N (µ, σ 2 ) family where both parameters are unknown, the statistic T(X) =
(X, S 2 ) is sufficient.
• In the N (µ, σ02 ) subfamily where σ02 is known, T (X) = X is sufficient.
• In the N (µ0 , σ 2 ) subfamily where µ0 is known,

n
X
T (X) = (Xi − µ0 )2
i=1
is sufficient. Interestingly, S 2 is not sufficient in this subfamily. It is easy to show that

fX|S 2 (x|s2 ) depends on σ 2 .
6.2.2 Minimal sufficient statistics
Example 6.10. Suppose that X1 , X2 , ..., Xn are iid N (µ, σ02 ), where −∞ < µ < ∞ and σ02
is known. Each of the following statistics is sufficient:
n
!
X
T1 (X) = X, T2 (X) = X1 , Xi , T3 (X) = (X(1) , X(2) , ..., X(n) ), T4 (X) = X.
i=2
Q: How much data reduction is possible?
PAGE 11
Definition: A sufficient statistic T = T (X) is called a minimal sufficient statistic if, for
any other sufficient statistic T ∗ (X), T (x) is a function of T ∗ (x).
Remark: A minimal sufficient statistic is a sufficient statistic that offers the most data
reduction. Note that “T (x) is a function of T ∗ (x)” means
T ∗ (x) = T ∗ (y) =⇒ T (x) = T (y).
Informally, if you know T ∗ (x), you can calculate T (x), but not necessarily vice versa.
Remark: You can also characterize minimality of a sufficient statistic using the partitioning
concept described at the beginning of this chapter. Consider the collection of sufficient
statistics. A minimal sufficient statistic T = T (X) admits the coarsest possible partition
in the collection.
Consider the following table:
Statistic Description Partition of X

T (x) Minimal sufficient At , t = 1, 2, ...,
T ∗ (x) Sufficient Bt , t = 1, 2, ...,
By “coarsest possible partition,” we mean that X (the support of X) cannot be split up

further and still be a sufficient partition (i.e., a partition for a sufficient statistic). This
means that {Bt , t = 1, 2, ..., } must be a sub-partition of {At , t = 1, 2, ..., }; i.e., each Bt set
associated with T ∗ (X) is a subset of some At associated with T (X).
Theorem 6.2.13. Suppose X ∼ fX (x|θ), where θ ∈ Θ. Suppose there is a function T (x)

such that for all x, y ∈ X ,
fX (x|θ)
is free of θ ⇐⇒ T (x) = T (y).
fX (y|θ)
Then T (X) is a minimal sufficient statistic.
Example 6.10 (continued). Suppose X1 , X2 , ..., Xn are iid N (µ, σ02 ), where −∞ < µ < ∞
and σ02 is known. For x ∈ Rn , the pdf of X is
n
Y 1 2 2
fX (x|µ) = √ e−(xi −µ) /2σ0
i=1
2πσ0
n P
1 n 2 2
= √ e− i=1 (xi −µ) /2σ0 .
2πσ0
Now write n n
X X
2
(xi − µ) = (xi − x)2 + n(x − µ)2
i=1 i=1
PAGE 12
and form the ratio

n
√ 1exp {− ni=1 (xi − µ)2 /2σ02 }
P
fX (x|µ) 2πσ0
= n
fX (y|µ) √ 1 {−
Pn 2 2
2πσ0
exp i=1 (yi − µ) /2σ0 }
n
√ 1 exp {−[ ni=1 (xi − x)2 + n(x − µ)2 ]/2σ02 }
P
2πσ0
= n Pn .
√ 1 exp {−[ (y − y)2 + n(y − µ)2 ]/2σ 2 }
2πσ 0 i=1 i 0
Clearly, this ratio is free of µ if and only if x = y. By Theorem 6.2.13, we know that
T (X) = X is a minimal sufficient statistic.
Result: Suppose X ∼ fX (x|θ), where θ ∈ Θ, and suppose T = T (X) is minimal sufficient.

If r is a one-to-one function, then r(T (X)) is also minimal sufficient.
Example 6.7 (continued). Suppose X1 , X2 , ..., Xn are iid U(θ, θ + 1), where −∞ < θ < ∞.
We have already shown the pdf of X is
n
Y
fX (x|θ) = I(x(1) > θ)I(x(n) − 1 < θ) I(xi ∈ R).
i=1
Clearly, the ratio
I(x(1) > θ)I(x(n) − 1 < θ) ni=1 I(xi ∈ R)

Q
fX (x|θ)
=
I(y(1) > θ)I(y(n) − 1 < θ) ni=1 I(yi ∈ R)
Q
fX (y|θ)
is free of θ if and only if (x(1) , x(n) ) = (y(1) , y(n) ). By Theorem 6.2.13, we know that T(X) =
(X(1) , X(n) ) is a minimal sufficient statistic. Note that in this family, the dimension of a
minimal sufficient statistic does not match the dimension of the parameter. Note also that
a one-to-one function of T(X) is

X(n) − X(1)
(X(1) + X(n) )/2
which is also minimal sufficient.
6.2.3 Ancillary statistics
Definition: A statistic S = S(X) is an ancillary statistic if the distribution of S does not

depend on the model parameter θ.
Remark: You might characterize an ancillary statistic as being “unrelated” to a sufficient

statistic. After all, sufficient statistics contain all the information about the parameter θ
and ancillary statistics have distributions that are free of θ.
PAGE 13
Example 6.11. Suppose that X1 , X2 , ..., Xn are iid N (0, σ 2 ), where σ 2 > 0. Note that
X ∼ N (0, σ 2 /n),
so X is not ancillary (its distribution depends on σ 2 ). However, the statistic
X
S(X) = √ ∼ tn−1
S/ n
is ancillary because its distribution, tn−1 , does not depend on σ 2 . Also, it is easy to show
that n
X
T (X) = Xi2
i=1
2
is a (minimal) sufficient statistic for σ .
Recap:
Pn
• T (X) = i=1 Xi2 contains all the information about σ 2 .
• The distribution of S(X) does not depend on σ 2 .
• Might we conclude that T (X) ⊥⊥ S(X)?
• I used R to generate B = 1000 draws from the bivariate distribution of (T (X), S(X)),
when n = 10 and σ 2 = 100; see Figure 6.1.
Remark: Finding ancillary statistics is easy when you are dealing with location or scale
families.
Location-invariance: For any c ∈ R, suppose the statistic S(X) satisfies
S(x1 + c, x2 + c, ..., xn + c) = S(x1 , x2 , ..., xn )
for all x ∈ X . We say that S(X) is a location-invariant statistic. In other words, the
value of S(x) is unaffected by location shifts.
Result: Suppose X1 , X2 , ..., Xn are iid from
fX (x|µ) = fZ (x − µ),
a location family with standard pdf fZ (·) and location parameter −∞ < µ < ∞. If S(X) is
location invariant, then it is ancillary.
Proof. Define Wi = Xi − µ, for i = 1, 2, ..., n. We perform an n-variate transformation to
find the distribution of W = (W1 , W2 , ..., Wn ). The inverse transformation is described by
xi = gi−1 (w1 , w2 , ..., wn ) = wi + µ,
PAGE 14
4
2
s
0
−2
−4
500 1000 1500 2000 2500
Figure 6.1: Scatterplot of B = 1000 pairs of T (x) and S(x) in Example 6.11. Each point
was calculated based on an iid sample of size n = 10 with σ 2 = 100.
for i = 1, 2, ..., n. It is easy to see that the Jacobian of the inverse transformation is 1 and
therefore
fW (w) = fX (w1 + µ, w2 + µ, ..., wn + µ)

Yn
= fX (wi + µ)
i=1
n
Y n
Y
= fZ (wi + µ − µ) = fZ (wi ),
i=1 i=1
which does not depend on µ. Because S(X) is location invariant, we know
S(X) = S(X1 , X2 , ..., Xn )

= S(W1 + µ, W2 + µ, ..., Wn + µ)
= S(W1 , W2 , ..., Wn )
= S(W).
Because the distribution of W does not depend on µ, the distribution of the statistic S(W)
cannot depend on µ either. But S(W) = S(X), so we are done. 2
PAGE 15
Examples: Each of the following is a location-invariant statistic (and hence is ancillary

when sampling from a location family):
n
1X
S(X) = X − M, S(X) = X(n) − X(1) , S(X) = |Xi − X|, S(X) = S 2 .
n i=1
Note: Above M denotes the sample median of X1 , X2 , ..., Xn .
Example 6.12. Suppose X1 , X2 , ..., Xn are iid N (µ, σ02 ), where −∞ < µ < ∞ and σ02 is
known. Show that the sample variance S 2 is ancillary.
Proof. First note that
1 2 2
fX (x|µ) = √ e−(x−µ) /2σ0 I(x ∈ R) = fZ (x − µ),
2πσ0
where
1 2 2
fZ (z) = √ e−z /2σ0 I(z ∈ R),
2πσ0
the N (0, σ02 ) pdf. Therefore, the N (µ, σ02 ) family is a location family. We now show that
S(X) = S 2 is location invariant. Let Wi = Xi + c, for i = 1, 2, ..., n. Clearly, W = X + c and
n
1 X
S(W) = (Wi − W )2
n − 1 i=1
n
1 X
= [(Xi + c) − (X + c)]2
n − 1 i=1
n
1 X
= (Xi − X)2 = S(X).
n − 1 i=1
This shows that S(X) = S 2 is location invariant and hence is ancillary.
Remark: The preceding argument only shows that the distribution of S 2 does not depend
on µ. However, in this example, it is easy to find the distribution of S 2 directly. Recall that
(n − 1)S 2 n − 1 2σ02

2 d n−1 2
∼ χn−1 = gamma ,2 =⇒ S ∼ gamma , ,
σ02 2 2 n−1
which, of course, does not depend on µ.
Scale-invariance: For any d > 0, suppose the statistic S(X) satisfies
S(dx1 , dx2 , ..., dxn ) = S(x1 , x2 , ..., xn )
for all x ∈ X . We say that S(X) is a scale-invariant statistic. In other words, the value
of S(x) is unaffected by changes in scale.
PAGE 16
Result: Suppose X1 , X2 , ..., Xn are iid from

1 x
fX (x|σ) = fZ ,
σ σ
a scale family with standard pdf fZ (·) and scale parameter σ 2 > 0. If S(X) is scale invariant,
then it is ancillary.
Proof. Define Wi = Xi /σ, for i = 1, 2, ..., n. We perform an n-variate transformation to find
the distribution of W = (W1 , W2 , ..., Wn ). The inverse transformation is described by
xi = gi−1 (w1 , w2 , ..., wn ) = σwi ,
for i = 1, 2, ..., n. It is easy to see that the Jacobian of the inverse transformation is σ n and
therefore
fW (w) = fX (σw1 , σw2 , ..., σwn ) × σ n

n
Y
n
= σ fX (σwi )
i=1
n n
n
Y 1 σwi Y
= σ fZ = fZ (wi ),
i=1
σ σ i=1
which does not depend on σ. Because S(X) is scale invariant, we know
S(X) = S(X1 , X2 , ..., Xn )

= S(σW1 , σW2 , ..., σWn )
= S(W1 , W2 , ..., Wn )
= S(W).
Because the distribution of W does not depend on σ, the distribution of the statistic S(W)
cannot depend on σ either. But S(W) = S(X), so we are done. 2
Examples: Each of the following is a scale-invariant statistic (and hence is ancillary when
sampling from a scale family):
Pk
S X(n) Xi2
S(X) = , S(X) = , S(X) = Pi=1 n 2
.
X X(1) i=1 Xi
Example 6.13. Suppose X1 , X2 , ..., Xn is an iid sample from

1 −|x|/σ
fX (x|σ) = e I(x ∈ R).
2σ
Show that Pk
|Xi |
S(X) = Pni=1
i=1 |Xi |
is ancillary.
PAGE 17
Proof. First note that

1 x
fX (x|σ) = fZ ,
σ σ
where
1
fZ (z) = e−|z| I(z ∈ R),
2
a standard LaPlace pdf. Therefore, {fX (x|σ), σ > 0} is a scale family. We now show that
S(X) is scale invariant. For d > 0, let Wi = dXi , for i = 1, 2, ..., n. We have
Pk
|Wi |
S(W) = Pni=1
i=1 |Wi |
Pk
|dXi |
= Pi=1n
i=1 |dXi |
Pk
d i=1 |Xi |
= = S(X).
d ni=1 |Xi |
P
This shows that S(X) is scale invariant and hence is ancillary.
Remark: The preceding argument only shows that the distribution of S(X) does not depend
on σ. It can be shown (verify!) that
Pk
|Xi |
S(X) = Pi=1
n ∼ beta(k, n − k),
i=1 |Xi |
which, of course, does not depend on σ.
Remark: It is straightforward to extend our previous discussions to location-scale families.
6.2.4 Sufficient, ancillary, and complete statistics
Definition: Let {fT (t|θ); θ ∈ Θ} be a family of pdfs (or pmfs) for a statistic T = T (X). We
say that this family is a complete family if the following condition holds:
Eθ [g(T )] = 0 ∀θ ∈ Θ =⇒ Pθ (g(T ) = 0) = 1 ∀θ ∈ Θ;
i.e., g(T ) = 0 almost surely for all θ ∈ Θ. We call T = T (X) a complete statistic.
Remark: This condition basically says that the only function of T that is an unbiased
estimator of zero is the function that is zero itself (with probability 1).
Example 6.14. Suppose X1 , X2 , ..., Xn are iid Bernoulli(θ), where 0 < θ < 1. Show that
n
X
T = T (X) = Xi
i=1
is a complete statistic.
PAGE 18
Proof. We know that T ∼ b(n, θ), so it suffices to show that this family of distributions is a
complete family. Suppose
Eθ [g(T )] = 0 ∀θ ∈ (0, 1).
It suffices to show that Pθ (g(T ) = 0) = 1 for all θ ∈ (0, 1). Note that
0 = Eθ [g(T )]
n
X n t
= g(t) θ (1 − θ)n−t
t=0
t
n
n
X n t
= (1 − θ) g(t) r,
t=0
t
where r = θ/(1 − θ). Because (1 − θ)n is never zero, it must be that

n
X n t
g(t) r = 0.
t=0
t
The LHS of this equation is a polynomial (in r) of degree n. The only way this polynomial
can be zero for all θ ∈ (0, 1); i.e., for all r > 0, is for the coefficients

n
g(t) = 0, for t = 0, 1, 2, ..., n.
t
Because nt 6= 0, this can only happen when g(t) = 0, for t = 0, 1, 2, ..., n. We have shown

that Pθ (g(T ) = 0) = 1 for all θ. Therefore, T (X) is complete. 2
Remark: To show that a statistic T = T (X) is not complete, all we have to do is find one
nonzero function g(T ) that satisfies Eθ [g(T )] = 0, for all θ.
Example 6.15. Suppose X1 , X2 , ..., Xn are iid N (θ, θ2 ), where θ ∈ Θ = (−∞, 0) ∪ (0, ∞).
Putting
1 2 2
fX (x|θ) = √ e−(x−θ) /2θ I(x ∈ R)
2πθ 2
into exponential family form shows that

 Xn 
 Xi 
 i=1
T = T(X) =  X

n 
Xi2
 
i=1
is sufficient. However, T is not complete. Consider

n
!2 n
X X
g(T) = 2 Xi − (n + 1) Xi2 .
i=1 i=1
PAGE 19
It is straightforward to show that

 !2 
n
X n
X
Eθ [g(T)] = Eθ 2 Xi − (n + 1) Xi2  = 0.
i=1 i=1
We have found a nonzero function g(T) that has zero expectation. Therefore T cannot be
complete.
Theorem 6.2.24 (Basu’s Theorem). Suppose T is a sufficient statistic. If T is also

complete, then T is independent of every ancillary statistic S.
Proof. Suppose S is ancillary. Let φ and ψ be any functions. Using iterated expectation,
Eθ [φ(S)ψ(T )] = Eθ {E[φ(S)ψ(T )|T ]}
= Eθ {ψ(T )E[φ(S)|T ]}, (6.1)
the last step following because once we condition on T = t; i.e., we write E[φ(S)ψ(t)|T = t],
then ψ(t) is constant. Now, consider the quantity
Eθ {E[φ(S)|T ]} = Eθ [φ(S)] = k,
where k is a constant free of θ (because S is ancillary by assumption). Define
g(T ) = E[φ(S)|T ] − k.
From our argument above, we have
Eθ [g(T )] = Eθ {E[φ(S)|T ] − k} = Eθ [φ(S)] − k = 0
for all θ. However, because T is complete by assumption, we know that
a.s. a.s.
g(T ) = 0 ∀θ =⇒ E[φ(S)|T ] = k ∀θ.
Because T is sufficient by assumption, we know that E[φ(S)|T ] does not depend on θ either.
From Equation (6.1), we have
Eθ [φ(S)ψ(T )] = Eθ {ψ(T )E[φ(S)|T ]}
= kEθ {ψ(T )}
= Eθ [φ(S)]Eθ [ψ(T )].
Because Eθ [φ(S)ψ(T )] = Eθ [φ(S)]Eθ {ψ(T )} holds for all functions (φ and ψ were arbitrarily
chosen), then equality holds when
φ(S) = I(S ≤ s)
ψ(T ) = I(T ≤ t),
for s, t ∈ R. Using this choice of φ and ψ, the joint cdf of (S, T )
FT,S (t, s) = Pθ (S ≤ s, T ≤ t)
= Eθ [φ(S)ψ(T )]
= Eθ [φ(S)]Eθ [ψ(T )]
= Pθ (S ≤ s)Pθ (T ≤ t) = FS (s)FT (t).
PAGE 20
We have shown that the joint cdf of (S, T ) factors into the product of the marginal cdfs.
Because s and t are arbitrary, we are done. 2
Example 6.16. Suppose that X1 , X2 , ..., Xn are iid U(0, θ), where θ > 0. Show that X(n)
and X(1) /X(n) are independent.
Proof. We will show that
• T (X) = X(n) is complete and sufficient.
• S(X) = X(1) /X(n) is ancillary.
The result will then follow from Basu’s Theorem. First, note that
1
fX (x|θ) = I(0 < x < θ)
θ
1 x
= fZ ,
θ θ
where fZ (z) = I(0 < z < 1) is the standard uniform pdf. Therefore, the U(0, θ) family is
a scale family. We now show that S(X) is scale invariant. For d > 0, let Wi = dXi , for
i = 1, 2, ..., n. We have
W(1) dX(1) X(1)
S(W) = = = = S(X).
W(n) dX(n) X(n)
This shows that S(X) is scale invariant and hence is ancillary.
We have already shown that T = T (X) = X(n) is sufficient; see Example 6.5 (notes). We
now show T is complete. We first find the distribution of T . The pdf of T , the maximum
order statistic, is given by
fT (t) = nfX (t)[FX (t)]n−1

n−1
1 t
= n I(0 < t < θ)
θ θ
n−1
nt
= I(0 < t < θ).
θn
Suppose Eθ [g(T )] = 0 for all θ > 0. This implies that
θ θ
ntn−1
Z Z
g(t) dt = 0 ∀θ > 0 =⇒ g(t)tn−1 dt = 0 ∀θ > 0
0 θn 0
Z θ
d
=⇒ g(t)tn−1 dt = 0 ∀θ > 0
dθ 0
=⇒ g(θ)θn−1 = 0 ∀θ > 0,
the last step following from the Fundamental Theorem of Calculus, provided that g is
Riemann-integrable. Because θn−1 =
6 0, it must be true that g(θ) = 0 for all θ > 0. We have
PAGE 21
therefore shown that the only function g satisfying Eθ [g(T )] = 0 for all θ > 0 is the function
that is itself zero; i.e., we have shown
Pθ (g(T ) = 0) = 1, for all θ > 0.
Therefore T = T (X) = X(n) is complete. 2
Remark: Our completeness argument in Example 6.16 is not entirely convincing. We have
basically established that
Eθ [g(T )] = 0 ∀θ > 0 =⇒ Pθ (g(T ) = 0) = 1 ∀θ > 0
for the class of functions g which are Riemann-integrable. There are many functions g that
are not Riemann-integrable. CB note that “this distinction is not of concern.” This is another
way of saying that the authors do not want to present completeness from a more general
point of view (for good reason; this would involve a heavy dose of measure theory).
Extension: Suppose that, in Example 6.16, I asked you to find

X(1)
E .
X(n)
At first glance, this appears to be an extremely challenging expectation to calculate. From

first principles, we could find the joint distribution of (X(1) , X(n) ) and then calculate the first
moment of the ratio. Another approach is to use Basu’s Theorem. Note that

X(1)
E(X(1) ) = E X(n)
X(n)

X(1)
= E(X(n) )E ,
X(n)
the last step following because X(n) and X(1) /X(n) are independent. Therefore, we can
calculate the desired expectation by instead calculating E(X(1) ) and E(X(n) ). These are
easier to calculate:
θ
E(X(1) ) =
n+1

n
E(X(n) ) = θ.
n+1
Therefore, we have

θ n X(1) X(1) 1
= θE =⇒ E = .
n+1 n+1 X(n) X(n) n
It makes sense that this expectation would not depend on θ; recall that S(X) = X(1) /X(n)
is ancillary.
PAGE 22
Completeness in the exponential family:
Recall: Suppose X1 , X2 , ..., Xn are iid from the exponential family

( k )
X
fX (x|θ) = h(x)c(θ) exp wi (θ)ti (x) ,
i=1
where θ = (θ1 , θ2 , ..., θd ), d ≤ k. In Theorem 6.2.10, we showed that

 n 
X
 t1 (Xj ) 
 j=1 
 n 
 X 
t2 (Xj ) 
 

T = T(X) =  j=1


 .. 

 n . 

 X 
 tk (Xj ) 
j=1
New result (Theorem 6.2.25): In the exponential family, the statistic T = T(X) is com-
plete if the natural parameter space
{η = (η1 , η2 , ..., ηk ) : ηi = wi (θ); θ ∈ Θ}
contains an open set in Rk . For the most part, this means:
• T = T(X) is complete if d = k (full exponential family)
• T = T(X) is not complete if d < k (curved exponential family).
Example 6.17. Suppose that X1 , X2 , ..., Xn is an iid sample from a gamma(α, 1/α2 ) distri-
bution. The pdf of X is
1 2
fX (x|α) = 1 α
xα−1 e−x/(1/α ) I(x > 0)
Γ(α) α2
I(x > 0) α2α α ln x −α2 x
= e e
x Γ(α)
I(x > 0) α2α
exp α ln x − α2 x

=
x Γ(α)
= h(x)c(α) exp{w1 (α)t1 (x) + w2 (α)t2 (x)},
where h(x) = I(x > 0)/x, c(α) = α2α /Γ(α), w1 (α) = α, t1 (x) = ln x, w2 (α) = −α2 , and
t2 (x) = x. Theorem 6.2.10 tells us that
 Xn 
 ln Xi 
 i=1
T = T(X) =  X

n 
 
Xi
i=1
PAGE 23
is a sufficient statistic. However, Theorem 6.2.25 tells us that T is not complete because
{fX (x|α), α > 0} is an exponential family with d = 1 and k = 2. Note also that
{η = (η1 , η2 ) : (α, −α2 ); α > 0}
is a half-parabola (which opens downward); this set does not contain an open set in R2 .
Example 6.18. Suppose X1 , X2 , ..., Xn are iid N (µ, σ 2 ), where −∞ < µ < ∞ and σ 2 > 0;
i.e., both parameters are unknown. Prove that X ⊥⊥ S 2 .
Proof. We use Basu’s Theorem, but we have to use it carefully. Fix σ 2 = σ02 and consider
first the N (µ, σ02 ) subfamily. The pdf of X ∼ N (µ, σ02 ) is
1 2 2
fX (x|µ) = √ e−(x−µ) /2σ0 I(x ∈ R)
2πσ0
2 2
I(x ∈ R)e−x /2σ0 −µ2 /2σ02 (µ/σ02 )x
= √ e e
2πσ0
= h(x)c(µ) exp{w1 (µ)t1 (x)}.
Theorem 6.2.10 tells us that n
X
T = T (X) = Xi
i=1
is a sufficient statistic. Because d = k = 1 (remember, this is for the N (µ, σ02 ) subfamily),
Theorem 6.2.25 tells us that T is complete. In Example 6.12 (notes), we have already showed
that
• the N (µ, σ02 ) subfamily is a location family

• S(X) = S 2 is location invariant and hence ancillary for this subfamily.
Therefore, by Basu’s Theorem, we have proven that, in the N (µ, σ02 ) subfamily,
n
X
Xi ⊥⊥ S 2 =⇒ X ⊥⊥ S 2 ,
i=1
the last implication being true because X is a function of T = T (X) = ni=1 Xi and functions
P
of independent statistics are independent. Finally, because we fixed σ 2 = σ02 arbitrarily, this
same argument holds for all σ02 fixed. Therefore, this independence result holds for any choice
of σ 2 and hence for the full N (µ, σ 2 ) family. 2
Remark: It is important to see that in the preceding proof, we cannot work directly with
the N (µ, σ 2 ) family and claim that
Pn
• T (X) = i=1 Xi is complete and sufficient
• S(X) = S 2 is ancillary
for this family. In fact, neither statement is true in the full family.
PAGE 24
Remark: Outside the exponential family, Basu’s Theorem can be useful in showing that a
sufficient statistic T (X) is not complete.
Basu’s Theorem (Contrapositive version): Suppose T (X) is sufficient and S(X) is ancillary.
If T (X) and S(X) are not independent, then T (X) is not complete.
Example 6.19. Suppose that X1 , X2 , ..., Xn is an iid sample from

1
fX (x|θ) = I(x ∈ R),
π[1 + (x − θ)2 ]
where −∞ < θ < ∞. It is easy to see that {fX (x|θ) : −∞ < θ < ∞} is a location family;
i.e., fX (x|θ) = fZ (x − θ), where
1
fZ (z) = I(z ∈ R)
π(1 + z 2 )
is the standard Cauchy pdf. We now prove the sample range S(X) = X(n) − X(1) is location
invariant. Let Wi = Xi + c, for i = 1, 2, ..., n, and note
S(W) = W(n) − W(1) = (X(n) + c) − (X(1) + c) = X(n) − X(1) = S(X).
This shows that S(X) is ancillary in this family. Finally, we know from Example 6.4 (notes)
that the order statistics
T = T(X) = (X(1) , X(2) , ..., X(n) )
are sufficient for this family (in fact, T is minimal sufficient; see Exercise 6.9, CB, pp 301).
However, clearly S(X) and T(X) are not independent; e.g., if you know T(x), you can
calculate S(x). By Basu’s Theorem (the contrapositive version), we know that T(X) cannot
be complete.
Theorem 6.2.28. Suppose that T (X) is sufficient. If T (X) is complete, then T (X) is
minimal sufficient.
Remark: Example 6.19 shows that the converse to Theorem 6.2.28 is not true; i.e.,
6
T (X) minimal sufficient =⇒ T (X) complete.
Example 6.7 provides another counterexample. We showed that if X1 , X2 , ..., Xn are iid
U(θ, θ + 1), then T = T(X) = (X(1) , X(n) ) is a minimal sufficient statistic. However, T
cannot be complete because T and the sample range X(n) − X(1) (which is location invariant
and hence ancillary in this model) are not independent. This implies that there exists a
nonzero function g(T) that has zero expectation for all θ ∈ R. In fact, it is easy to show
that
n−1
Eθ (X(n) − X(1) ) = .
n+1
Therefore,
n−1
g(T) = X(n) − X(1) −
n+1
satisfies Eθ [g(T)] = 0 for all θ.
PAGE 25
7 Point Estimation
Complementary reading: Chapter 7 (CB).
7.1 Introduction
Remark: We will approach “the point estimation problem” from the following point of
view. We have a parametric model for X = (X1 , X2 , ..., Xn ):
X ∼ fX (x|θ), where θ ∈ Θ ⊆ Rk ,
and the model parameter θ = (θ1 , θ2 , ..., θk ) is unknown. We will assume that θ is fixed
(except when we discuss Bayesian estimation). Possible goals include
1. Estimating θ
2. Estimating a function of θ, say τ (θ), where τ : Rk → Rq , q ≤ k (often, q = 1; i.e., τ (θ)

is a scalar parameter).
Remark: For most of the situations we will encounter in this course, the random vector
X will consist of X1 , X2 , ..., Xn , an iid sample from the population fX (x|θ). However,
our discussion is also relevant when the independence assumption is relaxed, the identically
distributed assumption is relaxed, or both.
Definition: A point estimator
W (X) = W (X1 , X2 , ..., Xn )
is any function of the sample X. Therefore, any statistic is a point estimator. We call
W (x) = W (x1 , x2 , ..., xn ) a point estimate. W (x) is a realization of W (X).
Preview: This chapter is split into two parts. In this first part (Section 7.2), we present
different approaches of finding point estimators. These approaches are:
• Section 7.2.1: Method of Moments (MOM)
• Section 7.2.2: Maximum Likelihood Estimation (MLE)
• Section 7.2.3: Bayesian Estimation
• Section 7.2.4: EM Algorithm (we will skip).
The second part (Section 7.3) focuses on evaluating point estimators; e.g., which estimators
are good/bad? What constitutes a “good” estimator? Is it possible to find the best one?
For that matter, how should we even define “best?”
PAGE 26
7.2 Methods of Finding Estimators
7.2.1 Method of moments
Strategy: Suppose X ∼ fX (x|θ), where θ = (θ1 , θ2 , ..., θk ) ∈ Θ ⊆ Rk . The method

of moments (MOM) approach says to equate the first k sample moments to the first k
population moments and then to solve for θ.
Recall: The jth sample moment (uncentered) is given by

n
1X j
m0j = X .
n i=1 i
If X1 , X2 , ..., Xn is an iid sample, the jth population moment (uncentered) is
µ0j = E(X j ).
Intuition: The first k sample moments depend on the sample X. The first k population
moments will generally depend on θ = (θ1 , θ2 , ..., θk ). Therefore, the system of equations
set
m01 = E(X)
set
m02 = E(X 2 )
..
.
0 set
mk = E(X k )
can (at least in theory) be solved for θ1 , θ2 , ..., θk . A solution to this system of equations is
called a method of moments (MOM) estimator.
Example 7.1. Suppose that X1 , X2 , ..., Xn are iid U(0, θ), where θ > 0. The first sample
moment is n
0 1X
m1 = Xi = X.
n i=1
The first population moment is
θ
µ01 = E(X) = .
2
We set these moments equal to each other; i.e.,
set θ
X =
2
and solve for θ. The solution
θb = 2X
is a method of moments estimator for θ.
PAGE 27
Example 7.2. Suppose that X1 , X2 , ..., Xn are iid U(−θ, θ), where θ > 0. For this popula-
tion, E(X) = 0 so this will not help us. Moving to second moments, we have
n
1X 2
m02 = X
n i=1 i
and
θ2
µ02 = E(X 2 ) = var(X) = .
3
Therefore, we can set
n
1 X 2 set θ2
X =
n i=1 i 3
and solve for θ. The solution v
u n
u3 X
θb = +t X2
n i=1 i
is a method of moments estimator for θ. We keep the positive solution because θ > 0
(although, technically, the negative solution is still a MOM estimator).
i.e., both parameters are unknown. The first two population moments are E(X) = µ and
E(X 2 ) = var(X) + [E(X)]2 = σ 2 + µ2 . Therefore, method of moments estimators for µ and
σ 2 are found by solving
set
X = µ
n
1 X 2 set 2
Xi = σ + µ2 .
n i=1
b = X and
We have µ
n n
2 1X 2 2 1X
σ
b = Xi − X = (Xi − X)2 .
n i=1 n i=1
Note that the method of moments estimator for σ 2 is not our “usual” sample variance (with
denominator n − 1).
Remarks:
• I think of MOM estimation as a “quick and dirty” approach. All we are doing is
matching moments. We are attempting to learn about a population fX (x|θ) by using
moments only.
• Sometimes MOM estimators have good finite-sample properties (e.g., unbiasedness,

small variance, etc.). Sometimes they do not.
• MOM estimators generally do have desirable large-sample properties (e.g., large-sample

normality, etc.) but are usually less (asymptotically) efficient than other estimators.
PAGE 28
• MOM estimators can be nonsensical. In fact, sometimes MOM estimators fall outside
the parameter space Θ. For example, in linear models with random effects, variance
components estimated via MOM can be negative.
7.2.2 Maximum likelihood estimation
Note: We first formally define a likelihood function; see also Section 6.3 (CB).
Definition: Suppose X ∼ fX (x|θ), where θ ∈ Θ ⊆ Rk . Given that X = x is observed, the

function
L(θ|x) = fX (x|θ)
is called the likelihood function.
Note: The likelihood function L(θ|x) is the same function as the joint pdf/pmf fX (x|θ).
The only difference is in how we interpret each one.
• The function fX (x|θ) is a model that describes the random behavior of X when θ is
fixed.
• The function L(θ|x) is viewed as a function of θ with the data X = x held fixed.
Interpretation: When X is discrete,
L(θ|x) = fX (x|θ) = Pθ (X = x).
That is, when X is discrete, we can interpret the likelihood function L(θ|x) literally as a
joint probability.
• Suppose that θ 1 and θ 2 are two possible values of θ. Suppose X is discrete and
L(θ 1 |x) = Pθ1 (X = x) > Pθ2 (X = x) = L(θ 2 |x).
This suggests the sample x is more likely to have occurred with θ = θ 1 rather than if
θ = θ 2 . Therefore, in the discrete case, we can interpret L(θ|x) as “the probability of
the data x.”
• Of course, this interpretation of L(θ|x) is not appropriate when X is continuous because

Pθ (X = x) = 0. However, this description is still used informally when describing
the likelihood function with continuous data. An attempt to make this description
mathematical is given on pp 290 (CB).
• Section 6.3 (CB) describes how the likelihood function L(θ|x) can be viewed as a data
reduction device.
PAGE 29
Definition: Any maximizer θ

b = θ(x)
b of the likelihood function L(θ|x) is called a maxi-
mum likelihood estimate.
• With our previous interpretation, we can think of θ

b as “the value of θ that maximizes
the probability of the data x.”
We call θ(X)
b a maximum likelihood estimator (MLE).
Remarks:
1. Finding the MLE θ b is essentially a maximization problem. The estimate θ(x)
b must
fall in the parameter space Θ because we are maximizing L(θ|x) over Θ; i.e.,
θ(x)
b = arg max L(θ|x).
θ∈Θ
There is no guarantee that an MLE θ(x)

b will be unique (although it often is).
2. Under certain conditions (so-called “regularity conditions”), maximum likelihood esti-
mators θ(X)
b have very nice large-sample properties (Chapter 10, CB).
3. In most “real” problems, the likelihood function L(θ|x) must be maximized numerically
to calculate θ(x).
b
Example 7.4. Suppose X1 , X2 , ..., Xn are iid U[0, θ], where θ > 0. Find the MLE of θ.
Solution. The likelihood function is
n n
Y 1 1 Y
L(θ|x) = I(0 ≤ xi ≤ θ) = n I(x(n) ≤ θ) I(xi ≥ 0) .
i=1
θ θ i=1
| {z }
view this as a function of θ with x fixed
Note that
• For θ ≥ x(n) , L(θ|x) = 1/θn , which decreases as θ increases.
• For θ < x(n) , L(θ|x) = 0.
Clearly, the MLE of θ is θb = X(n) .
Remark: Note that in this example, we “closed the endpoints” on the support of X; i.e.,
the pdf of X is ( 1
, 0≤x≤θ
fX (x|θ) = θ
0, otherwise.
Mathematically, this model is no different than had we “opened the endpoints.” However, if
we used open endpoints, note that
x(n) < arg max L(θ|x) < x(n) +
θ>0
for all > 0, and therefore the maximizer of L(θ|x); i.e., the MLE, would not exist.
PAGE 30
Curiosity: In this uniform example, we derived the MOM estimator to be θb = 2X in

Example 7.1. The MLE is θb = X(n) . Which estimator is “better?”
Note: In general, when the likelihood function L(θ|x) is a differentiable function of θ, we

can use calculus to maximize L(θ|x). If an MLE θb exists, it must satisfy
∂
L(θ|x)
b = 0, j = 1, 2, ..., k.
∂θj
Of course, second-order conditions must be verified to ensure that θ

b is a maximizer (and not
a minimizer or some other value).
Example 7.5. Suppose that X1 , X2 , ..., Xn are iid N (θ, 1), where −∞ < θ < ∞. The
likelihood function is
n
Y 1 2
L(θ|x) = √ e−(xi −θ) /2
i=1
2π
n
1 1 Pn 2
= √ e− 2 i=1 (xi −θ) .
2π
The derivative
n n
∂ 1 − 12
Pn
i=1 (xi −θ)
2
X set
L(θ|x) = √ e (xi − θ) = 0
∂θ 2π i=1
| {z }
this can never be zero
n
X
=⇒ (xi − θ) = 0.
i=1
Therefore, θb = x is a first-order critical point of L(θ|x). Is θb = x a maximizer? I calculated

" #2 
2
n n
∂ 1 − 12 n
P
(xi −θ)2
 X 
L(θ|x) = √ e i=1 (xi − θ) − n .
∂θ2 2π 
i=1

Because n
∂2

1 − 12 n
P 2
i=1 (xi −x) < 0,
2
L(θ|x) = −n √ e
∂θ
θ=x 2π
the function L(θ|x) is concave down when θ = x; i.e., θb = x maximizes L(θ|x). Therefore,
θb = θ(X)
b =X
is the MLE of θ.
Illustration: Under the N (θ, 1) model assumption, I graphed in Figure 7.1 the likelihood
function L(θ|x) after observing x1 = 2.437, x2 = 0.993, x3 = 1.123, x4 = 1.900, and
x5 = 3.794 (an iid sample of size n = 5). The sample mean x = 2.049 is our ML estimate of
θ based on this sample x.
PAGE 31
6e−04
Likelihood function
4e−04
2e−04
0e+00
0 1 2 3 4 5
Figure 7.1: Plot of L(θ|x) versus θ in Example 7.5. The data x were generated from a
N (θ = 1.5, 1) distribution with n = 5. The sample mean (MLE) is x = 2.049.
Q: What if, in Example 7.5, we constrained the parameter space to be Θ0 = {θ : θ ≥ 0}?

What is the MLE over Θ0 ?
A: We simply maximize L(θ|x) over Θ0 instead. It is easy to see the restricted MLE is

∗ ∗ X, X ≥ 0
θb = θb (X) =
0, X < 0.
Important: Suppose that L(θ|x) is a likelihood function. Then
θ(x)
b = arg max L(θ|x)
θ∈Θ
= arg max ln L(θ|x).
θ∈Θ
The function ln L(θ|x) is called the log-likelihood function. Analytically, it is usually

easier to work with ln L(θ|x) than with the likelihood function directly. The equations
∂
ln L(θ|x) = 0, j = 1, 2, ..., k,
∂θj
are called the score equations.
PAGE 32
i.e., both parameters are unknown. Set θ = (µ, σ 2 ). The likelihood function is
n
Y 1 2 2
L(θ|x) = √ e−(xi −µ) /2σ
i=1 2πσ 2
n/2
1 − 12
Pn 2
i=1 (xi −µ) .
= 2
e 2σ
2πσ
The log-likelihood function is
n
n 2 1 X
ln L(θ|x) = − ln(2πσ ) − 2 (xi − µ)2 .
2 2σ i=1
The score equations are
n
∂ 1 X set
ln L(θ|x) = 2
(xi − µ) = 0
∂µ σ i=1
n
∂ n 1 X set
2
ln L(θ|x) = − 2 + 4 (xi − µ)2 = 0.
∂σ 2σ 2σ i=1
b = x solves the
Clearly µ Pnfirst equation; inserting µb = x into the second equation
Pn and solving
2 2 −1 2 −1 2
for σ gives σ
b =n i=1 (xi − x) . A first-order critical point is (x, n i=1 (xi − x) ).
Q: How can we verify this solution is a maximizer?

A: In general, for a k-dimensional maximization problem, we can calculate the Hessian
matrix
∂2
H= ln L(θ|x),
∂θ∂θ 0
a k ×k matrix of second-order partial derivatives, and show this matrix is negative definite
when we evaluate it at the first-order critical point θ.
b This is a sufficient condition. Recall
0
a k × k matrix H is negative definite if a Ha < 0 for all a ∈ Rk , a 6= 0.
For the N (µ, σ 2 ) example, I calculated

 n 
n 1 X
− 2 − 4 (xi − µ)
 σ σ i=1 
H= .
 
n n
1 X n 1 X
− 4 (xi − µ) − 6 (xi − µ)2
 
σ i=1 σ 4 σ i=1
With a0 = (a1 , a2 ), it follows that
0
na21
a Ha =− < 0.

µ,σ 2 =b
µ=b σ2 b2
σ
This shows that  
X
 1X n
θ(X) = 

(Xi − X)2 
b
n i=1
is the MLE of θ in the N (µ, σ 2 ) model.
PAGE 33
Exercise: Find the MLEs of µ and σ 2 in the respective sub-families:
• N (µ, σ02 ), where σ02 is known

• N (µ0 , σ 2 ), where µ0 is known.
Example 7.7. ML estimation under parameter constraints. Suppose X1 , X2 are indepen-

dent random variables where
X1 ∼ b(n1 , p1 )
X2 ∼ b(n2 , p2 ),
where 0 < p1 < 1 and 0 < p2 < 1. The likelihood function of θ = (p1 , p2 ) is
L(θ|x1 , x2 ) = fX1 (x1 |p1 )fX2 (x2 |p2 )

n 1 x1 n1 −x1 n2
= p1 (1 − p1 ) px2 2 (1 − p2 )n2 −x2 .
x1 x2
ln L(θ|x1 , x2 ) = c + x1 ln p1 + (n1 − x1 ) ln(1 − p1 ) + x2 ln p2 + (n2 − x2 ) ln(1 − p2 ),
where c = ln nx11 + ln nx22 is free of θ. Over

Θ = {θ = (p1 , p2 ) : 0 < p1 < 1, 0 < p2 < 1},
it is easy to show that ln L(θ|x1 , x2 ) is maximized at

X1
 

pb1  n1 
θ
b = θ(X
b 1 , X2 ) =
 X2  ,
= 
pb2
n2
the vector of sample proportions. Because this is the maximizer over the entire parameter
space Θ, we call θ
b the unrestricted MLE of θ.
Q: How do we find the MLE of θ subject to the constraint that p1 = p2 ?

A: We would now like to maximize ln L(θ|x1 , x2 ) over
Θ0 = {θ = (p1 , p2 ) : 0 < p1 < 1, 0 < p2 < 1, p1 = p2 }.
We can use Lagrange multipliers to maximize ln L(θ|x1 , x2 ) subject to the constraint that
g(θ) = g(p1 , p2 ) = p1 − p2 = 0.
We are left to solve

∂ ∂
ln L(θ|x1 , x2 ) = λ g(θ)
∂θ ∂θ
g(θ) = 0.
PAGE 34
This system becomes

x1 n1 − x1
− = λ
p1 1 − p1
x2 n2 − x2
− = −λ
p2 1 − p2
p1 − p2 = 0.
Solving this system for p1 and p2 , we get
X 1 + X2
 
∗
∗ ∗ pb1  n1 + n2 
θ = θ (X1 , X2 ) =
b b ∗ =
 X 1 + X2  .

pb2
n1 + n2
b∗ the restricted MLE; i.e.,
Because this is the maximizer over the subspace Θ0 , we call θ
the MLE of θ subject to the p1 = p2 restriction.
Discussion: The parameter constraint p1 = p2 might arise in a hypothesis test; e.g.,

b∗ and θ
H0 : p1 = p2 versus H1 : p1 6= p2 . If H0 is true, then we would expect θ b to be “close”
and the ratio
L(θb∗ |x1 , x2 )
λ(x1 , x2 ) = ≈ 1.
L(θ|x
b 1 , x2 )
b∗ is from θ,
The farther θ b the smaller λ(x1 , x2 ) becomes. Therefore, it would make sense to
reject H0 when λ(x1 , x2 ) is small. This is the idea behind likelihood ratio tests.
Example 7.8. Logistic regression. In practice, finding maximum likelihood estimates usu-
ally requires numerical methods. Suppose Y1 , Y2 , ..., Yn are independent Bernoulli random
variables; specifically, Yi ∼ Bernoulli(pi ), where

pi exp(β0 + β1 xi )
ln = β0 + β1 xi ⇐⇒ pi = .
1 − pi 1 + exp(β0 + β1 xi )
In this model, the xi ’s are fixed constants. The likelihood function of θ = (β0 , β1 ) is
n
Y
L(θ|y) = pyi i (1 − pi )1−yi
i=1
n y i 1−yi
Y exp(β0 + β1 xi ) exp(β0 + β1 xi )
= 1− .
i=1
1 + exp(β0 + β1 xi ) 1 + exp(β0 + β1 xi )
Taking logarithms and simplifying gives
n
X
yi (β0 + β1 xi ) − ln(1 + eβ0 +β1 xi ) .

ln L(θ|y) =
i=1
Closed-form expressions for the maximizers βb0 and βb1 do not exist except in very simple
situations. Numerical methods are needed to maximize ln L(θ|y); e.g., iteratively re-weighted
least squares (the default method in R’s glm function).
PAGE 35
Theorem 7.2.10 (Invariance property of MLEs). Suppose θ b is the MLE of θ. For any
function τ (θ), the MLE of τ (θ) is τ (θ).
b
Proof. For simplicity, suppose θ is a scalar parameter and that τ : R → R is one-to-one (over
Θ). In this case,
η = τ (θ) ⇐⇒ θ = τ −1 (η).
The likelihood function of interest is L∗ (η). It suffices to show that L∗ (η) is maximized when
η = τ (θ),
b where θb is the maximizer of L(θ). For simplicity in notation, I drop emphasis of a
likelihood function’s dependence on x. Let ηb be a maximizer of L∗ (η). Then
L∗ (b
η ) = sup L∗ (η)
η
= sup L(τ −1 (η))

η
= sup L(θ).
θ
Therefore, the maximizer ηb satisfies τ −1 (b

η ) = θ. b 2
b Because τ is one-to-one, ηb = τ (θ).
Remark: Our proof assumes that τ is a one-to-one function. However, Theorem 7.2.10 is
true for any function; see pp 319-320 (CB).
Example 7.8 (continued). In the logistic regression model

pi exp(β0 + β1 xi )
ln = β0 + β1 xi ⇐⇒ pi = = τ (β0 , β1 ), say,
1 − pi 1 + exp(β0 + β1 xi )
the MLE of pi is
exp(βb0 + βb1 xi )
pbi = τ (βb0 , βb1 ) = .
1 + exp(βb0 + βb1 xi )
Example 7.9. Suppose X1 , X2 , ..., Xn are iid exponential(β), where β > 0. The likelihood
function is n
Y 1 −xi /β 1 Pn
L(β|x) = e = n e− i=1 xi /β .
i=1
β β
Pn
i=1 xi
ln L(β|x) = −n ln β −
β
The score equation becomes
Pn
∂ n i=1 xi set
ln L(β|x) = − + = 0.
∂β β β2
Solving the score equation for β gives βb = x. It is easy to show that this value maximizes
ln L(β|x). Therefore,
βb = β(X)
b =X
is the MLE of β.
PAGE 36
Applications of invariance: In Example 7.9,

2
• X is the MLE of β 2
• 1/X is the MLE of 1/β
• For t fixed, e−t/X is the MLE of SX (t|β) = e−t/β , the survivor function of X at t.
7.2.3 Bayesian estimation
Remark: Non-Bayesians think of inference in the following way:

Observe X ∼ fX (x|θ) −→ Use x to make statement about θ.
In this paradigm, the model parameter θ is fixed (and unknown). I have taken θ to be a
scalar here for ease of exposition.
Bayesians do not consider the parameter θ to be fixed. They regard θ as random, having its
own probability distribution. Therefore, Bayesians think of inference in this way:
Model θ ∼ π(θ) −→ Observe X|θ ∼ fX (x|θ) −→ Update with π(θ|x).
The model for θ on the front end is called the prior distribution. The model on the
back end is called the posterior distribution. The posterior distribution combines prior
information (supplied through the prior model) and the observed data x. For a Bayesian,
all inference flows from the posterior distribution.
Important: Here are the relevant probability distributions that arise in a Bayesian context.
These are given “in order” as to how the Bayesian uses them. Continue to assume that θ is
a scalar.
1. Prior distribution: θ ∼ π(θ). This distribution incorporates the information avail-

able about θ before any data are observed.
2. Conditional distribution: X|θ ∼ fX (x|θ). This is the distribution of X, but now
viewed conditionally on θ:
fX (x|θ) = L(θ|x)
n
iid
Y
= fX|θ (xi |θ).
i=1
Mathematically, the conditional distribution is the same as the likelihood function.

3. Joint distribution: This distribution describes how X and θ vary jointly. From the
definition of a conditional distribution,
fX,θ (x, θ) = fX|θ (x|θ) π(θ) .
| {z } |{z}
likelihood prior
PAGE 37
4. Marginal distribution. This describes how X is distributed marginally. From the

definition of a marginal distribution,
Z
mX (x) = fX,θ (x, θ)dθ
ZΘ
= fX|θ (x|θ)π(θ)dθ,
Θ
where Θ is the “support” of θ (remember, we are now treating θ as a random variable).

5. Posterior distribution. This is the Bayesian’s “updated” distribution of θ, given that
the data X = x have been observed. From the definition of a conditional distribution,
fX,θ (x, θ)
π(θ|x) =
mX (x)
fX|θ (x|θ)π(θ)
= R .
f (x|θ)π(θ)dθ
Θ X|θ
Remark: The process of starting with π(θ) and performing the necessary calculations to
end up with π(θ|x) is informally known as “turning the Bayesian crank.” The distributions
above can be viewed as steps in a “recipe” for posterior construction (i.e., start with the
prior and the conditional, calculate the joint, calculate the marginal, calculate the posterior).
We will see momentarily that not all steps are needed. In fact, in practice, computational
techniques are used to essentially bypass Step 4 altogether. You can see that this might be
desirable, especially if θ is a vector (and perhaps high-dimensional).
Example 7.10. Suppose that, conditional on θ, X1 , X2 , ..., Xn are iid Poisson(θ), where the
prior distribution for θ ∼ gamma(a, b), a, b known. We now turn the Bayesian crank.
1. Prior distribution.
1
π(θ) = θa−1 e−θ/b I(θ > 0).
Γ(a)ba
2. Conditional distribution. For xi = 0, 1, 2, ...,
n Pn
Y θxi e−θ θ i=1 xi −nθ
e
fX|θ (x|θ) = = Qn .
i=1
xi ! i=1 xi !
Recall that this is the same function as the likelihood function.
3. Joint distribution. For xi = 0, 1, 2, ..., and θ > 0,
fX,θ (x, θ) = fX|θ (x|θ)π(θ)

Pn
xi −nθ
θ e
i=1 1
= Qn a
θa−1 e−θ/b
i=1 xi ! Γ(a)b
1 Pn 1 −1
i=1 xi +a−1 e−θ/(n+ b )
= Qn a
θ .
i=1 xi ! Γ(a)b
| {z }
does not depend on θ
PAGE 38
4. Marginal distribution. For xi = 0, 1, 2, ...,

Z
mX (x) = fX,θ (x, θ)dθ
Θ
Z ∞ P
1 n 1 −1
= Qn a
θ| i=1 xi +a−1{ze−θ/(n+ b ) } dθ,
i=1 xi ! Γ(a)b 0 ∗ ∗ gamma(a , b ) kernel
where n
∗
X 1
a = xi + a and b∗ = .
i=1
n + 1b
Therefore,
n
! Pni=1 xi +a
1 X 1
mX (x) = Qn a
Γ xi + a 1 .
i=1 xi ! Γ(a)b i=1
n + b
5. Posterior distribution. For θ > 0,

fX,θ (x, θ)
π(θ|x) =
mX (x)
1 Pn 1 −1
i=1 xi +a−1 e−θ/(n+ b )
Qn a
θ
i=1 xi ! Γ(a)b
= n
! Pni=1 xi +a
1 X 1
Qn Γ xi + a
i=1 xi ! Γ(a)b
a
i=1
n + 1b
1 Pn 1 −1
i=1 xi +a−1 e−θ/(n+ b )
= P n
i=1 xi +a θ ,
Pn 1
Γ ( i=1 xi + a) n+ 1
b
which we recognize as the gamma pdf with parameters

n
X
a∗ = xi + a
i=1
1
b∗ = .
n + 1b
That is, the posterior distribution
n
!
X 1
θ|X = x ∼ gamma xi + a, 1 .
i=1
n+ b
Remark: Note that the shape and scale parameters of the posterior distribution π(θ|x)
depend on
• a and b, the prior distribution parameters (i.e., the “hyperparameters”)

• the data x through the sufficient statistic t(x) = ni=1 xi .
P
In this sense, the posterior distribution combines information from the prior and the data.
PAGE 39
Q: In general, which functional of π(θ|x) should we use as a point estimator?

A: Answering this question technically would require us to discuss loss functions (see
Section 7.3.4, CB). In practice, it is common to use one of
θbB = E(θ|X = x) −→ posterior mean

θeB = med(θ|X = x) −→ posterior median
θb∗ = mode(θ|X = x) −→ posterior mode.
B
Note that in Example 7.10 (the Poisson-gamma example), the posterior mean equals
Pn
i=1 xi + a
θB = E(θ|X = x) =
b
n + 1b

nb 1
= x+ ab.
nb + 1 nb + 1
That is, the posterior mean is a weighted average of the sample mean x and the prior
mean ab. Note also that as the sample size n increases, more weight is given to the data
(through x) and less weight is given to the the prior (through the prior mean).
Remark: In Example 7.10, we wrote the joint distribution (in Step 3) as
fX,θ (x, θ) = fX|θ (x|θ)π(θ)

Pn
xi −nθ
θ e
i=1 1
= Qn a
θa−1 e−θ/b
x
i=1 i ! Γ(a)b
1 Pn 1 −1
i=1 xi +a−1 e−θ/(n+ b )
= Qn a |
θ }.
i=1 xi ! Γ(a)b
{z
∗ ∗
gamma(a , b ) kernel
| {z }
At this step, we can clearly identify the kernel of the posterior distribution. We can therefore
skip calculating the marginal distribution mX (x) in Step 4, because we know mX (x) does
not depend on θ. Because of this, it is common to write, in general,
π(θ|x) ∝ fX|θ (x|θ)π(θ)

= L(θ|x)π(θ).
The posterior distribution is proportional to the likelihood function times the prior distri-
bution. A (classical) Bayesian analysis requires these two functions L(θ|x) and π(θ) only.
Remark: Suppose X|θ ∼ fX|θ (x|θ). If T = T (X) is sufficient, we can write
fX|θ (x|θ) = g(t|θ)h(x),
by the Factorization Theorem. Therefore, the posterior distribution
π(θ|x) ∝ fX|θ (x|θ)π(θ)

∝ g(t|θ)π(θ).
PAGE 40
This shows that the posterior distribution will depend on the data x through the value of the
sufficient statistic t = T (x). We can therefore write the posterior distribution as depending
on t only; i.e.,
π(θ|t) ∝ fT |θ (t|θ)π(θ),
and restrict attention to the (sampling) distribution of T = T (X) from the beginning.
Example 7.11. Suppose that X1 , X2 , ..., Xn are iid Bernoulli(θ), where the prior distribution
for θ ∼ beta(a, b), a, b known. We know that
n
X
T = T (X) = Xi
i=1
is a sufficient statistic for the Bernoulli family and that T ∼ b(n, θ). Therefore, for t =
0, 1, 2, ..., n and 0 < θ < 1, the posterior distribution
π(θ|t) ∝ fT |θ (t|θ)π(θ)

n t Γ(a + b) a−1
= θ (1 − θ)n−t θ (1 − θ)b−1
t Γ(a)Γ(b)

n Γ(a + b)
= θt+a−1 (1 − θ)n−t+b−1 ,
t Γ(a)Γ(b) | {z }
| {z } beta(a∗ , b∗ ) kernel
where a∗ = t + a and b∗ = n − t + b. From here, we can immediately conclude that the

posterior distribution
θ|T = t ∼ beta(t + a, n − t + b),
Pn
where t = T (x) = i=1 xi .
Discussion: In Examples 7.10 and 7.11, we observed the following occurrence:
• Example 7.10. θ ∼ gamma (prior) −→ θ|X = x ∼ gamma (posterior).
• Example 7.11. θ ∼ beta (prior) −→ θ|T = t ∼ beta (posterior).
Definition: Let F = {fX (x|θ) : θ ∈ Θ} denote a class of pdfs or pmfs. A class Π of prior
distributions is said to be a conjugate prior family for F if the posterior distribution also
belongs to Π.
As we have already seen in Examples 7.10 and 7.11,
• The gamma family is conjugate for the Poisson family.
• The beta family is conjugate for the binomial family.
PAGE 41
Example 7.12. Suppose X1 , X2 , ..., Xn are iid N (µ, σ 2 ), where −∞ < µ < ∞ and σ 2 > 0.
• If σ 2 is known, a conjugate prior for µ is
µ ∼ N (ξ, τ 2 ), ξ, τ 2 known.
• If µ is known, a conjugate prior for σ 2 is
σ 2 ∼ IG(a, b) a, b known.
7.3 Methods of Evaluating Estimators
7.3.1 Bias, variance, and MSE
Definition: Suppose W = W (X) is a point estimator. We call W an unbiased estimator

of θ if
Eθ (W ) = θ for all θ ∈ Θ.
More generally, we call W an unbiased estimator of τ (θ) if
Eθ (W ) = τ (θ) for all θ ∈ Θ.
Definition: The mean-squared error (MSE) of a point estimator W = W (X) is
MSEθ (W ) = Eθ [(W − θ)2 ]

= varθ (W ) + [Eθ (W ) − θ]2
= varθ (W ) + Bias2θ (W ),
where Biasθ (W ) = Eθ (W ) − θ is the bias of W as an estimator of θ. Note that if W is an

unbiased estimator of θ, then for all θ ∈ Θ,
Eθ (W ) = θ =⇒ Biasθ (W ) = Eθ (W ) − θ = 0.
In this case,
MSEθ (W ) = varθ (W ).
Remark: In general, the MSE incorporates two components:
• varθ (W ); this measures precision
• Biasθ (W ); this measures accuracy.
Obviously, we prefer estimators with small MSE because these estimators have small bias
(i.e., high accuracy) and small variance (i.e., high precision).
PAGE 42
i.e., both parameters unknown. Set θ = (µ, σ 2 ). Recall that our “usual” sample variance
estimator is n
2 1 X
S = (Xi − X)2
n − 1 i=1
and for all θ,
Eθ (S 2 ) = σ 2
2σ 4
varθ (S 2 ) = .
n−1
Consider the “competing estimator:”
n
1X
Sb2 = (Xi − X)2 ,
n i=1
which recall is the MOM and MLE of σ 2 .
Note that

2 n−1 2 2 n−1 2 n−1 2 n−1
Sb = S =⇒ Eθ (Sb ) = Eθ S = Eθ (S ) = σ2.
n n n n
That is, the estimator Sb2 is biased; it underestimates σ 2 on average.
Comparison: Let’s compare S 2 and Sb2 on the basis of MSE. Because S 2 is an unbiased
estimator of σ 2 ,
2σ 4
MSEθ (S 2 ) = varθ (S 2 ) = .
n−1
The MSE of Sb2 is
MSEθ (Sb2 ) = varθ (Sb2 ) + Bias2θ (Sb2 ).
The variance of Sb2 is

n−1
varθ (Sb2 ) = varθ S2
n
2 2
2σ 4 2(n − 1)σ 4

n−1 2 n−1
= varθ (S ) = = .
n n n−1 n2
The bias of Sb2 is
n−1
Eθ (Sb2 −σ )= 2
Eθ (Sb2 ) −σ =2
σ2 − σ2.
n
Therefore,
2
2(n − 1)σ 4

n−1 2n − 1
MSEθ (Sb2 ) = 2
+ 2
σ −σ = 2
σ4.
| n {z } n n2
| {z }
varθ (Sb2 ) Bias2θ (Sb2 )
PAGE 43
Finally, to compare MSEθ (S 2 ) with MSEθ (Sb2 ), we are left to compare the constants
2 2n − 1
and .
n−1 n2
Note that the ratio
2n − 1
2
n2 = 2n − 3n + 1 < 1,
2 2n2
n−1
for all n ≥ 2. Therefore,
MSEθ (Sb2 ) < MSEθ (S 2 ),
showing that Sb2 is a “better” estimator than S 2 on the basis of MSE.
Discussion: In general, how should we compare two competing estimators W1 and W2 ?
• If both W1 and W2 are unbiased, we prefer the estimator with the smaller variance.
• If either W1 or W2 is biased (or perhaps both are biased), we prefer the estimator with
the smaller MSE.
There is no guarantee that one estimator, say W1 , will always beat the other for all θ ∈ Θ
(i.e., for all values of θ in the parameter space). For example, it may be that W1 has smaller
MSE for some values of θ ∈ Θ, but larger MSE for other values.
Remark: In some situations, we might have a biased estimator, but we can calculate its
bias. We can then “adjust” the (biased) estimator to make it unbiased. I like to call this
“making biased estimators unbiased.” The following example illustrates this.
Example 7.14. Suppose that X1 , X2 , ..., Xn are iid U[0, θ], where θ > 0. We know (from
Example 7.4) that the MLE of θ is X(n) , the maximum order statistic. It is easy to show
that
n
Eθ (X(n) ) = θ.
n+1
The MLE is biased because Eθ (X(n) ) 6= θ. However, the estimator

n+1
X(n) ,
n
an “adjusted version” of X(n) , is unbiased.
Remark: In the previous example, we might compare the following estimators:

n+1
W1 = W1 (X) = X(n)
n
W2 = W2 (X) = 2X.
PAGE 44
The estimator W1 is an unbiased version of the MLE. The estimator W2 is the MOM (which
is also unbiased). I have calculated
θ2 θ2
varθ (W1 ) = and varθ (W2 ) = .
n(n + 2) 3n
It is easy to see that varθ (W1 ) ≤ varθ (W2 ), for all n ≥ 2. Therefore, W1 is a “better”
estimator on the basis of this variance comparison. Are you surprised?
Curiosity: Might there be another unbiased estimator, say W3 = W3 (X) that is “better”
than both W1 and W2 ? If a better (unbiased) estimator does exist, how do we find it?
7.3.2 Best unbiased estimators
Goal: Consider the class of estimators
Cτ = {W = W (X) : Eθ (W ) = τ (θ) ∀θ ∈ Θ}.
That is, Cτ is the collection of all unbiased estimators of τ (θ). Our goal is to find the
(unbiased) estimator W ∗ ∈ Cτ that has the smallest variance.
Remark: On the surface, this task seems somewhat insurmountable because Cτ is a very
n+1

large class. In Example 7.14, for example, both W1 = n X(n) and W2 = 2X are unbiased
estimators of θ. However, so is the convex combination

n+1
Wa = Wa (X) = a X(n) + (1 − a)2X,
n
for all a ∈ (0, 1).
Remark: It seems that our discussion of “best” estimators starts with the restriction that
we will consider only those that are unbiased. If we did not make a restriction like this,
then we would have to deal with too many estimators, many of which are nonsensical. For
example, suppose X1 , X2 , ..., Xn are iid Poisson(θ), where θ > 0.
• The estimators X and S 2 emerge as candidate estimators because they are unbiased.
• However, suppose we widen our search to consider all possible estimators and then try
to find the one with the smallest MSE. Consider the estimator θb = 17.
– If θ = 17, then θb can never be beaten in terms of MSE; its MSE = 0.

– If θ 6= 17, then θb may be a terrible estimator; its MSE = (17 − θ)2 .
• We want to exclude nonsensical estimators like this. Our solution is to restrict attention
to estimators that are unbiased.
PAGE 45
Definition: An estimator W ∗ = W ∗ (X) is a uniformly minimum variance unbiased

estimator (UMVUE) of τ (θ) if
1. Eθ (W ∗ ) = τ (θ) for all θ ∈ Θ
2. varθ (W ∗ ) ≤ varθ (W ), for all θ ∈ Θ, where W is any other unbiased estimator of τ (θ).
Note: This definition is stated in full generality. Most of the time (but certainly not always),
we will be interested in estimating θ itself; i.e., τ (θ) = θ. Also, as the notation suggests, we
assume that τ (θ) is a scalar parameter and that estimators are also scalar.
Discussion/Preview: How do we find UMVUEs? We start by noting the following:

• UMVUEs may not exist.
• If a UMVUE does exist, it is unique (we’ll prove this later).
We present two approaches to find UMVUEs:
Approach 1: Determine a lower bound, say B(θ), on the variance of any unbiased esti-
mator of τ (θ). Then, if we can find an unbiased estimator W ∗ whose variance attains this
lower bound, that is,
varθ (W ∗ ) = B(θ),
for all θ ∈ Θ, then we know that W ∗ is UMVUE.
Approach 2: Link the notion of being “best” with that of sufficiency and completeness.
Theorem 7.3.9 (Cramér-Rao Inequality). Suppose X ∼ fX (x|θ), where

1. the support of X is free of all unknown parameters
2. for any function h(x) such that Eθ [h(X)] < ∞ for all θ ∈ Θ, the interchange
Z Z
d ∂
h(x)fX (x|θ)dx = h(x)fX (x|θ)dx
dθ Rn Rn ∂θ
is justified; i.e., we can interchange the derivative and integral (derivative and sum if
X is discrete).
For any estimator W (X) with varθ [W (X)] < ∞, the following inequality holds:
d 2
dθ
Eθ [W (X)]
varθ [W (X)] ≥ n 2 o .
∂
Eθ ∂θ ln fX (X|θ)
The quantity on the RHS is called the Cramér-Rao Lower Bound (CRLB) on the
variance of the estimator W (X).
Remark: Note that in the statement of the CRLB in Theorem 7.3.9, we haven’t said exactly
what W (X) is an estimator for. This is to preserve the generality of the result; Theorem 7.3.9
holds for any estimator with finite variance. However, given our desire to restrict attention
to unbiased estimators, we will usually consider one of these cases:
PAGE 46
• If W (X) is an unbiased estimator of τ (θ), then the numerator becomes

2
d
τ (θ) = [τ 0 (θ)]2 .
dθ
• If W (X) is an unbiased estimator of τ (θ) = θ, then the numerator equals 1.
Important special case (Corollary 7.3.10): When X consists of X1 , X2 , ..., Xn which are
iid from the population fX (x|θ), then the denominator in Theorem 7.3.9
( 2 ) ( 2 )
∂ ∂
Eθ ln fX (X|θ) = nEθ ln fX (X|θ) ,
∂θ ∂θ
or, using other notation,

In (θ) = nI1 (θ).
We call In (θ) the Fisher information based on the sample X. We call I1 (θ) the Fisher
information based on one observation X.
Lemma 7.3.11 (Information Equality): Under fairly mild assumptions (which hold for
exponential families, for example), the Fisher information based on one observation
( 2 ) 2
∂ ∂
I1 (θ) = Eθ ln fX (X|θ) = −Eθ ln fX (X|θ) .
∂θ ∂θ2
The second expectation is often easier to calculate.
Preview: In Chapter 10, we will investigate the large-sample properties of MLEs. Under
certain regularity conditions, we will show an MLE θb satisfies
√ d
n(θb − θ) −→ N (0, σθ2b),
where the asymptotic variance

1
σθ2b = .
I1 (θ)
This is an extremely useful (large-sample) result; e.g., it makes getting large-sample CIs
and performing large-sample tests straightforward. Furthermore, an analogous large-sample
b is the MLE of a k × 1 dimensional parameter θ,
result holds for vector-valued MLEs. If θ
then √
b − θ) −→d
n(θ mvnk (0, Σ),
where the asymptotic variance-covariance matrix (now, k × k)
Σ = [I1 (θ)]−1
is the inverse of the k × k Fisher information matrix I1 (θ).
PAGE 47
Example 7.15. Suppose X1 , X2 , ..., Xn are iid Poisson(θ), where θ > 0. Find the CRLB on
the variance of unbiased estimators of τ (θ) = θ.
Solution. We know that the CRLB is
1 1
= ,
In (θ) nI1 (θ)
where ( 2 ) 2
∂ ∂
∂θ ∂θ2
For x = 0, 1, 2, ...,
θx e−θ

ln fX (x|θ) = ln = x ln θ − θ − ln x!.
x!
Therefore,
∂ x
ln fX (x|θ) = −1
∂θ θ
∂2 x
ln f X (x|θ) = − .
∂θ2 θ2
The Fisher information based on one observation is
2
∂
I1 (θ) = −Eθ ln fX (X|θ)
∂θ2

X 1
= −Eθ − 2 = .
θ θ
Therefore, the CRLB on the variance of all unbiased estimators of τ (θ) = θ is
1 θ
CRLB = = .
nI1 (θ) n
Observation: Because W (X) = X is an unbiased estimator of τ (θ) = θ in the Poisson(θ)

model and because
θ
varθ (X) = ,
n
we see that varθ (X) does attain the CRLB. This means that W (X) = X is the UMVUE for
τ (θ) = θ.
Example 7.16. Suppose X1 , X2 , ..., Xn are iid gamma(α0 , β), where α0 is known and β > 0.
Find the CRLB on the variance of unbiased estimators of β.
Solution. We know that the CRLB is
1 1
= ,
In (β) nI1 (β)
PAGE 48
where ( 2 )
∂2

∂
I1 (β) = Eβ ln fX (X|β) = −Eβ ln fX (X|β) .
∂β ∂β 2
For x > 0,

1
ln fX (x|β) = ln α
xα0 −1 e−x/β
Γ(α0 )β 0
x
= − ln Γ(α0 ) − α0 ln β + (α0 − 1) ln x − .
β
Therefore,
∂ α0 x
ln fX (x|β) = − + 2
∂β β β
2
∂ α0 2x
2
ln fX (x|β) = − .
∂β β2 β3
The Fisher information based on one observation is
2
∂
I1 (β) = −Eβ ln fX (X|β)
∂β 2

α0 2X α0
= −Eβ − = .
β2 β3 β2
Therefore, the CRLB on the variance of all unbiased estimators of β is
1 β2
CRLB = = .
nI1 (β) nα0
Observation: Consider the estimator

X
W (X) = .
α0
Note that
X α0 β
Eβ [W (X)] = Eβ = =β
α0 α0
and
α0 β 2 β2

X
varβ [W (X)] = varβ = = .
α0 nα02 nα0
We see that W (X) = X/α0 is an unbiased estimator for β and varβ (X/α0 ) attains the
CRLB. This means that W (X) = X/α0 is the UMVUE for β.
PAGE 49
Discussion: Instead of estimating β in Example 7.16, suppose that we were interested in

estimating τ (β) = 1/β instead.
1. Show that
nα0 − 1
W (X) =
nX
is an unbiased estimator of τ (β) = 1/β.
2. Derive the CRLB for the variance of unbiased estimators of τ (β) = 1/β.
3. Calculate varβ [W (X)] and show that it is strictly larger than the CRLB (i.e., the
variance does not attain the CRLB).
Q: Does this necessarily imply that W (X) cannot be the UMVUE of τ (β) = 1/β?
Remark: In general, the CRLB offers a lower bound on the variance of any unbiased
estimator of τ (θ). However, this lower bound may be unattainable. That is, the CRLB may
be strictly smaller than the variance of any unbiased estimator. If this is the case, then our
“CRLB approach” to finding an UMVUE will not be helpful.
Corollary 7.3.15 (Attainment). Suppose X1 , X2 , ..., Xn is an iid sample from fX (x|θ),

where θ ∈ Θ, a family that satisfies the regularity conditions stated for the Cramér-Rao
Inequality. If W (X) is an unbiased estimator of τ (θ), then varθ [W (X)] attains the CRLB if
and only if the score function
S(θ|x) = a(θ)[W (x) − τ (θ)]
is a linear function of W (x).
Recall: The score function is given by

∂
S(θ|x) = ln L(θ|x)
∂θ
∂
= ln fX (x|θ).
∂θ
Example 7.16 (continued). Suppose X1 , X2 , ..., Xn are iid gamma(α0 , β), where α0 is known
and β > 0. The likelihood function is
n
Y 1
L(β|x) = α0
xαi 0 −1 e−xi /β
i=1
Γ(α 0 )β
n Y n
!α0 −1
1 − n
P
i=1 xi /β .
= α
x i e
Γ(α0 )β 0
i=1

n Pn
X
i=1 xi
ln L(β|x) = −n ln Γ(α0 ) − nα0 ln β + (α0 − 1) ln xi − .
i=1
β
PAGE 50
The score function is

Pn
∂ nα0 xi
S(β|x) = ln L(β|x) = − + i=12
∂β β β
Pn
nα0 i=1 xi
= −β
β2 nα0
= a(β)[W (x) − τ (β)],
where Pn
i=1 xi x
W (x) = = .
nα0 α0
We have written the score function S(β|x) as a linear function of W (x) = x/α0 . Because
W (X) = X/α0 is an unbiased estimator of τ (β) = β (shown previously), the variance
varβ [W (X)] attains the CRLB for the variance of unbiased estimators of τ (β) = β.
Remark: The attainment result is interesting, but I have found that its usefulness may be
limited if you want to find the UMVUE. Even if we can write
S(θ|x) = a(θ)[W (x) − τ (θ)]
where Eθ [W (X)] = τ (θ), the RHS might involve a function τ (θ) for which there is no desire
to estimate. To illustrate this, suppose X1 , X2 , ..., Xn are iid beta(θ, 1), where θ > 0. The
score function is
n
n X
S(θ|x) = + ln xi
θ
Pi=1
n
i=1 ln xi 1
= n − −
n θ
= a(θ)[W (x) − τ (θ)].
It turns out that !

n
1X 1
Eθ [W (X)] = Eθ ln Xi =− .
n i=1 θ
We have shown that varθ [W (X)] attains the CRLB on the variance of unbiased estimators
of τ (θ) = −1/θ, a parameter we likely have no desire to estimate.
Unresolved issues:
1. What if fX (x|θ) does not satisfy the regularity conditions needed for the Cramér-Rao
Inequality to apply? For example, X ∼ U(0, θ).
2. What if the CRLB is unattainable? Can we still find the UMVUE?
PAGE 51
7.3.3 Sufficiency and completeness
Remark: We now move to our “second approach” on how to find UMVUEs. This approach
involves sufficiency and completeness−two topics we discussed in the last chapter. We can
also address the unresolved issues on the previous page.
Theorem 7.3.17 (Rao-Blackwell). Let W = W (X) be an unbiased estimator of τ (θ) .

Let T = T (X) be a sufficient statistic for θ. Define
φ(T ) = E(W |T ).
Then
1. Eθ [φ(T )] = τ (θ) for all θ ∈ Θ
2. varθ [φ(T )] ≤ varθ (W ) for all θ ∈ Θ.
That is, φ(T ) = E(W |T ) is a uniformly better unbiased estimator than W .

Proof. This result follows from the iterated rules for means and variances. First,
Eθ [φ(T )] = Eθ [E(W |T )] = Eθ (W ) = τ (θ).
Second,
varθ (W ) = Eθ [var(W |T )] + varθ [E(W |T )]

= Eθ [var(W |T )] + varθ [φ(T )]
≥ varθ [φ(T )],
because var(W |T ) ≥ 0 (a.s.) and hence Eθ [var(W |T )] ≥ 0. 2
Implication: We can always “improve” the unbiased estimator W by conditioning on a

sufficient statistic.
Remark: To use the Rao-Blackwell Theorem, some students think they have to
1. Find an unbiased estimator W .
2. Find a sufficient statistic T .
3. Derive the conditional distribution fW |T (w|t).
4. Find the mean E(W |T ) of this conditional distribution.
This is not the case at all! Because φ(T ) = E(W |T ) is a function of the sufficient statistic
T , the Rao-Blackwell result simply convinces us that in our search for the UMVUE, we can
restrict attention to those estimators that are functions of a sufficient statistic.
PAGE 52
Q: In the proof of the Rao-Blackwell Theorem, where did we use the fact that T was
sufficient?
A: Nowhere. Thus, it would seem that conditioning on any statistic, sufficient or not, will
result in an improvement over the unbiased W . However, there is a catch:
• If T is not sufficient, then there is no guarantee that φ(T ) = E(W |T ) will be an

estimator; i.e., it could depend on θ. See Example 7.3.18 (CB, pp 343).
Remark: To understand how we can use the Rao-Blackwell result in our quest to find a
UMVUE, we need two additional results. One deals with uniqueness; the other describes an
interesting characterization of a UMVUE itself.
Theorem 7.3.19 (Uniqueness). If W is UMVUE for τ (θ), then it is unique.

Proof. Suppose that W 0 is also UMVUE. It suffices to show that W = W 0 with probability
one. Define
1
W ∗ = (W + W 0 ).
2
Note that
1
Eθ (W ∗ ) = [Eθ (W ) + Eθ (W 0 )] = τ (θ), for all θ ∈ Θ,
2
showing that W is an unbiased estimator of τ (θ). The variance of W ∗ is
∗

∗ 1 0
varθ (W ) = varθ (W + W )
2
1 1 1
= varθ (W ) + varθ (W 0 ) + covθ (W, W 0 )
4 4 2
1 1 1 1/2
≤ varθ (W ) + varθ (W ) + [varθ (W )varθ (W 0 )]
0
4 4 2
= varθ (W ),
where the inequality arises from the covariance inequality (CB, pp 188, application of
Cauchy-Schwarz) and the final equality holds because both W and W 0 are UMVUE by
assumption (so their variances must be equal). Therefore, we have shown that
1. W ∗ is unbiased for τ (θ)
2. varθ (W ∗ ) ≤ varθ (W ).
Because W is UMVUE (by assumption), the inequality in (2) can not be strict (or else it
would contradict the fact that W is UMVUE). Therefore, it must be true that
varθ (W ∗ ) = varθ (W ).
This implies that the inequality above (arising from the covariance inequality) is an equality;
therefore,
1/2
covθ (W, W 0 ) = [varθ (W )varθ (W 0 )] .
PAGE 53
Therefore,
corrθ (W, W 0 ) = ±1 =⇒ W 0 = a(θ)W + b(θ) , with probability 1,

| {z }
linear function of W
by Theorem 4.5.7 (CB, pp 172), where a(θ) and b(θ) are constants. It therefore suffices to
show that a(θ) = 1 and b(θ) = 0. Note that
covθ (W, W 0 ) = covθ [W, a(θ)W + b(θ)] = a(θ)covθ (W, W )

= a(θ)varθ (W ).
However, we have previously shown that

1/2
covθ (W, W 0 ) = [varθ (W )varθ (W 0 )] = [varθ (W )varθ (W )]1/2
= varθ (W ).
This implies a(θ) = 1. Finally,
Eθ (W 0 ) = Eθ [a(θ)W + b(θ)] = Eθ [W + b(θ)]

= Eθ (W ) + b(θ).
Because both W and W 0 are unbiased, this implies b(θ) = 0. 2
Theorem 7.3.20. Suppose Eθ (W ) = τ (θ) for all θ ∈ Θ. W is UMVUE of τ (θ) if and only
if W is uncorrelated with all unbiased estimators of 0.
Proof. Necessity (=⇒): Suppose Eθ (W ) = τ (θ) for all θ ∈ Θ. Suppose W is UMVUE of
τ (θ). Suppose Eθ (U ) = 0 for all θ ∈ Θ. It suffices to show covθ (W, U ) = 0 for all θ ∈ Θ.
Define
φa = W + aU,
where a is a constant. It is easy to see that φa is an unbiased estimator of τ (θ); for all θ ∈ Θ,
Eθ (φa ) = Eθ (W + aU ) = Eθ (W ) + a Eθ (U ) = τ (θ).
| {z }
= 0
Also,
varθ (φa ) = varθ (W + aU )

= varθ (W ) + a2 varθ (U ) + 2a covθ (W, U ) .
| {z }
Key question: Can this be negative?
• Case 1: Suppose ∃ θ0 ∈ Θ such that covθ0 (W, U ) < 0. Then
a2 varθ0 (U ) + 2a covθ0 (W, U ) < 0 ⇐⇒ a2 varθ0 (U ) < −2a covθ0 (W, U )

2a covθ0 (W, U )
⇐⇒ a2 < − .
varθ0 (U )
PAGE 54
I can make this true by picking

2 covθ0 (W, U )
0<a<−
varθ0 (U )
and therefore I have shown that
varθ0 (φa ) < varθ0 (W ).
However, this contradicts the assumption that W is UMVUE. Therefore, it must be

true that covθ (W, U ) ≥ 0.
• Case 2: Suppose ∃ θ0 ∈ Θ such that covθ0 (W, U ) > 0. Then
a2 varθ0 (U ) + 2a covθ0 (W, U ) < 0 ⇐⇒ a2 varθ0 (U ) < −2a covθ0 (W, U )

2a covθ0 (W, U )
⇐⇒ a2 < − .
varθ0 (U )
I can make this true by picking
2 covθ0 (W, U )
− <a<0
varθ0 (U )
and therefore I have shown that
varθ0 (φa ) < varθ0 (W ).
However, this again contradicts the assumption that W is UMVUE. Therefore, it must
be true that covθ (W, U ) ≤ 0.
Combining Case 1 and Case 2, we are forced to conclude that covθ (W, U ) = 0. This proves
the necessity.
Sufficiency (⇐=): Suppose Eθ (W ) = τ (θ) for all θ ∈ Θ. Suppose covθ (W, U ) = 0 for all
θ ∈ Θ where U is any unbiased estimator of zero; i.e., Eθ (U ) = 0 for all θ ∈ Θ. Let W 0 be
any other unbiased estimator of τ (θ). It suffices to show that varθ (W ) ≤ varθ (W 0 ). Write
W 0 = W + (W 0 − W )
and calculate
varθ (W 0 ) = varθ (W ) + varθ (W 0 − W ) + 2covθ (W, W 0 − W ).
However, covθ (W, W 0 − W ) = 0 because W 0 − W is an unbiased estimator of 0. Therefore,
varθ (W 0 ) = varθ (W ) + varθ (W 0 − W ) ≥ varθ (W ).

| {z }
≥0
This proves the sufficiency. 2
PAGE 55
Summary: We are now ready to put Theorem 7.3.17 (Rao-Blackwell), Theorem 7.3.19
(UMVUE uniqueness) and Theorem 7.3.20 together. Suppose X ∼ fX (x|θ), where θ ∈ Θ.
Our goal is to find the UMVUE of τ (θ).
• Theorem 7.3.17 (Rao-Blackwell) assures us that we can restrict attention to functions

of sufficient statistics.
Therefore, suppose T is a sufficient statistic for θ. Suppose that φ(T ), a function of T , is an

unbiased estimator of τ (θ); i.e.,
Eθ [φ(T )] = τ (θ), for all θ ∈ Θ.
• Theorem 7.3.20 assures us that φ(T ) is UMVUE if and only if φ(T ) is uncorrelated
with all unbiased estimators of 0.
Add the assumption that T is a complete statistic. The only unbiased estimator of 0 in
complete families is the zero function itself. Because covθ [φ(T ), 0] = 0 holds trivially, we
have shown that φ(T ) is uncorrelated with “all” unbiased estimators of 0. Theorem 7.3.20
says that φ(T ) must be UMVUE; Theorem 7.3.19 guarantees that φ(T ) is unique.
Recipe for finding UMVUEs: Suppose we want to find the UMVUE for τ (θ).
1. Start by finding a statistic T that is both sufficient and complete.
2. Find a function of T , say φ(T ), that satisfies
Eθ [φ(T )] = τ (θ), for all θ ∈ Θ.
Then φ(T ) is the UMVUE for τ (θ). This is essentially what is summarized in Theorem
7.3.23 (CB, pp 347).
Example 7.17. Suppose X1 , X2 , ..., Xn are iid Poisson(θ), where θ > 0.
• We already know that X is UMVUE for θ; we proved this by showing that X is unbiased
and that varθ (X) attains the CRLB on the variance of all unbiased estimators of θ.
• We now show X is UMVUE for θ by using sufficiency and completeness.
The pmf of X is
θx e−θ
fX (x|θ) = I(x = 0, 1, 2, ..., )
x!
I(x = 0, 1, 2, ..., ) −θ (ln θ)x
= e e
x!
= h(x)c(θ) exp{w1 (θ)t1 (x)}.
PAGE 56
Therefore X has pmf in the exponential family. Theorem 6.2.10 says that
n
X
T = T (X) = Xi
i=1
is a sufficient statistic. Because d = k = 1 (i.e., a full family), Theorem 6.2.25 says that T
is complete. Now, !
n
X Xn
Eθ (T ) = Eθ Xi = Eθ (Xi ) = nθ.
i=1 i=1
Therefore,
T
Eθ = Eθ (X) = θ.
n
Because X is unbiased and is a function of T , a complete and sufficient statistic, we know
that X is the UMVUE.
Example 7.18. Suppose X1 , X2 , ..., Xn are iid U(0, θ), where θ > 0. We have previously
shown that
T = T (X) = X(n)
is sufficient and complete (see Example 6.5 and Example 6.16, respectively, in the notes). It
follows that
n
Eθ (T ) = Eθ (X(n) ) = θ
n+1
for all θ > 0. Therefore,
n+1
Eθ X(n) = θ.
n
Because (n+1)X(n) /n is unbiased and is a function of X(n) , a complete and sufficient statistic,
it must be the UMVUE.
Example 7.19. Suppose X1 , X2 , ..., Xn are iid gamma(α0 , β), where α0 is known and β > 0.
Find the UMVUE of τ (β) = 1/β.
Solution. The pdf of X is
1
fX (x|β) = xα0 −1 e−x/β I(x > 0)
Γ(α0 )β α0
xα0 −1 I(x > 0) 1 (−1/β)x
= e
Γ(α0 ) β α0
= h(x)c(β) exp{w1 (β)t1 (x)}
a one-parameter exponential family with d = k = 1 (a full family). Theorem 6.2.10 and

Theorem 6.2.25 assure that n
X
T = T (X) = Xi
i=1
PAGE 57
is a sufficient and complete statistic, respectively. In Example 7.16 (notes), we saw that
nα0 − 1
φ(T ) =
T
is an unbiased estimator of τ (β) = 1/β. Therefore, φ(T ) must be the UMVUE.
Remark: In Example 7.16, recall that the CRLB on the variance of unbiased estimators of
τ (β) = 1/β was unattainable.
Example 7.20. Suppose X1 , X2 , ..., Xn are iid Poisson(θ), where θ > 0. Find the UMVUE
for
τ (θ) = Pθ (X = 0) = e−θ .
Solution. We use an approach known as “direct conditioning.” We start with
n
X
T = T (X) = Xi ,
i=1
which is sufficient and complete. We know that the UMVUE therefore is a function of T .
Consider forming
φ(T ) = E(W |T ),
where W is any unbiased estimator of τ (θ) = e−θ . We know that φ(T ) by this construction
is the UMVUE; clearly φ(T ) = E(W |T ) is a function of T and
Eθ [φ(T )] = Eθ [E(W |T )] = Eθ (W ) = e−θ .
How should we choose W ? Any unbiased W will “work,” so let’s keep our choice simple, say
W = W (X) = I(X1 = 0).
Note that
Eθ (W ) = Eθ [I(X1 = 0)] = Pθ (X1 = 0) = e−θ ,
showing that W is an unbiased estimator. Now, we just calculate φ(T ) = E(W |T ) directly.
For t fixed, we have
φ(t) = E(W |T = t) = E[I(X1 = 0)|T = t]

= P (X1 = 0|T = t)
Pθ (X1 = 0, T = t)
=
Pθ (T = t)
Pθ (X1 = 0, ni=2 Xi = t)
P
=
Pθ (T = t)
Pθ (X1 = 0)Pθ ( ni=2 Xi = t)
P
indep
= .
Pθ (T = t)
PAGE 58
Pn
We can now calculate each of these probabilities. Recall that X1 ∼ Poisson(θ), i=2 Xi ∼
Poisson((n − 1)θ), and T ∼ Poisson(nθ). Therefore,
Pθ (X1 = 0)Pθ ( ni=2 Xi = t)
P
φ(t) =
Pθ (T = t)
[(n − 1)θ]t e−(n−1)θ
e−θ
n−1
t
= t! = .
(nθ)t e−nθ n
t!
Therefore,
T
n−1
φ(T ) =
n
is the UMVUE of τ (θ) = e−θ .
Remark: It is interesting to note that in this example

t n x n x
n−1 n−1 1
φ(t) = = = 1− ≈ e−x ,
n n n
for n large. Recall that e−X is the MLE of τ (θ) = e−θ by invariance.
Remark: The last subsection in CB (Section 7.3.4) is on loss-function optimality. This

material will be covered in STAT 822.
7.4 Appendix: CRLB Theory
Remark: In this section, we provide the proofs that pertain to the CRLB approach to
finding UMVUEs. These proofs are also relevant for later discussions on MLEs and their
large-sample characteristics.
Remark: We start by reviewing the Cauchy-Schwarz Inequality. Essentially, the main

Cramér-Rao inequality result (Theorem 7.3.9) follows as an application of this inequality.
Recall: Suppose X and Y are random variables. Then
|E(XY )| ≤ E(|XY |) ≤ [E(X 2 )]1/2 [E(Y 2 )]1/2 .
This is called the Cauchy-Schwarz Inequality. In this inequality, if we replace X with

X − µX and Y with Y − µY , we get
|E[(X − µX )(Y − µY )]| ≤ {E[(X − µX )2 ]}1/2 {E[(Y − µY )2 ]}1/2 .
Squaring both sides, we get

[cov(X, Y )]2 ≤ σX
2 2
σY .
This is called the covariance inequality.
PAGE 59
Theorem 7.3.9 (Cramér-Rao Inequality). Suppose X ∼ fX (x|θ), where
1. the support of X is free of all unknown parameters
2. for any function h(x) such that Eθ [h(X)] < ∞ for all θ ∈ Θ, the interchange
Z Z
d ∂
h(x)fX (x|θ)dx = h(x)fX (x|θ)dx
dθ Rn Rn ∂θ
is justified; i.e., we can interchange the derivative and integral (derivative and sum if
X is discrete).
For any estimator W (X) with varθ [W (X)] < ∞, the following inequality holds:
d 2
Eθ [W (X)]
varθ [W (X)] ≥ ndθ 2 o .
∂
Eθ ∂θ ln fX (X|θ)
Proof. First we state and prove a lemma.
Lemma. Let
∂
S(θ|X) =
ln fX (X|θ)
∂θ
denote the score function. The score function is a zero-mean random variable; that is,

∂
Eθ [S(θ|X)] = Eθ ln fX (X|θ) = 0.
∂θ
Proof of Lemma: Note that

Z Z ∂
∂ ∂ f (x|θ)
∂θ X
Eθ ln fX (X|θ) = ln fX (x|θ)fX (x|θ)dx = fX (x|θ)dx
∂θ Rn ∂θ n fX (x|θ)
ZR
∂
= fX (x|θ)dx
Rn ∂θ
Z
d
= fX (x|θ)dx = 0.
dθ Rn
| {z }
= 1
The interchange of derivative and integral above is justified based on the assumptions stated
in Theorem 7.3.9. Therefore, the lemma is proven. 2
Note: Because the score function is a zero-mean random variable,
varθ [S(θ|X)] = Eθ {[S(θ|X)]2 };
that is, (
2 )
∂ ∂
varθ ln fX (X|θ) = Eθ ln fX (X|θ) .
∂θ ∂θ
PAGE 60
We now return to the CRLB proof. Consider

∂ ∂ ∂
covθ W (X), ln fX (X|θ) = Eθ W (X) ln fX (X|θ) − Eθ [W (X)] Eθ ln fX (X|θ)
∂θ ∂θ ∂θ
| {z }
= 0

∂
= Eθ W (X) ln fX (X|θ)
∂θ
Z
∂
= W (x) ln fX (x|θ)fX (x|θ)dx
Rn ∂θ
∂
fX (x|θ)
Z
= W (x) ∂θ fX (x|θ)dx
Rn fX (x|θ)
Z
∂
= W (x) fX (x|θ)dx
Rn ∂θ
Z
d
= W (x)fX (x|θ)dx
dθ Rn
d
= Eθ [W (X)].
dθ
Now, write the covariance inequality with
1. W (X) playing the role of “X”

∂
2. S(θ|X) = ∂θ
ln fX (X|θ) playing the role of “Y .”
We get
2
∂ ∂
covθ W (X), ln fX (X|θ) ≤ varθ [W (X)] varθ ln fX (X|θ) ,
∂θ ∂θ
that is, (
2 2 )
d ∂
Eθ [W (X)] ≤ varθ [W (X)] Eθ ln fX (X|θ) .
dθ ∂θ
n 2 o
∂
Dividing both sides by Eθ ∂θ ln fX (X|θ) gives the result. 2
Corollary 7.3.10 (Cramér-Rao Inequality−iid case). With the same regularity conditions
stated in Theorem 7.3.9, in the iid case,
d 2
E θ [W (X)]
varθ [W (X)] ≥ ndθ 2 o .
∂
nEθ ∂θ ln fX (X|θ)
Proof. It suffices to show

( 2 ) ( 2 )
∂ ∂
Eθ ln fX (X|θ) = nEθ ln fX (X|θ) .
∂θ ∂θ
PAGE 61
Because X1 , X2 , ..., Xn are iid,

" #2 
 ∂ Yn 
LHS = Eθ ln fX (Xi |θ)
 ∂θ 
i=1
" #2 
 ∂ X n 
= Eθ ln fX (Xi |θ)
 ∂θ 
i=1
" #2 
n
 X ∂ 
= Eθ ln fX (Xi |θ)

i=1
∂θ 
n
( 2 ) X X
X ∂ ∂ ∂
= Eθ ln fX (Xi |θ) + Eθ ln fX (Xi |θ) ln fX (Xj |θ)
i=1
∂θ i6=j
∂θ ∂θ
n
( 2 )
indep
X ∂ XX ∂
∂

= Eθ ln fX (Xi |θ) + Eθ ln fX (Xi |θ) Eθ ln fX (Xj |θ) .
i=1
∂θ i6=j
∂θ ∂θ
| {z }| {z }
= 0 = 0
Therefore, all cross product expectations are zero and thus

n
( 2 ) ( 2 )
X ∂ ident ∂
LHS = Eθ ln fX (Xi |θ) = nEθ ln fX (X|θ) .
i=1
∂θ ∂θ
This proves the iid case. 2
Remark: Recall our notation:

( 2 )
∂
In (θ) = Eθ ln fX (X|θ)
∂θ
( 2 )
∂
I1 (θ) = Eθ ln fX (X|θ) .
∂θ
In the iid case, we have just proven that In (θ) = nI(θ). Therefore, in the iid case,
• If W (X) is an unbiased estimator of τ (θ), then
[τ 0 (θ)]2
CRLB = .
nI1 (θ)
• If W (X) is an unbiased estimator of τ (θ) = θ, then

1
CRLB = .
nI1 (θ)
PAGE 62
Lemma 7.3.11 (Information Equality). Under regularity conditions,

( 2 ) 2
∂ ∂
∂θ ∂θ2
Proof. From the definition of mathematical expectation,

" #
∂
∂2 ∂2
Z
fX (x|θ)
Z
∂ ∂θ
Eθ ln fX (X|θ) = ln fX (x|θ)fX (x|θ)dx = fX (x|θ)dx
∂θ2 R ∂θ
2
R ∂θ fX (x|θ)
| {z }
use quotient rule here
Note: A sum replaces the integral above if X is discrete. The derivative

" #
∂ ∂2 ∂ ∂
∂ ∂θ fX (x|θ) f (x|θ)fX (x|θ) − ∂θ
∂θ2 X
fX (x|θ) ∂θ fX (x|θ)
=
∂θ fX (x|θ) [fX (x|θ)]2
∂2
∂ 2
f (x|θ)
∂θ2 X
f (x|θ)
∂θ X
= − .
fX (x|θ) [fX (x|θ)]2
Therefore, the last integral becomes
Z ( ∂2
∂
2 ) Z ( ∂
2 )
∂θ 2 fX (x|θ) ∂θ fX (x|θ) ∂2 ∂θ fX (x|θ)
− fX (x|θ)dx = fX (x|θ) − dx
R fX (x|θ) [fX (x|θ)]2 R ∂θ2 fX (x|θ)
Z ∂ 2
∂2 ∂θ fX (x|θ)
Z
= f (x|θ)dx −
2 X
dx
R ∂θ R fX (x|θ)
2
d2
Z Z
∂
= f X (x|θ)dx − ln fX (x|θ) fX (x|θ)dx
dθ2 R R ∂θ
| {z }
= 1
( 2 )
∂
= −Eθ ln fX (X|θ) .
∂θ
We have shown
( 2 )
∂2

∂
Eθ ln fX (X|θ) = −Eθ ln fX (X|θ) .
∂θ2 ∂θ
Multiplying both sides by −1 gives the information equality. 2
Remark: We now finish by proving the attainment result.
Corollary 7.3.15. Suppose X1 , X2 , ..., Xn is an iid sample from fX (x|θ), where θ ∈ Θ, a

family that satisfies the regularity conditions stated for the Cramér-Rao Inequality. If W (X)
is an unbiased estimator of τ (θ), then varθ [W (X)] attains the CRLB if and only if the score
function
S(θ|x) = a(θ)[W (x) − τ (θ)]
is a linear function of W (x).
PAGE 63
Proof. From the CRLB proof, recall that we had
1. W (X) playing the role of “X”

∂
2. ∂θ
ln fX (X|θ) playing the role of “Y ”
in applying the covariance inequality, which yields
[τ 0 (θ)]2
varθ [W (X)] ≥ n 2 o
∂
Eθ ∂θ
ln fX (X|θ)
iid [τ 0 (θ)]2
= n Qn 2 o .
∂
Eθ ∂θ
ln i=1 Xf (X i |θ)
Now, in the covariance inequality, we have equality when the correlation of W (X) and
∂
∂θ
ln fX (X|θ) equals ±1, which in turn implies
c(X − µX ) = Y − µY a.s.,
or restated,
∂
c[W (X) − τ (θ)] = ln fX (X|θ) − 0 a.s.
∂θ
This is an application of Theorem 4.5.7 (CB, pp 172); i.e., two random variables are per-
fectly correlated if and only if the random variables are perfectly linearly related. In these
equations, c is a constant. Also, I have written “−0” on the RHS of the last equation to
emphasize that " #
n
∂ ∂ Y
Eθ ln fX (X|θ) = Eθ ln fX (Xi |θ) = 0.
∂θ ∂θ i=1
Also, W (X) is an unbiased estimator of τ (θ) by assumption. Therefore, we have
∂
c[W (X) − τ (θ)] = ln fX (X|θ)
∂θ
n
∂ Y
= ln fX (Xi |θ)
∂θ i=1
∂
= ln L(θ|X)
∂θ
= S(θ|X),
where S(θ|X) is the score function. The constant c cannot depend on W (X) nor on
∂
∂θ
ln fX (X|θ), but it can depend on θ. To emphasize this, we write
S(θ|X) = a(θ)[W (X) − τ (θ)].
Thus, varθ [W (X)] attains the CRLB when the score function S(θ|X) can be written as a
linear function of the unbiased estimator W (X). 2
PAGE 64
8 Hypothesis Testing
8.1 Introduction
Setting: We observe X = (X1 , X2 , ..., Xn ) ∼ fX (x|θ), where θ ∈ Θ ⊆ Rk . For example,

X1 , X2 , ..., Xn might constitute a random sample (iid sample) from a population fX (x|θ).
We regard θ as fixed and unknown.
Definition: A statistical hypothesis is a statement about θ. This statement specifies a

collection of distributions that X can possibly have. Two complementary hypotheses in a
testing problem are the null hypothesis
H0 : θ ∈ Θ0
and the alternative hypothesis

H1 : θ ∈ Θc0 ,
where Θc0 = Θ \ Θ0 . We call Θ0 the null parameter space and Θc0 the alternative
parameter space.
Example 8.1. Suppose X1 , X2 , ..., Xn are iid N (θ, σ02 ), where −∞ < θ < ∞ and σ02 is
known. Consider testing
H0 : θ = θ0
versus
H1 : θ 6= θ0 ,
where θ0 is a specified value of θ. The null parameter space Θ0 = {θ0 }, a singleton. The
alternative parameter space Θc0 = R \ {θ0 }.
Terminology: In Example 8.1, we call H0 : θ = θ0 a simple (or sharp) hypothesis. Note

that H0 specifies exactly one distribution, namely, N (θ0 , σ02 ). A simple hypothesis specifies
a single distribution.
Terminology: In Example 8.1, suppose we wanted to test
H0 : θ ≤ θ0
versus
H1 : θ > θ0 .
We call H0 a composite (or compound) hypothesis. Note that H0 specifies a family of

distributions, namely, {N (θ, σ02 ) : θ ≤ θ0 }.
PAGE 65
Goal: In a statistical hypothesis testing problem, we decide between the two complementary
hypotheses H0 and H1 on the basis of observing X = x. In essence, a hypothesis test is a
specification of the test function
φ(x) = P (Reject H0 |X = x).
Terminology: Let X denote the support of X.
• The subset of X for which H0 is rejected is called the rejection region, denoted by
R.
• The subset of X for which H0 is not rejected is called the acceptance region, denoted
by Rc .
If
1, x ∈ R
φ(x) = I(x ∈ R) =
0, x ∈ Rc ,
the test is said to be non-randomized.
Example 8.2. Suppose X ∼ b(10, θ), where 0 < θ < 1, and consider testing
H0 : θ ≥ 0.35
versus
H1 : θ < 0.35.
Here is an example of a randomized test function:


 1, x ≤ 2
1
φ(x) = , x=3
 5
0, x ≥ 4.
Using this test function, we would reject H0 if x = 0, 1, or 2. If x = 3, we would reject H0

with probability 1/5. If x ≥ 4, we would not reject H0 .
• If we observed x = 3, we could then subsequently generate U ∼ U(0, 1).
– If u ≤ 0.2, then reject H0 .

– If u > 0.2, then do not reject H0 .
Remark: In most problems, a test function φ depends on X through a one-dimensional

test statistic, say
W = W (X) = W (X1 , X2 , ..., Xn ).
PAGE 66
1. We would like to work with test statistics that are sensible and confer tests with nice
statistical properties (does sufficiency play a role?)
2. We would like to find the sampling distribution of W under H0 and H1 .
i.e., both parameters are unknown. Consider testing
H0 : σ 2 = 40
versus
H1 : σ 2 6= 40.
In this problem, both

W1 = W1 (X) = |S 2 − 40|
(n − 1)S 2
W2 = W2 (X) =
40
are reasonable test statistics.
• Because S 2 is an unbiased estimator of σ 2 , large values of W1 (intuitively) are evidence
against H0 . However, what is W1 ’s sampling distribution?
• The advantage of working with W2 is that we know its sampling distribution when H0
is true; i.e., W2 ∼ χ2n−1 . It is also easy to calculate the sampling distribution of W2
when H0 is not true; i.e., for values of σ 2 6= 40.
Example 8.4. McCann and Tebbs (2009) summarize a study examining perceived unmet
need for dental health care for people with HIV infection. Baseline in-person interviews were
conducted with 2,864 HIV infected individuals (aged 18 years and older) as part of the HIV
Cost and Services Utilization Study. Define
X1 = number of patients
with private insurance
with medicare and private insurance
without insurance
with medicare but no private insurance.
Set X = (X1 , X2 , X3 , X4 ) and model X ∼ mult(2864, p1 , p2 , p3 , p4 ; 4i=1 pi = 1). Under this
P
assumption, consider testing
1
H0 : p1 = p2 = p3 = p4 = 4
versus
H1 : H0 not true.
Note that an observation like x = (0, 0, 0, 2864) should lead to a rejection of H0 . An obser-
vation like x = (716, 716, 716, 716) should not. What about x = (658, 839, 811, 556)? Can
we find a reasonable one-dimensional test statistic?
PAGE 67
8.2 Methods of Finding Tests
Preview: The authors present three methods of finding tests:
1. Likelihood ratio tests (LRTs)

2. Bayesian tests
3. Union-Intersection and Intersection-Union tests (UIT/IUT)
We will focus largely on LRTs. We will discuss Bayesian tests briefly.
8.2.1 Likelihood ratio tests
Recall: Suppose X = (X1 , X2 , ..., Xn ) ∼ fX (x|θ), where θ ∈ Θ ⊆ Rk . The likelihood

function is
L(θ|x) = fX (x|θ)
n
iid
Y
= fX (xi |θ),
i=1
where fX (x|θ) is the common population distribution (in the iid case). Recall that Θ is the
parameter space.
Definition: The likelihood ratio test (LRT) statistic for testing
H0 : θ ∈ Θ0
versus
H1 : θ ∈ Θ \ Θ0
is defined by
sup L(θ|x)
θ∈Θ0
λ(x) = .
sup L(θ|x)
θ∈Θ
A LRT is a test that has a rejection region of the form

R = {x ∈ X : λ(x) ≤ c},
where 0 ≤ c ≤ 1.
Intuition: The numerator of λ(x) is the largest the likelihood function can be over the null
parameter space Θ0 . The denominator is the largest the likelihood function can be over the
entire parameter space Θ. Clearly,
0 ≤ λ(x) ≤ 1.
PAGE 68
The form of the rejection region above says to “reject H0 when λ(x) is too small.” When
λ(x) is small, the data x are not consistent with the collection of models under H0 .
Connection with MLEs:
• The numerator of λ(x) is

sup L(θ|x) = L(θb0 |x),
θ∈Θ0
where θb0 is the MLE of θ subject to the constraint that θ ∈ Θ0 . That is, θb0 is the
value of θ that maximizes L(θ|x) over the null parameter space Θ0 . We call θb0 the
restricted MLE.
• The denominator of λ(x) is

sup L(θ|x) = L(θ|x),
b
θ∈Θ
where θb is the MLE of θ. That is, θb is the value of θ that maximizes L(θ|x) over the
entire parameter space Θ. We call θb the unrestricted MLE.
• For notational simplicity, we often write
L(θb0 |x)
λ(x) = .
L(θ|x)
b
This notation is easier and emphasizes how the definition of λ(x) is tied to maximum
likelihood estimation.
Special case: When H0 is a simple hypothesis; i.e.,
H0 : θ = θ 0 ,
the null parameter space is Θ0 = {θ 0 }, a singleton. Clearly, in this case,
sup L(θ|x) = L(θb0 |x) = L(θ 0 |x).

θ∈Θ0
That is, there is only one value of θ “allowed” under H0 . We are therefore maximizing the
likelihood function L(θ|x) over a single point in Θ.
Large-sample intuition: We will learn in Chapter 10 that (under suitable regularity con-
ditions), an MLE
p
θb −→ θ, as n → ∞,
i.e., “MLEs are consistent” (I have switched to the scalar case here only for convenience).
In the light of this asymptotic result, consider each of the following cases:
PAGE 69
• Suppose that H0 is true; i.e., θ ∈ Θ0 . Then

p
θb0 −→ θ ∈ Θ0
p
θb −→ θ ∈ Θ0 .
The MLEs θb0 and θb are converging to the same quantity (in probability) so they should
be close to each other in large samples. Therefore, we would expect
L(θb0 |x)
λ(x) =
L(θ|x)
b
to be “close” to 1.
• Suppose H0 is not true; i.e., θ ∈ Θ \ Θ0 . Then

p
θb −→ θ ∈ Θ \ Θ0 ,
but θb0 ∈ Θ0 because θb0 is calculated by maximizing L(θ|x) over Θ0 (i.e., θb0 can never
“escape from” Θ0 ). Therefore, there is no guarantee that θb0 and θb will be close to each
other in large samples, and, in fact, the ratio
L(θb0 |x)
λ(x) =
L(θ|x)
b
could be much smaller than 1.
• This is why (at least by appealing to large-sample intuition) it makes sense to reject
H0 when λ(x) is small.
Example 8.5. Suppose X1 , X2 , ..., Xn are iid N (µ, σ02 ), where −∞ < µ < ∞ and σ02 = 1.
Consider testing
H0 : µ = µ0
versus
H1 : µ 6= µ0 .
The likelihood function is

n n
Y 1 −(xi −µ)2 /2 1 1 Pn 2
L(µ|x) = √ e = √ e− 2 i=1 (xi −µ) .
i=1
2π 2π
The relevant parameter spaces are
Θ0 = {µ0 }, a singleton
Θ = {µ : −∞ < µ < ∞}.
PAGE 70
Clearly, n
1 1 Pn 2
sup L(µ|x) = L(µ0 |x) = √ e− 2 i=1 (xi −µ0 ) .
µ∈Θ0 2π
Over the entire parameter space Θ, the MLE is µb = X; see Example 7.5 (notes, pp 31).
Therefore, n
1 1 Pn 2
sup L(µ|x) = L(x|x) = √ e− 2 i=1 (xi −x) .
µ∈Θ 2π
The LRT statistic is

n 1 Pn 2
L(µ0 |x) √1
2π
e− 2 i=1 (xi −µ0 )
1 Pn Pn
= e− 2 [ i=1 (xi −µ0 ) − i=1 (xi −x) ] .
2 2
λ(x) = = n 1 P n
L(x|x) √1 e− 2 i=1 (xi −x)
2
2π
Recall the algebraic identity

n
X n
X
2
(xi − µ0 ) = (xi − x)2 + n(x − µ0 )2 .
i=1 i=1
Therefore, λ(x) reduces to

n 2
λ(x) = e− 2 (x−µ0 ) .
An LRT rejects H0 when λ(x) is “too small,” say, λ(x) ≤ c.
Goal: Write the rejection rule

λ(x) ≤ c
as a statement in terms of an easily-identified statistic. Note that
n 2 n
λ(x) = e− 2 (x−µ0 ) ≤ c ⇐⇒ − (x − µ0 )2 ≤ ln c
2
2 ln c
⇐⇒ (x − µ0 )2 ≥ −
r n
2 ln c
⇐⇒ |x − µ0 | ≥ − = c0 , say.
n
Therefore, the LRT rejection region can be written as
R = {x ∈ X : λ(x) ≤ c} = {x ∈ X : |x − µ0 | ≥ c0 }.
Rejecting H0 when λ(x) is “too small” is the same as rejecting H0 when |x − µ0 | is “too
large.” The latter decision rule makes sense intuitively. Note that we have written our LRT
rejection region and the corresponding test function
1, |x − µ0 | ≥ c0

0
φ(x) = I(x ∈ R) = I(|x − µ0 | ≥ c ) =
0, |x − µ0 | < c0
in terms of the one-dimensional statistic W (X) = X. Recall that W (X) = X is a sufficient

statistic for the N (µ, 1) family.
PAGE 71
Example 8.6. Suppose X1 , X2 , ..., Xn are iid with population pdf

−(x−θ)
e , x≥θ
fX (x|θ) =
0, x < θ,
where −∞ < θ < ∞. Note that this is a location exponential population pdf; the location
parameter is θ. Consider testing
H0 : θ ≤ θ0
versus
H1 : θ > θ0 .

n
Y
L(θ|x) = e−(xi −θ) I(xi ≥ θ)
i=1
n
Pn Y
= e− i=1 xi +nθ
I(x(1) ≥ θ) I(xi ∈ R)
i=1
n
Pn Y
= enθ I(x(1) ≥ θ) e− i=1 xi I(xi ∈ R) .
| {z } i=1
g(x(1) |θ) | {z }
h(x)
Note that W (X) = X(1) is a sufficient statistic by the Factorization Theorem. The relevant
parameter spaces are
Θ0 = {θ : −∞ < θ ≤ θ0 }
Θ = {θ : −∞ < θ < ∞}.
We need to find the unrestricted MLE
θb = arg max L(θ|x)

θ∈Θ
and the restricted MLE

θb0 = arg max L(θ|x).
θ∈Θ0
Unrestricted MLE: Note that

Pn
• When θ ≤ x(1) , L(θ|x) = e− i=1 xi +nθ
, which increases as θ increases.
– For graphing purposes, it is helpful to note that

∂2 2 − n
P
i=1 xi +nθ > 0,
L(θ|x) = n e
∂θ2
i.e., L(θ|x) is convex.
PAGE 72
• When θ > x(1) , L(θ|x) = 0.
• Therefore, L(θ|x) is an increasing function when θ is less than or equal to the minimum
order statistic x(1) ; when θ is larger than x(1) , the likelihood function drops to zero.
• Clearly, the unrestricted MLE of θ is θb = X(1) and hence the denominator of λ(x) is
b = L(x(1) |x).
sup L(θ|x) = L(θ|x)
θ∈Θ
Restricted MLE: By “restricted,” we mean “subject to the constraint that the estimate
fall in Θ0 = {θ : −∞ < θ ≤ θ0 }.”
• Case 1: If θ0 < x(1) , then the largest L(θ|x) can be is L(θ0 |x). Therefore, the restricted
MLE is θb0 = θ0 .
• Case 2: If θ0 ≥ x(1) , then the restricted MLE θb0 coincides with the unrestricted MLE
θb = X(1) .
• Therefore,
θ0 , θ0 < X(1)
θb0 =
X(1) , θ0 ≥ X(1) .
The LRT statistic is

L(θ0 |x)


 , θ0 < x(1)
L(θ0 |x)
b 
L(x(1) |x)
λ(x) = = L(x(1) |x)
L(θ|x)
b 
 = 1, θ0 ≥ x(1) .
L(x(1) |x)

That λ(x) = 1 when θ0 ≥ x(1) makes perfect sense in testing
H0 : θ ≤ θ0
versus
H1 : θ > θ0 .
• If x(1) ≤ θ0 , we certainly don’t want to reject H0 and conclude that θ > θ0 .
• It is only when x(1) > θ0 do we have evidence that θ might be larger than θ0 . The
larger x(1) is (x(1) > θ0 ), the smaller λ(x) becomes; see Figure 8.2.1 (CB, pp 377).
That is,
larger x(1) ⇐⇒ smaller λ(x) ⇐⇒ more evidence against H0 .
PAGE 73
Not surprisingly, we can write our LRT rejection region in terms of W (X) = X(1) . When
θ0 < x(1) , the LRT statistic
Pn
L(θ0 |x) e− i=1 xi +nθ0
λ(x) = = − Pn x +nx = e−n(x(1) −θ0 ) .
L(x(1) |x) e i=1 i (1)
Note that
λ(x) = e−n(x(1) −θ0 ) ≤ c ⇐⇒ −n(x(1) − θ0 ) ≤ ln c

ln c
⇐⇒ x(1) ≥ θ0 − = c0 , say.
n
Therefore, the LRT rejection region can be written as
R = {x ∈ X : λ(x) ≤ c} = {x ∈ X : x(1) ≥ c0 }.
Rejecting H0 when λ(x) is “too small” is the same as rejecting H0 when x(1) is “too large.”
As noted earlier, the latter decision rule makes sense intuitively. Note that we have written
our LRT rejection region and the corresponding test function
1, x(1) ≥ c0

0
φ(x) = I(x ∈ R) = I(x(1) ≥ c ) =
0, x(1) < c0
in terms of the one-dimensional statistic W (X) = X(1) , which is sufficient for the location
exponential family.
Theorem 8.2.4. Suppose T = T (X) is a sufficient statistic for θ. If λ∗ (T (x)) = λ∗ (t) is the
LRT statistic based on T and if λ(x) is the LRT statistic based on X, then λ∗ (T (x)) = λ(x)
for all x ∈ X .
Proof. Because T = T (X) is sufficient, we can write (by the Factorization Theorem)
fX (x|θ) = gT (t|θ)h(x),
where gT (t|θ) is the pdf (pmf) of T and h(x) is free of θ. Therefore,
sup L(θ|x) sup gT (t|θ)h(x)

θ∈Θ0 θ∈Θ0
λ(x) = =
sup L(θ|x) sup gT (t|θ)h(x)
θ∈Θ θ∈Θ
sup gT (t|θ)
θ∈Θ0
=
sup gT (t|θ)
θ∈Θ
sup L∗ (θ|t)
θ∈Θ0
= ,
sup L∗ (θ|t)
θ∈Θ
where L∗ (θ|t) is the likelihood function based on observing T = t. 2
PAGE 74
Implication: If a sufficient statistic T exists, we can immediately restrict attention to its

distribution when deriving an LRT.
Example 8.7. Suppose X1 , X2 , ..., Xn are iid exponential(θ), where θ > 0. Consider testing
H0 : θ = θ0
versus
H1 : θ 6= θ0 .
(a) Show that the LRT statistic based on X = x is

n X n
!n
e Pn
λ(x) = xi e− i=1 xi /θ0 .
nθ0 i=1
(b) Show that the LRT statistic based on T = T (X) = ni=1 Xi is

P
n
∗ e
λ (t) = tn e−t/θ0 ,
nθ0
establishing that λ∗ (t) = λ(x), as stated in Theorem 8.2.4.

(c) Show that
λ∗ (t) ≤ c ⇐⇒ t ≤ c1 or t ≥ c2 ,
for some c1 and c2 satisfying c1 < c2 .
i.e., both parameters are unknown. Set θ = (µ, σ 2 ). Consider testing
H0 : µ = µ0
versus
H1 : µ 6= µ0 .
The null hypothesis H0 above looks simple, but it is not. The relevant parameter spaces are
Θ0 = {θ = (µ, σ 2 ) : µ = µ0 , σ 2 > 0}
Θ = {θ = (µ, σ 2 ) : −∞ < µ < ∞, σ 2 > 0}.
In this problem, we call σ 2 a nuisance parameter, because it is not the parameter that is
of interest in H0 and H1 . The likelihood function is
n
Y 1 2 2
L(θ|x) = √ e−(xi −µ) /2σ
i=1 2πσ 2
n/2
1 − 12
Pn 2
i=1 (xi −µ) .
= 2
e 2σ
2πσ
PAGE 75
Unrestricted MLE: In Example 7.6 (notes, pp 33), we showed that

 
X
X n
θ
b= = 1X
 
Sb2 (Xi − X)2 
n i=1
maximizes L(θ|x) over Θ.
Restricted MLE: It is easy to show that

 
µ0
n
θ  1 X(X − µ )2 
b0 =  
i 0
n i=1
maximizes L(θ|x) over Θ0 .
(a) Show that

Pn n/2
L(θb0 |x) i=1 (xi − x)2
λ(x) = = Pn 2
.
L(θ|x) i=1 (xi − µ0 )
b
(b) Show that

x − µ0
λ(x) ≤ c ⇐⇒ √ ≥ c0 .
s/ n
This demonstrates that the “one-sample t test” is a LRT under normality.
Exercise: In Example 7.7 (notes, pp 34-35), derive the LRT statistic to test
H0 : p1 = p2
versus
H1 : p1 6= p2 .
Exercise: In Example 8.4 (notes, pp 67), show that the LRT statistic is
4 x
Y 2864 i
λ(x) = λ(x1 , x2 , x3 , x4 ) = .
i=1
4xi
Also, show that

λ(x) ≤ c ⇐⇒ − 2 ln λ(x) ≥ c0 .
Under H0 : p1 = p2 = p3 = p4 = 41 , we will learn later that −2 ln λ(X) is distributed
approximately as χ23 . This suggests a “large-sample” LRT, namely, to reject H0 if −2 ln λ(x)
is “too large.” We can use the χ23 distribution to specify what “too large” actually means.
PAGE 76
8.2.2 Bayesian tests
Remark: Hypothesis tests of the form
H0 : θ ∈ Θ0
versus
H1 : θ ∈ Θc0 ,
where Θc0 = Θ \ Θ0 , can also be carried out within the Bayesian paradigm, but they are
performed differently. Recall that, for a Bayesian, all inference is carried out using the
posterior distribution π(θ|x).
Realization: The posterior distribution π(θ|x) is a valid probability distribution. It is the

distribution that describes the behavior of the random variable θ, updated after observing
the data x. In this light, the probabilities
Z
P (H0 true|x) = P (θ ∈ Θ0 |x) = π(θ|x)dθ
ZΘ0
P (H1 true|x) = P (θ ∈ Θc0 |x) = π(θ|x)dθ
Θc0
make perfect sense and be calculated (or approximated) “exactly.” Note that these proba-
bilities make no sense to the non-Bayesian. S/he regards θ as fixed, so that {θ ∈ Θ0 } and
{θ ∈ Θc0 } are not random events. We do not assign probabilities to events that are not
random.
Example 8.9. Suppose that X1 , X2 , ..., Xn are iid Poisson(θ), where the prior distribution
for θ ∼ gamma(a, b), a, b known. In Example 7.10 (notes, pp 38-39), we showed that the
n
!
X 1
θ|X = x ∼ gamma xi + a, .
i=1
n + 1b
As an application, consider the following data, which summarize the number of goals per
game in the 2013-2014 English Premier League season:
Goals 0 1 2 3 4 5 6 7 8 9 10+
Frequency 27 73 80 72 65 39 17 4 1 2 0
There were n = 380 games total. I modeled the number of goals per game X as a Poisson
random variable and assumed that X1 , X2 , ..., X380 are iid Poisson(θ). Before the season
started, I modeled the mean number of goals per game as θ ∼ gamma(1.5, 2), which is a
fairly diffuse prior distribution.
PAGE 77
0.25
4
0.20
Posterior distribution
Prior distribution
0.15
3
0.10
2
0.05
1
0.00
0
0 5 10 15 2.0 2.5 3.0 3.5
θ θ
Figure 8.1: 2013-2014 English Premier League data. Prior distribution (left) and posterior
distribution (right) for θ, the mean number of goals scored per game. Note that the horizontal
axes are different in the two figures.
Based on the observed data, I used R to calculate
> sum(goals)
[1] 1060
The posterior distribution is therefore

1 d
θ|X = x ∼ gamma 1060 + 1.5, 1 = gamma(1061.5, 0.002628).
380 + 2
I have depicted the prior distribution π(θ) and the posterior distribution π(θ|x) in Figure
8.1. Suppose that I wanted to test H0 : θ ≥ 3 versus H1 : θ < 3 on the basis of the assumed
Bayesian model and the observed data x. The probability that H0 is true is
Z ∞
P (θ ≥ 3|x) = π(θ|x)dθ ≈ 0.008,
3
which I calculated in R using
> 1-pgamma(3,1061.5,1/0.002628)
[1] 0.008019202
Therefore, it is far more likely that H1 is true, in fact, with probability over 0.99.
PAGE 78
8.3 Methods of Evaluating Tests
Setting: Suppose X = (X1 , X2 , ..., Xn ) ∼ fX (x|θ), where θ ∈ Θ ⊆ R and consider testing
H0 : θ ∈ Θ0
versus
H1 : θ ∈ Θc0 ,
where Θc0 = Θ \ Θ0 . I will henceforth assume that θ is a scalar parameter (for simplicity
only).
8.3.1 Error probabilities and the power function
Definition: For a test (with test function)
φ(x) = I(x ∈ R),
we can make one of two mistakes:
1. Type I Error: Rejecting H0 when H0 is true
2. Type II Error: Not rejecting H0 when H1 is true.
Therefore, for any test that we perform, there are four possible scenarios, described in the
following table:
Decision
Reject H0 Do not reject H0
H0 Type I Error ,
Truth
H1 , Type II Error
Calculations:
1. Suppose H0 : θ ∈ Θ0 is true. For θ ∈ Θ0 ,
P (Type I Error|θ) = Pθ (X ∈ R) = Eθ [I(X ∈ R)] = Eθ [φ(X)].
2. Suppose H1 : θ ∈ Θc0 is true. For θ ∈ Θc0 ,
P (Type II Error|θ) = Pθ (X ∈ Rc ) = 1 − Pθ (X ∈ R) = 1 − Eθ [φ(X)] = Eθ [1 − φ(X)].
It is very important to note that both of these probabilities depend on θ. This is why we
emphasize this in the notation.
PAGE 79
Definition: The power function of a test φ(x) is the function of θ given by
β(θ) = Pθ (X ∈ R) = Eθ [φ(X)].
In other words, the power function gives the probability of rejecting H0 for all θ ∈ Θ. Note
that if H1 is true, so that θ ∈ Θc0 ,
β(θ) = Pθ (X ∈ R) = 1 − Pθ (X ∈ Rc ) = 1 − P (Type II Error|θ).
H0 : µ ≤ µ0
versus
H1 : µ > µ0 .
The LRT of H0 versus H1 uses the test function
 1, x −√µ0 ≥ c

φ(x) = σ0 / n
 0, otherwise.
The power function for this test is given by

X − µ0
β(µ) = Pµ (X ∈ R) = Pµ √ ≥c
σ0 / n

cσ0
= Pµ X ≥ √ + µ 0
n
cσ
!
√ 0 + µ0 − µ

X −µ n µ0 − µ
= Pµ √ ≥ √ = 1 − FZ c + √ ,
σ0 / n σ0 / n σ0 / n
where Z ∼ N (0, 1) and FZ (·) is the standard normal cdf.
Exercise: Determine n and c such that
sup β(µ) = 0.10

µ≤µ0
inf β(µ) = 0.80.
µ≥µ0 +σ0
• The first requirement implies that P (Type I Error|µ) will not exceed 0.10 for all µ ≤ µ0
(H0 true).
• The second requirement implies that P (Type II Error|µ) will not exceed 0.20 for all
µ ≥ µ0 + σ0 (these are values of µ that make H1 true).
PAGE 80
1.0
0.8
0.6
Power function
0.4
0.2
0.0
0 1 2 3 4
Figure 8.2: Power function β(µ) in Example 8.10 with c = 1.28, n = 5, µ0 = 1.5 and σ0 = 1.
Horizontal lines at 0.10 and 0.80 have been added.
Solution. Note that

∂ ∂ µ0 − µ
β(µ) = 1 − FZ c + √
∂µ ∂µ σ0 / n
√
n µ0 − µ
= fZ c + √ > 0;
σ0 σ0 / n
i.e., β(µ) is an increasing function of µ. Therefore,

set
sup β(µ) = β(µ0 ) = 1 − FZ (c) = 0.10 =⇒ c = 1.28,
µ≤µ0
the 0.90 quantile of the N (0, 1) distribution. Also, because β(µ) is increasing,
√ set
inf β(µ) = β(µ0 + σ0 ) = 1 − FZ (1.28 − n) = 0.80
µ≥µ0 +σ0
√
=⇒ 1.28 − n = −0.84
=⇒ n = 4.49,
which would be rounded up to n = 5. The resulting power function with c = 1.28, n = 5,

µ0 = 1.5 and σ0 = 1 is shown in Figure 8.2.
PAGE 81
Definition: A test φ(x) with power function β(θ) is a size α test if

sup β(θ) = α.
θ∈Θ0
The test φ(x) is a level α test if

sup β(θ) ≤ α.
θ∈Θ0
Note that if φ(x) is a size α test, then it is also level α. The converse is not true. In other
words,
{class of size α tests} ⊂ {class of level α tests}.
Remark: Often, it is unnecessary to differentiate between the two classes of tests. How-
ever, in testing problems involving discrete distributions (e.g., binomial, Poisson, etc.), it is
generally not possible to construct a size α test for a specified value of α; e.g., α = 0.05.
Thus (unless one randomizes), we may have to settle for a level α test.
Important: As the definition above indicates, the size of any test φ(x) is calculated by
maximizing the power function over the null parameter space Θ0 identified in H0 .
Example 8.11. Suppose X1 , X2 are iid Poisson(θ), where θ > 0, and consider testing
H0 : θ ≥ 3
versus
H1 : θ < 3.
We consider the two tests

φ1 = φ1 (x1 , x2 ) = I(x1 = 0)
φ2 = φ2 (x1 , x2 ) = I(x1 + x2 ≤ 1).
The power function for the first test is
β1 (θ) = Eθ [I(X1 = 0)] = Pθ (X1 = 0) = e−θ .
Recall that T = T (X1 , X2 ) = X1 + X2 ∼ Poisson(2θ). The power function for the second
test is
β2 (θ) = Eθ [I(X1 + X2 ≤ 1)] = Pθ (X1 + X2 ≤ 1) = e−2θ + 2θe−2θ .
I have plotted both power functions in Figure 8.3 (next page).
Size calculations: The size of each test is calculated as follows. For the first test,
α = sup β1 (θ) = β1 (3) = e−3 ≈ 0.049787.
θ≥3
For the second test,

α = sup β2 (θ) = β2 (3) = e−6 + 6e−6 ≈ 0.017351.
θ≥3
Both φ1 and φ2 are level α = 0.05 tests.
PAGE 82
1.0
0.8
0.6
Power function
β1(θ)
β2(θ)
0.4
0.2
0.0
0 1 2 3 4 5
Figure 8.3: Power functions β1 (θ) and β2 (θ) in Example 8.11.
Example 8.12. Suppose X1 , X2 , ..., Xn are iid from fX (x|θ) = e−(x−θ) I(x ≥ θ), where
−∞ < θ < ∞. In Example 8.6 (notes, pp 72-74), we considered testing
H0 : θ ≤ θ0
versus
H1 : θ > θ0
and derived the LRT to take the form φ(x) = I(x(1) ≥ c0 ). Find the value of c0 that makes
φ(x) a size α test.
Solution. The pdf of X(1) is fX(1) (x|θ) = ne−n(x−θ) I(x ≥ θ). We set
α = sup Eθ [φ(X)] = sup Pθ (X(1) ≥ c0 )

θ≤θ0 θ≤θ0
Z ∞
= sup ne−n(x−θ) dx
θ≤θ0 c0
−n(c0 −θ) 0
= sup e = e−n(c −θ0 ) .
θ≤θ0
Therefore, c0 = θ0 − n−1 ln α. A size α LRT uses φ(x) = I(x(1) ≥ θ0 − n−1 ln α).
PAGE 83
8.3.2 Most powerful tests
Definition: Let C be a class of tests for testing
H0 : θ ∈ Θ0
versus
H1 : θ ∈ Θc0 ,
where Θc0 = Θ \ Θ0 . A test in C with power function β(θ) is a uniformly most powerful
(UMP) class C test if
β(θ) ≥ β ∗ (θ) for all θ ∈ Θc0 ,
where β ∗ (θ) is the power function of any other test in C. The “uniformly” part in this
definition refers to the fact that the power function β(θ) is larger than (i.e., at least as large
as) the power function of any other class C test for all θ ∈ Θc0 .
Important: In this course, we will restrict attention to tests φ(x) that are level α tests.
That is, we will take
C = {all level α tests}.
This restriction is analogous to the restriction we made in the “optimal estimation problem”
in Chapter 7. Recall that we restricted attention to unbiased estimators first; we then wanted
to find the one with the smallest variance (uniformly, for all θ ∈ Θ). In the same spirit, we
make the same type of restriction here by considering only those tests that are level α tests.
This is done so that we can avoid having to consider “silly tests,” e.g.,
φ(x) = 1 for all x ∈ X .
The power function for this test is β(θ) = 1, for all θ ∈ Θ. This test cannot be beaten in
terms of power when H1 is true! Unfortunately, it is not a very good test when H0 is true.
Recall: A test φ(x) with power function β(θ) is a level α test if
sup β(θ) ≤ α.
θ∈Θ0
That is, P (Type I Error|θ) can be no larger than α for all θ ∈ Θ0 .
Starting point: We start by considering the simple-versus-simple test:
H0 : θ = θ0
versus
H1 : θ = θ1 .
Both H0 and H1 specify exactly one probability distribution.
Remark: This type of test is rarely of interest in practice. However, it is the “building
block” situation for more interesting problems.
PAGE 84
Theorem 8.3.12 (Neyman-Pearson Lemma). Consider testing
H0 : θ = θ0
versus
H1 : θ = θ1
and denote by fX (x|θ0 ) and fX (x|θ1 ) the pdfs (pmfs) of X = (X1 , X2 , ..., Xn ) corresponding
to θ0 and θ1 , respectively. Consider the test function

 1, fX (x|θ1 ) > k

fX (x|θ0 )


φ(x) =
 0, fX (x|θ1 ) < k,


fX (x|θ0 )

for k ≥ 0, where
α = Pθ0 (X ∈ R) = Eθ0 [φ(X)]. (8.1)
Sufficiency: Any test satisfying the definition of φ(x) above and Equation (8.1) is a most
powerful (MP) level α test.
Remarks:
• The necessity part of the Neyman-Pearson (NP) Lemma is less important for our
immediate purposes (see CB, pp 388).
• In a simple-versus-simple test, any MP level α test is obviously also UMP level α.

Recall that the “uniformly” part in UMP refers to all θ ∈ Θc0 . However, in a simple
H1 , there is only one value of θ ∈ Θc0 . I choose to distinguish MP from UMP in this
situation (whereas the authors of CB do not).
Example 8.13. Suppose that X1 , X2 , ..., Xn are iid beta(θ, 1), where θ > 0; i.e., the popu-
lation pdf is
fX (x|θ) = θxθ−1 I(0 < x < 1).
Derive the MP level α test for
H0 : θ = 1
versus
H1 : θ = 2.
Solution. The pdf of X = (X1 , X2 , ..., Xn ) is, for 0 < xi < 1,

n n
!θ−1
iid
Y Y
fX (x|θ) = θxθ−1
i = θn xi .
i=1 i=1
PAGE 85
Form the ratio n

2−1
2n ( ni=1 xi )
Q
fX (x|θ1 ) fX (x|2) n
Y
= = 1−1 = 2 xi .
fX (x|θ0 ) fX (x|1) 1n ( ni=1 xi )
Q
i=1
The NP Lemma says that the MP level α test uses the rejection rejection
( n
)
Y
R = x ∈ X : 2n xi > k ,
i=1
where the constant k satisfies

n !
Y
α = Pθ=1 (X ∈ R) = P 2n Xi > k θ = 1 .
i=1
Instead
Q of finding the constant k that satisfies this equation, we rewrite the rejection rule
{2n ni=1 xi > k} in a way that makes our life easier. Note that
n
Y n
Y
n
2 xi > k ⇐⇒ xi > 2−n k
i=1 i=1
n
X
⇐⇒ − ln xi < − ln(2−n k) = k 0 , say.
i=1
Qn Pn
We have rewritten the rejection rule {2n i=1 xi > k} as { − ln xi < k 0 }. Therefore,
i=1
n ! n !
Y X
− ln Xi < k 0 θ = 1 .

α=P 2n Xi > k θ = 1 = P
i=1 i=1
We have now changed the problem to choosing k 0 to solve this equation above.
Q: Why did we do this?

A: Because it is easier to find the distribution of ni=1 − ln Xi when H0 : θ = 1 is true.
P
Recall that
H H
Xi ∼0 U(0, 1) =⇒ − ln Xi ∼0 exponential(1)
n
H
X
=⇒ − ln Xi ∼0 gamma(n, 1).
i=1
Therefore, to satisfy the equation above, we take k 0 = gn,1,1−α , the (lower) α quantile of a
gamma(n, 1) distribution. This notation for quantiles is consistent with how CB have defined
them on pp 386. Thus, the MP level α test of H0 : θ = 1 versus H1 : θ = 2 has rejection
region ( )
Xn
R= x∈X : − ln xi < gn,1,1−α .
i=1
Special case: If n = 10 and α = 0.05, then g10,1,0.95 ≈ 5.425.
PAGE 86
Q: What is β(2), the power of this MP test (when n = 10 and α = 0.05)?

A: We calculate !
10
X

β(2) = P − ln Xi < 5.425 θ = 2 .
i=1
Recall that
H H
Xi ∼1 beta(2, 1) =⇒ − ln Xi ∼1 exponential(1/2)
10
H
X
=⇒ − ln Xi ∼1 gamma(10, 1/2).
i=1
Therefore, Z 5.425
1
β(2) = u9 e−2u du ≈ 0.643.
1 10

0 Γ(10) 2
| {z }
gamma(10, 1/2) pdf
Proof of NP Lemma. We prove the sufficiency part only. Define the test function

fX (x|θ1 )
 1, fX (x|θ0 ) > k



φ(x) =
 fX (x|θ1 )
 0, < k,


fX (x|θ0 )
where k ≥ 0 and
α = Pθ0 (X ∈ R) = Eθ0 [φ(X)];
i.e., φ(x) is a size α test. We want to show that φ(x) is MP level α. Therefore, let φ∗ (x) be
the test function for any other level α test of H0 versus H1 . Note that
Eθ0 [φ(X)] = α
Eθ0 [φ∗ (X)] ≤ α.
Thus,
Eθ0 [φ(X) − φ∗ (X)] = Eθ0 [φ(X)] − Eθ0 [φ∗ (X)] ≥ 0.

| {z } | {z }
= α ≤ α
Define
b(x) = [φ(x) − φ∗ (x)][fX (x|θ1 ) − kfX (x|θ0 )].
We want to show that b(x) ≥ 0, for all x ∈ X .
• Case 1: Suppose fX (x|θ1 ) − kfX (x|θ0 ) > 0. Then, by definition, φ(x) = 1. Because
0 ≤ φ∗ (x) ≤ 1, we have
b(x) = [φ(x) − φ∗ (x)] [fX (x|θ1 ) − kfX (x|θ0 )] ≥ 0.

| {z }| {z }
≥ 0 > 0
PAGE 87
• Case 2: Suppose fX (x|θ1 ) − kfX (x|θ0 ) < 0. Then, by definition, φ(x) = 0. Because
0 ≤ φ∗ (x) ≤ 1, we have
b(x) = [φ(x) − φ∗ (x)] [fX (x|θ1 ) − kfX (x|θ0 )] ≥ 0.

| {z }| {z }
≤ 0 < 0
• Case 3: Suppose fX (x|θ1 ) − kfX (x|θ0 ) = 0. It is then obvious that b(x) = 0.
We have shown that b(x) = [φ(x) − φ∗ (x)][fX (x|θ1 ) − kfX (x|θ0 )] ≥ 0. Therefore,
[φ(x) − φ∗ (x)]fX (x|θ1 ) − k[φ(x) − φ∗ (x)]fX (x|θ0 ) ≥ 0

⇐⇒ [φ(x) − φ∗ (x)]fX (x|θ1 ) ≥ k[φ(x) − φ∗ (x)]fX (x|θ0 ).
Integrating both sides, we get

Z Z
∗
[φ(x) − φ (x)]fX (x|θ1 )dx ≥ k [φ(x) − φ∗ (x)]fX (x|θ0 )dx,
Rn Rn
that is,
Eθ1 [φ(X) − φ∗ (X)] ≥ k Eθ0 [φ(X) − φ∗ (X)] ≥ 0.
| {z }
≥ 0, shown above
Therefore, Eθ1 [φ(X) − φ∗ (X)] ≥ 0 and hence Eθ1 [φ(X)] ≥ Eθ1 [φ∗ (X)]. This shows that φ(x)
is more powerful than φ∗ (x). Because φ∗ (x) is an arbitrary level α test, we are done. 2
Corollary 8.3.13 (NP Lemma with a sufficient statistic T ). Consider testing
H0 : θ = θ0
versus
H1 : θ = θ1 ,
and suppose that T = T (X) is a sufficient statistic. Denote by gT (t|θ0 ) and gT (t|θ1 ) the pdfs
(pmfs) of T corresponding to θ0 and θ1 , respectively. Consider the test function

gT (t|θ1 )
 1, gT (t|θ0 ) > k



φ(t) =
 0, gT (t|θ1 ) < k,


gT (t|θ0 )

for k ≥ 0, where, with rejection region S ⊂ T ,
α = Pθ0 (T ∈ S) = Eθ0 [φ(T )].
The test that satisfies these specifications is a MP level α test.

Proof. See CB (pp 390).
PAGE 88
Implication: In search of a MP test, we can immediately restrict attention to those tests

based on a sufficient statistic.
known. Find the MP level α test for
H0 : µ = µ0
versus
H1 : µ = µ1 ,
where µ1 < µ0 .
Solution. The sample mean T = T (X) = X is a sufficient statistic for the N (µ, σ02 ) family.
Furthermore,
σ02

1 − n2 (t−µ)2
T ∼ N µ, =⇒ gT (t|µ) = p e 2σ0
,
n 2πσ02 /n
for t ∈ R. Form the ratio

1 − n
2 (t−µ1 )
2σ0
2
p e
gT (t|µ1 ) 2πσ02 /n − n 2 2
2 [(t−µ1 ) −(t−µ0 ) ]
2σ0
= = e .
gT (t|µ0 ) 1 − n2 (t−µ0 )2
2σ0
p e
2πσ02 /n
Corollary 8.3.13 says that the MP level α test rejects H0 when
− n 2 2
2 [(t−µ1 ) −(t−µ0 ) ] 2σ02 n−1 ln k − (µ21 − µ20 )
e 2σ0
> k ⇐⇒ t < = k 0 , say.
2(µ0 − µ1 )
Therefore, the MP level α test uses the rejection region

gT (t|θ1 )
S= t∈T : >k = {t ∈ T : t < k 0 },
gT (t|θ0 )
where k 0 satisfies
k 0 − µ0

0
α = Pµ0 (T < k ) = P Z < √
σ0 / n
k 0 − µ0
=⇒ √ = −zα
σ0 / n
√
=⇒ k 0 = µ0 − zα σ0 / n.
√
Therefore, the MP level α test rejects H0 when X < µ0 − zα σ0 / n. This is the same test we
would have gotten using fX (x|µ0 ) and fX (x|µ1 ) with the original version of the NP Lemma
(Theorem 8.3.12).
PAGE 89
8.3.3 Uniformly most powerful tests
Remark: So far, we have discussed “test related optimality” in the context of simple-versus-
simple hypotheses. We now extend the idea of “most powerful” to more realistic situations
involving composite hypotheses; e.g., H0 : θ ≤ θ0 versus H1 : θ > θ0 .
Definition: A family of pdfs (pmfs) {gT (t|θ); θ ∈ Θ} for a univariate random variable T
has monotone likelihood ratio (MLR) if for all θ2 > θ1 , the ratio
gT (t|θ2 )
gT (t|θ1 )
is a nondecreasing function of t over the set {t : gT (t|θ1 ) > 0 or gT (t|θ2 ) > 0}.
Example 8.15. Suppose T ∼ b(n, θ), where 0 < θ < 1. The pmf of T is

n t
gT (t|θ) = θ (1 − θ)n−t ,
t
for t = 0, 1, 2, ..., n. Suppose θ2 > θ1 . Consider

n t
θ2 (1 − θ2 )n−t n t
gT (t|θ2 ) t 1 − θ2 θ2 (1 − θ1 )
= = .
gT (t|θ1 ) n t n−t 1 − θ1 θ1 (1 − θ2 )
θ (1 − θ1 )
t 1
1−θ2 n

Note that 1−θ 1
> 0 and is free of t. Also, because θ2 > θ1 , both
θ2 1 − θ1
> 1 and > 1.
θ1 1 − θ2
Therefore,
gT (t|θ2 )
= c(θ1 , θ2 ) at ,
gT (t|θ1 ) | {z }
>0
where a > 1. This is an increasing function of t over {t : t = 0, 1, 2, ..., n}. Therefore, the
family {gT (t|θ) : 0 < θ < 1} has MLR.
Remark: Many common families of pdfs (pmfs) have MLR. For example, if
T ∼ gT (t|θ) = h(t)c(θ)ew(θ)t ,
i.e., T has pdf (pmf) in the one-parameter exponential family, then {gT (t|θ); θ ∈ Θ} has
MLR if w(θ) is a nondecreasing function of θ.
Proof. Exercise.
Q: Why is MLR useful?

A: It makes getting UMP tests easy.
PAGE 90
Theorem 8.3.17 (Karlin-Rubin). Consider testing
H0 : θ ≤ θ0
versus
H1 : θ > θ0 .
Suppose that T is sufficient. Suppose that {gT (t|θ); θ ∈ Θ} has MLR. The test that rejects
H0 iff T > t0 is a UMP level α test, where
α = Pθ0 (T > t0 ).
Similarly, when testing
H0 : θ ≥ θ0
versus
H1 : θ < θ0 ,
the test that rejects H0 iff T < t0 is UMP level α, where α = Pθ0 (T < t0 ).
Example 8.16. Suppose X1 , X2 , ..., Xn are iid Bernoulli(θ), where 0 < θ < 1, and consider
testing
H0 : θ ≤ θ0
versus
H1 : θ > θ0 .
We know that n
X
T = Xi
i=1
is a sufficient statistic and T ∼ b(n, θ). In Example 8.15, we showed that the family {gT (t|θ) :
0 < θ < 1} has MLR. Therefore, the Karlin-Rubin Theorem says that the UMP level α test
is
φ(t) = I(t > t0 ),
where t0 solves
n
X n t
α = Pθ0 (T > t0 ) = θ (1 − θ0 )n−t .
t 0
t=bt0 c+1
Special case: I took n = 30 and θ0 = 0.2. I used R to calculate the following:
t0 Pθ0 (T ≥ bt0 c + 1)
7 ≤ t0 < 8 P (T ≥ 8|θ = 0.2) = 0.2392
8 ≤ t0 < 9 P (T ≥ 9|θ = 0.2) = 0.1287
9 ≤ t0 < 10 P (T ≥ 10|θ = 0.2) = 0.0611
10 ≤ t0 < 11 P (T ≥ 11|θ = 0.2) = 0.0256
11 ≤ t0 < 12 P (T ≥ 12|θ = 0.2) = 0.0095
PAGE 91
1.0
0.8
0.6
Power function
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Figure 8.4: Power function β(θ) for the UMP level α = 0.0611 test in Example 8.16 with
n = 30 and θ0 = 0.2. A horizontal line at α = 0.0611 has been added.
Therefore, the UMP level α = 0.0611 test of H0 : θ ≤ 0.2 versus H1 : θ > 0.2 uses I(t ≥ 10).
The UMP level α = 0.0256 test uses I(t ≥ 11). Note that (without randomizing) it is not
possible to write a UMP level α = 0.05 test in this problem. For the level α = 0.0611 test,
the power function is
30
X 30 t
β(θ) = Pθ (T ≥ 10) = θ (1 − θ)30−t ,
t=10
t
which is depicted in Figure 8.4 (above).
Example 8.17. Suppose that X1 , X2 , ..., Xn are iid with population distribution
fX (x|θ) = θe−θx I(x > 0),
where θ > 0. Note that this population distribution is an exponential distribution with mean
1/θ. Derive the UMP level α test for
H0 : θ ≥ θ0
versus
H1 : θ < θ0 .
PAGE 92
Solution. It is easy to show that

n
X
T = Xi
i=1
is a sufficient statistic and T ∼ gamma(n, 1/θ). Suppose θ2 > θ1 and form the ratio
1
1 n
tn−1 e−θ2 t n
gT (t|θ2 ) Γ(n) θ2 θ2
= = e−t(θ2 −θ1 ) .
gT (t|θ1 ) 1 θ1
n tn−1 e−θ1 t
Γ(n) θ11
Because θ2 − θ1 > 0, we see that the ratio
gT (t|θ2 )
gT (t|θ1 )
is a decreasing function of t over {tP: t > 0}. However, the ratio is an increasing function
of t = −t, and T = T (X) = − ni=1 Xi is still a sufficient statistic (it is a one-to-one
∗ ∗ ∗
function of T ). Therefore, we can apply the Karlin-Rubin Theorem using T ∗ = −T instead.

Specifically, the UMP level α test is
φ(t∗ ) = I(t∗ < t0 ),
where t0 satisfies
α = Eθ0 [φ(T ∗ )] = Pθ0 (T ∗ < t0 )

= Pθ0 (T > −t0 ).
Because T ∼ gamma(n, 1/θ), we take −t0 = gn,1/θ0 ,α , the (upper) α quantile of a

gamma(n, 1/θ0 ) distribution. Therefore, the UMP level α test is I(t > gn,1/θ0 ,α ); i.e., the
UMP level α rejection region is
( n
)
X
R= x∈X : xi > gn,1/θ0 ,α .
i=1
Using χ2 critical values: We can also write this rejection region in terms of a χ2 quantile.
To see why, note that when θ = θ0 , the quantity 2θ0 T ∼ χ22n so that
α = Pθ0 (T > −t0 ) = Pθ0 (2θ0 T > −2θ0 t0 )

set
=⇒ −2θ0 t0 = χ22n,α .
Therefore, the UMP level α rejection region can be written as

( n
) ( n
)
2
X X χ 2n,α
R = x ∈ X : 2θ0 xi > χ22n,α = x ∈ X : xi > .
i=1 i=1
2θ0
PAGE 93
1.0
0.8
0.6
Power function
0.4
0.2
0.0
0 2 4 6 8
Figure 8.5: Power function β(θ) for the UMP level α = 0.10 test in Example 8.17 with
n = 10 and θ0 = 4. A horizontal line at α = 0.10 has been added.
Remark: One advantage of writing the rejection region in this way is that it depends on a
χ2 quantile, which, historically, may have been available in probability tables (i.e., in times
before computers and R). Another small advantage is that we can express the power function
β(θ) in terms of a χ2 cdf instead of a more general gamma cdf.
Power function: The power function of the UMP level α test is given by
χ22n,α θχ22n,α

β(θ) = Pθ (X ∈ R) = Pθ T > = Pθ 2θT >
2θ0 θ0
2
θχ2n,α
= 1 − Fχ22n ,
θ0
where Fχ22n (·) is the χ22n cdf. A graph of this power function, when n = 10, α = 0.10, and
θ0 = 4, is shown in Figure 8.5 (above).
Proof of Karlin-Rubin Theorem. We will prove this theorem in parts. The first part is a
lemma.
Lemma 1: If g(x) ↑nd x and h(x) ↑nd x, then
cov[g(X), h(X)] ≥ 0.
PAGE 94
Proof. Take X1 , X2 to be iid with the same distribution as X. Then
E{[h(X1 ) − h(X2 )][g(X1 ) − g(X2 )]}

= E[h(X1 )g(X1 )] − E[h(X2 )g(X1 )] − E[h(X1 )g(X2 )] + E[h(X2 )g(X2 )]
X1 ⊥
⊥X2
= E[h(X1 )g(X1 )] − E[h(X2 )]E[g(X1 )] −E[h(X1 )]E[g(X2 )] + E[h(X2 )g(X2 )]
| {z }| {z }
= cov[g(X),h(X)] = cov[g(X),h(X)]
which equals 2cov[g(X), h(X)]. Therefore,

1
cov[g(X), h(X)] = E{[h(X1 ) − h(X2 )][g(X1 ) − g(X2 )]}.
2
However, note that

 (≥ 0)(≥ 0), x1 > x2
[h(x1 ) − h(x2 )][g(x1 ) − g(x2 )] = 0, x 1 = x2
(≤ 0)(≤ 0), x1 < x2 ,

showing that [h(x1 ) − h(x2 )][g(x1 ) − g(x2 )] ≥ 0, for all x1 , x2 ∈ R. By Theorem 2.2.5 (CB,
pp 57), E{[h(X1 ) − h(X2 )][g(X1 ) − g(X2 )]} ≥ 0. 2
Remark: Our frame of reference going forward is testing H0 : θ ≤ θ0 versus H1 : θ > θ0 .

Proving the other case stated in the Karlin-Rubin Theorem is analogous.
Lemma 2. Suppose the family {gT (t|θ) : θ ∈ Θ} has MLR. If ψ(t) ↑nd t, then Eθ [ψ(T )] ↑nd θ.
Proof. Suppose that θ2 > θ1 . Because {gT (t|θ) : θ ∈ Θ} has MLR, we know that
gT (t|θ2 ) x
 t
gT (t|θ1 ) nd
over the set {t : gT (t|θ1 ) > 0 or gT (t|θ2 ) > 0}. Therefore, by Lemma 1, we know

gT (T |θ2 ) gT (T |θ2 ) gT (T |θ2 )
covθ1 ψ(T ), ≥ 0 =⇒ Eθ1 ψ(T ) ≥ Eθ1 [ψ(T )] Eθ1
gT (T |θ1 ) gT (T |θ1 ) gT (T |θ1 )
| {z } | {z }
= Eθ2 [ψ(T )] = 1
=⇒ Eθ2 [ψ(T )] ≥ Eθ1 [ψ(T )].
Because θ1 and θ2 are arbitrary, the result follows. 2
Lemma 3. Under the same assumptions stated in Lemma 2,
Pθ (T > t0 ) ↑nd θ
for all t0 ∈ R. In other words, the family {gT (t|θ) : θ ∈ Θ} is stochastically increasing in θ.
Proof. This is a special case of Lemma 2. Fix t0 . Take ψ(t) = I(t > t0 ). Clearly,

1, t > t0
ψ(t) =
0, t ≤ t0
PAGE 95
is a nondecreasing function of t (with t0 fixed). From Lemma 2, we know that

Eθ [ψ(T )] = Eθ [I(T > t0 )] = Pθ (T > t0 ) ↑nd θ.
Because t0 ∈ R was chosen arbitrarily, this result is true for all t0 ∈ R. 2
Implication: In the statement of the Karlin-Rubin Theorem (for testing H0 : θ ≤ θ0 versus

H1 : θ > θ0 ), we have now shown that the power function
β(θ) = Pθ (T > t0 )
is a nondecreasing function of θ. This explains why α satisfies
α = Pθ0 (T > t0 ).
Why? Because Pθ (T > t0 ) is a nondecreasing function of θ,
α = sup β(θ) = sup β(θ) = β(θ0 ) = Pθ0 (T > t0 ).
H0 θ≤θ0
This shows that φ(t) = I(t > t0 ) is a size α (and hence level α) test function. Thus, all that
remains is to show that this test is uniformly most powerful (i.e., most powerful ∀θ > θ0 ).
Remember that we are considering the test
H0 : θ ≤ θ0
versus
H1 : θ > θ0 .
Let φ∗ (x) be any other level α test of H0 versus H1 . Fix θ1 > θ0 and consider the test of
H0∗ : θ = θ0
versus
∗
H1 : θ = θ1
instead. Note that

Eθ0 [φ∗ (X)] ≤ sup Eθ [φ∗ (X)] ≤ α
θ≤θ0
because φ∗ (x) is a level α test of H0 versus H1 . This also means that φ∗ (x) is a level α test
of H0∗ versus H1∗ . However, Corollary 8.3.13 (Neyman Pearson with a sufficient statistic T )
says that φ(t) is the most powerful (MP) level α test of H0∗ versus H1∗ . This means that
Eθ1 [φ(T )] ≥ Eθ1 [φ∗ (X)].
Because θ1 > θ0 was chosen arbitrarily and because φ∗ (x) was too, we have
Eθ [φ(T )] ≥ Eθ [φ∗ (X)]
for all θ > θ0 and for any level α test φ∗ (x) of H0 versus H1 . Because φ(t) is a level α test
of H0 versus H1 (shown above), we are done. 2
PAGE 96
Note: In single parameter exponential families, we can find UMP tests for H0 : θ ≤ θ0
versus H1 : θ > θ0 (or for H0 : θ ≥ θ0 versus H1 : θ < θ0 ). Unfortunately,
• once we get outside this setting (even with a one-sided H1 ), UMP tests do become
scarce.
• with a two-sided H1 , that is H1 : θ 6= θ0 , UMP tests do not exist.
In other words, the collection of problems for which a UMP test exists is somewhat small.
In many ways, this should not be surprising. Requiring a test to outperform all other level
α tests for all θ in the alternative space Θc0 is asking a lot. The “larger” Θc0 is, the harder
it is to find a UMP test.
H0 : µ = µ0
versus
H1 : µ 6= µ0 .
There is no UMP test for this problem. A UMP test would exist if we could find a test
whose power function “beats” the power function for all other level α tests. For one-sided
alternatives, it is possible to find one. However, a two-sided alternative space is too large.
To illustrate, suppose we considered testing
H00 : µ ≤ µ0
versus
H10 : µ > µ0 .
The UMP level α test for H00 versus H10 uses

0 zα σ0
φ (x) = I x > √ + µ0
n
and has power function

0 µ0 − µ
β (µ) = 1 − FZ zα + √ ,
σ0 / n
where FZ (·) is the standard normal cdf. This is a size (and level) α test for H00 versus H10 .
It is also a size (and level) test for H0 versus H1 because
sup β 0 (µ) = sup β 0 (µ) = β 0 (µ0 ) = 1 − FZ (zα ) = α.

µ∈Θ0 µ=µ0
PAGE 97
Now consider testing
H000 : µ ≥ µ0
versus
H100 : µ < µ0 .
The UMP level α test for H000 versus H100 uses

00 zα σ0
φ (x) = I x < − √ + µ0
n
and has power function

00 µ0 − µ
β (µ) = FZ −zα + √ .
σ0 / n
This is a size (and level) α test for H0 versus H1 because
sup β 00 (µ) = sup β 00 (µ) = β 00 (µ0 ) = FZ (−zα ) = α.

µ∈Θ0 µ=µ0
Therefore, we have concluded that
• φ0 (x) is UMP level α when µ > µ0
• φ00 (x) is UMP level α when µ < µ0 .
However, φ0 (x) 6= φ00 (x) for all x ∈ X . Therefore, no UMP test can exist for H0 versus H1 .
Q: How do we find an “optimal” test in situations like this (e.g., a two-sided H1 )?

A: We change what we mean by “optimal.”
Definition: Consider the test of
H0 : θ ∈ Θ0
versus
H1 : θ ∈ Θc0 .
A test with power function β(θ) is unbiased if β(θ0 ) ≥ β(θ00 ) for all θ0 ∈ Θc0 and for all
θ00 ∈ Θ0 . That is, the power is always larger in the alternative parameter space than it is in
the null parameter space.
• Therefore, when no UMP test exists, we could further restrict attention to those tests
that are level α and are unbiased. Conceptually, define
C U = {all level α tests that are unbiased}.
PAGE 98
0.4
0.3
PDF
0.2
0.1
1−α
α 2 α 2
0.0
−4 −2 0 2 4
Figure 8.6: Pdf of Z ∼ N (0, 1). The UMPU level α rejection region in Example 8.18 is
shown shaded.
• The test in C U that is UMP is called the uniformly most powerful unbiased
(UMPU) test. The UMPU test has power function β(θ) that satisfies
β(θ) ≥ β ∗ (θ) for all θ ∈ Θc0 ,
where β ∗ (θ) is the power function of any other (unbiased) test in C U .
Example 8.18 (continued). Suppose X1 , X2 , ..., Xn are iid N (µ, σ02 ), where −∞ < µ < ∞
and σ02 is known. Consider testing
H0 : µ = µ0
versus
H1 : µ 6= µ0 .
The UMPU level α test uses

 1, x −√µ0 < −z x − µ0

α/2 or √ > zα/2
φ(x) = σ0 / n σ0 / n
 0, otherwise.
PAGE 99
1.0
UMPU
UMP µ > µ0
UMP µ < µ0
0.8
0.6
Power function
0.4
0.2
0.0
2 4 6 8 10
Figure 8.7: Power function β(µ) of the UMPU level α = 0.05 test in Example 8.18 with
n = 10, µ0 = 6, and σ02 = 4. Also shown are the power functions corresponding to the two
UMP level α = 0.05 tests with H1 : µ > µ0 and H1 : µ < µ0 .
In other words, the UMPU level α rejection region is

R = {x ∈ X : φ(x) = 1} = {x ∈ X : |z| > zα/2 },
where
x − µ0
z= √ .
σ0 / n
H
Note that, because Z ∼0 N (0, 1),
Pµ0 (X ∈ R) = Pµ0 (|Z| > zα/2 ) = 1 − Pµ0 (−zα/2 < Z < zα/2 )
= 1 − [FZ (zα/2 ) − FZ (−zα/2 )]
= 1 − (1 − α/2) + α/2 = α,
which shows that R is a size (and hence level) α rejection region. The power function of the
UMPU test φ(x) is

µ0 − µ µ0 − µ
β(µ) = Pµ (X ∈ R) = Pµ (|Z| ≥ z) = 1 − FZ zα/2 + √ + FZ −zα/2 + √ .
σ0 / n σ0 / n
Special case: I took n = 10, α = 0.05, µ0 = 6, and σ02 = 4. The UMPU level α = 0.05
power function β(µ) is shown in Figure 8.7 (above). For reference, I have also plotted in
PAGE 100
Figure 8.7 the UMP level α = 0.05 power functions for the two one-sided tests (i.e., the tests
with H1 : µ > µ0 and H1 : µ < µ0 , respectively).
• It is easy to see that the UMPU test is an unbiased test. Note that β(µ) is always
larger in the alternative parameter space {µ ∈ R : µ 6= µ0 } than it is when µ = µ0 .
• The UMPU test’s power function “loses” to each UMP test’s power function in the
region where that UMP test is most powerful. This is the price one must pay for
restricting attention to unbiased tests. The best unbiased test for a two-sided H1 will
not beat a one-sided UMP test. However, the UMPU test is clearly better than the
UMP tests in each UMP test’s null parameter space.
8.3.4 Probability values
Definition: A p-value p(X) is a test statistic, satisfying 0 ≤ p(x) ≤ 1, for all x ∈ X . Small
values of p(x) are evidence against H0 . A p-value is said to be valid if
Pθ (p(X) ≤ α) ≤ α,
for all θ ∈ Θ0 and for all 0 ≤ α ≤ 1.
Remark: Quoting your authors (CB, pp 397),
“If p(X) is a valid p-value, it is easy to construct a level α test based on p(X).
The test that rejects H0 if and only if p(X) ≤ α is a level α test.”
It is easy to see why this is true. The validity requirement above guarantees that
φ(x) = I(p(x) ≤ α)
is a level α test function. Why? Note that
sup Eθ [φ(X)] = sup Pθ (p(X) ≤ α) ≤ α.
θ∈Θ0 θ∈Θ0
Therefore, rejecting H0 when p(x) ≤ α is a level α decision rule.
Theorem 8.3.27. Let W = W (X) be a test statistic such that large values of W give
evidence against H0 . For each x ∈ X , define
p(x) = sup Pθ (W (X) ≥ w),
θ∈Θ0
where w = W (x). Then p(X) is a valid p-value. Note that the definition of p(x) for when
small values of W give evidence against H0 would be analogous.
Proof. Fix θ ∈ Θ0 . Let F−W (w|θ) denote the cdf of −W = −W (X). When the test rejects
for large values of W ,
pθ (x) ≡ Pθ (W (X) ≥ w) = Pθ (−W (X) ≤ −w) = F−W (−w|θ),
PAGE 101
where w = W (x). If −W (X) is a continuous random variable, then

d d
pθ (X) = F−W (−W |θ) = U(0, 1),
by the Probability Integral Transformation (Chapter 2). If −W (X) is discrete, then

d
pθ (X) = F−W (−W |θ) ≥ST U(0, 1),
where the notation X ≥ST Y means “the distribution of X is stochastically larger than the
distribution of Y ” (see Exercise 2.10, CB, pp 77). Combining both cases, we have
Pθ (pθ (X) ≤ α) ≤ α,
for all 0 ≤ α ≤ 1. Now, note that
p(x) ≡ sup Pθ0 (W (X) ≥ w) ≥ Pθ (W (X) ≥ w) ≡ pθ (x),

θ0 ∈Θ0
for all x ∈ X . Therefore,
Pθ (p(X) ≤ α) ≤ Pθ (pθ (X) ≤ α) ≤ α.
Because we fixed θ ∈ Θ0 arbitrarily, this result must hold for all θ ∈ Θ0 . We have shown
that p(X) is a valid p-value. 2
i.e., both parameters are unknown. Set θ = (µ, σ 2 ). Consider testing
H0 : µ ≤ µ0
versus
H1 : µ > µ0 .
We have previously shown (see pp 75-76, notes) that large values of
X − µ0
W = W (X) = √
S/ n
are evidence against H0 (i.e., this is a “one-sample t test,” which is a LRT). The null
parameter space is
Θ0 = {θ = (µ, σ 2 ) : µ ≤ µ0 , σ 2 > 0}.
Therefore, with observed value w = W (x), the p-value for the test is

X − µ0
p(x) = sup Pθ (W (X) ≥ w) = sup Pθ √ ≥w
θ∈Θ0 θ∈Θ0 S/ n

X −µ µ0 − µ
= sup Pθ √ ≥w+ √
θ∈Θ0 S/ n S/ n

µ0 − µ
= sup Pθ Tn−1 ≥ w + √ = P (Tn−1 ≥ w) ,
µ≤µ0 S/ n
PAGE 102
1.0
0.8
0.6
p−values
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
U(0,1) quantiles
Figure 8.8: Uniform qq plot of B = 200 simulated p-values in Example 8.20.
where Tn−1 is a t random variable with n − 1 √

degrees of freedom. The penultimate equality
2
√ of (X − µ)/(S/ n) does not depend on σ . The last equality
holds because the distribution
holds because (µ0 − µ)/(S/ n) is a nonnegative random variable.
Remark: In Example 8.19, calculating the supremum over Θ0 is relatively easy. In other
problems, it might not be, especially when there are nuisance parameters. A very good
discussion on this is given in Berger and Boos (1994). These authors propose another type
of p-value by “suping” over subsets of Θ0 formed from calculating confidence intervals first
(which can make the computation easier).
Important: If H0 is simple, say H0 : θ = θ0 , and if a p-value p(x) satisfies
Pθ0 (p(X) ≤ α) = α,
H
for all 0 ≤ α ≤ 1, then φ(x) = I(p(x) ≤ α) is a size α test and p(X) ∼0 U(0, 1).
Example 8.20. Suppose X1 , X2 , ..., Xn are iid N (0, 1). I used R to simulate B = 200
independent samples of this type, each with n = 30. With each sample, I performed a t test
for H0 : µ = 0 versus H1 : µ 6= 0 and calculated the p-value for each test (note that H0 is
true). A uniform qq plot of the 200 p-values in Figure 8.8 shows agreement with the U(0, 1)
distribution. Using α = 0.05, there were 9 tests (out of 200) that incorrectly rejected H0 .
PAGE 103
9 Interval Estimation
9.1 Introduction
Setting: We observe X = (X1 , X2 , ..., Xn ) ∼ fX (x|θ), where θ ∈ Θ ⊆ Rk . Usually,

X1 , X2 , ..., Xn will constitute a random sample (iid sample) from a population fX (x|θ). We
regard θ as fixed and unknown.
Definition: An interval estimate of a real-valued parameter θ is any pair of functions

L(x) = L(x1 , x2 , ..., xn )
U (x) = U (x1 , x2 , ..., xn ),
satisfying L(x) ≤ U (x) for all x ∈ X . When X = x is observed, the inference
L(x) ≤ θ ≤ U (x)
is made. The random version [L(X), U (X)] is called an interval estimator.
Remark: In the definition above, a one-sided interval estimate is formed when one of
the endpoints is ±∞. For example, if L(x) = −∞, then the estimate is (−∞, U (x)]. If
U (x) = ∞, the estimate is [L(x), ∞).
Definition: Suppose [L(X), U (X)] is an interval estimator for θ. The coverage probabil-
ity of the interval is
Pθ (L(X) ≤ θ ≤ U (X)).
It is important to note the following:
• In the probability above, it is the endpoints L(X) and U (X) that are random; not θ
(it is fixed).
• The coverage probability is regarded as a function of θ. That is, the probability that
[L(X), U (X)] contains θ may be different for different values of θ ∈ Θ. This is usually
true when X is discrete.
Definition: The confidence coefficient of the interval estimator [L(X), U (X)] is

inf Pθ (L(X) ≤ θ ≤ U (X)).
θ∈Θ
An interval estimator with confidence coefficient equal to 1 − α is called a 1 − α confidence

interval.
Remark: In some problems, it is possible that the estimator itself is not an interval. More
generally, we use the term 1 − α confidence set to allow for these types of estimators. The
notation C(X) is used more generally to denote a confidence set.
PAGE 104
Example 9.1. Suppose that X1 , X2 , ..., Xn are iid U(0, θ), where θ > 0. We consider two
interval estimators:
1. (aX(n) , bX(n) ), where 1 ≤ a < b
2. (X(n) + c, X(n) + d), where 0 ≤ c < d.
The pdf of X(n) is

n−1 1 x n−1
fX(n) (x) = nfX (x)[FX (x)] = n I(0 < x < θ)
θ θ
nxn−1
= I(0 < x < θ).
θn
By transformation, the pdf of
X(n)
T =
θ
is
fT (t) = ntn−1 I(0 < t < 1);
i.e., T ∼ beta(n, 1). The coverage probability for the first interval is

1 1 1
Pθ (aX(n) ≤ θ ≤ bX(n) ) = Pθ ≤ ≤
bX(n) θ aX(n)

1 X(n) 1
= Pθ ≤ ≤
b θ a
Z 1/a n n
1 1
= ntn−1 dt = − ,
1/b a b
that is, the coverage probability is the same for all θ ∈ Θ = {θ : θ > 0}. The confidence
coefficient of the interval (aX(n) , bX(n) ) is therefore
n n n n
1 1 1 1
inf − = − .
θ>0 a b a b
On the other hand, the coverage probability for the second interval is
Pθ (X(n) + c ≤ θ ≤ X(n) + d) = Pθ (c ≤ θ − X(n) ≤ d)

c X(n) d
= Pθ ≤1− ≤
θ θ θ

d X(n) c
= Pθ 1 − ≤ ≤1−
θ θ θ
Z 1− c n
θ
n−1
c n
d
= nt dt = 1 − − 1− ,
1− dθ θ θ
PAGE 105
which does depend on θ. Interestingly, the confidence coefficient of (X(n) + c, X(n) + d) is

n
c n d
inf 1 − − 1− = 0.
θ>0 θ θ
Example 9.2. Suppose that X1 , X2 , ..., Xn are iid Bernoulli(p), where 0 < p < 1. A “1 − α
confidence interval” commonly taught in undergraduate courses is
r
pb(1 − pb)
pb ± zα/2 ,
n
where pb is the sample proportion, that is,
n
Y 1X
pb = = Xi ,
n n i=1
where Y = ni=1 Xi ∼ b(n, p), and zα/2 is the upper α/2 quantile of the N (0, 1) distribution.
P
In Chapter 10, we will learn that this is a large-sample “Wald-type” confidence interval. An
expression for the coverage probability of this interval is
r r !
pb(1 − pb) pb(1 − pb)
Pp pb − zα/2 ≤ p ≤ pb + zα/2
n n
  s s 
Y Y Y Y
Y (1 − n ) Y (1 − n )
= Ep I  − zα/2 n ≤ p ≤ + zα/2 n 
n n n n
n
ry ry !
y y
X y (1 − ) y (1 − ) n y
= I − zα/2 n n
≤ p ≤ + zα/2 n n
p (1 − p)y .
y=0
n n n n y
| {z }
b(n,p) pmf
Special case: I used R to graph this coverage probability function across values of 0 < p < 1
when n = 40 and α = 0.05; see Figure 9.1 (next page).
• The coverage probability rarely attains the nominal 0.95 level across 0 < p < 1.
• The jagged nature of the coverage probability function (of p) arises from the discrete-
ness of Y ∼ b(40, p).
• The confidence coefficient of the Wald interval (i.e., the infimum coverage probability
across all 0 < p < 1) is clearly 0.
• An excellent account of the performance of this confidence interval (and competing
intervals) is given in Brown et al. (2001, Statistical Science).
• When 1 − α = 0.95, one competing interval mentioned in Brown et al. (2001) replaces
y with y ∗ = y + 2 and n with n∗ = n + 4. This “add two successes-add two failures”
interval was proposed by Agresti and Coull (1998, American Statistician). Because
this interval’s coverage probability is much closer to the nominal level across 0 < p < 1
(and because it is so easy to compute), it has begun to usurp the Wald confidence
interval in introductory level courses.
PAGE 106
1.0
0.8
Coverage probability
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Figure 9.1: Coverage probability of the Wald confidence interval for a binomial proportion
p when n = 40 and α = 0.05. A dotted horizontal line at 1 − α = 0.95 has been added.
9.2 Methods of Finding Interval Estimators
Preview: The authors present four methods to find interval estimators:
1. Test inversion (i.e., inverting a test statistic)
2. Using pivotal quantities
3. “Guaranteeing an interval” by pivoting a cdf
4. Bayesian credible intervals.
Note: Large-sample interval estimators will be discussed in Chapter 10.
9.2.1 Inverting a test statistic
Remark: This method of interval construction is motivated by the strong duality between
hypothesis testing and confidence intervals.
PAGE 107
Motivation: Consider testing H0 : θ = θ0 using the (non-randomized) test function

1, x ∈ Rθ0
φ(x) = I(x ∈ Rθ0 ) =
0, x ∈ Rθc0 ,
where
Pθ0 (X ∈ Rθ0 ) = Eθ0 [φ(X)] = α;
i.e., φ(x) is a size α test. Note that we have used the notation Rθ0 to emphasize that the
rejection region R depends on the value of θ0 . Let Aθ0 = Rθc0 denote the “acceptance region”
for the test, that is, Aθ0 is the set of all x ∈ X that do not lead to H0 being rejected. For
each x ∈ X , define
C(x) = {θ0 : x ∈ Aθ0 }.
From this definition, clearly θ0 ∈ C(x) ⇐⇒ x ∈ Aθ0 . Therefore,
Pθ0 (θ0 ∈ C(X)) = Pθ0 (X ∈ Aθ0 ) = 1 − Pθ0 (X ∈ Rθ0 ) = 1 − α.
However, this same argument holds for all θ0 ∈ Θ; i.e., it holds regardless of the value of θ
under H0 . Therefore,
C(X) = {θ ∈ Θ : X ∈ Aθ }
is a 1 − α confidence set.
i.e., both parameters are unknown. A size α likelihood ratio test (LRT) of H0 : µ = µ0
versus H1 : µ 6= µ0 uses the test function

|x − µ0 |
1, √ ≥ tn−1,α/2

φ(x) = s/ n
0, otherwise.

The “acceptance region” for this test is

|x − µ0 |
Aµ0 = x ∈ X : √ < tn−1,α/2 ,
s/ n
where, note that

|X − µ0 |
Pµ0 (X ∈ Aµ0 ) = Pµ0 √ < tn−1,α/2
S/ n

X − µ0
= Pµ0 −tn−1,α/2 < √ < tn−1,α/2 = 1 − α.
S/ n
Therefore, a 1 − α confidence set for µ is

x−µ
C(x) = {µ ∈ R : x ∈ Aµ } = µ : −tn−1,α/2 < √ < tn−1,α/2
s/ n

s s
= µ : −tn−1,α/2 √ < x − µ < tn−1,α/2 √
n n

s s
= µ : x − tn−1,α/2 √ < µ < x + tn−1,α/2 √ .
n n
PAGE 108
The random version of this confidence set (interval) is written as

S S
X − tn−1,α/2 √ , X + tn−1,α/2 √ .
n n
Remark: As Example 9.3 suggests, when we invert a two-sided hypothesis test, we get a
two-sided confidence interval. This will be true in most problems. Analogously, inverting
one-sided tests generally leads to one-sided intervals.
Example 9.4. Suppose X1 , X2 , ..., Xn are iid exponential(θ), where θ > 0. A uniformly
most powerful (UMP) level α test of H0 : θ = θ0 versus H1 : θ > θ0 uses the test function
θ0
(
1, t ≥ χ22n,α
φ(t) = 2
0, otherwise,
where the sufficient statistic t = ni=1 xi . The “acceptance region” for this test is
P

θ0 2
Aθ0 = x ∈ X : t < χ2n,α ,
2
where, note that

θ0 2T
Pθ0 (X ∈ Aθ0 ) = Pθ0 T < χ22n,α = Pθ0 < χ22n,α = 1 − α,
2 θ0
H d
because 2T /θ0 ∼0 gamma(n, 2) = χ22n . Therefore, a 1 − α confidence set for θ is

θ 2
C(x) = {θ > 0 : x ∈ Aθ } = θ : t < χ2n,α
2

2t
= θ: 2 <θ .
χ2n,α
The random version of this confidence set is written as

2T
, ∞ ,
χ22n,α
where T = ni=1 Xi . This is a “one-sided” interval, as expected, because we have inverted a
P
one-sided test.
Remark: The test inversion method makes direct use of the relationship between hypothesis
tests and confidence intervals (sets). On pp 421, the authors of CB write,
“Both procedures look for consistency between sample statistics and population
parameters. The hypothesis test fixes the parameter and asks what sample values
(the acceptance region) are consistent with that fixed value. The confidence set
fixes the sample value and asks what parameter values (the confidence interval)
make this sample value most plausible.”
An illustrative figure (Figure 9.2.1, pp 421) displays this relationship in the N (µ, σ02 ) case;
i.e., writing a confidence interval for a normal mean µ when σ02 is known.
PAGE 109
9.2.2 Pivotal quantities
Definition: A random variable Q = Q(X, θ) is a pivotal quantity (or pivot) if the

distribution of Q does not depend on θ. That is, Q has the same distribution for all θ ∈ Θ.
Remark: Finding pivots makes getting confidence intervals easy. If Q = Q(X, θ) is a pivot,
then we can set
1 − α = Pθ (a ≤ Q(X, θ) ≤ b),
where a and b are quantiles of the distribution of Q that satisfy the equation. Because Q
is a pivot, the probability on the RHS will be the same for all θ ∈ Θ. Therefore, a 1 − α
confidence interval can be determined from this equation.
Example 9.5. Suppose that X1 , X2 , ..., Xn are iid U(0, θ), where θ > 0. In Example 9.1,
we showed that
X(n)
Q = Q(X, θ) = ∼ beta(n, 1).
θ
Because the distribution of Q is free of θ, we know that Q is a pivot. Let bn,1,1−α/2 and
bn,1,α/2 denote the lower and upper α/2 quantiles of a beta(n, 1) distribution, respectively.
We can then write

X(n) 1 θ 1
1 − α = Pθ bn,1,1−α/2 ≤ ≤ bn,1,α/2 = Pθ ≥ ≥
θ bn,1,1−α/2 X(n) bn,1,α/2

X(n) X(n)
= Pθ ≤θ≤ .
bn,1,α/2 bn,1,1−α/2
This shows that

X(n) X(n)
,
bn,1,α/2 bn,1,1−α/2
is a 1 − α confidence interval for θ.
Example 9.6. Consider the simple linear regression model
Yi = β0 + β1 xi + i ,
where i ∼ iid N (0, σ 2 ) and the xi ’s are fixed constants (measured without error). Consider
writing a confidence interval for
θ = E(Y |x0 ) = β0 + β1 x0 ,
where x0 is a specified value of x. In a linear models course, you have shown that
2
!
1 (x 0 − x)
θb = βb0 + βb1 x0 ∼ N θ, σ 2 + Pn 2
,
n i=1 (xi − x)
where βb0 and βb1 are the least-squares estimators of β0 and β1 , respectively.
PAGE 110
• If σ 2 is known (completely unrealistic), we can use
θb − θ
Q(Y, θ) = r h 2
i ∼ N (0, 1)
σ2 1
+ Pn(x0 −x) 2
n (x
i=1 i −x)
as a pivot to write a confidence interval for θ.
• More realistically, σ 2 is unknown and
θb − θ
Q(Y, θ) = r h 2
i ∼ tn−2 ,
1 (x 0 −x)
MSE n + Pn (xi −x)2
i=1
where MSE is the mean-squared error from the regression, is used as a pivot.
In the latter case, we can write

 
 θb − θ 
1 − α = Pβ,σ2 −t
 n−2,α/2 ≤ r h i ≤ tn−2,α/2 
,
(x −x)2
1 0
MSE n + Pn (xi −x)2
i=1
for all β = (β0 , β1 )0 and σ 2 . It follows that

s
1 (x0 − x)2
θ ± tn−2,α/2 MSE
b + Pn 2
n i=1 (xi − x)
is a 1 − α confidence interval for θ.
Remark: As Examples 9.5 and 9.6 illustrate, interval estimates are easily obtained after
writing 1 − α = Pθ (a ≤ Q(X, θ) ≤ b), for constants a and b (quantiles of Q). More generally,
{θ ∈ Θ : Q(x, θ) ∈ A} is a set estimate for θ, where A satisfies 1 − α = Pθ (Q(X, θ) ∈ A).
For example, in Example 9.5, we could have written

X(n) X(n)
1 − α = Pθ bn,1,1−α ≤ ≤1 = Pθ X(n) ≤ θ ≤
θ bn,1,1−α
and concluded that

X(n)
X(n) ,
bn,1,1−α
is a 1 − α confidence interval for θ. How does this interval compare with

X(n) X(n)
, ?
bn,1,α/2 bn,1,1−α/2
Which one is “better?” For that matter, how should we define what “better” means?
PAGE 111
Remark: The more general statement

1 − α = Pθ (Q(X, θ) ∈ A)
is especially useful when θ is a vector and the goal is to find a confidence set (i.e., confidence
region) for θ. In such cases, A will generally be a subset of Rk where k = dim(θ).
i.e., both parameters are unknown. Set θ = (µ, σ 2 ). We know that
X −µ
Q1 = √ ∼ tn−1 ,
S/ n
that is, Q1 is a pivot. Therefore,

X −µ
1 − α = Pθ −tn−1,α/2 ≤ √ ≤ tn−1,α/2
S/ n

S S
= Pθ X − tn−1,α/2 √ ≤ µ ≤ X + tn−1,α/2 √ ,
n n
showing that
S S
C1 (X) = X − tn−1,α/2 √ , X + tn−1,α/2 √
n n
is a 1 − α confidence set for µ. Similarly, we know that
(n − 1)S 2
Q2 = ∼ χ2n−1 ,
σ2
that is, Q2 is also a pivot. Therefore,
(n − 1)S 2

2 2
1 − α = Pθ χn−1,1−α/2 ≤ ≤ χn−1,α/2
σ2
!
(n − 1)S 2 (n − 1)S 2
= Pθ ≤ σ2 ≤ 2 ,
χ2n−1,α/2 χn−1,1−α/2
showing that !
(n − 1)S 2 (n − 1)S 2
C2 (X) = ,
χ2n−1,α/2 χ2n−1,1−α/2
is a 1 − α confidence set for σ 2 .
Extension: Suppose we wanted to write a confidence set (region) for θ = (µ, σ 2 ) in R2 .

From the individual pivots, we know that C1 (X) and C2 (X) are each 1 − α confidence sets.
Q: Is C1 (X) × C2 (X), the Cartesian product of C1 (X) and C2 (X), a 1 − α confidence region
for θ?
A: No. By Bonferroni’s Inequality,
Pθ (θ ∈ C1 (X) × C2 (X)) ≥ Pθ (µ ∈ C1 (X)) + Pθ (σ 2 ∈ C2 (X)) − 1
= (1 − α) + (1 − α) − 1
= 1 − 2α.
Therefore, C1 (X) × C2 (X) is a 1 − 2α confidence region for θ.
PAGE 112
Bonferroni adjustment: Adjust C1 (X) and C2 (X) individually so that the confidence
coefficient of each is 1 − α/2. The adjusted set C1∗ (X) × C2∗ (X) is a 1 − α confidence region
for θ. This region has coverage probability larger than or equal to 1 − α for all θ (so it is
“conservative”).
More interesting approach: Consider the quantity

2
(n − 1)S 2

2 X −µ
Q = Q(X, θ) = Q(X, µ, σ ) = √ + .
σ/ n σ2
It is easy to show that Q ∼ χ2n , establishing that Q is a pivot. Therefore, we can write
2 2
!
X − µ (n − 1)S
1 − α = Pθ (Q ≤ χ2n,α ) = Pθ √ + ≤ χ2n,α ,
σ/ n σ2
which shows that
C(X) = {θ = (µ, σ 2 ) : Q(X, µ, σ 2 ) ≤ χ2n,α }
is a 1 − α confidence region (in R2 ) for θ. To see that this set looks like, note that the
boundary is
2
(n − 1)s2

2 2 x−µ
Q(x, µ, σ ) = χn,α ⇐⇒ √ + = χ2n,α
σ/ n σ2
χ2n,α 2 (n − 1)s2

2
⇐⇒ (µ − x) = σ − ,
n χ2n,α
which is a parabola in Θ = {θ = (µ, σ 2 ) : −∞ < µ < ∞, σ 2 > 0}. The parabola has vertex
at
(n − 1)s2

x,
χ2n,α
and it opens upward (because χ2n,α /n > 0). The confidence set is the interior of the parabola.
Discussion: Example 9.2.7 (CB, pp 427-428) provides tips on how to find pivots in location
and scale (and location-scale) families.
Family Parameter Pivot examples

Location µ X − µ, X(n) − µ, X(1) − µ
X X(n) X(1)
Scale σ , ,
σ σ σ
In general, differences are pivotal in location family problems; ratios are pivotal for scale
parameters.
Exercise: Suppose X1 , X2 , ..., Xn are iid from

fX (x|µ) = fZ (x − µ),
where −∞ < µ < ∞ and fZ (·) is a standard pdf. Show that Q(X, µ) = X − µ is a pivotal
quantity.
PAGE 113
9.2.3 Pivoting the CDF
Example 9.8. Suppose X1 , X2 , ..., Xn are iid with population pdf

−(x−θ)
e , x≥θ
fX (x|θ) =
0, x < θ,
where −∞ < θ < ∞. How can we obtain a confidence set for θ? Note that, because
{fX (x|θ) : −∞ < θ < ∞} is a location family, we could try working with Q = Q(X, θ) =
X − θ. From our recent discussion, we know that Q is pivotal. In fact, it is easy to show
that
d
Xi − θ ∼ exponential(1) = gamma(1, 1)
and hence n
1X
Q=X −θ = (Xi − θ) ∼ gamma(n, 1/n).
n i=1
d
As expected, the distribution of Q is free of θ. Furthermore, 2nQ ∼ gamma(n, 2) = χ22n .
Using 2nQ = 2n(X − θ) as a pivot, we can write
2 2
!
χ 2n,α/2 χ2n,1−α/2
1 − α = Pθ (χ22n,1−α/2 ≤ 2n(X − θ) ≤ χ22n,α/2 ) = Pθ X − ≤θ≤X− .
2n 2n
Therefore, !
χ22n,α/2 χ22n,1−α/2
X− , X−
2n 2n
is a 1 − α confidence set for θ.
Criticism: Although this is a bonafide 1 − α confidence set, it is not based on T = T (X) =

X(1) , a sufficient statistic for θ. Let’s find a pivot based on T instead. One example of such
a pivot is Q(T, θ) = T − θ ∼ exponential(1/n). Another example is Q(T, θ) = FT (T |θ), the
cdf of T , which is U(0, 1) by the Probability Integral Transformation.
CDF: It is easy to show that

0, t≤θ
FT (t|θ) = −n(t−θ)
1−e , t > θ.
Therefore, because FT (T |θ) ∼ U(0, 1), we can write
1 − α = Pθ (α/2 ≤ FT (T |θ) ≤ 1 − α/2)
= Pθ (α/2 ≤ 1 − e−n(T −θ) ≤ 1 − α/2)

1 α 1 α
= Pθ T + ln ≤ θ ≤ T + ln 1 − .
n 2 n 2
Therefore,
1 α 1 α
T + ln , T + ln 1 −
n 2 n 2
is a 1 − α confidence set for θ.
PAGE 114
1.0
0.8
0.6
CDF
0.4
0.2
0.0
8.8 9.0 9.2 9.4 9.6 9.8 10.0
Figure 9.2: CDF of T = X(1) in Example 9.8, FT (t|θ), plotted as a function of θ with t fixed.
The value of t is 10.032, calculated based on an iid sample from fX (x|θ) with n = 5. Dotted
horizontal lines at α/2 = 0.025 and 1 − α/2 = 0.975 have been added.
Special case: I used R to simulate an iid sample of size n = 5 from fX (x|θ). The cdf of
T = X(1) is plotted in Figure 9.2 as a function of θ with the observed value of t = x(1) = 10.032
held fixed. A 0.95 confidence set is (9.293, 10.026). The true value of θ is 10.
Theorem 9.2.12. Suppose T is a statistic with a continuous cdf FT (t|θ). Suppose α1 +α2 =
α. Suppose for all t ∈ T , the functions θL (t) and θU (t) are defined as follows:
• When FT (t|θ) is a decreasing function of θ,
– FT (t|θU (t)) = α1
– FT (t|θL (t)) = 1 − α2 .
• When FT (t|θ) is an increasing function of θ,
– FT (t|θU (t)) = 1 − α2
– FT (t|θL (t)) = α1 .
Then the random interval (θL (T ), θU (T )) is a 1 − α1 − α2 confidence set for θ.
PAGE 115
Remark: Theorem 9.2.12 remains valid for any statistic T with continuous cdf. In practice,
we would likely want T to be a sufficient statistic.
Remark: In practice, one often sets α1 = α2 = α/2 so that (θL (T ), θU (T )) is a 1 − α

confidence set. This is not necessarily the “optimal” approach, but it is reasonable in most
situations. One sided confidence sets are obtained by letting either α1 or α2 equal 0.
Remark: Pivoting the cdf always “works” because (if T is continuous), the cdf itself, when
viewed as random, is a pivot. From the Probability Integral Transformation, we know that
FT (T |θ) ∼ U(0, 1). Therefore, when FT (t|θ) is a decreasing function of θ, we have
Pθ (θL (T ) ≤ θ ≤ θU (T )) = Pθ (α1 ≤ FT (T |θ) ≤ 1 − α2 )

= 1 − α1 − α2 .
The case wherein FT (t|θ) is an increasing function of θ is analogous.
Implementation: To pivot the cdf, it is not necessary that FT (t|θ) be available in closed
form (as in Example 9.8). All we really have to do is solve
Z t0 Z ∞
∗ set set
fT (t|θ1 (t0 ))dt = α/2 and fT (t|θ2∗ (t0 ))dt = α/2
−∞ t0
(in the equal α1 = α2 = α/2 case, say), based on the observed value T = t0 . We solve these
equations for θ1∗ (t0 ) and θ2∗ (t0 ). One of these will be the lower limit θL (t0 ) and the other
will be the upper limit θU (t0 ), depending on whether FT (t|θ) is an increasing or decreasing
function of θ.
Remark: The discrete case (i.e., the statistic T has a discrete distribution) is handled in
the same way except that the integrals above are replaced by sums.
Example 9.9.
P Suppose X1 , X2 , ..., Xn are iid Poisson(θ), where θ > 0. We now pivot the
cdf ofPT = ni=1 Xi , a sufficient statistic, to write a 1 − α confidence set for θ. Recall that
T = ni=1 Xi ∼ Poisson(nθ). If T = t0 is observed, we set
t0
X (nθ)k e−nθ set
Pθ (T ≤ t0 ) = = α/2
k=0
k!
∞
X (nθ)k e−nθ set
Pθ (T ≥ t0 ) = = α/2
k=t
k!
0
and solve each equation for θ. In practice, the solutions could be found by setting up a
grid search over possible values of θ and then selecting the values that solve these equations
(one solution will be the lower endpoint; the other solution will be the upper endpoint). In
this example, however, it is possible to get closed-form expressions for the confidence set
endpoints. To see why, we need to recall the following result which “links” the Poisson and
gamma distributions.
PAGE 116
1.00
0.95
Coverage probability
0.90
0.85
0.80
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Figure 9.3: Coverage probability of the confidence interval in Example 9.9 when n = 10 and
α = 0.10. A dotted horizontal line at 1 − α = 0.90 has been added.
Result: If X ∼ gamma(a, b), a ∈ N (a positive integer), then
P (X ≤ x) = P (Y ≥ a),
where Y ∼ Poisson(x/b). This identity was stated in Example 3.3.1 (CB, pp 100-101).
Application: If we apply this result in Example 9.9 for the second equation to be solved,
we have a = t0 , x/b = nθ, and

α set 2X
= Pθ (T ≥ t0 ) = Pθ (X ≤ bnθ) = Pθ ≤ 2nθ = Pθ (χ22t0 ≤ 2nθ).
2 b
Therefore, we set
2nθ = χ22t0 ,1−α/2
and solve for θ (this will give the lower endpoint). A similar argument shows that the upper
endpoint solves
2nθ = χ22(t0 +1),α/2 .
Therefore, a 1 − α confidence set for θ is

1 2 1 2
χ , χ .
2n 2t0 ,1−α/2 2n 2(t0 +1),α/2
PAGE 117
Remark: When T is discrete, the coverage probability of (θL (T ), θU (T )) found through

pivoting the cdf will generally be a function of θ and the interval itself will be conservative,
that is,
Pθ (θL (T ) ≤ θ ≤ θU (T )) ≥ 1 − α,
for all θ ∈ Θ. This is true because when T is discrete, the cdf FT (T |θ) is stochastically
larger than a U(0, 1) distribution. For example, consider Example 9.9 with n = 10 and
1 − α = 0.90. Figure 9.3 shows that the coverage probability of a nominal 0.90 confidence
interval is always at least 0.90 and can, in fact, be much larger than 0.90.
Remark: Pivoting a discrete cdf can be used to write confidence sets for parameters in other
discrete distributions. For example, a 1 − α confidence interval for a binomial probability p
when using this technique is given by
!
x+1
1 F
n−x 2(x+1),2(n−x),α/2
, ,
1 + n−x+1
x
x+1
F2(n−x+1),2x,α/2 1 + n−x F2(x+1),2(n−x),α/2
where x is the realized value of X ∼ b(n, p) and Fa,b,α/2 is the upper α/2 quantile of an
F distribution with degrees of freedom a and b. This is known as the Clopper-Pearson
confidence interval for p and it can (not surprisingly) be very conservative; see Brown et al.
(2001, Statistical Science). The interval arises by first exploiting the relationship between
the binomial and beta distributions (see CB, Exercise 2.40, pp 82) and then the relationship
which “links” the beta and F distributions (see CB, Theorem 5.3.8, pp 225).
9.2.4 Bayesian intervals
Recall: In the Bayesian paradigm, all inference is carried out using the posterior distribution
π(θ|x). However, because the posterior π(θ|x) is itself a legitimate probability distribution
(for θ, updated after seeing x), we can calculate probabilities involving θ directly by using
this distribution.
Definition: For any set A ⊂ R, the credible probability associated with A is

Z
P (θ ∈ A|X = x) = π(θ|x)dθ.
A
If the credible probability is 1 − α, we call A a 1 − α credible set. If π(θ|x) is discrete, we

simply replace integrals with sums.
Note: Bayesian credible intervals are interpreted differently than confidence intervals.
• Confidence interval interpretation: “If we were to perform the experiment over
and over again, each time under identical conditions, and if we calculated a 1 − α
confidence interval each time the experiment was performed, then 100(1 − α) percent
of the intervals we calculated would contain the true value of θ. Any specific interval
we calculate represents one of these possible intervals.”
• Credible interval interpretation: “The probability our interval contains θ is 1−α.”
PAGE 118
Example 9.10. Suppose that X1 , X2 , ..., Xn are iid Poisson(θ), where the prior distribution
for θ ∼ gamma(a, b), a, b known. In Example 7.10 (notes, pp 38-39), we showed that the
n
!
X 1
θ|X = x ∼ gamma xi + a, .
i=1
n + 1b
In Example 8.9 (notes, pp 77-78), we used this Bayesian model setup with the number of
goals per game in the 2013-2014 English Premier League season and calculated the posterior
distribution for the mean number of goals θ to be

1 d
θ|X = x ∼ gamma 1060 + 1.5, = gamma(1061.5, 0.002628).
380 + 12
A 0.95 credible set for θ is (2.62, 2.96).
> qgamma(0.025,1061.5,1/0.002628)
[1] 2.624309
> qgamma(0.975,1061.5,1/0.002628)
[1] 2.959913
Q: Why did we select the “equal-tail” quantiles (0.025 and 0.975) in this example?
A: It’s easy!
Note: There are two types of Bayesian credible intervals commonly used: Equal-tail (ET)
intervals and highest posterior density (HPD) intervals.
Definition: The set A is a highest posterior density (HPD) 1 − α credible set if
A = {θ : π(θ|x) ≥ c}
and the credible probability of A is 1 − α. ET and HPD intervals will coincide only when
π(θ|x) is symmetric.
Remark: In practice, because Monte Carlo methods are often used to approximate posterior
distributions, simple ET intervals are usually the preferred choice. HPD intervals can be far
more difficult to construct and are rarely much better than ET intervals.
9.3 Methods of Evaluating Interval Estimators
Note: We will not cover all of the material in this subsection. We will have only a brief
discussion of the relevant topics.
PAGE 119
Evaluating estimators: When evaluating any interval estimator, there are two important
criteria to consider:
1. Coverage probability. When the coverage probability is not equal to 1 − α for all
θ ∈ Θ (as is usually the case in discrete distributions), we would like it to be as close
as possible to the nominal 1 − α level.
• Some intervals maintain a coverage probability ≥ 1 − α for all θ ∈ Θ but can be

very conservative (as in Example 9.9).
• Confidence intervals based on large-sample theory might confer a coverage prob-
ability ≤ 1 − α for some/all θ ∈ Θ, even though they are designed to be nominal
as n → ∞. Large-sample intervals are discussed in Chapter 10.
2. Interval length. Shorter intervals are more informative. Interval length (or expected
interval length) depends on the interval’s underlying confidence coefficient.
• It only makes sense to compare two interval estimators (on the basis of inter-
val length) when the intervals have the same coverage probability (or confidence
coefficient).
i.e., both parameters are unknown. Set θ = (µ, σ 2 ). A 1 − α confidence interval for µ is

S S
C(X) = X + a √ , X + b √ ,
n n
where the constants a and b are quantiles from the tn−1 distribution satisfying

S S
1 − α = Pθ X + a √ ≤ µ ≤ X + b √ .
n n
Which choice of a and b is “best?” More precisely, which choice minimizes the expected
length? The length of this interval is
S
L = (b − a) √ ,
n
which, of course, is random. The expected length is
Eθ (S) √
Eθ (L) = (b − a) √ = (b − a)c(n)σ/ n,
n
where the constant √

2Γ( n2 )
c(n) = √ .
n − 1Γ( n−1 2
)
Note that the expected length is proportional to b − a.
PAGE 120
Theorem 9.3.2. Suppose Q = Q(X, θ) is a pivotal quantity and
Pθ (a ≤ Q ≤ b) = 1 − α,
where a and b are constants. Let fQ (q) denote the pdf of Q. If

Rb
1. a
fQ (q)dq = 1 − α
2. fQ (a) = fQ (b) > 0
3. fQ0 (a) > fQ0 (b),
then b − a is minimized relative to Q.
Remark: The version of Theorem 9.3.2 stated in CB (pp 441-442) is slightly different than
the one I present above; the authors’ version requires that the pdf of Q be unimodal (mine
requires that it be differentiable).
Application: Consider Example 9.11 with
X −µ
Q = Q(X, θ) = √
S/ n
and
s s
C(x) = x + a √ , x + b √ .
n n
If we choose a = −tn−1,α/2 and b = tn−1,α/2 , then the conditions in Theorem 9.3.2 are
satisfied. Therefore,
S S
X − tn−1,α/2 √ , X + tn−1,α/2 √
n n
has the shortest expected length among all 1 − α confidence intervals based on Q.
Proof of Theorem 9.3.2. Suppose Q ∼ fQ (q), where
1 − α = Pθ (a ≤ Q ≤ b) = FQ (b) − FQ (a)
so that
FQ (b) = 1 − α + FQ (a)
and
b = FQ−1 [1 − α + FQ (a)] ≡ b(a), say.
The goal is to minimize b − a = b(a) − a. Taking derivatives, we have (by the Chain Rule)
d d −1
[b(a) − a] = [F [1 − α + FQ (a)] − 1
da da Q
d d −1
= [1 − α + FQ (a)] F (η) − 1,
da dη Q
PAGE 121
where η = 1−α+FQ (a). However, note that by the inverse function theorem (from calculus),
d −1 1 1 1
FQ (η) = 0 −1 = 0 = .
dη FQ [FQ (η)] FQ (b) fQ (b)
Therefore,
d fQ (a) set
[b(a) − a] = − 1 = 0 =⇒ fQ (a) = fQ (b).
da fQ (b)
To finish the proof, all we need to show is that
d2
[b(a) − a] > 0
da2
whenever fQ (a) = fQ (b) and fQ0 (a) > fQ0 (b). This will guarantee that the conditions stated
in Theorem 9.3.2 lead to b − a being minimized. 2
Remark: The theorem we have just proven is applicable when an interval’s length (or
expected length) is proportional to b − a. This is often true when θ is a location parameter
and fX (x|θ) is a location family. When an interval’s length is not proportional to b − a, then
Theorem 9.3.2 is not directly applicable. However, we might be able to formulate a modified
version of the theorem that is applicable.
Example 9.12. Suppose X1 , X P2 ,n..., Xn are iid exponential(β), where β > 0. A pivotal
quantity based on T = T (X) = i=1 Xi , a sufficient statistic, is
2T
Q = Q(T, β) = ∼ χ22n .
β
Therefore, we can write

2T 2T 2T
1 − α = Pβ (a ≤ Q ≤ b) = Pβ a≤ ≤b = Pβ ≤β≤ ,
β b a
where a and b are quantiles from the χ22n distribution. In this example, the expected interval
length is not proportional to b − a. Instead, the expected length is

2T 2T 1 1 1 1
Eβ (L) = Eβ − = − Eβ (2T ) = − 2nβ,
a b a b a b
which is proportional to
1 1
− .
a b
Theorem 9.3.2 is therefore not applicable here. To modify the theorem (towards finding a
shortest expected length confidence interval based on Q), we would have to minimize
1 1 1 1
− = −
a b a b(a)
with respect to a subject to the constraint that
Z b(a)
fQ (q)dq = 1 − α,
a
where fQ (q) is the pdf of Q ∼ χ22n . See CB (pp 444).
PAGE 122
10 Asymptotic Evaluations
10.1 Introduction
Preview: In this chapter, we revisit “large sample theory” and discuss three important
topics in statistical inference:
1. Point estimation (Section 10.1)
• Efficiency, consistency
• Large sample properties of maximum likelihood estimators
2. Hypothesis testing (Section 10.3)
• Wald, score, LRT

• asymptotic distributions
3. Confidence intervals (Section 10.4)
• Wald, score, LRT
Our previous inference discussions (i.e., in Chapters 7-9 CB) dealt with finite sample topics
(i.e., unbiasedness, MSE, optimal estimators/tests, confidence intervals based on finite sam-
ple pivots, etc.). We now investigate large sample inference, a topic of utmost importance
in statistical research.
10.2 Point Estimation
Setting: We observe X = (X1 , X2 , ..., Xn ) ∼ fX (x|θ), where θ ∈ Θ ⊆ R. Usually,

X1 , X2 , ..., Xn will constitute a random sample (an iid sample) from a population fX (x|θ).
We regard the scalar parameter θ as fixed and unknown. Define
Wn = Wn (X) = Wn (X1 , X2 , ..., Xn )
to be a sequence of estimators. For example,
W 1 = X1
X 1 + X2
W2 =
2
X 1 + X2 + X 3
W3 = ,
3
PAGE 123
and so on, so that in general,

n
1X
Wn = X n = Xi .
n i=1
Note that we emphasize the dependence of this sequence on the sample size n.
Definition: A sequence of estimators Wn is consistent for a parameter θ if

p
Wn −→ θ for all θ ∈ Θ.
That is, for all > 0 and for all θ ∈ Θ,
lim Pθ (|Wn − θ| ≥ ) = 0.
n→∞
An equivalent definition is
lim Pθ (|Wn − θ| < ) = 1.
n→∞
We call Wn a consistent estimator of θ. What makes consistency “different” from our
p
usual definition of convergence in probability is that we require Wn −→ θ for all θ ∈ Θ. In
other words, convergence of Wn must result for all members of the family {fX (x|θ) : θ ∈ Θ}.
Remark: From Markov’s Inequality, we know that for all > 0,

Eθ [(Wn − θ)2 ]
Pθ (|Wn − θ| ≥ ) ≤ .
2
Therefore, a sufficient condition for Wn to be consistent is
Eθ [(Wn − θ)2 ]
→0
2
for all θ ∈ Θ. However, note that
Eθ [(Wn − θ)2 ] = varθ (Wn ) + [Eθ (Wn ) − θ]2 = varθ (Wn ) + [Biasθ (Wn )]2 .
This leads to the following theorem.
Theorem 10.1.3. If Wn is a sequence of estimators of a parameter θ satisfying
1. varθ (Wn ) → 0, as n → ∞, for all θ ∈ Θ

2. Biasθ (Wn ) → 0, as n → ∞, for all θ ∈ Θ,
then Wn is a consistent estimator of θ.
Weak Law of Large Numbers: Suppose that X1 , X2 , ..., Xn are iid with Eθ (X1 ) = µ and
varθ (X1 ) = σ 2 < ∞. Let
n
1X
Xn = Xi
n i=1
denote the sample mean. As an estimator of µ, it is easy to see that the conditions of
Theorem 10.1.3 are satisfied. Therefore, X n is a consistent estimator of Eθ (X1 ) = µ.
PAGE 124
Continuity: Suppose Wn is a consistent estimator of θ. Suppose g : R → R is a continuous

function. Then
p
g(Wn ) −→ g(θ) for all θ ∈ Θ.
That is, g(Wn ) is a consistent estimator of g(θ).
Consistency of MLEs: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ. Let
θb = arg max L(θ|x)

θ∈Θ
denote the maximum likelihood estimator (MLE) of θ. Under “certain regularity conditions,”
it follows that
p
θb −→ θ for all θ ∈ Θ,
as n → ∞. That is, MLEs are consistent estimators.
Remark: Consistency also results for vector valued MLEs, say θ,

b but we herein restrict
attention to the scalar case.
Sufficient conditions to prove consistency of MLEs:
1. X1 , X2 , ..., Xn are iid from fX (x|θ).

2. The parameter θ is identifiable, that is, for θ1 , θ2 ∈ Θ,
fX (x|θ1 ) = fX (x|θ2 ) =⇒ θ1 = θ2 .
In other words, different values of θ cannot produce the same probability distribution.
3. The family of pdfs {fX (x|θ) : θ ∈ Θ} has common support X . This means that the
support does not depend on θ. In addition, the pdf fX (x|θ) is differentiable with
respect to θ.
4. The parameter space Θ contains an open set where the true value of θ, say θ0 , resides
as an interior point.
Remark: Conditions 1-4 generally hold for exponential families that are of full rank.
Example 10.1. Suppose X1 , X2 , ..., Xn are iid N (0, θ), where θ > 0. The MLE of θ is
n
1X 2
θb = X .
n i=1 i
p
As an MLE, θb −→ θ, for all θ > 0; i.e., θb is a consistent estimator of θ.
Asymptotic normality of MLEs: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where
θ ∈ Θ. Let θb denote the MLE of θ. Under “certain regularity conditions,” it follows that
√ d
n(θb − θ) −→ N (0, v(θ)),
PAGE 125
as n → ∞, where the asymptotic variance

1
v(θ) = .
I1 (θ)
Recall that I1 (θ), the Fisher Information based on one observation, is given by
( 2 ) 2
∂ ∂
∂θ ∂θ2
Remark: The four regularity conditions on the last page were sufficient conditions for
consistency. For asymptotic normality, there are two additional sufficient conditions:
5. The pdf/pmf fX (x|θ) isR three times differentiable with respect to θ, the third derivative
is continuous in θ, and R fX (x|θ)dx can be differentiated three times under the integral
sign.
6. There exists a function M (x) such that
3
∂
ln f X (x|θ) ≤ M (x)
∂θ3
for all x ∈ X for all θ ∈ Nc (θ0 ) ∃c > 0 and Eθ0 [M (X)] < ∞.
Note: We now sketch a casual proof of the asymptotic normality result for MLEs. Let θ0
denote the true value of θ. Let S(θ) = S(θ|x) denote the score function; i.e.,
∂
S(θ) = ln fX (x|θ).
∂θ
Note that because θb is an MLE, it solves the score equation; i.e., S(θ)
b = 0. Therefore, we
can write (via Taylor series expansion about θ0 ),
0 = S(θ) b
∂S(θ0 ) b 1 ∂ 2 S(θb∗ ) b
= S(θ0 ) + (θ − θ0 ) + (θ − θ0 )2
∂θ 2 ∂θ2
where θb∗ is between θ0 and θ.
b Therefore, we have
" #
2
∂S(θ0 ) 1 ∂ S(θb∗ )
0 = S(θ0 ) + (θb − θ0 ) + (θb − θ0 ) .
∂θ 2 ∂θ2
After simple algebra, we have
√
√ − nS(θ0 )
n(θ − θ0 ) =
b
∂S(θ0 ) 1 ∂ 2 S(θb∗ ) b
+ (θ − θ0 )
∂θ 2 ∂θ2
n
√ 1X ∂
− n ln fX (Xi |θ0 )
n i=1 ∂θ −A
= n n = ,
1 X ∂2 1 X ∂3 B+C
ln fX (Xi |θ0 ) + ln fX (Xi |θb∗ )(θb − θ0 )
n i=1 ∂θ2 2n i=1 ∂θ3
PAGE 126
where
n
√ 1X ∂
A = n ln fX (Xi |θ0 )
n i=1 ∂θ
n
1 X ∂2
B = ln fX (Xi |θ0 )
n i=1 ∂θ2
n
1 X ∂3
C = ln fX (Xi |θb∗ )(θb − θ0 ).
2n i=1 ∂θ3
The first term n

√ 1X ∂ d
A= n ln fX (Xi |θ0 ) −→ N (0, I1 (θ0 )).
n i=1 ∂θ
Proof. For general θ, define
∂
Yi = ln fX (Xi |θ),
∂θ
for i = 1, 2, ..., n. The Yi ’s are iid with mean
Z
∂ ∂
Eθ (Y ) = Eθ ln fX (X|θ) = ln fX (x|θ)fX (x|θ)dx
∂θ R ∂θ
Z
∂
= fX (x|θ)dx
R ∂θ
Z
d
= fX (x|θ)dx = 0
dθ R
and variance
( 2 )
∂ ∂
varθ (Y ) = varθ ln fX (X|θ) = Eθ ln fX (X|θ) = I1 (θ).
∂θ ∂θ
Therefore, the CLT says that

n
√ 1X ∂ √ d
A= n ln fX (Xi |θ) = n(Y − 0) −→ N (0, I1 (θ)),
n i=1 ∂θ
as n → ∞. Note that when θ = θ0 , we have

√ d
−A = − n(Y − 0) −→ N (0, I1 (θ0 )),
because the N (0, I1 (θ)) limiting distribution above is symmetric about 0. 2
The second term, by WLLN,

n
1 X ∂2
2
p ∂
B= ln fX (Xi |θ0 ) −→ Eθ ln fX (X|θ0 ) = −I1 (θ0 ).
n i=1 ∂θ2 ∂θ2
PAGE 127
The third term n

1 X ∂3 p
C= ln fX (Xi |θb∗ )(θb − θ0 ) −→ 0.
2n i=1 ∂θ3
Proof (very casual). We have
n
1 1 X ∂3
C = (θb − θ0 ) ln fX (Xi |θb∗ ).
2 n i=1 ∂θ3
p
Note that θb − θ0 −→ 0, because θb is consistent (i.e., θb converges in probability to θ0 ).
Therefore, it suffices to show that
n
1 X ∂3
ln fX (Xi |θb∗ )
n i=1 ∂θ3
converges to something finite (in probability). Note that for n “large enough;” i.e., as soon
as θb∗ ∈ Nc (θ0 ) in Regularity Condition 6,
n n 3 n
1 X ∂3

1 X ∂ 1X p
3
ln f X (X |θ
i ∗
b ) ≤
3
ln f X (X |θ
i ∗
b ) ≤ M (Xi ) −→ Eθ0 [M (X)] < ∞. 2
n i=1 ∂θ n i=1 ∂θ
n i=1
p p
We have shown that C −→ 0 and hence B + C −→ −I1 (θ0 ). Finally, note that

−A 1 d 1
= | −{zA } −→ N 0, ,
B+C | B {z
+C } I1 (θ0 )
d
−→N (0,I1 (θ0 )) p 1
−→ − I
1 (θ0 )
by Slutsky’s Theorem. 2
Remark: We have shown that, under regularity conditions, an MLE θb satisfies

√ d
n(θb − θ) −→ N (0, v(θ)),
where
1
v(θ) = .
I1 (θ)
Now recall the Delta Method from Chapter 5; i.e., if g : R → R is differentiable at θ and
g 0 (θ) 6= 0, then
√ d
N 0, [g 0 (θ)]2 v(θ) .

b − g(θ)] −→
n[g(θ)
Therefore, not only are MLEs asymptotically normal, but functions of MLEs are too.
Example 10.1 (continued). Suppose X1 , X2 , ..., Xn are iid N (0, θ), where θ > 0. The MLE
of θ is n
1X 2
θ=
b X .
n i=1 i
PAGE 128
p
We know that θb −→ θ, as n → ∞. We now derive the asymptotic distribution of θb (suitably
centered and scaled). We know
√ d
n(θb − θ) −→ N (0, v(θ)),
where
1
v(θ) = .
I1 (θ)
Therefore, all we need to do is calculate I1 (θ). The pdf of X is, for all x ∈ R,
1 2
fX (x|θ) = √ e−x /2θ .
2πθ
Therefore,
1 x2
ln fX (x|θ) = − ln(2πθ) − .
2 2θ
The derivatives of ln fX (x|θ) are
∂ 1 x2
ln fX (x|θ) = − + 2
∂θ 2θ 2θ
2
∂ 1 x2
ln f X (x|θ) = − .
∂θ2 2θ2 θ3
Therefore,
∂2
2
X 1 θ 1 1
I1 (θ) = −Eθ 2
ln fX (X|θ) = Eθ 3
− 2 = 3− 2 = 2
∂θ θ 2θ θ 2θ 2θ
and
1
v(θ) = = 2θ2 .
I1 (θ)
We have √ d
n(θb − θ) −→ N (0, 2θ2 ).
b = θb2 ,
Exercise: Use the Delta Method to derive the large sample distributions of g1 (θ)
b = eθb, and g3 (θ)
g2 (θ) b = ln θ,
b suitably centered and scaled.
Important: Suppose that an MLE θb (or any sequence of estimators) satisfies

√ d
n(θb − θ) −→ N (0, v(θ)).
Suppose that v(θ)

b is a consistent estimator of v(θ), that is,
p
b −→ v(θ),
v(θ)
for all θ ∈ Θ as n → ∞. We know that

θb − θ d
Zn = r −→ N (0, 1).
v(θ)
n
PAGE 129
In addition, s
b− θ
θ b− θ
θ v(θ) d
Zn∗ = s = r −→ N (0, 1),
v(θ)
b v(θ) v(θ)
b
n
| {z }
p
n | {z } −→1
d
−→N (0,1)
by Slutsky’s Theorem. Note that s

v(θ) p
−→ 1
v(θ)
b
because of continuity. This technique is widely used in large sample arguments.
Summary:
1. We start with a sequence of estimators (e.g., an MLE sequence, etc.) satisfying
√ d
n(θb − θ) −→ N (0, v(θ)).
2. We find a consistent estimator of the asymptotic variance, say v(θ).

b
3. Slutsky’s Theorem and continuity of convergence are used to show that

θb − θ d
Zn∗ = s −→ N (0, 1).
v(θ)
b
n
One can then use Zn∗ to formulate large sample (Wald) hypothesis tests and confidence
intervals; see Sections 10.3 and 10.4, respectively.
Example 10.1 (continued). Suppose X1 , X2 , ..., Xn are iid N (0, θ), where θ > 0. The MLE
of θ is n
1X 2
θ=
b X .
n i=1 i
We have shown that
√ d θb − θ d
n(θb − θ) −→ N (0, 2θ2 ) ⇐⇒ Zn = r −→ N (0, 1).
2θ2
n
A consistent estimator of v(θ) = 2θ2 is v(θ)
b = 2θb2 , by continuity. Therefore,
s
θ−θ
b θ−θ
b 2θ2 d
Zn∗ = s = r −→ N (0, 1),
2θ 2 2θb2
2θb2 | {z }
n | {zn } −→1 p
d
−→N (0,1)
by Slutsky’s Theorem.
PAGE 130
Definition: Suppose we have two competing sequences of estimators (neither of which is

necessarily an MLE sequence) denoted by Wn and Vn that satisfy
√ d 2
n(Wn − θ) −→ N (0, σW )
√ d
n(Vn − θ) −→ N (0, σV2 ).
Both estimators are consistent estimators of θ. Define the asymptotic relative efficiency
(ARE) as
σ2
ARE(Wn to Vn ) = W .
σV2
With this definition, the following interpretations are used:
1. If ARE < 1, then Wn is more efficient than Vn .

2. If ARE = 1, then Wn is as efficient as Vn .
3. If ARE > 1, then Wn is less efficient than Vn .
The ARE is commonly used to compare the variances of two competing consistent estimators;
the comparison is of course on the basis of each estimator’s large sample distribution.
Remark: Before we do an example illustrating ARE, let’s have a brief discussion about
sample quantile estimators.
Sample quantiles: Suppose X1 , X2 , ..., Xn are iid with continuous cdf F . Define
φp = F −1 (p) = inf{x ∈ R : F (x) ≥ p}.
We call φp the pth quantile of the distribution of X. Note that if F is strictly increasing,
then F −1 (p) is well defined by
φp = F −1 (p) ⇐⇒ F (φp ) = p.
The simplest definition of the sample pth quantile is Fbn−1 (p), where
n
1X
Fbn (x) = I(Xi ≤ x)
n i=1
is the empirical distribution function (edf ). The edf is a non-decreasing step function
that takes steps of size 1/n at each observed Xi . Therefore,

−1 X(np) , np ∈ Z+
φp ≡ Fbn (p) =
b
X(bnpc+1) , otherwise.
This is just a fancy way of saying that the sample pth quantile is one of the order statistics
(note that other books may define this differently; e.g., by averaging order statistics, etc.).
Whenever I teach STAT 823, I prove that
√

d p(1 − p)
n(φp − φp ) −→ N 0, 2
b ,
f (φp )
PAGE 131
where f is the population pdf of X. For example, if p = 0.5, then φp = φ0.5 is the median of
X and the sample median φb0.5 satisfies
√

d 1
n(φ0.5 − φ0.5 ) −→ N 0, 2
b .
4f (φ0.5 )
i.e., both parameters are unknown. Consider the following two estimators Wn = X n and
Vn = φb0.5 as estimators of µ. Note that because the N (µ, σ 2 ) population distribution is
symmetric, the population median φ0.5 = µ as well.
We know that √ d
n(X n − µ) = N (0, σ 2 ),
√
that is, this “limiting distribution” is the exact distribution of n(X n − µ) for each n. From
our previous discussion on sample quantiles, we know that
√

d 1
n(φ0.5 − µ) −→ N 0, 2
b ,
4f (φ0.5 )
where (under the normal assumption),
1 1 1 π 2
= = 2 = σ .
4f 2 (φ0.5 ) 4f 2 (µ) 2

1
4 √
2πσ
Therefore, the asymptotic relative efficiency of the sample median φb0.5 when compared to
the sample mean X n is
π 2
σ π
ARE(φb0.5 to X n ) = 2 2 = ≈ 1.57.
σ 2
Interpretation: The sample median φb0.5 would require 57 percent more observations to
achieve the same level of (asymptotic) precision as X n .
Example 10.3. Suppose X1 , X2 , ..., Xn are iid beta(θ, 1), where θ > 0.
• Show that the MOM estimator of θ is

X
θbMOM =
1−X
and that θbMOM satisfies

√ θ(θ + 1)2

d
n(θbMOM − θ) −→ N 0, .
θ+2
Hint: Use CLT and Delta Method.
PAGE 132
5
4
ARE (MOM to MLE)
3
2
1
0
0 1 2 3 4 5
Figure 10.1: Plot of ARE(θbMOM to θbMLE ) versus θ in Example 10.3.
• Show that the MLE of θ is

n
θbMLE = − Pn
i=1 ln Xi
and that θbMLE satisfies

√ d
n(θbMLE − θ) −→ N (0, θ2 ).
Hint: Use large sample results for MLEs.
• Show that
(θ + 1)2
ARE(θbMOM to θbMLE ) = .
θ(θ + 2)
• I graphed ARE(θbMOM to θbMLE ) as a function of θ in Figure 10.1. Note that ARE is

always greater than unity; i.e., the MOM estimator is not as efficient as the MLE.
10.3 Hypothesis Testing
Remark: In Chapter 8 (CB), we discussed methods to derive hypothesis tests and also
optimality issues based on finite sample criteria. These discussions revealed that optimal
tests (e.g., UMP tests) were available for just a small collection of problems (some of which
were not realistic).
PAGE 133
Preview: In this section, we present three large sample approaches to formulate hypothesis
tests:
1. Wald (1943)
2. Score (1947, Rao); also known as “Lagrange multiplier tests”
3. Likelihood ratio (1928, Neyman-Pearson).
These are known as the “large sample likelihood based tests.”
10.3.1 Wald tests
Recall: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ R. As long as suitable
regularity conditions hold, we know that an MLE θb satisfies
√ d
n(θb − θ) −→ N (0, v(θ)),
where
1
v(θ) = .
I1 (θ)
If v(θ) is a continuous function of θ, then
p
b −→ v(θ),
v(θ)
for all θ; i.e., v(θ)

b is a consistent estimator of v(θ), and
s
∗ θb − θ θb − θ v(θ) d
Zn = s = r −→ N (0, 1),
v(θ)
b v(θ) v(θ)
b
| {zn } −→1
| {z }
p
n
d
−→N (0,1)
by Slutsky’s Theorem. This forms the basis for the Wald test.
Wald statistic: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ R. Consider
testing
H0 : θ = θ0
versus
H1 : θ 6= θ0 .
When H0 is true, then

θb − θ0 d
ZnW = s −→ N (0, 1).
v(θ)
b
n
PAGE 134
Therefore,
R = {x ∈ X : |znW | ≥ zα/2 },
where zα/2 is the upper α/2 quantile of the N (0, 1) distribution, is an approximate size α
rejection region for testing H0 versus H1 . One sided tests also use ZnW . The only thing that
changes is the form of R.
Example 10.4. Suppose X1 , X2 , ..., Xn are iid Bernoulli(p), where 0 < p < 1. Derive the
Wald test of
H0 : p = p0
versus
H1 : p 6= p0 .
Solution. We already know that the MLE of p is given by

n
1X
pb = Xi ,
n i=1
the so-called “sample proportion.” Because pb is an MLE, we know that

√ d
p − p) −→ N (0, v(p)),
n(b
where
1
v(p) = .
I1 (p)
We now calculate I1 (p). The pmf of X is, for x = 0, 1,
fX (x|p) = px (1 − p)1−x .
Therefore,
ln fX (x|p) = x ln p + (1 − x) ln(1 − p).
The derivatives of ln fX (x|p) are
∂ x 1−x
ln fX (x|p) = −
∂p p 1−p
∂2 x 1−x
ln f X (x|p) = − − .
∂p2 p2 (1 − p)2
Therefore,
∂2

X 1−X p 1−p 1
I1 (p) = −Ep 2
ln fX (X|p) = Ep 2 + 2
= 2+ 2
=
∂p p (1 − p) p (1 − p) p(1 − p)
and
1
v(p) = = p(1 − p).
I1 (p)
PAGE 135
We have √ d
p − p) −→ N (0, p(1 − p)).
n(b
Because the asymptotic variance v(p) = p(1 − p) is a continuous function of p, it can be
p) = pb(1 − pb). The Wald statistic to test H0 : p = p0 versus
consistently estimated by v(b
H1 : p 6= p0 is given by
pb − p0 pb − p0
ZnW = r =r .
v(b
p) pb(1 − pb)
n n
An approximate size α rejection region is
R = {x ∈ X : |znW | ≥ zα/2 }.
10.3.2 Score tests
Motivation: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ R. Recall that
the score function, when viewed as random, is
∂
S(θ|X) = ln L(θ|X)
∂θ
n
iid
X ∂
= ln fX (Xi |θ),
i=1
∂θ
the sum of iid random variables. Recall that

∂
Eθ ln fX (X|θ) = 0
∂θ
( 2 )
∂ ∂
varθ ln fX (X|θ) = Eθ ln fX (X|θ) = I1 (θ).
∂θ ∂θ
Therefore, applying the CLT to the sum above, we have

√

1 d
n S(θ|X) − 0 −→ N (0, I1 (θ)),
n
which means
1
n
S(θ|X) S(θ|X) iid S(θ|X) d
q =p =p −→ N (0, 1),
I1 (θ) nI1 (θ) In (θ)
n
where recall In (θ) = nI1 (θ) is the Fisher information based on all n iid observations. There-
fore, the score function divided by the square root of the Fisher information (based on all
n observations) behaves asymptotically like a N (0, 1) random variable. This fact forms the
basis for the score test.
PAGE 136
Score statistic: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ R. Consider
testing
H0 : θ = θ0
versus
H1 : θ 6= θ0 .
When H0 is true, then

S(θ0 |X) d
ZnS = p −→ N (0, 1).
In (θ0 )
Therefore,
R = {x ∈ X : |znS | ≥ zα/2 },
where zα/2 is the upper α/2 quantile of the N (0, 1) distribution, is an approximate size α
rejection region for testing H0 versus H1 . One sided tests also use ZnS . The only thing that
changes is the form of R.
score test of
H0 : p = p0
versus
H1 : p 6= p0 .
Solution. The likelihood function is given by

n
Y Pn Pn
L(p|x) = pxi (1 − p)1−xi = p i=1 xi
(1 − p)n− i=1 xi
.
i=1

n n
!
X X
ln L(p|x) = xi ln p + n− xi ln(1 − p).
i=1 i=1
The score function is

Pn
n − ni=1 xi
P
∂ i=1 xi
S(p|x) = ln L(p|x) = − .
∂p p 1−p
Recall in Example 10.4, we calculated
1
I1 (p) = .
p(1 − p)
Therefore, the score statistic is
Pn
n − ni=1 Xi
P
i=1 Xi
−
S(p0 |X) p0 1 − p0 pb − p0
ZnS = p = r =r .
In (p0 ) n p0 (1 − p0 )
p0 (1 − p0 ) n
PAGE 137

R = {x ∈ X : |znS | ≥ zα/2 }.
Remark: It is insightful to compare

pb − p0 pb − p0
ZnW = r with ZnS = r .
pb(1 − pb) p0 (1 − p0 )
n n
The two statistics differ only in how the standard error of pb (as a point estimator of p) is
calculated. The Wald statistic uses the estimated standard error. The score statistic uses
the standard error calculated under the assumption that H0 : p = p0 is true (i.e., nothing is
being estimated). This is an argument in favor of the score statistic.
10.3.3 Likelihood ratio tests
Setting: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ R. Consider testing
H0 : θ = θ0
versus
H1 : θ 6= θ0 .
The likelihood ratio test (LRT) statistic is defined as

sup L(θ|x)
θ∈Θ0 L(θ0 |x) L(θ0 )
λ(x) = = = .
sup L(θ|x) L(θ|x)
b L(θ)
b
θ∈Θ
Suppose the regularity conditions needed for MLEs to be consistent and asymptotically
normal hold. When H0 is true,
d
−2 ln λ(X) −→ χ21 .
Because small values of λ(x) are evidence against H0 , large values of −2 ln λ(x) are too.
Therefore,
R = {x ∈ X : −2 ln λ(x) ≥ χ21,α },
where χ21,α is the upper α quantile of the χ21 distribution, is an approximate size α rejection
region for testing H0 versus H1 .
Proof. Our proof is casual. Suppose H0 : θ = θ0 is true. First, write ln L(θ)

b in a Taylor
series expansion about θ0 , that is,
2
b = ln L(θ0 ) + (θb − θ0 ) ∂ ln L(θ0 ) + 1 (θb − θ0 )2 ∂ ln L(θb∗ )
ln L(θ)
∂θ 2 ∂θ2
√ 1 ∂ n 1 ∂2
= ln L(θ0 ) + n(θb − θ0 ) √ ln L(θ0 ) + (θb − θ0 )2 ln L(θb∗ ), (10.1)
n ∂θ 2 n ∂θ2
| {z }
see Equation (10.2)
PAGE 138
∂
where θb∗ is between θb and θ0 . Now write ∂θ ln L(θ0 ) in a Taylor series expansion about θ,
b
that is,
2
∂ ∂ b +(θ0 − θ)b ∂ ln L(θb∗∗ ),
ln L(θ0 ) = ln L(θ)
∂θ |∂θ {z } ∂θ2
= 0
∂
where θb∗∗ is between θ0 and θ.
b Note that
∂θ
ln L(θ)
b = 0 because θb solves the score equation.
From the last equation, we have
√ 1 ∂2

1 ∂
√ ln L(θ0 ) = n(θ − θ0 ) −
b ln L(θ∗∗ ) .
b (10.2)
n ∂θ n ∂θ2
Combining Equations (10.1) and (10.2), we have
√ √ 1 ∂2

ln L(θ)
b = ln L(θ0 ) + n(θ − θ0 ) n(θ − θ0 ) −
b b ln L(θ∗∗ )
b
n ∂θ2
n 1 ∂2
+ (θb − θ0 )2 ln L(θb∗ )
2 n ∂θ2
so that
2 2

b − ln L(θ0 ) = n(θb − θ0 ) −21 ∂ n 2 1 ∂
ln L(θ) ln L(θb∗∗ ) + (θb − θ0 ) ln L(θb∗ ) . (10.3)
n ∂θ2 2 n ∂θ2
p
Because θb is consistent (and because H0 is true), we know that θb −→ θ0 , as n → ∞.
Therefore, because θb∗ and θb∗∗ are both trapped between θb and θ0 , both terms in the brackets,
i.e.,
1 ∂2 2
b∗∗ ) and 1 ∂ ln L(θb∗ )
ln L(θ
n ∂θ2 n ∂θ2
converge in probability to
2
∂
Eθ0 ln fX (X|θ) = −I1 (θ0 ),
∂θ2
by the WLLN. Therefore, the RHS of Equation (10.3) will behave in the limit the same as
n b 1 √ b √
(θ − θ0 )2 I1 (θ0 ) = n(θ − θ0 ) n(θb − θ0 )I1 (θ0 )
2 2
√ b √ b
1 n(θ − θ0 ) n(θ − θ0 ) d 1 2
= q q −→ χ1 ,
2 1 1 2
I1 (θ0 ) I1 (θ0 )
| {z } | {z }
d d
−→N (0,1) −→N (0,1)
by continuity. Therefore, when H0 : θ = θ0 is true,

d
−2 ln λ(X) = −2[ln L(θ0 ) − ln L(θ)]
b −→ χ21 . 2
PAGE 139
large sample LRT test of
H0 : p = p0
versus
H1 : p 6= p0 .
Solution. The likelihood ratio statistic is

Pn
x Pn
L(p0 |x) p i=1 i (1 − p0 )n− i=1 xi
λ(x) = = 0Pn x Pn
p|x)
L(b pb i=1 i (1 − pb)n− i=1 xi
Pni=1 xi n−Pni=1 xi
p0 1 − p0
= .
pb 1 − pb
Therefore,
" n n
! #
X p0 X 1 − p0
−2 ln λ(X) = −2 Xi ln + n− Xi ln
pb 1 − pb
i=1 i=1

p0 1 − p0
= −2 nb p ln + n(1 − pb) ln .
pb 1 − pb
R = {x ∈ X : −2 ln λ(x) ≥ χ21,α }.
Monte Carlo Simulation: When X1 , X2 , ..., Xn are iid Bernoulli(p), where 0 < p < 1,
we have derived the Wald, score, and large sample LRT for testing H0 : p = p0 versus
H1 : p 6= p0 . Each test is a large sample test, so the size of each one is approximately equal
to α when n is large. We now perform a simulation to assess finite sample characteristics.
• Take n = 20, n = 50, n = 100
• Let p0 = 0.1 and p0 = 0.3
• At each configuration of n and p0 , we will
– simulate B = 10000 Bernoulli(p0 ) samples (i.e., H0 is true)

– calculate znW , znS , and −2 ln λ(x) with each sample
– record the percentage of times that H0 is (incorrectly) rejected when α = 0.05
– this percentage is an estimate of the true size of the test (for a given configuration
of n and p0 ).
The results from this simulation study are shown in Table 10.1.
PAGE 140
Wald Score LRT

n = 20 0.1204 0.0441 0.1287
p0 = 0.1 n = 50 0.1189 0.0316 0.0627
n = 100 0.0716 0.0682 0.0456
n = 20 0.0538 0.0243 0.0538
p0 = 0.3 n = 50 0.0646 0.0447 0.0447
n = 100 0.0506 0.0637 0.0506
Table 10.1: Monte Carlo simulation. Size estimates of nominal α = 0.05 Wald, score, and
LRTs for a binomial proportion p when n = 20, 50, 100 and p0 = 0.1, 0.3.
Important: Note that these sizes are really estimates of the true sizes (at each setting of n
and p0 ). Therefore, we should acknowledge that these are estimates and report the margin
of error associated with them.
• Because these are nominal size 0.05 tests, the margin of error associated with each
“estimate,” assuming a 99 percent confidence level, is equal to
r
0.05(1 − 0.05)
B = 2.58 ≈ 0.0056.
10000
• Size estimates between 0.0444 and 0.0556 indicate that the test is operating at the
nominal level. I have bolded the estimates in Table 10.1 that are within these bounds.
• Values <0.0444 suggest conservatism (the test rejects too often). Values >0.0556
suggest anti-conservatism (the test is not rejecting often enough).
Summary: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ R. Assume that the
regularity conditions needed for MLEs to be consistent and asymptotically normal (CAN)
hold. We have presented three large sample procedures to test
H0 : θ = θ0
versus
H1 : θ 6= θ0 .
• Wald:
θb − θ0 θb − θ0 d
ZnW = s =s −→ N (0, 1)
v(θ)
b 1
n nI1 (θ)
b
• Score:
S(θ0 |X) d
ZnS = p −→ N (0, 1)
In (θ0 )
PAGE 141
• LRT:
d
−2 ln λ(X) = −2[ln L(θ0 |X) − ln L(θ|X)]
b −→ χ21 .
All convergence results are under H0 : θ = θ0 .
• Note that (ZnW )2 , (ZnS )2 , and −2 ln λ(X) each converge in distribution to a χ21 distri-
bution as n → ∞.
• In terms of power (i.e., rejecting H0 when H1 is true), all three testing procedures are
asymptotically equivalent when examining certain types of alternative sequences
(i.e., Pitman sequences of alternatives). For these alternative sequences, (ZnW )2 , (ZnS )2 ,
and −2 ln λ(X) each converge to the same (noncentral) χ21 (λ) distribution. However,
the powers may be quite different in finite samples.
Remark: The large sample LRT procedure can be easily generalized to multi-parameter
hypotheses.
Theorem 10.3.3. Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ Rk . Assume
that the regularity conditions needed for MLEs to be CAN hold. Consider testing
H0 : θ ∈ Θ0
versus
H1 : θ ∈ Θc0
and define
sup L(θ|x)
θ∈Θ0 L(θb0 |x)
λ(x) = = .
sup L(θ|x) L(θ|x)
b
θ∈Θ
If θ ∈ Θ0 , then
d
−2 ln λ(X) = −2[ln L(θb0 |X) − ln L(θ|X)]
b −→ χ2ν ,
where ν = dim(Θ) − dim(Θ0 ), the number of “free parameters” between Θ and Θ0 .
Implication: Rejecting H0 : θ ∈ Θ0 when λ(x) is small is equivalent to rejecting H0 when

−2 ln λ(x) is large. Therefore,
R = {x ∈ X : −2 ln λ(x) ≥ χ2ν,α }
is an approximate size α rejection region. This means
lim Pθ (Reject H0 ) = α for all θ ∈ Θ0 .

n→∞
Example 10.7. McCann and Tebbs (2009) summarize a study examining perceived unmet
need for dental health care for people with HIV infection. Baseline in-person interviews were
PAGE 142
conducted with 2,864 HIV infected individuals (aged 18 years and older) as part of the HIV
Cost and Services Utilization Study. Define
with private insurance
with medicare and private insurance
without insurance
with medicare but no private insurance.
Set X = (X1 , X2 , X3 , X4 ) and model X ∼ mult(2864, p1 , p2 , p3 , p4 ; 4i=1 pi = 1). Under this

P
assumption, consider testing
1
H0 : p1 = p2 = p3 = p4 = 4
versus
H1 : H0 not true.
The null parameter space is
Θ0 = {θ = (p1 , p2 , p3 , p4 ) : p1 = p2 = p3 = p4 = 1/4},
the singleton (1/4, 1/4, 1/4, 1/4). The entire parameter space is
( 4
)
X
Θ= θ = (p1 , p2 , p3 , p4 ) : 0 < p1 < 1, 0 < p2 < 1, 0 < p3 < 1, 0 < p4 < 1; pi = 1 ,
i=1
a simplex in R4 . The number of “free parameters” is ν = dim(Θ) − dim(Θ0 ) = 3 − 0 = 3.

The observed data from the study are summarized by
x = (658, 839, 811, 556).

2864!
L(θ|x) = L(p1 , p2 , p3 , p4 |x) = p x1 p x2 p x3 p x4 .
x1 ! x2 ! x3 ! x4 ! 1 2 3 4
Maximizing L(p1 , p2 , p3 , p4 |x) over Θ, noting that p4 = 1−p1 −p2 −p3 , gives the (unrestricted)
maximum likelihood estimates
x1 x2 x3 x4
pb1 = , pb2 = , pb3 = , pb4 = .
2864 2864 2864 2864
Therefore,
L( 14 , 14 , 14 , 14 )
λ(x) = λ(x1 , x2 , x3 , x4 ) =
L(pb1 , pb2 , pb3 , pb4 )
2864! 4
( 1 )x1 ( 14 )x2 ( 14 )x3 ( 14 )x4
x
x1 ! x2 ! x3 ! x4 ! 4
Y 2864 i
= = .
2864!
( x1 )x1 ( 2864
x1 ! x2 ! x3 ! x4 ! 2864
x2 x2 x3 x3 x4 x4
) ( 2864 ) ( 2864 ) i=1
4xi
PAGE 143
The large sample LRT statistic is

4
X 2864
−2 ln λ(x) = −2 xi ln ≈ 75.69.
i=1
4xi
An approximate size α = 0.05 rejection region is
R = {x ∈ X : −2 ln λ(x) ≥ 7.81}.
Therefore, we have very strong evidence against H0 .
10.4 Confidence Intervals
Remark: In Chapter 9 (CB), we discussed methods to derive confidence intervals based on

exact (i.e., finite sample) distributions. We now present three large sample approaches:
1. Wald
2. Score
3. Likelihood ratio.
These are known as the “large sample likelihood based confidence intervals.”
Definition: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ R. The random
variable
Qn = Qn (X, θ)
is called a large sample pivot if its asymptotic distribution is free of all unknown param-
eters. If Qn is a large sample pivot and if
Pθ (Qn (X, θ) ∈ A) ≈ 1 − α,
then C(X) = {θ : Qn (X, θ) ∈ A} is called an approximate 1 − α confidence set for θ.
10.4.1 Wald intervals
Recall: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ R. As long as suitable
regularity conditions hold, we know that an MLE θb satisfies
√ d
n(θb − θ) −→ N (0, v(θ)),
where
1
v(θ) = .
I1 (θ)
PAGE 144
p
b −→ v(θ), for all θ; i.e., v(θ)
If v(θ) is a continuous function of θ, then v(θ) b is a consistent
estimator of v(θ), and
θb − θ d
Qn (X, θ) = s −→ N (0, 1),
v(θ)
b
n
by Slutsky’s Theorem. Therefore, Qn (X, θ) is a large sample pivot and
1 − α ≈ Pθ (−zα/2 ≤ Qn (X, θ) ≤ zα/2 )
   s s 
θ−θ
b v(θ)
b v(θ) 
b
= Pθ −zα/2 ≤ q ≤ zα/2  = Pθ θb − zα/2 ≤ θ ≤ θb + zα/2 .
v(θ)
b n n
n
Therefore, s
v(θ)
b
θb ± zα/2
n
is an approximate 1 − α confidence interval for θ.
Remark: We could have arrived at this same interval by inverting the large sample test of
H0 : θ = θ0
versus
H1 : θ 6= θ0
that uses the (Wald) test statistic

θb − θ0
ZnW = s
v(θ)
b
n
and rejection region
R = {x ∈ X : |znW | ≥ zα/2 }.
This is why this type of large sample interval is called a Wald confidence interval (it is
the interval that arises from inverting a large sample Wald test).
Extension: We can also write large sample Wald confidence intervals for functions of θ
using the Delta Method. Recall that if g : R → R is differentiable at θ and g 0 (θ) 6= 0, then
√ d
N 0, [g 0 (θ)]2 v(θ) .

b − g(θ)] −→
n[g(θ)
If [g 0 (θ)]2 v(θ) is a continuous function of θ, then we can find a consistent estimator for it,
namely [g 0 (θ)]b 2 v(θ),
b because MLEs are consistent themselves and consistency is preserved
under continuous mappings. Therefore,
b − g(θ)
g(θ) d
Qn (X, θ) = s −→ N (0, 1),
[g 0 (θ)]
b 2 v(θ)
b
n
PAGE 145
by Slutsky’s Theorem and s

b ± zα/2 [g 0 (θ)]
b 2 v(θ)
b
g(θ)
n
is an approximate 1 − α confidence interval for g(θ).
Example 10.8. Suppose X1 , X2 , ..., Xn are iid Bernoulli(p), where 0 < p < 1.
(a) Derive a 1 − α (large sample) Wald confidence interval for p.
(b) Derive a 1 − α (large sample) Wald confidence interval for

p
g(p) = ln ,
1−p
the log odds of p.
Solution. (a) We already know that the MLE of p is given by

n
1X
pb = Xi .
n i=1
In Example 10.4, we showed that

1
v(p) = = p(1 − p).
I1 (p)
Therefore, r
pb(1 − pb)
pb ± zα/2
n
is an approximate 1 − α Wald confidence interval for p. The problems with this interval (i.e.,
in conferring the nominal coverage probability) are well known; see Brown et al. (2001).
(b) Note that g(p) = ln[p/(1 − p)] is a differentiable function and

1
g 0 (p) = 6= 0.
p(1 − p)
The Delta Method gives

2 !
√

pb p d 1
n ln − ln −→ N 0, p(1 − p)
1 − pb 1−p p(1 − p)

d 1
= N 0, .
p(1 − p)
PAGE 146
Because the asymptotic variance 1/p(1 − p) can be consistently estimated by 1/b

p(1 − pb), we
have
pb p
ln − ln
1 − pb 1−p d
s −→ N (0, 1)
1
p(1 − pb)
nb
by Slutsky’s Theorem, and
s
pb 1
ln ± zα/2
1 − pb p(1 − pb)
nb
is an approximate 1 − α Wald confidence interval for g(p) = ln[p/(1 − p)].
Remarks: As you can see, constructing (large sample) Wald confidence intervals is straight-
forward. We rely on the MLE being consistent and asymptotically normal (CAN) and also
on being able to find a consistent estimator of the asymptotic variance of the MLE.
• More generally, if you have an estimator θb (not necessarily an MLE) that is asymp-
totically normal and if you can estimate its (large sample) variance consistently, you
can do Wald inference. This general strategy for large sample inference is ubiquitous
in statistical research.
• The problem, of course, is that because large sample standard errors must be estimated,
the performance of Wald confidence intervals (and tests) can be poor in small samples.
Brown et al. (2001) highlights this for the binomial proportion; however, this behavior
is seen in other settings.
• I view Wald inference as a “fall back.” It is what to do when no other large sample
inference procedures are available; i.e., “having something is better than nothing.”
• Of course, in very large sample settings (e.g., large scale Phase III clinical trials, public
health studies with thousands of individuals, etc.), Wald inference is usually the default
approach (probably because of its simplicity) and is generally satisfactory.
10.4.2 Score intervals
Recall: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ R. We have shown
previously that
S(θ|X) d
Qn (X, θ) = p −→ N (0, 1),
In (θ)
where In (θ) = nI1 (θ) is the Fisher information based on the sample.
PAGE 147
Motivation: Score confidence intervals arise from inverting (large sample) score tests. Re-
call that in testing H0 : θ = θ0 versus H1 : θ 6= θ0 , the score statistic
S(θ0 |X) d
Qn (X, θ0 ) = p −→ N (0, 1)
In (θ0 )
when H0 is true. Therefore,
R = {x ∈ X : |Qn (x, θ0 )| ≥ zα/2 }
is an approximate size α rejection region for testing H0 versus H1 . The acceptance region is
A = Rc = {x ∈ X : |Qn (x, θ0 )| < zα/2 }.
From inverting this acceptance region, we can conclude that
C(x) = {θ : |Qn (x, θ)| < zα/2 }
is an approximate 1 − α confidence set for θ. If C(x) is an interval, then we call it a score

confidence interval.
Example 10.9. Suppose X1 , X2 , ..., Xn are iid Bernoulli(p), where 0 < p < 1. Derive a
1 − α (large sample) score confidence interval for p.
Solution. From Example 10.5, we have
Pn
n − ni=1 Xi
P
i=1 Xi
−
S(p|X) p 1−p pb − p
Qn (X, p) = p = r =r .
In (p) n p(1 − p)
p(1 − p) n
From our discussion above, the (random) set
 

 pb − p 
C(X) = {p : |Qn (X, p)| < zα/2 } = p : q < zα/2
 p(1−p) 
n
forms the score interval for p. After observing X = x, this interval could be calculated
numerically (e.g., using a grid search over values of p that satisfy this inequality). However,
in the binomial case, we can get closed-form expressions for the endpoints. To see why, note
that the boundary
p(1 − p)
p − p)2 = zα/2
|Qn (x, p)| = zα/2 ⇐⇒ (b 2
.
n
After algebra, this equation becomes
2
! 2
!
zα/2 zα/2
1+ p2 − 2b
p+ p + pb2 = 0.
n n
PAGE 148
The LHS of the last equation is a quadratic function of p. The roots of this equation, if they
are real, delimit the score interval for p. Using the quadratic formula, the lower and upper
limits are
q
2 2 2
(2b
p + zα/2 /n) − (2b p + zα/2 /n)2 − 4(1 + zα/2 p2
/n)b
pL = 2
2(1 + zα/2 /n)
q
2 2 2
(2b
p + zα/2 /n) + (2b p + zα/2 /n)2 − 4(1 + zα/2 p2
/n)b
pU = 2
,
2(1 + zα/2 /n)
respectively. Note that the score interval is much more complex than the Wald interval.
However, the score interval (in this setting and elsewhere) typically confers very good cover-
age probability, that is, close to the nominal 1 − α level, even for small samples. Therefore,
although we have added complexity, the score interval is typically much better.
10.4.3 Likelihood ratio intervals
Recall: Suppose X1 , X2 , ..., Xn are iid from fX (x|θ), where θ ∈ Θ ⊆ R. Consider testing
H0 : θ = θ0 versus H1 : θ 6= θ0 . The LRT statistic is
L(θ0 |x)
λ(x) =
L(θ|x)
b
and
R = {x ∈ X : −2 ln λ(x) ≥ χ21,α }
is an approximate size α rejection region for testing H0 versus H1 . Inverting the acceptance
region, ( " # )
L(θ|x) 2
C(x) = θ : −2 ln < χ1,α
L(θ|x)
b
is an approximate 1 − α confidence set for θ. If C(x) is an interval, then we call it a
likelihood ratio confidence interval.
Example 10.10. Suppose X1 , X2 , ..., Xn are iid Bernoulli(p), where 0 < p < 1. Derive a
1 − α (large sample) likelihood ratio confidence interval for p.
Solution. From Example 10.6, we have

L(p|x) p 1−p
−2 ln = −2 nb p ln + n(1 − pb) ln .
p|x)
L(b pb 1 − pb
Therefore, the confidence interval is

p 1−p 2
C(x) = p : −2 nb p ln + n(1 − pb) ln < χ1,α .
pb 1 − pb
This interval must be calculated using numerical search methods.
PAGE 149

STAT 713 Mathematical Statistics Ii: Lecture Notes

Uploaded by

Copyright:

Available Formats

STAT 713 Mathematical Statistics Ii: Lecture Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STAT 713 Mathematical Statistics Ii: Lecture Notes

Uploaded by

Copyright:

Available Formats

STAT 713

9 Interval Estimation 104

10 Asymptotic Evaluations 123

6 Principles of Data Reduction

Complementary reading: Chapter 6 (CB). Sections 6.1-6.2.

Recall: We begin by recalling the definition of a statistic. Suppose that X1 , X2 , ..., Xn

Recall: We can think of X and T as functions:

• (S, B, P ): probability space for random experiment

Conceptualization: A statistic T forms a partition of X , the support of X. Specifically,

Define the statistic

A0 = {x ∈ X : T (x) = 0} = {(0, 0, 0)}

The image of X under T is

T = {t : t = T (x), x ∈ X } = {0, 1, 2, 3},

Connection: Data reduction plays an important role in statistical inference. Suppose

Preview: Chapter 6 (CB) discusses three methods of data reduction:

• Section 6.2: Sufficiency Principle

We will focus exclusively on Section 6.2.

6.2 The Sufficiency Principle

6.2.1 Sufficient statistics

Informal Definition: A statistic T = T (X) is a sufficient statistic for a parameter θ if

Sufficiency Principle: If T = T (X) is a sufficient statistic for θ, then any inference

Definition 6.2.1/Theorem 6.2.2. A statistic T = T (X) is a sufficient statistic for θ if the

Proof. The pmf of X, for xi = 0, 1, 2, ..., is given by

Example 6.3. Suppose that X1 , X2 , ..., Xn is an iid sample from

Recall that if X1 , X2 , ..., Xn are iid exponential(θ), then

Therefore, the pdf of T = T (X) = X, for t > 0, is

for −∞ < x1 < x2 < · · · < xn < ∞. Therefore, the ratio

• If I ask you to show that T is sufficient by appealing to the definition of sufficiency,

• The Factorization Theorem makes getting sufficient statistics much easier.

Theorem 6.2.6 (Factorization Theorem). A statistic T = T (X) is sufficient for θ if

where t = x(n) . By the Factorization Theorem, T = T (X) = X(n) is sufficient.

• Example 6.3: exponential(θ). T = X; dim(T ) = dim(θ) = 1

• Example 6.5: U(0, θ). T = X(n) ; dim(T ) = dim(θ) = 1

• Example 6.6: gamma(α, β). T = ( ni=1 Xi , ni=1 Xi ); dim(T) = dim(θ) = 2.

where t1 = x(1) and t2 = x(n) . By the Factorization Theorem,

is sufficient. In this family, 2 = dim(T) > dim(θ) = 1.

Remark: Sufficiency also extends to non-iid situations.

Example 6.8. Consider the linear regression model

For y ∈ Rn , the pdf of Y is

Sufficient statistics in the exponential family:

where θ = (θ1 , θ2 , ..., θd ), d ≤ k. Then

Proof. Use the Factorization Theorem. The pdf of X is

Result: Suppose X ∼ fX (x|θ), where θ ∈ Θ, and suppose T = T (X) is sufficient. If r is a

• In Example 6.9, we showed that

is also sufficient in the N (µ, σ 2 ) family.

• In the N (µ, σ02 ) subfamily where σ02 is known, T (X) = X is sufficient.

• In the N (µ0 , σ 2 ) subfamily where µ0 is known,

is sufficient. Interestingly, S 2 is not sufficient in this subfamily. It is easy to show that

6.2.2 Minimal sufficient statistics

Q: How much data reduction is possible?

T ∗ (x) = T ∗ (y) =⇒ T (x) = T (y).

Consider the following table:

Statistic Description Partition of X

By “coarsest possible partition,” we mean that X (the support of X) cannot be split up

Theorem 6.2.13. Suppose X ∼ fX (x|θ), where θ ∈ Θ. Suppose there is a function T (x)

Then T (X) is a minimal sufficient statistic.

and form the ratio

Result: Suppose X ∼ fX (x|θ), where θ ∈ Θ, and suppose T = T (X) is minimal sufficient.

Clearly, the ratio