STAT-36700 Homework 4 - Solutions: Fall 2018 September 28, 2018

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

STAT-36700 Homework 4 - Solutions

Fall 2018
September 28, 2018

This contains solutions for Homework 4. Please note that we have


included several additional comments and approaches to the problems
to give you better insight.

Problem 1. Suppose that X1 , . . . , Xn ∼ Geom( p), i.e. the samples have


a geometric distribution with parameter p. A geometric distribution is the
distribution of the number of coin flips needed to see one head.

(a) Write down the likelihood as a function of the observed data X1 , . . . , Xn ,


and the unknown parameter p.

(b) Compute the MLE of p. In order to do this you need to find a zero of the
derivative of the likelihood, and also check that the second derivative of
the likelihood at the point is negative.

(c) Compute the method-of-moments estimator for p. Is this the same as the
MLE?

(d) Extra Credit: Are the estimators you have derived above unbiased? As a
hint: think about using Jensen’s inequality, and when Jensen’s inequality
is a strict inequality.

Solution 1. We derive each of the results below:


n
(a) We claim the L( p | x1 , x2 , . . . , xn ) = pn (1 − p)∑i=1 xi −n

Proof. We firstly note that X ∼ Geom( p) its PDF is given by


P( X = k) = (1 − p)k−1 p. Now we have the likelihood as:
n
L ( p | x1 , x2 , . . . , x n ) = ∏ P ( Xi = x i )
i =1
n
= ∏ (1 − p ) x i −1 p
i =1
n
= p n ( 1 − p ) ∑ i =1 x i − n

1
(b) We claim that p̂ MLE = Xn

Proof. From the previous part we have that the Likelihood of p is


given by:
n
L ( p | x 1 , x 2 , . . . , x n ) = p n ( 1 − p ) ∑ i =1 x i − n
stat-36700 homework 4 - solutions 2

. We can then derive the log-likelihood of p as follows:


n
L ( p | x 1 , x 2 , . . . , x n ) = p n ( 1 − p ) ∑ i =1 x i − n
!
n
=⇒ l ( p | x1 , x2 , . . . , xn ) = n log ( p) + ∑ xi − n log (1 − p) (we are now maximizing the log-likelihood)
i =1

Now we can differentiate the log-likelihood and set to 0 to find the


MLE as follows:
∂l n ∑n x − n
= − i =1 i =0 (setting partial derivative to 0)
∂p p 1− p
n
=⇒ n − np = p ∑ xi − np
i =1
n
=⇒ n = p ∑ xi
i =1
n
=⇒ p =
∑in=1 xi
1
=
xn
We then verify that this is a maximum using the second derivative
as follows:
∂2 l n n − ∑in=1 xi
= − −
∂p2 p2 (1 − p )2
∑n x − n
 
n
= − 2 + i =1 i 2
p (1 − p )
<0
1
Since it has negative curvature for all p we have that p̂ MLE = Xn
as
required.

1
(c) We claim that p̂ MOM = Xn

Proof. 1 For X ∼ Geom( p) we have that E( X ) = 1p . 1


Note: Prove this as an exercise
Now we use the method of moments as follows:
n
1 1
p̂ MOM
=
n ∑ xi
i =1
= xn
1
=⇒ p̂ MOM =
xn
1
So the method of moments is p̂ MOM = Xn
which is equal to p̂ MLE
in this case.

1
(d) Extra Credit: We claim that p̂ MLE = p̂ MOM = Xn
are biased estima-
tors.
stat-36700 homework 4 - solutions 3

Proof. Firstly we note that


!
1 n
n i∑
E Xn = E

Xi
=1
= E( X1 ) (using linearity of expectation and IID of Xi ’s)
1
= (Using expectation of Geomp variable (Exercise: derive this))
p

Now per the hint, using Jensen’s inequality for the convex func-
tion2 x 7→ 1x we have: 2
This is checked by noting the second
∂2 1
  derivative ∂x2 x
= x13 > 0 ∀ x > 0
1
E( p MLE
ˆ )=E
Xn
1
>  (note the strict inequality since our convex map is not linear)
E Xn
1
= 1
p

=p

Since E( p MLE
ˆ ) = E( p MOM
ˆ ) > p our estimators are biased.

Problem 2. Suppose we have samples X1 , . . . , Xn ∼ Unif [0, θ ].

(a) Write down the likelihood as a function of the observed data X1 , . . . , Xn ,


and the unknown parameter θ.

(b) Compute the MLE of θ.

(c) Use the method of moments to derive an estimator of θ. Is this the same
as the MLE?

Solution 2. We derive each of the results below:

(a) We claim the L(θ | x1 , x2 , . . . , xn ) =

Proof.
n
1
L ( θ | x1 , x2 , . . . , x n ) = ∏ θ 1(0 ≤ x i ≤ θ )
i =1
 n n
1
=
θ ∏ 1(0 ≤ x i ≤ θ )
i =1
 n
1
= 1 ( 0 ≤ x (1) ≤ x ( n ) ≤ θ )
θ

(b) 3 We claim that θ̂ MLE = X(n) 3


Note: In this case it is critical to
understand that the parameter we
are seeking to estimate i.e. θ is in the
support of our likelihood. As such we
can’t just differentiate the log-likelihood
per usual and have to carefully consider
the shape of the likelihood over it’s
support and choose the θ value that
maximizes it.
stat-36700 homework 4 - solutions 4

Proof. We note that from part (a) that the log likelihood is:

l (θ | x1 , x2 , . . . , xn ) = −n log (θ )1(0 ≤ x(1) ≤ x(n) ≤ θ )


∂l n
=⇒ = − 1 ( 0 ≤ x (1) ≤ x ( n ) ≤ θ )
∂θ θ
which is a decreasing function of θ over it’s support. This function
is then maximized over it’s support when θ = x(n) . We then have
that θ̂ MLE = X(n) as required.

(c) We claim that θ̂ MOM = 2X n

Proof.

x
Z θ
E( X ) = dx
θ
0
 2 θ
x
=
2θ 0
θ2
=

θ
=
2
θ̂ MOM 1 n
=⇒ = ∑ xi
2 n i =1
=⇒ θ̂ MOM = 2x n

We then have θ̂ MOM = 2X n as required. This is not necessarily the


same as the MLE.

Problem 3. In this question we will explore what is known as Monte Carlo


integration. We often wish to calculate an integral of the form:

Z 1
I( f ) = f ( x )dx,
0

but cannot do so analytically. The Monte Carlo method is a way to approxi-


mate this integral.

(a) Re-express this integral as an expected value of the function f . What is


the distribution of the underlying random variable.

(b) The above suggests that we can generate random variables in a certain
way and approximate the integral of interest. Use the weak law of large
numbers (WLLN) to give a precise (asymptotic) guarantee for this
method.
stat-36700 homework 4 - solutions 5

(c) As a concrete example, we consider the evaluation of:


Z 1
1
I( f ) = √ exp(− x2 /2)dx.
2π 0

Use R (or any other programming language) to approximate the above


integral using 1000 random samples. Report your result. Repeat this
100 times just to get a sense of the variability - you do not need to report
these 100 numbers.

(d) Use the standard Gaussian CDF to give a precise evaluation of this
integral. You should use R (or any other language) to evaluate the CDF.

Solution 3. We derive each of the results below:

(a) We claim that this is EX ( f ( X )) where X ∼ Unif [0, 1].

Proof. For X ∼ Unif [0, 1] we have the PDF as:

p X ( x ) = 1(0 ≤ x ≤ 1)

. So we have:
Z
EX ( f ( X )) := f ( x ) p X ( x )dx
ZX
= f ( x )1(0 ≤ x ≤ 1)dx
X
Z 1
= f ( x )dx
0

as required

(b) Now that we have expressed the required integral as the expectation of a
random variable X ∼ Unif [0, 1], we have that for X1 , X2 , . . . Xn ∼ X ∼
Unif [0, 1] by the WLLN that:
n Z 1
1 p
n ∑ f ( Xi ) → EX ( f ( X )) =
0
f ( x )dx
i =1

(c) As a concrete example, we consider the evaluation of:


Z 1
1
I( f ) = √ exp(− x2 /2)dx.
2π 0

We can use R to approximate the above integral using 1000 random


samples. Report your result.
# i n s t a l l . packages (" t i d y v e r s e ")
library ( tidyverse )
stat-36700 homework 4 - solutions 6

# Set seed f o r r e p r o d u c i b i l i t y
set . seed (35842)

# D e f i n e t h e r e q u i r e d f u n c t i o n we want t o i n t e g r a t e
f <− f u n c t i o n ( x ) {
b a s e : : r e t u r n ( ( 1 / s q r t ( 2 * p i ) ) * exp ( − ( x^2 / 2 ) ) )
}

# D e f i n e t h e Monte C a r l o s i m u l a t i o n
monte _ c a r l o <− f u n c t i o n ( i n p _ fn , n ) {
# G e n e r a t e random s a m p l e s
samps <− p u r r r : : map_ d b l ( . x = 1 : n , ~ r u n i f ( n = 1 ,
min = 0 ,
max = 1 ) )

# Approximate t h e i n t e g r a l
i n t e g l <− mean ( i n p _ f n ( samps ) )

base : : return ( i n t e g l )
}

# S i n g l e run
o n e _ sim <− monte _ c a r l o ( i n p _ f n = f , n = 1 0 0 0 )
o n e _ sim

This returns a value of 0.3416494.


We can repeat this 100 times as follows:
# L e t ’ s r e p l i c a t e t h e s i m u l a t i o n 100 t i m e s
num_ r e p l i c a t i o n s <− 100
s i m s <− r e p l i c a t e ( e x p r = monte _ c a r l o ( i n p _ f n = f , n = 1 0 0 0 ) ,
n = num_ r e p l i c a t i o n s )

# Let ’ s p l o t a histogram
hist ( sims )

(d) We can represent this exactly as the difference of CDF of the standard
normal distribution as follows:
Z 1
1
I( f ) = √ exp(− x2 /2)dx.
2π 0
Z 1 Z 0
1 1
= √ exp(− x2 /2)dx. − √ exp(− x2 /2)dx.
−∞ 2π −∞ 2π
= Φ (1) − Φ (0)
stat-36700 homework 4 - solutions 7

This can be numerically calculated (approximated) in R using the follow-


ing command:
pnorm ( 1 ) − pnorm ( 0 )

which evaluates to 0.3413447.

Problem 4. The Poisson distribution with parameter λ, has pmf:

exp(−λ)λk
P( X = k ) = .
k!
For large values of λ the Poisson distribution is well approximated by a
Gaussian distribution.

(a) Use moment matching to find the Gaussian that best approximates a
Poi(λ) distribution. In other words, we can use N (µ, σ2 ) to approximate
the Poisson, of we choose µ and σ2 to match the mean and variance of the
Poisson.

(b) Suppose that we have a system that emits a random variable X particles
according to a Poisson distribution with mean λ = 900 per hour. Use the
above approximation to calculate the probability P( X > 950). You should
express this in terms of an appropriate standard Gaussian quantile, i.e.,
express your answer in terms of the function Φ(z) = P( Z ≤ z) where Z
has a standard normal distribution.

(c) Use R to compute the value of P( X > 950), approximately using


the Gaussian quantile and exactly using the Poisson CDF. There are
functions built in to R for each of these.

Solution 4. We derive each of the results below:

(a) We are given that X ∼ Poiss(λ) and Y ∼ N (µ, σ2 ). We know that


E( X 2 ) = Var( X ) + (E( X ))2 .
Equating first moments we have:

E ( X ) = E (Y )
=⇒ µ = λ

Equating second moments we have


   
E X2 = E Y2

=⇒ λ2 + λ = σ2 + µ2
= σ 2 + λ2
=⇒ σ2 = λ

So the closest Gaussian approximation to the Poisson by matching the


first 2 moments is Y ∼ N (λ, λ).
stat-36700 homework 4 - solutions 8

(b) We have from part (a) that X ∼ Y ∼ N (µ = λ, σ2 = λ) (approxi-


mately). This shows that:

P( X > 950) = 1 − P( X ≤ 950)


X−µ 950 − µ X−λ 950 − λ
   
= 1−P ≤ = 1−P √ ≤ √
σ σ λ λ
 
50
= 1−P Z ≤
30
 
5
= 1−Φ
3

(c) We can use R to perform the approximation as follows:


# Setup parameters
lambd <− 900
mu <− lambd
s i g m a _ s q <− lambd
t _ v a l <− 950
t _ v a l _ s t d <− ( t _ v a l − mu) / s q r t ( s i g m a _ s q )

# Normal a p p r o x i m a t i o n t o t h e p o i s s o n
1 − pnorm ( t _ v a l _ s t d )

which gives a value of 0.04779035.


If we calculate using the in-built Poisson functions we get the more
precise numerical estimation as:
1 − ppois ( q = 9 5 0 , lambda = lambd )

which gives a value of 0.04711902.

Problem 5. A Binomial RV is the sum of independent Bernoullis and is thus


well approximated by a Gaussian by the central limit theorem. We will use
this to perform a rudimentary hypothesis test.
We believe a coin is fair and toss it 100 times. It lands heads up 60
times. Use moment matching (or the CLT) to approximate the probability
P( X ≥ 60) assuming the coin is fair in terms of the standard Gaussian
CDF.
Use R to give a numerical value. Roughly, our intuition is that if this
value is small then we should be “suspicious” of our initial hypothesis that
the coin is fair. Are you suspicious?

Solution 5. Let Y1 , ..., Yn ∼ Bernoulli ( p) with p = 1/2. Then


X = {number of times landing heads up} = ∑in=1√Yi . Since n = 100
n (Ȳ −µ)
is large enough that by applying the CLT, we have σ ∼ N (0, 1).
stat-36700 homework 4 - solutions 9

Equivalently,
√ n
n (Ȳ − µ) 1 σ2
σ
∼ N (0, 1) ⇒ Ȳ =
n ∑ Yi ∼ N (µ, n
)
i =1
n
⇒ ∑ Yi ∼ N (nµ, nσ2 )
i =1

Since µ = E[Yi ] = p = 1/2, σ2 = Var (Yi ) = p(1 − p), we have


X = ∑in=1 Yi ∼ N (np, np(1 − p)) = N (50, 25). Hence

X − 50 60 − 50
P( X ≥ 60) = P( ≥ )
5 5
= P( Z ≥ 2)
= 1 − FZ (2)
≈ 0.02275 (can be computed using 1-pnorm(2) in R)

Where Z is a standard Gaussian random variable. This value is small


enough that we should be suspicious.

Problem 6. Suppose that X1 , . . . , Xn are repeated measurements of a


quantity µ, and that E[ Xi ] = µ, and Var( Xi ) = σ2 . Further, let us suppose
that each Xi ∈ [0, 1]. Let X n denote the average of the measurements.

(a) Use the fact that each Xi ∈ [0, 1] to give some bounds on µ and σ.
1
(b) Suppose that we take 16 measurements and that σ2 = 12 . Use the CLT to
approximate the probability that the average deviates from µ by more than
0.5.

(c) Use Chebyshev’s inequality to give an upper bound on the same quantity.

(d) Repeat the above calculation but now use Hoeffding’s inequality.

(e) Now use R to estimate this probability in the following way. Suppose
that each Xi is U [0, 1]. The mean is 0.5 and the variance is exactly 1/12.
Draw 16 measurements and track if the sample mean is within 0.5 of the
true mean. Repeat this 1000 times to get an accurate estimate. Compare
the answer to what you obtained analytically. Particularly, order the
confidence intervals by length.

Solution 6.(a) 0 ≤ Xi ≤ 1 ⇒ 0 ≤ E[ Xi ] ≤ 1 ⇒ 0 ≤ µ ≤ 1.
0 ≤ Var ( Xi ) = E[( Xi − E[ Xi ])2 ] ≤ (1 − 0)2 ≤ 1 ⇒ 0 ≤ σ ≤ 1 (Note
that we can actually have Var ( Xi ) ≤ 1/4 by the Popoviciu’s inequality).
stat-36700 homework 4 - solutions 10


n ( X̄ −µ)
(b) By the CLT, σ∼ N (0, 1), then
√ √
n | X̄ − µ| 0.5 n
P(| X̄ − µ| > 0.5) = P( > )
σ √ σ
0.5 × 16
= P(| Z | > √ )
1/12

= P(| Z | > 4 3 )

= 2 × P( Z > 4 3 )
= 4.26 × 10−12 (R command: 2*(1-pnorm(4*sqrt(3))))

(c) First notice that E( X̄ ) = µ and Var ( X̄ ) = Var ( Xi )/n = σ2 /n. Then
by Chebyshev’s inequality,

Var ( X̄ )
P(| X̄ − µ| > 0.5) ≤
0.52
σ2
=
0.25n
1
=
12 × 0.25 × 16
1
= ≈ 0.0208
48

(d) By Hoeffding’s inequality,

2n2 0.52
P(| X̄ − µ| > 0.5) ≤ 2 exp(− )
∑in=1 (1 − 0)2
= 2 exp(−2 × 0.25 × 16)
= 0.00067

(e) Use R to simulate the samples:

# initialize
L = rep ( 0 , 1000)

for ( k in 1:1000){
# g e n e r a t e 16 u n i f o r m l y d i s t r i b u t e d s a m p l e s
X = r u n i f ( 1 6 , min = 0 , max = 1 )
# s a m p l e mean
X . mean = mean (X)
# c h e c k i f t h e mean i s w i t h i n 0 . 5 o f t h e t r u e mean
L [ k ] = ( abs (X . mean − 0 . 5 ) > 0 . 5 )
}
# probability = 0
p r o b = sum ( L ) / l e n g t h ( L )
stat-36700 homework 4 - solutions 11

Based on the R simulations, P(| X̄ − µ| > 0.5) = 0. Summarizing parts


(b)(c)(d)(e), we have

PR < PCLT < P Hoe f f ding0 s < PChebyshev0 s

and hence, the α-level confidence intervals for (b)(c)(d)(e) satisfy

CIR ⊂ CICLT ⊂ CIHoe f f ding0 s ⊂ CIChebyshev0 s

Problem 7. Conjugate Priors: In lectures we have seen that the Beta and
Binomial are conjugate distributions, i.e. if we are estimating a Binomial
parameter and use a Beta prior then the posterior is a Beta distribution.
There is a similar relationship between Gamma and Poisson distributions.
You can use any reference (Wikipedia) for the distributions you need - for the
Gamma use the shape/rate parameterization.

(a) Suppose the data X1 , . . . , Xn ∼ Poi(λ). Write down the likelihood as a


function of the parameter λ.

(b) Compute the MLE for λ.

(c) Assume that λ ∼ Gamma(α, β), and write down the posterior distri-
bution over the parameter λ. Show that this posterior distribution is a
Gamma distribution, and compute its parameters.

(d) The mean of a Gamma distribution is α/β. Compute the posterior mean.
This will be our point estimate.

(e) Write the posterior mean as a convex combination of the prior mean and
the MLE. What happens if α, β are fixed and n → ∞?

Solution 7.(a)
n
L(λ; x1 , ..., xn ) = L(λ) = ∏ P ( Xi = x i )
i =1
n − λ xi
e λ
=∏
i =1
xi !
n
∑ xi
e−nλ λi=1
= n
∏ xi !
i =1

(b) Since L(λ) ≥ 0, λ MLE = arg max L(λ) = arg max log L(λ).
λ λ
∂ log L(λ) ∂2 log L(λ)
Such a λ satisfies ∂λ = 0 and ∂λ2
≤ 0.
n n
log L(λ) = −nλ + ∑ xi log λ − log ∏ xi !
i =1 i =1
stat-36700 homework 4 - solutions 12

n
∑ xi
∂ log L(λ) i =1
= −n + =0
∂λ λ
n
∑ xi
i =1
⇒λ=
n

n
∑ xi
∂2 log L(λ)
= − i=12 < 0.
∂λ2 λ
n
∑ xi
i =1
Therefore λ MLE = n .

(c) From lecture notes 9, we have

p(λ| x1 , ..., xn ) ∝ L(λ) p(λ)


n
∑ xi
e−nλ λi=1 βα α−1 − βλ
∝ λ e
n Γ(α)
∏ xi !
i =1
n
α + ∑ xi n
( β + n) i =1 α + ∑ x i −1
∝ n λ i =1 e−( β+n)λ
Γ ( α + ∑ xi )
i =1

n
Therefore λ| X1 , ..., Xn ∼ Gamma(α + ∑ Xi , β + n).
i =1

(d) This is
n
α + ∑ xi
i =1
λBayes =
β+n

(e)
n
∑ xi
β α n i =1
λBayes = +
β+n β β+n n
when n → ∞, λBayes = λMLE .

Problem 8. Computing a Bayes rule: Suppose that X ∼ N (µ, 1) is


one sample from a Normal Let µ ∼ N (0, 1) be a prior. Assume squared
error loss. Recall that the Bayes estimator is the estimator that minimizes
the expected posterior loss.

(a) Compute the posterior distribution for µ.

(b) For the squared loss, compute the Bayes estimator.


stat-36700 homework 4 - solutions 13

(c) Extra credit: Compute the Bayes estimator using the loss |µ̂ − µ|.

Solution 8.(a) From lecture notes 9, we have

f (µ| x ) ∝ L(µ; x ) f (µ)


1 ( x − µ )2 1 µ2
∝ √ exp − ×√ exp −
2π 2 2π 2
2
2µ − 2xµ + x 2
∝ exp −
2
2
(µ − 2 ) + x4
x 2
= exp −
2 · ( 21 )
(µ − 2x )2
∝ exp −
2 · ( 12 )

Therefore µ| X ∼ N ( X2 , 12 ).

(b) Let R L2 (µ, µ̂) = Eµ| X (µ̂ − µ)2 .


∂R L2 (µ,µ̂)
µ̂ L2 = arg min R L2 (µ, µ̂). Such a µ̂ L2 satisfies: ∂µ̂ =0
µ
∂2 R L2 (µ,µ̂)
and ∂µ̂2
≥ 0.
Z +∞
2
R L2 (µ, µ̂) = Eµ| X (µ̂ − µ) = (µ̂ − µ)2 f µ|X (µ) f µ
−∞

Z +∞
∂R L2 (µ, µ̂) ∂(µ̂ − µ)2
= f µ| X (µ ) f µ
∂µ̂ −∞ ∂µ̂
Z +∞
= 2(µ̂ − µ) f µ| X (µ) f µ = 0
−∞
Z +∞ Z +∞
⇒ µ̂ f µ| X (µ) f µ = µ f µ| X (µ ) f µ
−∞ −∞
Z +∞
⇒µ̂ f µ | X ( µ ) f µ = Eµ | X ( µ )
−∞
X
⇒µ̂ = Eµ|X (µ) =
2
And
Z +∞
∂2 R L2 (µ, µ̂)
= 2 f µ| X (µ) f µ = 2 > 0.
∂µ̂2 −∞

X
Therefore µ̂ L2 = 2.

(c) The idea is same as (b), and one can get4 4


( f ( x ) = | x |, the derivative
Note: For
1 x>0
Z +∞ f 0 (x) = .
−1 x < 0
R L1 (µ, µ̂) = Eµ| X |µ̂ − µ| = |µ̂ − µ)| f µ|X (µ) f µ
−∞
stat-36700 homework 4 - solutions 14

Z +∞
∂R L1 (µ, µ̂) ∂|µ̂ − µ|
= f µ| X (µ ) f µ
∂µ̂ −∞ ∂µ̂
Z µ̂ Z +∞
= f µ| X (µ ) f µ − f µ| X (µ ) f µ = 0
−∞ µ̂

⇒P(µ ≤ µ̂) = P(µ > µ̂)


X
⇒µ̂ = By symmetry of normal distribution
2
And5 5
Note: In this case for the absolute
value loss, the Bayes estimator is the
∂2 R L1 (µ, µ̂) median of the posterior, not mean
= 0. per the squared loss. For the normal
∂µ̂2
distribution the median and the mean
X coincide so the estimator is the same as
Therefore µ̂ L1 = 2. the previous part

You might also like