Lectura 3 MLE

M347
Mathematical statistics
Point estimation
This publication forms part of an Open University module. Details of this and other
Open University modules can be obtained from the Student Registration and Enquiry Service, The
Open University, PO Box 197, Milton Keynes MK7 6BJ, United Kingdom (tel. +44 (0)845 300 6090;
email general-enquiries@open.ac.uk).
Alternatively, you may visit the Open University website at www.open.ac.uk where you can learn
more about the wide range of modules and packs offered at all levels by The Open University.
To purchase a selection of Open University materials visit www.ouw.co.uk, or contact Open
University Worldwide, Walton Hall, Milton Keynes MK7 6AA, United Kingdom for a brochure
(tel. +44 (0)1908 858793; fax +44 (0)1908 858787; email ouw-customer-services@open.ac.uk).
Note to reader
Mathematical/statistical content at the Open University is usually provided to students in
printed books, with PDFs of the same online. This format ensures that mathematical notation
is presented accurately and clearly. The PDF of this extract thus shows the content exactly as
it would be seen by an Open University student. Please note that the PDF may contain
references to other parts of the module and/or to software or audio-visual components of the
module. Regrettably mathematical and statistical content in PDF files is unlikely to be
accessible using a screenreader, and some OpenLearn units may have PDF files that are not
searchable. You may need additional help to read these documents.
The Open University, Walton Hall, Milton Keynes, MK7 6AA.

First published 2015.
Copyright c 2015 The Open University
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, transmitted
or utilised in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without
written permission from the publisher or a licence from the Copyright Licensing Agency Ltd. Details of such
licences (for reprographic reproduction) may be obtained from the Copyright Licensing Agency Ltd, Saffron
House, 6–10 Kirby Street, London EC1N 8TS (website www.cla.co.uk).
Open University materials may also be made available in electronic formats for use by students of the
University. All rights, including copyright and related rights and database rights, in electronic materials and
their contents are owned by or licensed to The Open University, or otherwise used by The Open University as
permitted by applicable law.
In using electronic materials and their contents you agree that your use will be solely for the purposes of
following an Open University course of study or otherwise as licensed by The Open University or its assigns.
Except as permitted above you undertake not to copy, store in any medium (including electronic storage or
use in a website), distribute, transmit or retransmit, broadcast, modify or show in public such electronic
materials in whole or in part without the prior written consent of The Open University or in accordance with
the Copyright, Designs and Patents Act 1988.
Edited, designed and typeset by The Open University, using the Open University TEX System.
Printed in the United Kingdom by Martins the Printers, Berwick upon Tweed.
ISBN 978 1 4730 0341 5
1.1
Contents
1 Maximum likelihood estimation 3
1.1 Background 3
1.2 Initial ideas 3
1.2.1 More on the binomial case 6
1.3 Maximum log-likelihood estimation 8
1.4 The general strategy 8
1.4.1 Two particular cases 10
1.4.2 Estimating normal parameters 12
1.5 A non-calculus example 13
1.6 Estimating a function of the parameter 14
1.6.1 Estimating σ and σ2 14
1.6.2 Estimating h(θ) 15
2 Properties of estimators 16
2.1 Sampling distributions 16
2.2 Unbiased estimators 18
2.2.1 Examples 19
2.3 Choosing between unbiased estimators 22
2.4 The Cramér–Rao lower bound 23
2.4.1 An alternative form 23
2.4.2 Examples 25
26
Solutions 27
1 Maximum likelihood estimation

This section considers the method of maximum likelihood estimation for
obtaining a point estimator b θ of the unknown parameter θ. You will need
to do lots of differentiation in this section.
1.1 Background
You should know a little about the whys and wherefores of the likelihood
function from your previous studies and the review of likelihood in
Section 5 of Unit 4. A further brief review of the fundamentals is
nevertheless included here.
Write f (x|θ) to denote either the probability density function associated
with X if X is continuous, or the probability mass function if X is
discrete. The ‘likelihood function’ gives a measure of how likely a value of
θ is, given that it is known that the sample X1 , X2 , . . . , Xn has the values
x1 , x2 , . . . , xn .
The likelihood function, or likelihood for short, for independent

observations x1 , x2 , . . . , xn can be written as the product of n
probability density or mass functions thus: A more precise notation would
be L(θ | x1 , x2 , . . . , xn ).
L(θ) = f (x1 |θ) × f (x2 |θ) × · · · × f (xn |θ)
Yn
= f (xi |θ) for short.
i=1
Q
Here, is very useful notation for the product of terms.
L(θ) can be evaluated for any value of θ in the parameter space that you
care to try. An important point to remember, therefore, is that the
likelihood is thought of as a function of θ (not of x). See, for example,
Figure 4.6 in Subsection 5.2 of Unit 4, which shows a likelihood function
for a parameter λ.
It follows that the best estimator for the value of θ is the one that
maximises the likelihood function . . . the maximum likelihood
estimator of θ. In this sense, the maximum likelihood estimator is the
most likely value of θ given the data.
1.2 Initial ideas

The maximum likelihood estimator is the maximiser of the likelihood
function. A (differentiable) likelihood function, L(θ) say, is no different
from any other (differentiable) function in the way that its maxima are Maximising and minimising
determined using calculus. (The likelihood functions you will deal with in
this unit are all differentiable with respect to θ.)
First, its stationary points are defined as the points θ such that
d
L(θ) = L′ (θ) = 0.
dθ
In general, some of these stationary points correspond to maxima, some to
minima and some to saddlepoints. In the simplest cases of likelihood
functions, there will be just one stationary point and it will correspond to
a maximum. To check whether any stationary point that you find, b θ say,
3
Point estimation
does indeed correspond to a maximum, check the sign of the second

derivative,
d2 ′′
2 L(θ) = L (θ),
dθ
at θ = bθ. If L′′ (b
θ) < 0, then b
θ corresponds to a maximum. You can then If L′′ (b
θ) > 0, then b
θ minimises
b
declare θ to be the maximum likelihood estimator of θ and its numerical the likelihood function!
value to be the maximum likelihood estimate of θ. Further remarks on
this strategy will be made in Subsection 1.4.
Exercise 5.1
Suppose that a coin is tossed 100 times with a view to estimating p, the
probability of obtaining a head, 0 < p < 1. The probability mass function
of the distribution of X, the random variable representing the number of Binomial distribution
heads in 100 tosses of the coin, is that of the binomial distribution with
parameters n = 100 and p. Therefore,

100 x
f (x|p) = p (1 − p)100−x .
x
Now suppose that the outcome of the experiment was 54 heads (and
100 − 54 = 46 tails).
What is the likelihood, L(p), for p given that X = x = 54?
Example 5.1 Maximising a binomial likelihood
This example picks up the problem considered in Exercise 5.1. It concerns

a coin tossed 100 times resulting in 54 heads; what is the maximum
likelihood estimate of p?
The likelihood function obtained in Exercise 5.1 is plotted as a function of
p in Figure 5.1. The idea now is to choose as estimate of p the value of p
that maximises L(p). This value, pb, is marked on Figure 5.1.
L(p)
0.1
0
0 pb 1 p
Figure 5.1 Plot of L(p) with maximum marked as pb

Write C = 100
54 and note that C is positive and does not depend on p.
Then, write the likelihood function that it is required to maximise as
L(p) = Cp54 (1 − p)46 .
4
Differentiation with respect to p requires that the product of the functions

p54 and (1 − p)46 is differentiated:

d d d Product rule
L(p) = L′ (p) = C (1 − p)46 (p54 ) + p54 {(1 − p)46 }
dp dp dp

= C (1 − p)46 54p53 − p54 46(1 − p)45
= Cp53 (1 − p)45 {54(1 − p) − 46p}
= Cp53 (1 − p)45 (54 − 100p) .
Setting L′ (p) = 0 means that the potential maximum value of p satisfies
Cp53 (1 − p)45 (54 − 100p) = 0.
This equation can be divided through by C and p53 (1 − p)45 since both of
these are positive, the latter for 0 < p < 1. Therefore, p solves
54 − 100p = 0
or p = 54/100 = 0.54.
It can be checked that this value of p does indeed correspond to a
maximum, by differentiating L′ (p) again with respect to p, i.e. calculating
d ′ d 53
L′′ (p) = L (p) = Cp (1 − p)45 (54 − 100p) .
dp dp
Such a calculation would confirm that p = 0.54 corresponds to a maximum
of L(p), because L′′ (0.54) < 0.
But stop right there, because this simple problem seems to be leading to a
surprisingly difficult method of solution. There must be a better way! And
there is, as you will now learn.
Define the log-likelihood function, or log-likelihood for short, by A more precise notation would
be ℓ(θ | x1 , x2 , . . . , xn ).
ℓ(θ) = log L(θ).
(Remember that log means log to the base e.)
The trick, now, is this.
The maximum likelihood estimator, defined as the maximiser of the

likelihood L(θ), is also the maximiser of the log-likelihood ℓ(θ). If this is not obvious to you,
please take its veracity on trust
for the rest of this subsection.
It turns out that it is almost always easier (or at least as easy) to work
with the log-likelihood function rather than the likelihood function, and
statisticians do this all the time. The fundamental reason is that the log of
a product – which is the basic form of the likelihood function – is the sum
of the logs of the elements of the product, and sums tend to be easier to Manipulating logs
work with than products (even though logs are involved!). This is
illustrated in Subsection 1.2.1 in the binomial case.
5
Point estimation
1.2.1 More on the binomial case
Example 5.2 (Example 5.1 continued)
The likelihood in the binomial case considered in Example 5.1 was shown
to be
L(p) = Cp54 (1 − p)46 .
The log-likelihood is therefore
Manipulating logs
ℓ(p) = log L(p)
= log{Cp54 (1 − p)46 }
= log C + log(p54 ) + log((1 − p)46 )
= log C + 54 log p + 46 log(1 − p).
Now, differentiating with respect to p:
d
ℓ(p) = ℓ′ (p)
dp
d d d
= (log C) + 54 (log p) + 46 {log(1 − p)}
dp dp dp
54 46
=0+ −
p (1 − p)
54(1 − p) − 46p
=
p(1 − p)
54 − 100p
= .
p(1 − p)
Setting ℓ′ (p) = 0 means that the potential maximum value of p satisfies
54 − 100p
= 0.
p(1 − p)
This equation can be multiplied through by p(1 − p) since p(1 − p) is
positive for 0 < p < 1. Therefore, p solves
54 − 100p = 0
or p = 54/100 = 0.54. You might or might not feel that this calculation is
a little easier than the one done in Example 5.1!
However, this time checking that this value of p corresponds to a
maximum is much more easily done:

d2 ′′ d ′ d 54 d 46
ℓ(p) = ℓ (p) = ℓ (p) = −
dp2 dp dp p dp 1 − p
54 46
=− 2 − .
p (1 − p)2
At p = 54/100,

′′ 54 54 × 1002 46 × 1002
ℓ =− −
100 542 462

1 1
= −10 000 × + < 0,
54 46
and so pb = 54/100 is indeed the maximum likelihood estimate of p. In this
case, ℓ′′ (p) happens to be negative for all 0 < p < 1, but this property is
not required to confirm that p = 54/100 corresponds to a maximum.
6
You might have expected the ‘natural’ estimate of the proportion of heads
based on this experiment to be the observed number of heads, 54, divided
by the total number of coin tosses, 100. All this effort has confirmed that
the general approach of maximum likelihood estimation agrees with
intuition in this case.
Also, the fact that pb = 0.54 maximises both the likelihood and
log-likelihood functions is confirmed pictorially in Figure 5.2.
L(p)
0.1
0
0 pb = 0.54 1 p
(a)
ℓ(p)
0
−5
−10
0 pb = 0.54 1 p
(b)
Figure 5.2 (a) Plot of L(p) with maximum marked as pb = 0.54; (b) plot
of ℓ(p) with maximum marked as pb = 0.54
Example 5.2 came up with a particular numerical maximum likelihood

estimate rather than a formula for a maximum likelihood estimator
because you were given the numerical result arising from the coin-tossing
experiment. Suppose now that you were concerned with a ‘generic’
coin-tossing experiment in which a coin was tossed n times. The model for
this situation is that the number of heads, X, follows the B(n, p)
distribution. The number of coin tosses, n, is known, as is the outcome of
the experiment that x heads and n − x tails were observed. The next
exercise will lead you through the steps to finding the maximum likelihood
estimator of p. The factor nx that appears in the binomial probability
mass function can be written as a constant C because it does not depend
on p.
7
Point estimation
Exercise 5.2
Consider the situation described above, assuming 0 < x < n.

(a) What is the likelihood, L(p), for p given that X = x?
(b) What is the log-likelihood, ℓ(p), for p given that X = x?
(c) Find a candidate formula for the maximum likelihood estimator of p
by differentiating ℓ(p) with respect to p and solving ℓ′ (p) = 0.
(d) Confirm that the point you located in part (c) corresponds to a
maximum of the log-likelihood by checking that ℓ′′ (p) < 0 at that
value of p.
(e) In the B(n, p) model with known n, what is the maximum likelihood
estimator, pb, of p? In one sentence, describe what pb means in terms of
the coin-tossing experiment.
1.3 Maximum log-likelihood estimation

The following screencast will demonstrate why the value of θ which
maximises the likelihood is the same as the value of θ which maximises the
log-likelihood.
Screencast 5.1 Demonstration that the likelihood and log-likelihood
have the same maximiser
Interactive content appears here. Please visit the website to use it.
If you would like to follow through the calculus argument for why you can
maximise the log-likelihood instead of the likelihood, refer to Section 1 of
the ‘Optional material for Unit 5’.
1.4 The general strategy

Since obtaining maximum likelihood estimators and estimates can be
based wholly on the log-likelihood function rather than the likelihood
function itself, you might as well start the process from ℓ(θ) rather than
from L(θ). The next exercise yields a result to facilitate this.
Exercise 5.3
The likelihood function for parameter θ based on independent observations

x1 , x2 , . . . , xn is
n
Y
L(θ) = f (xi |θ),
i=1
where f (x|θ) denotes either the probability density function associated

with X if X is continuous or the probability mass function if X is discrete.
Write down an expression in terms of the f (xi |θ)’s for the log-likelihood
function, ℓ(θ).
8
The following strategy can therefore be followed for finding maximum

likelihood estimators and estimates. It assumes, in Step 3 of the strategy,
that there is just one solution to the relevant equation. This will be
adequate for the examples to follow in M347.
Step 1 Obtain the log-likelihood

n
X
ℓ(θ) = log L(θ) = log f (xi |θ).
i=1
Step 2 Differentiate the log-likelihood function with respect to θ to
obtain ℓ′ (θ).
Step 3 Rearrange the equation
ℓ′ (θ) = 0
to find the solution for θ; this gives the candidate for the
maximum likelihood estimator b θ.
Step 4 Differentiate the log-likelihood a second time with respect to
θ to obtain ℓ′′ (θ). Check whether
ℓ′′ (b
θ) < 0.
If it is, declare b
θ to be the maximum likelihood estimator.
Step 5 If you have numerical data, insert the numbers into the
formula for the maximum likelihood estimator to find the
maximum likelihood estimate of θ.
The assumption made in Step 3 above is quite a big one. There are
alternatives. For example, more generally, the maximum of the
log-likelihood is actually either at a stationary point of the log-likelihood
function or at one of the endpoints of the parameter space. (One or both
of these endpoints might be infinite.) You could then check the values of
ℓ(θ) at the stationary points and the endpoints, and return the point from
among these that yields the largest value of the log-likelihood as the
maximum likelihood estimator. It is also possible that the log-likelihood
function is not differentiable everywhere in the interior of the parameter
space, in which case the calculus route to maximum likelihood estimation
fails. You will see an example of this in Subsection 1.5. Nonetheless, you
should use the approach of the above box throughout M347, unless
specifically told not to.
This might be a useful time to introduce a standard piece of terminology
to avoid the mouthfuls ‘maximum likelihood estimator’ and ‘maximum
likelihood estimate’; either is often referred to as the MLE.
9
Point estimation
On the other hand, mouthfuls are just what you want at the other
MLE: Major League Eating, ‘the world body that oversees all
professional eating contests’ and ‘includes the sport’s governing body,
the International Federation of Competitive Eating’.
1.4.1 Two particular cases
Example 5.3 Maximising an exponential likelihood
Suppose that x1 , x2 , . . . , xn is a set of independent observations each

arising from the same exponential distribution, M (λ), which has pdf
f (x|λ) = λe−λx on x > 0.
It is desired to estimate the positive parameter λ via maximum likelihood. You have in fact already
calculated L(λ) in Example 4.8
Step 1 The log-likelihood function is given by of Subsection 5.2 in Unit 4
n
X
ℓ(λ) = log f (xi |λ)
i=1
Xn
= log λe−λxi
i=1
Xn n o
= log λ + log(e−λxi )
i=1
n
X
= (log λ − λxi )
i=1
n
X
= n log λ − λ xi .
i=1
Pn
The sample mean of the data is x = i=1 xi /n; so, rearranging,
n
X
xi = nx.
i=1
It saves a little writing to make this notational change now:
ℓ(λ) = n log λ − λnx = n(log λ − λx).
10
Step 2 Differentiating the log-likelihood with respect to λ,

′ d 1
ℓ (λ) = ℓ(λ) = n −x .
dλ λ
Step 3 Solving ℓ′ (λ) = 0 means solving

1
n − x = 0.
λ
Dividing throughout by n > 0, this is equivalent to
1
− x = 0,
λ
that is, 1/λ = x or λ = 1/x. The candidate MLE of λ is therefore
b = 1/ X.
λ
Step 4 To check whether λ b is the MLE of λ, first evaluate ℓ′′ (λ):

′′ d ′ d 1 n
ℓ (λ) = ℓ (λ) = n −x = − 2.
dλ dλ λ λ
Then check that ℓ′′ (λ)b < 0, which it surely is: −nx2 < 0. In fact, again,
b is therefore the MLE of λ.
ℓ′′ (λ) < 0 for all λ. λ
Exercise 5.4
Suppose independent observations x1 , x2 , . . . , xn are available from a

Poisson distribution with pmf
e−µ µx
f (x|µ) = on x = 0, 1, 2, . . . ,
x!
b,
for positive parameter µ. In this exercise you will show that the MLE, µ
of µ is X.
(a) Show that the log-likelihood is Step 1
ℓ(µ) = −nµ + nx log µ − C,
P
where C = ni=1 log(xi !) is a quantity that does not depend on µ.
b = X.
(b) Find ℓ′ (µ) and hence show that the candidate MLE is µ Steps 2 and 3
b = X is indeed the MLE of µ.
(c) Confirm that µ Step 4
(d) Table 5.1 shows the number of days on which there were 0, 1, 2, 3, 4,
and 5 or more murders in London in a period of n = 1095 days from
April 2004 to March 2007. It turns out that it is reasonable to assume
that the numbers of murders in London on different days are
independent observations from a Poisson distribution.
Table 5.1 Observed number of days with murders occurring in
London between April 2004 and March 2007
Number of murders on a day: 0 1 2 3 4 5 or more
Number of days on which observed: 713 299 66 16 1 0
(Source: Spiegelhalter, D. and Barnett, A. (2009) ‘London murders: a
predictable pattern?’, Significance, vol. 6, pp. 5–8.)
11
Point estimation
What is the maximum likelihood estimate of µ based on these data? Step 5

(Give your answer correct to three decimal places.) Interpret your
answer.
You might have noticed that, so far, all the MLEs that have been derived
arise from equating the sample mean with the population mean. (And this
will happen again!) However, it is certainly not the case that MLEs always
correspond to such a simple alternative approach.
1.4.2 Estimating normal parameters
Exercise 5.5

normal distribution with unknown mean µ and known standard deviation This situation is not very
σ0 . This has pdf realistic but it makes a nice
( ) example and informs something
1 1 x−µ 2 later in this subsection.
f (x|µ) = √ exp − on −∞ < x < ∞.
2πσ0 2 σ0
b, of µ is, once more, X.

In this exercise you will show that the MLE, µ
n
X
1
ℓ(µ) = C − (xi − µ)2 ,
2σ20 i=1
√
where C = −n log( 2πσ0 ). Notice that C is a quantity that does not
depend on µ, even though it does depend on the fixed quantity σ0 .
b = X.
(b) Find ℓ′ (µ) and hence show that the candidate MLE is µ Steps 2 and 3
(c) Confirm that µ Step 4
Exercise 5.6

normal distribution with known mean µ0 and unknown standard This situation is not very
deviation σ. The pdf is the same as that in Exercise 5.5 except for some realistic either but it too
relabelling of the parameters: contributes to something more
( useful shortly.
)
1 1 x − µ0 2
f (x|σ) = √ exp − on −∞ < x < ∞.
2πσ 2 σ
b, of σ is the square root of
In this exercise you will show that the MLE, σ
the average of the squared deviations about µ0 :
( n )1/2
1X 2
b=
σ (Xi − µ0 ) .
n
i=1

n
X
1
ℓ(σ) = C1 − n log σ − (xi − µ0 )2 ,
2σ2
i=1
12
√
where C1 = −n log( 2π). Notice that C1 differs from C in Exercise 5.5
because σ, which formed part of C, is now the parameter of interest.
b given above.
(b) Find ℓ′ (σ) and hence show that the candidate MLE is σ Steps 2 and 3
(c) Confirm that σb is indeed the MLE of σ. Hint: You might find Step 4
manipulations
P easier if you use the shorthand notation
S0 = n1 ni=1 (xi − µ0 )2 and note that S0 , being a sum of squares, is
positive.
You were promised that in this unit you would have to deal with the
estimation of only one parameter at a time. Nonetheless it is opportune to
mention a result that involves estimating two parameters simultaneously, Now, both µ and σ are unknown.
the parameters of the normal model.
The maximum likelihood estimators of the pair of parameters µ and σ

in the N (µ, σ2 ) model are
( n )1/2
1X
µb = X and σ b= (Xi − X)2 .
n
i=1
b = X. It might also
Given the result of Exercise 5.5, it is no surprise that µ
b has the same form when µ is unknown as it does
be unsurprising that σ
when µ is known, with µ0 in the latter replaced by its estimator X in the
former. However, proof of the boxed result is not entirely straightforward
and is beyond the scope of this unit. Moreover, you might also notice
b is not the usual sample standard deviation; more on that later in
that σ
the unit.
1.5 A non-calculus example
Exercise 5.7
Suppose you have data values x1 , x2 , . . . , xn from a uniform distribution on

the interval 0 to θ with θ > 0 unknown.
(a) Write down the pdf of the uniform distribution on (0, θ) and hence
write down the likelihood for θ based on x1 , x2 , . . . , xn . Hint: The Uniform distribution on (a, b)
support of the distribution is important.
(b) For θ > xmax , where xmax is the maximum data value, evaluate L′ (θ).
Is L′ (θ) = 0 for any finite positive θ? Evaluate L′′ (θ). What is the
value of limθ→∞ L′′ (θ)?
(c) Sketch a graph of L(θ) for all θ > 0.
(d) Using the graph of L(θ) obtained in part (c), what is the maximum
likelihood estimator of θ?
Obtaining the maximum likelihood estimator of θ in the U (0, θ)

distribution is therefore an easy problem if you look at the graph of the
likelihood and a confusing one if you blindly use calculus (as in
Exercise 5.7(b))! The problem here is the lack of continuity of L(θ) at the
MLE θ = b θ = Xmax . This, in turn, is caused by the parameter θ itself
13
Point estimation
being the boundary of the support of f (x|θ), and is common to such It is arguable as to how
‘irregular’ likelihood problems. important such models
are/should be in statistics.
Rest assured that calculus usually works in maximum likelihood contexts,
that it should remain your approach of choice to such problems in M347,
and that no attempt will be made to catch you out in this way. The real
world, however, won’t always be so kind!
1.6 Estimating a function of the parameter

Exercise 5.6 might have set another question brewing in your mind. If, for
simplicity in the case of µ = µ0 known, the MLE of the standard
deviation σ in the normal model is
( n )1/2
1X 2
b=
σ (Xi − µ0 ) ,
n
i=1
then is the MLE of σ2 equal to the square of the MLE of σ, i.e. is it true
that
n
c 2 1X
2
σ = (b σ) = (Xi − µ0 )2 ? Notice that the first, wide, hat
n covers all of σ2 , while the
i=1
second, narrow, hat covers
This question is addressed in Subsection 1.6.1. Then, in Subsection 1.6.2, just σ.
the more general question of estimating a function of the parameter, h(θ),
will be explored.
1.6.1 Estimating σ and σ2
Exercise 5.8

normal distribution with known mean µ0 and unknown variance σ2 . The
model has not changed, so the pdf is precisely the same as that in Normal distribution
Exercise 5.6. The focus of interest has changed, however, from σ to σ2 .
The likelihood now has to be maximised with respect to σ2 rather than
with respect to σ.
The key to avoiding confusion in the calculations is to give σ2 a different
notation, v = σ2 , say, and work in terms of that. Remember that v > 0.
(a) Show that the log-likelihood in terms of v is Step 1
n
X
n 1
ℓ(v) = C1 − log v − (xi − µ0 )2 ,
2 2v
i=1
√
where C1 = −n log( 2π).
(b) Find ℓ′ (v) and hence show that the candidate MLE of v is Steps 2 and 3
n
X
1
vb = (Xi − µ0 )2 .
n
i=1
(c) Confirm that vb is indeed the MLE of v. Step 4
(d) So, is it true that c σ)2 ?

σ2 = (b
14
A log-likelihood based on n = 10 independent observations from an

N (0, σ2 ) distribution is shown as a function of σ in Figure 5.3(a) and as a
function of v = σ2 in Figure 5.3(b). Notice that the maximum of the
log-likelihood corresponds to σ = 1.0399 and v = 1.0814 = σ2 , respectively.
(In fact, the value of the log-likelihood at the maximum is the same in
each case.)
ℓ(σ)
−14
−20
0 8
(a) b = 1.0399
σ
ℓ(v)
−14
−20
0 vb = 1.0814 8
(b)
Figure 5.3 Plot of an N (0, 1)-based log-likelihood as: (a) a function of

σ with maximum marked at σ = 1.0399; (b) a function of v = σ2 with
maximum marked at v = 1.0814 = 1.03992
1.6.2 Estimating h(θ)

The property demonstrated in Exercise 5.8 is, in fact, a quite general
property of maximum likelihood estimators.
The MLE of a function of a parameter is the function of the MLE of

the parameter. That is, if τ = h(θ) and b
θ is the MLE of θ, then the
MLE of τ is
bτ = h(b
θ).
This means that in future there will be no need to make separate

calculations to find the MLEs of, for example, a parameter (like σ) and its
square (σ2 ); it will always be the case that the MLE of the square of the
parameter is the square of the MLE of the parameter.
15
Point estimation
Exercise 5.9
Consider a Poisson distribution with unknown mean µ.

Poisson distribution
(a) What is the probability, θ say, that a random variable X from this
distribution takes the value 0?
(b) Suppose that independent observations x1 , x2 , . . . , xn are available
from this Poisson distribution. Using the result of Exercise 5.4, what is
the MLE of θ?
The main result applies to virtually any function h, but it will be proved
only for increasing h. This means that h has an inverse function g, say, so
that since τ = h(θ), θ = g(τ). Notice that g is an increasing function of τ. g is usually written as h−1 but it
Now, by setting θ = g(τ) in the log-likelihood ℓ(θ) it is possible to write the will be less confusing not to do
log-likelihood in terms of τ (as was done in a special case in Exercise 5.8). so here.
Denoting the log-likelihood for τ as m(τ),
m(τ) = ℓ(g(τ)) = ℓ(θ).
Now, let b
θ be the MLE of θ. Write bτ = h(b θ), so that b
θ = g(bτ), and let τ0 be
some value of τ different from bτ. Write θ0 = g(τ0 ). Because bθ is the MLE,
you know that
ℓ(b
θ) > ℓ(θ0 ).
Then, combining the equation and the inequality above,
m(bτ) = ℓ(g(bτ)) = ℓ(b
θ) > ℓ(θ0 ) = ℓ(g(τ0 )) = m(τ0 ).
That is, the likelihood associated with bτ is greater than the likelihood
associated with any other value τ0 , and hence bτ is indeed the MLE of τ.
2 Properties of estimators
Define eθ to be any point estimator of a parameter θ. This section concerns The estimator is read as ‘theta
general properties of point estimators. The notation eθ is used to tilda’.
b
distinguish a general estimator from θ, which in the remainder of this unit
will always refer to the MLE.
2.1 Sampling distributions

You were briefly exposed to the idea of sampling distributions in Unit 4.
Nevertheless, the topic will be explored in more detail and from scratch in
this section.
When a set of observations x1 , x2 , . . . , xn is obtained it is important to
recognise that what has been gathered is only one of the many possible
sets of observations that might have been obtained in the situation of
interest. This particular dataset gives rise to the point estimate e θ of θ,
e
which is really of the form θ = t(x1 , x2 , . . . , xn ) for some function t. On
another, separate, occasion a completely different set of observations,
x∗1 , x∗2 , . . . , x∗n say, would have been obtained even from the same situation This is the nature of random
of interest. This, in turn, gives rise to a different value of the point variation!
∗ ∗
estimate of θ, e θ say, where eθ = t(x∗1 , x∗2 , . . . , x∗n ).
16
In fact, a whole range of realisations of the estimator e θ = t(X1 , X2 , . . . , Xn )

is possible, where X1 , X2 , . . . , Xn is the set of random variables giving rise
to the observations. The probability distribution of X1 , X2 , . . . , Xn induces
a probability distribution for e θ, which is called the sampling
distribution of e θ. The following example and animation are intended to
try to clarify these ideas in a simple specific context.
Example 5.4 The sampling distribution of the sample mean of data

from a normal distribution
Suppose that X1 , X2 , . . . , Xn is a set of independent random variables each

arising from the same normal distribution, N (θ, σ2 ), and it P is desired to
estimate the population mean θ. The sample mean is X = ni=1 Xi /n. In
this situation, X = e
θ. Moreover, X = e θ = t(X1 , X2 , . . . , Xn ) where
X1 + X2 + · · · + Xn
t(X1 , X2 , . . . , Xn ) = .
n
You already know (Exercise 1.12 in Subsection 1.5.2 of Unit 1) that
Distribution of mean of normal
e σ2 distribution
θ = X ∼ N θ, ,
n
where ∼ is shorthand for ‘is distributed as’. This distribution is, therefore,
the sampling distribution of the point estimator eθ = X. The estimator e θ
estimates the parameter θ, and inference for θ will be based on this
sampling distribution.
Two aspects of the sampling distribution that are especially informative
about eθ and its relationship to θ are its mean and its variance. In this case,
E(θ) = θ; i.e. the average value of all the possible e
e θ’s is θ. Also,
V (e
θ) = σ2 /n, which shows that the e θ’s are less variable about θ whenever σ
is smaller and/or the sample size n is larger.
The way this sampling distribution changes as the sample size increases is
shown in Animation 5.1.
Animation 5.1 Demonstration of how the sampling distribution
changes with sample size
Interactive content appears here. Please visit the website to use it.
In general, the sampling distribution of eθ depends on the unknown

parameter θ that it is desired to estimate. It is through the sampling
distribution of the sampled data that all conclusions are reached with
regard to (i) what information in the data is relevant and (ii) how it should
usefully be employed for estimating the unknown parameter in the ‘parent’
distribution. This sampling distribution is the basis of the classical theory Bayesian inference, on the other
of statistical inference with which this and several other units in M347 are hand, is not concerned solely
concerned. with sampling distributions.
17
Point estimation
2.2 Unbiased estimators

Since a point estimator, e
θ, follows a sampling distribution as in
Subsection 2.1, it is meaningful to consider its mean
E(e
θ),
where the expectation is taken over the sampling distribution of e
θ.
The bias of an estimator is defined by

B(e
θ) = E(e
θ) − θ.
If the expectation of the estimator e

θ is equal to the true parameter You will have met both these
value, i.e. concepts if you studied M248.
E(e
θ) = θ,
then the bias of the estimator is zero and e
θ is said to be an unbiased
estimator of θ.
If e
θ is an unbiased estimator there is no particular tendency for it to
under- or over-estimate θ. This is clearly a desirable attribute for a point
estimator to possess. For example, Figure 5.4 shows the sampling
distributions of two estimators, e θ1 and e θ2 , of the same parameter θ. The
left-hand distribution, with pdf fs (e θ1 ), has mean θ; the right-hand one,
with pdf fs (e
θ2 ), clearly does not. (Note the locations of E(e θ1 ) = θ and
E(θ2 ), and the bias B(θ2 ) = E(θ2 ) − θ.) It would appear that e
e e e θ1 is a
better estimator of θ than is e θ2 .
fs (θe1 ) fs (θe2 )
θ = E(θe1 ) E(θe2 )
B(θe2 )
Figure 5.4 To the left, the pdf fs (e θ1 ) of the sampling distribution of e

θ1 ;
e
to the right, the pdf fs (θ2 ) of the sampling distribution of θ2 e
That said, it is not considered absolutely necessary for a good point

estimator to be exactly unbiased, a point that will be returned to later in
the unit.
18
Bias (of Priene) was one of the Seven Sages of Ancient Greece.
Among his sayings is this one relevant to the OU student: ‘Choose
the course which you adopt with deliberation; but when you have
adopted it, then persevere in it with firmness.’
2.2.1 Examples
Example 5.5 Unbiasedness of the sample mean as estimator of the

population mean
Suppose that X1 , X2 , . . . , Xn is a set of independent random variables from

any distribution
Pn with population mean µ. The sample mean is
e = X = i=1 Xi /n. Its expectation is
µ
( n )
1X
E(eµ) = E(X) = E Xi
n Linearity of expectation
i=1
n
X
1
= E(Xi )
n
i=1
Xn
1 1
= µ= nµ = µ,
n n
i=1
e = X, is an unbiased
since E(Xi ) = µ for each i. Thus the sample mean, µ
estimator of the population mean µ.
Notice that this result holds whatever the distribution of the Xi ’s,
provided only that it has finite mean µ.
The result that you will obtain in the next exercise is both important in its
own right and as a prerequisite for the example concerning unbiasedness
which follows it.
19
Point estimation
Exercise 5.10
Let X1 , X2 , . . . , Xn be a set of independent random variables from any

distribution with finite population variance σ2 . Show that Variance of a sum of
independent random variables
σ2
V (X) = .
n
Example 5.6 Unbiased estimation of the population variance
Suppose that X1 , X2 , . . . , Xn is a set of n ≥ 2 independent random

variables from any distribution with population mean µ and variance σ2 .
You already know that the sample variance is usually defined as
n
2 1 X
S = (Xi − X)2 .
n−1
i=1
You might not know, however, why the denominator is n − 1 rather than
the n that you might have expected. The answer is that S 2 is an unbiased
estimator of σ2 . This result also holds whatever the distribution of the Xi ’s
(provided only that it has finite variance σ2 ). This will now be proved.
The trick used in the following calculation, of adding and subtracting µ
inside the brackets, makes the calculation so much easier than it would be
otherwise:
Xn
(Xi − X)2
i=1
n
X
= (Xi − µ + µ − X)2
i=1
Xn
= {(Xi − µ) − (X − µ)}2
i=1
Xn
= {(Xi − µ)2 − 2(Xi − µ)(X − µ) + (X − µ)2 }
i=1
n
X n
X n
X
2
= (Xi − µ) − 2(X − µ) (Xi − µ) + (X − µ)2
i=1 i=1 i=1
Xn
= (Xi − µ)2 − 2n(X − µ)2 + n(X − µ)2
i=1
n
!
X
because (Xi − µ) = n(X − µ)
i=1
n
X
= (Xi − µ)2 − n(X − µ)2 .
i=1
20
After a brief pause for breath, expectations can be taken. Explicitly,

( n )
1 X
E(S 2 ) = E (Xi − X)2
n−1
i=1
( n )
1 X
= E (Xi − µ)2 − n(X − µ)2
n−1
i=1
" n #
1 X
2 2
= E{(Xi − µ) } − n E{(X − µ) } .
n−1
i=1
Now, remember that E(Xi ) = µ, i = 1, 2, . . . , n, and E(X) = µ. It follows

that, by the definition of variance, E{(Xi − µ)2 } = V (Xi ) and
E{(X − µ)2 } = V (X). Therefore,
( n )
1 X
E(S 2 ) = V (Xi ) − n V (X) .
n−1
i=1
But V (Xi ) = σ2 , i = 1, 2, . . . , n, and, from Exercise 5.10, V (X) = σ2 /n. It

follows that

2 1 2 σ2 1
E(S ) = nσ − n = (n − 1)σ2 = σ2 .
n−1 n n−1
That is, E(S 2 ) = σ2 and therefore S 2 is an unbiased estimator of σ2 .
These results are summarised in the following box.
Let X1 , X2 , . . . , Xn be a set of independent random variables from

any distribution with population 2
Pn mean µ and population variance σ .
Then the sample mean X = i=1 Xi /n and sample variance
P
S 2 = ni=1 (Xi − X)2 /(n − 1) are unbiased estimators of µ and σ2 ,
respectively.
Exercise 5.11
Let X1 , X2 , . . . , Xn be a set of independent random variables from the

N (µ, σ2 ) distribution. Towards the end of Subsection 1.4.2, you saw that,
when both parameters are unknown, the MLE for σ is
( n )1/2
1X 2
b=
σ (Xi − X) .
n
i=1
Since the MLE of a function of a parameter is that function of the MLE of
the parameter (Subsection 1.6.2), it must also be the case that
n
c 1X
σ2 = (Xi − X)2 .
n
i=1
Is c
σ2 a biased estimator of σ2 ? If so, what is its bias? Hint: Relate c
σ2 to
2
the sample variance S .
21
Point estimation
2.3 Choosing between unbiased estimators

Unbiasedness on its own is not a property that uniquely defines the ‘best’
estimator of a parameter. In particular, there may be more than one
estimator, e
θ1 and eθ3 say, both of which are unbiased estimators of θ, i.e.
e e
E(θ1 ) = E(θ3 ) = θ.
Figure 5.5 shows the sampling distributions of two such estimators. Both
distributions have mean θ. However, the distribution of e θ3 is much wider
e
than the distribution of θ1 . There remains much more uncertainty about
the value of θ when e
θ3 is its estimator than when eθ1 is its estimator; e
θ1
seems to be a better estimator of θ than does e
θ3 .
fs (θe1 )
fs (θe3 )
θ = E(θe1 ) = E(θe3 )
Figure 5.5 The pdfs fs (eθ1 ) and fs (e

θ3 ) of the sampling distributions of
e e
θ1 and θ3 , respectively
This comparison can be quantified by looking at the variances of the

estimators concerned. The lower the variance of an unbiased estimator, the
better the unbiased estimator. The best unbiased estimators of all,
therefore, are the unbiased estimators with minimum variance.
Unsurprisingly, when they exist – and they certainly don’t always – these
are called minimum variance unbiased estimators, or MVUEs for
short.
Exercise 5.12
Let X1 , X2 , . . . , Xn be a set of independent random variables from a

distribution with mean µ and variance σ2 . Write the sample mean in
slightly different notation from earlier as X n and assume that n ≥ 2. You
know from Example 5.5 that X n is an unbiased estimator of µ, and from
Exercise 5.10 that its variance is σ2 /n.
(a) A lazy statistician decides to use just the first observation he obtains,
that is, X1 , as his estimator of µ. Is X1 an unbiased estimator of µ? If X1 is a good estimator of µ,
What is its variance? there’d be no need to make the
other observations.
(b) Which of the unbiased estimators mentioned so far in this exercise has
the lower variance?
(c) The lazy statistician acknowledges the error of his ways but tries
another tack: what if he uses X m , where X m is the sample mean
based on the first m observations that arise, and m < n? He continues
to argue, correctly, that he is using an unbiased estimator. Convince
the lazy statistician that he is still losing out in terms of the variability
of his estimator.
Exercise 5.12 makes an argument – in terms of minimising the variance of

the sample mean as estimator of the population mean – for taking larger
samples of observations.
22
2.4 The Cramér–Rao lower bound

This subsection concerns a rather remarkable result. Amongst unbiased
estimators, the minimum possible variance – the very lowest that it is
possible to achieve with any unbiased estimator – is defined by the
‘Cramér–Rao lower bound’, often abbreviated to CRLB. This result is
named after two eminent mathematical statisticians, the Swede Harald
Cramér (1893–1985) and the Indian Calyumpadi R. Rao (born 1920), who
independently produced versions of the result, as opposed to working
together on it.
The Cramér–Rao lower bound for the variance of any unbiased

estimator e θ of θ based on independent random variables
X1 , X2 , . . . , Xn is given by
1
V (e
θ) ≥ .
E[{ℓ (θ | X1 , X2 , . . . , Xn )}2 ]
′
Here, ℓ(θ | X1 , X2 , . . . , Xn ) is once more the log-likelihood function, It is useful here to use this more
precise notation for the
d log-likelihood and its derivative.
ℓ′ (θ | X1 , X2 , . . . , Xn ) =
{ℓ(θ | X1 , X2 , . . . , Xn )},
dθ
and E is the expectation over the distribution of X1 , X2 , . . . , Xn .
The Cramér–Rao result is subject to certain ‘regularity conditions’ which

ensure that the manipulations performed in its proof are valid. You should
note that these conditions include the requirement that the support of the
distribution of X1 , X2 , . . . , Xn must not depend on θ. The regularity
conditions are, therefore, not met if the data come from the U (0, θ)
distribution considered in Subsection 1.5.
The proof of the CRLB (in the continuous case) is an entirely optional
extra, provided in Section 2 of the ‘Optional material for Unit 5’.
2.4.1 An alternative form

A second, equivalent but often more useful, form for the Cramér–Rao
inequality will be derived next in the continuous case. First, define the
random variable
φ = ℓ′ (θ | X1 , X2 , . . . , Xn ).
The CRLB, therefore, depends on
Variance
E[{ℓ′ (θ | X1 , X2 , . . . , Xn )}2 ] = E(φ2 ) = V (φ) + {E(φ)}2 .
To simplify φ, use the fact that, by independence,
n
X
log f (X1 , X2 , . . . , Xn | θ) = log f (Xi |θ)
i=1
to see that
d
φ = ℓ′ (θ | X1 , X2 , . . . , Xn ) = log f (X1 , X2 , . . . , Xn | θ)
dθ !
n
d X
= log f (Xi |θ)
dθ
i=1
n
X Xn
d
= log f (Xi |θ) = Ui , say,
dθ
i=1 i=1
23
Point estimation
where
d
Ui = log f (Xi |θ), i = 1, 2, . . . , n.
dθ
Notice that U1 , U2 , . . . , Un are independent random variables because
X1 , X2 , . . . , Xn are.
An important subsidiary result is that E(Ui ) = 0, i = 1, 2, . . . , n, as the
following shows.

d
E(Ui ) = E log f (Xi |θ)
dθ Expectation
Z
d Chain rule
= {log f (x|θ)} f (x|θ) dx
dθ
Z
d 1
= {f (x|θ)} f (x|θ) dx
dθ f (x|θ)
(using the chain rule)
Z
d
= {f (x|θ) dx}
dθ
Z
d
= f (x|θ) dx
dθ
(swapping the order of integration and differentiation)
d
= (1)
dθ
(since f (x|θ) is a density)
= 0.
(The regularity conditions mentioned above ensure the validity of
swapping the order of integration and differentiation.)
The following exercise will provide results that lead to the restatement of
the CRLB that follows it.
Exercise 5.13
Recall that
n
X
′
φ = ℓ (θ | X1 , X2 , . . . , Xn ) = Ui
i=1
and that U1 , U2 , . . . , Un are independent with
d
Ui = log f (Xi |θ).
dθ
(a) Show that E(φ) = 0.
(b) Write σ2U = V (Ui ), i = 1, 2, . . . , n. Show that V (φ) = nσ2U .
(c) Show that
" 2 #
d
σ2U = E log f (X|θ) ,
dθ
where X is a random variable from the distribution with pdf f (x|θ).
Now, in the notation of Exercise 5.13, the CRLB states that

1 1 1
V (e
θ) ≥ 2 = V (φ) = nσ2 ,
E(φ ) U
24
where E(φ2 ) = V (φ) because E(φ) = 0. The formula you derived in

Exercise 5.13(c) then gives the result in the following box in the continuous
case; the result also holds in the discrete case.
The Cramér–Rao lower bound for the variance of any unbiased

estimator e
θ of θ based on independent observations X1 , X2 , . . . , Xn
from the distribution with pdf or pmf f (x|θ) is also given by
1
V (e
θ) ≥ h 2
i.
d
nE dθ log f (X|θ)
There is a hclear effect of sample

i size n in this second formulation: this is
d 2
because E dθ log f (X|θ) does not depend on n, and so the CRLB is
proportional to 1/n. Indeed, the CRLB provides another argument – in
terms of minimising the lower bound on the variance of an unbiased
estimator – for taking larger samples of observations.
2.4.2 Examples
Example 5.7 The CRLB for estimating the parameter of the Poisson
distribution
arising from the same Poisson distribution with parameter µ.
To compute the CRLB, start from the pmf of the Poisson distribution,
µx e−µ
f (x|µ) = .
x!
Take logs:
log f (x|µ) = x log µ − µ − log(x!).
Differentiate with respect to the parameter µ:
d x
log f (x|µ) = − 1.
dµ µ
Square this:
2 2
d x 1
log f (x|µ) = − 1 = 2 (x − µ)2 .
dµ µ µ
Finally, take the expectation:
" 2 #
d 1
E log f (X|µ) = 2 E{(X − µ)2 }
dµ µ
1 1 1
= V (X) = 2 µ = .
µ2 µ µ
Here, E(X) = V (X) = µ for the Poisson distribution has been used. The
CRLB is the reciprocal of n times this value, i.e. 1/(n/µ) = µ/n.
25
Point estimation
Exercise 5.14

arising from the same exponential distribution with parameter λ. The pdf
of the exponential distribution is Exponential distribution
f (x|λ) = λe−λx on x > 0.

Calculate the CRLB for unbiased estimators of λ.
Exercise 5.15

arising from the same Bernoulli distribution with parameter p. The pmf of
the Bernoulli distribution is Bernoulli distribution
f (x|p) = px (1 − p)1−x .
Calculate the CRLB for unbiased estimators of p.
26
Point estimation
Solutions
Solution 5.1
The likelihood, L(p), for p given that X = x = 54 is the pmf when x = 54:

100 54 100−54 100 54
L(p) = f (54|p) = p (1 − p) = p (1 − p)46 ,
54 54
thought of as a function of p.
Solution 5.2
(a) L(p) = f (x|p) = Cpx (1 − p)n−x .
(b) ℓ(p) = log C + x log p + (n − x) log(1 − p).
x n−x x(1 − p) − (n − x)p x − np
(c) ℓ′ (p) = − = = ,
p 1−p p(1 − p) p(1 − p)
which equals zero when x = np and hence p = x/n. The candidate
formula for the maximum likelihood estimator is therefore pb = X/n.
x n−x
(d) ℓ′′ (p) = − 2 − .
p (1 − p)2
Noting that
x n−x
1 − pb = 1 − = ,
n n
x xn2 (n − x)n2 n2 n2
ℓ′′ (b
p) = ℓ′′ =− − = − − < 0.
n x2 (n − x)2 x (n − x)
(Again, in this case, ℓ′′ (p) < 0 for all 0 < p < 1.) Therefore, pb = x/n
maximises the log-likelihood.
(e) The maximum likelihood estimator of p is pb = X/n. pb is the number
of heads divided by the total number of coin tosses, or pb is the
proportion of heads observed in the experiment.
Solution 5.3
( n
) n
Y X
ℓ(θ) = log L(θ) = log f (xi |θ) = log f (xi |θ).
i=1 i=1 Manipulating logs
Solution 5.4
(a) log f (xi |µ) = log(e−µ ) + log(µxi ) − log(xi !)
= −µ + xi log µ − log(xi !),
so that
n
X n
X n
X
ℓ(µ) = log f (xi |µ) = −nµ + xi log µ − log(xi !)
i=1 i=1 i=1
= −nµ + nx log µ − C,
as required.

′ d nx x
(b) ℓ (µ) = (−nµ + nx log µ − C) = −n + =n −1 .
dµ µ µ
Setting this equal to zero yields the requirement that x/µ = 1, which
is equivalent to µ = x. Hence the candidate MLE is µ b = X.

d x x
(c) ℓ′′ (µ) = n −1 = −n 2 ,
dµ µ µ
27
Solutions
so that
x n
ℓ′′ (b
µ) = −n2 = − < 0.
x x
The negativity is because each of x1 , x2 , . . . , xn is positive and hence x
is positive. Therefore, µ
713 × 0 + 299 × 1 + 66 × 2 + 16 × 3 + 1 × 4
(d) b=x=
µ
1095
299 + 132 + 48 + 4 483
= = = 0.441
1095 1095
correct to three decimal places.
Thus, the estimated murder rate in London is 0.441 murders per day
or, more roughly, about 1 murder every two days.
Solution 5.5
√ (xi − µ)2
(a) log f (xi |µ) = − log( 2πσ0 ) − ,
2σ20
so that
n
X
ℓ(µ) = log f (xi |µ)
i=1
Xn
√ (xi − µ)2
= −n log( 2πσ0 ) −
i=1
2σ20
n
1 X
=C− (xi − µ)2 ,
2σ20 i=1
as required.
n
′ 2 X 1 n
(b) ℓ (µ) = 2 (xi − µ) = 2 (nx − nµ) = 2 (x − µ).
2σ0 σ0 σ0
i=1
For this to be zero, µ = x, and so the candidate maximum likelihood

estimator is µb = X.
n
(c) ℓ′′ (µ) = − 2 = ℓ′′ (b
µ) < 0.
σ0
Therefore, µ
Solution 5.6
(a) As in Exercise 5.5(a), with relabelling of parameters,
Xn
√ (xi − µ0 )2
ℓ(σ) = −n log( 2πσ) − 2
,
i=1
2σ
which can be written as
n
√ 1 X
−n log( 2π) − n log σ − 2 (xi − µ0 )2
2σ i=1
n
1 X
= C1 − n log σ − 2 (xi − µ0 )2 ,
2σ i=1
as required.
(b) Differentiating with respect to σ,
n
n 2 X
ℓ′ (σ) = − + 3 (xi − µ0 )2 .
σ 2σ i=1
28
Point estimation
Set ℓ′ (σ) = 0 and multiply throughout by σ3 > 0 to get

n
X
−nσ2 + (xi − µ0 )2 = 0,
i=1
so that
n
2 1X
σ = (xi − µ0 )2
n
i=1
and hence σ b is a candidate MLE for σ.

n
n 3 X n 3n
(c) ′′
ℓ (σ) = 2 − 4 (xi − µ0 )2 = 2 − 4 S0 ,
σ σ i=1 σ σ
√
where S0 is given in the question. Thus σ b = S0 , therefore
n 3n 2n
ℓ′′ (b
σ) = − 2 S0 = − < 0,
S0 S0 S0
because S0 > 0. Therefore σ b is indeed the MLE of σ. Notice that in
this case it is not true that ℓ′′ (σ) < 0 for all σ.
Solution 5.7
(a) The pdf of the uniform distribution on (0, θ) is
1
f (x|θ) = on 0 < x < θ.
θ
The likelihood is
Yn
L(θ) = f (xi |θ).
i=1
Now, if θ is less than or equal to any of the x’s, the corresponding pdf
is zero and so, therefore, is the likelihood. Only if θ is greater than all
of the x’s, and in particular greater than the maximum data value,
xmax , say, is each pdf in the likelihood equal to 1/θ and hence
1
L(θ) = on θ > xmax .
θn
(b) For θ > xmax ,
n
L′ (θ) = − n+1 .
θ
This is negative for all finite positive θ and, in particular, is never
zero. Also,
n(n + 1)
L′′ (θ) = ,
θn+2
and its limit as θ → ∞ is zero.
29
Solutions
(c) A graph of L(θ) for all θ > 0 is shown below.

L(θ)
0 xmax θ
(d) From the graph in part (c), the MLE is b

θ = Xmax .
Solution 5.8
√
(a) Set σ = v in the log-likelihood
√ given in Exercise 5.6(a) and
1
remember that log v = 2 log v. The desired formula then follows.
n
n 1 X
(b) ℓ′ (v) = − + 2 (xi − µ0 )2 .
2v 2v i=1
Set ℓ′ (v) equal to zero and multiply throughout by 2v 2 > 0 to get

n
X
−nv + (xi − µ0 )2 = 0,
i=1
which is satisfied by
n
1X
v= (xi − µ0 )2 .
n i=1
That is, the candidate MLE of v is
n
1X
vb = (Xi − µ0 )2 .
n
i=1
n
n 2 X n n
(c) ′′
ℓ (v) = 2 − 3 (xi − µ0 )2 = 2 − 3 S0 ,
2v 2v i=1 2v v
Pn
where S0 = n1 i=1 (xi − µ0 )2 (as in Exercise 5.6). Now, since vb = S0 ,
n n n
ℓ′′ (b
v) = 2
− 3 S0 = − 2 < 0.
2S0 S0 2S0
Therefore, vb is indeed the MLE of v.
( n )1/2
1X
(d) b=
σ (Xi − µ0 )2 ,
n i=1
so that
n
1X
σ)2 =
(b (Xi − µ0 )2 .
n i=1
But you have just shown that
X n
c2 = 1
vb = σ (Xi − µ0 )2 .
n i=1
So, yes, the MLE of σ2 is the square of the MLE of σ.
Solution 5.9
(a) For the Poisson distribution,
µ0 e−µ
θ = P (X = 0) = = e−µ .
0!
30
Point estimation
b = X. Therefore
(b) From Exercise 5.4, µ
b
θ = e−bµ = e−X .
Solution 5.10
n
! n
1X 1 X 1 2 σ2
V (X) = V Xi = V (X i ) = nσ = ,
n i=1 n2 i=1 n2 n
since V (Xi ) = σ2 for each i.
Solution 5.11
c2 = n − 1 S 2 ,
σ
n
so
c2 ) = n−1 n−1 2
E(σ E(S 2 ) = σ .
n n
c2 is a biased estimator of σ2 . Its bias is
Therefore, σ
B(σc2 ) = E(σc2 ) − σ2 = n − 1 σ2 − σ2 = − 1 σ2 .
n n
Solution 5.12
(a) E(X1 ) = µ, so X1 is an unbiased estimator of µ. Its variance is
V (X1) = σ2 .
σ2
(b) V (X n ) = < σ2 = V (X1 )
n
for all n ≥ 2.
σ2 σ2
(c) V (X m ) = > = V (X n ),
m n
since m < n. The lazy statistician’s estimator has greater variability
than X n . He will have to admit defeat and choose m = n.
Solution 5.13
(a) Using the result (just shown) that E(Ui ) = 0,
n
! n n Linearity of expectation
X X X
E(φ) = E Ui = E(Ui ) = 0 = 0.
i=1 i=1 i=1
n
! n n
X X X
(b) V (φ) = V Ui = V (Ui ) = σ2U = nσ2U ,
i=1 i=1 i=1 Variance of a sum of
independent random variables
because U1 , U2 , . . . , Un are independent.
" 2 #
d
(c) σ2U = V (Ui ) = E(Ui2 ) − {E(Ui )} = E 2
log f (Xi |θ) ,
dθ
since E(Ui ) = 0. This equals the required formula because each Xi
has the same distribution as X.
Solution 5.14
First, take logs:
log f (x|λ) = log λ − λx.
Then differentiate with respect to the parameter λ:
d 1
log f (x|λ) = − x.
dλ λ
31
Solutions
Square this:
2 2
d 1
log f (x|λ) = −x .
dλ λ
" 2 # ( 2 )
d 1
E log f (X|λ) =E −X
dλ λ
( 2 )
1
=E X−
λ
= V (X)
1
= 2,
λ
since E(X) = 1/λ and V (X) = 1/λ2 . The CRLB is the reciprocal of n
times this value, that is, 1/(n/λ2 ) = λ2 /n.
Solution 5.15
First, take logs:
log f (x|p) = x log p + (1 − x) log(1 − p).
Then differentiate with respect to the parameter p:
d x 1−x
log f (x|p) = −
dp p 1−p
x(1 − p) − (1 − x)p
=
p(1 − p)
x−p
= .
p(1 − p)
Square this:
2
d (x − p)2
log f (x|p) = 2 .
dp p (1 − p)2
" 2 #
d E(X − p)2
E log f (X|p) = 2
dp p (1 − p)2
V (X)
=
p2 (1 − p)2
p(1 − p)
=
p2 (1 − p)2
1
= ,
p(1 − p)
since E(X) = p and V (X) = p(1 − p). So the CRLB is p(1 − p)/n.
32

Lectura 3 MLE

Uploaded by

Copyright:

Available Formats

Lectura 3 MLE

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lectura 3 MLE

Uploaded by

Copyright:

Available Formats

M347

The Open University, Walton Hall, Milton Keynes, MK7 6AA.

1 Maximum likelihood estimation

The likelihood function, or likelihood for short, for independent

1.2 Initial ideas

does indeed correspond to a maximum, check the sign of the second

Example 5.1 Maximising a binomial likelihood

This example picks up the problem considered in Exercise 5.1. It concerns

Figure 5.1 Plot of L(p) with maximum marked as pb

Differentiation with respect to p requires that the product of the functions

The trick, now, is this.

The maximum likelihood estimator, defined as the maximiser of the

1.2.1 More on the binomial case

Example 5.2 (Example 5.1 continued)

Example 5.2 came up with a particular numerical maximum likelihood

Consider the situation described above, assuming 0 < x < n.

1.3 Maximum log-likelihood estimation

1.4 The general strategy

The likelihood function for parameter θ based on independent observations

where f (x|θ) denotes either the probability density function associated

The following strategy can therefore be followed for finding maximum

Step 1 Obtain the log-likelihood

1.4.1 Two particular cases

Example 5.3 Maximising an exponential likelihood

Suppose that x1 , x2 , . . . , xn is a set of independent observations each

Step 2 Differentiating the log-likelihood with respect to λ,

Suppose independent observations x1 , x2 , . . . , xn are available from a

What is the maximum likelihood estimate of µ based on these data? Step 5

1.4.2 Estimating normal parameters

Suppose independent observations x1 , x2 , . . . , xn are available from a

b, of µ is, once more, X.

Suppose independent observations x1 , x2 , . . . , xn are available from a

(a) Show that the log-likelihood is Step 1

The maximum likelihood estimators of the pair of parameters µ and σ

1.5 A non-calculus example

Suppose you have data values x1 , x2 , . . . , xn from a uniform distribution on

Obtaining the maximum likelihood estimator of θ in the U (0, θ)

1.6 Estimating a function of the parameter

1.6.1 Estimating σ and σ2

Suppose independent observations x1 , x2 , . . . , xn are available from a

(c) Confirm that vb is indeed the MLE of v. Step 4

(d) So, is it true that c σ)2 ?

A log-likelihood based on n = 10 independent observations from an

Figure 5.3 Plot of an N (0, 1)-based log-likelihood as: (a) a function of

1.6.2 Estimating h(θ)

The MLE of a function of a parameter is the function of the MLE of

This means that in future there will be no need to make separate

Consider a Poisson distribution with unknown mean µ.

2.1 Sampling distributions

In fact, a whole range of realisations of the estimator e θ = t(X1 , X2 , . . . , Xn )

Example 5.4 The sampling distribution of the sample mean of data

Suppose that X1 , X2 , . . . , Xn is a set of independent random variables each

In general, the sampling distribution of eθ depends on the unknown

2.2 Unbiased estimators

The bias of an estimator is defined by

If the expectation of the estimator e

Figure 5.4 To the left, the pdf fs (e θ1 ) of the sampling distribution of e

That said, it is not considered absolutely necessary for a good point

Example 5.5 Unbiasedness of the sample mean as estimator of the

Suppose that X1 , X2 , . . . , Xn is a set of independent random variables from