Estimation

Bayesian Inference
Ricardo Ehlers
ehlers@icmc.usp.br
Departamento de Matemática Aplicada e Estatı́stica

Universidade de São Paulo
Estimation
Introduction to Decision Theory
A decision problem is completely specified by the description of the

following spaces:
• Parameter space or state of nature, Θ.

• Space of possible results of an experiment, Ω.
• Space of possible actions, A.
A decision rule δ is a function defined in Ω which assumes values in

A, i.e. δ : Ω → A.
1
For each decision δ and each possible value of the parameter θ we
can associate a loss L(δ, θ) assuming positive values.
This is supposed to measure the penalty associated with a decision
δ when the parameter takes the value θ.
2
Definition
The risk of a decision rule, denoted R(δ), is the posterior expected
loss, i.e.
R(δ) = Eθ|x [L(δ, θ)]

Z
= L(δ, θ)p(θ|x)dθ.
Θ
3
Definition
A decision rule δ ∗ is optimal if its risk is minimum, i.e.
R(δ ∗ ) < R(δ), ∀δ.
• This rule is called the Bayes rule and its risk is the Bayes risk.
• Then,
δ ∗ = arg min R(δ),
is the Bayes estimator.
• Bayes risks can be used to compare estimators. A decision
rule δ1 is preferred to a rule δ2 if
R(δ1 ) < R(δ2 )
4
Loss Functions
Definition
The quadratic loss function is defined as,
L(δ, θ) = (δ − θ)2 ,
and the associated risk is,
R(δ) = Eθ|x [(δ − θ)2 ]

Z
= (δ − θ)2 p(θ|x)dθ.
Θ
5
Lemma
If L(δ, θ) = (δ − θ)2 the Bayes estimator of θ is E (θ|x).
The Bayes risk is,
E [E (θ|x) − θ]2 = Var (θ|x).
6
Definition
The absolute loss function is defined as,
L(δ, θ) = |δ − θ|.
Lemma
If L(δ, θ) = |δ − θ| the Bayes estimator of θ is the median of the
posterior distribution.
7
Definition
The 0-1 loss function is defined as,
L(δ, θ) = 1 − I (|θ − δ| < ),
where I (·) is an indicator function.

Lemma
Let δ() be the Bayes rule for this loss function and δ ∗ the mode of
the posterior distribution. Then,
lim δ() = δ ∗ .
→0
So, the Bayes estimator for a 0-1 loss function is the mode of the
8
Consequently, for a 0-1 loss function and θ continuous the Bayes
estimate is,
θ∗ = arg max p(θ|x)

θ
= arg max p(x|θ)p(θ)
θ
= arg max[log p(x|θ) + log p(θ)].
θ
This is also referred to as the generalized maximum likelihood

estimator (GMLE).
9
Example. Let X1 , . . . , Xn a random sample from a Bernoulli
distribution with parameter θ. For a Beta(α, β) conjugate prior it
follows that the posterior distribution is,
θ|x ∼ Beta(α + t, β + n − t)
Pn
where t = i=1 xi .
Under quadratic loss the Bayes estimate is given by,
α + ni=1 xi
P
E (θ|x) = .
α+β+n
10
Example. In the previous example suppose that n = 100 and
t = 10 so that the Bayes estimate under quadratic loss is,
α + 10
E (θ|x) = .
α + β + 100
For a Beta(1,1) prior the estimate is 0.108 while for quite a

different Beta(1,2) prior the estimate is 0.107.
Both estimates are close to the maximum likelihood estimate 0.1.
Under 0-1 loss with a Beta(1,1) prior the Bayes estimate is then
0.1.
11
Example. Bayesian estimates of θ under quadratic loss with a
Beta(a, b) prior, varying n and keeping x = 0.1.
Prior Parameters
n (1,1) (1,2) (2,1) (0.001,0.001) (7,1.5)
10 0.167 (0.067) 0.154 (0.063) 0.231 (0.061) 0.1 (0.082) 0.432 (0.039)
20 0.136 (0.038) 0.13 (0.037) 0.174 (0.035) 0.1 (0.043) 0.316 (0.025)
50 0.115 (0.016) 0.113 (0.016) 0.132 (0.016) 0.1 (0.018) 0.205 (0.013)
100 0.108 (0.008) 0.107 (0.008) 0.117 (0.008) 0.1 (0.009) 0.157 (0.007)
200 0.104 (0.004) 0.103 (0.004) 0.108 (0.004) 0.1 (0.004) 0.129 (0.004)
12
Example. Let X1 , . . . , Xn a random sample from a Poisson
distribution with parameter θ. Using a conjugate prior it follows
that,
X1 , . . . , Xn ∼ Poisson(θ)
θ ∼ Gamma(α, β)
θ|x ∼ Gamma(α + t, β + n)
Pn
where t = i=1 xi .
The Bayes estimate under quadratic loss is,
α + ni=1 xi
P
α + nx
E [θ|x] = =
β+n β+n
while the Bayes risk is,
α + nx E [θ|x]
Var (θ|x) = 2
= .
(β + n) β+n
If α → 0 and β → 0 then, E [θ|x] → x and Var (θ|x) → x/n.

If n → ∞ then E [θ|x] → x.
13
Bayes estimates of θ under quadratic loss using a Gamma(a, b)
prior, varying n and keeping x = 10.
Prior Parameters
n (1,0.01) (1,2) (2,1) (0.001,0.001) (7,1.5)
10 10.09 (1.008) 8.417 (0.701) 9.273 (0.843) 9.999 (1) 9.304 (0.809)
20 10.045 (0.502) 9.136 (0.415) 9.619 (0.458) 10 (0.5) 9.628 (0.448)
50 10.018 (0.2) 9.635 (0.185) 9.843 (0.193) 10 (0.2) 9.845 (0.191)
100 10.009 (0.1) 9.814 (0.096) 9.921 (0.098) 10 (0.1) 9.921 (0.098)
200 10.004 (0.05) 9.906 (0.049) 9.96 (0.05) 10 (0.05) 9.96 (0.049)
14
Example. If X1 , . . . , Xn is a random sample from a N(θ, σ 2 ) with
σ 2 known and using the conjugate prior, i.e. θ ∼ N(µ0 , τ02 ) then
the posterior is also normal in which case mean, median and mode
coincide.
The Bayes estimate of θ is then given by,
τ0−2 µ0 + nσ −2 x
.
τ0−2 + nσ −2
Under a Jeffreys prior for θ the Bayes estimate is simply x.
15
Example. Let X1 , . . . , Xn a random sample from a N(θ, σ 2 )
distribution with θ known and φ = σ −2 unknown.
n0 n0 σ02

φ ∼ Gamma ,
2 2
n0 + n n0 σ02 + ns02

φ|x ∼ Gamma , .
2 2
where ns02 = ni=1 (xi − θ)2 . Then,

P
n0 + n
E (φ|x) =
n0 σ02 + ns02
is the Bayes estimate under quadratic loss.
16
• The quadratic loss can be extended to the multivariate case,
L(δ, θ) = (δ − θ)0 (δ − θ),
and the Bayes estimate is E (θ|x).
• Likewise, the 0-1 loss can also be extended,
L(δ, θ) = lim I (|θ − δ| ∈ A),

vol(A)→0
and the Bayes estimate is the joint mode of the posterior

distribution.
• However, the absolute loss has no clear extension.
17
Example. Suppose X = (X1 , . . . , Xp ) has a multinomial
distribution with parameters n and θ = (θ1 , . . . , θp ). If we adopt a
Dirichlet prior with parameters α1 , . . . , αp the posterior distribution
is Dirichlet with parameters xi + αi , i = 1, . . . , p.
Under quadratic loss, the Bayes estimate of θ is E (θ|x) where,
xi + αi
E (θi |x) = E (θi |xi ) = .
n + pj=1 αj
P
18
Definition
A quantile loss function is defined as,
L(δ, θ) = c1 (δ − θ)I(−∞,δ) (θ) + c2 (θ − δ)I(δ,∞) (θ),
where c1 > 0 and c2 > 0.

It can be shown that the Bayes estimate of θ is a value θ∗ such
that,
c2
P(θ ≤ θ∗ ) = .
c1 + c2
So the Bayes estimate is the quantile of order c2 /(c1 + c2 ) of the

19
Definition
The Linex (Linear Exponential) loss function is defined as,
L(δ, θ) = exp[c(δ − θ)] − c(δ − θ) − 1,
It can be shown that the Bayes estimator is,

1
δ∗ = log E [e cθ |x], c 6= 0.
c
20
Linex function with c < 0 reflecting small losses for overestimation and
large losses for underestimation.
40
30
Loss
20
c = − 0.5
c = −1
c = −2
10
0
−6 −4 −2 0 2 4 6
δ−θ
21
Credible Sets
• Point estimates simplify the posterior distribution into single

figures.
• How precise are point estimates?
• We seek a compromise between reporting a single number

representing the posterior distribution or report the
distribution itself.
22
Definition
A set C ∈ Θ is a 100(1-α)% credible set for θ if,
P(θ ∈ C ) ≥ 1 − α.
• The inequality is useful when θ has a discrete distribution,

otherwise an equality is used in practice.
• This definition differs fundamentally from the classical
confidence region.
23
Invariance under Transformation
Credible sets are invariant under 1 to 1 parameter transformations.

Let θ ∗ = g (θ) and C ∗ denotes the image of θ under g . Then,
P(θ ∗ ∈ C ∗ ) = 1 − α.
In the univariate case, if C = [a, b] is a 100(1-α)% credible interval

for θ then [g (a), g (b)] is a 100(1-α)% credible interval for θ∗ .
24
• Credible sets are not unique in general.
• For any α > 0 there are infinitely many solutions to
P(θ ∈ C ) = 1 − α.
25
90% credible intervals for a Poisson parameter θ when the posterior is
Gamma(4,0.5).
0.00 0.02 0.04 0.06 0.08 0.10
0.00 0.02 0.04 0.06 0.08 0.10

0 5 10 15 20 25 0 5 10 15 20 25
θ θ
0.00 0.02 0.04 0.06 0.08 0.10
26
0 5 10 15 20 25
90% credible intervals for Binomial parameter θ|x ∼ Beta(2, 1.5).
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
1.5
θ θ
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0

27
90% credible intervals for θ|x ∼ Normal(2, 1).
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
−2 0 2 4 6 −2 0 2 4 6
θ θ
0.4
0.3
0.2
0.1
0.0
−2 0 2 4 6
28
Definition
A 100(1-α)% highest probability density (HPD) credible set for θ

is a 100(1-α)% credible set for θ with the property
p(θ 1 |x) ≥ p(θ 2 |x),
∀θ 1 ∈ C and all θ 2 ∈
/ C.
29
• For symmetric distributions HPD credible sets are obtained by
fixing the same probability for the tails.
• HPD credible sets are not invariant under transformation.
• In the univariate case, if C = [a, b] is a 100(1-α)% HPD

interval for θ then [g (a), g (b)] is a 100(1-α)% interval for
g (θ) but not necessarily HPD.
30
Example. Let X1 , · · · , Xn a random sample from a N(θ, σ 2 ) with
σ 2 known. If θ ∼ N(µ0 , τ02 ) then θ|x ∼ N(µ1 , τ12 ) and,
θ − µ1
Z= |x ∼ N(0, 1).
τ1
Define zα/2 as the value of Z such that,

P(Z ≤ zα/2 ) = 1 − α/2.
We can find the percentile zα/2 such that,

θ − µ1
P −zα/2 ≤ ≤ zα/2 = 1 − α
τ1
or, equivalently

P µ1 − zα/2 τ1 ≤ θ ≤ µ1 + zα/2 τ1 = 1 − α.

Then, µ1 − zα/2 τ1 ; µ1 + zα/2 τ1 is the 100(1-α)% HPD interval
for θ.
31
Example. In the previous example, if τ02 → ∞ it follows that
τ1−2 → nσ −2 and µ1 → x. Then,
√
n(θ − x)
Z= |x ∼ N(0, 1).
σ
The 100(1-α)% HPD credible interval is given by,

√ √
x − zα/2 σ/ n; x + zα/2 σ/ n
which concides numerically wit the classical confidence interval.

The interpretation however is completely different.
32
Example. In the previous example, the classical approach would
base inference on,
σ2

X ∼ N θ, ,
n
or equivalently,
√
n(X − θ)
U= ∼ N(0, 1).
σ
U (called a pivot) is a function of the sample and of θ but its

distribution does not depend on θ.
33
Again we can find the percentile zα/2 such that,

P −zα/2 ≤ U ≤ zα/2 = 1 − α
or, equivalently
√ √
P X − zα/2 σ/ n ≤ θ ≤ X + zα/2 σ/ n = 1 − α.
However, this is a probabilistic statement about the limits of the

interval, and not about θ.
34
The classical interpretation
If the same experiment were to be repeated infinitely many times,
in 100(1 − α)% of them the random limits of the interval would
include θ.
Useless in practice since it is based on unobserved samples.
In the example, when X = x is observed it is said that there is a

100(1 − α)% confidence (not probability) that the interval
√ √
(x − zα/2 σ/ n; x + zα/2 σ/ n) contains θ.
35
95% confidence intervals for the mean of 100 samples of size 20 simulated from
a N(0, 100). Arrows indicate interval that do not contain zero.
20
10
Means
0−10
−20
0 20 40 60 80 100
Samples
Real confidence level = 95 % 36
Normal Approximation
If the posterior distribution is unimodal and approximately

symmetric it can be approximated by a normal distribution
centered about the posterior mode.
Consider the Taylor expansion of log p(θ|x) about the mode θ∗ ,

∗ ∗ d
log p(θ|x) = log p(θ |x) + (θ − θ ) log p(θ|x)
dθ θ=θ∗
2
1 ∗ 2 d
+ (θ − θ ) log p(θ|x) + ...
2 dθ2 θ=θ∗
37
By definition,
d
log p(θ|x) =0
dθ θ=θ∗
Defining,
d2

h(θ) = − log p(θ|x)
dθ2
it follows that,
h(θ∗ )

∗ 2
p(θ|x) ≈ constant × exp − (θ − θ ) .
2
Then, for large n, we have the following approximation,
θ|x ∼ N(θ∗ , h(θ∗ )−1 ).
38
These results can be extended to the multivariate case.
Since,
∂ log p(θ|x)
= 0,
∂θ θ =θ ∗
defining the matrix,
∂ 2 log p(θ|x)

H(θ) = − ,
∂θ∂θ 0
then, for large n, we have the following approximation,
θ|x ∼ N(θ ∗ , H(θ ∗ )−1 ).
In particular, it is possible to construct approximate credibility

regions based on the above results.
39
Definition
Let θ ∈ Θ. A region C ⊂ Θ is an asymptotic 100(1 − α)%
credibility region if
lim P(θ ∈ C|x) ≥ 1 − α.

n→∞
40
Posterior Gamma density and its normal approximation with simulated
data (n = 10).
3000
2500
2000
p(θ|x)
aprox. normal
1500
1000
500
0
0.0000 0.0005 0.0010 0.0015
θ 41
Posterior Gamma density and its normal approximation with simulated
data (n = 100).
4000
3000
p(θ|x)
2000
aprox. normal
1000
0
0.0005 0.0007 0.0009 0.0011
θ 42
Example. Consider the model,
X1 , . . . , Xn ∼ Poisson(θ)
θ ∼ Gamma(α, β).
The posterior distribution is given by,

X
θ|x ∼ Gama(α + xi , β + n)
portanto, P
xi −1
p(θ|x) ∝ θα+ exp{−θ(β + n)}
ou equivalentemente,
X
log p(θ|x) = (α + xi − 1) log θ − θ(β + n) + constant.
First and second derivatives,

P
d α + xi − 1
log p(θ|x) = −(β + n) +
dθ θ
2
P
d α + xi − 1
log p(θ|x) = − .
dθ2 θ2
43
It then follows that,
P P
∗ α + xi − 1 α+ xi − 1
θ = , h(θ) =
β+n θ2
(β + n)2
h(θ∗ ) = P .
α + xi − 1
The approximate posterior distribution is,

P P
α + xi − 1 α + xi − 1
θ|x ∼ N , .
β+n (β + n)2
An approximate 100(1-α)% credible interval for θ,
θ∗ − zα/2 h(θ∗ )−1/2 < θ < θ∗ + zα/2 h(θ∗ )−1/2
44
P
20 simulated Poisson data with θ = 2, prior Gamma(1, 2), xi = 35.
1.5
1.0
p(θ|x)
aprox. normal
0.5
0.0
0 1 2 3 4 5
θ
45
Example. For the model X1 , . . . , Xn ∼ Bernoulli(θ) with
θ ∼ Beta(α, β) the posterior is,
n
X
θ|x ∼ Beta(α + t, β + n − t), t = xi .
i=1
Then,
p(θ|x) ∝ θα+t−1 (1 − θ)β+n−t−1
log p(θ|x) = (α + t − 1) log θ + (β + n − t − 1) log(1 − θ)
d α+t −1 β+n−t −1
log p(θ|x) = − + constant
dθ θ 1−θ
d2 α+t −1 β+n−t −1
log p(θ|x) = − −
dθ2 θ2 (1 − θ)2
46
α+t −1
θ∗ =
α+β+n−2
α+t −1 β+n−t −1
h(θ) = +
θ2 (1 − θ)2
α+β+n−2
h(θ∗ ) = .
θ∗ (1 − θ∗ )
47
θ∗ (1 − θ∗ )

θ|x ∼ N θ∗ , .
α+β+n−2
An approximate 100(1-α)% credible interval for θ,

" s s #
θ ∗ (1 − θ ∗ ) θ∗ (1 − θ∗ )
∗ ∗
θ − zα/2 ; θ + zα/2
α+β+n−2 α+β+n−2
48
20 simulated Bernoulli observations, θ = 0.2, prior Beta(1, 1),
P20
i=1 xi = 3.
5
4
3
p(θ|x)
conjugate posterior
normal approximation
2
1
0
0.0 0.2 0.4 0.6 0.8 1.0
θ
49
Example. If X1 , . . . , Xn ∼ Exp(θ) and p(θ) ∝ 1, it follows that,
n
X
p(θ|x) ∝ θn e −θt , t = xi
i=1
π(θ) = log p(θ|x) = n log(θ) − θ t + c
n
π 0 (θ) = −t
θ
n
π 00 (θ) = − 2
θ
Then,
n 1
Mode(θ|x) = θ∗ = =
t x
∗ n
h(θ ) = = nx 2 .
(θ∗ )2
50

1 1
θ|x ∼ N , .
x nx 2
An approximate 100(1-α)% credible interval for θ is,

" r r #
1 1 1 1
− zα/2 ; + zα/2 .
x nx 2 x nx 2
51

Estimation

Uploaded by

Copyright:

Available Formats

Estimation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Estimation

Uploaded by

Copyright:

Available Formats

Bayesian Inference

Departamento de Matemática Aplicada e Estatı́stica

A decision problem is completely specified by the description of the

• Parameter space or state of nature, Θ.

A decision rule δ is a function defined in Ω which assumes values in

R(δ) = Eθ|x [L(δ, θ)]

R(δ ∗ ) < R(δ), ∀δ.

R(δ1 ) < R(δ2 )

and the associated risk is,

R(δ) = Eθ|x [(δ − θ)2 ]

E [E (θ|x) − θ]2 = Var (θ|x).

L(δ, θ) = 1 − I (|θ − δ| < ),

where I (·) is an indicator function.

θ∗ = arg max p(θ|x)

This is also referred to as the generalized maximum likelihood

For a Beta(1,1) prior the estimate is 0.108 while for quite a

If α → 0 and β → 0 then, E [θ|x] → x and Var (θ|x) → x/n.

Under a Jeffreys prior for θ the Bayes estimate is simply x.

where ns02 = ni=1 (xi − θ)2 . Then,

L(δ, θ) = (δ − θ)0 (δ − θ),

and the Bayes estimate is E (θ|x).

• Likewise, the 0-1 loss can also be extended,

L(δ, θ) = lim I (|θ − δ| ∈ A),

and the Bayes estimate is the joint mode of the posterior

• However, the absolute loss has no clear extension.

L(δ, θ) = c1 (δ − θ)I(−∞,δ) (θ) + c2 (θ − δ)I(δ,∞) (θ),

where c1 > 0 and c2 > 0.

So the Bayes estimate is the quantile of order c2 /(c1 + c2 ) of the

L(δ, θ) = exp[c(δ − θ)] − c(δ − θ) − 1,

It can be shown that the Bayes estimator is,

• Point estimates simplify the posterior distribution into single

• How precise are point estimates?

• We seek a compromise between reporting a single number

• The inequality is useful when θ has a discrete distribution,

Credible sets are invariant under 1 to 1 parameter transformations.

In the univariate case, if C = [a, b] is a 100(1-α)% credible interval

• For any α > 0 there are infinitely many solutions to

0.00 0.02 0.04 0.06 0.08 0.10

0.00 0.02 0.04 0.06 0.08 0.10

0.0 0.2 0.4 0.6 0.8 1.0

A 100(1-α)% highest probability density (HPD) credible set for θ

p(θ 1 |x) ≥ p(θ 2 |x),

• HPD credible sets are not invariant under transformation.

• In the univariate case, if C = [a, b] is a 100(1-α)% HPD

Define zα/2 as the value of Z such that,

We can find the percentile zα/2 such that,

The 100(1-α)% HPD credible interval is given by,

which concides numerically wit the classical confidence interval.

U (called a pivot) is a function of the sample and of θ but its

However, this is a probabilistic statement about the limits of the

Useless in practice since it is based on unobserved samples.

In the example, when X = x is observed it is said that there is a

If the posterior distribution is unimodal and approximately

Then, for large n, we have the following approximation,

θ|x ∼ N(θ∗ , h(θ∗ )−1 ).

then, for large n, we have the following approximation,

θ|x ∼ N(θ ∗ , H(θ ∗ )−1 ).

In particular, it is possible to construct approximate credibility

lim P(θ ∈ C|x) ≥ 1 − α.

L(δ, θ) = 1 − I (|θ − δ| < ),