Estimation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

Bayesian Inference

Ricardo Ehlers
ehlers@icmc.usp.br

Departamento de Matemática Aplicada e Estatı́stica


Universidade de São Paulo
Estimation
Introduction to Decision Theory

A decision problem is completely specified by the description of the


following spaces:

• Parameter space or state of nature, Θ.


• Space of possible results of an experiment, Ω.
• Space of possible actions, A.

A decision rule δ is a function defined in Ω which assumes values in


A, i.e. δ : Ω → A.

1
For each decision δ and each possible value of the parameter θ we
can associate a loss L(δ, θ) assuming positive values.
This is supposed to measure the penalty associated with a decision
δ when the parameter takes the value θ.

2
Definition
The risk of a decision rule, denoted R(δ), is the posterior expected
loss, i.e.

R(δ) = Eθ|x [L(δ, θ)]


Z
= L(δ, θ)p(θ|x)dθ.
Θ

3
Definition
A decision rule δ ∗ is optimal if its risk is minimum, i.e.

R(δ ∗ ) < R(δ), ∀δ.

• This rule is called the Bayes rule and its risk is the Bayes risk.
• Then,
δ ∗ = arg min R(δ),
is the Bayes estimator.
• Bayes risks can be used to compare estimators. A decision
rule δ1 is preferred to a rule δ2 if

R(δ1 ) < R(δ2 )

4
Loss Functions

Definition
The quadratic loss function is defined as,

L(δ, θ) = (δ − θ)2 ,

and the associated risk is,

R(δ) = Eθ|x [(δ − θ)2 ]


Z
= (δ − θ)2 p(θ|x)dθ.
Θ

5
Lemma
If L(δ, θ) = (δ − θ)2 the Bayes estimator of θ is E (θ|x).
The Bayes risk is,

E [E (θ|x) − θ]2 = Var (θ|x).

6
Definition
The absolute loss function is defined as,

L(δ, θ) = |δ − θ|.

Lemma
If L(δ, θ) = |δ − θ| the Bayes estimator of θ is the median of the
posterior distribution.

7
Definition
The 0-1 loss function is defined as,

L(δ, θ) = 1 − I (|θ − δ| < ),

where I (·) is an indicator function.


Lemma
Let δ() be the Bayes rule for this loss function and δ ∗ the mode of
the posterior distribution. Then,

lim δ() = δ ∗ .
→0

So, the Bayes estimator for a 0-1 loss function is the mode of the
posterior distribution.

8
Consequently, for a 0-1 loss function and θ continuous the Bayes
estimate is,

θ∗ = arg max p(θ|x)


θ
= arg max p(x|θ)p(θ)
θ
= arg max[log p(x|θ) + log p(θ)].
θ

This is also referred to as the generalized maximum likelihood


estimator (GMLE).

9
Example. Let X1 , . . . , Xn a random sample from a Bernoulli
distribution with parameter θ. For a Beta(α, β) conjugate prior it
follows that the posterior distribution is,

θ|x ∼ Beta(α + t, β + n − t)
Pn
where t = i=1 xi .
Under quadratic loss the Bayes estimate is given by,

α + ni=1 xi
P
E (θ|x) = .
α+β+n

10
Example. In the previous example suppose that n = 100 and
t = 10 so that the Bayes estimate under quadratic loss is,
α + 10
E (θ|x) = .
α + β + 100

For a Beta(1,1) prior the estimate is 0.108 while for quite a


different Beta(1,2) prior the estimate is 0.107.
Both estimates are close to the maximum likelihood estimate 0.1.
Under 0-1 loss with a Beta(1,1) prior the Bayes estimate is then
0.1.

11
Example. Bayesian estimates of θ under quadratic loss with a
Beta(a, b) prior, varying n and keeping x = 0.1.

Prior Parameters
n (1,1) (1,2) (2,1) (0.001,0.001) (7,1.5)
10 0.167 (0.067) 0.154 (0.063) 0.231 (0.061) 0.1 (0.082) 0.432 (0.039)
20 0.136 (0.038) 0.13 (0.037) 0.174 (0.035) 0.1 (0.043) 0.316 (0.025)
50 0.115 (0.016) 0.113 (0.016) 0.132 (0.016) 0.1 (0.018) 0.205 (0.013)
100 0.108 (0.008) 0.107 (0.008) 0.117 (0.008) 0.1 (0.009) 0.157 (0.007)
200 0.104 (0.004) 0.103 (0.004) 0.108 (0.004) 0.1 (0.004) 0.129 (0.004)

12
Example. Let X1 , . . . , Xn a random sample from a Poisson
distribution with parameter θ. Using a conjugate prior it follows
that,
X1 , . . . , Xn ∼ Poisson(θ)
θ ∼ Gamma(α, β)
θ|x ∼ Gamma(α + t, β + n)
Pn
where t = i=1 xi .
The Bayes estimate under quadratic loss is,
α + ni=1 xi
P
α + nx
E [θ|x] = =
β+n β+n
while the Bayes risk is,
α + nx E [θ|x]
Var (θ|x) = 2
= .
(β + n) β+n

If α → 0 and β → 0 then, E [θ|x] → x and Var (θ|x) → x/n.


If n → ∞ then E [θ|x] → x.
13
Bayes estimates of θ under quadratic loss using a Gamma(a, b)
prior, varying n and keeping x = 10.

Prior Parameters
n (1,0.01) (1,2) (2,1) (0.001,0.001) (7,1.5)
10 10.09 (1.008) 8.417 (0.701) 9.273 (0.843) 9.999 (1) 9.304 (0.809)
20 10.045 (0.502) 9.136 (0.415) 9.619 (0.458) 10 (0.5) 9.628 (0.448)
50 10.018 (0.2) 9.635 (0.185) 9.843 (0.193) 10 (0.2) 9.845 (0.191)
100 10.009 (0.1) 9.814 (0.096) 9.921 (0.098) 10 (0.1) 9.921 (0.098)
200 10.004 (0.05) 9.906 (0.049) 9.96 (0.05) 10 (0.05) 9.96 (0.049)

14
Example. If X1 , . . . , Xn is a random sample from a N(θ, σ 2 ) with
σ 2 known and using the conjugate prior, i.e. θ ∼ N(µ0 , τ02 ) then
the posterior is also normal in which case mean, median and mode
coincide.
The Bayes estimate of θ is then given by,

τ0−2 µ0 + nσ −2 x
.
τ0−2 + nσ −2

Under a Jeffreys prior for θ the Bayes estimate is simply x.

15
Example. Let X1 , . . . , Xn a random sample from a N(θ, σ 2 )
distribution with θ known and φ = σ −2 unknown.

n0 n0 σ02
 
φ ∼ Gamma ,
2 2
n0 + n n0 σ02 + ns02
 
φ|x ∼ Gamma , .
2 2

where ns02 = ni=1 (xi − θ)2 . Then,


P

n0 + n
E (φ|x) =
n0 σ02 + ns02
is the Bayes estimate under quadratic loss.

16
• The quadratic loss can be extended to the multivariate case,

L(δ, θ) = (δ − θ)0 (δ − θ),

and the Bayes estimate is E (θ|x).

• Likewise, the 0-1 loss can also be extended,

L(δ, θ) = lim I (|θ − δ| ∈ A),


vol(A)→0

and the Bayes estimate is the joint mode of the posterior


distribution.

• However, the absolute loss has no clear extension.

17
Example. Suppose X = (X1 , . . . , Xp ) has a multinomial
distribution with parameters n and θ = (θ1 , . . . , θp ). If we adopt a
Dirichlet prior with parameters α1 , . . . , αp the posterior distribution
is Dirichlet with parameters xi + αi , i = 1, . . . , p.
Under quadratic loss, the Bayes estimate of θ is E (θ|x) where,
xi + αi
E (θi |x) = E (θi |xi ) = .
n + pj=1 αj
P

18
Definition
A quantile loss function is defined as,

L(δ, θ) = c1 (δ − θ)I(−∞,δ) (θ) + c2 (θ − δ)I(δ,∞) (θ),

where c1 > 0 and c2 > 0.


It can be shown that the Bayes estimate of θ is a value θ∗ such
that,
c2
P(θ ≤ θ∗ ) = .
c1 + c2

So the Bayes estimate is the quantile of order c2 /(c1 + c2 ) of the


posterior distribution.

19
Definition
The Linex (Linear Exponential) loss function is defined as,

L(δ, θ) = exp[c(δ − θ)] − c(δ − θ) − 1,

It can be shown that the Bayes estimator is,


1
δ∗ = log E [e cθ |x], c 6= 0.
c

20
Linex function with c < 0 reflecting small losses for overestimation and
large losses for underestimation.

40
30
Loss

20

c = − 0.5
c = −1
c = −2
10
0

−6 −4 −2 0 2 4 6

δ−θ
21
Credible Sets

• Point estimates simplify the posterior distribution into single


figures.

• How precise are point estimates?

• We seek a compromise between reporting a single number


representing the posterior distribution or report the
distribution itself.

22
Definition
A set C ∈ Θ is a 100(1-α)% credible set for θ if,

P(θ ∈ C ) ≥ 1 − α.

• The inequality is useful when θ has a discrete distribution,


otherwise an equality is used in practice.
• This definition differs fundamentally from the classical
confidence region.

23
Invariance under Transformation

Credible sets are invariant under 1 to 1 parameter transformations.


Let θ ∗ = g (θ) and C ∗ denotes the image of θ under g . Then,

P(θ ∗ ∈ C ∗ ) = 1 − α.

In the univariate case, if C = [a, b] is a 100(1-α)% credible interval


for θ then [g (a), g (b)] is a 100(1-α)% credible interval for θ∗ .

24
• Credible sets are not unique in general.

• For any α > 0 there are infinitely many solutions to

P(θ ∈ C ) = 1 − α.

25
90% credible intervals for a Poisson parameter θ when the posterior is
Gamma(4,0.5).

0.00 0.02 0.04 0.06 0.08 0.10

0.00 0.02 0.04 0.06 0.08 0.10


0 5 10 15 20 25 0 5 10 15 20 25

θ θ
0.00 0.02 0.04 0.06 0.08 0.10

26
0 5 10 15 20 25
90% credible intervals for Binomial parameter θ|x ∼ Beta(2, 1.5).

1.5

1.5
1.0

1.0
0.5

0.5
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
1.5

θ θ
1.0
0.5
0.0

0.0 0.2 0.4 0.6 0.8 1.0


27
90% credible intervals for θ|x ∼ Normal(2, 1).

0.4

0.4
0.3

0.3
0.2

0.2
0.1

0.1
0.0

0.0
−2 0 2 4 6 −2 0 2 4 6

θ θ
0.4
0.3
0.2
0.1
0.0

−2 0 2 4 6
28
Definition

A 100(1-α)% highest probability density (HPD) credible set for θ


is a 100(1-α)% credible set for θ with the property

p(θ 1 |x) ≥ p(θ 2 |x),

∀θ 1 ∈ C and all θ 2 ∈
/ C.

29
• For symmetric distributions HPD credible sets are obtained by
fixing the same probability for the tails.

• HPD credible sets are not invariant under transformation.

• In the univariate case, if C = [a, b] is a 100(1-α)% HPD


interval for θ then [g (a), g (b)] is a 100(1-α)% interval for
g (θ) but not necessarily HPD.

30
Example. Let X1 , · · · , Xn a random sample from a N(θ, σ 2 ) with
σ 2 known. If θ ∼ N(µ0 , τ02 ) then θ|x ∼ N(µ1 , τ12 ) and,
θ − µ1
Z= |x ∼ N(0, 1).
τ1

Define zα/2 as the value of Z such that,


P(Z ≤ zα/2 ) = 1 − α/2.

We can find the percentile zα/2 such that,


 
θ − µ1
P −zα/2 ≤ ≤ zα/2 = 1 − α
τ1

or, equivalently

P µ1 − zα/2 τ1 ≤ θ ≤ µ1 + zα/2 τ1 = 1 − α.

Then, µ1 − zα/2 τ1 ; µ1 + zα/2 τ1 is the 100(1-α)% HPD interval
for θ.
31
Example. In the previous example, if τ02 → ∞ it follows that
τ1−2 → nσ −2 and µ1 → x. Then,

n(θ − x)
Z= |x ∼ N(0, 1).
σ

The 100(1-α)% HPD credible interval is given by,


√ √ 
x − zα/2 σ/ n; x + zα/2 σ/ n

which concides numerically wit the classical confidence interval.


The interpretation however is completely different.

32
Example. In the previous example, the classical approach would
base inference on,
σ2
 
X ∼ N θ, ,
n

or equivalently,

n(X − θ)
U= ∼ N(0, 1).
σ

U (called a pivot) is a function of the sample and of θ but its


distribution does not depend on θ.

33
Again we can find the percentile zα/2 such that,

P −zα/2 ≤ U ≤ zα/2 = 1 − α

or, equivalently
√ √ 
P X − zα/2 σ/ n ≤ θ ≤ X + zα/2 σ/ n = 1 − α.

However, this is a probabilistic statement about the limits of the


interval, and not about θ.

34
The classical interpretation
If the same experiment were to be repeated infinitely many times,
in 100(1 − α)% of them the random limits of the interval would
include θ.

Useless in practice since it is based on unobserved samples.

In the example, when X = x is observed it is said that there is a


100(1 − α)% confidence (not probability) that the interval
√ √
(x − zα/2 σ/ n; x + zα/2 σ/ n) contains θ.

35
95% confidence intervals for the mean of 100 samples of size 20 simulated from
a N(0, 100). Arrows indicate interval that do not contain zero.

20
10
Means
0−10
−20

0 20 40 60 80 100
Samples
Real confidence level = 95 % 36
Normal Approximation

If the posterior distribution is unimodal and approximately


symmetric it can be approximated by a normal distribution
centered about the posterior mode.
Consider the Taylor expansion of log p(θ|x) about the mode θ∗ ,

 
∗ ∗ d
log p(θ|x) = log p(θ |x) + (θ − θ ) log p(θ|x)
dθ θ=θ∗
 2 
1 ∗ 2 d
+ (θ − θ ) log p(θ|x) + ...
2 dθ2 θ=θ∗

37
By definition,  
d
log p(θ|x) =0
dθ θ=θ∗

Defining,
d2
 
h(θ) = − log p(θ|x)
dθ2
it follows that,
h(θ∗ )
 
∗ 2
p(θ|x) ≈ constant × exp − (θ − θ ) .
2

Then, for large n, we have the following approximation,

θ|x ∼ N(θ∗ , h(θ∗ )−1 ).

38
These results can be extended to the multivariate case.
Since,  
∂ log p(θ|x)
= 0,
∂θ θ =θ ∗
defining the matrix,

∂ 2 log p(θ|x)
 
H(θ) = − ,
∂θ∂θ 0

then, for large n, we have the following approximation,

θ|x ∼ N(θ ∗ , H(θ ∗ )−1 ).

In particular, it is possible to construct approximate credibility


regions based on the above results.

39
Definition
Let θ ∈ Θ. A region C ⊂ Θ is an asymptotic 100(1 − α)%
credibility region if

lim P(θ ∈ C|x) ≥ 1 − α.


n→∞

40
Posterior Gamma density and its normal approximation with simulated
data (n = 10).

3000
2500
2000

p(θ|x)
aprox. normal
1500
1000
500
0

0.0000 0.0005 0.0010 0.0015

θ 41
Posterior Gamma density and its normal approximation with simulated
data (n = 100).

4000
3000

p(θ|x)
2000

aprox. normal
1000
0

0.0005 0.0007 0.0009 0.0011

θ 42
Example. Consider the model,
X1 , . . . , Xn ∼ Poisson(θ)
θ ∼ Gamma(α, β).

The posterior distribution is given by,


X
θ|x ∼ Gama(α + xi , β + n)
portanto, P
xi −1
p(θ|x) ∝ θα+ exp{−θ(β + n)}
ou equivalentemente,
X
log p(θ|x) = (α + xi − 1) log θ − θ(β + n) + constant.

First and second derivatives,


P
d α + xi − 1
log p(θ|x) = −(β + n) +
dθ θ
2
P
d α + xi − 1
log p(θ|x) = − .
dθ2 θ2
43
It then follows that,
P P
∗ α + xi − 1 α+ xi − 1
θ = , h(θ) =
β+n θ2
(β + n)2
h(θ∗ ) = P .
α + xi − 1

The approximate posterior distribution is,


 P P 
α + xi − 1 α + xi − 1
θ|x ∼ N , .
β+n (β + n)2

An approximate 100(1-α)% credible interval for θ,

θ∗ − zα/2 h(θ∗ )−1/2 < θ < θ∗ + zα/2 h(θ∗ )−1/2

44
P
20 simulated Poisson data with θ = 2, prior Gamma(1, 2), xi = 35.

1.5
1.0

p(θ|x)
aprox. normal
0.5
0.0

0 1 2 3 4 5

θ
45
Example. For the model X1 , . . . , Xn ∼ Bernoulli(θ) with
θ ∼ Beta(α, β) the posterior is,
n
X
θ|x ∼ Beta(α + t, β + n − t), t = xi .
i=1

Then,

p(θ|x) ∝ θα+t−1 (1 − θ)β+n−t−1

log p(θ|x) = (α + t − 1) log θ + (β + n − t − 1) log(1 − θ)

d α+t −1 β+n−t −1
log p(θ|x) = − + constant
dθ θ 1−θ

d2 α+t −1 β+n−t −1
log p(θ|x) = − −
dθ2 θ2 (1 − θ)2

46
α+t −1
θ∗ =
α+β+n−2
α+t −1 β+n−t −1
h(θ) = +
θ2 (1 − θ)2
α+β+n−2
h(θ∗ ) = .
θ∗ (1 − θ∗ )

47
The approximate posterior distribution is,

θ∗ (1 − θ∗ )
 
θ|x ∼ N θ∗ , .
α+β+n−2

An approximate 100(1-α)% credible interval for θ,


" s s #
θ ∗ (1 − θ ∗ ) θ∗ (1 − θ∗ )
∗ ∗
θ − zα/2 ; θ + zα/2
α+β+n−2 α+β+n−2

48
20 simulated Bernoulli observations, θ = 0.2, prior Beta(1, 1),
P20
i=1 xi = 3.

5
4
3
p(θ|x)

conjugate posterior
normal approximation
2
1
0

0.0 0.2 0.4 0.6 0.8 1.0

θ
49
Example. If X1 , . . . , Xn ∼ Exp(θ) and p(θ) ∝ 1, it follows that,
n
X
p(θ|x) ∝ θn e −θt , t = xi
i=1
π(θ) = log p(θ|x) = n log(θ) − θ t + c
n
π 0 (θ) = −t
θ
n
π 00 (θ) = − 2
θ

Then,
n 1
Mode(θ|x) = θ∗ = =
t x
∗ n
h(θ ) = = nx 2 .
(θ∗ )2

50
The approximate posterior distribution is,
 
1 1
θ|x ∼ N , .
x nx 2

An approximate 100(1-α)% credible interval for θ is,


" r r #
1 1 1 1
− zα/2 ; + zα/2 .
x nx 2 x nx 2

51

You might also like