Estimation
Estimation
Estimation
Ricardo Ehlers
ehlers@icmc.usp.br
1
For each decision δ and each possible value of the parameter θ we
can associate a loss L(δ, θ) assuming positive values.
This is supposed to measure the penalty associated with a decision
δ when the parameter takes the value θ.
2
Definition
The risk of a decision rule, denoted R(δ), is the posterior expected
loss, i.e.
3
Definition
A decision rule δ ∗ is optimal if its risk is minimum, i.e.
• This rule is called the Bayes rule and its risk is the Bayes risk.
• Then,
δ ∗ = arg min R(δ),
is the Bayes estimator.
• Bayes risks can be used to compare estimators. A decision
rule δ1 is preferred to a rule δ2 if
4
Loss Functions
Definition
The quadratic loss function is defined as,
L(δ, θ) = (δ − θ)2 ,
5
Lemma
If L(δ, θ) = (δ − θ)2 the Bayes estimator of θ is E (θ|x).
The Bayes risk is,
6
Definition
The absolute loss function is defined as,
L(δ, θ) = |δ − θ|.
Lemma
If L(δ, θ) = |δ − θ| the Bayes estimator of θ is the median of the
posterior distribution.
7
Definition
The 0-1 loss function is defined as,
lim δ() = δ ∗ .
→0
So, the Bayes estimator for a 0-1 loss function is the mode of the
posterior distribution.
8
Consequently, for a 0-1 loss function and θ continuous the Bayes
estimate is,
9
Example. Let X1 , . . . , Xn a random sample from a Bernoulli
distribution with parameter θ. For a Beta(α, β) conjugate prior it
follows that the posterior distribution is,
θ|x ∼ Beta(α + t, β + n − t)
Pn
where t = i=1 xi .
Under quadratic loss the Bayes estimate is given by,
α + ni=1 xi
P
E (θ|x) = .
α+β+n
10
Example. In the previous example suppose that n = 100 and
t = 10 so that the Bayes estimate under quadratic loss is,
α + 10
E (θ|x) = .
α + β + 100
11
Example. Bayesian estimates of θ under quadratic loss with a
Beta(a, b) prior, varying n and keeping x = 0.1.
Prior Parameters
n (1,1) (1,2) (2,1) (0.001,0.001) (7,1.5)
10 0.167 (0.067) 0.154 (0.063) 0.231 (0.061) 0.1 (0.082) 0.432 (0.039)
20 0.136 (0.038) 0.13 (0.037) 0.174 (0.035) 0.1 (0.043) 0.316 (0.025)
50 0.115 (0.016) 0.113 (0.016) 0.132 (0.016) 0.1 (0.018) 0.205 (0.013)
100 0.108 (0.008) 0.107 (0.008) 0.117 (0.008) 0.1 (0.009) 0.157 (0.007)
200 0.104 (0.004) 0.103 (0.004) 0.108 (0.004) 0.1 (0.004) 0.129 (0.004)
12
Example. Let X1 , . . . , Xn a random sample from a Poisson
distribution with parameter θ. Using a conjugate prior it follows
that,
X1 , . . . , Xn ∼ Poisson(θ)
θ ∼ Gamma(α, β)
θ|x ∼ Gamma(α + t, β + n)
Pn
where t = i=1 xi .
The Bayes estimate under quadratic loss is,
α + ni=1 xi
P
α + nx
E [θ|x] = =
β+n β+n
while the Bayes risk is,
α + nx E [θ|x]
Var (θ|x) = 2
= .
(β + n) β+n
Prior Parameters
n (1,0.01) (1,2) (2,1) (0.001,0.001) (7,1.5)
10 10.09 (1.008) 8.417 (0.701) 9.273 (0.843) 9.999 (1) 9.304 (0.809)
20 10.045 (0.502) 9.136 (0.415) 9.619 (0.458) 10 (0.5) 9.628 (0.448)
50 10.018 (0.2) 9.635 (0.185) 9.843 (0.193) 10 (0.2) 9.845 (0.191)
100 10.009 (0.1) 9.814 (0.096) 9.921 (0.098) 10 (0.1) 9.921 (0.098)
200 10.004 (0.05) 9.906 (0.049) 9.96 (0.05) 10 (0.05) 9.96 (0.049)
14
Example. If X1 , . . . , Xn is a random sample from a N(θ, σ 2 ) with
σ 2 known and using the conjugate prior, i.e. θ ∼ N(µ0 , τ02 ) then
the posterior is also normal in which case mean, median and mode
coincide.
The Bayes estimate of θ is then given by,
τ0−2 µ0 + nσ −2 x
.
τ0−2 + nσ −2
15
Example. Let X1 , . . . , Xn a random sample from a N(θ, σ 2 )
distribution with θ known and φ = σ −2 unknown.
n0 n0 σ02
φ ∼ Gamma ,
2 2
n0 + n n0 σ02 + ns02
φ|x ∼ Gamma , .
2 2
n0 + n
E (φ|x) =
n0 σ02 + ns02
is the Bayes estimate under quadratic loss.
16
• The quadratic loss can be extended to the multivariate case,
17
Example. Suppose X = (X1 , . . . , Xp ) has a multinomial
distribution with parameters n and θ = (θ1 , . . . , θp ). If we adopt a
Dirichlet prior with parameters α1 , . . . , αp the posterior distribution
is Dirichlet with parameters xi + αi , i = 1, . . . , p.
Under quadratic loss, the Bayes estimate of θ is E (θ|x) where,
xi + αi
E (θi |x) = E (θi |xi ) = .
n + pj=1 αj
P
18
Definition
A quantile loss function is defined as,
19
Definition
The Linex (Linear Exponential) loss function is defined as,
20
Linex function with c < 0 reflecting small losses for overestimation and
large losses for underestimation.
40
30
Loss
20
c = − 0.5
c = −1
c = −2
10
0
−6 −4 −2 0 2 4 6
δ−θ
21
Credible Sets
22
Definition
A set C ∈ Θ is a 100(1-α)% credible set for θ if,
P(θ ∈ C ) ≥ 1 − α.
23
Invariance under Transformation
P(θ ∗ ∈ C ∗ ) = 1 − α.
24
• Credible sets are not unique in general.
P(θ ∈ C ) = 1 − α.
25
90% credible intervals for a Poisson parameter θ when the posterior is
Gamma(4,0.5).
θ θ
0.00 0.02 0.04 0.06 0.08 0.10
26
0 5 10 15 20 25
90% credible intervals for Binomial parameter θ|x ∼ Beta(2, 1.5).
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
1.5
θ θ
1.0
0.5
0.0
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
−2 0 2 4 6 −2 0 2 4 6
θ θ
0.4
0.3
0.2
0.1
0.0
−2 0 2 4 6
28
Definition
∀θ 1 ∈ C and all θ 2 ∈
/ C.
29
• For symmetric distributions HPD credible sets are obtained by
fixing the same probability for the tails.
30
Example. Let X1 , · · · , Xn a random sample from a N(θ, σ 2 ) with
σ 2 known. If θ ∼ N(µ0 , τ02 ) then θ|x ∼ N(µ1 , τ12 ) and,
θ − µ1
Z= |x ∼ N(0, 1).
τ1
or, equivalently
P µ1 − zα/2 τ1 ≤ θ ≤ µ1 + zα/2 τ1 = 1 − α.
Then, µ1 − zα/2 τ1 ; µ1 + zα/2 τ1 is the 100(1-α)% HPD interval
for θ.
31
Example. In the previous example, if τ02 → ∞ it follows that
τ1−2 → nσ −2 and µ1 → x. Then,
√
n(θ − x)
Z= |x ∼ N(0, 1).
σ
32
Example. In the previous example, the classical approach would
base inference on,
σ2
X ∼ N θ, ,
n
or equivalently,
√
n(X − θ)
U= ∼ N(0, 1).
σ
33
Again we can find the percentile zα/2 such that,
P −zα/2 ≤ U ≤ zα/2 = 1 − α
or, equivalently
√ √
P X − zα/2 σ/ n ≤ θ ≤ X + zα/2 σ/ n = 1 − α.
34
The classical interpretation
If the same experiment were to be repeated infinitely many times,
in 100(1 − α)% of them the random limits of the interval would
include θ.
35
95% confidence intervals for the mean of 100 samples of size 20 simulated from
a N(0, 100). Arrows indicate interval that do not contain zero.
20
10
Means
0−10
−20
0 20 40 60 80 100
Samples
Real confidence level = 95 % 36
Normal Approximation
∗ ∗ d
log p(θ|x) = log p(θ |x) + (θ − θ ) log p(θ|x)
dθ θ=θ∗
2
1 ∗ 2 d
+ (θ − θ ) log p(θ|x) + ...
2 dθ2 θ=θ∗
37
By definition,
d
log p(θ|x) =0
dθ θ=θ∗
Defining,
d2
h(θ) = − log p(θ|x)
dθ2
it follows that,
h(θ∗ )
∗ 2
p(θ|x) ≈ constant × exp − (θ − θ ) .
2
38
These results can be extended to the multivariate case.
Since,
∂ log p(θ|x)
= 0,
∂θ θ =θ ∗
defining the matrix,
∂ 2 log p(θ|x)
H(θ) = − ,
∂θ∂θ 0
39
Definition
Let θ ∈ Θ. A region C ⊂ Θ is an asymptotic 100(1 − α)%
credibility region if
40
Posterior Gamma density and its normal approximation with simulated
data (n = 10).
3000
2500
2000
p(θ|x)
aprox. normal
1500
1000
500
0
θ 41
Posterior Gamma density and its normal approximation with simulated
data (n = 100).
4000
3000
p(θ|x)
2000
aprox. normal
1000
0
θ 42
Example. Consider the model,
X1 , . . . , Xn ∼ Poisson(θ)
θ ∼ Gamma(α, β).
44
P
20 simulated Poisson data with θ = 2, prior Gamma(1, 2), xi = 35.
1.5
1.0
p(θ|x)
aprox. normal
0.5
0.0
0 1 2 3 4 5
θ
45
Example. For the model X1 , . . . , Xn ∼ Bernoulli(θ) with
θ ∼ Beta(α, β) the posterior is,
n
X
θ|x ∼ Beta(α + t, β + n − t), t = xi .
i=1
Then,
d α+t −1 β+n−t −1
log p(θ|x) = − + constant
dθ θ 1−θ
d2 α+t −1 β+n−t −1
log p(θ|x) = − −
dθ2 θ2 (1 − θ)2
46
α+t −1
θ∗ =
α+β+n−2
α+t −1 β+n−t −1
h(θ) = +
θ2 (1 − θ)2
α+β+n−2
h(θ∗ ) = .
θ∗ (1 − θ∗ )
47
The approximate posterior distribution is,
θ∗ (1 − θ∗ )
θ|x ∼ N θ∗ , .
α+β+n−2
48
20 simulated Bernoulli observations, θ = 0.2, prior Beta(1, 1),
P20
i=1 xi = 3.
5
4
3
p(θ|x)
conjugate posterior
normal approximation
2
1
0
θ
49
Example. If X1 , . . . , Xn ∼ Exp(θ) and p(θ) ∝ 1, it follows that,
n
X
p(θ|x) ∝ θn e −θt , t = xi
i=1
π(θ) = log p(θ|x) = n log(θ) − θ t + c
n
π 0 (θ) = −t
θ
n
π 00 (θ) = − 2
θ
Then,
n 1
Mode(θ|x) = θ∗ = =
t x
∗ n
h(θ ) = = nx 2 .
(θ∗ )2
50
The approximate posterior distribution is,
1 1
θ|x ∼ N , .
x nx 2
51