Bayesian Inference
Bayesian Inference
Reasonable efforts have been made to publish reliable data and information, but the author and pub-
lisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.
com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA
01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermis-
sions@tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.
DOI: 10.1201/9781003221623
Typeset in CMR10
by KnowledgeWorks Global Ltd.
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
Contents
Preface ix
1 Introduction 1
2 Bayesian Modelling 5
2.1 Statistical Model 5
2.2 Bayes Model 10
2.3 Advantages 22
2.3.1 Sequential Analysis 22
2.3.2 Big Data 23
2.3.3 Hierarchical Models 25
2.4 List of Problems 27
3 Choice of Prior 29
3.1 Subjective Priors 31
3.2 Conjugate Priors 38
3.3 Non-informative Priors 50
3.3.1 Laplace Prior 50
3.3.2 Jeffreys Prior 53
3.3.3 Reference Priors 63
3.4 List of Problems 74
4 Decision Theory 78
4.1 Basics of Decision Theory 78
4.2 Bayesian Decision Theory 83
4.3 Common Bayes Decision Rules 85
4.3.1 Quadratic Loss 85
4.3.2 Absolute Error Loss 89
4.3.3 Prediction 91
4.3.4 The 0–1 Loss 93
4.3.5 Intrinsic Losses 95
4.4 The Minimax Criterion 97
4.5 Bridges 101
4.6 List of Problems 110
v
vi CONTENTS
5 Asymptotic Theory 113
5.1 Consistency 113
5.2 Schwartz’ Theorem 120
5.3 List of Problems 125
7 Estimation 183
7.1 Maximum a Posteriori (MAP) Estimator 186
7.1.1 Regularized Estimators 188
7.2 Bayes Rules 193
7.2.1 Estimation in Univariate Linear Models 195
7.2.2 Estimation in Multivariate Linear Models 198
7.3 Credible Sets 199
7.3.1 Credible Sets in Linear Models 203
7.4 Prediction 206
7.4.1 Prediction in Linear Models 210
7.5 List of Problems 212
10 Solutions 284
10.1 Solutions for Chapter 2 284
10.2 Solutions for Chapter 3 287
10.3 Solutions for Chapter 4 292
10.4 Solutions for Chapter 5 296
10.5 Solutions for Chapter 6 300
10.6 Solutions for Chapter 7 306
10.7 Solutions for Chapter 8 312
10.8 Solutions for Chapter 9 319
11 Appendix 323
11.1 Discrete Distributions 323
11.2 Continuous Distributions 324
11.3 Multivariate Distributions 325
11.4 Matrix-Variate Distributions 326
Bibliography 329
Index 333
Preface
The book is typically written as a textbook, but it also contains ample mate-
rial for researchers. As textbook, it contains more material than needed for a
one-semester course, whereas the choice of chapters also depends on the back-
ground and interest of the students, and specific aims of the designed course.
A number of examples and exercises are embedded in all chapters, and so-
lutions to all exercises are provided, often with reasonable details of solution
steps.
The book also adds a new piece of joint output in the authors’ repository of
long-term team work, mostly leading to research articles or book chapters.
Most material of the book stems from the Master and Ph.D. courses on
Bayesian Statistics over the last several years. The courses were offered
through the Department of Mathematics and the Center for Interdisciplinary
Mathematics (CIM), Uppsala University. Special thanks are due to the col-
leagues and students who have read the preliminary drafts of the book and
provided valuable feedback. Any remnants of mistakes or typos are of course
the authors’ responsibility, and any information in this regard will be highly
appreciated.
We wish the readers a relishing amble through the pleasures and challenges
of Bayesian statistics!
ix
Chapter 1
Introduction
Bayesian theory has seen an increasingly popular trend over the recent few
decades. It can be ascribed to a combination of two factors, namely, the fast
growth of computational power, and the applicability of the theory to a wide
variety of real life problems.
P(B|A)P(A)
P(A|B) = .
P(B)
The theorem, however, remained unknown to the wider world until one of
Bayes’ closest friends, Thomas Price (1723–1791), posthumously dug it up
in 1764, meticulously edited it, and sent it to the Royal Society which both
Thomases were members of. Thomas Price also attached a detailed note with
the article, arguing for the value and worth of the theory Bayes had put forth
in it. For a detailed historical profile of the subject, and of Thomas Bayes, see
e.g., Bellhouse (2004) and Stigler (1982).
What was introduced by Bayes and popularized by Price, was later formal-
ized by Laplace, modernized by Jeffreys, and mathematicised by de Finnetti.
Harold Jeffreys rightly accoladed that the Bayes theorem is to the theory of
DOI: 10.1201/9781003221623-1 1
2 INTRODUCTION
probability what Pythagoras theorem is to geometry. Over time, Bayesian the-
ory has occupied a prominent place in the comity of researchers.
We may recall ourselves that Statistics is more a research tool than a disci-
pline like others, and as such, Statistics helps develop and apply techniques to
measure and assess uncertainty in inductive inference. This implies applica-
bility of Statistics in any research context involving random mechanism where
the researcher intends to inductively draw conclusions.
P = {f (x|θ); θ ∈ Θ}.
π(θ|X) ∝ f (x|θ)π(θ).
The prior can be based on subjective beliefs. It mainly pertains to the fact, one
that cannot be overemphasized, that the scientific inquiry often incorporates
subjectively formulated conjectures as essential components in exploring the
nature. This can be evidenced from the use of subjectivism by world renowned
researchers excelled in a variety of scientific and philosophical domains. Press
and Tamur (2001) list up many of them, in the context of their subjectivism,
including giants such as Johannes Kepler, Gregor Mendel, Galileo Galilei,
INTRODUCTION 3
Isaac Newton, Alexander von Humboldt, Charles Darwin, Sigmund Freud,
and Albert Einstein.
During the last decades, however, the more pragmatic Bayesian approach has
become popular, especially because of the iterative structure of the formulas
which allow adaptive applications. An objective Bayesian analysis is devel-
oped, where the prior is determined by information–theoretic criteria, avoid-
ing a subjective influence on the result of the study on one hand and exploring
the advantages of Bayes analysis on the other.
A major reason for this renaissance lies in the efficient computational sup-
port. The availability of computer-intensive methods such as MCMC (Markov
Chain Monte Carlo) and ABC (Approximative Bayesian Method) in the
modern era have led to a second coming of Bayesian theory. This has ob-
viously widened the spectrum of applications of Bayesian theory into the
modern areas of statistical inference involving big data and its associated
problems. The less reliance on mathematics, and more on computational al-
gorithms, has rightfully enhanced the scope of interest in Bayesian way of
inference for researchers and situations where mathematical intricacies can be
avoided.
Our motivation stems from the fact that Bayesian inference provides a prag-
matic statistical approach to discern information from real life data. This
aspect indeed sets the book apart from many, if not all, others already in the
literature on the same subject, in that we focus on the cases, both in univari-
ate and multivariate inference, where exact posterior can be derived. Further,
the solutions of all exercises are furnished, to make the book self-contained.
The book can be used for a one-semester Bayesian inference course at the
undergraduate or master level, depending on the students’ background, for
which we expect a reasonable orientation to basic statistical inference, at the
level of e.g., Liero and Zwanzig (2011).
Chapter 6 deals with Bayesian theory of linear models under normal distribu-
tion. It includes univariate linear models, linear mixed models and multivariate
models. The explicit formulas for the posteriors are derived. These results are
also important for evaluating simulation methods.
Bayesian Modelling
DOI: 10.1201/9781003221623-2 5
6 BAYESIAN MODELLING
n!
P(N1 = n1 , N2 = n2 , N3 = n3 ) = p n1 p n 2 p n 3 , (2.1)
n1 ! n2 ! n3 ! 1 2 3
with n3 = n − n1 − n2 , p3 = 1 − p1 − p2 and 0 < pj < 1 for j = 1, 2, 3. The
unknown parameter θ = (p1 , p2 , p3 ) consists of the probabilities of each color.
With only three colors, we have p1 + p2 + p3 = 1. The parameter space is
Thus
P = {Mult(n, p1 , p2 , p3 ) : θ = (p1 , p2 , p3 ) ∈ Θ}.
2
Example 2.2 (Measurements)
Suppose we carry out a physical experiment. Under identical conditions we
take repeated measurements. The data consist of n real numbers xi , i =
1, . . . , n and X = Rn . The sources of randomness are the measurement errors,
which can be assumed as normally distributed. Thus we have a sample of i.i.d.
r.v.’s and the model is given by
P = {P⊗n
θ : EX = μ, VarX = σ 2 , (μ, σ 2 ) ∈ R × R+ , f ∈ F}.
STATISTICAL MODEL 7
2
The unknown parameter θ consists of μ, σ , f and the parameter space Θ is
determined as R × R+ × F. In this case, F is some functional space of infinite
dimension. This model does not fulfill the conditions of a statistical model
in Definition 2.1. Otherwise when we suppose that F is a known parametric
distribution family with density f (x) = fθ (x), θ = (μ, σ 2 , ϑ), ϑ ∈ Rq , then the
parameter space has a dimension less or equal q + 2 and the model fulfills the
conditions of Definition 2.1. 2
Let us consider one more example which is probably not realistic but still
a good classroom example for understanding purposes; see also Liero and
Zwanzig (2011).
At night the lion eats x people with a probability Pθ (x) given by the following
table:
x 0 1 2 3 4
θ1 0 0.05 0.05 0.8 0.1
.
θ2 0.05 0.05 0.8 0.1 0
θ3 0.9 0.05 0.05 0 0
Thus the model consists of three different distributions over X = {0, 1, 2, 3, 4}.
2
The Corona crisis, started in early 2020, led to an increased interest in statis-
tics. We consider an example related to Corona mortality statistics.
Example 2.5 (Corona) The data consist of the number of deaths related to
an infection with COVID-19 in the ith county in Sweden i = 1, . . . , 17, during
the first wave in week 17 (end of April) in 2020. We assume a logistic model
for the probability of death p by COVID-19, depending on the covariates x1
(age), x2 (gender), x3 (health condition) and x4 (accommodation). Thus,
exp(α1 x1 + α2 x2 + α3 x3 + α4 x4 )
p= .
1 + exp(α1 x1 + α2 x2 + α3 x3 + α4 x4 )
2
Given a statistical model P = {Pθ : θ ∈ Θ}, the main tool for exploring the
information in the data is the likelihood function.
For convenience, we introduce the probability function to handle both dis-
crete and continuous distributions because measure theory is not a prerequisite
of our book. Let A ⊆ X . For a continuous r.v. X with density f (·|θ)
Pθ (A) = f (x|θ)dx,
A
8 BAYESIAN MODELLING
and for a discrete r.v.
Pθ (A) = Pθ ({x}),
x∈A
(θ|x) = p(x|θ).
n
(θ|x) = Pi,θ (xi ) in the discrete case (2.2)
i=1
and
n
(θ|x) = fi (xi |θ) in the continuous case, (2.3)
i=1
then they bring the same information about θ and must lead to
identical inferences.”
The other strategy is to fix the number k of successes and observe the sequence
until k successes are attained. The number of failures x2 is counted, where x2
is a realization of X2 which follows a negative binomial distribution:
X2 ∼ NB(k, p),
θ1 θ2 θ3
.
(θ|x = 3) 0.8 0.1 0
This leads to the conclusion that a lion, having eaten 3 persons, was hungry.2
1
n
∝ (σ 2 )− 2 exp −
n
(xi − μ)2 .
2σ 2 i=1
n
Since i=1 (xi − x)(x − μ) = 0, we obtain
n
n
n
(xi − μ)2 = (xi − x)2 + (x − μ)2 .
i=1 i=1 i=1
1
n
s2 = (xi − x)2 ,
n − 1 i=1
it follows that the likelihood function depends only on the sufficient statistic
(x, s2 ) since
1
(θ|x) ∝ (σ 2 )− 2 exp − 2 ((n − 1)s2 + n(x − μ)2 ) .
n
2σ
In particular, it holds that
with
n
(μ|σ 2 , x) ∝ exp − (x − μ)2
2σ 2
and
(n − 1)s2
(σ 2 |x) ∝ (σ 2 )− 2 exp −
n
.
2σ 2
The likelihood function of μ given σ 2 has the form of the Gaussian bell curve
with center x and inflection points at x − √1n σ, x + √1n σ. The likelihood func-
tions of μ and σ 2 are plotted in Figure 2.1. 2
1000 1200
4
3
800
likelihood
likelihood
600
2
400
1
200
0
S1 sample S2
mean
0
S
Figure 2.1: Likelihood functions in Example 2.12. Left: Likelihood function (μ|σ 2 , x)
with maximum at the sample mean x and inflection points S1 = x − √1n σ and
S2 = x + √1n σ. Right: Likelihood function (σ 2 |x) with maximum at S = n−1 n
s2 ,
where s2 is the sample variance.
In Bayesian inference we have two random variables: θ and X, and both play
different roles. The random variable θ is not observed, but the parameter
generating distribution π is known. The random variable X is observed.
The data generating distribution is the conditional distribution of X given
θ and it is known up to θ.
The choice of the prior distribution is a very important part of Bayesian mod-
elling. Different approaches and principles for choosing a prior are presented
in Chapter 3.
The following historical example of Thomas Bayes is used to illustrate the
construction of the parameter generating experiment and the data generating
experiment; see the details in Stigler (1982).
we have
Summarizing, we obtain
⎛ ⎞ ⎛⎛ ⎞ ⎛ 2 ⎞⎞
θ μ0 σ0 σ02 σ02
⎝X1 ⎠ ∼ N ⎝⎝μ0 ⎠ , ⎝σ02 σ0 + σ 2
2
σ02 ⎠⎠ ,
X2 μ0 σ02 σ02 σ0 + σ 2
2
14 BAYESIAN MODELLING
In fact, we are not interested in the joint distribution of (θ, X); our interest is
to learn about the underlying data generating distribution PX|θ , the main tool
for which is the conditional distribution of θ given X = x. The information
on the parameter based on prior and on the experiment is included in the
conditional distribution of θ given x, denoted by π(.|x) and called posterior
distribution. Roughly speaking the posterior distribution takes over the role
of the likelihood. We formulate the Bayesian inference principle as follows.
In other words, when we know π(.|x), we can do all inference, including esti-
mation and testing. The derivation of the posterior distribution is the first step
in a Bayes study, and the main tool to arrive at this is the Bayes Theorem.
It holds that
P(θ,X) = Pθ PX|θ = PX Pθ|X .
BAYES MODEL 15
Thus
Pθ PX|θ
Pθ|X = .
PX
Assuming that the data and the parameter have continuous distributions, we
rewrite this relation using the respective densities:
π(θ)f (x|θ)
π(θ|x) =
f (x)
where f (x|θ) is the likelihood function (θ|x). The joint density of (θ, X) can
be written as
f (θ, x) = f (x|θ)π(θ),
where f (x) is the density of the marginal distribution, i.e.,
f (x) = f (x|θ)π(θ)dθ. (2.6)
Θ
π(θ)(θ|x)
π(θ|x) = ∝ π(θ)(θ|x).
Θ
(θ|x)π(θ)dθ
and
π(θ)(θ|x)
π(θ|x) = ∝ π(θ)(θ|x),
θ∈Θ (θ|x)π(θ)
where we use (θ|x) = Pθ ({x}). In each case the most important relation for
determining the posterior distribution is:
π(θ|x) ∝ π(θ)(θ|x)
The product π(θ)(θ|x) is the kernel function of the posterior. It includes the
prior information (π) and the information from the data ((.|x)). The posterior
distribution is then determined up to a constant. To determine the complete
posterior, we can apply different methods.
• Compare the kernel function with known distribution families,
• Calculate the normalizing constant analytically,
• Calculate the normalizing constant with Monte Carlo methods,
• Generate a sequence θ(1) , . . . , θ(N ) which is distributed as π(θ|x), and
• Generate a sequence θ(1) , . . . , θ(N ) which is distributed approximately as
π(θ|x).
16 BAYESIAN MODELLING
The computer intensive methods on Bayesian computations are included in
Chapter 9. Hence we illustrate the first three items.
The density of U(0, 1) is constant and equals one. Hence the posterior distri-
bution is determined by
π(θ|x) ∝ θx (1 − θ)n−x .
where the normalizing constant B(α, β) is the beta function. Note that, U(0, 1)
is the beta distribution with α = 1, β = 1. In a more general case where the
prior is a beta distribution Beta(α0 , β0 ), we have
and
θ|x ∼ Beta(α0 + x, β0 + n − x).
2
1
n
(μ|x) ∝ exp − (xi − μ)2
2σ 2 i=1
and
1
π(μ) ∝ exp − 2 (μ − μ0 ) .
2
2σ0
BAYES MODEL 17
3.5
3.5
beta(7,5)
3.0
3.0
beta(1,1)
beta(4,2) beta(10,6)
2.5
2.5
posterior density
prior density
2.0
2.0
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
theta theta
Figure 2.4: Example 2.11. Left: Prior distributions. Right: Posterior distributions,
after observing x = 6 from Bin(10, θ).
Thus
1
n
1
π(μ|x) ∝ exp − (xi − μ)2 − 2 (μ − μ0 )2 .
2σ 2 i=1 2σ0
n
Using the identity i=1 (xi − μ)2 = (n − 1)s2 + n(μ − x̄)2 , where s2 is the
sample variance
1
n
s2 = (xi − x)2
n − 1 i=1
we obtain
n 1
π(μ|x) ∝ exp − 2 (μ − x̄) − 2 (μ − μ0 ) .
2 2
2σ 2σ0
By completing the squares, we get
2
nσ02 + σ 2 x̄nσ02 + μ0 σ 2
π(μ|x) ∝ exp − μ− .
2σ02 σ 2 nσ02 + σ 2
3.5
3.0
2.5
prior
likelihood
2.0
1.5
1.0
0.5
0.0 posterior
theta
β α α−1
f (τ |α, β) = τ exp(−βτ ) (2.9)
Γ(α)
where α > 0 is the shape parameter and β > 0 is the rate parameter. In this
case, we have
τ 2
n
n
(τ |x) ∝ τ 2 exp − x
2 i=1 i
and
π(τ |α, β) ∝ τ α−1 exp(−βτ ).
Then the posterior distribution has the kernel
1 2
n
n
π(τ |x, α, β) ∝ τ α−1+ 2 exp −τ ( x + β) .
2 i=1 i
BAYES MODEL 19
Hence the posterior distribution is also
n gamma distribution with shape pa-
rameter α + n2 and rate parameter 12 i=1 x2i + β:
n 1 2
n
Gamma(α + , x + β). (2.10)
2 2 i=1 i
Let us consider the same set up once more, where now the parameter of interest
is the variance.
and α+1
1 β
π(θ|α, β) ∝ exp − .
θ θ
Then the posterior distribution has the kernel
α+1+ n2 1
n
1 β+ x2i
π(θ|x, α, β) ∝ exp − 2 i=1
.
θ θ
n 1 2
n
σ 2 |x ∼ InvGamma(α + , x + β). (2.12)
2 2 i=1 i
We revisit the lion example to illustrate a case where the constant can be
easily calculated.
20 BAYESIAN MODELLING
0.8
prior
likelihood
0.6
posterior
0.4
0.2
0.0
0 1 2 3 4 5
variance
Figure 2.6: Example 2.14. The posterior distribution is more concentrated and
slightly shifted towards the true parameter.
The experience says that an adult male lion is lethargic with probability 0.8
where the probability that he is hungry is 0.1.
θ1 θ2 θ3
.
π(θ) 0.1 0.1 0.8
Thus
The normalizing constant is 0.09, and we obtain for the posterior distribution
θ1 θ2 θ3
.
π(θ|x = 3) 0.889 0.111 0
After knowing that the lion has eaten 3 persons, the probability that he was
hungry is really high; no surprise! 2
1.0
prior
likelihood
0.8
likelihood*prior
posterior
0.6
0.4
0.2
0.0
−1 0 1 2 3 4 5
theta
n
1 (θ − μ)2
π(θ|x) ∝ exp − . (2.14)
i=1
1 + (xi − θ)2 2σ 2
2.3 Advantages
In this section we present three settings, where the Bayesian approach gives
easy computations in many useful models.
θ ∼ π.
The posterior distribution after the first two data sets is N(μ2 , σ22 ) with
Finally, the posterior distribution after all three data sets is N(μ3 , σ32 ) with
4 2 2
i=1 zi σ2 + μ2 σ σ22 σ 2
μ3 = 2 and σ32 = .
4σ2 + σ 2 4σ22 + σ 2
∼ Nn (0, σ 2 In ).
1.2
prior
posterior−1
posterior−2
1.0
posterior−all
0.8
true value
0.6
0.4
0.2
0.0
−2 0 2 4 6 8 10
theta
Figure 2.8: Example 2.17. The iteratively calculated posterior distributions are more
and more concentrated around the unknown underlying parameter.
not have full rank so that the inverse of XT X does not exist. The likelihood
function
n2
1 1
(θ|Y) ∝ exp − (Y − Xβ) T
(Y − Xβ) , (2.16)
σ2 2σ 2
has no unique maximum. The maximum likelihood estimators β are the solu-
tions of
XT Xβ = XT Y.
Using the Bayesian approach we assume a normal prior
β ∼ Np (β0 , σ 2 Σ0 ),
with −1 T
β1 = XT X + Σ−1
0 X Y + Σ−1
0 β0
ADVANTAGES 25
and
Σ−1 T −1
1 = X X + Σ0 .
α ∼ πμ 0 ,
which gives
π(θ|x) ∝ π(θ)(θ|x)
where
π(θ) = π(θ|α)πμ0 (α)dα.
The next example, taken from Dupuis (1995), illustrates a hierarchical model.
It is assumed that
qj (r, s) = φj (r)ψj (r, s)
where φj (r) is the survival probability and ψj (r, s) is the probability of moving
from r to s. The parameter of interest is given by
θ = (p, φ, ψ),
LIST OF PROBLEMS 27
where p = (p1 (1), . . . , p6 (k)), φ = (φ1 (1), . . . , φ6 (k)) and ψ =
(ψ1 (1, 1), . . . , ψ6 (k, k)). Following assumptions are set:
Γ(p0 )
p(x|p1 , . . . , pk ) = xp1 −1 . . . xpkk −1 .
Γ(p1 ) . . . Γ(pk ) 1
The components of ψj (r) are dependent. The other parameters are assumed to
be independent. The hyperparameters e1 , . . . , ek are independent of the time
j and the stratum s. In the study the experimenters set all hyperparameters
as known, using the knowledge on the behavior of the lizards. The Bayesian
inference is done by using Gibbs sampling, see Subsection 9.4.2. 2
x D C B A
θ=0 0.8 0.1 0.1 0
θ=1 0.2 0.5 0.2 0.1
θ=2 0 0.1 0.5 0.4
The probability that a student has done nothing is 0.1 and that the student
is very well prepared is 0.3. Calculate the posterior distribution for each x.
2. Assume X|θ ∼ Pθ and θ ∼ π. Show that:
The statistic T = T (X) is sufficient iff the posteriors of T and X coincide:
π(θ|X) = π(θ|T (X)).
3. Assume X|n ∼ Bin(n, 0.5). Find a prior on n such that n|X ∼ NB(x, 0.5).
28 BAYESIAN MODELLING
4. Consider a sample X = (X1 , . . . , Xn ) from N(0, σ 2 ), n = 4. The unknown
parameter is the precision parameter θ = τ = σ −2 . As prior distribution of
τ , we set Gamma(2, 2). Derive the posterior distribution.
5. In summer 2020, in a small town, 1000 inhabitants were randomly chosen
and their blood samples are tested for Corona antibodies. 15 persons got
a positive test result. From another study in Spring it was noted that the
proportion of the population with antibodies is around 2%. Let X be the
number of inhabitants tested positive in the study. Let θ be the probability
of a positive test. Two different Bayes models are proposed.
• M0 : X ∼ Bin(1000, p), p ∼ Beta(1, 20)
• M1 : X ∼ Poi(λ), λ ∼ Gamma(20, 1)
(a) Derive the posterior distributions for both models.
(b) Discuss the differences between the two models.
(c) Give a recommendation.
6. Consider a multiple regression model
yi = β0 + xi β1 + zi β2 + εi , i = 1, . . . , n
Choice of Prior
The main difference between a statistical model and a Bayes model lies in the
assumption that the parameter of the data generating distribution is the re-
alization of a random variable from a known distribution. This distribution is
called prior distribution. The parameter is not observed but its prior distribu-
tion is assumed to be known. The choice of the prior distribution is essential.
In this chapter we present different principles for determining the prior.
Principles of Modelling
We begin by summarizing a general set of principles of statistical modelling.
These principles are valid for a statistical model P = {Pθ : θ ∈ Θ} following
Definition 2.1 where we aim to determine a family of distributions. Concerning
Bayes model {P, π}, given in Definition 2.3, we need to model both a family
of distributions and a single prior distribution. Thus the principles become
even more important.
• First, all statistical models need assumptions for their valid applications.
It is, therefore, worth keeping in mind, while developing the mathematical
properties of models, that these properties will only hold as long as the
assumptions do.
• Second, the development or selection of a statistical model should be
objective-oriented. A model which is good for prediction, for example, may
not be appropriate to explore certain relationships between variables.
• Third, since models are generally only approximative, it is highly recom-
mended to be pragmatic rather than perfectionist. One should thus take
the risk of setting necessary assumptions and, more so, keep track of them
while analyzing the data under the postulated model.
• Fourth, it makes much sense to fit different models to the same problem
and compare them.
• Finally, even with all the aforementioned requisites fully taken care of, it
is always befitting to recall G.E.P. Box’s well-known adage: Essentially all
models are wrong, but some are useful. The practical question is how wrong
they have to be to not be useful. In a nutshell, as in everyday life, one should
DOI: 10.1201/9781003221623-3 29
30 CHOICE OF PRIOR
6
6
prior
5
5
likelihood
posterior prior
likelihood
4
4
posterior
3
3
2
2
1
1
0
0
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
theta theta
Figure 3.1: Example 3.1. Left: The prior has a small variance and a bad guess, and
it dominates the likelihood. Right: The prior involves the same bad guess but also
a high uncertainty. It does not dominate.
Young lions are more active. In this case her prior is:
θ1 θ2 θ3
.
π(θ) 0.3 0.1 0.6
2
32 CHOICE OF PRIOR
5
4
August 1989
3
2
1
0 August 1991
theta
Figure 3.3: Example 3.3. The subjective priors are determined by experiences.
0.6
1.0 ●
0.5
0.8
0.4
prior density
0.6
0.3
0.2
0.4
0.1
0.2
0.2
0.0
0 1 2 3 4 5
0 5 10 15 20 25 30
theta
alpha
Figure 3.4: Example 3.4. Left: Determining α by P (θ > 3) ≈ 0.8. The distribution
function of InvGamma(α, 2(α + 1)) at 3 and the line at level 0.8 are plotted. Right:
The prior density with α = 10 and β = 22 reflects the subjective knowledge.
0.08
●
0.06
subjective prior density
0.04
0.02
0.25
0.00
−10 −5 0 5 10
theta
Figure 3.5: Example 3.5. The subjective prior reflects the vague subjective informa-
tion.
We assume a Cauchy distribution C(1, γ), which is heavy tailed and symmet-
rical around 1. The hyperparameter is the scaling parameter γ. The quantile
function of C(x0 , γ) is given by
where B(a, b) is the beta function as the solution of the integral equation
1
B(a, b) = xa−1 (1 − x)b−1 dx, a > 0, b > 0. (3.2)
0
0.32
0.25
●
marginal probability
0.28
P(X=2)
0.20
0.24
0.15
0.20
1 2 3 4 5 2 4 6 8 10
x alpha
Figure 3.6: Example 3.4. Left: The marginal distribution for X is plotted. Right:
The parameter of the prior is determined such that the marginal distribution has
the desired properties.
aa=seq(1,10,0.01); a0=3.8
plot(aa,beta(aa+2,aa+2)/beta(aa,aa)*choose(n,n/2),"l")
lines(aa,rep(0.3,length(aa)),"l",lty=2)
points(a0,6*beta(a0+2,a0+2)/beta(a0,a0),lwd=3)
segments(a0,6*beta(a0+2,a0+2)/beta(a0,a0),a0,0,lty=2)
Another good proposal for a subjective prior is a mixture distribution. Mix-
ture distributions are rich parametric classes. A folk theorem says that every
distribution can be approximated by a mixture distribution, which is used in
several contexts without a clear statement. We quote here a recent result of
Nguyen et al. (2020).
such that
lim |f (x) − hm (x)|dx = 0.
m→∞
0.10
0.08
subjective prior
0.06
0.04
0.02
0.00
−5 0 5 10 15 20
theta
Figure 3.7: Illustration for Example 3.7. The subjective prior is a mixture of normal
distributions.
Figure 3.8: Subjective prior for the expected value of hours of sunshine.
The family F is also called closed under sampling. The prior distribution
which is an element of a conjugate family is named conjugate prior.
We obtain
π(θ|(n1 , n2 , n3 )) ∝ θ1e1 +n1 −1 θ2e2 +n2 −1 θ3e3 +n3 −1 ,
which is the kernel of a Dirichlet distribution with parameters:
e 1 + n1 , e 2 + n2 , e 3 + n3 . (3.10)
we obtain
k
P(n1 , . . . , nk ) = h(n1 , . . . , nk ) exp( ln(pi )ni ),
i=1
with
n!
h(n1 , . . . , nk ) = .
n1 ! . . . nk !
Hence the multinomial distribution belongs an exponential family. By using
(3.13) we can reduce the dimension. Since,
k
k−1
pi k−1
ln(pi )ni = ni ln( ) + n ln(pk ); pk = 1 − pi .
i=1 i=1
pk i=1
k
where α0 = i=1 αi . Using the relation (3.14) we rewrite
Γ(α0 ) k
f (x|α1 , . . . , αk ) = exp( ln(xi )(αi − 1)).
Γ(α1 ) . . . Γ(αk ) i=1
42 CHOICE OF PRIOR
Set h(x) = x−1
1 . . . x−1
k . We get
Γ(α0 ) k
f (x|α1 , . . . , αk ) = h(x) exp( ln(xi )αi ).
Γ(α1 ) . . . Γ(αk ) i=1
Theorem 4.12 gives a very useful result on exponential families, namely that
for an i.i.d. sample to follow an exponential family, it suffices that Xi follows
an exponential family; see Liero and Zwanzig (2011).
Theorem 3.2
If X = (X1 , . . . , Xn ) is an i.i.d. sample from a distribution of the form
(3.11) with functions ζj and Tj , j = 1, . . . , k then the distribution of X
belongs to an exponential family with parameter ζj , and statistics
n
T(n,j) (x) = Tj (xi ), j = 1, . . . , k.
i=1
CONJUGATE PRIORS 43
Proof: Since the distribution of Xi belongs to an exponential family the
probability function of the sample is given by
n
k
p(x|θ) = C(θ) exp( ζj (θ)Tj (xi ))h(xi )
i=1 j=1
k
n
= C(θ)n exp( ζj (θ) Tj (xi ))h̃(x) (3.16)
j=1 i=1
n
with h̃(x) = i=1 h(xi ). Thus the distribution of X = (X1 , . . . , Xn ) belongs
nan exponential family with the functions ζj and the statistics T(n,j) (x) =
to
i=1 Tj (xi ).
2
We apply this result to a normal i.i.d. sample.
Theorem 3.3
Assume X has a distribution of the form (3.17). Then the conjugate family
over Θ is given by
μ = μ0 + T (x) and λ = λ0 + 1.
44 CHOICE OF PRIOR
Proof: Suppose π(θ) ∈ F such that
xα−1
T (x) = −x, Ψ(θ) = −α ln(θ), h(x) = .
Γ(α)
Generalized linear models also belong to exponential families. Let us get back
to the Corona Example 2.5.
CONJUGATE PRIORS 45
Example 3.14 (Logistic regression)
We consider a general logistic regression model. The data set is given by
(yi , x1i , . . . , xpi ) with i = 1, . . . , n, where y1 , . . . , yn are independent re-
sponse variables. The success probability depends on the covariates xi =
(x1i , . . . , xpi )T , i.e.,
P(yi = 1|xi , θ) = p(xi ).
The statistical model is
n
p(y1 , . . . , yn |xi , θ) = p(xi )yi (1 − p(xi ))(1−yi )
i=1
n
n
= exp ln(p(xi ))yi + ln(1 − p(xi ))(1 − yi )
i=1 i=1
n
n
p(xi )
= exp ln yi + ln(1 − p(xi )) .
i=1
1 − p(xi ) i=1
(3.20)
The main trick is to find a canonical link function g(.), such that
n n
p(xi )
ln yi = yi xTi θ.
i=1
1 − p(x i ) i=1
In case we want to include the prior knowledge from different sources we can
also construct an averaging prior which is conjugate. The following theorem
is an extension of Theorem 3.3 to mixtures.
CONJUGATE PRIORS 47
N
N
FN = { ωi π(θ|μi , λi ) : ωi = 1, ωi > 0,
i=1 i=1
π(θ|μi , λi ) = K(μi , λi ) exp(θT μi − λi Ψ(θ)),
μi ∈ Rk , λi ∈ R}
(3.22)
N
π(θ) = ωi π(θ|μi , λi ),
i=1
the posterior is
N
π(θ|x) = ωi (x)π(θ|μi + T (x), λi + 1)
i=1
with
ωi K(μi , λi )
ωi (x) ∝ .
K(μi + T (x), λi + 1)
π(θ|x) ∝ π(θ)(θ|x)
N
∝ ωi K(μi , λi ) exp(θT μi − λi Ψ(θ)) exp(θT T (x) − Ψ(θ))
i=1
N
∝ ωi K(μi , λi ) exp(θT (μi + T (x)) − (λi + 1)Ψ(θ))
i=1
N
∝ ωi (x)K(μi + T (x), λi + 1) exp(θT (μi + T (x)) − (λi + 1)Ψ(θ)).
i=1
(3.23)
2
We now turn back to the classroom Example 3.7 and Figure 3.7.
48 CHOICE OF PRIOR
Example 3.15 (Weather)
We assume that the temperature measurements are normally distributed.
Thus
1 1 x2 1 1 2
f (x|θ) = √ exp(− 2 ) exp θx − θ .
2π σ 2σ σ2 2σ 2
We have
1 2 1
Ψ(θ) =θ , T (x) = 2 x.
2σ 2 σ
From Theorem 3.3 a conjugate family is
1 2
F = {K(μ, λ) exp(θμ − λ θ ) : μ ∈ R, λ ∈ R, λ > 0}
2σ 2
with √
1 λ σ2 2
K(μ, λ) = √ exp − μ .
2π σ 2λ
2 2
Parameterizing by τ 2 = σλ and m = σλ μ, it is the family of normal distribu-
tions
F = {N(m, τ 2 ) : m ∈ R, τ 2 > 0}.
Recall Example 2.12 with n = 1, the posterior belonging to the prior N(m, τ 2 )
is N(m(x), τp2 ) where
The prior in Example 3.7 fulfills the conditions of Theorem 3.4. It is a mixture
of normal distributions
with
and
τi 1 1
ωi (x) ∝ ωi exp − 2 m2i + 2 mi (x)2
τi,p 2τi 2τi,p
with ω1 (x) + ω2 (x) = 1. Figures 3.9 and 3.10 show the prior and posterior. 2
CONJUGATE PRIORS 49
0.30
0.25
0.25
0.20
0.20
0.15
0.15
0.10
0.10
0.05
0.05
0.00
0.00
●
−5 0 5 10 15 20
−5 0 5 10 15 20
theta
theta
Figure 3.9: Illustration for Example 3.15. Left: The subjective prior (broken line) of
expert 1 and the posterior after observing x = 4. Right: The prior (broken line) and
posterior related to expert 2.
0.20
0.15
0.10
0.05
0.00
−5 0 5 10 15 20
theta
Figure 3.10: Illustration for Example 3.15. The broken line is the prior mixture
distribution. The continuous line is the posterior, which is also a mixture, but the
posterior weight of the second expert is shifted from 0.2 to 0.035, because the ob-
servation x = 4 lies in the tail of the second expert’s prior.
50 CHOICE OF PRIOR
3.3 Non-informative Priors
Basically, the non–informative priors are recommended when no prior knowl-
edge is available but we still want to explore the advantages of Bayesian mod-
elling. However they can be used for other reasons, e.g., when we do not trust
the subjective knowledge and want to avoid any conflict with the likelihood
approach. But it is unclear which prior deserves the name non–informative.
There are different approaches but it is still an open problem and new set ups
are under development. We present here main principles.
p(ω) ∝ const.
π(θ) ∝ const.
Note that, the notational system in statistics is not unique, so that sometimes
in literature prior distributions following a Laplace distribution are named
Laplace prior, which are of course not constant.
Definition 3.3 is also applied to unbounded Θ. In general, priors which are not
probability measures are called improper priors.
NON-INFORMATIVE PRIORS 51
Suppose a finite set Θ = {θ1 , . . . , θm }. The Shannon entropy is defined by
m
H(π) = − π(θi ) log(π(θi )). (3.25)
i=1
It describes how much the probability mass of π is spread out on Θ. The higher
the entropy the less informative is the parameter. It holds for all j = 1, . . . , m
that
d
m
1 .
− ki log(ki ) = −kj − log(kj ) = −1 − log(kj ) = 0,
dkj i=1
kj
i.e., the measure with maximal entropy has constant weight on all elements of
Θ. Thus the Laplace prior fulfills the idea of no information. For illustration,
see Figure 3.11.
0.4
0.4
0.3
N(1,64)
0.3
0.2
0.2
0.1
0.1
0.0
0.0
0 2 4 6 8 −4 −2 0 2 4 6
theta theta
Figure 3.11: Laplace non-informative prior. Left: Illustration of spread and entropy.
E stands for the entropy measure (3.25). Right: In Example 3.18 the Laplace prior
can be considered as limit of normal priors under increasing variance.
where |J| = det(J) is the Jacobian determinant. A constant prior with respect
to θ does not deliver a constant prior with respect to η, so that the Laplace
non-informative prior does not fulfill (3.27). We illustrate it in the following
example.
and
p(x|θ) 1
ln( ) = ln(p(x|θ)) − ln(p(x|θ ) ≈ p (x|θ)T (θ − θ ),
p(x|θ ) p(x|θ)
we obtain
T 1
I2 (Pθ , Pθ ) ≈ (θ − θ ) p (x|θ)p (x|θ)T dx (θ − θ ).
p(x|θ)
54 CHOICE OF PRIOR
From the definition of the score function
∂ ln p(x|θ) 1
V (θ|x) = = p (x|θ)
∂θ p(x|θ)
and the Fisher information matrix is (see Liero and Zwanzig, 2011)
I(θ) = Covθ V (θ|x). (3.29)
We have
1 1
p (x|θ)p (x|θ)T dx = p (x|θ)p (x|θ)T p(x|θ)dx
p(x|θ) p(x|θ)2
= Eθ (V (θ|x)V (θ|x)T )
= Covθ (V (θ|x))
= I(θ)
so that
I2 (Pθ , Pθ ) ≈ (θ − θ )T I(θ)(θ − θ ). (3.30)
Thus the distance between the probability measures generates a weighted
distance in the parameter space. Changing the parametrization and using
θ − θ ≈ J(η − η )
where J is the Jacobian matrix defined in (3.26), we have
I2 (Pθ , Pθ ) ≈ (θ − θ )T I(θ)(θ − θ )
≈ (η − η )T JT I(θ)J(η − η ).
Further, by the chain rule, the score function V (η|x)) with respect to η is
V (η|x)) = V (θ|x))T J
which gives
I(η) = JT I(θ)J (3.31)
and we see that different parametrizations for the same probability distribu-
tions deliver the same distance
I2 (Pθ , Pθ ) = I2 (Pη , Pη ).
The relation (3.31) is Jeffreys main argument for an invariant prior.
Although, Jeffreys first goal was to find an invariant prior, his prior is also
non–informative in the sense that the prior has no influence because the pos-
terior distribution based on Jeffreys prior coincides approximately with the
likelihood function. Let us explain it in more detail.
Assume that we have an i.i.d. sample x = (x1 , . . . , xn ) from a regular statis-
tical model, which means the Fisher information matrix exists and it holds
that (see Liero and Zwanzig, 2011, Theorem 3.6)
I(θ) = Covθ V (θ|x) and In (θ) = Covθ V (θ|x) = −Eθ J(θ|x) = nI(θ),
(3.32)
where V (θ|x) is the vector of first derivatives of the log likelihood function
and J(θ|x) is the matrix of second derivatives of the log likelihood function.
Let θ be the maximum likelihood estimator with V (θ|x) = 0. Applying the
Taylor expansion about θ we have
1
ln p(x|θ) ≈ ln p(x|θ) + (θ − θ)T V (θ|x) + (θ − θ)T J(θ|x)(θ − θ)
2
1
≈ ln p(x|θ) + (θ − θ) J(θ|x)(θ − θ).
T
2
π(θ|x) ∝ π(θ)(θ|x)
1
∝ det(I(θ)) 2 (θ|x)
1 n
∝ det(I(θ)) 2 exp − (θ − θ)T I(θ)(θ − θ) ,
2
56 CHOICE OF PRIOR
1 −1
which is the kernel of N(θ, n I(θ) ).
Thus Jeffreys prior is precisely the right
weight needed to obtain the same result as in asymptotic inference theory,
where it is shown that (see e.g., van der Vaart, 1998)
√ D
n(θ − θ) −→ N(0, I(θ)−1 ).
The first example we want to calculate Jeffreys prior for, and to compare it
with other approaches, is the binomial model.
with
x n−x
V (θ|x) = −
θ 1−θ
x n−x
J(θ|x) = − 2 − .
θ (1 − θ)2
Using EX = nθ, we get the Fisher information
nθ n − nθ n
I(θ) = −EJ(θ|x) = + = ,
θ 2 (1 − θ) 2 θ(1 − θ)
π(θ) ∝ θ− 2 (1 − θ)− 2 .
1 1
This is the kernel of Beta( 12 , 12 ), which is symmetric and gives more weight
to small and large parameters, but less weight to parameters around 12 ; see
Figure 3.12. 2
3.0
Laplace
Jeffreys
2.5
subjective
2.0
prior
1.5
1.0
0.5
0.0
theta
Figure 3.12: Illustration for Examples 3.6, 3.17 and 3.20. Jeffreys prior is Beta( 12 , 12 ),
Laplace non-informative prior is Beta(1, 1) and the subjective prior is Beta(3.8, 3.8),
as determined in Example 3.6.
I(θ) = I(f ).
0.5
0.5
0.4
0.4
0.3
prior
0.3
0.2
0.2
0.1
0.1
0.0
0.0
−5 0 5
−5 0 5 location parameter
Figure 3.13: Example 3.21. Jeffreys prior for location parameter. Left: Densities
of the Laplace distribution La(θ, 1) with location parameters −4, −2, 0, 2, 4. Right:
Jeffreys prior.
10
sigma=0.5
sigma=1
0.8
sigma=2
8
0.6
6
prior
0.4
4
2
0.2
0.0
−6 −4 −2 0 2 4 6 scale parameter
Figure 3.14: Example 3.22. Jeffreys prior for scale parameter. Left: Densities of the
Laplace distribution La(0, σ) with different scale parameters. Right: Jeffreys prior.
The situation changes when we study the location parameter μ and the scale
parameter σ simultaneously, i.e., θ = (μ, σ). In this case, Jeffreys additionally
requires independence between μ and σ, so that
Applying I μ, σ 2 = JT I (μ, σ) J, we get
⎛ ⎞
1
0
I(μ, σ 2 ) = ⎝ σ ⎠.
2
(3.41)
0 2σ1 4
k−1
p(x|θ) ∝ pni i (1 − δ)n−s (3.42)
i=1
k−1 k−1
where δ = i=1 pi and s = i=1 ni . The log–likelihood function is
k−1
ln(p((n1 , . . . , nk−1 )|θ)) = ni ln(pi ) + (n − s) ln(1 − δ) + const
i=1
62 CHOICE OF PRIOR
with the first and second derivatives for i, j = 1, . . . , k − 1
∂ 1 1
ln(p(x|θ)) = ni − (n − s)
∂pi pi 1−δ
∂ 1 1
ln(p(x|θ)) = −ni 2 − (n − s) (3.43)
∂pi pi pi (1 − δ)2
∂ 1
ln(p(x|θ)) = −(n − s) , j = i.
∂pi pj (1 − δ)2
It holds that Eni = npi and E(n − s) = n(1 − δ). Using pk = 1 − δ the Fisher
information is
⎛ ⎞
1 1 1 1
+ . . .
⎜ p1 pk pk pk ⎟
⎜ ⎟
I(θ) = n ⎜ 1 1
p2 + pk
1
... 1
⎟ (3.44)
⎝ pk pk
⎠
.. ..
. ... . p 1 + p1
k−1 k
where 1k−1 is the column vector consisting of k − 1 ones, and 1k−1 1Tk−1 is the
(k − 1) × (k − 1) matrix consisting of ones. Using the rule
we obtain
k−1
1 1
k−1 k
1
det(I(θ)) = nk−1 (1 + pi ) = nk−1 .
i=1
pi i=1
p k p
i=1 i
k
Hence Jeffreys prior for p1 , . . . , pk with i=1 pi = 1 is
− 12 − 12
πJeff (p1 , . . . , pk ) ∝ p1 . . . pk (3.45)
In the following table, we summarize the Jeffreys priors for some commonly
used distributions.
NON-INFORMATIVE PRIORS 63
recall that in (3.25) the entropy is defined for the discrete case. The expected
information can be presented as the difference of the entropy of the prior
and the expected entropy of the posterior. By using (3.46) and π(θ|x)p(x) =
π(θ)p(x|θ) we get
π(θ|x)
I(P, π) = p(x) π(θ|x) ln dθ dx
π(θ)
=− π(θ|x)p(x) ln(π(θ))dx dθ + p(x)π(θ|x) ln(π(θ|x))dx dθ
= − π(θ) ln(π(θ)) p(x|θ)dx dθ + p(x)π(θ|x) ln(π(θ|x))dx dθ
= − π(θ) ln(π(θ))dθ + p(x) π(θ|x) ln(π(θ|x))dθ dx
We have
I(P, π) = H(π) − p(x)H(π(.|x)dx. (3.49)
or alternatively
d
Ψ(x) = ln(Γ(x)).
dx
The marginal distribution of X is calculated in (3.5). For illustration let us
consider as candidate class C of prior distributions on Θ all symmetric beta
priors with α = β. Then the marginal distribution in (3.5) fulfills p(k) =
p(n − k) and the expected information can be written as function of α as
H(Beta(α, β)) is non-positive for all α, β with maximal value zero attained
for α = 1, β = 1, where B(1, 1) = 1. The prior Beta(1, 1) is equal to the
uniform distribution U[0, 1], which is also the Laplace non-informative prior;
see Example 3.20. 2
Unfortunately the requirement to take the prior which maximizes the expected
information I(P, π) does not give tractable results in all cases (see the discus-
sion in Berger and Bernardo, 1992a). Bernardo (1979) proposed to compare
the prior information with the theoretical best posterior information, which
he described as the limit information for n → ∞ from an experiment with n
repeated independent copies of the original experiment P n = {P⊗n θ : θ ∈ Θ}.
66 CHOICE OF PRIOR
3.5
●
n=1000
3.0
2.5
● n=200
Information
2.0
1.5
● n=20
1.0
●
n=5
0.5
0.0
alpha
Figure 3.15: Example 3.27. The expected information I(α) in (3.50) has its maximum
at α = 0.142 for n = 5, α = 0.272 for n = 20, α = 0.41 for n = 200 and α = 0.45 for
n = 1000.
Definition 3.5 does not provide a construction of reference priors. The follow-
ing theorem from Berger et al. (2009) provides an explicit expression.
68 CHOICE OF PRIOR
Theorem 3.5
Assume P = {Pθ : θ ∈ Θ ⊆ R} and a class of prior distributions C on
Θ. For all p ∈ C and all θ ∈ Θ, p(θ) > 0 and Θ p(θ)p(x|θ)dθ < ∞. Let
P n = {P⊗nθ : θ ∈ Θ} and tn be a sufficient statistic of P n . Let π ∗ (θ) be
a strictly positive function such that the posterior π ∗ (θ|tn ) is proper and
asymptotically consistent. For any θ0 in an open subset of Θ, define
fn (θ)
fn (θ) = exp p(tn |θ) ln(π ∗ (θ|tn )) dtn , f (θ) = lim . (3.52)
n→∞ fn (θ0 )
If
1. for all n, fn (θ) is continuous,
2. the ratio ffnn(θ
(θ)
0)
is either monotonic in n or bounded by an integrable
function and
3. conditions 1 and 2 of Definition 3.5 are fulfilled,
then f (θ) is a reference prior.
For cases where the integrals and the limit require complex calculations,
Berger et al. (2009) proposed the following algorithm for the numerical com-
putation of the reference prior. The algorithm includes Monte Carlo steps for
the integrals in (3.52) and an approximation of the limit, see also Chapter 9.
1. Split the parameter θ = (ω, λ) into the parameter of interest ω and the
nuisance parameter λ.
2. For each parameter of interest ω, derive the reference prior π(λ|ω) related
to the statistical model {p(x|ω, λ) : λ ∈ Λ}.
3. Use π(λ|ω) to eliminate the nuisance parameter by
p(x|ω) = p(x|ω, λ)π(λ|ω) dλ.
Λ
π(θ) = π(λ|ω)π(ω).
where π(p) and π(q) are the beta distributions, Beta(0.5, 0.5). Let us now
compare the reference prior with the Jeffrey prior for θ = (p, q). Using (3.54)
we have the score function
T
x1 − np x2 − x1 q
V (θ|x) = , = (V1 (p), V2 (q))T .
p(1 − p) q(1 − q)
and the second derivatives
d x1 (2p − 1) − np2 d x2 (2q − 1) − x1 q 2
V1 (p) = and V2 (q) = ,
dp p (1 − p)
2 2 dq q 2 (1 − q)2
d d
where dq V1 (p) and dp V2 (q) are zero. Further,
Note that, it is not a product of two beta distributions, but it is still a proper
prior. 2
1. For j = m, m − 1, . . . , 1:
(a) Suppose the current state is π(j+1) (θ[∼j] |θ[j] ).
(b) Determine the marginal model
p(x|θ[j] ) = p(x|θ)π(j+1) (θ[∼j] |θ[j] ) dθ[∼j] .
(c) Determine the reference prior h(j) (θ(j) |θ[j−1] ) related to the model
{p(x|θ[j] ) : θ(j) ∈ Θj },
where p(x|θ) is given in (3.57). This conditional model is regular. The reference
prior is Jeffreys prior, given in (3.45), constrained on (0, 1 − (p1 + p2 )),
−1
π(p3 |p1 , p2 ) ∝ p3 2 (1 − δ)− 2 , 0 < p3 < 1 − (p1 + p2 ).
1
It holds that
1−(p1 +p2
n − 12
(1 − δ)n4 − 2 p3 3
1
p(x|θ1 , θ2 ) ∝ pn1 1 pn2 2 dp3 .
0
74 CHOICE OF PRIOR
Using (3.58) again, the marginal model is
which is the binomial distribution, Bin(n, p1 ), and the reference prior is Jeffreys
prior
−1
π(θ1 ) ∝ p1 2 (1 − p1 )− 2 .
1
This gives
−1 −1 −1
π(p1 , p2 , p3 ) ∝ p1 2 (1 − p1 )− 2 p2 2 (1 − (p1 + p2 ))− 2 p3 2 (1 − (p1 + p2 + p3 ))− 2 .
1 1 1
x + +/− − 0
θ=2 0.6 0.1 0 0.3
θ=1 0.1 0.2 0.1 0.6
θ=0 0 0.2 0.79 0.01
Pθ (k) = (1 − θ)k θ.
(d) For which λ and μ0 is the second prior less informative than the first?
Chapter 4
Decision Theory
DOI: 10.1201/9781003221623-4 78
BASICS OF DECISION THEORY 79
20
L1
L2
15
absolute error
loss
10
5
0
−4 −2 0 2 4
theta−d
Example 4.1 (Lion’s appetite) The appetite of a lion has three different
stages:
θ ∈ Θ = {hungry, moderate, lethargic} = {θ1 , θ2 , θ3 }.
It would be much more dangerous to classify a hungry lion as lethargic than
the other way around. We consider the following losses:
L(θ, d) d = θ1 d = θ2 d = θ3
θ1 0 0.8 1
θ2 0.3 0 0.8
θ3 0.3 0.1 0
2
Applying the loss on a decision rule gives L(θ, δ(x)) which is data dependent.
The quality of a decision rule should not depend on the observed data, but on
the data generating probability distribution. We have to consider the loss as
a random variable and to study the distribution of L(θ, δ(X)) under X ∼ Pθ .
In decision theory we are mainly interested in the expectation. We define the
risk as the expected loss
R(θ, δ) = L(θ, δ(x)) p(x|θ) dx. (4.3)
X
BASICS OF DECISION THEORY 81
The risk in (4.3) is often called frequentist risk, because it is based on the
statistical model P only. We consider decision rule δ1 better than decision rule
δ2 iff
R(θ, δ1 ) ≤ R(θ, δ2 ) for all θ ∈ Θ.
Since R(θ, δ1 ) should be compared over the entire parameter space, there are
risks that are not comparible. We illustrate it with the toy example.
R(θ, δ) θ1 θ2 θ3
δ1 0.73 0.08 0.005
δ2 0.08 0.07 0.01
δ3 0 0.3 0.3
In general we cannot find an optimal decision rule which is better than all the
other decision rules. The way out is that we search for the rule for which we
cannot find a better one.
0.8
●
0.6
risk
0.4
● ●
0.2
● ●
●
0.0
● ●
theta
Figure 4.3: Example 4.2. No decision rule has a smaller risk for all parameters. It is
impossible to compare them by the frequentist risk.
The admissibility property is weaker than the requirement of minimal risk for
all parameters. We have a better chance to find an admissible decision rule
than an optimal one. Already from this type of definition it becomes clear
that it is easier to prove inadmissibility than the opposite. We will see that
most of the proofs in this field are indirect proofs.
We conclude this section with a prominent example on the James–Stein esti-
mator, defined in James and Stein (1960).
For the loss function L(θ, d) = θ − d 2 it can be shown that the James–
Stein estimator is better. We consider main steps of the proof. Comparing the
BAYESIAN DECISION THEORY 83
estimators Tnat (x) = x and SJS we get
D = R(θ, Tnat ) − R(θ, SJS )
d
d−2 (d − 2)2 (4.6)
=2 Eθ (xj − θj ) xj − Eθ .
j=1
x 2 x 2
We obtain r(π, δ1 ) = 0.085, r(π, δ2 ) = 0.023 and r(π, δ3 ) = 0.27. The second
rule is better than the first rule. The pessimistic and data-independent deci-
sion rule δ3 is the worst. 2
Definition 4.2 (Bayes decision rule) Assume the Bayes model {P, π},
the decision space D and the loss function L. A decision rule δ π is called a
Bayes decision rule iff it minimizes the integrated risk r(π, δ). The value
The posterior expectation of the loss function is called the posterior ex-
pected loss, i.e.,
π
ρ(π, d|x) = E (L(θ, d)|x) = L(θ, d)π(θ|x) dθ. (4.8)
Θ
The following theorem gives us a method to find the Bayes decision.
To begin with the estimation problem, we consider a Bayes model {P, π} with
Θ ⊆ Rp . The goal is to find an “optimal” estimator: θ : x ∈ X → Θ. In other
words, we set D = Θ and search for θ = δ π .
LW (θ, d) = θ − d 2
W −1 = (θ − d)T W (θ − d), (4.11)
p
LW (θ, d) = wi (θi − di )2 .
i=1
Θ
= (θ − δ π (x))T W (θ − δ π (x))π(θ|x)dθ (4.13)
Θ
+ 2 (d − δ π (x))T W (δ π (x) − θ)π(θ|x) dθ
Θ
+ (d − δ π (x))T W (d − δ π (x))π(θ|x) dθ.
Θ
86 DECISION THEORY
π
Since δ (x) = Θ θ π(θ|x) dθ, it holds for the mixture term that
(d − δ π (x))T W (δ π (x) − θ)π(θ|x) dθ
Θ
= (d − δ (x)) W
π T
(δ π (x) − θ)π(θ|x) dθ
Θ
= (d − δ (x)) W δ (x) −
π T π
θ π(θ|x) dθ
Θ
= 0.
Hence
ρ(π, d|x) = (θ − δ π (x))T W (θ − δ π (x))π(θ|x) dθ
Θ
+ (d − δ π (x))T W (d − δ π (x))
≥ ρ(π, δ π (x)).
2
We turn back to the toy data, introduced in Example 2.4.
In case the posterior belongs to a known distribution class, with the first two
moments finite and known, we obtain the Bayes decision rule and the Bayes
risk without additional problems. We demonstrate it with the help of next
three examples.
1.0
●
●
MLE
●
Jeffreys ●
0.8
Laplace ●
●
subjective ●
●
● ●
●
0.6
estimation
●
0.4
●
● ●
●
●
●
0.2
●
●
●
●
0.0
0 1 2 3 4 5 6
Figure 4.4: Example 4.6. The Bayes estimators are related to the priors Beta(0.5, 0.5),
Beta(1, 1), Beta(3.8, 3.8); see also Figure 3.12.
Laplace prior is Beta(1, 1) and Jeffreys is Beta( 12 , 12 ), for which the associated
Bayes estimators are
1+x 0.5 + x
δ πLap (x) = , δ πJeff (x) = .
2+n 1+n
The moment estimator equals the maximum likelihood x
n. See Figure 4.4. 2
x̄nσ02 + μ0 σ 2 σ02 σ 2
N(μ1 , σ12 ), with μ1 = 2 and σ 2
1 = .
nσ0 + σ 2 nσ02 + σ 2
We obtain as Bayes estimator
x̄nσ02 + μ0 σ 2
δ π (x) = . (4.16)
nσ02 + σ 2
For different prior parameters μ0 and σ02 estimators are plotted as functions
of the sufficient statistic x̄ in Figure 4.5. 2
88 DECISION THEORY
4
MLE
N(1,1)
N(−1,0.5)
2
estimation
0
−2
−4
−4 −2 0 2 4
sample mean
Figure 4.5: Example 4.7. The maximum likelihood estimator and Bayes estimators
related to the priors N(1, 1) and N(−1, 0.5) are plotted as function of the sample
mean, where n = 4 and σ 2 = 1.
The last example is related to Example 2.16. Here we have no closed form
expression for the Bayes estimator and computer-intensive methods are re-
quired.
n
1
π(θ|x) ∝ ϕ(θ).
i=1
1 + (xi − θ)2
1
N
m(x) = h(θ(j) ).
N j=1
COMMON BAYES DECISION RULES 89
20
L1
L2
absolute error
15
loss
10
5
0
−4 −2 0 2 4
theta−d
Figure 4.6: Illustration of different loss functions. The absolute error loss defined in
(4.17) is plotted with k1 = 2, k2 = 4.
For the other integral, generate a new sequence θ(1) , . . . , θ(N ) from N(μ, σ 2 )
and approximate m1 (x) by
1 (j)
N
m1 (x) = θ h(θ(j) ).
N j=1
Theorem 4.3 For every x ∈ X , δ π (x) with respect to L(k1 ,k2 ) is given by
the k1k+k
2
2
fractile of the posterior distribution π(θ|x)
k2
Pπ (θ ≤ δ π (x)|x) = . (4.18)
k1 + k2
ρ(π, d|x) = L(k1 ,k2 ) (θ, d)π(θ|x) dθ
Θ
d ∞
= k1 (d − θ)π(θ|x) dθ + k2 (θ − d)π(θ|x) dθ.
−∞ d
d
Using dx (−(1 − F (x)) = f (x) and integration by parts, we obtain for the
second term
∞ ∞
(θ − d)π(θ|x) dθ =(θ − d)(−(1 − F π (θ|x)) |∞
d + (1 − F π (θ|x)) dθ
d
∞ d
= (1 − F (θ|x)) dθ.
π
d
Hence
d ∞
ρ(π, d|x) = k1 π
F (θ|x) dθ + k2 (1 − F π (θ|x)) dθ.
−∞ d
The optimal decision should fulfill (4.9), which implies that the derivative of
the posterior loss with respect to d should be zero,
ρ (π, d|x) = 0.
We have
ρ (π, d|x) = k1 F π (d|x) − k2 (1 − F π (d|x)) = 0
and obtain
k2
F π (δ π (x)|x) = .
k1 + k2
2
COMMON BAYES DECISION RULES 91
Note that, for the L1 loss, we have k1 = k2 and the Bayes estimate is the
median of the posterior distribution. Let us consider an example.
α − 13
med ≈ , α > 1, β > 1. (4.19)
α + β − 23
Hence
α + x − 13
δ π (x) ≈ . (4.20)
α + β + n − 23
Laplace prior is Beta(1, 1) and Jeffreys is Beta( 12 , 12 ), and the related Bayes
estimators are, for x > 0, x ≤ 1
x + 23 1
+x
δ πLap (x) ≈ 4 , δ πJeff (x) ≈ 6
1 .
3 + n 3 +n
2
4.3.3 Prediction
We assume the Bayes model {P, π} which has the posterior π(θ|x). The future
data point xf is generated by a distribution Qθ , possibly depending on x;
setting z = xf , we have q(z|θ, x). The prediction error Lpred (z, d) is the
loss we incur by predicting z = xf by some d ∈ D = Xf . Note that, the future
observation is a realization of a random variable. The prediction error is not
a loss function as given in (4.1). We define the loss function as the expected
prediction error
L(θ, d) = Lpred (z, d)q(z|θ, x) dz. (4.21)
Xf
Hence
ρ(π, d|x) = Lpred (z, d)π(z|x) dz. (4.25)
Xf
Applying Theorem 4.1 we obtain the Bayes rule by minimizing the expected
posterior loss for each x ∈ X . We call this Bayes rule Bayes predictor.
ρ(π, δ π (x)|x) = inf ρ(π, d|x)
d∈D
(4.26)
= inf Lpred (z, d)π(z|x) dz.
d∈D Xf
Theorem 4.4
1. For Lpred (z, d) = (z − d)T W (z − d), W positive definite a the Bayes
predictor is the expectation of the predictive distribution, E(z|x).
2. Set Xf = R and Lpred (z, d) = |z − d|. The Bayes predictor is the median
of the predictive distribution, Median(z|x).
Applying known results for normal distribution, i.e., Theorem 6.1, we obtain
the predictive distribution
xf = x̄ + aμ1 .
It is just the prediction rule obtained by plugging in the estimator. For the
Bayes predictor we plug in the Bayes estimator. 2
0.7
0.7
0.6
null hypothesis
0.6
null hypothesis
alternative alternative
0.5
0.5
0.4
0.4
posterior
posterior
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
● ●
0 1 2 3 4 5 0 1 2 3 4 5
variance variance
Figure 4.7: Example 4.11. The posterior π(θ|x) is plotted, with prior InvGamma(6, 9).
Left: Θ0 = (0, 1] and Θ1 = (1, ∞). The posterior probability of Θ1 is higher. The
Bayes decision is δ π (x) = 1. Right: Θ0 = {1} and Θ1 = (0, 1) ∪ (1, ∞). The posterior
probability of Θ0 is zero. The Bayes decision is δ π (x) = 1.
# 2
1 p(x|d)
2
LH (θ, d) = H (Pθ , Pd ) = −1 p(x|θ) dx. (4.31)
2 X p(x|θ)
96 DECISION THEORY
Note that the loss LKL (θ, d) is asymmetric, but LH (θ, d) is symmetric. We can
reformulate the Hellinger distance as
# 2
1 p(x|d)
LH (θ, d) = − 1 p(x|θ) dx
2 X p(x|θ)
#
1 p(x|d) p(x|d)
= −2 + 1 p(x|θ) dx
2 X p(x|θ) p(x|θ) (4.32)
1
= p(x|d) dx − 2 p(x|θ)p(x|d) dx + 1
2
X X
A famous example in this context is the Neyman–Pearson test, see for example
Liero and Zwanzig (2011, Section 5.3.1).
d1 d2 d3
δ ∗ (x, .) = , p3 (x) = p1 (x) + p2 (x).
p1 (x) p2 (x) p3 (x)
The non–randomized decision rule δ1 in Example 4.2 corresponds to
d1 d2 d3 d1 d2 d3
δ1∗ (1, .) = . . . δ1∗ (4, .) =
0 0 1 1 0 0
The loss of a randomized decision rule is calculated as
L(θ, δ ∗ (x, .) = p1 (x)L(θ, d1 ) + p2 (x)L(θ, d2 ) + p3 (x)L(θ, d3 ),
where L(θ, d) is defined in Example 4.1. We obtain the risk
R(θ, δ ∗ ) = Eθ p1 (x)L(θ, d1 ) + Eθ p2 (x)L(θ, d2 ) + Eθ p3 (x)L(θ, d3 ).
A randomized decision rule δ ∗ (x, .) is defined by
x 0 1 2 3 4
p1 (x) 0 0.2 0.5 0.8 0.9
p2 (x) 0.2 0.3 0.4 0.2 0.1
p3 (x) 0.8 0.5 0.1 0 0
θ1 θ2 θ3
.
0.194 0.263 0.032
THE MINIMAX CRITERION 99
0.8
●
0.6
risk
0.4
● ●
●
0.2
● ●
●
●
0.0
● ●
theta
Figure 4.8: Example 4.14. We compare only four rules. δ1 , δ2 , δ3 are the non-
randomized rules in Example 4.2. The randomized rule δ ∗ given in Example 4.14 is
plotted as line. The comparison criterion is the maximal risk highlighted by thick
points. The decision minimizing the maximal risk is δ2 .
In Figure 4.8 the risk functions are for δ1 , δ2 , δ3 given in Example 4.2 and for
δ ∗ . Comparing these four rules, the minimax rule is δ2 . 2
Next we give two statements, which provide an insight into the relationship
between the minimax concept and admissibility. As shown in Figure 4.8 the
minimax rule is also admissible. Theorem 4.6 gives the corresponding result.
Theorem 4.6
If there exists a unique minimax rule, then it is admissible.
Proof: Let δ ∗ be an unique minimax rule. We carry out the proof by con-
tradiction. So let δ ∗ be not admissible. Then there exists a rule δ1 such that
This implies
Theorem 4.7
Let δ0 be an admissible rule. If for all θ ∈ Θ the loss function L(θ, .) is
strictly convex and if for all θ ∈ Θ
R(θ, δ0 ) = const,
Proof: Let δ0 be an admissible rule with constant risk, then for an arbitrary
θ0 ∈ Θ
sup R(θ, δ0 ) = R(θ0 , δ0 )
θ∈Θ
We carry out the proof by contradiction and state that δ0 is not unique min-
imax. Then there exists δ1 such that
Thus
R(θ0 , δ1 ) ≤ sup R(θ, δ1 ) = R(θ0 , δ0 ). (4.40)
θ∈Θ
4.5 Bridges
In this section we show optimality results for Bayesian procedures in the fre-
quentist context. In a sense it is a bridge between the statistician who prefers
the Bayes model and the one who does not like the additional assumption of
a prior distribution. But it is not only some kind of justification of the Bayes
model, rather the bridge results deliver a method to prove optimality results
for a “frequentist” method when it can be rewritten as a Bayes rule.
We start with the statement that Bayes rules are not randomized.
Theorem 4.8 Assume a Bayes model {P, π}. Then it holds that
Proof: Consider the posterior risk. For all x ∈ X and all δ ∗ ∈ D∗ we have
ρ(π, δ ∗ (x, .)|x) = L(θ, a)δ ∗ (x, a) da π(θ|x) dθ
Θ D
= L(θ, a)π(θ|x) dθ δ ∗ (x, a) da
D Θ
= ρ(π, a|x)δ ∗ (x, a) da
D
≥ inf ρ(π, a|x) δ ∗ (x, a) da = inf ρ(π, a|x).
a∈D D a∈D
Because r(π, δ ∗ ) = X
ρ(π, δ ∗ (x, .)|x)p(x)dx, we further have
2
Here we have a bridge statement, which says that a Bayes rule can be admis-
sible. The Bayes rule δ π is optimal within the framework of the Bayes decision
theory, whereas admissibility is an optimality property which does not require
a Bayes model.
102 DECISION THEORY
and if
R(θ, δ) is continuous in θ for all δ,
then δ π is admissible.
and
R(θ1 , δ1 ) < R(θ1 , δ π ), for some θ1 ∈ Θ.
Because R(., δ) is continuous there exists a neighborhood C, θ1 ∈ C and
π(C) > 0 such that
and
R(θ, δ1 )π(θ) dθ < R(θ, δ π )π(θ) dθ
C C
and for C = Θ \ C
R(θ, δ1 )π(θ) dθ ≤ R(θ, δ π )π(θ) dθ.
C C
Hence
r(π, δ1 ) = R(θ, δ1 )π(θ) dθ + R(θ, δ1 )π(θ) dθ
C C
< R(θ, δ π )π(θ) dθ + R(θ, δ π )π(θ) dθ
C C
= r(π, δ π ) = inf r(π, δ) < ∞.
δ∈D
Theorem 4.10 Assume a Bayes model {P, π}, where π can be improper.
If for all θ ∈ Θ the loss function L(θ, .) is strictly convex, and if δ π is a
Bayes rule with finite Bayes risk
then δ π is admissible.
and
R(θ1 , δ1 ) < R(θ1 , δ π ), for some θ1 ∈ Θ.
We define a new decision rule
δ1 (x) + δ π (x)
δ2 (x) = .
2
104 DECISION THEORY
Because of the strong convexity of the loss function we obtain
R(θ, δ2 ) = L(θ, δ2 (x))p(x|θ) dx
X
1 1 π
< L(θ, δ1 (x)) + L(θ, δ (x)) p(x|θ) dx
X 2 2
1 1
= R(θ, δ1 ) + R(θ, δ π )
2 2
≤ R(θ, δ π ).
Thus
R(θ, δ2 ) < R(θ, δ π ) for all θ ∈ Θ.
For r(π, δ) = R(θ, δ) π(θ) dθ we obtain
r(π, δ2 ) < r(π, δ π ) = inf r(π, δ) < ∞,
δ∈D
which is a contradiction!
2
We apply this theorem in the following example on testing hypotheses.
Theorem 4.11 Assume a Bayes model {P, π}. If the Bayes rule δ π is
π
unique, then δ is admissible.
Theorem 4.12
Assume
inf sup r(π, δ) = sup inf r(π, δ). (4.46)
δ∈D π π δ∈D
The prior π0 in (4.47) is called least favourable prior. The statement above
means for a minimax estimator δ0 that
r(π0 , δ) ≤ r(π0 , δ0 ) = r(π0 ) for all δ ∈ D. (4.48)
Proof: We begin by showing the following general result for f (x) ≥ 0
sup f (x) = sup f (x)π(x) dx. (4.49)
x {π: π(x)dx=1,π(x)≥0}
We have, for all π, π(x)dx = 1 and f (x) ≥ 0, so that
f (x)π(x) dx ≤ sup f (x) π(x) dx = sup f (x).
x x
106 DECISION THEORY
Take a sequence {xn } with limn→∞ f (xn ) = supx f (x) and define a sequence
of Dirac measures πn , πn (x) = 1 for x = xn , otherwise πn (x) = 0. Then
sup f (x)π(x) dx ≥ f (x)πn (x)dx = f (xn )
π
and (4.49) follows. Now we prove (4.48). It suffices to show that r(π0 , δ0 ) ≤
r(π0 ). We have
r(π0 , δ0 ) = R(θ, δ0 )π0 (θ) dθ (definition of integrated risk)
L(θ, d) d = θ0 d = θ1
θ0 0 0.2 (4.51)
θ1 0.8 0
We obtain
ρ(π, θ0 |x) = 0.2 π(θ1 |x) and ρ(π, θ1 |x) = 0.8 π(θ0 |x),
x 0 1 2 3 4
(4.53)
g(x) 1 0.67 0.32 0.2 0
This implies that we have five different Bayes rules δj , j = 1, . . . , 5. Set 1 for
decision θ1 and 0 for decision θ0 . Then
x 0 1 2 3 4
p = 0 δ1 (x) 0 0 0 0 0
0 < p ≤ 0.2 δ2 (x) 0 0 0 0 1
(4.54)
0.2 < p ≤ 0.32 δ3 (x) 0 0 0 1 1
0.32 < p ≤ 0.67 δ4 (x) 0 0 1 1 1
0.67 < p ≤ 1 δ5 (x) 0 1 1 1 1
108 DECISION THEORY
0.04
0.03
Bayes risk
0.02
0.01
0.00
Figure 4.9: Example 4.18. The least favourable prior distributions have p between
0.32 and 0.67.
5
R(θ, δj ) = L(θ, δj (xi ))Pθ (xi )
i=1
and
The Bayes risk is piecewise linear. The supremum is attained for all p ∈
[0.32, 0.67]. The Bayes risk is plotted in Figure 4.9. For each given prior dis-
tribution the Bayes rule is unique. By Theorem 4.11, the Bayes rules in (4.54)
are admissible. 2
We now cross the bridge in the other direction. The next theorem gives a
sufficient condition for a Bayes rule to be minimax. Note that, the condition
is only sufficient!
Theorem 4.13 Assume a Bayes model {P, π}. If the Bayes rule δ π has
constant risk
R(θ, δ π ) = const, (4.55)
then δ π is minimax.
BRIDGES 109
Proof: Suppose that δ is not minimax. Then there exists a decision rule
π
δ1 such that
Otherwise
sup R(θ, δ1 ) ≥ R(θ, δ1 )π(θ) dθ = r(π, δ1 ).
θ Θ
This implies
r(π, δ1 ) < r(π, δ π )
which is a contradiction.
2
Further r(π0 ) = 0.04 = supp r(p), see Figure 4.9. Hence the rule δ4 is our
favorite,
x 0 1 2 3 4
(4.56)
δ4 (x) 0 0 1 1 1
It is admissible and minimax and associated with a least favourable prior. We
are on the safe side and decide that if the lion has eaten more than 1 person
the lion was hungry; no surprise! 2
1.0
●
Laplace Laplace ●
2.5
Jeffreys ●
0.8
Jeffreys ●
Least favourable minimax
MLE
2.0
Bayes estimation
●
●
0.6
●
prior
1.5
0.4
●
●
1.0
0.2
●
0.5
●
●
0.0
0.0
●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
theta x
√
n
Plugging α = β = 2 in (4.45) we get
2n 2n
n − 2α2 − 2αβ = n − − =0
4 4
and √
(α + β)2 − n = ( n)2 − n = 0
so that the risk is constant. The
√
least favourable prior belongs to the conjugate
√
family; it is the prior Beta( 2n , 2n ). Applying Theorem 4.13 the estimator
√
n
+x
δ(x) = 2
√ (4.57)
n+ n
is minimax. In Figure 4.10 different priors and Bayes estimates are plotted
and compared. 2
x −1 0 1
θ=0 0.5 0.3 0.2
θ=1 0 0.5 0.5
(a) Derive the Bayes estimator associated with the loss in (4.58).
(b) Set k1 = k2 . Give an approximate expression for the Bayes estimator.
(c) Discuss possible applications of this asymmetric approach. Give an ex-
ample.
(d) Is it possible to find a prior inside the class of beta distributions
{Beta(α, β) : α > 1, β > 1} such that the Bayes estimator has a constant
risk? Why is a constant risk of interest?
3. Consider X ∼ Gamma(α, β), where α > 2 is known. The parameter of
interest is θ = β.
(a) Derive Jeffreys prior and the related posterior.
(b) Derive the Bayes estimator δ π (x) under the L2 loss function.
(c) Calculate the frequentist risk, the posterior expected loss and the Bayes
risk of δ π (x).
(d) Show that δ π (x) is admissible.
4. Consider X ∼ Gamma(α, β), where α > 2 is known. The parameter of
interest is θ = β1 .
(a) Does Gamma(α, β) belong to an exponential family? Derive the natural
parameter and the related sufficient statistics.
(b) Derive the efficient estimator δ ∗ .
(c) Derive the conjugate prior for θ. Calculate the posterior.
112 DECISION THEORY
π
(d) Derive the Bayes estimator δ (x) under the L2 loss function.
(e) Calculate the frequentist risk of δ π and of δ ∗ , the posterior expected loss
of δ π and the Bayes risk of δ π .
(f) Is δ π admissible?
(g) Compare the frequentist risk of δ π and of δ ∗ .
5. Consider X ∼ Mult(n, θ1 , θ2 , θ3 ) and
n
θ = (θ1 , θ2 , θ3 ) ∈ Θ = {(ϑ1 , ϑ2 , ϑ3 ) : ϑi ∈ [0, 1], ϑi = 1}
i=1
Asymptotic Theory
5.1 Consistency
Recall from Chapter 2 that, in a Bayes model {P, π}, the notion of a true
parameter makes no sense. The parameter is a random variable with known
distribution π. But given the data x we can ask for the data generating dis-
tribution. Like in decision theory we cross the bridge and consider a Bayes
procedure inside the model P = {Pθ : θ ∈ Θ} and study its consistency.
Having in mind the Bayesian inference principle, that all inference on θ is
based on the posterior distribution, we define consistency with respect to the
posterior distribution. Note that, we use the capital letter X(n) for the random
variable and x(n) for its realization, we set Pπ (.) for the prior distribution
with density π(.) and Pπn (.|x(n) ) for the posterior distribution with density
πn (.|x(n) ).
This definition is general and also includes dependent data. In case of an i.i.d.
sample we have X(n) = (X1 , . . . , Xn ) ∼ P⊗n
θ0 . The posterior distribution con-
tracts to the data generating parameter θ0 and becomes the Dirac distribution
at the point θ0 . We illustrate the definition for the binomial model.
Since 0 ≤ x ≤ n, we have
(α + n)(β + n)
Eθ0 (Var(θ|X = x)) ≤ → 0, for n → ∞.
(α + β + n)2 (α + β + n + 1)
L : (θ, d) ∈ Θ × Θ → L(θ, d) ∈ R+ ,
is the penalty for the choice of d, instead of θ. We assume the following contrast
condition, that there exists a constant c0 > 0 for all d, such that
Theorem 5.1
Assume the following conditions:
1. The loss function fulfills the conditions (5.2) and (5.3).
2. For all ε > 0 and all open sets O ⊂ Θ with θ0 ∈ O it holds for
that
Pπ (Bε (θ0 )) > 0. (5.4)
(n)
3. Let X(n) ∼ Pθ0 and the sequence of posteriors πn (θ|X(n) ) be strongly
consistent at θ0 .
Then for n → ∞
δnπ (X(n) ) → θ0 a.s.
≥ π
L(θ, δn (x(n) ))πn (θ|x(n) ) dθ
B (5.5)
≥ (L(θ0 , δnπ (x(n) )) − ε)πn (θ|x(n) ) dθ
B
≥ (L(θ0 , δnπ (x(n) )) − ε) Pπn (B|x(n) ).
Otherwise we have
L(θ, θ0 )πn (θ|x(n) ) dθ ≤ (L(θ0 , θ0 ) + ε)Pπn (B|x(n) ).
B
116 ASYMPTOTIC THEORY
and because of (5.3) and the Cauchy–Schwartz inequality,
2
L(θ, θ0 )πn (θ|x(n) ) dθ
c
B
≤ L(θ, θ0 )2 πn (θ|x(n) ) dθ IB c (θ)πn (θ|x(n) ) dθ
≤ K 2 Pπn (B c |x(n) ).
Summarizing, we get
ρn (π, θ0 |x(n) ) = L(θ, θ0 )πn (θ|x(n) ) dθ + L(θ, θ0 )πn (θ|x(n) ) dθ
B Bc (5.6)
1
≤ (L(θ0 , θ0 ) + ε)Pπn (B|x(n) ) +K (Pπn (B c |x(n) )) 2 .
and 1
(Pπ (B c |x(n) )) 2
L(θ0 , δnπ (x(n) )) ≤ 2ε + L(θ0 , θ0 ) + K nπ .
Pn (B|x(n) )
From (5.4), B ⊂ O and the consistency of the posterior, it holds that
1
(Pπn (B c |x(n) )) 2
→ 0, a.s.
Pπn (B|x(n) )
2
The assumption (5.4) includes the requirement that the prior distribution
is positive around the data generating parameter θ0 . The following example
illustrates what happens when the prior excludes θ0 .
likelihood
prior
8
posterior, n=1
posterior, n=2
posterior, n=10
6
4
2
0
−2 −1 0 1 2 3
theta
Figure 5.1: Example 5.2. The posterior distributions are concentrated at the border
b = 1, while the data generating parameter is θ0 = 2.
θ0 = 2, see Figure 5.1. We apply a quadratic loss L(θ, d) = |θ − d|2 . The loss
fulfills conditions (5.2)and (5.3). From Theorem 4.2 we get
1 ϕ(α) − ϕ(β)
δnπ (x(n) ) = E(θ|x(n) ) = x̄ + √ ,
n Φ(β) − Φ(α)
√ √
where α = n(−1 − x̄), β = n(1 − x̄), ϕ and Φ are the density and dis-
tribution functions of N(0, 1). Note that, this expression cannot be used for
calculations, since the posterior support is lying in the tails of a normal dis-
tribution that implies the fraction of two nearly zero terms. From the Law of
Large Numbers we obtain that x̄ → 2 a.s. for n → ∞. Applying the rule of
L’ Hospitale we get
δnπ (x(n) ) − x̄ → −1 a.s.
2
Note that, Assumption (5.4) is stronger than that the prior covers all possi-
ble parameters. It also requires a continuity of the parametrization. Here we
present a counterexample from Bahadur; see Schwartz (1965).
118 ASYMPTOTIC THEORY
Example 5.3 (Bahadur’s counterexample)
Assume X1 , . . . , Xn are i.i.d from Pθ with θ ∈ Θ = [1, 2), where
⎧
⎨ U[0, 1] for θ = 1
Pθ =
⎩ U[0, 2 ] for 1 < θ < 2
θ
Consider now the Bayes model with uniform prior π(θ) = 1 for 1 ≤ θ < 2.
The Bayes rule under quadratic loss is the expectation of the posterior. We
have π(θ|x(n) ) ∝ (θ|x(n) ) and
2
θ (θ|x(n) ) dθ
δ(x(n) ) = 1 2
1
(θ|x(n) ) dθ
b
Applying the integral a
z m dz = m+11
(bm+1 − am+1 ), with a = 1 and b = 2
y
we obtain ⎧
⎨ n+1 2n+2 −1 for y ≤ 1
n+2 2n+1 −1
δ(x(n) ) = .
⎩ n+1 bn+2 −1 for y > 1, b = 2
n+2 b n+1 −1 y
In Example 5.1 the limit behavior of the posterior is independent of the prior.
This can be proved generally. This property was also one of the key points for
the construction of the reference priors; see the discussion in Section 3.3.3.
CONSISTENCY 119
1.2
1.0
theta=1
theta=1.3
1.0
0.8
theta=1.9
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.5 1.0 1.5 2.0 1.0 1.2 1.4 1.6 1.8 2.0
x theta
Figure 5.2: Example 5.3. Left: densities for θ = 1 and θ = 1.3, 1.9. Right: likelihood
functions with xmax = 1.1 and with xmax = 0.8 (broken line). The maximum likeli-
hood estimator is xmax
2
≈ 1.82 in the first case and 1 in the second.
(n)
Theorem 5.2 Assume X(n) ∼ Pθ0 and two different priors π1 , π2 , both
continuous and positive at θ0 . Assume that for i = 1, 2 the posterior series
πi,n (θ|X(n) ), associated with πi , are strongly consistent. Then
1 μ1 (Θ)
α≤ ≤ η 2 α. (5.11)
η2 μ2 (Θ)
c
Set Om 0
= Θ \ Om0 . Using (5.11) we can estimate the posteriors
μ1 (A) μ2 (A)
Pπn1 (A|X(n) )−Pπn2 (A|X(n) ) = −
μ1 (Θ) μ2 (Θ)
μ1 (A ∩ Om0 ) μ2 (A ∩ Om0 ) μ1 (Om c
)
≤ − + 0
μ2 (Θ) μ1 (Θ)
≤ (η 3 − 1)Pπn2 (Om0 |X(n) ) + Pπn1 (Om
c
0
|X(n) ).
Pπ (Kε (θ0 )) > 0, where Kε (θ0 ) = {θ : K(Pθ0 |Pθ ) < ε}. (5.14)
For every open set O ⊂ Θ with θ0 ∈ O and Oc = Θ\O, there exist constants
D0 and q0 , q0 < 1, such that
H 21 (P⊗n
θ0 , Pn,O ) ≤ D0 q0 ,
c
n
(5.15)
Proof: We only provide a sketch of the proof. For a precise proof based on
measure theory we recommend Liese and Miescke (2008, p. 354ff). We have
to show that for an arbitrary open subset O with θ0 ∈ O
It holds that
μ(A)
Pπn (A|x(n) ) = with μ(A) = pn (x(n) |θ) π(θ) dθ. (5.18)
μ(Θ) A
with
Numn (β) = exp(βn)μ(Oc ) pn (x(n) |θ0 )−1 ,
(5.20)
Denn (β) = exp(βn)μ(Θ) pn (x(n) |θ0 )−1 .
The proof is split into two steps. In the first step we show that under assump-
tion (5.15) there exists a β0 such that
In the second step we show that under (5.14), for all β > 0,
Using (5.25) and the inequality for non-negative random variables Z that
P(Z ≥ 1) ≤ EZ, we obtain for arbitrary r > 1
1
P(r2n g(x(n) ) ≥ 1) = P(rn g(x(n) ) 2 ≥ 1)
1
(5.26)
≤ rn Eθ0 g 2 ≤ D0 (r q0 )n .
1
Choosing r such that 1 < r < q0 we get
∞
∞
P(r g(x(n) ) ≥ 1) ≤ D0
2n
(r q0 )n < ∞. (5.27)
n=1 n=1
Numn (β0 ) = r−n r2n g(x(n) ) Pπ (Oc ) ≤ r−n r2n g(x(n) ) ≤ r−n a.s.
Since r > 1 we get r−n → 0 and (5.21), which completes the first step of the
proof. To show (5.22), we set β = 2nε for arbitrary ε > 0. We have
pn (x(n) |θ)
lim inf Denn (β) = lim inf exp(2nε) π(θ) dθ
n→∞ n→∞ Θ n (x(n) |θ0 )
p
pn (x(n) |θ)
≥ lim inf exp(2nε) π(θ) dθ
n→∞ Kε (θ0 ) n (x(n) |θ0 )
p
pn (x(n) |θ0 )
= lim inf exp n2ε − ln π(θ) dθ
n→∞ K (θ )
ε 0
pn (x(n) |θ)
=D
(5.29)
SCHWARTZ’ THEOREM 123
P⊗n
(n)
As Pθ = θ , we have
pn (x(n) |θ0 )
ln = ln(pn (x(n) |θ0 )) − ln(pn (x(n) |θ))
pn (x(n) |θ)
n n
= ln(pn (xi |θ0 )) − ln(pn (xi |θ)) (5.30)
i=1 i=1
n
pn (xi |θ0 )
= ln .
i=1
pn (xi |θ)
Applying Jensen’s inequality, that f (EX) ≤ E(f (x)) for convex functions f ,
and using the factorization
π(θ) = Pπ (Kε (θ0 ))πε (θ) with πε (θ) = π(θ|Kε (θ0 )) (5.31)
we obtain
1 pn (xi |θ0 )
n
D ≥ lim inf exp n 2ε −
ln π(θ) dθ
n→∞ Kε (θ0 ) n i=1 pn (xi |θ)
(5.32)
1
n
≥ lim inf exp Pπ (Kε (θ0 )) n 2ε πε (θ) dθ − Yi ,
n→∞ Kε (θ0 ) n i=1
where
p(xi |θ0 )
Yi = ln πε (θ) dθ.
p(xi |θ)
n
The strong law of large numbers yields n1 i=1 Yi → EY1 , a.s. for n → ∞. We
calculate EY1 ,
p(x1 |θ0 )
EY1 = ln( )πε (θ) dθ p(x1 |θ0 ) dx1
p(x1 |θ)
p(x1 |θ0 )
= ln( )p(x1 |θ0 ) dx1 πε (θ) dθ (5.33)
p(x1 |θ)
= K(Pθ0 |Pθ )πε (θ) dθ,
Summarizing, we obtain
lim inf Denn (β) ≥ lim inf exp (n Pπ (Kε (θ0 )) (2ε − ε)) = ∞
n→∞ n→∞
H2 (P⊗n c
⊗n
θ0 , Pn,O ) = 1 − H 12 (Pθ0 , Pn,O ) > 1 − D0 q0 .
c
n
H0 : P⊗n ⊗n
θ0 versus H1 : {Pθ : θ ∈ Θ \ O}
Proof: Set C1,n = {x(n) : ϕn (x(n) ) = 1}, the critical region of test ϕn , and
C0,n = {x(n) : ϕn (x(n) ) = 0} its complement. Recall the definition of Pn,Oc
in (5.16), and set πr (θ) for the prior restricted on Θ \ O. Then by Schwarz’
inequality we have
12
H 12 (P⊗n
1
θ0 , Pn,O )
c = p(x(n) |θ0 ) 2 p(x(n) |θ0 )πr (θ) dθ dx(n)
≤ P⊗n ⊗n
1 1 1 1
θ0 (C1,n ) Pn,O (C1,n ) + Pθ0 (C0,n ) Pn,O (C0,n )
2 c 2 2 c 2
≤ P⊗n
1 1
θ0 (C1,n ) + Pn,O (C0,n ) .
2 c 2
Since
P⊗n
θ0 (C1,n ) = Eθ0 ϕn (X(n) )
LIST OF PROBLEMS 125
and
Pn,Oc (C0,n ) = p(x(n) |θ)πr (θ) dθ dx(n)
C0,n
≤ sup P⊗n
θ (C0,n )
θ∈Θ\O
≤ sup Eθ (1 − ϕn (X(n) ))
θ∈Θ\O
This chapter deals with Bayesian theory of linear models, beginning with a
detailed treatment of univariate case, followed by a brief multivariate exten-
sion. Our focus will be on Bayesian analysis of linear models under normality
assumption, i.e., models of the form
We recall that the linearity of such models follows from that of β in the
expectation, Xβ. We shall be mainly concerned with two Bayes models based
on (6.1), namely,
{P, πc } and {P, πJeff },
where πc and πJeff stand, respectively, for conjugate and Jeffreys priors; see
Chapter 3 for details. Under this setting, our aim will be to derive poste-
rior distributions, particularly focusing on closed-form expressions. The conju-
gate families, as the normal-inverse-gamma distributions or the normal-inverse
Wishart distributions, are well studied and computer-intensive methods are
not needed. Unfortunately, there is no standard parametrization for these dis-
tributions in the literature. We introduce the distributions in a general set
up in the running text and sign a gray frame around them for a better read-
ing. Furthermore these explicit posteriors give a chance to test the simulation
methods.
y = Xβ + , (6.2)
where the first column of X is often a vector of ones, denoted 1n . Model (6.2)
expresses each of n observations in y as a linear combination of unknown
parameters in β with coefficients from X, i.e.,
p
yi = xTi β + i = xij βj + i , (6.4)
j=1
30
20
x1
10
5
0
60
50
x2
40
30
20
160
120
x3
80
40
0 5 10 15 20 25 30 40 60 80 100 140
Figure 6.1 depicts matrix scatter plot of independent variables. The model
was in fact employed to study two phosphorous contents, i.e., two depen-
dent variables. Here, we use only the first of them, adjourning the treat-
ment of bivariate model as a special case of multivariate linear model in
Section 6.4. 2
The first part of β represents X variables on each patient, where the second
part represents the week effect, and these parts correspond, respectively, to
first four and last three columns of X. 2
Our main focus is on full rank models, so that r(X) = p and Σ(ϑ) = σ 2 Σ,
where Σ is positive definite and known. Under this set up, the likelihood and
log-likelihood functions for model (6.2) with θ = (β, σ 2 ) can be stated as
1 1 1 T −1
(θ|y) = exp − (y − Xβ) Σ (y − Xβ) (6.7)
(2πσ 2 )n/2 |Σ| 12 2σ 2
n 1 1
l(θ|y) = − ln(2πσ 2 ) − ln(|Σ|) − 2 (y − Xβ)T Σ−1 (y − Xβ),(6.8)
2 2 2σ
which yield the maximum likelihood estimator, equivalently also the general-
ized least-squares estimator (GLSE), of β,
90
Y (Length of Life, years)
80
70
60
50
40
6 7 8 9 10 11 12
X (Length of Lifeline, cm)
For a short introduction into linear models we refer to Liero and Zwanzig
(2011, Chapter 6).
so that β = (87.89 − 2.26), i.e., the least-squares fitted line predicting life
length using life line as predictor, is yi = 87.89 − 2.26xi . The fitted line is also
shown in Figure 6.2, along with the scatter plot of observed data.
Roughly, the negative slope seems to counter the popular belief. In fact, a test
of H0 : β1 = 0 could not be rejected at any reasonable α (F = 2.09, p-value
= 0.155), indicating that the belief is nothing more than a mere perception.
Later we analyze the same data under Bayes model. 2
library(aprean3)
#"Applied Regression Analysis" by N.R.Draper and H.Smith
life<-dse03v # life data: data(dse03v)
attach(life); age<-life$x; length<-life$y
BAYES LINEAR MODELS 131
plot(length,age)
# outlier age=19, length=13.20
length[45]; age[1] # outlier
# new data without outlier
Age<-c(age[2:44],age[46:50])
Length<-c(length[2:44],length[46:50])
plot(Length,Age)
M<-lm(Age~Length); abline(M); summary(M)
We have
y|β ∼ Nn (Xβ, Σ) and β ∼ Np (γ, Γ). (6.13)
As the set up is all about joined, marginal and conditional normal distribution,
the following two lemmas deliver useful results for simplifying the calculations.
Lemma 6.1 states that if the joint distribution is multivariate normal, then
132 NORMAL LINEAR MODELS
all marginal and conditional distributions are also multivariate normal of re-
spective dimensions; see e.g., Mardia et al. (1979).
X1 ∼ Nr (μ1 , Σ11 )
X2 |X1 = x1 ∼ Np−r (μ2|1 , Σ2|1 ),
where
The converse of Lemma 6.1 does not hold in general. However, a converse is
possible if E(X2 |X1 = x1 ) is a linear function of x1 and Cov(X2 |X1 = x1 )
does not even depend on x1 . As this result, unlike Lemma 6.1, is rarely found
in statistical literature, we state it below.
X1 ∼ Nr (μ1 , Σ11 )
X2 |X1 = x1 ∼ Np−r (Ax1 + b, Λ),
Proof: We write the joint density, using the given marginal and conditional
normals, as
Partitioning this sum, term by term, and collecting quadratic and bilinear,
linear, and constant terms together, we have
Q1 + Q2 = XT C−1 T −1 T −1
1 X − 2X C2 + μ1 Σ11 μ1 + b Λ
T
b,
where
⎛ ⎞ ⎛ ⎞
Σ−1 T −1
11 + A Λ A −AT Λ−1 Σ−1 T −1
11 μ1 − A Λ b
C−1
1 =⎝ ⎠ , C2 = ⎝ ⎠.
−Λ−1 A Λ−1 Λ−1 b
Now, recall that, a random vector X ∈ Rp with X ∼ Np (μ, Σ), must have the
density:
1
f (X) ∝ exp − (XT Σ−1 X − 2XT Σ−1 μ)
2
Thus, we have to show, that C1 = Σ and C2 = Σ−1 μ. We use the inverse of
a partitioned matrix as given in Seber (2008), as
⎛ ⎞−1 ⎛ ⎞
A−1 −1
11 + A11 A12 S
−1
A21 A−1 −A−1
11 A12 S
−1
A11 A12
⎝ 11 ⎠ =⎝ ⎠,
−S −1
A21 A−1
11 S −1
A21 A22
In the next step we apply Lemma 6.1 to the joint multivariate normal distri-
bution in (6.15) and obtain the following theorem.
it holds that
y ∼ Nn (μy , Σy ) and β|y ∼ Np μβ|y , Σβ|y ,
with
μy = Xγ (6.16)
T
Σy = Σ + XΓX (6.17)
T −1
μβ|y = T
γ + ΓX (Σ + XΓX ) (y − Xγ) (6.18)
T −1
Σβ|y = Γ − ΓX (Σ + XΓX )
T
XΓ. (6.19)
We also note, however, that the posterior moments often provide a better
insight when expressed in terms of precision matrices, Σ−1 and Γ−1 . The
following lemma on special matrix inverse identities will be useful in achieving
these objectives.
BAYES LINEAR MODELS 135
In particular,
with
−1
m1 = γ − ΓXT XΓXT + Σ Xγ
= Γ − ΓX (XΓX + Σ) XΓ Γ−1 γ
T T −1
= Σβ|y Γ−1 γ
and
−1
m2 = ΓXT XΓXT + Σ y
−1
= Γ 2 Γ 2 XT Σ− 2 Σ− 2 XΓXT Σ− 2 + In Σ− 2 y
1 1 1 1 1 1
1 −1 − 1
= Γ 2 C CT C + In Σ 2 y,
1 −1
CΣ− 2 y
1
m2 = Γ 2 Im + CCT
−1
= Γ 2 Im + Γ 2 XT Σ−1 XΓ 2 Γ 2 XT Σ−1 y
1 1 1 1
−1 T −1
= Γ−1 + XT Σ−1 X X Σ y
= Σβ|y XT Σ−1 y.
It holds that
Cov(β) Cov(β|y)
−1
since X Σ X 0. This implies that the posterior not only updates the
T
prior, using the information in data, rather also improves it by reducing its
variance.
Before we present special cases of (6.23) and (6.24) we show an alternative
method of their derivation.
Following the general notion (see Chapter 2), the posterior distribution of β|y
can be obtained as
π(β|y) ∝ π(β)(β|y).
BAYES LINEAR MODELS 137
For the set up in (6.13), the prior and the likelihood can be written, respec-
tively, as
1 1 T −1
π(β) = exp − (β − γ) Γ (β − γ)
(2π)p/2 |Γ|1/2 2
1 1 T −1
(θ|y) = exp − (y − Xβ) Σ (y − Xβ) ,
(2π)n/2 |Σ|1/2 2
with
Q1 = (β − γ)T Γ−1 (β − γ)
Q2 = (y − Xβ)T Σ−1 (y − Xβ).
with
Γ1 = (Γ−1 + XT Σ−1 X)−1
γ1 = Γ1 (Γ−1 γ + XT Σ−1 y)
const1 = γ T Γ−1 γ + yT Σ−1 y.
Q1 + Q2 = β T Γ−1 T −1 T −1
1 β − 2β Γ1 γ1 + γ1 Γ1 γ1 + const1 − Q3
= (β − γ1 )T Γ−1
1 (β − γ1 ) + const2
where
const2 = γ T Γ−1 γ + yT Σ−1 y − γ1T Γ−1
1 γ1 .
yT Σ−1 y + γ T Γ−1 γ + Q4 .
138 NORMAL LINEAR MODELS
with
Q4 = −2γ1T XT Σ−1 y + γ1T XT Σ−1 Xγ1 − 2γ1 Γ−1 γ + γ1 Γ−1 γ1
= −2γ1T (XT Σ−1 y + Γ−1 γ) + γ1T (Γ−1 + XT Σ−1 X)γ1
= −2γ1T Γ−1 T −1
1 γ1 + γ1 Γ1 γ1
= −γ1T Γ−1 γ1
which gives the formula of const2 . Summarizing we obtain for the posterior
1
π(β|y) ∝ exp − (β − γ1 )T Γ−11 (β − γ 1 ) .
2
This is the kernel of Np (γ1 , Γ1 ). Comparing the formulas for γ1 and Γ1 with
(6.23) and (6.24) we obtain again
can be expressed as
Q = (β −γ1 )T Γ−1 T −1
1 (β −γ1 )+(y−Xγ1 ) Σ (y−Xγ1 )+(γ −γ1 )T Γ−1 (γ −γ1 )
(6.25)
or alternatively as
Q = (β − γ1 )T Γ−1 T −1
1 (β − γ1 ) + y Σ y + γ T Γ−1 γ − γ1T Γ−1
1 γ1 (6.26)
with
γ1 = Γ1 (Γ−1 γ + XT Σ−1 y).
Γ1 = (Γ−1 + XT Σ−1 X)−1
∼ Nn (0, σ 2 In ),
which implies Cov(y) = σ 2 In . Note that, structures like σ 2 In are called spher-
ical because the level sets of the density are spheres. In this case, the posterior
BAYES LINEAR MODELS 139
distribution is N(μβ|y , Σβ|y ) with
−1
μβ|y = Γ−1 + σ −2 XT X σ −2 XT y + Γ−1 γ (6.27)
−1
Σβ|y = Γ−1 + σ −2 XT X . (6.28)
Another special case is that each component of β has independent prior in-
formation with same precision, i.e., we assume
β ∼ Np (γ, τ 2 Ip ).
β = (XT X)−1 XT y = η −2 XT y.
and ⎛ ⎞ ⎛⎛ ⎞ ⎛ ⎞⎞
α 100 100 0
⎝ ⎠ ∼ N2 ⎝⎝ ⎠ , σ 2 ⎝ ⎠⎠ .
β 0 0 100
We apply Corollary 6.1 and obtain, for the first prior, the posterior
⎛ ⎞ ⎛⎛ ⎞ ⎛ ⎞⎞
α 86.69 215.40 −23.30
⎝ ⎠ |y ∼ N2 ⎝⎝ ⎠,⎝ ⎠⎠
β −2.13 −23.30 2.56
We see that the choice of prior of the intercept has an influence on the posterior
expectation of the slope. We observe much reduced variance in the posterior
of the slope compared with the prior. The posterior variance of the intercept
in model (6.33) is still high.
Because we are mostly interested in the slope, we center the response and the
X–variable. The new model is:
We assume the prior β ∼ N(0, σ 2 100) and obtain β|y ∼ N(−2.257, 2.59). One
of posteriors of the slope is plotted in Figure 6.3. 2
library(matrixcalc)
sigma<-sqrt(160) # variance known
# alpha~N(a,sa^2); beta~N(b,sb^2) # prior
# First case
a<-0 # prior mean intercept
BAYES LINEAR MODELS 141
0.25
0.20
0.15
0.10
0.05
0.00
−5 0 5
slope
Figure 6.3: Example 6.4 on life length data. The broken line is the prior density,
N(0, σ 2 100), of the slope, and the continuous curve is the respective posterior density.
prior = N(0,10)
beta 1
beta 2
3
beta 3
2
1
0
−1 0 1 2 3
parameter
Figure 6.4: Example 6.5. Prior and posterior distributions of β1 , β2 , β3 for corn data
with conjugate prior.
BAYES LINEAR MODELS 143
where yij are independent observations measured on jth subject in ith group,
j = 1, . . . , ri , i = 1, . . . , a. It holds that yij ∼ N(μi , σ 2 ), hence
ri
y i ∼ N(μi , σ 2 /ri ) where y i = yij /ri .
j=1
We let μi ∼ N(γi , τ 2 ) be the prior, independent for each μi and assume γi and
τ 2 , along with σ 2 , known. Model (6.35) consists of a independent samples.
For known variance σ 2 the statistic y i is sufficient in each sample. Following
theorem gives the posterior distribution for each group.
Theorem 6.3 Given y i ∼ N(μi , σ 2 /ri ) with known σ 2 and the prior μi ∼
ri
N(γi , τ 2 ), where y i = j=1 yij /ri , i = 1, . . . , a. The posterior distribution
of μi |y i follows for each group i as
where
y i τ 2 + γi σ 2 /ri τ 2 σ 2 /ri
Ai = 2 2
and m2i = 2 . (6.37)
τ + σ /ri τ + σ 2 /ri
Denoting the amplitude measured on jth patient under ith condition as yij ,
the model can be stated as
yij = μi + ij , i = 1, 2, 3, j = 1, . . . , 16.
144 NORMAL LINEAR MODELS
π(μi)
π(μ1 | y1)
π(μ2 | y2)
yi
π(μi)
π(μ3 | y3)
1 2 3
i μi
Figure 6.5: Illustration for Example 6.6. Left: Bar plot of the Parkinson data. Right:
Prior and posterior distributions of μi .
The three sample means, y i , are 0.495, 0.575, 0.535, with the corresponding
variances as 0.144, 0.234 and 0.208. The left panel of Figure 6.5 shows the
basic statistics as bar plots.
For the analysis, we use μi ∼ N(0.5, 0.3) as prior and take ij ∼ N(0, 0.2).
This leads to the posterior distributions as N(0.496, 0.012), N(0.574, 0.012)
and N(0.534, 0.012), respectively for μ1 , μ2 and μ3 . The prior and posteriors
are plotted in the right panel of Figure 6.5. 2
The model for concentration measured on jth subject in ith group is stated
as
yij = μi + ij , i = 1, . . . , 4, j = 1, . . . , 7.
The sample means and variances of the four groups are, respectively, (10.857,
8.286, 6.143, 3.286) and (2.476, 2.571, 2.810, 3.238). The downward trend of
the averages across four groups and almost identical variances are also clear
from the bar plot of the data shown in the left panel of Figure 6.6.
Using μi ∼ N(5, 2.5) as prior for each group and ∼ N(0, 2.5), the posterior
distributions as obtained as N(10.125, 0.313), N(7.875, 0.313), N(6, 0.313) and
N(3.5, 0.313), respectively for μ1 , μ2 , μ3 and μ4 , and the same are plotted,
along with the prior, in the right panel of Figure 6.6. 2
BAYES LINEAR MODELS 145
π(μi)
π(μ1 | y1)
yi
π(μ2 | y2)
π(μi)
π(μ3 | y3)
π(μ4 | y4)
1 2 3 4
i μi
Figure 6.6: Example 6.7. Left: Bar plot of smokers data. Right: Prior and posterior
distributions of μi .
y|β, σ 2 ∼ Nn (Xβ, σ 2 Σ)
β|σ 2 ∼ Np (γ, σ 2 Γ) (6.39)
σ 2
∼ InvGamma(a/2, b/2).
Note that, β and σ 2 are not independent. The joint distribution of θ = (β, σ 2 )
is called normal-inverse-gamma (NIG) distribution . We introduce the distri-
bution in a general setting as follows.
146 NORMAL LINEAR MODELS
(X, λ) ∼ NIG(α, β, μ, Σ)
iff
X|λ ∼ Np (μ, λΣ) and λ ∼ InvGamma(α/2, β/2) (6.40)
with density
− p+α+2 1 T −1
f (X, λ) = c λ 2exp − β + (X − μ) Σ (X − μ) .
2λ
(6.41)
The constant c is given by
α
1 b2
c= α . (6.42)
(2π)p/2 |Σ|1/2 2 2 Γ( α2 )
Theorem 6.4 For the Bayes model (6.43), the joint posterior distribution
of (β, σ 2 ) is the normal-inverse-gamma (NIG) distribution,
with
a1 = a + n
b1 = b + (y − Xγ1 )T Σ−1 (y − Xγ1 ) + (γ − γ1 )T Γ−1 (γ − γ1 )
(6.45)
γ1 = Γ1 (XT Σ−1 y + Γ−1 γ)
Γ1 = (XT Σ−1 X + Γ−1 )−1 .
BAYES LINEAR MODELS 147
Proof: Consider prior in (6.43). Taking the likelihood function,
1 1 −1
(y|β, σ 2 ) = exp − (y − Xβ)Σ (y − Xβ) ,
(2π)n/2 (σ 2 )n/2 |Σ|1/2 2σ 2
with
Q = (y − Xβ)T Σ−1 (y − Xβ) + (β − γ)T Γ−1 (β − γ).
Applying Lemma 6.4, (6.25) delivers the statement.
2
The posterior moments in Theorem 6.4 include the inverse matrices of Σ and
Γ. In case σ 2 is given, two different representations of the posterior moments
follow from Theorem 6.2 and Corollary 6.1. The formulas in (6.45) are related
to Corollary 6.1. We derive the presentations of the posterior moments related
to Theorem 6.4 for θ = (β, σ 2 ) in the following corollary. One advantage of
the new formulas is that the inverses of Σ and Γ are not needed.
Corollary 6.2 For the Bayes model (6.43), the joint posterior distribution
of (β, σ 2 ) is the normal-inverse gamma (NIG) distribution,
with
a1 = a + n
b1 = b + (y − Xγ)T (Σ + XΓXT )−1 (y − Xγ)
(6.47)
γ1 = γ + ΓXT (Σ + XΓXT )−1 (y − Xγ)
Γ1 = Γ − ΓXT (Σ + XΓXT )−1 XΓ.
Proof: The formulas for γ1 and Γ1 in (6.45) are the same as in Corollary
6.1. We can apply Theorem 6.2. It remains to check the expression of b1 . Set
M = Σ + XΓXT . (6.48)
148 NORMAL LINEAR MODELS
−1
We have to show that Q = (y − Xγ) M T
(y − Xγ) = Q1 + Q2 , where
thus
Q2 = (y − Xγ)T M−1 XΓXT M−1 (y − Xγ). (6.50)
Using (6.49) and (6.48), we get
and
Q1 = (y − Xγ)T M−1 ΣM−1 (y − Xγ).
Applying (6.48) we obtain
Q = Q1 + Q2
= (y − Xγ)T M−1 (Σ + XΓXT )M−1 (y − Xγ)
= (y − Xγ)T M−1 (y − Xγ),
X ∼ tp (ν, μ, Σ),
β
X ∼ tp (α, μ, Σ) and λ ∼ InvGamma(α/2, β/2).
α
∞
Proof: To derive f (X), we need to integrate λ out of f (X) = f (X, λ)dλ,
0
where
−(p+α+2)/2 1
f (X, λ) ∝ λ exp − (Q + β) with Q = (X − μ)T Σ−1 (X − μ).
2λ
∞
−(p+α)/2
f (X) ∝ (Q + β) x(p+α)/2−1 exp(−x)dx.
0
Recall
∞ a−1that, for any real a > 0, the gamma function is defined as Γ(a) =
0
x exp(−x)dx; see e.g., Mathai and Haubold (2008). Thus, the integral
in f (X) is a gamma function with a = (p + α)/2, and we can write
1
f (X) ∝ (Q + β)−(p+α)/2 ∝ (1 + (X − μ)T Σ−1 (X − μ))−(p+α)/2 . (6.52)
β
Comparing this kernel with (6.51) delivers the statement. The marginal dis-
tribution of λ is given by the hierarchical set up of NIG.
2
150 NORMAL LINEAR MODELS
For a better overview, we summarize once more all results related to the con-
jugate prior set up in Theorem 6.5.
then
b
y ∼ tn (a, m, M) (6.56)
a
β|y, σ 2 ∼ Np (γ1 , σ 2 Γ1 ) (6.57)
b1
β|y ∼ tp (a1 , γ1 , Γ1 ) (6.58)
a1
σ 2 |y ∼ InvGamma(a1 /2, b1 /2). (6.59)
with
m = Xγ (6.60)
M = Σ + XΓXT (6.61)
γ1 = Γ1 (XT Σ−1 y + Γ−1 γ) (6.62)
−1
= γ + ΓX M T
(y − Xγ) (6.63)
−1 T −1 −1
Γ1 = (Γ +X Σ X) (6.64)
= Γ − ΓXT M−1 XΓ. (6.65)
a1 = a+n (6.66)
T −1 T −1
b1 = b + (y − Xγ1 ) Σ (y − Xγ1 ) + (γ − γ1 ) Γ (γ − γ1 )(6.67)
= b + (y − Xγ1 )T M−1 (y − Xγ1 ). (6.68)
= b + yT Σ−1 y + γ T Γ−1 γ − γ1T Γ−1
1 γ1 . (6.69)
Proof: The equivalent formulas (6.62) and (6.63) are given in Theorem 6.4
and Corollary 6.2. The same is true for (6.64) and (6.65). The three equivalent
presentations of b1 are given in Theorem 6.4, Corollary 6.2 and Lemma 6.4.
The t–distribution of β in (6.58) is a consequence of Lemma 6.5. It remains
to show (6.56). From Theorem 6.1 it follows for given σ 2 that
⎛ ⎞ ⎛⎛ ⎞ ⎛ ⎞⎞
2 2 T
β γ σ Γ σ ΓX
⎝ ⎠ |σ 2 ∼ Nn+p ⎝⎝ ⎠ , ⎝ ⎠⎠ .
y m σ 2 XΓ σ2 M
BAYES LINEAR MODELS 151
This includes
y|σ 2 ∼ Nn (m, σ 2 M),
which, together with (6.55), implies
In the following example we consider again the simple linear regression model,
but now we include the intercept as parameter. Also in this case the formulas
can be written explicitly.
with
σ 2 ∼ InvGamma(a/2, b/2).
Then Σ = I2 , ⎛ ⎞ ⎛ ⎞
γa λ1 0
γ=⎝ ⎠ and Γ = ⎝ ⎠.
γb 0 λ2
We apply (6.62) and (6.64). It holds that
⎛ n ⎞
−1
n + λ x
XT X + Γ−1 = ⎝ i=1 i ⎠.
1
n n 2 −1
i=1 x i i=1 x i + λ 2
and
n
n
x2i = sxx + nx̄2 , with sxx = (xi − x̄)2
i=1 i=1
we obtain ⎛ ⎞
−1
1 ⎝sxx + nx̄2 + λ2 −nx̄
Γ1 = ⎠
d −nx̄ n + λ−1
1
with
d = (n + λ−1 −1 −1 2
1 )(sxx + λ2 ) + λ1 nx̄ . (6.70)
Further ⎛ ⎞
n −1
y + λ γ
XT y + Γ−1 γ = ⎝ ⎠
i=1 i 1 a
n −1
i=1 x i y i + λ 2 γ b
and
n
n
xi yi = sxy + nx̄ȳ, with sxy = (xi − x̄)(yi − ȳ)
i=1 i=1
so that
1
γ1,a = nsxx α + nλ−1 −1
2 αprior + λ1 γa + λ2
−1
d (6.71)
1
γ1,b = (n + λ−1 −1 −1
1 )(sxx β + λ2 γb ) + nx̄λ1 (ȳ − γa )
d
BAYES LINEAR MODELS 153
with
α = ȳ − β x̄
αprior = ȳ − γb x̄ (6.72)
sxy
β = . (6.73)
sxx
2
In Figure 6.7 the posteriors together with the prior are plotted. 2
y<-Agec; x<-Lengthc
# prior NIG(a,b,gamma,Gamma)
a<-10; b<-2*640
gamma<-0; Gamma<-100
## posterior NIG(a1,b1,gamma1,Gamma1)
a1<-a+n
Gamma1<-(Gamma^(-1)+sum(x^2))^(-1)
gamma1<-Gamma1*(sum(x*y)+Gamma^(-1)*gamma)
b1<-b+sum(y*y)+gamma^2*Gamma^(-1)-gamma1^2*Gamma1^(-1)
b1/a1*Gamma1
0.15
prior
prior
posterior posterior
0.010
0.10
0.005
0.05
0.000
0.00
50 100 150 200 250 300 350 400
−10 −5 0 5
variance
slope
Figure 6.7: Example 6.10 on life length data. Left: The broken line is the prior
InvGamma(5, 640) of the error variance, and the continuous line is the posterior
InvGamma(29, 4096). Right: The broken line is the prior density t1 (10, 0, 12800) of
the slope, and the continuous line is the posterior density t1 (58, −2.26, 2.29).
Under this assumption Jeffreys prior and the reference prior coincide, see
(3.51) and the discussion in Subsection 3.3.3. In Definition 3.4 Jeffreys prior
is defined as the square root of the determinant of Fisher information.
We begin by computing Fisher information.
where J(θ|x) is the matrix of the second derivatives, Hessian, of the log-
likelihood function. We have
with
∂l 1 T −1
= X Σ (y − Xβ)
∂β σ2
∂l n 1
= − 2+ (y − Xβ)T Σ−1 (y − Xβ).
∂σ 2 2σ 2(σ 2 )2
Differentiating again,
∂2l 1 T −1
= − X Σ X
∂ββ T σ2
∂2l n 1
= − 2 3 (y − Xβ)T Σ−1 (y − Xβ)
∂(σ 2 )2 2(σ 2 )2 (σ )
∂2l 1
= − 2 2 XT Σ−1 (y − Xβ)
∂β∂σ 2 (σ )
so that
⎛ ⎞
− σ12 XT Σ−1 X − σ14 XT Σ−1 (y − Xβ)
J(θ|y) = ⎝ ⎠.
1
σ 4 (y − Xβ)T XT Σ−1 n
2σ 4 − 1
σ 6 (y − Xβ)T Σ−1 (y − Xβ)
Corollary 6.3 For y|θ ∼ Nn (Xβ, σ 2 Σ), the Jeffreys prior of θ = (β, σ 2 )
is given as
1
π(θ) ∝ . (6.77)
(σ 2 )p/2+1
156 NORMAL LINEAR MODELS
Proof: As I(θ) is a block diagonal matrix, we get see (see Seber, 2008)
1 T −1 n 1
det(I(θ)) = det( X Σ X) · ∝ 2 p+2 , (6.78)
σ2 2(σ 2 )2 (σ )
We observe that the Jeffreys prior under (6.80) is different from that in Corol-
lary 6.3. In particular, it does not depend on p. In both cases, however, Jeffreys
prior reduces to an improper prior for θ = (β, σ 2 ). But the posterior belongs
to the conjugate family, as under the conjugate prior. This is stated in the fol-
lowing theorem. First recall that the maximum likelihood estimator in model
BAYES LINEAR MODELS 157
(6.74) is the generalized least-squares estimator
Theorem 6.7 Given y|θ ∼ Nn (Xβ, σ 2 Σ) with Σ known. Then, under Jef-
freys’ priors
π(β, σ 2 ) ∝ (σ 2 )−m
with 2m = p + 2 in (6.77) and m = 1 in (6.81). It follows that
and
b
β|y ∼ tp (am , βΣ , (XT Σ−1 X)−1 ) (6.84)
am
with
am = 2m + n − p − 2
b = (y − XβΣ )T Σ−1 (y − XβΣ )
(6.85)
γ = (XT Σ−1 X)−1 XT Σ−1 y = βΣ
Γ = (XT Σ−1 X)−1 .
Proof: We calculate
Since
we have
y − XβΣ = + X(β − βΣ )
= (In − X(XT Σ−1 X)−1 XT Σ−1 ) .
158 NORMAL LINEAR MODELS
The projection property of the generalized least-squares estimator implies that
the mixed term is zero:
1
n
sxy
Σ = In , β = , XT Σ−1 X = sxx , s2 = (yi − ȳ − β(xi − x̄))2 .
sxx n − 1 i=1
0.15
0.008
0.10
0.004
0.05
0.000
0.00
50 100 150 200 250 300 −10 −5 0 5
variance slope
In Example 6.3 we calculated β = −2.26, sxx = 61.65 and s2 = 147, 07. Figure
6.8 depicts the posterior distributions.
2
π(β1 | y)
π(β2 | y)
π(β3 | y)
π(β)
Figure 6.9: Example 6.12. Prior and posterior distributions of β1 , β2 , β3 for corn data
with Jeffreys prior.
yij = β0 + xi β1 + γj + ij , i = 1, . . . , 10, j = 1, 2.
γ = (γ1 , γ2 )T , Z = (I2 , . . . , I2 )T
Estimation of β
First we are only interested in estimating the regression parameter β. We
eliminate the random component γ in the model. Setting
ξ = Zγ + , and V = ZΔZT + Σ
The model (6.89) is called marginal model . The maximum likelihood estimator
of β follows as
50
● first visit
● second visit
45 ● ●
●
● ●
●
Haematocrit
40
●
●
●
35
30
64 66 68 70 72 74 76
Age
Figure 6.10: Illustration for the data set used in Example 6.13. Values under the
broken line indicate anemia. But the blood values are improved for most of patients.
Note that, the three patients with same age 70, are plotted as with age 69.5, 70, 70.5.
Prediction of γ
First we re-formulate the model (6.87) in a hierarchical setting
It holds that
1 T −1 T −1
h(β, γ) ∝ exp − 2 (y − Xβ − Zγ) Σ (y − Xβ − Zγ) + γ Δ γ .
2σ
XT WXT β = XT Wy
with
W = Σ−1 − Σ−1 ZT (ZT Σ−1 Z + Δ−1 )−1 ZT Σ−1 .
Applying Lemma 6.3 we obtain
Thus the solution of the equation system for β coincides with βV in (6.90)
and the prediction of γ is given by
It follows from
Note that, the joint distribution (6.97) explains the name marginal model for
(6.89). Lemma 6.1 implies
and derive the posterior distributions π(θ|y) under conjugate and non-
informative priors. We start with the conjugate prior πc , i.e.,
The Bayes model {P, πc } is a univariate linear model with conjugate prior.
We can apply Theorem 6.5 and obtain:
Theorem 6.8 For the marginal model set up {P, πc }, it holds that
and
b1
β|y ∼ tp (a1 , β1 , Γ1 ), (6.100)
a1
with
a1 = a+n (6.101)
−1
b1 = b + (y − Xβ1 ) M T
(y − Xβ1 ) (6.102)
β1 = β0 + ΓXT M−1 (y − Xβ0 ) (6.103)
−1
Γ1 = Γ − ΓX M T
XΓ (6.104)
M = V + XΓXT (6.105)
where m = 1 gives the prior, under the condition that location and scale
parameter are independent. Otherwise, 2m = p + 2 without this condition;
see Section 6.2.3. The Bayes model {P, πm } is a univariate linear model with
non-informative prior. We apply Theorem 6.7 and obtain:
LINEAR MIXED MODELS 165
Theorem 6.9 For the marginal model set up {P, πm } it holds that
and
b T −1
β|y ∼ tp (am , βV , X V X), (6.108)
am
where
βV = (XT V−1 X)−1 V−1 Xy
am = 2m + n − p − 2 (6.109)
−1
b = (y − XβV ) VT
(y − XβV ). (6.110)
We assume
θ ∼ NIG(a, b, β0 , Γ). (6.111)
The entire model set up can be formulated as
where β|σ 2 , and γ|σ 2 are assumed to be mutually independent. We see that
under Bayes model set up the random component γ and the parameter θ have
equivalent status. It also explains why often γ is called random parameter and
its distribution prior. Set the joint parameter as α = (β, γ), α0 = (β0 , 0) and
U = (X Z), Ψ = diag(Γ Δ). Then the model set up can also be written as
y|α, σ 2 ∼ Nn (Uα, σ 2 Σ)
(6.116)
(α, σ 2 ) ∼ NIG(a, b, α0 , Ψ).
166 NORMAL LINEAR MODELS
Following theorem is a consequence of Theorem 6.5, as it sums up the poste-
rior under (6.116).
and
b1
α|y ∼ tp+q (a1 , α1 , Ψ1 ), (6.118)
a1
with
a1 = a+n (6.119)
−1
b1 = b + (y − Uα1 ) M T
(y − Uα1 ) (6.120)
= b + y Σ y + α0T Ψ−1 α0 −
T −1
α1T Ψ−1
1 α1 (6.121)
α1 = α0 + ΨUT M−1 (y − Uα0 ) (6.122)
= Ψ1 (UT Σ−1 y + Ψ−1 α0 ) (6.123)
−1
Ψ1 = Ψ − ΨU M UΨ
T
(6.124)
= (Ψ−1 + UT Σ−1 U)−1 (6.125)
M = Σ + UΨUT (6.126)
where the component for β|y in α|y is the posterior distribution of β and
the component for γ|y is the predictive distribution for γ.
Note that the formulas for the posterior parameters can be presented in dif-
ferent ways as stated in Theorem 6.5. The presentation above is helpful for a
separation between β and γ. Especially in (6.120) it holds for α1T = (β1T , γ1T )
that
b1 = b + (y − Xβ1 − Zγ1 )T M−1 (y − Xβ1 − Zγ1 )
and in (6.124)
⎛ ⎞
Γ − ΓXT M−1 XΓ −ΓXT M−1 ZΔ
Ψ1 = ⎝ ⎠.
−ΔZT M−1 XΓ Δ − ΔZT M−1 ZΔ
Using the presentation (6.125) we obtain
⎛ ⎞
T −1 T −1
Γ + X Σ X X Σ Z
Ψ−1
1 =
⎝ ⎠. (6.127)
ZT Σ−1 X Δ + ZT Σ−1 Z
Further
M = Σ + XΓXT + ZΔZT = V + XΓXT
LINEAR MIXED MODELS 167
−1
β1 = β0 + ΓX M T
(y − Xβ0 )
and
γ1 = ΔZT M−1 y.
We see that β1 included in (6.122) coincides with β1 in (6.103).
The practically most commonly used special case of Theorem 6.10 is when
all covariance matrices are spherical and the design matrices are orthogonal.
Furthermore we consider σ 2 as known. We state the result as a corollary to
Theorem 6.10.
0 rγ−1 Iq
with ⎛ ⎞
XT y + rβ−1 β0
UT Σ−1 y + Ψ−1 α0 = ⎝ ⎠
ZT y
and obtain μβ|y and μγ|y .
2
For a better understanding of the formulas above we discuss a simple special
case.
β|y, σ 2 ∼ N(β1 , σ 2 τ1 )
with
β1 = τ1 (sxy + τ −1 β0 )
τ1 = (sxx + τ −1 )−1
and
γ|y, σ 2 ∼ N((n + λ−1 )−1 nȳ, σ 2 (n + λ−1 )−1 ).
From Theorem 6.10 with (6.121), it follows that
with
and ⎛ ⎛ ⎞ ⎛ ⎞⎞
4.86 9.330 7.751
γ|y ∼ t2 ⎝26, ⎝ ⎠,⎝ ⎠⎠ .
−2.23 7.751 9.330
2
# data
Age<-c(66,66,70,70,70,70,74,74,65,65,71,71,68,68,69,69,
64,64,70,70)
Haematocrit<-c(47.1,32.8,44.1,37,43.3,36.6,37.4,29.05,45.7,
39.8,46.05,37.8,42.1,36.05,38.25,30.5,43,36.65,37.8,30.6)
# joint design matrix
U<-matrix(rep(0,4*20),ncol=4)
U[,1]<-rep(1,20)
U[,2]<-Age
U[,3]<-c(1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0)
U[,4]<-c(0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1)
y<-Haematocrit
# prior: NIG(a0,b0,alpha0,Psi)
a0<-3*2
170 NORMAL LINEAR MODELS
b0<-10*4
b0/(a0-2) # prior expectation variance
(b0/2)^2/((a0/2-1)^2*(a0/2-2))# prior variance of variance
Psi<-diag(4)
alpha0<-c(0,0,0,0)
# posterior: NIG(a1,b1,alpha1,Psi1)
UU<-t(U)%*%U
Psi1<-solve(solve(Psi)+UU)
alpha1<-Psi1%*%(t(U)%*%y+solve(Psi)%*%alpha0)
a1<-a0+length(Age)
b1<-b0+y%*%y+alpha0%*%solve(Psi)%*%alpha0
-alpha1%*%solve(Psi1)%*%alpha1
# t-distribution
s<-as.numeric(b1/a1)
s*Psi1
Y = XB + E, (6.131)
y i = BT x i + i , i ∼ Nd (0, Σ) i = 1, . . . , n (6.132)
are i.i.d. This implies that the elements of the response matrix are normally
distributed, where the elements in different columns are correlated, while ele-
ments in different rows are independent.
This implies that, E ∼ MNn,d (O, In , Σ), where In is the identity matrix.
Correspondingly, then,
with density
−nd/2 −n/2 1 −1
f (Y|B, Σ) = (2π) |Σ| exp − tr Σ (Y − XB) (Y − XB) . T
2
(6.135)
The multivariate model (6.131) can be considered columnwise as a collection
of d correlated univariate models. Let y(s) , β(s) , and (s) be the sth columns of
the response matrix Y, parameter matrix B, and error matrix E, respectively.
Then
Example 6.16 (Corn plants) We analyze complete corn data, using both
dependent variables; see Example 6.1 on page 127. The three independent
variables are the same as used in Examples 6.5 and 6.12, where the analysis
with one dependent variable was discussed.
With n = 17 independent vectors, where d = 2 and p = 4, we have Y and X
as 17 × 2 and 17 × 4 matrices, respectively. 2
We continue with analysis of the statistical model (6.137) with unknown pa-
rameter θ = (B, Σ). Writing (6.135) as likelihood function the maximum like-
lihood estimation problem
n 1
arg max − ln |Σ| − tr Σ−1 (Y − XB)T (Y − XB) ,
B,Σ 2 2
nΣ ∼ Wd (n − p, Σ).
where ⎛ ⎞−1
k
ν+1−j ⎠
c W = ⎝2
kν k(k−1)
2 π 4 Γ( ) . (6.142)
j=1
2
(Z, W) ∼ NIW(ν, M, U, V)
iff
Z|W ∼ MNm,k (M, U, W)
W ∼ IWk (ν, V)
Z ∼ tm,k (ν, M, U, V)
Then
Z ∼ tm,k (ν − k + 1, M, U, V).
Note that
− ν+m+k+1 ν+m 1
K(W) ∝ |W| 2 |A| 2 exp − tr W−1 A
2
is kernel of IW(ν + m, A). We have
ν+m
f (Z, W) ∝ |U|− 2 |V| 2 |A|−
k ν
2 K(W).
Thus we can integrate out W and obtain
ν+m
f (Z) ∝ |U|− 2 |V| 2 |A|−
k ν
2
ν+m
∝ |U|− 2 |V| 2 |V + (Z − M) U−1 (Z − M) |−
k ν T 2
ν+m
∝ |U|− 2 |V|− 2 |Ik + V−1 (Z − M) U−1 (Z − M) |−
k m T 2 .
176 NORMAL LINEAR MODELS
This is the kernel of tm,k (ν − k + 1, U, V).
2
We set the conjugate prior of θ = (B, Σ) as
(B, Σ) ∼ NIW(ν0 , B0 , C0 , Σ0 ), (6.150)
where B0 is first guess for the matrix of regression parameters B, C0 says
something about the accuracy of the guess, Σ0 includes subjective information
on the correlation between different univariate models, and ν0 measures how
many observations the guesses are based on. The density of normal-inverse-
Wishart (NIW) distribution, NIW(ν0 , B0 , C0 , Σ0 ), is:
ν0 +p+d+1 1
π(B, Σ) ∝ |Σ0 |− 2 exp − tr Σ−1 (Σ0 + (B − B0 )T C−10 (B − B 0 ))
2
(6.151)
The entire Bayes model setup can be stated as
with ν1 = ν0 + n and
B1 = C1 (XT Y + C−1
0 B0 )
−1 −1
C1 = C0 + XT X
Σ1 = Σ0 + (Y − XB1 )T (Y − XB1 ) + (B0 − B1 )T C−1
0 (B0 − B1 ).
Proof: Using π(θ|x) ∝ π(θ)(θ|x), where now the observations x are denoted
by Y generated by the multivariate linear model in (6.131), we obtain
ν +p+d+1+n
− 0 1 −1
π(θ|Y) ∝ |Σ| 2 exp − tr Σ term1
2
with
term1 = Σ0 + (Y − XB)T (Y − XB) + (B − B0 )T C−1
0 (B − B0 ).
BAYES MULTIVARIATE LINEAR MODELS 177
Completing the squares for B gives
term1 = Σ1 + (B − B1 )T C−1
1 (B − B1 ),
where
Σ1 = Σ0 + (Y − XB1 )T (Y − XB1 ) + (B0 − B1 )T C−1
0 (B0 − B1 )
with
C1 = (C−1 T
0 + X X)
−1
and B1 = C1 (XT XB + C−1
0 B0 ).
Theorem 6.12 For the Bayes model {P, πJeff }, with P given in (6.137)
and πJeff as defined in (6.156)
−1
(B, Σ)|Y ∼ NIW(n, B, XT X , S)
T −1
B|Y ∼ tp,d (n − d + 1, B, X X , S)
with
B = (XT X)−1 XT Y
S = (Y − XB)T (Y − XB).
Thus ⎛ ⎞
64.440 26.823
⎜ ⎟ ⎛ ⎞
⎜ ⎟
⎜ 1.301 0.090 ⎟ 2087.17 −349.79
B=⎜
⎜
⎟, S = ⎝
⎟
⎠.
⎜−0.130 1.189 ⎟ −349.79 3693.80
⎝ ⎠
0.023 0.095
We apply Theorem 6.12 with Jeffreys prior, π(Σ) ∝ |Σ|3/2 , and assuming
Y ∼ MN17,2 (XB, I17 , Σ), the posterior of B is the multivariate t distribution,
t4,2 (16, B, (XT X)−1 , S), the components of which are given above.
For the conjugate prior, we assume ν0 = 5, σ02 = 500, with
B|Σ ∼ MN4,2 (0, I4 , Σ) and Σ ∼ IW(ν0 , σ02 I2 ), i.e., we set the
prior NIW(ν0 , 0, I4 , σ02 I2 ). From Theorem 6.11, the joint posterior is
NIW(ν1 , B1 , C1 , Σ1 ), where ν1 = 22, and the moments are computed as
⎛ ⎞
48367.52 −49.42 −602.68 −168.71
⎜ ⎟
⎜ ⎟
⎜ −49.42 79.06 −24.73 1.65 ⎟
C1 = 10−5 ⎜ ⎜
⎟
⎟
⎜ −602.68 −24.73 36.75 −5.08 ⎟
⎝ ⎠
−168.71 1.65 −5.08 3.12
⎛ ⎞
33.27 13.86
⎜ ⎟ ⎛ ⎞
⎜ ⎟
⎜ 1.33 0.10 ⎟ 4732.92 543.10
B1 = ⎜
⎜
⎟ , Σ1 = ⎝
⎟
⎠.
⎜ 0.26 1.35 ⎟ 543.10 4567.11
⎝ ⎠
0.13 0.14
2
180 NORMAL LINEAR MODELS
yi = β0 + xi β1 + zi β2 + εi , i = 1, . . . , n
n n n
with orthogonal design,
i.e., i=1 xi = 0, i=1 zi = 0, i=1 xi zi = 0,
n 2 n 2
x
i=1 i = n and i=1 iz = n where ε i are i.i.d. normally distributed with
expectation zero and variance σ 2 = 1. The unknown three dimensional pa-
rameter β = (β0 , β1 , β2 )T is normally distributed with mean μ = (1, 1, 1)T
and covariance matrix ⎡ ⎤
1 0.5 0
⎢ ⎥
⎢ ⎥
Σ = ⎢0.5 1 0⎥ .
⎣ ⎦
0 0 1
(a) Write the model equation in matrix form.
(b) Determine the posterior distribution of β. Specify the expressions in
(6.23) and (6.24).
2. Consider the simple linear regression model
Estimation
We consider the Bayes model {P, π}, where P = {Pθ : θ ∈ Θ} and π is the
known prior distribution of θ over Θ. The data generating distribution Pθ is
the conditional distribution of X given θ and it is known up to θ.
Having observed x for X ∼ Pθ we want to determine the underlying parameter
θ – exploiting the model assumption {P, π}.
Applying the Bayesian inference principle means that the posterior distribu-
tion takes over the role of the likelihood; see Chapter 2. All information we
have about θ is included in the posterior. Depending on different estimation
strategies we choose as estimator the
• mode,
• expectation, or
• median
of the posterior distribution. Before we discuss each method in more detail,
we illustrate it by three examples. In the first example all estimators coincide.
4
prior = beta(2,3)
posterior = beta(3,10)
3
● mle
● mode
2
expectation
median
1
● ●
0
theta
Figure 7.1: Example 7.2. All estimates are situated in the center of the posterior
distribution: mle= 0.123, mode= 0.18, expectation= 0.23, median= 0.21.
The Bayesian estimation approach has at least two advantages. One is the ele-
gant handling of the nuisance parameters by exploring the marginal posterior
of the parameter of interest, and the other is that the posterior distribution
delivers information about the precision of the estimator. We demonstrate
both properties in the following example.
MAXIMUM A POSTERIORI (MAP) ESTIMATOR 185
Example 7.3 (Normal i.i.d. sample and conjugate prior)
We have an i.i.d. sample X = (X1 , . . . , Xn ) from N(μ, σ 2 ) with unknown
expectation μ and unknown variance σ 2 , thus θ = (μ, σ 2 ). The parame-
ter of interest is μ; the variance is the nuisance parameter. In Chapter 6
the conjugate prior for the linear model is derived. This example is a spe-
cial case with p = 1, where all design points equal 1. The conjugate family
is the normal-inverse-gamma distribution, given in (6.41). We set the prior
NIG(α0 , β0 , μ0 , σ02 ). Recall, the prior of μ given σ 2 is N(μ0 , σ 2 σ02 ), the prior of
σ 2 is the inverse-gamma distribution InvGamma( α20 , β20 ). The prior parameters
α0 , β0 , μ0 and σ02 are known. We have
α +3
2 − 02 1 1
π(θ) ∝ (σ ) exp − 2 (β0 + 2 (μ − μ0 ) ) . 2
2σ σ0
We give the main steps of deriving the posterior, as follows,
π(θ|x) ∝ π(θ)(θ|x)
1
n
∝ π(θ)(σ 2 )− 2 exp −
n
(xi − μ)2 .
2 σ 2 i=1
Note that, it is enough to know the kernel of π(θ|x) ∝ π(θ)(θ|x). The inte-
gration step is not required. Otherwise, for unknown distributions we have to
carry out a numerical maximization procedure. In case the posterior belongs
to a known distribution family, we can apply mode’s formula.
We continue now with two examples, the first of which uses mode’s formula.
α−1
Mode = , α ≥ 1, (7.5)
β
and we obtain, for n > 1,
2α0 + n − 2
θMAP (x) = n .
2β0 + i=1 x2i
with
and
τi 1 1
ωi (x) ∝ ωi exp − 2 m2i + 2 mi (x)2
τi,p 2τi 2τi,p
with ω1 (x) + ω2 (x) = 1. Figure 3.10 shows the prior, the posterior, and
the observation. By searching for the maximum of the y-values we obtain
θMAP (4) = 5.33. 2
y = Xβ + . (7.6)
We obtain for
βMAP (y) ∈ arg maxp π(β|y) (7.9)
β∈R
that it is equivalent to
It means, choosing the right Bayes model, the Bayes MAP estimator coincides
with a regularized estimator in regression. It is a bridge between Bayes and
frequentist approach. There is one more bridge. The optimization problem in
(7.10) has an equivalent formulation that there exists a constant k > 0 such
that
βMAP (y) ∈ arg min L(y − Xβ). (7.11)
{β:pen(β)≤k}
This means, choosing the right Bayes model, the Bayes MAP estimator co-
incides with a restricted estimator in regression. For short overview of
regularized and restricted estimators, we recommend Zwanzig and Mahjani
(2020, Section 6.4) and Hastie et al. (2015).
Ridge
We start with normal linear model with known covariance matrix Σ and con-
jugate prior:
1 T −1
(β|y) ∝ exp − (y − Xβ) Σ (y − Xβ)
2
(7.12)
1 T −1
π(β) ∝ exp − (β − γ) Γ (β − γ)
2
190 ESTIMATION
The posterior is N(μβ|y , Σβ|y ) given in Corollary 6.1 as
1
π(β|y) ∝ exp − (β − μβ|y )T Σ−1 β|y (β − μ β|y )
2
(7.13)
μβ|y = Σβ|y (XT Σ−1 y + Γ−1 γ)
Σβ|y = (XT Σ−1 X + Γ−1 )−1 .
The posterior is symmetric around μβ|y , such that mode and expectation equal
μβ|y . Applying (6.18) for μβ|y , we obtain
Lasso
Suppose a normal linear model with covariance matrix Σ = σ 2 I. The com-
ponents of β are independent and Laplace distributed La(m, b) with location
2
parameter m = 0 and scale parameter b = 2σλ , so that
1
(β|y) ∝ exp − 2 (y − Xβ) (y − Xβ)
T
2σ
⎛ ⎞
p (7.15)
λ
π(β) ∝ exp ⎝− 2 |βj |⎠ .
2σ j=1
The posterior is
⎛ ⎛ ⎞⎞
1 p
π(β|y) ∝ exp ⎝− 2 ⎝(y − Xβ)T (y − Xβ) + λ |βj |⎠⎠ . (7.16)
2σ j=1
Zou and Hastie (2005) propose the following compromise between Lasso and
ridge estimators.
MAXIMUM A POSTERIORI (MAP) ESTIMATOR 191
Elastic Net
Consider a normal linear model with covariance matrix Σ = σ 2 I. The compo-
nents of β are i.i.d. from a prior distribution given by a compromise between
σ2 σ2
N(0, λ(1−α) ) and Laplace La(0, λα 2 ) prior. For α = 1, it is the Laplace prior
2 2
La(0, σλ ) and for α = 0, it is the normal prior N(0, σλ ); all other values
α ∈ (0, 1) lead to a compromise. It gives
1
(β|y) ∝ exp − 2 (y − Xβ)T (y − Xβ)
2σ
⎛ ⎛ ⎞⎞
p p (7.18)
⎝ λ ⎝ ⎠ ⎠
π(β) ∝ exp − 2 (1 − α) 2
βj + α |βj | .
2σ j=1 j=1
Note that, the regularized estimators depend strongly on the weight λ assigned
to the penalty term. There exists an extensive literature on adaptive methods
for choosing the tuning parameter λ and respectively α, but this is beyond
the scope of this book. In the Bayesian context the tuning parameters are
hyperparameters of the prior. We refer to Chapter 3 where different proposals
for the prior choice are presented.
yi = β0 + β1 xi + β3 x2i + β4 x4i + εi , i = 1, . . . , n.
A small data set is generated with 9 equidistant design points between 0 and
4; εi ∼ N(0, 0.52 ) i.i.d. and the true regression function
Using R the lasso estimate is calculated for λ = 0.012; the ridge estima-
tor for λ = 0.05; and the elastic net estimator for α = 0.5 and λ = 0.02.
192 ESTIMATION
The R-packages also include methods for an adaptive choice of the tuning
parameters. Here for illustrative purposes the tuning parameters are chosen
arbitrarily. Figure 7.3 shows different fitted polynomials. 2
5
4
● ●
3
● ● ●
2
●
1
true ●
lse
0
lasso ●
ridge ●
−1
net
−2
0 1 2 3 4
Figure 7.3: Example 7.6. Since the sample size n = 9 is small with respect to the
number of parameters p = 5 the least-squares method overfits the data. Different
regularized estimators correct the overfit.
The subindex is related to the loss functions. The estimator θL2 obeys opti-
mality properties with respect to L2 loss and θL1 with respect to L1 loss. We
have already presented several examples related to these estimates. Besides
the introductory Examples 7.1, 7.2 and 7.3, check Examples 4.6, 4.7 and 4.8
for θL2 and Example 4.9 for θL1 .
In the following table priors and corresponding posterior expectations are
collected related to some popular one parameter exponential families with pa-
rameter θ, assuming all other parameters known. The table is partly taken
from Robert (2001), where posterior distributions are given for single x.
194 ESTIMATION
Poisson Gamma
α+x
Poi(θ) Gamma(α, β) β+1
Gamma Gamma
α+ν
Gamma(ν, θ) Gamma(α, β) β+x
Binomial Beta
α+x
Bin(n, θ) Beta(α, β) β+n−x
Multinomial Dirichlet
Multk (θ1 , . . . , θk ) Dir(α1 , . . . , αk ) αi +xi
j αj +n
Normal Gamma
N μ, θ1 Gamma(α, β) 2α+1
2β+(x−μ)2
Normal InvGamma
2β+(μ−x)2
N (μ, θ) InvGamma(α, β) 2α−1
Recall that for symmetric single mode posteriors both estimators in (7.20)
coincide. In order to illustrate a posterior with two local maxima we come
back to the classroom Example 3.7 now with two more controversial and more
equally weighted weather experts.
where φ(μ,σ2 ) denotes the density of N(μ, σ 2 ). Applying (3.24) we obtain for
x=1
π(θ|x) = 0.34 φ(−2,3.33) (θ) + 0.66 φ(4.89,2.22) (θ).
BAYES RULES 195
prior
0.15
posterior
0.10
● expectation
median
0.05
0.00
−15 −10 −5 0 5 10 15
theta
The posterior is more concentrated, but it is still a clear mixture of two com-
ponents, as seen in Figure 7.4. The estimate
θL1 ≈ 3.81
under both Bayes models, {P, πc }, with conjugate prior, and {P, πJeff }, with
non-informative prior. The joint posterior distribution is a normal-inverse-
gamma distribution,
θ|y ∼ NIG(a, b, γ, Γ),
196 ESTIMATION
see Theorems 6.5 and 6.7, respectively. Under the linear mixed model (6.87) we
also obtain a posterior belonging to the normal-inverse-gamma distributions,
see Theorems 6.8, 6.9 and 6.10.
This implies the marginal distributions
b
β|y ∼ tp (a, γ, Γ) and σ 2 ∼ InvGamma(a/2, b/2).
a
In case of known σ 2 we use
β|(y, σ 2 ) ∼ Np (γ, Γ).
The t distribution and the normal distribution are symmetric around the
location parameter. We have
βL2 (y) = βL1 (y) = γ.
The inverse-gamma distribution is not symmetric. We apply the formula for
expectation and obtain
b
σ 2 = σL2 2 (y) = .
a−2
Unfortunately, the median is not given in a closed form. In case, the L1 -
estimate of the variance is of main interest, numerical methods are needed.
There is no difference in the estimate of β wether σ 2 is known or unknown.
But its precision is different. It holds that
Cov(β|y, σ 2 ) = σ 2 Γ
and
b
Cov(β|y) = Γ = σ 2 Γ.
a−2
Now let us compare the estimates under a simple linear model with intercept.
we obtain
αL2 = 86.69, βL2 = −2.13.
Alternatively, assuming
The priors differ only in the expected value of the intercept. Also in this study
we get a negative trend against the popular belief. We will revisit these results
later. 2
Y = XB + E, E ∼ MNn,d (O, In , Σ)
2
CREDIBLE SETS 199
7.3 Credible Sets
In this section we deal with Bayes confidence regions. We assume the Bayes
model {P, π}, where P = {Pθ : θ ∈ Θ} and π is the known prior distribution
of θ over Θ. Having observed x for X ∼ Pθ we want to determine a region
Cx ∈ Θ such that the underlying parameter θ ∈ Cx . According to the Bayesian
inference principle that all information is included in the posterior π(θ|x), we
define the set Cx as following, where HPD stands for highest posterior
density.
Pπ (Cx |x) ≥ 1 − α.
In case the posterior is a continues distribution with density π(θ|x) the defi-
nition simplifies to
Cx = {θ : π(θ|x) ≥ kα } with Pπ (π(θ|x) ≥ kα |x) = 1 − α. (7.24)
We illustrate it by the following examples.
0.30
0.30
0.25
0.25
0.20
0.20
posterior
posterior
0.15
0.15
0.10
0.10
0.05
0.05
0.00
0.00
[ ] [ ]
0 2 4 6 8 10 0 2 4 6 8 10
theta theta
Figure 7.5: Example 7.25. Left: HPD interval for α = 0.5 with interval length 5.88.
Right: Credible interval for α = 0.5, which is not HPD. The interval length is 7.12.
Cx = {θ : μ1 − z1− α2 σ1 ≤ θ ≤ μ1 + z1− α2 σ1 },
which has the same structure as the frequentist confidence interval, but now
around the Bayes estimate. 2
The density is unimodal but skewed. The HPD α-credible interval is given by
α1 +1
1 β1
{θ : π(θ|x) ≥ kα } = {θ : exp − ≥ cα } (7.27)
θ θ
CREDIBLE SETS 201
0.8
0.8
0.6
0.6
posterior
posterior
0.4
0.4
0.2
0.2
0.0
0.0
[ ] [ ]
0 2 4 6 8 0 2 4 6 8
variance variance
Figure 7.6: Example 7.14. Left: HPD interval for α = 0.1 with interval length 2.46.
The right tail probability is around 0.097. Right: Credible interval for α = 0.1, which
is equal–tailed. The interval length is 3.19.
library(invgamma)
a<-3; b<-3 # posterior parameter
aa<-seq(0,0.1,0.000001) # confidence levels on the lower side
q1<-qinvgamma(aa,a,b) # quantile lower bound
q2<-qinvgamma(aa+0.9,a,b) # quantile upper bound
plot(aa,dinvgamma(q1,a,b),"l") # see Figure
lines(aa,dinvgamma(q2,a,b))
# Simple search algorithm gives the crossing point at 0.078
k<-0.078; lines(aa,rep(k,length(aa)))
# Simple searching algorithm gives approximative bounds
L<-0.30155 ; U<- 2.765,
# such that k=dinvgamma(L,a,b) and k=dinvgamma(U,a,b).
Unfortunately in case of multi-modal densities, the interpretation of HPD
credible intervals is not so convincing. We demonstrate it with the help of
the classroom example (Example 3.7, page 37) about the two contradicting
weather experts.
202 ESTIMATION
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Recall Example 3.15 on page 48. Continuing with Example 7.7, the poste-
rior is a normal mixture distribution shown in Figure 7.4. Depending on the
confidence level α, it may happen that the α-credible region consists of two
separate intervals as demonstrated in Figure 7.8. 2
0.20
0.20
0.15
0.15
posterior
posterior
0.10
0.10
0.05
0.05
0.00
0.00
[ ] [ ] [ ]
−5 0 5 10 −5 0 5 10
theta theta
Figure 7.8: Example 7.15. Left: HPD interval for α = 0.05. Right: HPD interval for
α = 0.1.
CREDIBLE SETS 203
The interpretation of HPD credible regions can be complicated for discrete
posterior distributions, i.e., for the parameter space Θ ⊆ Z. In this case the
HPD α-credible regions do not have to be unique. We discuss it with the help
of the lion example on page 7.
θ1 θ2 θ3
.
π(θ|x = 1) 0.1 0.1 0.8
y = Xβ + , ∼ Nn (0, σ 2 In ), θ = (β, σ 2 ).
β|y, σ 2 ∼ Np (γ1 , σ 2 Γ1 )
b1
β|y ∼ t(a1 , γ1 , Γ1 ) (7.28)
a1
σ 2 |y ∼ InvGamma(a1 /2, b1 /2).
2
0
slope
−2
−4
−6
60 80 100 120
intercept
Figure 7.9: Example 7.17. The larger ellipse is the HPD credible region for α = 0.05;
the smaller is for α = 0.1.
The credible region for (α, β) is an ellipse given in (7.32) where the F–quantiles
are F2,a1 (0.9) = 2.39 and F2,a1 (0.95) = 3.15, illustrated in Figure 7.9. 2
The best choice with minimal volume are the sets with highest predictive
probability;
π(z|x) dz = 1 − α and π(z|x) ≥ π(z |x) for z ∈ Xpred , z ∈
/ Xpred .
Xpred
(7.36)
The main problem is now to derive the predictive distribution. In this sec-
tion we present several cases, where it is possible to derive explicit formulas.
Otherwise computer-intensive methods, explained in Chapter 9, can be used.
In the following example we have an i.i.d. sample and want to predict the next
observation.
and for θ = λ n
(θ|x) ∝ θ i=1 xi
exp(−nθ). (7.38)
PREDICTION 207
Assume the conjugate prior Gamma(α, β)
β α α−1
π(θ) = θ exp(−βθ), (7.39)
Γ(α)
where α and β are known; α/β stands for a guess of θ and β for the sample
size this guess is based on. We obtain
π(θ|x) ∝ π(θ)(θ|x)
n
∝ θα−1 exp(−βθ) θ i=1 xi
exp(−nθ) (7.40)
n
xi −1
∝ θα+ i=1 exp(−(β + n)θ),
n
hence the posterior is Gamma(α + i=1 xi , β + n). Summarizing for an i.i.d.
Poisson sample we have
n
θ ∼ Gamma(α, β) and θ|x ∼ Gamma(α + xi , β + n); (7.41)
i=1
see Figure 7.10. The Bayes estimator of θ is the expectation of the posterior,
n
α + i=1 xi
θ L2 = .
β+n
n
For the predictive distribution, set sn = i=1 xi . We have Xf ∼ Poi(θ),
independent of x, and
P(Xf = k|x) = q(k|θ, x)π(θ|x) dθ
Θ
∞ k
θ (β + n)α+sn α+sn −1
= exp(−θ) θ exp(−(β + n)θ) dθ
0 k! Γ(α + sn )
(β + n)α+sn ∞ α+sn +k−1
= θ exp(−(β + n + 1)θ) dθ.
Γ(α + sn )k! 0
(7.42)
0.20
likelihood
prior
posterior
0.6
0.15
0.4
0.10
0.05
0.2
0.00
0.0
● ● ● ● ● ● ● ●
0 2 4 6 8 0 2 4 6 8 10 12
lambda k
Figure 7.10: Example 7.18. Left: Likelihood function of an i.i.d Poisson sample with
n = 5, prior Gamma(3, 2) and the posterior Gamma(28, 7). Right: The predictive
distribution NB(28, 0.875). The mode is 3. The predictive region {1, 2, 3, 4, 5, 6, 7, 8}
has the level 0.946.
1 n
xf (x) = (α + xi ), (7.45)
β+n i=1
so that Bayes predictor and Bayes estimator coincide. This is no surprise, since
the parameter θ = λ is the expectation of Poi(λ). In general, the expectation
is not an integer. We recommend instead the mode of the predictive distribu-
tion. The mode of the generalized negative binomial distribution NB(r, p) is
the largest integer equal or less than p(r−1)
1−p , r > 1. It is illustrated in Figure
7.10. 2
The next example is related to a dynamic model where the future observation
depends on the past.
1 − ρ2t 2
Var(Xt |θ) = ρ2 Var(Xt−1 |θ) + σ 2 = σ , |ρ| < 1. (7.47)
1 − ρ2
PREDICTION 209
We assume the conjugate prior NIG(a0 , b0 , γ0 , κ0 ), so that
a0 b0
ρ|σ 2 ∼ N(γ0 , σ 2 κ0 ), and σ 2 ∼ InvGamma( , ). (7.48)
2 2
We derive an iterative formula for the posterior by setting the prior equal to
the previous posterior: see Section 2.3.1. Set T = 1. Applying Theorem 6.5 we
obtain the posterior NIG(a1 , b1 , γ1 , κ1 ), i.e.,
a 1 b1
ρ|σ 2 , x1 ∼ N(γ1 , σ 2 κ1 ), and σ 2 |x1 ∼ InvGamma( , )
2 2
with
a 1 = a0 + 1
x20 −1
b1 = b0 + (x1 − γ0 x0 )2 (1 + )
κ0
γ0 (7.49)
γ1 = κ1 (x0 x1 + )
κ0
1 −1
κ1 = (x20 + ) .
κ0
Note that, for x0 = 0 and γ0 = 0 we have
a1 = a0 + 1, b1 = b0 + x21 , γ1 = 0, and κ1 = κ0 . (7.50)
Set T = 2. Applying Theorem 6.5 we obtain the posterior
θ|(x0 , x1 , x2 ) ∼ NIG(a2 , b2 , γ2 , κ2 )
with
a 2 = a1 + 1
x21 −1
b2 = b1 + (x2 − γ1 x1 )2 (1 + )
κ1
γ1 (7.51)
γ2 = κ2 (x1 x2 + )
κ1
1 −1
κ2 = (x21 + ) .
κ1
Note that, also for γ1 = 0 we have γ2 = 0 a.s. In general we have the iterative
formula for the posterior
θ|(x0 , . . . , xT ) ∼ NIG(aT , bT , γT , κT ) (7.52)
with
aT = aT −1 + 1
x2T −1 −1
bT = bT −1 + (xT − γT −1 xT −1 )2 (1 + )
κT −1
γT −1 (7.53)
γT = κT (xT −1 xT + )
κT −1
1
κT = (x2T −1 + )−1 .
κT −1
210 ESTIMATION
We are interested in the distribution of XT +1 given xT = (x0 , . . . , xT ). The
posterior θ|xT ∼ NIG(aT , bT , γT , κT ) and (7.46) imply
ρ|σ 2 , xT ∼ N(γT , σ 2 κT )
(7.54)
XT +1 |(ρ, σ 2 , xT ) ∼ N(ρxT , σ 2 ).
Further, we have
a T bT
σ 2 |xT ∼ InvGamma(
, ).
2 2
Applying Lemma 6.5, we obtain the predictive distribution
bT
XT +1 |xT ∼ t1 (aT , γT xT , (κT x2T + 1)). (7.56)
aT
Prediction regions are of the form
aT
{x : (x − γT xT )2 (κT x2T + 1)−1 ≤ const}. (7.57)
bT
Recall that,
(X − μ)
for X ∼ t1 (ν, μ, σ 2 ), it holds ∼ tν , (7.58)
σ
where tν is students t-distribution with ν degrees of freedom. Setting tν,1− α2
for the (1 − α2 )–quantile, we obtain the prediction interval
$ $
bT bT
γT xT − taT ,1− α2 2
(κT xT + 1), γT xT + taT ,1− α2 2
(κT xT + 1) .
aT aT
(7.59)
2
y = Xβ + , ∼ Nn (0, σ 2 In ), θ = (β, σ 2 ),
β|y, σ 2 ∼ Np (γ1 , σ 2 Γ1 )
(7.60)
σ 2 |y ∼ InvGamma(a1 /2, b1 /2).
PREDICTION 211
We are interested in the prediction of q future observations generated by
yf = Zβ + ε, ε ∼ Nq (0, σ 2 Σf ), (7.61)
yf |(β, σ 2 , y) ∼ Nq (Zβ, σ 2 Σf )
(7.62)
β|y, σ 2 ∼ Np (γ1 , σ 2 Γ1 ).
Further, we have
σ 2 |y ∼ InvGamma(a1 /2, b1 /2).
Applying Lemma 6.5, we obtain the predictive distribution
b1
yf |y ∼ tq (a1 , Zγ1 , (ZΓ1 ZT + Σf )). (7.64)
a1
The multivariate t-distribution tq (ν, μ, Σ) belongs to the elliptical class. The
level sets are ellipsoids around the location μ; the matrix Σ gives the shape of
the ellipsoid. Using (7.31) the prediction region is
a1
{z : (z − Zγ1 )T (ZΓ1 ZT + Σf )−1 (z − Zγ1 ) ≤ Fq,a1 (1 − α)} (7.65)
qb1
b1
yf |y ∼ t1 (a1 , yf , σf2 ), , with yf = Zγ1 , σf2 = (ZΓ1 ZT + Σf ) (7.66)
a1
and
y f − yf
∼ ta 1 (7.67)
σf
where ta1 is the t-distribution with a1 degrees of freedom. Let ta1 (1 − α) be
the quantile of ta1 , then we obtain the prediction interval:
α α
[yf − ta1 (1 − )σf , yf + ta1 (1 − )σf ] (7.68)
2 2
and set the prior NIG(a0 , b0 , γ0 , Γ0 ) with a0 = 10, b0 = 1, γ0 = (0, 0)T , and
Γ0 = 2I2 . The posterior is NIG(a1 , b1 , γ1 , Γ1 ) with a1 = 31, b1 = 7.12, γ1 =
(0.903, 0.869)T , and ⎛ ⎞
0.122 ≈0
Γ1 = ⎝ ⎠.
≈0 0.180
We set α = 0.1, then the quantile is ta1 ≈ 1.70. Applying (7.68) the prediction
interval is calculated at xf = 1.1 as [1.08, 3.01] and at xf = 1.4 as [1.84, 4.10].
Figure 7.11 illustrates the example. 2
5
●
4
●
3
● ●
2
●
●
●
1
● ●
●
● ● ●
●
●
● ●
0
●
●
● ● ● ● ●
−1
n
3. Consider the simple linear regression model with i=1 xi = 0
yi = αy + βy xi + εi , i = 1, . . . , n, εi ∼ N(0, σ 2 ) i.i.d.
(7.70)
zi = αz + βz xi + ξi , i = 1, . . . , n, ξi ∼ N(0, σ 2 ) i.i.d.,
n n
where εi and ξi are mutually independent and i=1 xi = 0, i=1 x2i = n.
The unknown parameter is θ = (αy , βy , αz , βz , σ 2 ). We are mainly inter-
ested in η = 12 (βy + βz ).
214 ESTIMATION
(a) Re–write model (7.70) as univariate model with response variable y =
(y1 , . . . , yn , z1 , . . . , zn )T .
(b) Assume prior θ ∼ NIG(a, b, m, Γ) with Γ = diag(λ1 , λ2 , λ3 , λ4 ).
Derive the posterior distribution of η.
(c) Assume that σ 2 is known. Determine the HPD α-credible interval
C1 (y, σ 2 ) for η.
(d) Assume that σ 2 is unknown. Determine the Bayesian L2 estimate σ̃12 for
σ2 .
(e) Assume that σ 2 is unknown. Determine the HPD α-credible interval
C1 (y) for η.
n n
6. Consider model (7.70), with i=1 xi = 0 and i=1 x2i = n. Set
1 1
ui = (yi +zi ) = αu +βu xi + i , i ∼ N(0, σ 2 ), i.i.d., i = 1, . . . , n, (7.71)
2 2
where αu = 12 (αy + αz ), βu = 12 (βy + βz ) and i = 12 (εi + ξi ). The unknown
parameter is θu = (αu , βu , σ 2 ) and θu ∼ NIG(a, b, 0, diag(c1 , c2 )).
(a) Derive the posterior distribution of βu .
(b) Assume that σ 2 is known. Determine the HPD α-credible interval
C2 (u, σ 2 ) for βu .
(c) Assume that σ 2 is unknown. Determine the Bayesian L2 estimate σ̃22 for
σ2 .
(d) Assume that σ 2 is unknown. Determine the HPD α-credible interval
C2 (u) for βu .
7. Consider the two models (7.70) and (7.71). The parameter of interest is
η = 12 (βy +βz ) = βu . Compare the results in Problems 5 and 6, particularly:
(a) Specify prior distributions such that (η, σ 2 ) have the same prior in both
models.
(b) Assume that σ 2 is known. Compare the HPD α-credible intervals for η.
(c) Assume that σ 2 is unknown. Compare the Bayesian L2 estimates for σ 2 .
(d) Assume that σ 2 is unknown. Compare the HPD α-credible intervals for
η.
8. Related to Problem 10 in Chapter 6. Assume two correlated regression lines
This chapter deals with the Bayesian approach for hypotheses testing. The
hypotheses are two alternative Bayes models. The goal is to figure out which
of the models can be the right one for the observed data. We set
H0 : M0 = {P0 , π0 } versus H1 : M1 = {P1 , π1 } (8.1)
where
Pj = {Pj,θj : θj ∈ Θj }, θj ∼ πj , j = 0, 1. (8.2)
It is possible that the statistical models are different including the parameter
spaces. Note that, θj is not a component of the parameter θ, rather another
parameter. We give an example for two different Bayes models.
Pj = {Pθ : θ ∈ Θj } , j = 0, 1.
We assume that the prior probability, Pπ (Θ0 ) > 0. The prior π on Θ is de-
composed as
π(θ) = Pπ (Θ0 )π0 (θ) + Pπ (Θ1 )π1 (θ)
with Pπ (Θ0 ) + Pπ (Θ1 ) = 1, and
⎧ ⎧
⎨ π(θ) for θ ∈ Θ0 ⎨ π(θ)
for θ ∈ Θ1
Pπ (Θ0 ) Pπ (Θ1 )
π0 (θ) = π1 (θ) = .
⎩ 0 else ⎩ 0 else
H0 : Θ0 versus H1 : Θ1 . (8.3)
where X = C1 ∪ C0 , with C1 ∩ C0 = ∅.
Note that Pπ (Θ0 |x) + Pπ (Θ1 |x) = 1. For a0 = a1 the model, whose parameter
space has higher posterior probability, is accepted. This procedure has a nice
heuristic background. Following the Bayesian inference principle, we apply
the ratio of posterior probabilities instead of likelihood ratio. But note that,
it does not give the same answer as in the Neyman–Pearson theory, where first
the probability of type I error is bounded and then the probability of type II
error is minimized. The test in (8.4) treats both errors simultaneously, only
corrected by different weights.
Figure 8.1: Example 8.3. The white storks decide the SRB.
xσ02 + μ0 σ 2 σ02 σ 2
N(μ1 , σ12 ), with μ1 = and σ 2
1 = . (8.6)
σ02 + σ 2 σ02 + σ 2
H0 : θ ≤ 0 versus H1 : θ > 0.
Calculate
θ − μ1 −μ1 −μ1
Pπ (θ ≤ 0|x) = Pπ ≤ |x =Φ
σ1 σ1 σ1
where Φ is the distribution function of N(0, 1). Setting za for the quantile
a1
Φ(za ) = ,
a0 + a1
220 TESTING AND MODEL COMPARISON
0.20
0.20
0.15
0.15
alternative
posterior
posterior
0.10
0.10
0.05
0.05
0.00
0.00
] ●
−5 0 5 10 −5 0 5 10
theta theta
Figure 8.2: Left: Example 8.4. The null hypothesis is rejected. Right: Illustration for
point-null hypothesis. The null hypothesis is always rejected for any experiment’s
result.
H0 : β ≥ 0 versus H1 : β < 0.
0.25
0.20
0.15
0.10
0.05
0.94
0.06
0.00
−8 −6 −4 −2 0 2 4
slope
Figure 8.3: Example 8.5. Bayes model with θ = (α, β, σ 2 ). The null hypothesis is
clearly rejected.
Bayes models,
H0 : M0 = {P0 , π0 } versus H1 : M1 = {P1 , π1 } (8.8)
with Pj = {Pj,θj : θj ∈ Θj }.
We start with the construction of a common Bayes model Mcom which embed
both models. We introduce an additional parameter, the model indicator and
a prior on it:
k ∈ {0, 1}, π(k) = pk , p0 + p1 = 1, p0 > 0. (8.9)
The parameter space of the common model Pcom is
Θcom = {0} × Θ0 ∪ {1} × Θ1
with parameter θ = (k, θk ) ∈ Θcom , such that
Pcom = {P0 ∪ P1 : θ ∈ Θcom }.
The common prior is the mixture
π(θ) = π((k, θk ))
= π((k, θk )|k = 0)π(k = 0) + π((k, θk )|k = 1)π(k = 1) (8.10)
= π0 (θ0 )p0 + π1 (θ1 )p1 .
We get the common Bayes model
Mcom = {Pcom , π}. (8.11)
222 TESTING AND MODEL COMPARISON
The parameter of interest is the model indicator k. Following the Bayesian
inference principle, the posterior distribution p(k|x) is the main source of
information. Applying Bayes theorem we obtain
p(k)p(x|k)
p(k|x) = , (8.12)
p(0)p(x|0) + p(1)p(x|1)
where p(x|k) is the marginal distribution of the data x ∈ X given the model
Mk . To compare the models, the ratio is of main interest is
p(0|x) p(x|0) p(0)
= . (8.13)
p(1|x) p(x|1) p(1)
We have
p(x|k) = p(x, θk |k) dθk
Θk
= pk (x, θk ) dθk (8.14)
Θk
= k (θk |x)πk (θk ) dθk
Θk
We get
p(0|x) 0 (θ0 |x)π0 (θ0 ) dθ0 p(0)
= Θ0 . (8.15)
p(1|x) (θ |x)π1 (θ1 ) dθ1 p(1)
Θ1 1 1
The information from the data is included in the Bayes factor, defined as
follows.
Definition 8.2 (Bayes factor) Assume the test problem (8.8). The
Bayes factor is
0 (θ0 |x)π0 (θ0 ) dθ0
π
B01 = B01 = Θ0 .
(θ |x)π1 (θ1 ) dθ1
Θ1 1 1
Applying (8.15) we get the relation between the prior odds ratio,
p(0) p(0)
=
p(1) 1 − p(0)
and the posterior odds ratio,
p(0|x) p(0|x)
=
p(1|x) 1 − p(0|x)
with the help of the Bayes factor:
p(0|x) p(0)
= B01
1 − p(0|x) 1 − p(0)
i.e.,
BAYES FACTOR 223
The Bayes factor measures the change in the odds. Small Bayes factor B01
means evidence against the null hypothesis. There are different empirical scales
for judging the size of Bayes factor. We quote here a table from Kass and
Raftery (1997). Note that, the table is given for B10 = (B01 )−1 .
Using
n 1 1
= ,
x n + 1 B(x + 1, n − x + 1)
we obtain
1 B(α0 + x, β0 + n − x)
A= .
n + 1 B(x + 1, n − x + 1)B(α0 , β0 )
224 TESTING AND MODEL COMPARISON
The prior in model M1 is Gamma(α1 , β1 ). It holds that
B := 1 (θ1 |x)π1 (θ1 ) dθ1
Θ1
∞
λx β α1
= exp(−λ) 1 λα1 −1 exp(−β1 λ) dλ
0 x! Γ(α1 )
∞
β1α1
= λx+α1 −1 exp(−λ(β1 + 1)) dλ.
x!Γ(α1 ) 0
Using ∞
Γ(x + α1 )
λx+α1 −1 exp(−λ(β1 + 1)) dλ = ,
0 (β1 + 1)α1 +x
we obtain
β1α1 Γ(x + α1 )
B= .
x!Γ(α1 ) (β1 + 1)α1 +x
In general the Bayes factor A/B has a difficult expression which divides large
values. In case α0 = β0 = 1, we can use B(1, 1) = 1 and can simplify A to
1/(n + 1). Setting also α1 = β1 = 1 in B, we obtain
A 1
B01 = = 2x+1 .
B n+1
2
In the special case where we formulate the alternatives by splitting the com-
mon model {P, π}, as in Section 8.1 we can re-formulate the Bayes factor as
follows. The prior probability pk of the submodel Mk coincides with Pπ (Θk ),
k = 0, 1 and
1
k (θk |x)πk (θk ) dθk = π (θ|x)π(θ) dθ
Θk P (Θk ) Θk
m(x)
= π π(θ|x) dθ (8.16)
P (Θk ) Θk
m(x) π
= π P (Θk |x)
P (Θk )
where m(x) = Θ (θ|x)π(θ) dθ. We obtain, for the test problem (8.3), the
Bayes factor as
Pπ (Θ0 |x) Pπ (Θ1 )
Bπ01 = π .
P (Θ1 |x) Pπ (Θ0 )
Note that, Pπ and Pπ (.|x) are the prior and posterior probabilities related to
the common model {P, π}. In this case (8.4) is the optimal decision rule. Both
approaches, decision making or evidence finding, can be transformed to each
other. We have the relation:
a1
Reject H0 ⇔ Pπ (Θ0 |x) <
a0 + a1
a1 p1
⇔ B01 <
π
.
a 0 p0
BAYES FACTOR 225
8.2.1 Point Null Hypothesis
Given {P, π} with P = {Pθ : θ ∈ Θ}, a point-null hypothesis is defined by
splitting Θ = Θ1 ∪ {θ0 }, where Θ1 = Θ \ {θ0 }, written as
H0 : θ = θ0 versus H1 : θ = θ0 .
with two different priors. The prior of the point-null model is the Dirac mea-
sure on θ0 , ⎧
⎨ 1 if θ = θ
0
π0 (θ) =
⎩ 0 otherwise.
Further
p(x|θ)π0 (θ) dθ = p(x|θ0 ),
Θ0
hence
p(x|θ0 )
B01 = . (8.18)
m(x)
In Example 8.3 we studied the sex ratio at birth in Sweden and concluded
that the rate is less than the expected rate given by WHO. Now we want to
check the related one–point hypothesis.
In likelihood testing theory we can test a one–point hypothesis with the help of
a confidence region. The same method is possible in Bayes inference applying
HPD credible regions. Assume that
then
p(x|θ0 )π(θ0 )
π(θ0 |x) = >k
m(x)
BAYES FACTOR 227
250
250
200
200
posterior
posterior
150
150
100
100
0.967
50
50
● ●
0
0
0.505 0.510 0.515 0.520 0.505 0.510 0.515 0.520
theta theta
Figure 8.4: Sex ratio at birth (SRB) data. Left: Example 8.3: The one-sided test
rejects the null hypothesis. The SRB in Sweden is significantly less than the value
of WHO. Right: Example 8.9: The WHO value lies inside of credible region. There
is no evidence against the one–point null hypothesis.
and
p(x|θ0 ) k
B01 = > .
m(x) π(θ0 )
Small B01 delivers evidence against H0 , but here we have a lower bound.
Roughly speaking, if θ0 ∈ Cx we will find no evidence against H0 .
We illustrate it with the Swedish SRB data.
and
πj (θj ) = πj (βj |σ 2 )π(σ 2 ).
This set up can be illustrated as follows.
228 TESTING AND MODEL COMPARISON
Example 8.10 (Comparison of regression curves) We fit two different
models to the same data and want to figure out which model is better. Con-
sider two polynomial regression curves of different degree but same error as-
sumptions. Model M0 is a cubic regression:
with posterior NIG(a, b, γ, Γ). Using the known posterior we calculate the in-
tegral
m(y) = (θ|y)π(θ)dθ.
Θ
Set
(θ|y)π(θ) = c0 cπ ker(θ|y)
where cπ is the constant from the prior and
n
1 1
c0 = √ .
2π |Σ|1/2
Further, for
π(θ|y) = c1 ker(θ|y)
with the same kernel function ker(θ|y) and
p a2
1 1 b 1
c1 = √
2π |Γ|1/2 2 Γ( a2 )
we obtain
c0 cπ
m(y) = .
c1
Note that, the constant c0 is the same in both models. Summarizing, we obtain
the Bayes factor
(0) (1)
m(0) (y) cπ c
B01 = (1) = (1) 1(0) ,
m (y) cπ c1
BAYES FACTOR 229
where the superscript indicates the corresponding model. We consider the
conjugate prior and Jeffreys prior. For conjugate priors, we assume
(j) (j)
βj |σ 2 ∼ Npj (γ0 , σ 2 Γ0 ), σ 2 ∼ InvGamma(a0 /2, b0 /2).
The constants of variance prior are the same in both models. We get
1
(0)
cπ √ |Γ0 |
(1) 2
(1)
= ( 2π)p1 −p0 (0)
.
cπ |Γ0 |
Now we calculate the ratio of the constants of NIG(a(0) , b(0) , γ (0) , Γ(0) ) and
NIG(a(1) , b(1) , γ (1) , Γ(1) ). Note that, a(0) = a(1) = a0 + n, where all other
parameters differ. We get
12 a02+n
(1)
c1 √ p0 −p1 |Γ(0) | b(1)
= ( 2π) .
(0)
c1 |Γ(1) | b(0)
Summarizing, we have
1
(1)
1
a02+n
|Γ(0) | |Γ0 |
2 2
b(1)
B01 = . (8.20)
(0)
|Γ0 | |Γ(1) | b(0)
y(j) = X(j) γ (j) , pen(j) = (γ0 − γ (j) )T (Γ0 )−1 (γ0 − γ (j) ).
(j) (j) (j)
We transform the Bayes factor and take only the leading terms for n → ∞
into account, such that
B01 = = π .
(0)
c1 |Γ(1) | (0)
( b2 )
n−p0
2 Γ( n−p
2 )
1
we get
p1 −p0
Γ( n−p
2 )
0
n 2
n−p1 ≈ .
Γ( 2 ) 2
Under (8.21) we have
|Γ(0) | |XT(1) Σ−1 X(1) | |C(1) |
= = n(p1 −p0 ) + o(1) .
|Γ(1) | |XT(0) Σ−1 X(0) | |C(0) |
We transform the Bayes factor and take only the leading terms for n → ∞
into account, so that
Hence we prefer the model with the smaller BIC (Bayes information criterion,
or Schwarz criterion; Schwarz (1978)). In the linear normal model, BIC is
given as (see Subsection 8.3.1)
RSS
BIC = n ln( ) + p ln(n).
n
H0 : θ = 0 versus H1 : θ = 0.
p(x|θ = 0)
B01 =
m(x)
with ∞ ∞
1 (x − θ)2
m(x) = p(x|θ) dθ = √ exp(− ) dθ = 1
−∞ −∞ 2π 2
and
1 x2 1
p(x|θ = 0) = √ exp(− ) ≤ √ .
2π 2 2π
Hence
1
B01 ≤ √ ≈ 0.4.
2π
This procedure favours rejecting H0 . 2
The pseudo Bayes factor delivers a possible way out from the problems re-
garding improper priors. The idea is to explore the iterative structure of Bayes
methods. The sample is split into a training sample, sufficiently large to derive
a proper posterior, and this posterior is set as new proper prior. The training
sample should be chosen as small as possible. Following example demonstrates
the applicability of this idea.
The term const is the same for two competing models, such that it does not
change the result. We calculate BIC for the normal linear model with Jeffreys
prior
%% & & 1
Nn (Xβ, σ 2 Σ) : θ = (β, σ 2 ), β ∈ Rp , σ 2 ∈ R+ , πJeff ; πJeff ∝ 2 .
σ
BAYES INFORMATION 233
Recall that, the data are now y and X denotes the design matrix. The log-
likelihood l(θ|y) = ln (θ|y) is
n 1 1
l(θ|y) = − ln(2πσ 2 ) − ln(|Σ|) − 2 (y − Xβ)T Σ−1 (y − Xβ). (8.22)
2 2 2σ
Its maximum is attained at
and
1
σ2 = RSS, RSS = (y − XβΣ )T Σ−1 (y − XβΣ ).
n
Thus
1
−2 max l(θ|x) = n ln( RSS) + n ln(2π) + ln(|Σ|) + n
θ∈Θ n
and with const = −n − ln(|Σ| − n ln(2π)) we obtain
1
BIC = n ln( RSS) + p ln(n). (8.23)
n
where the constant term is independent of the model and has no influence to
model choice. The model fit is measured by the posterior expectation of the
deviance. The complexity of the model is given by
DIC = E(D(θ)|x) + pD .
y = Xβ + ε, ε ∼ Nn (0, σ 2 Σ).
234 TESTING AND MODEL COMPARISON
Theorem 8.1
Assume the Bayes model
%% & &
Nn (Xβ, σ 2 Σ) : θ = (β, σ 2 ), β ∈ Rp , σ 2 ∈ R+ , π
Then
b a a+2
DIC = n ln( )−2nΨ( )+ FIT+2 tr Γ XT Σ−1 X +const, (8.26)
a−2 2 b
where Ψ(.) is the digamma function and
and obtain
E((y − Xβ)T Σ−1 (y − Xβ)|y, σ 2 ) = FIT + E((γ − β)T XT Σ−1 X(γ − β)|y, σ 2 ).
Further
E((γ − β)T XT Σ−1 X(γ − β)|y, σ 2 ) =E(tr (γ − β)(γ − β)T XT Σ−1 X|y, σ 2 )
=tr Cov(β|y, σ 2 )XT Σ−1 X
=σ 2 tr Γ XT Σ−1 X .
BAYES INFORMATION 235
It follows that
1
E(D(θ)|y) = nE(ln(σ 2 )|y) + E( (y − Xβ)T Σ−1 (y − Xβ)|y) + const
σ2
b a a
= n(ln( ) − Ψ( )) + FIT + tr Γ XT Σ−1 X + const.
2 2 b
(8.29)
Further
2 a 2
pD = n(ln( ) − Ψ( )) + FIT + tr Γ XT Σ−1 X . (8.30)
a−2 2 b
Summarizing, we obtain the statement.
2
In the following we calculate the difference of DIC between two Bayes linear
models
ΔDIC = DIC(0) − DIC(1) .
We assume the same set up as in Subsection 8.2.2, for comparing models with
different linear regression functions but the same error distribution.
The expressions for the posteriors are given in Theorem 6.5. Note that the
parameters a0 and a1 are the same in both models. Applying Theorem 8.1 for
j = 0, 1 we calculate
ΔDIC = D(0) − D(1)
where
a1 + 2
FIT(j) + 2tr Γ1 XT(j) Σ−1 X(j) .
(j) (j)
D(j) = n ln(b1 ) + (j)
(8.32)
b1
Especially
and
Γ1 = ((Γ0 )−1 + XT(j) Σ−1 X(j) )−1
(j) (j)
with
XT(j) Σ−1 y + (Γ0 )−1 γ0
(j) (j) (j) (j)
γ1 = Γ1 .
236 TESTING AND MODEL COMPARISON
Under (8.21) we have
−1
1 (j) −1 1 T −1
= C−1
(j)
nΓ1 = (Γ ) + X(j) Σ X(j) (j) + o(1)
n 0 n
and
(j) 1
tr Γ1 XT(j) Σ−1 X(j) = tr nΓ1 XT(j) Σ−1 X(j) = pj + o(1).
(j)
n
where, for j = 0, 1,
RSS(j)
D(j) = n ln + 2 pj . (8.33)
n
Note that, in this case, DIC coincides with Akaike information criterion
(AIC).
θα α−1
f (x|θ) = x exp(−θx) (8.35)
Γ(α)
Computational Techniques
Following the Bayesian inference principle, all what we need is the posterior
distribution. In general however, we rarely find a known distribution family
which has the same kernel as π(θ) (θ|x). In this case computational methods
help.
Nowadays we can derive the posterior for almost all combinations of π(θ)
and (θ|x). Furthermore we can also do it in cases where we do not have an
expression of the likelihood function, but only a data generating procedure.
Likewise, a closed form expression of the prior is not required either.
Even when these expressions require computational methods for special func-
tions, such as the incomplete beta function or digamma function, properties of
these functions are well studied and can help understand the problem better.
It can also be useful to approximate the posterior by analytic expressions, and
in this case we can study the properties of the posterior more generally.
Figure 9.1: The lazy mathematician gives up and lets the computer do it.
details we refer to Albert (2005), Chen et al. (2002), Givens and Hoeting
(2005), Robert and Casella (2010), Lui (2001), Zwanzig and Mahjani (2020,
Chapter 2), and the literature therein.
9.1.1 Brute-Force
1. Discretize Θ ≈ {θ1 , . . . , θN }.
2. Calculate π(θj ) (θj |x) for j = 1, . . . , N .
N
3. Calculate mN (x) = N1 j=1 π(θj ) (θj |x).
4. Approximate π(θj |x) ≈ mN (x) π(θj ) (θj |x).
1
242 COMPUTATIONAL TECHNIQUES
4
prior
likelihood
posterior
3
2
1
x
●
0
0 2 4 6 8
theta
where θ = arg min h(θ), b = b(θ), h = h(θ) and σ 2 = h (θ)−1 ; see Tierney
et al. (1989).
DETERMINISTIC METHODS 243
For better explanation of Laplace approximation we set b(θ) = 1 and assume
that h(θ) is smooth, has the unique minimum θ = arg1min h(θ), and h√ (θ) = 0.
Applying Taylor expansion and the integral exp(− 2a (x − b)2 )dx = 2πa we
obtain
n
exp(−nh(θ)) dθ ≈ exp −nh − nh (θ − θ) − h (θ − θ)2 dθ
2
n
= exp −nh − h (θ − θ)2 dθ
2
n
= exp −nh exp − h (θ − θ)2 dθ
2
√ − 12
= exp −nh 2πσn ,
They assume that g(θ) is smooth, and that n is sufficiently large such that
the MAP estimator θMAP is unique. Note that, the second assumption is not
very strong; see the results in Chapter 5. Further, they need that n1 ln((θ|x))
is asymptotically independent of n.
bN σN exp(−nhN )
μ(x) = + O(n−2 ). (9.2)
bD σD exp(−nhD )
We illustrate the approximation (9.2) for g(θ) = θ in the following example.
244 COMPUTATIONAL TECHNIQUES
Example 9.2 (Binomial distribution and beta prior)
Set X|θ ∼ Bin(n, θ) and θ ∼ Beta(α0 , β0 ). Then the posterior is Beta(α, β),
with α = α0 + x and β = β0 + n − x; see Example 2.11. We know that the
expectation of Beta(α, β) is
α
μ= .
α+β
We set bN = bD = 1 and further
1
hD (θ) = − ((α − 1) ln(θ) + (β − 1) ln(1 − θ)) ,
n
1
hN (θ) = − (α ln(θ) + (β − 1) ln(1 − θ)) .
n
We calculate
1 α−1 β−1 1 α−1 β−1
hD (θ) = − − , hD (θ) = − − 2 − .
n θ (1 − θ) n θ (1 − θ)2
For illustration we consider θ ∼ U[0, 1]. Then we have α+β −2 = n. Figure 9.3
shows the convergence in (9.3) for n → ∞ under different values of p = α+βα
,
such that α = p(n + 2), β = (1 − p)(n + 2). 2
0.12
0.52
●
0.11
0.51
Laplace approximation
●
Laplace approximation
●
●
●
● ● ● ● ● ● ● ● ●
0.10
0.50
● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
●
0.09
0.49
0.08
0.48
10 15 20 25 10 15 20 25
Figure 9.3: Laplace approximation in Example 9.2. Left: The symmetric case α = β,
p = 0.5. Right: p = 0.1. Note that, we need n ≥ 8.
The striking idea of Monte Carlo methods (MC) is that, depending on the
factorization,
h(θ, x) π(θ|x) = m(θ, x) p(θ|x), (9.5)
the integral can be rewritten as an expected value
μ(x) = m(θ, x) p(θ|x) dθ.
1
N
μ(x) = m(θ(i) , x). (9.6)
N i=1
For
2
σ 2 (x) = (m(θ, x) − μ(x)) p(θ|x) dθ < ∞, (9.7)
Note that, the rate √1N cannot be improved for Monte Carlo Methods. The
Monte Carlo approximation of the deterministic integral is a random variable,
which for large N is close to the integral value with high probability.
The factorization ansatz (9.5) is essential. We need a distribution p(θ|x) with
both a small variance and a good random number generator. The choice of the
fraction in (9.4), the application of the MC on the denominator and numerator
246 COMPUTATIONAL TECHNIQUES
separately, and generating a sample from the prior π(θ) may not be good
choices. Since non-informative priors can be improper or have at least a high
variance. Otherwise, priors with small variances have the risk of dominating
the likelihood; see Example 3.1 on page 30.
The following algorithm is called independent MC because it is based on an
independent sample θ(1) , . . . , θ(N ) .
1
μ(x) = (m(θ(1) , x)) + . . . + m(θ(N ) , x)).
N
We refer to Zwanzig and Mahjani (2020, Section 2.1), Albert (2005, Section
5.7), Robert and Casella (2010, Section 3.2). Here we give only an example
for illustration of two MC methods, one based on sampling from posterior
and the other on twice independent sampling from the prior for separate
approximation of the numerator and dominator in the fraction (9.4).
Note that, for given data x, μ(x) is a random approximation. We get μ(x) =
(1) (N )
2.803. For the second method we draw θ1 , . . . , θ1 from the prior. Then we
INDEPENDENT MONTE CARLO METHODS 247
approximate the denominator integral
∞
β + n2 θMLE
θ−(α+ 2 +1) exp(−
n
μDen (x) = )dθ
0 θ
by
1 (j) − n
N n
θMLE
μDen (x) = (θ1 ) 2 exp(− 2 (j) ).
N j=1 θ1
(1) (N )
Further we draw θ2 , . . . , θ2 from the prior. Then we approximate the nu-
merator integral
∞
β + n2 θMLE
θ−(α+ 2 ) exp(−
n
μNum (x) = )dθ
0 θ
by
1 (j) − n −1
N n
θMLE
μNum (x) = (θ2 ) 2 exp(− 2 (j) ).
N j=1 θ 2
x<-c(-0.06,2.43,1.55,1.76,1.27,0.94,0.48,1.16,1.32,1.81)# data
n<-length(x) # sample size
MLE<-sum(x**2)/n # maximum likelihood estimate
N<-100 # simulation size
library(invgamma)
a0<-4; b0<-12 # prior parameters
b0^2/((a0-1)^2*(a0-2)) # prior variance
a1<-a0+n/2; b1<-b0+MLE*n/2 # posterior parameters
b1^2/((a1-1)^2*(a1-2)) # posterior variance
b1/a1-1# Bayes estimate
# Sampling from the posterior (MC1)
MC1<-function(N,a1,b1)
{mu<-mean(rinvgamma(N,a1,b1)); return(mu)}
MC1(N,a1,b1)# MC1 approximation of the Bayes estimate
# Sampling from the prior (MC2)
MC2<-function(N,a0,b0){
248 COMPUTATIONAL TECHNIQUES
4.0
MC approximation
3.5
3.0
2.5
true
MLE
2.0
Simulation size N
Figure 9.4: Example 9.3. The MC approximations converge to the Bayes estimate
(straight line) with increasing simulation size N . But even for large N the approxi-
mations still have random deviations around the Bayes estimate. The Bayes estimate
and the maximum likelihood estimate differ from the true parameter. For them, only
a convergence for increasing sample size n is possible.
k(θ(j) |x)
w(j) = w(θ(j) , x) = , j = 1, . . . , N.
g(θ(j) |x)
3. Approximate μ(x) = h(θ, x) π(θ|x) dθ by
Note that, the constant of the posterior is not needed. Usually we apply
π(θ)(θ|x) ∝ k(θ|x). In contrast to independent MC, where the numerator
integral and the denominator integral are approximated separately, we gener-
ate only one sample θ(1) , . . . , θ(N ) .
The IS method is based on a weighted law of large numbers. For π(θ|x) =
c0 (x) k(θ|x), we have
1 (j)
N
P
w h(θ , x) −→ h(θ, x)w(θ, x) g(θ|x) dθ = c0 (x)−1 μ(x),
(j)
N j=1
since
h(θ, x)k(θ|x) −1
g(θ|x) dθ = h(θ, x)k(θ|x) dθ = c0 (x) h(θ, x)π(θ|x) dθ,
g(θ|x)
250 COMPUTATIONAL TECHNIQUES
and
1 (j) P
N
k(θ|x)
w −→ g(θ|x) dθ = c0 (x)−1 π(θ|x) dθ = c0 (x)−1 ,
N j=1 g(θ|x)
P
such that μ(x) −→ μ(x).
The importance weights are quotients of two densities up to a constant. For
this, we state the following theorem from Lui (2001, Theorem 2.5.1, page 31).
Theorem 9.1 Let f (z1 , z2 ) and g(z1 , z2 ) be two probability densities, where
the support of f is a subset of the support of g. Then,
f (z1 , z2 ) f1 (z1 )
Varg ≥ Varg ,
g(z1 , z2 ) g1 (z1 )
Proof: Using f1 (z1 ) = f (z1 , z2 )dz2 , g(z1 , z2 ) = g1 (z1 )g2|1 (z2 |z1 ), and
r(z1 , z2 ) = fg(z
(z1 ,z2 )
1 ,z2 )
we obtain
f1 (z1 )
r1 (z1 ) = = r(z1 , z2 )g2|1 (z2 |z1 )dz2 = Eg (r(z1 , z2 )|z1 ).
g1 (z1 )
The statement follows from
2
Theorem 9.1 is based on the Rao–Blackwell Theorem; see Liero and Zwanzig
(2011, Theorem 4.5, page 104). Based on this result, Lui formulated a rule of
thumb for Monte Carlo computation:
We refer to Zwanzig and Mahjani (2020, Section 2.1.1), Albert (2005, Section
5.9), Robert and Casella (2010, Section 3.3), Chen et al. (2002, Chapter 5)
and Lui (2001, Section 2.5).
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0 2 4 6 8
Figure 9.5: Example 9.3. The continuous line, inverse-gamma distribution, is the
target, and the broken line, the gamma distribution, is the trial.
In Figure 9.5 the posterior and the trial distribution are plotted together. We
set N = 100 and generate observations from the trial, given by θ(1) , . . . , θ(N ) .
Using π(θ)(θ|x) ∝ k(θ|x), with
α+1+ n2
1 β + n2 θMLE
k(θ|x) = exp(− ),
θ θ
|x)
(j)
the importance weights are calculated by w(j) = k(θ g(θ (j) )
. We obtain
m −6
m
j=1 w (j)
= 2.778851 × 10 , and j=1 w (j)
θ(j) = 7.860973 × 10−6 . Thus
the Bayes estimate approximated by importance sampling is 2.83. Recall that,
the Bayes estimate is 2.78 and σ02 = 2.25. See also the following R code. 2
N<-100
a0<-4; b0<-12 # prior parameters
a<-6; b<-2 # trial parameters
IS<-function(N,a0,b0,MLE,a,b)
{
theta<-rgamma(N,a,b); k<-theta^{-(a0+6)}*exp(-(b0+5*MLE)/theta);
w<-k/dgamma(theta,a,b);
return(sum(w*theta)/sum(w))
}
252 COMPUTATIONAL TECHNIQUES
3.5
3.0
2.5
2.0
MC1 MC2 IS
Figure 9.6: Comparison of methods in Examples 9.3 and 9.4 by violinplots. The
straight line is the Bayes estimate.
library(invgamma);library(UsingR)
M1<-rep(NA,200);M2<-rep(NA,200);M3<-rep(NA,200)
for (i in 1:200){
M1[i]<-MC1(N,a1,b1); M2[i]<-MC2(N,a0,b0)
M3[i]<-IS(N,a0,b0,MLE,a,b)} # run each method 200 times
Mcomb<-c(M1,M2,M3); comb<-c(rep(1,200),rep(2,200),rep(3,200))
violinplot(Mcomb~comb,col=grey(0.8),names=c("MC1","MC2","IS"))
0.5
0.4
Density
0.3
0.2
0.1
0.0
0 1 2 3 4 5 6 7
Figure 9.7: Example 9.5. The thick continuous line is the target, the thin line is the
trial and the broken line is the density estimation of SIR simulated sample.
Thus we conclude that the SIR method works. See also the following R code
and Figure 9.7. 2
library(invgamma)
a0<-4; b0<-12; n<-10; MLE<-2.051
a1<-a0+n/2; b1<-b0+n/2*MLE #posterior parameters
a<-6; b<-2 # trial parameters
m<-1000; y<-rgamma(m,a,b) # step 1
w<-dinvgamma(y,a1,b1)/dgamma(y,a,b); ws<-w/sum(w) # weights
N<-100; yy<-sample(y,N,replace=TRUE,prob=ws)# step 2
ks.test(yy,pinvgamma(a1,b1)) # one sample ks test
z<-rinvgamma(N,a1,b1); ks.test(yy,z) # two sample ks.test
Note that, the constant of the target distribution is not needed. The condition
(9.8) implies that the trial has heavier tails than the target. We will see that
the algorithm rejects less, when the trial distribution is close to the target and
M (x) is as small as possible. The rejection algorithm generates an independent
sample θ(1) , . . . , θ(j) , . . . , θ(N ) from the target. It is given as following.
k(θ|x)
r(θ, x) = .
M (x) g(θ|x)
Figure 9.8: Rejection algorithm. The girl changes the fortune area on the wheel
depending on the chosen ball.
(c0 (x)M (x))−1 , that is why the choice of a small M (x) fulfilling (9.8) is es-
sential for an effective algorithm.
We refer to Zwanzig and Mahjani (2020, Section 1.3), Lui (2001, Section 2.2),
Robert and Casella (2010, Section 2.3), and Givens and Hoeting (2005, Section
6.2.3). Here we continue with Examples 9.3 and 9.4.
0.6
0.6
0.5
target
0.5
generated
trial
target
trial*M
trial
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
0 2 4 6 8 0 2 4 6 8
theta theta
Figure 9.9: Example 9.6. The trial distribution is Gamma(6, 2). The target is
InvGamma(9, 22.26). Left: The target is covered with M = 1.6, which is large since
trial and target have different modes. Right: The histogram of the random numbers
generated by the rejection algorithm and the density of the target and trial. The
generated sample belongs to the target.
library(invgamma)
n<-10; MLE<-2.051922 # sample size, sufficient statistic
a1<-a0+n/2; b1<-b0+MLE*n/2 # posterior parameters
rand.rej<-function(N,M)
{rand<-rep(0,N)
for( i in 1:N)
{
L<-TRUE
while(L){
rand[i]<-rgamma(1,6,2)
r<-dinvgamma(rand[i],a1,b1)/(dgamma(rand[i],6,2)*M)
if(runif(1)<r){L<-FALSE}
}
}
return(rand)
}
R<-rand.rej(1000,1.6)
The general principle of Markov Chain Monte Carlo Methods can be formu-
lated as follows.
such that
π(θ|x) = A(ϑ, θ)π(ϑ|x) dϑ. (9.10)
1 N
μ(x) = h(θ(j) , x).
N − m j=m+1
Let us explain, why MCMC works; for more details see Lui (2001, p. 249 and
pp. 269). Suppose the Markov chain θ(1) , . . . , θ(N ) is irreducible and aperiodic.
P
Then for any initial distribution of θ(0) it holds that μ(x) −→ μ(x) for N → ∞
MARKOV CHAIN MONTE CARLO (MCMC) 259
and furthermore
√ ∞
N (μ(x) − μ(x))
→ N(0, 1), where σh2 (x) = σ 2 (x)(1 + 2 ρj (x)),
σh (x) j=1
with
σ 2 (x) = Var(h(θ(1) , x)), ρj (x)σ 2 (x) = Cov(h(θ(1) , x), h(θ(j+1) , x)).
Thus
1
μ(x) = μ(x) + √ σh (x)OP (1). (9.11)
N
The approximation in (9.11) is better for smaller σh (x). We search for a
Markov chain which can be easily generated by using the previous members,
but has still sufficiently low dependence structure.
= π(ϑ |x).
It remains to show (9.12). We assume that the trial function gives always a
new proposal: T (ϑ, ϑ) = 0. That implies that the chain only gets stuck when
the proposal is rejected. Let δϑ (ϑ ) be the Dirac mass, i.e., δϑ (ϑ ) = 1 for
ϑ = ϑ and δϑ (ϑ ) = 0 otherwise. The proposal is accepted with probability
r(ϑ, ϑ ) = min(1, R(ϑ, ϑ )). Thus
Thus
A(ϑ, ϑ ) = T (ϑ, ϑ )r(ϑ, ϑ )(1 − δϑ (ϑ )) + (1 − r(ϑ))δϑ (ϑ ).
Since T (ϑ, ϑ )δϑ (ϑ ) = T (ϑ, ϑ) = 0, we have
and
Here we present only the algorithm, where the proposal distribution is a ran-
dom walk. The previous member is disturbed by a random variable with sym-
metric distribution around zero. The step length from one member of the
chain to the following is regulated by the variance of the disturbance; see
Figure 9.12.
262 COMPUTATIONAL TECHNIQUES
where LRT is the likelihood ratio and PRT is the prior ratio, here the quotient
of Cauchy densities
(ϑ|y) π(ϑ)
LRT(ϑ, θ) = , PRT(ϑ, θ) = .
(θ|y) π(θ)
MARKOV CHAIN MONTE CARLO (MCMC) 263
We see in Figure 9.11 that after m = 120 the equilibrium is reached. We take
d = 0.8 and N = 1000, and obtain for the Bayes estimates
The least squares estimates are θ1,lse = 0.338 and θ2,lse = 2.969. The true
values are θ = (1, 2). Observe, that the sample size n = 8 is quite small. The
following R code gives the calculations. 2
MCMC<-function(d,seed1,seed2,N)
{
rand1<-rep(0,N); rand2<-rep(0,N)
LRT<-rep(1,N); PRT<-rep(1,N); R<-rep(1,N)
rand1[1]<-seed1; rand2[1]<-seed2
for ( i in 2:N)
{
rand1[i]<-rand1[i-1]+runif(1,-d,d)# proposal intercept
rand2[i]<-rand2[i-1]+rnorm(1,-d,d)# proposal slope
A<-prod(dnorm(Y,rand1[i]+rand2[i]*x,sqrt(0.5)))# likelihood
B<-prod(dnorm(Y,rand1[i-1]+rand2[i-1]*x,sqrt(0.5)))
LRT[i]<-A/B # new/old # likelihood ratio
p1<-dcauchy(rand1[i],1,2)*dcauchy(rand1[i],0,1.5)
p2<-dcauchy(rand1[i-1],1,2)*dcauchy(rand1[i],0,1.5)
PRT[i]<-p1/p2 # new/old # prior ratio
R[i]<- LRT[i]*PRT[i] # Metropolis ratio
r<-min(1,R[i])
u<-runif(1)
if (u<r){rand1[i]<-rand1[i]}else{rand1[i]<-rand1[i-1]}
if (u<r){rand2[i]<-rand2[i]}else{rand2[i]<-rand2[i-1]}
}
return(data.frame(rand1,rand2,LRT,PRT,R))
}
MM<-MCMC(0.8,0,0,1000) # carry out MCMC
mean(MM$rand1[121:1000]) # Bayes estimate intercept
mean(MM$rand2[121:1000]) # Bayes estimate slope
0.25
6
●
true intercept
5
0.20
lse slope
●
4
0.15
3
prior
y
0.10
2
●
●
●
●
1
0.05
0
0.00
−1
Figure 9.10: Example 9.7. Left: Generated data points and the true regression line.
The broken line is the least squares fit. Right: The prior distribution for the intercept
is a Cauchy distribution with location parameter 0, and scale parameter 1.5; the prior
for the slope is a Cauchy distribution with location parameter 1 and scale parameter
2.
intercept
4
3
slope
2
2
slope
0
1
−2
0
−4
0 50 100 150 200 250 300 0 50 100 150 200 250 300
1:300 1:300
Figure 9.11: Example 9.7. Left: The generated Markov chains with d = 0.8 and seed
= (0.0). Right: Different start values for studying the burn-in time.
d=0.2
3
1.0
0 10 20 30 40 50 60
d=0.8
1
0.6
0.2
0.0 0.2 0.4 0.6 0.8 1.0 −0.2
0
0 10 20 30 40 50 60
d=0.2
1:500 0 10 20 30 40 50 60
Figure 9.12: Example 9.7. Right: The generated Markov chain for the intercept
with d = 0.2 which is too small and with d = 2 which is too large. Left: The
autocorrelation functions for different turning parameter d.
First we consider the random-scan algorithm, where in every step one com-
ponent is randomly chosen and updated, while the other components remain
unchanged.
Alternatively, we can also change the components one after the other.
Let us explain why the Gibbs sampler produces a “right” Markov chain. Let
i = it be the index chosen at step t. Then the actual transition probability
266 COMPUTATIONAL TECHNIQUES
from ϑ to ϑ at step t is
⎧
⎨ πi θ(t) = ϑ | θ(t−1) = ϑ[−i] for ϑ, ϑ , with ϑ[−i] = ϑ[−i]
At (ϑ, ϑ ) =
i i [−i]
.
⎩ 0 otherwise
The chain is non-stationary, but aperiodic and irreducible for π(θ|x) > 0 for
all θ.
Moreover ϑ At (ϑ, ϑ )π(ϑ|x) = π(ϑ |x), since for Si = {ϑ : ϑ[−i] = ϑ[−i] }
At (ϑ, ϑ )π(ϑ|x) = = ϑi | θ[−i] = ϑ[−i] π(ϑ|x)
(t) (t−1)
πi θ i
ϑ ϑ∈Si
= ϑi | = ϑ[−i] πi (ϑ[−i] |x) πi ϑi | ϑ[−i]
(t) (t−1)
= πi θi θ[−i]
ϑ∈Si
For more details see Zwanzig and Mahjani (2020, Section 2.2.5), Albert (2005,
Chapter 10), Chen et al. (2002, Section 2.1), Lui (2001, Chapter 6), Robert
and Casella (2010, Chapter 7), Givens and Hoeting (2005, Section 7.2), and
the references therein.
Here we give two examples for illustration. The first example explores the
simulated data from Example 9.7.
gibbs.beta<-function(a,b,n,N)
{
theta=rep(0,N);x=rep(0,N);
theta[1]<-rbeta(1,a,b);
x[1]<-rbinom(1,n,theta[1]);
for(i in 2:N){
theta[i]<-rbeta(1,a+x[i-1],b+n-x[i-1]);
x[i]<-rbinom(1,n,theta[i])
}
return(data.frame(x,theta))
}
2.0
0.18
simulated
true
1.5
0.14
Density
1.0
0.10
0.5
0.0
0.06
0 2 4 6 8 10 theta
Figure 9.13: Example 9.9. The values are generated by the Gibbs sampler, using R
code 18. Left: BetaBin(10, 0.5, 0.5) Right: Beta(0.5, 0.5)
The problem is to determine the distance d(S(xnew ), S(x)) and the threshold
q. For small threshold q the rejection rate is high and the algorithm is slow.
On the other hand for large thresholds the generated sample follows the prior
more than the posterior; see Figure 9.14.
In the following we present an example for illustration. It is the same set up
as in Example 9.3, on page 246, with data x are given in R code 11. This
example has the advantage that we can compare the ABC simulated sample
with the known posterior distribution.
x<-c(-0.06,2.43,1.55,1.76,1.27,0.94,0.48,1.16,1.32,1.81)# data
S<-sum(x*x) # maximum likelihood estimate, sufficient statistic
library(invgamma)
a0<-4; b0<-12 # prior parameters
a1<-a0+n/2; b1<-b0+S/2 # posterior parameters
## ABC ##
APPROXIMATIVE BAYESIAN COMPUTATION (ABC) 271
0.5
0.5
0.4
0.4
generated
target generated
trial target
Density
Density
0.3
0.3
trial
0.2
0.2
0.1
0.1
0.0
0.0
0 2 4 6 8 10 0 2 4 6 8 10
Figure 9.14: Example 9.10. The histogram of the random numbers generated
by the ABC algorithm. The continuous line is the target, the posterior density
InvGamma(9, 22.25); the dotted line is the trial, the prior InvGamma(4, 12). Left: For
the threshold q = 0.1 the generated sample is approximatively distributed like the
target. Right: For the large threshold q = 10 the generated sample is approxima-
tively distributed like the trial.
ABC.dist<-function(N,q,S)
{
rand<-rep(0,N)
for( i in 1:N)
{
L<-TRUE
while(L){
rand[i]<-rinvgamma(1,a0,b0) # draw theta
xx<-rnorm(n,0,sqrt(rand[i])) # new data
SS<-sum(xx*xx) # new sufficient statistics
D<-abs(SS-S) # metric
if(D<q){L<-FALSE}
}
}
return(rand)
}
C<-ABC.dist(1000,0.1,2.05) # carry out ABC
# Figure
hist(C,freq=FALSE,xlab="",ylim=c(0,0.6),main="",nclass=14)
xx<-seq(0.01,15,0.1); lines(xx,dinvgamma(xx,a1,b1),lty=2,lwd=2)
box(lty=1,col=1)
The following example is an illustration of the ABC method when the likeli-
hood function is not given in a closed form and a data generating algorithm
is used.
272 COMPUTATIONAL TECHNIQUES
Example 9.11 (ABC)
Consider the following latent model
yi = f (xi , θ) + εi , i = 1, . . . , n
n
d(z, z ) = (zi − zi )2 .
i=1
The priors are a ∼ N(−1, 1), b ∼ N(1, 1), and c ∼ N(0, 1), mutually indepen-
dent. The fitted Bayes curve f˜(x) is calculated from θ(1) , . . . , θ(N ) by
1 (j)
N
θ̃ = θ = (ã, b̃, c̃), f˜(x) = ãx + b̃x2 + c̃x3 .
N j=1
The left side of Figure 9.15 shows the true curve and the non-truncated points.
On the right side, data points together with filter region are plotted. In Fig-
ure 9.16 the true and a fitted Bayes curves are compared. The algorithm with
q = 3 delivers a good fit but the choice q = 13 gives bad fit. 2
6
●
●
4
4
● ●
●
●
2
2
●
0
0
●
●
●
−2
−2
0 1 2 3 4 5 0 1 2 3 4 5
Figure 9.15: Example 9.11. Left: The latent model. The observations are simulated
with a = −2, b = 2 and c = 0.3. Right: The observed model. Observations inside
the gray area are unchanged; otherwise the point on the borderline is taken.
randa<-rep(0,N); randb<-rep(0,N);randc<-rep(0,N)
for( i in 1:N)
{
L<-TRUE
while(L){
a<-rnorm(1,-1,1) # priors
b<-rnorm(1,1,1); c<-rnorm(1,0,1)
D<-Data(a,b,c)
Diff<-sum((yobs-D)*(yobs-D))
if(Diff<q){randa[i]<-a;randb[i]<-b;randc[i]<-c;L<-FALSE}
}
}
rand<-data.frame(randa,randb,randc)
return(rand)
}
AA<-ABC(100,3) # carry out ABC
ma<-mean(AA$randa); mb<-mean(AA$randb) # Bayes estimates
mc<-mean(AA$randc)
There are many different ABC methods. We refer to Beaumont (2019) and
the references therein.
We mention only one more. The combination of MCMC and ABC avoids
generation from the prior; instead MCMC steps are included. Note that, this
algorithm produces a Markov chain so that the elements in the sequence are
no longer independent.
274 COMPUTATIONAL TECHNIQUES
6
6
4
4
2
2
true
0
0
q=3
q=13
−2
−2
0 1 2 3 4 5 0 1 2 3 4 5
Figure 9.16: Example 9.11 with N = 100. Left: The broken line is the Bayes curve,
calculated with q = 3. Right: The ABC procedure delivers a reasonable fit for q = 3
and a bad fit for q = 13.
The model D is named variational family. The idea is to use q ∗ (θ|x) instead of
π(θ|x) and to explore properties of D. The challenge is to find a compromise
between a simple model D, which gives a bad approximation and a complex
model D with high quality approximation; otherwise, for complex models it
is hard to find q ∗ (θ|x).
We refer to Lee (2022) and Bishop (2006) for more details. Here we explain
only the main approach.
Set Θ = Θ1 × Θj × Θm ⊂ Rp . We focus on the mean-field variational family,
DMF , which requires that the components of θ are independent:
m
q(θ|x) = qj (θj |x), θj ∈ Θj , (9.15)
j=1
where
ELBO(q) = q(θ|x) ln(p(θ, x)) dθ − q(θ|x) ln(q(θ|x)) dθ.
276 COMPUTATIONAL TECHNIQUES
Since K(q(θ|x)|π(θ|x) ≥ 0 it holds that
ELBO(q) ≤ ln(p(x)).
This explains the notation ELBO, it stands for evidence lower bound. Mini-
mization of the Kullback–Leibler divergence means maximization of ELBO.
Let us illustrate the concept of variational inference for a linear model with
normal errors and known error variance.
4.5
4.0
0.3
0.2
0.5 0.4
3.5 0.8
0.7
0.8
slope
3.0
1
2.5
0.6 0.6
0.4
2.0
0.2
0.1
1.5
intercept
Figure 9.17: Example 9.12. The grey contours are from the posterior N(γ1 , Γ1 ),
the black contours belong to the variational distribution N(γ1 , Γ∗ ). Observe that,
the components of the variational distribution are independent, but have different
variances.
it holds that
qk∗ (θk |x) ∝ exp q(θ[−k] ) ln(π(θk |θ[−k] , x)) dθ[−k] . (9.18)
we have
K2 = qk (θk |x)q(θ[−k] ) ln(π(θ|x)) dθ
= qk (θk |x)q(θ[−k] ) ln(π(θk |θ[−k] , x))dθ + qk (θk |x)q(θ[−k] ) ln(π(θ[−k] |x))dθ
= qk (θk |x) ln(vk (θk )) dθk + q(θ[−k] ) ln(π(θ[−k] |x)) dθ[−k]
vk (θk )
= qk (θk |x) ln( ) dθk + constK2
cv
Summarizing
vk (θk )
K(q(θ|x)|π(θ|x)) = K(qk (θk |x)| ) + const.
cv
VARIATIONAL INFERENCE (VI) 279
vk (θk )
Since K(Q|P) = 0 for Q = P, the minimizer is cv .
2
Applying Theorem 9.2 we formulate the coordinate ascent variational infer-
ence (CAVI) algorithm.
where
k−1
m
q[−j] (θ|x) = qj (θj |x)(t+1) qj (θj |x)(t) .
j=1 j=k+1
and
β1 (μ)
q(μ|x) ln(π(σ|μ, x)) dμ = q(μ|x)(t) ( 2 ) dμ − (α1 + 1) ln(σ 2 ) + const.
(t)
σ
2
success size
treatment 119 11037
control 98 11034
Let p1 be the probability of success (to get a stroke) under the treatment
and p2 the probability of success in the control group. We are interested in
θ = pp12 , especially in testing H0 : θ ≤ 1.
(a) Which distribution model is valid?
(b) Give the Jeffreys prior for p1 and p2 .
(c) Give the least favourable prior for p1 and p2 .
(d) Derive the posterior distributions for p1 and p2 in (b) and (c).
(e) Give an R code for generating a sample of size N from the posterior
distributions of θ.
(f) In Figure 9.18 the density estimators of the posterior distributions of θ
are plotted. Can we reject H0 : θ ≤ 1?
2. Here are R codes for three different methods for calculating the same inte-
gral. Method 1
LIST OF PROBLEMS 281
density.default(x = odds.ratio)
3.5
Jeffreys
3.0
least favourable
2.5
2.0
Density
1.5
1.0
0.5
0.0
x1<-runif(10000);
sum(dbeta(x1,1,20)*dbinom(15,1000,x1))/10000
Method 2
x2<-rbeta(10000,1,20); sum(dbinom(15,1000,x2))/10000
Method 3
f1<-function(x){dbinom(15,1000,x)*dbeta(x,1,20)}
integrate(f1,0,1)
Following results are obtained:
Method 1 2 3
Result 0.01228 0.01397 0.01476
exp(θxi )
P (yi = 1|θ) = , i = 1, . . . , n.
1 + exp(θxi )
The prior π(θ) is a Cauchy distribution. Given the data set (x, y):
ϑ = θj + a × u, u ∼ U[−1, 1].
Solutions
This chapter provides solutions to the problems listed at the end of each
chapter. In some cases, main arguments or sketch of solution are given.
λk
π(λ|X = k) ∝ λα−1 exp(−λβ) exp(−λ)
k!
∝ λα+k−1 exp(−λ(β + 1).
π(β|y
∝ π(β)(β|y)
1
2 n
∝ exp(− (βk − 1)2 ) exp(−2 (yi − β0 − β1 xi − β2 zi )2 )
2 i=1
k=0
1
2
∝ exp(− (βk − 1)2 ) − 2n (β0 − ȳ)2 + (β1 − xy)2 + (β2 − zy)2 )
2
k=0
1
n 1
n 1
n
with ȳ = n i=1 yi , xy = n i=1 xi yi , zy = n i=1 zi yi . Completing the
squares
ac + b 2 (ac + b)2
c(x − a)2 + (x − b)2 = (c + 1)(x − ) − + ca2 + b2 ,
c+1 c+1
we obtain
4n + 1
π(β|y) ∝ exp(− (β0 − b0 )2 + (β1 − b1 )2 + (β2 − b2 )2 )
2
286 SOLUTIONS
with
4nȳ + 1 4nxy + 1 4nzy + 1
b0 = , b1 = , b2 = .
4n + 1 4n + 1 4n + 1
Thus posteriors are independent normal distributions:
1 1 1
β0 |y ∼ N(b0 , ), β1 |y ∼ N(b1 , ), β2 |y ∼ N(b2 , ).
4n + 1 4n + 1 4n + 1
n n
The independence is a consequence of
i=1 xi = 0, i=1 zi = 0 and
n
i=1 xi zi = 0.
7. Hierarchical model:
(a) The joint distribution of (θ, μ) is normal, thus the marginal distribution
of θ is Nn (E(θ), Cov(θ)) with E(θ) = E (E(θ|μ)) = 0. Set 1n = (1, . . . , 1)T ,
for an n × 1 vector of 1s. Then Eθ = μ1. Applying (2.5),
Cov(θ) = Eμ (Cov(θ|μ)) + Covμ (E(θ|μ)) = In + 1n 1Tn .
(b) Recall that the posterior depends on the sufficient statistics only. The
sufficient statistic is
1 1
k k
1
X=( X1j , . . . , Xnj ), X|θ ∼ Nn (θ, In )
k j=1 k j=1 k
1
n k
n+1 1 n+1
θ̄|x ∼ N(m, v ), m =
2
xij , v 2 = .
k(n + 1) + 1 n k i=1 j=1 n k(n + 1) + 1
SOLUTIONS FOR CHAPTER 3 287
10.2 Solutions for Chapter 3
1. Costumer satisfaction:
(a) Set the prior π(0) = 0.3, π(1) = 0.5, π(2) = 0.2. The posterior probabil-
ities π(θ|x) are given as follows.
x + +/− − 0
θ=2 12/17 1/9 0 20/121
θ=1 5/17 5/9 50/287 100/121
θ=0 0 3/9 237/287 1/121
(b) Highest entropy prior is π(0) = π(1) = π(2) = 1/3. The posterior is the
normalized likelihood.
(c) Set π(0) = p, π(1) = 2p, π(2) = 1 − 3p. Using (3.25),
1
H(p) = −p ln(p) − 2p ln(2p) − (1 − 3p) ln(1 − 3p), arg max H(p) = 2 .
p 3 + 23
(c) The posterior is not Cauchy distribution since the posterior moments
exist.
(d) Fisher information is σ12 . Thus πJeff (θ) ∝ 1.
3. Pareto distribution:
(a) No. The support depends on μ.
(b) No. The support depends on μ.
(c) Yes, with natural parameter α and
n
1
T (x) = − xi , Ψ(α) = − ln(α), because f (x, α) = α exp(−αx).
i=1
x
1 1 1−θ 1 1
π(θ) ∝ ∝ ∝ .
h(θ) (1 − θ) 2 θ (1 − θ) 2 θ(1 − θ)
with a ≥ 0, b ≥ 0 is proper.
(c) Applying π(ξ) = πθ (h(ξ))| h(ξ)
dξ | with
ξ dh(ξ) 1 b
θ = h(ξ) = , = , c= ,
c+ξ dξ (c + ξ)2 a
gives
a−1 b−1
ξ c 1 ξ a−1
π(ξ) ∝ 2
∝ .
c+ξ c+ξ (c + ξ) (b + aξ)a+b
with
exp(ϑ) 1
θ= , Ψ(ϑ) = −n ln(1 − θ) = −n ln .
1 + exp(ϑ) 1 + exp(ϑ)
with a = μ and b = nλ − μ.
(c) Applying Theorem 3.3,
exp(ϑ)ν+x exp(ϑ)a1
π(ϑ|x) ∝ ∝
(1 + exp(ϑ))n(λ+1) (1 + exp(ϑ))a1 +b1
with a1 = a + x, b1 = b + n − x.
290 SOLUTIONS
(d) We shift the natural parameter to obtain a known family. Set
1 b θ 1 b
z = ln = C + ϑ, ϑ = 2z − 2C = h(z), C = ln( ).
2 a1−θ 2 a
exp(z)2a exp(z)2a
π(z) ∝ ∝ .
( ab exp(2z) + 1)a+b (2a exp(2z) + 2b)a+b
θ
(b) Natural parameter: η = ln( 1−θ ), sufficient statistic: T (x) = 2I0 (x) +
I1 (x), and Ψ0 (θ) = −2 ln(1 − θ). It has form (3.17).
exp(η)
(c) We have θ = h(η) = 1+exp(η) . Apply Theorem 3.3 with Ψ(η) =
Ψ0 (h(θ)) = 2 ln(1 + exp(η)). A conjugate prior is
exp(η)μ
π(η|μ, λ) ∝ exp(ημ − 2λ ln(1 + exp(η)) ∝ .
(1 + exp(η))2λ
SOLUTIONS FOR CHAPTER 3 291
(d) Theorem 3.3 delivers the posterior
exp(η)μ+T (x)
π(η|x) = π(η|μ + T (x), λ + 1) ∝ .
(1 + exp(η))2(λ+1)
(e) Calculate the Fisher information I(η) = Var(V (η|x)), with score function
V (η|x) = ln(P(x|η)), as
exp(η) exp(η)
V (η|x) = T (x) − 2 , Var(T (x)) = 2 .
1 + exp(η) (1 + exp(η))2
k
(c) The Kullback–Leibler divergence K(πJeff (.|x)|πJeff (.|x)) equals
k
1 1
c(k)−1 √ exp(− (x − θ)2 )(− log(c(k)))dθ = − ln(c(k))
−k 2π 2
so that
k
(d) The prior πJeff (θ) is independent of θ, thus p0 (θ) = 1.
11. Shannon entropy of N (μ, σ 2 ): We have
1 1
H(N(μ, σ 2 )) = − ϕμ,σ2 (x) ln( √ )dx + 2 ϕμ,σ2 (x)(x − μ)2 dx
2πσ 2σ
√ 1 1 1
= ln( 2πσ) + 2 σ 2 = (ln(2πσ) + 1) = ln(2πσ 2 e).
2σ 2 2
292 SOLUTIONS
12. Compare two normal priors:
2
(a) A sufficient statistic is x̄ ∼ N(θ, σn ); see Example 2.12. For j = 1, 2, the
2
posterior distributions related to πj are N(μj,1 , σj,1 ) with
2 σ02 σ 2 2 λσ02 σ 2
σ1,1 = , σ2,1 = .
nσ02 + σ 2 nλσ02 + σ 2
(b) Use (3.49) and the result of Problem 11 above. Note that the Shannon
entropy of N(μ, σ 2 ) is independent of the mean.
1 1 1 1
I(P n , π1 ) = ln(2πσ02 e) − ln(2πσ1,1
2
e) = ln(nσ02 + σ 2 ) − ln(σ 2 )
2 2 2 2
and
1 1
I(P n , π2 ) = ln(nλσ02 + σ 2 ) − ln(σ 2 ).
2 2
(c) We have
1 nσ02 + σ 2
I(P , π1 ) − I(P , π2 ) = ln
n n
,
2 nλσ02 + σ 2
nσ02 + σ 2 σ02 + σ 2 /n
lim 2 = lim = λ−1 ,
n→∞ nλσ + σ 2 n→∞ λσ 2 + σ 2 /n
0 0
1
lim (I(P n , π1 ) − I(P n , π2 )) = − ln(λ).
n→∞ 2
(d) For λ > 1 and every μ0 the second prior is less informative.
π(0) p(x|0)
π(θ = 0|x) = = 1 − π(θ = 1|x),
π(0) p(x|0) + π(1)p(x|1)
it follows that
x −1 0 1
0≤p≤ 2
7 δ1 (x) 0 0 0
2
7 <p≤ 3
8 δ2 (x) 0 0 1
3
8 <p≤1 δ3 (x) 0 1 1
SOLUTIONS FOR CHAPTER 4 293
(c) The frequentist risk is calculated by R(0, δ ) = P(C1 |0), R(1, δ ) =
π π
θ 0 1
0≤p≤ 2
7 0 1
2
7 <p≤ 3
8 0.2 0.5
3
8 <p≤1 0.5 0
(e) It holds that maxp r(π) = maxp (0.2 + 0.3 p) = 0.3125. The least
favourable prior is π0 (1) = 38 .
2. Binomial model :
(a) Applying Theorem 4.3 the Bayes estimator is the k1k+k 2
2
fractile of the
posterior Beta(α + x, β + n − x).
(b) See Example 4.9. For k1 = k2 the Bayes estimator is the median of
Beta(α + x, β + n − x), approximated by (4.19).
(c) The asymmetric approach is of interest, when the underestimation of the
success probability is more dangerous than overestimation, i.e., when the
success means the risk of a traffic accident.
(d) Set n = 1 and k1 = k2 . The frequentist risk is
θ < δ π (0) < δ π (1), δ π (0) < θ < δ π (1), δ π (0) < δ π (1) < θ
294 SOLUTIONS
In each case it is impossible to find a beta prior such that the frequentist
risk is constant. We consider the first case only.
But 1 + δ π (1) − δ π (0) > 0 for all beta priors with α > 1, β > 1. Thus
inside {Beta(α, β) : α > 1, β > 1} we cannot find a prior, which fulfills
the sufficient condition for a minimax estimator given in Theorem 4.13.
α2 1
R(θ, δ π ) = θ2 d(α), d(α) = + .
(α − 1)2 (α − 2) (α − 1)2
Posterior expected risk:
α
ρ(π, δ π (x)|x) = E((θ − δ π (x))2 |x) = Var(θ|x) =
x2
Bayes risk:
∞ ∞
1
r(π) = R(θ, δ π ) dθ = d(α) θ dθ = ∞.
0 θ 0
(d) We can neither apply Theorem 4.9 nor Theorem 4.10. Since δ π is unique,
by Theorem 4.11, δ π is admissible.
SOLUTIONS FOR CHAPTER 4 295
1
4. Gamma distribution θ = β, conjugate prior :
(a) See Example 3.13. Natural parameter η = −β = − θ1 and T (x) = x.
(b) See Liero and Zwanzig (2011, Theorem 4.3). Since E(T (X)) = α β = θα,
the efficient estimator is δ ∗ (x) = αx .
(c) In Example 3.13 the conjugate prior for β is Gamma(α0 , β0 ) and the
posterior is Gamma(α0 + α, β0 + x). Thus the conjugate prior of θ = β1
is the inverse-gamma distribution InvGamma(α0 , β0 ) and the posterior is
InvGamma(α0 + α, β0 + x).
(d) Applying Theorem 4.2 and the expectation of InvGamma(α0 + α, β0 + x),
δ π (x) = α0β+α−1
0 +x
.
(e) Frequentist risk of δ π and δ ∗ :
2
R(θ, δ π ) = Var(δ π (X)|θ) + (θ − E(δ π (X)|θ))
1 2
= θ (α + α0 ) − 2θα0 (β0 + 1) + (β02 + 1)2 ,
(α0 + α − 1) 2
1 α θ2
R(θ, δ ∗ ) = Var(δ ∗ (X)|θ) = = .
α2 β 2 α
Posterior expected risk:
(β0 + x)2
ρ(π, δ π (x)|x) = E((θ − δ π (x))2 |x) = Var(θ|x) = .
(α − 1)2 (α − 2)
Bayes risk: The frequentist risk is quadratic in θ. For α0 > 2 the second
prior moment exists, such that r(π) < ∞. Particularly
1 −2 2 α0 − 2
r(π) = β + β 0 + 1 .
(α0 + α + 1)2 α0 − 2 0 α0 − 1
(f) Applying Theorem 4.9, 4.10 or 4.11 we note that under α0 > 2 the Bayes
estimate is admissible.
(g) Due to admissibility of the Bayes estimator, there exists a θ with
R(θ , δ ∗ ) > R(θ , δ π ), which is no contradiction to the efficiency of δ ∗
since the Bayes estimator is biased.
Using the asymptotic relations for Γ(.) and digamma function Ψ(.)
Γ(n + z) Ψ(z)
lim = 1, lim 1 = 1,
n→∞ Γ(n) z n z→∞ ln(z) − 2z
we have
nν + α1 1
K(P1 |P2 ) = ln − (α1 − α2 ) + o(1) = o(1).
nν 2(nν + α1 )
1 t2 1 t2
Φ(−t) ≤ exp(− ), 1 − Φ(t) ≤ exp(− ), for t > 0,
2 2 2 2
there exists a positive constant c0 such that Φ(a1 )+1−Φ(b1 ) ≤ exp(−c0 n).
Hence for Pπ (Oc ) > C −1 , we have q(x(n) ) ≤ p(x(n) )C exp(−c0 n).
Set H = H 12 (P⊗nθ0 , Pn,O ) and p0 (x(n) ) = pn (x(n) |θ0 ) and split the integra-
c
where -
H1 = p0 (x(n) )p(x(n) ) dx(n) < C exp(−c0 n).
Bd
2
Applying Cauchy–Schwarz inequality and Pθ0 (Bdc ) ≤ exp(− nd
2σ 2 ),
1
2 1
2
1
H2 ≤ p0 (x(n) ) dx(n) q(x(n) ) dx(n) ≤ Pθ0 (Bdc ) 2
Bdc Bdc
2
d
For D0 > C + 1 and q0 = exp(− min( 4σ 2 , c0 )), (5.15) follows.
1 1
lim γ1,1 = (a + b) = a, lim γ1,2 = (a + b) = b.
n→∞ 2 n→∞ 2
SOLUTIONS FOR CHAPTER 5 299
(b) Inconsistency of the posterior: Take θ0 = (a0 , b0 ) with a0 > 0, b0 > 0 and
a0 = b0 . Set O = {θ : ||θ − θ0 || < c} and A = {θ : |a − a0 | < c, b ∈ R}.
Then O ⊂ A and
where
−c − γ11 + a0 1 1
Ln = → √ (−c + (a0 − b0 )) = c1 a.s.
(1 − λn ) 2 2
c − γ11 + a0 1 1
Un = → √ (c + (b0 − a0 )) = c2 a.s..
(1 − λn ) 2 2
(c) Check condition (5.15): Let O be the open set defined above. Applying
pn (x(n) |θ)π(θ) = π(θ|x(n) )p(x(n) ), and setting C = 2P1−c
π (O c ) ,
0
pn (x(n) ) π c
q(x(n) ) = P (O |x(n) ) < Cpn (x(n) ).
Pπ (Oc )
Thus
H 12 (P⊗n ⊗n
θ0 , Pn,O ) > C H 12 (Pθ0 , P
c
x(n)
).
P⊗n
θ0 is Nn (μ1 , Σ1 ) and P
x(n)
is Nn (μ2 , Σ2 ) (see Theorem 6.2 on page
134), with μ1 = 1n 12 θ0 , Σ1 = In , and μ2 = 1n 1T2 γ1 , Σ2 = In + 21n 1Tn .
T
1
det(In + aaT ) = 1 + aT a, (In + aaT )−1 = In − aaT ,
1 + aT a
we obtain
14
1 + 2n 1 n
Hn = exp − 2 (a0 + b0 )2 .
(1 + n)2 8 σ (1 + n)
Hn converges to zero, but not with the required rate. It holds for all
q0 < 1, that limn→∞ q0−n Hn = limn→∞ q0−n n− 4 const = ∞.
1
300 SOLUTIONS
10.5 Solutions for Chapter 6
1. Specify (6.23) and (6.24):
(a) Matrix form is given by y = Xβ + with
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
y1 1 x 1 z1 ⎛ ⎞ ε1
⎜ ⎟ ⎜ ⎟ β ⎜ ⎟
⎜ ⎟ ⎜ ⎟ ⎜ 1⎟ ⎜ ⎟
⎜ y2 ⎟ ⎜1 x 2 z 2 ⎟ ⎜ ⎟ ⎜ ε2 ⎟
y=⎜ ⎟ ⎜
⎜ .. ⎟ , X = ⎜ .. ..
⎟
.. ⎟ , β = ⎜β2 ⎟ , =⎜ ⎟
⎜ .. ⎟ .
⎜.⎟ ⎜. . .⎟ ⎝ ⎠ ⎜.⎟
⎝ ⎠ ⎝ ⎠ β3 ⎝ ⎠
yn 1 x n zn εn
where d = ( 43 + n)2 − 49 .
2. Simple linear regression with Jeffreys prior :
(a) By Corollary 6.4, πJeff (θ) ∝ σ −2 .
n
(b) Posterior
n distributions: Set sxx = i=1 (xi − x̄)2= nx2 − n(x̄)2 , sxy =
n n
(x − x̄)(y − ȳ) = nxy − nx̄ȳ with nx̄ = x , nȳ = i=1 yi
i=1 i
2
n
i
2
n
i=1 i
and nx = i=1 xi , nxy = i=1 xi yi . It holds that
⎛ ⎞T
1 . . . 1
XT = ⎝ ⎠ ,
x 1 . . . xn
⎛ ⎞ ⎛ ⎞
n nx̄ 1 ⎝ x2 −x̄
XT X = ⎝ ⎠, (XT X)−1 = ⎠.
nx̄ nx2 sxx −x̄ 1
sxy
The least-squares estimates are α = ȳ − x̄β, β = sxx .
SOLUTIONS FOR CHAPTER 6 301
i. Applying Theorem 6.7 and (6.40) we obtain
⎛ ⎞ ⎛⎛ ⎞ ⎞
α α
⎝ ⎠ |y, σ 2 ∼ N2 ⎝⎝ ⎠ , σ 2 (XT X)−1 ⎠ . (10.4)
β β
2
ii. The marginal distribution of β in (10.4) is β|y, σ 2 ∼ N(β, sσxx ).
iii. Applying Lemma 6.1 on page 132 to (10.4) with
x̄ 1 1
Σ21 Σ−1
11 = − , Σ22 − Σ221 Σ−1
11 = ¯ − αx̄)
, μ2|1 = ¯2 (xy
x2 nx2 x
gives
xy σ2x̄
β|(α, σ , y) ∼ N
2
−α , .
x2 x2 sxx
Note that, the mean in the posterior corresponds to the least-squares
estimator ofβ when the intercept α is known.
n
iv. Set RSS = i=1 (yi − yi )2 , with yi = α + xi β. Applying Theorem 6.7
and (6.40),
n − 2 RSS
σ |y ∼ InvGamma
2
, . (10.5)
2 2
v. From (ii) and (iv) it follows that (β, σ 2 )|y ∼ NIG(n − 2, RSS, β, s−1
xx ).
Applying Lemma 6.5 we get
RSS
β ∼ t1 (n − 2, β, σ 2 s−1 2
xx ), with σ = .
n−4
3. Simple linear regression with variance as parameter of interest:
(a) In case of known α and β it is an i.i.d. sample problem with i = yi −
α − βxi . Hence,
n2
1 2
n
1 1
π(σ |y, β, α) ∝ 2
2
exp − .
σ σ2 2σ 2 i=1 i
n
This is the kernel of InvGamma( n2 , 2b ) with b = i=1 (yi − α − βxi )2 .
(b) The posteriors are different. In Problem 2 (b) the regression parameters
are estimated and the degree of freedom is reduced by the number of
estimated parameters.
4. Transformation of simple linear regression model to a centered model : We
eliminate the intercept α, by
ȳ = α + β x̄ + ε̄, yi − ȳ = α − α + β(xi − x̄) + εi − ε̄.
n
(a) The transformation is given by ξi = εi − n1 j=1 εj . Set 1n = (1, . . . , 1)T
the n-dimensional column vector of ones. Then
1 1
ξ = ε − 1n 1Tn ε = Pε, P = In − 1n 1Tn
n n
where P is a projection matrix.
302 SOLUTIONS
(b) Applying the projection properties we obtain a singular covariance ma-
trix
Cov(ξ) = PCov(ε)PT = σ 2 P, det(P) = 0.
(c) Deleting the last observation:
1 1
ξ(−n) = ε(−n) − 1n−1 1Tn ε = Aε(−n) − εn 1n−1
n n
with A = In−1 − n1 1n−1 1Tn−1 , where A is not a projection matrix. Further
1
Cov(ξ(−n) ) = σ 2 (AA + 1n−1 1Tn−1 ) = σ 2 Σ
n2
with Σ = In−1 − n1 1n−1 1Tn−1 . Let e be an eigenvector orthogonal to 1n−1 .
Then Σe = e, with corresponding eigenvalue 1. Consider the eigenvector
v = a1n−1 . Then Σv = n−1 v and the eigenvalue is n1 = 0. Thus det(Σ) =
1
n > 0.
(d) Consider model (6.159) with deleted last observation and ξ(−n) ∼
N(0, σ 2 Σ), where
1
Σ = In−1 − 1n−1 1Tn−1 , Σ−1 = In−1 + 1n−1 1Tn−1
n
(x(−n) − x̄1n−1 )T Σ−1 (x(−n) − x̄1n−1 ) = sxx
(x(−n) − x̄1n−1 )T Σ−1 (y(−n) − ȳ1n−1 ) = sxy
i. Model (6.159) with deleted last observation and Jeffreys prior: Apply-
ing Theorem 6.7 and using the expressions above we have β|y, σ 2 ∼
sxy σ 2
N1 sxx , sxx and σ 2 |y ∼ InvGamma n−2 RSS
2 , 2 , which is the same
result as in model (10.5).
ii. Model (6.159) with deleted last observation and conjugate prior,
σ 2 = 1: Applying Corollary 6.1 and using the expressions above, we
λ s +γb
have β|y ∼ N1 (μ1 , σ02 ) with μ1 = λ22 sxy
xx +1
and σ02 = λ2 sλxx2 +1 . This
posterior is different from the posterior (6.71), given in Example 6.9
on page 151.
5. Precision parameter : The conjugate family includes the normal-gamma dis-
tributions, defined as follows. A vector valued random variable X with sam-
ple space X ⊆ Rp and a positive random scalar λ have a normal-gamma
distribution
(X, λ) ∼ NGam(a, b, μ, P)
iff
−1 −1 a b
X|λ ∼ Np (μ, λ P ), λ ∼ Gamma , .
2 2
The kernel of the density of NGam(a, b, μ, P) is given as
a+p−2 λ
λ 2 exp − (b + (X − μ) P(X − μ)) .
T
2
SOLUTIONS FOR CHAPTER 6 303
−2
Set λ = σ . Suppose that θ = (β, λ) ∼ NGam(a0 , b0 , γ0 , P0 ). Then
a0 +p−2 λ
π(θ|y) ∝ λ 2 exp − (b0 + (β − γ0 )T P0 (β − γ0 ))
2
n λ
× λ 2 exp − (y − Xβ)T (y − Xβ) .
2
Applying (6.25) given in Lemma 6.4 on page 138 we obtain θ|y ∼
NGam(a1 , b1 , γ1 , P1 ) where a1 = a0 + n,
b1 = b0 + (y − Xγ1 )T (y − Xγ1 ) + (γ0 − γ1 )T P0 (γ0 − γ1 ),
P1 = P0 + XT X, γ1 = P−1 T
1 (P0 γ + X y).
6. Jeffreys prior for the scale parameter : Set η T = (η1T , η2 ) = (β T , σ) and
θT = (θ1T , θ2 ) = (β T , σ 2 ). From Theorem 6.6 we know that
⎛ ⎞
1 T
X X 0
I(θ) = ⎝ θ2 ⎠.
n
0 2θ 2
2
−(p+1)
Using Definition 3.4 we obtain π(η) ∝ η2 . Under the independence
assumption we calculate separately π(η1 ) ∝ const, and π(η2 ) ∝ η2−1 , so
that π(η) ∝ η2−1 .
7. Three sample problem:
(a) Reformulation as linear model is given by
⎛ ⎞ ⎛ ⎞⎛ ⎞
XT 1m 0 0 1 0 ⎛ ⎞
⎜ ⎟ ⎜ ⎟⎜ ⎟ μ
⎜ ⎟ ⎜ ⎟⎜ ⎟
1 ⎟⎝ ⎠ + ,
1
y = ⎜ Y T ⎟ = ⎜ 0 1m 0 ⎟ ⎜ 0
⎝ ⎠ ⎝ ⎠⎝ ⎠ μ2
ZT 0 0 1m −1 −1
1
n
1 1
E(σ12 |Y) = (yi − ỹi )(zi − z̃i ) + α̃y α̃y + β̃y β̃z .
ν+n−3 i=1
c1 c2
306 SOLUTIONS
10.6 Solutions for Chapter 7
1. Sample from Gamma(ν, θ): Likelihood and prior, θ ∼ Gamma(α, β), are
given by
n
(θ|x) ∝ θ nν
exp(− xi θ), π(θ) ∝ θα−1 exp(−βθ).
i=1
2 b1 n
σMAP = , b1 = b + sxx + x̄2 .
a1 + 2 n+1
n 2 2 λ2
Set 2
i=1 xi = nx and use sxx + nx = nx .
2 We have for c0 = b + λ12 γ 2 ,
2
λ
c1 = −2 λ12 γ, c2 = −λ21 n2 , and c3 = n, that
2 1
σMAP = (c0 + c1 x + c2 x2 + c3 x2 ).
a+n+2
(d) Set μ = 0 and σ 2 = 1. Then X1 , . . . , Xn i.i.d. N(0, 1), with Ex = 0,
Ex3 = 0, Ex2 x = 0, and Ex2 = n1 . For γ = 0
nx 2 λ+n
2
Cov(μMAP , σMAP ) = E σ = −2γ 2 = 0.
a + λ MAP λ (a + n + 2)
3. Simple linear regression: Applying (6.64) and (6.62) we get θ|y ∼
N2 (γ1 , σ 2 Γ1 ), where γ1 = (α̃, β̃)T and
⎛ ⎞
λ1
0
Γ1 = ⎝ 1+λ1 n ⎠ , α̃ = λ1 ny + γa , β̃ = λ2 nxy + γb .
0 λ2 1 + nλ1 1 + λ2 sxx
1+λ2 sxx
where z(1− α20 ) is the (1 − α20 )–quantile of N(0, 1), i.e., Φ(zγ ) = γ.
(b) Applying Lemma 6.2 on page 132 to
⎛⎛ ⎞ ⎛ ⎞⎞
⎛ ⎞ 1 z s2 (z) + 1 (1, z)Γ1
y ⎜⎜ ⎟ ⎜ ⎛ ⎞ ⎟⎟
⎝ f ⎠ |y ∼ N3 ⎜ ⎜ ⎟ ⎜
⎜⎜1 0⎟ γ1 , σ 2 ⎜ T 1
⎟⎟
⎟⎟
θ ⎝ ⎝ ⎠ ⎝ Γ1
⎝ ⎠ Γ 1
⎠⎠
0 1 z
gives
yf |y ∼ N μ̃(z), σ 2 (s2 (z) + 1) . The prediction interval is given by
[μ̃(z) − z(1− α0 ) σ s2 (z) + 1, μ̃(z) + z(1− α20 ) σ s2 (z) + 1].
2
308 SOLUTIONS
4. Predictive distribution in binomial model : In Example 2.11 the posterior is
calculated as θ|X ∼ Beta(α, β) with α = α0 + x and β = β0 + n − x.
(a) Applying (7.33), the predictive distribution,
1
nf x f 1
π(xf |x) = θ (1 − θ)(nf −xf ) θα−1 (1 − θ)β−1 dθ
0 xf B(α, β)
1
nf 1
= θxf +α−1 (1 − θ)(nf −xf +β−1) dθ
xf B(α, β) 0
nf B(α + xf , β + n − xf )
= ,
xf B(α, β)
is a beta-binomial distribution,
xf |x ∼ BetaBin(nf , α0 + x, β0 + n − x).
with
nλ1 y + m1 nλ2 xy + m2
α̃y = , β̃y = ,
1 + nλ1 1 + nλ2
SOLUTIONS FOR CHAPTER 7 309
nλ3 z + m3 nλ4 xz + m4
α̃z = , β̃z = .
1 + nλ3 1 + nλ4
Then η|σ 2 , y ∼ N(η̃, σ 2 s2y ) with
1 1 λ2 λ4
η̃ = (β̃y + β̃z ), s2y = + .
2 4 1 + nλ2 1 + nλ4
with ỹi = α̃y + β̃y xi and z̃i = α̃z + β̃z xi . From Lemma 6.5 it follows that
b1 (y) 2
η|y ∼ t(a + 2n, η̃, s ).
a + 2n y
with
2 b1 (y) 2 b1 (y) 1 λ2 λ4
s1 (y) = s = +
a + 2n y a + 2n 4 1 + nλ2 1 + nλ4
where t(df,γ) is the γ-quantile of t1 (df, 0, 1).
6. Averaged observations: Model (7.71) is a simple linear regression model
with Σ = 12 In .
(a) Applying Theorem 6.5 and Lemma 6.5,
c2
βu |u, σ 2 ∼ N β̃u , σ 2 , σ 2 |u ∼ InvGamma((a+n)/2, b1 (u)/2),
1 + 2nc2
310 SOLUTIONS
b1 (u)c2
βu |u ∼ t a + n, β̃u , s2 (u)2 , with s2 (u)2 = .
(a + n)(1 + 2nc2 )
Further ũi = α̃u + β̃u xi , where
2nc1 u 2nc2 xu
α̃u = , β̃u = ,
1 + 2nc1 1 + 2nc2
n
1 2 1
b1 (u) = b + 2 (ui − ũi )2 + α̃u + β̃u2 .
i=1
c1 c2
(b) For known σ the posterior is η|(σ 2 , y) ∼ N(β̃u , σ 2 s2u ) with s2u =
2 c2
1+2nc2 .
Thus
c2
n
1 nc2 1
β̃u = 2 (yi + zi )xi = (xy + xz) = (β̃y + β̃z ) = η̃.
1 + 2nc2 i=1 2 1 + 2nc2 2
1 λ2 λ4 1 2c2
s2y = + = = s2u .
4 1 + nλ2 1 + nλ4 2 1 + 2nc2
This is expected, because u is a sufficient statistic for η. The orthog-
onal design and the diagonal prior covariance matrices imply that the
intercepts have no influence on η in both models.
(c) Note that, u is not sufficient for σ 2 . Assume additionally that m1 =
m3 = 0 and λ1 = λ3 = 2c1 , λ2 = λ4 = 2c2 . Then
2c1 1
n
nc1 1
α̃u = (yi + zi ) = (y + z) = (α̃y + α̃z )
1 + 2nc1 i=1 2 1 + 2nc1 2
SOLUTIONS FOR CHAPTER 7 311
n
and ũi = α̃u + β̃u xi
= (ỹi + z̃i ). Set Ry = i=1 (yi − ỹi ) , Rz =
1
2
2
n n n
i=1 (zi − z̃i ) 2
, R u = i=1 (ui − ũi ) and Rzy =
2
i=1 (zi − z̃i )(yi − ỹi ).
1
Thus 2Ru = 2 (Ry + Rz ) + Rzy . The estimates σ̃1 and σ̃22 are different,
2
B|Y ∼ t2,2 (ν + n − 1, B1 , C1 , Σ1 ).
1 1
1T2 Σ1 12 = 2 + Ry + 2Ryz + Rz + (α̃y + α̃z )2 + (β̃y + β̃z )2 .
λ1 λ2
θαn α−1
n n
p(x|θ) = x exp(−θ xi ).
Γ(α)n i=1 i i=1
SOLUTIONS FOR CHAPTER 8 313
∞
Using the integral 0 xa−1 exp(−bx) dx = b−a Γ(a), we obtain
n
m(x) = Γ(α)−n Γ(α0 )−1 β0α0 xiα−1 Γ(α1 )β1−α1 .
i=1
Hence
Γ(α0 ) θ0αn n n
α0 +nα
B01 = (β 0 + x i ) exp(−θ 0 xi ).
Γ(α0 + nα) β0α0 i=1 i=1
(c) Applying the table from Kass and Vos (1997), we get very strong evi-
dence against the null hypothesis.
3. Two sample problem, Delphin data:
(a) In both models the parameter θ is related to location parameter. The
non-informative prior is Jeffreys, π(θ) ∝ const; see Example 3.21.
(b) First we consider the model {P0 , π}. Only the distribution of the first m
observations depends of θ. The posterior is N(γ(0) , Γ(0) ) with
1
m
1 2
γ(0) = x̄(1) , Γ(0) = σ , x̄(1) = xi .
m 1 m i=1
The second model {P1 , π} is the special case of a linear model. We apply
Theorem 6.7 on page 157 , where p = 1, σ 2 = 1, X = (1, . . . , 1, a, . . . , a)T
2 2 2 2 (2) 1 n
and Σ = diag(σ1 , . . . , σ1 , σ2 , . . . , σ2 ). Set x̄ = n−m i=m+1 xi . We
obtain the posterior N(γ(1) , Γ(1) ) with
1
m
(0) (0)
xi = xi = x̄(1) , i = 1, . . . , m, xi = μ, i = m + 1, . . . , n,
m i=1
with
1 1
m n
RSS(0) = 2 (xi − x̄(1) )2 + 2 (xi − μ)2 .
σ1 i=1 σ2 i=m+1
In the second model we have
(1) (1)
xi = γ (1) , i = 1, . . . , m, xi = aγ (1) , i = m + 1, . . . , n,
with
1 1
m n
RSS(1) = (x i − γ (1) ) 2
+ (xi − aγ(1) )2 .
σ12 i=1 σ22 i=m+1
314 SOLUTIONS
(d) The Bayes factor is
∞
m0 (x)
B01 = , mj (x) = pj (x|θ) dθ, j = 0, 1.
m1 (x) −∞
We have n
m0 (x) exp(− 2σ1 2 i=m+1 (xi − μ)2 )A1
2
=
m1 (x) A2
with
1
m
A1 = exp − (xi − θ)2 dθ
2σ12 i=1
and
1 1
m n
A2 = exp − (x i − θ) 2
− (xi − aθ)2 dθ.
2σ12 i=1 2σ22 i=m+1
First consider A1 . Applying
√ (xi − θ)2 = (xi − x̄)2 + m(θ − x̄)2 and
1
exp(− 2a (x − b)2 )dx = 2πa, we get
√ 1
m
σ12 1
A1 = 2π( ) 2 exp − 2 (xi − x̄(1) )2 .
m 2σ1 i=1
1 1
m n
(x i − θ) 2
+ (xi − aθ)2 = RSS(1) + Γ−1
(1) (θ − γ(1) ) .
2
σ12 i=1 σ22 i=m+1
Thus √ 1
A2 = 2π(Γ(1) ) 2 exp −RSS(1) .
Summarizing,
n − m σ12 2 1 1
B01 = (1 + a ) 2 exp (RSS(1) − RSS(0) ) .
n σ22 2
(c) Reject H0 iff
0.2 − (η1 − η2 ) −0.2 − (η1 − η2 )
Φ −Φ ≤ 0.5,
ση ση
and
(0)
n
(0)
n
(0)
b1 = b + (yi − yi )2 + (zi − zi )2 + (γ (0) )T γ (0) (10.13)
i=1 i=1
(0) (0)
with yi = αy + βxi , zi = αz + βxi . Setting
n
n
R1 = (yi − αy )2 , R2 = (zi − αz )2 ,
i=1 i=1
we obtain
(0)
b1 = b + R1 + R2 + (αy )2 + (αz )2 − (2n + 1)(β)2 . (10.14)
(b) See the notation and solution of Problem 5(a), (b) of Chapter 6. We
(1)
apply Theorem 6.5 and obtain the posterior NIG(a1 , b1 , γ (1) , Γ(1) ) with
a1 = a + n,
⎛ ⎞ ⎛ ⎞
n
y α
⎜ n+1 ⎟ ⎜ y ⎟
⎜ n ⎟ ⎜ ⎟
1 ⎜ n+1 z ⎟ ⎜αz ⎟
Γ(1) = I4 , γ (1) = ⎜ ⎟ ⎜ ⎟
⎜ 1 T ⎟ = ⎜ ⎟.
n+1 ⎜ n+1 x y ⎟ ⎜ βy ⎟
⎝ ⎠ ⎝ ⎠
1 T
n+1 x z βz
316 SOLUTIONS
Note that, the estimates of αy and of αz coincide in both models, because
of the condition x̄ = 0. Further
(1)
n
(1)
n
(1)
b1 = b + (yi − yi )2 + (zi − zi )2 + (γ (1) )T γ (1) (10.15)
i=1 i=1
(1) (1)
with yi = αy + βy xi , zi = αz + βz xi . We obtain
(1)
b1 = b + R1 + R2 + (αy )2 + (αz )2 − (n + 1)(βy )2 − (n + 1)(βz )2 . (10.16)
n+1
(c) Note that, β = 2n+1 (βy + βz ). Applying (10.14) and (10.16) we get
with
n+2 1
k1 (θ(1) |y) = (σ 2 ) 2 exp(− (RSS + (β − β)T XT X(β − β))).
2σ 2
Since
n−p
−p 1 1 RSS 2
π(β|y) = (2π) 2 |X X|
T 2 k1 (θ(1) |y),
Γ( n−p
2 )
2
we get
− n−p
− n−p − 12 n−p RSS 2
m1 (y) = (2π) 2 |X X|
T
Γ( ) .
2 2
Summarizing,
n−p
RSS 2
n − p −1 1
B01 = Γ( ) exp(− 2 RSS).
σ02 2 2σ0
σ2
2 ln(B10 ) = (n − p)(x − ln(x)) + rest(n), x = ,
σ02
with
1 n−p
rest(n) = ln(Γ( )) − (n − p) ln(n − p).
2 2
The function f (x) = x − ln(x) has its minimum at x = 1. Using Stirling
approximation we obtain limn→∞ rest(n) = 0.
7. Test on correlation between two regression lines:
(a) Under H0 the lines are independent. We have the univariate model (10.8)
given in Problem 5(a), Chapter 7. Using the notation and solution of
1
Problem 5(b), Chapter 7, the posterior is NIG(2 + 2n, b1 , γ, n+1 I4 ) with
n
γ T = (α y , β y , α z , β z ) T = (y, xy, z, xz)T .
n+1
Set yi = αy + xi βy and zi = αz + xi βz . Then b1 = b1,y + b1,z with
n
n
b1,y = 1 + (yi − yi )2 + αy2 + βy2 , b1,z = 1 + (zi − zi )2 + αz2 + βz2 .
i=1 i=1
318 SOLUTIONS
(b) Under H1 the lines are dependent. We have model (10.6)
n + 1, we obtain
(2π)n! n!
n+2 n+1 = Γ(n + 1) = 1.
Γ( 2 )Γ( 2 )
Summarizing, we get
n+1
√ n − 2 n+1 |Σ|
2
1
B01 = 2(n − 2)( ) |Σ| 2 .
n
|Σ|
SOLUTIONS FOR CHAPTER 9 319
(e) It holds that
2 − ln |Σ| + rest(n)
ln(B10 ) = ln |Σ|
n+1
with
1 n−2 2 1
rest(n) = ln( )+ ln(n − 2) + (ln(2) + ln |Σ|) = oP (1),
2 n n+1 n+1
since the Bayes estimate is bounded in probability.
(θ|x) n
r(θ, x) = , (θ|x) = fC (xi |θ, 1), π(θ) = fC (θ|3, 1)
π(θ) i=1
The chain is not stationary, but the limit distribution is the right one,
since for |ρ| < 1, limt→∞ ρt = 0.
6. ABC for Iris data:
(a) The data is an i.i.d. sample from a truncated bivariate normal distribu-
tion, TMVN(μ, Σ, a, b). This is N2 (μ, Σ) with
⎛ ⎞ ⎛ ⎞
μ1 σ12 ρσ1 , σ2
μ = ⎝ ⎠, Σ = ⎝ ⎠
μ2 ρσ1 σ2 σ22
truncated to the support [a1 , b1 ] × [a2 , b3 ]. Here we have [0, ∞) × [0, ∞).
The parameter is θ = (μ1 , μ2 , σ1 , σ2 , ρ), m1 is the average of the first
variable and m2 is the average of the second variable.
(b) The priors are σ12 ∼ U[0.2, 2], σ22 ∼ U[0.2, 2], ρ ∼ U[0, 0.9], μ1 ∼
TN(0, ∞, 3, 1), and μ1 ∼ TN(0, ∞, 2, 1), where TN(0, ∞, m, v 2 ) is the
normal distribution N1 (m, v 2 ) truncated on [0, ∞).
322 SOLUTIONS
(1) (N )
(c) The algorithm samples θ , . . . , θ independently from a distribution
which approximates the posterior. The posterior has no closed form ex-
pression.
(d) It is an ABC Algorithm: Given current state θ(j) :
i. Draw independently θ1 from TN(0, ∞, 3, 1), θ2 from TN(0, ∞, 2, 1), θ3
from U[0.2, 2], θ4 from U[0.2, 2], and θ5 from U[0, 0.9].
ii. For i = 1, . . . , 50, generate new observations
Znew,i ∼ TMVN(μ, Σ, a, b), where
⎛ ⎞ ⎛ ⎞
θ1 θ32 θ 5 θ 3 , θ4
μ = ⎝ ⎠, Σ = ⎝ ⎠.
θ2 θ5 θ3 θ4 θ42
n
iii. Calculate n1 i=1 Znew,i = (mnew,1 , mnew,2 ), and D = (m1 −
mnew,1 )2 + (m2 − mnew,2 )2 .
iv. For D < tol we update θ(j+1) = θ; otherwise we go back to Step 1.
Chapter 11
Appendix
Here we briefly list some of the most commonly used distributions in Bayesian
inference. For details, we recommend other sources given in bibliography.
Poi(λ) dpois(k,lambda)
Bin(n, p) dbinom(k,n,p)
(k,n,alpha,beta)
NB(r, p) dnbinom(k,r,p)
Geo(p) dgeom(k,p)
f >2
f1 +f2
−
f1
−1 f1 2 f2 2f22 (f1 +f2 −2)
Ff1 ,f2 x 2 1+ f2 x f2 −2 f1 (f2 −2)2 (f2 −4)
f2 > 2 f2 > 4
1 1
Exp (λ) λ exp(−λx) λ λ2
2 −1
C(m, σ) 1 + ( x−m
σ ) – –
. .
La(μ, σ) exp − . x−μ
σ
. μ 2σ 2
Beta(α, β) xα−1 (1 − x)β−1 α
α+β
αβ
(α+β)2 (α+β+1)
α α
Gamma(α, β) xα−1 exp(−xβ) β β2
β2
InvGamma(α, β) x−α−1 exp(− βx ) β
α−1 (α−1)2 (α−2)
α>1 α>2
MULTIVARIATE DISTRIBUTIONS 325
N(μ, σ 2 ) dnorm(x,mu,sigma)
C(m, σ) dcauchy(x,m,sigma)
Beta(α, β) dbeta(x,alpha,beta)
Gamma(α, β) dgamma(x,alpha,beta)
We also refer to the toolbox given in Chapter 6: Lemma 6.1, Lemma 6.2,
Theorem 6.1, and Lemma 6.5.
326 APPENDIX
P = Σ−1 Eλ = α
β Var(λ) =
2α
β 2 (α−1) , α>2
Cov(X, λ) = 0
exp(− 2λ
1
(β+ X−μ 2
Σ )) β
NIG(α, β, μ, Σ) p+α+2 EX = μ Cov(X) = α−2 Σ
λ 2
β
Eλ = α−2 Var(λ)
2β 2
α>2 = (α−2)2 (α−1)
Cov(X, λ) = 0
where X − μ 2
Σ = (X − μ)T Σ−1 (X − μ).
(mu,P,2*alpha,2*beta)
(xx,2*alpha,2*beta,mu,Sigma)
ν>2
Notation f (W|θ) ∝ EW
ν−k−1 −1
Wk (ν, V) |W| 2 exp − 12 tr WV νV
ν+k+1
IWk (ν, V) |W|− 2 exp − 12 tr W−1 V 1
ν−k−1 V
Further
Notation Cov(vec(W))
2
+ (ν+1)v(v−2) (Ik2 + Kk,k )(V ⊗ V), ν > 2
random numbers
Bibliography
329
330 BIBLIOGRAPHY
W.W. Daniel and C.L. Cross. Biostatistics: A Foundation for Analysis in the
Health Sciences, 10th ed. Wiley, 2013.
E. Demidenko. Mixed models: Theory and Applications with R. Wiley, 2013.
P. Diaconis and D. Freedman. On the consistency of Bayes estimates. The
Annals of Statistics, 14(1):1–26, 1986.
N. R. Draper and H. Smith. Applied Regression Analysis. Wiley, 1966.
J.A. Dupuis. Bayesian estimation of movement and survival probabilities from
capture-recapture data. Biometrika, 82(4):761–772, 1995.
G.H. Givens and J.L. Hoeting. Computational Statistics. Wiley, 2005.
A.K. Gupta and D.K. Nagar. Matrix Variate Distributions. CRC Press, 2000.
T.J. Hastie, R.J. Tibshirani, and M. Wainright. Statistical Learning with
Sparsity: The LASSO and Generalizations. CRC Press, 2015.
W.K. Hastings. Monte Carlo sampling methods using Markov chains and their
applications. Biometrika, 57:97–109, 1970.
A.E. Hoerl and R.W. Kennard. Ridge regression: Applications to nonorthog-
onal problems. Technometrics, 12:55–67, 1970.
W. James and C. Stein. Estimation with quadratic loss. In Proc. 4th Berkeley
Symp. Math. Statist. Probab., volume 1, pages 361–380, Berkeley, 1960.
Univ. of California Press.
H. Jeffreys. An invariant form for the prior probability in estimation problems.
Proceedings of the Royal Statistical Society of London, 186:453–461, 1946.
R. E. Kass and P.W. Vos. Geometrical Foundations of Asymptotic Inference.
Wiley, 1997.
R. E. Kass and L. Wasserman. An invariant form for the prior probability in
estimation problems. Journal of the American Statistical Association, 91
(435):1343–1370, 1996.
R.E. Kass and A.E. Raftery. Bayes factors. Journal of the American Statistical
Association, 90(430):773–795, 1997.
K-R. Koch. Introduction to Bayesian Statistics, 2nd ed. Springer, 2007.
T. Kollo and D. von Rosen. Advanced Multivariate Statistics with Matrices.
Springer, 2005.
Se Yoon Lee. Gibbs sampler and coordinate ascent variational inference: A
set theoretical review. Communications in Statistics-Theory and Methods,
51(6):1549–1568, 2022.
H. Liero and S. Zwanzig. Introduction to the Theory of Statistical Inference.
CRC Press, 2011.
F. Liese and K.-J. Miescke. Statistical Decision Theory. Springer, 2008.
Pi-Erh Lin. Some characterization of the multivariate t distribution. Journal
of Multivariate Analysis, 2:339–344, 1972.
B. W. Lindgren. Statistical Theory. Chapman & Hall, 1962.
BIBLIOGRAPHY 331
D.V. Lindley. On a measure of the information provided by an experiment.
The Annals of Mathematical Statistics, 27(4):986–1005, 1956.
D.V. Lindley and A.F.M. Smith. Bayes estimates for the linear model. Journal
of the Royal Statistical Society, Ser. B, 34:1–41, 1972.
Jun S. Lui. Monte Carlo Strategies in Scientific Computing. Springer Series
in Statistics. Springer, 2001. ISBN 0-387-95230-6.
J.R. Magnus and H. Neudecker. The elimination matrix: Some lemmas and
applications. SIAM Journal on Algebraic Discrete Methods, 1(4):422–449,
1980.
J.R. Magnus and H. Neudecker. Matrix Differential Calculus with Applications
in Statistics and Econometrics. Wiley, 1999.
K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate Analysis. Academic
Press, 1979.
A.M. Mathai and H.J. Haubold. Special Functions for Applied Scientists.
Springer, 2008.
N. Metropolis, A. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller.
Equations of state calculations by fast computing machines. Journal of
Chemical Physics, 21(6):1087–1091, 1953.
T.Tin. Nguyen, Hien D. Nguyen, F. Chamroukhi, and G.L. McLachlan.
Bayesian estimation of movement and survival probabilities from capture-
recapture data. Cogent Mathematics & Statistics, 7, 2020.
N. Polson and L. Wasserman. Prior distributions for the bivariate binomial.
Biometrika, 77(4):901–904, 1990.
J. Press and J. M. Tamur. The Subjectivity of Scientists and the Bayesian
Approach. Wiley, 2001.
H. Raiffa and R. Schlaifer. Applied Statistical Decision Theory. Harvard
University, Boston, 1961.
C. P. Robert. The Bayesian Choice. Springer, 2001.
C.P. Robert and G. Casella. Introducing Monte Carlo Methods with R.
Springer, 2010.
L. Schwartz. On Bayes procedures. Z. Wahrscheinlichkeitstheorie, 4:10–26,
1965.
G. Schwarz. Estimating the dimension of a model. The Annals of Statistics,
6(2):461–464, 1978.
S. Searle, G. Casella, and C.E. McCulloch. Variance Components. Wiley,
2006.
S.R. Searle. Linear Models. Wiley, 1971.
S.R. Searle and M.J. Gruber. Linear Models, 2nd ed. Wiley, 2017.
G.A.F. Seber. A Matrix Handbook for Statisticians. Wiley, 2008.
332 BIBLIOGRAPHY
G.W. Snedecor and W.G. Cochran. Statistical Methods, 8th ed. Iowa State
University Press, 1989.
D.J. Spiegelhalter, N.G. Best, P.C. Bradley, and A. Van der Linde. Bayesian
measures of model complexity and fit. Journal of the Royal Statistical So-
ciety Ser. B, 64(4):583–639, 2002.
S. M. Stigler. Thomas Bayes’s Bayesian inference. Journal of the Royal Sta-
tistical Society. Ser. A, 145(2):250–258, 1982.
T.J. Tibshirani. Regression shrinkage and selection via the lasso. Journal of
the Royal Statistical Society, Ser. B, 58:267–288, 1996.
L. Tierney, Robert E.K., and J.B. Kadane. Fully exponential laplace approx-
imation to expectations and variances of nonpositive functions. Journal of
the American Statistical Association, 84(407):710–716, 1989.
A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press,
1998.
A. Zellner. An Introduction to Bayesian Inference in Econometrics. Wiley,
1971.
Hui Zou and T. Hastie. Regularization and variable selection via the elastic
net. Journal of the Royal Statistical Society, Ser. B, 67(2):301–320, 2005.
S. Zwanzig and B. Mahjani. Computer Intensive Methods in Statistics. CRC
Press, 2020.
Index
333
334 INDEX
Wishart, 172, 327 lion’s appetite, 7, 9, 20, 31, 51,
80, 81, 84, 86, 98, 106, 109,
Entropy 203
Shannon, 51, 64, 77, 291 mixture distribution, 48, 202
Estimation multinomial distribution, 61, 73
regularized, 191 no consistency, 116, 118
Estimator normal, 87
Bayes, 115 normal distribution, 10, 12, 16,
elastic net, 191 18, 19, 30, 51, 96, 183, 185,
lasso, 190 187, 199, 200, 219, 246
least-squares normal i.i.d. sample, 94
multivariate, 172 Parkinson, 143
Maximum a Posteriori Estimator pendulum, 6
(MAP), 186, 243 random-walk Metropolis, 262
Maximum Likelihood Estimator rejection algorithm, 256
(MLE), 186 sampling importance resampling,
minimax, 98 253
regularized, 189 sequential data, 23
restricted, 189 sex ratio at birth, 218, 226, 227
ridge estimator, 190 side effects, 128
Example smokers, 144
ABC, 270, 272 systematic-scan Gibbs Sampler,
Bahadur, 118 266
big data, 23 Variational Inference, 276
billiard table, 12, 16 weather, 37, 48, 187, 194, 202
binomial distribution, 51, 52, 56, Exponential family
64, 66, 86, 91, 102, 109, 218, conjugate prior, 43, 47
225 definition, 40
bivariate binomial, 70 natural parameter, 40
capture–recapture, 25, 32 sample, 42
Cauchy distribution, 21
Cauchy i.i.d. sample, 88 Fisher information, 54
CAVI algorithm, 279
corn plants, 127, 142, 159, 172, Gamma distribution
179 conjugate prior, 44, 75
Corona, 7, 45 Gamma function, 149
flowers, 6, 39
gamma distribution, 44 Hardy–Weinberg Model, 290
hip operation, 160, 169, 198 Hellinger distance, 95, 124
importance sampling, 250 Hellinger transform, 96, 120, 124
independent MC, 246 Hyperparameter, 25, 32
James–Stein estimator, 82
Information
Laplace approximation, 244
expected information, 63
life length, 130, 139, 153, 158,
mutual information, 64
197, 204, 220
Information criterion
INDEX 335
Akaike ( AIC), 236 Markov Chain Monte Carlo
Bayesian (BIC), 232 annealing, 261
Deviance (DIC), 233 balance inequality, 260
DIC general MCMC, 258
normal linear model, 234 Gibbs Sampling, 27, 263
Schwarz, 232 MCMC, 257
Instrumental distribution, 249 Metropolis algorithm, 259
Metropolis–Hastings, 259
Jacobian, 52, 303 Metropolis–Hastings ratio, 259
Jeffreys–Lindley paradox, 230 proposal distribution, 261
random walk, 262
Kullback–Leibler divergence, 53, 76, thinning, 261
95, 120, 125, 275, 276, 291, MCMC
296 logistic regression, 320
Minimaxity, 98
Laplace approximation, 242
Model
Likelihood
Bayes model, 11
likelihood function, 8
Hardy–Weinberg, 76
matrix-variate normal, 171
hierarchical, 25
maximum likelihood principle, 9
statistical model, 5, 11
multivariate normal, 137, 147,
Monte Carlo
154
Importance Sampling (IS), 248
Linear mixed model
importance function, 249
marginal model, 161
importance weights, 249
Linear model, 126
instrumental, 249
orthogonal design, 139
trial, 249
Bayesian, 131
independent MC, 245
mixed, 159, 197
multivariate, 170 Normal distribution
univariate, 127 Gibbs sampler, 282
Location model, 56 Jeffreys prior, 60
Location scale model, 104
Location-scale model, 60 Odds ratio, 52
Logistic regression, 45
Loss Parameter
0–1 loss, 93 model indicator, 221
L1 , 79 nuisance, 70
L2 , 79 of interest, 70
absolute error, 89 Parameter space, 5, 11, 173
contrast condition, 114 Partitioned matrix, 133
Hellinger, 95 inverse, 133
Kullback–Leibler, 95 Pinskers’ inequality, 125
posterior expected, 84, 115 Posterior, 136
randomized decision, 97 precision parameter, 139
weighted quadratic, 85 joint, 166, 178
Loss function, 79 marginal, 178
336 INDEX
matrix-variate t, 178 right Haar measure, 61
normal, 143 subjective, 31
odds ratio, 222 uniform distribution, 68
precision parameter, 135 Prior distribution, 11
robustness, 119 Probability function, 7
strongly consistent, 114
Posterior distribution, 14 Reference Analysis, 63
Prediction Risk
Bayes predictor, 92, 206 Bayes, 84
prediction error, 91 integrated, 83
predictive distribution, 92, 206 frequentist, 80, 97
quadratic regression, 211
Principle Sample
Bayesian, 14, 113, 183, 199, 218, i.i.d., 12
222, 240 Sample space, 5
likelihood, 8 Scale model, 58
Principles Schur complement, 133
modelling, 29 Schwartz’ Theorem, 121
Prior Score function, 54
Berger–Bernardo method, 63, 72 Sherman-Morrison-Woodbury matrix
conjugate, 38, 131, 165 identity, 134
exponential family, 43 Spherical structure, 138
gamma distribution, 212 Strong consistency, 114
hyperparameter, 32, 34
Target distribution, 249
improper, 50
Test
invariant, 55
p-value, 104
inverse-gamma, 145
randomized, 97
Jeffreys, 54
Trial distribution, 249, 255
joint, 177
Laplace, 50 Variational inference
least favourable, 105 CAVI, 279
left Haar measure, 61 evidence lower bound (ELBO),
NIG, 147 276
non-informative, 55, 64, 153 mean-field, 275
objective, 31 variational density, 275
odds ratio, 222 variational factor, 278
reference, 63, 67 variational family, 275